mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2011-11-13, 19:37   #1
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

24·5·29 Posts
Default maximum theoretical speed of LL test w/o bandwidth limitations?

I know that in many cases, the speed of a LL test depends on not so much the CPU speed, but rather the memory bandwidth. Assuming that the latter isn't an issue, how much of a speedup can we get with current-generation processors? I recall someone saying that their Tesla GPU was using only 25% of its processing power, but does anyone know if the figures are similar in the case of CPUs?

Suddenly, the prospects of "hybrid memory cubes" and DDR4 are starting to sound attractive...

Last fiddled with by ixfd64 on 2011-11-13 at 19:45
ixfd64 is offline   Reply With Quote
Old 2011-11-13, 20:26   #2
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×2,399 Posts
Default

I've also been wondering how much the cache sizes and memory bandwidth affect LL tests. Would it improve performance to get higher bandwidth memory? (And for the same frequency, would a 2600k with 8MB L3 be faster than a 2500k with 6MB L3, for example?)
Dubslow is offline   Reply With Quote
Old 2011-11-15, 03:57   #3
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

171210 Posts
Default

Let's suppose for a minute that memory speed is infinite, and the CPU core multiply is the only limitiation. For a 50M LL test, we need 50M operations of square and subtract 2, modulo 2^50M. Each squaring operation costs 2 FFTs of byte length 50M/8 = 6M. The 6M FFT costs 6M*log2(6M) byte multiplies; we'll assume the accumulations come for free as they are done in parallel with the slower multiplies. Latest CPUs run 128 bit registers (I THINK, could be wrong, certainly 64), so that divides my 6Ms by 16, giving 375K * log2(375K) = 375K * 22 multiplies per FFT = 8.25M 128 bit multiplies per FFT.

8M multiplies per FFT
*2 FFTs per LL step
*50M LL steps
= 800 * 10^12 128 bit multiplies
Assume a (theoretical) 1 GHz machine -- it will take it 16 clock steps to do that 128 bit multiply, so we end up with
16 * 800 * 10^12 clocks to do it...12800 seconds, about 3.5 hours.

Now, I need someone to come poke holes in my numbers...and remind me where I lost a zero.
****************
LL tests are specifically designed to be friendly to caches, so more L3 cache is better, all else being equal.
Christenson is offline   Reply With Quote
Old 2011-11-15, 04:28   #4
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×2,399 Posts
Default

How much better? Would the extra 2 MB be a 2% or 6% increase?
Dubslow is offline   Reply With Quote
Old 2011-11-15, 04:54   #5
axn
 
axn's Avatar
 
Jun 2003

7×683 Posts
Default

Quote:
Originally Posted by Christenson View Post
Let's suppose for a minute that memory speed is infinite, and the CPU core multiply is the only limitiation. For a 50M LL test, we need 50M operations of square and subtract 2, modulo 2^50M. Each squaring operation costs 2 FFTs of byte length 50M/8 = 6M. The 6M FFT costs 6M*log2(6M) byte multiplies; we'll assume the accumulations come for free as they are done in parallel with the slower multiplies. Latest CPUs run 128 bit registers (I THINK, could be wrong, certainly 64), so that divides my 6Ms by 16, giving 375K * log2(375K) = 375K * 22 multiplies per FFT = 8.25M 128 bit multiplies per FFT.
You have misunderstood the mechanics of FFT.
Prime95 uses doubles (64 bit floating points) as the FFT "limbs" -- not bytes. It stores about 19 bits of data in a single limb. So that is around 50M/19 ~ 2.6M limbs. Since we can't do arbitrary sized FFT, we use the next higher size FFT - 3M. So, the opcount should be 3M * log2(3M) [But this is an approximation -- to get an exact count, we need to look at the actual FFT operations]. Double that for inverse FFT. Then 3M point-wise multiplications for the actual squaring. So roughly 3M + 2*3M*log2(3M).

EDIT:- Doing parallel operations using SSE2 merely reduces the number of clock cycles -- doesn't reduce the op count.

EDIT2:- So opcount per iteration for a 3M FFT ~ 139 Mflops. At 4 ops/cyc, a 2GHz processor would take 0.0174 secs. That is the fastest you can theoretically execute if memory were not a bottleneck. An Intel Core 2 Duo @ 2GHz takes 0.084 secs as per the benchmark page. Which is a factor of 5 from the "theoretical" calculation -- this cannot be accounted by memory issues alone. I must be really off in my opcount

Last fiddled with by axn on 2011-11-15 at 05:08
axn is offline   Reply With Quote
Old 2011-11-15, 05:13   #6
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

719710 Posts
Default

I think it's the approximation part. Correct me if I'm wrong, but aren't you using the complexity there? If so, there's multiplier you're missing which could be 100, 10000000, or whatever.
Dubslow is offline   Reply With Quote
Old 2011-11-15, 05:41   #7
axn
 
axn's Avatar
 
Jun 2003

7·683 Posts
Default

Quote:
Originally Posted by Dubslow View Post
I think it's the approximation part. Correct me if I'm wrong, but aren't you using the complexity there? If so, there's multiplier you're missing which could be 100, 10000000, or whatever.
Yes. But, NlgN ought to be a pretty good approximation.

FFT algorithm is regular enough that we can count exactly the number of ops. There would be a multiplier, but it'll be a small one (say, 2). But there would also be other factors, and we should be able to count all of them. Anyways, I think I'm off by a factor of 2, at the most.
axn is offline   Reply With Quote
Old 2011-11-15, 06:02   #8
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

1C1D16 Posts
Default

Huh. I thought I remember reading somewhere it was closer to 5... but you at least have calculations, which is better than me.
Dubslow is offline   Reply With Quote
Old 2011-11-15, 08:53   #9
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

10000110012 Posts
Default

Quote:
Originally Posted by axn View Post
EDIT2:- So opcount per iteration for a 3M FFT ~ 139 Mflops. At 4 ops/cyc, a 2GHz processor would take 0.0174 secs. That is the fastest you can theoretically execute if memory were not a bottleneck. An Intel Core 2 Duo @ 2GHz takes 0.084 secs as per the benchmark page. Which is a factor of 5 from the "theoretical" calculation -- this cannot be accounted by memory issues alone. I must be really off in my opcount
Hmm, even if you have unlimited bandwidth and no latency, you still have to load and store data. Also I wonder where the "4 ops/cycle" comes from; what does it take into account?

An interesting thing to do would be to run Prime95 through a simulator that can count instructions by category
ldesnogu is offline   Reply With Quote
Old 2011-11-15, 09:34   #10
axn
 
axn's Avatar
 
Jun 2003

112558 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
Hmm, even if you have unlimited bandwidth and no latency, you still have to load and store data.
In the theoretical model, you shouldn't count that. Loads/Stores can usually execute simultaneously with other inflight instructions.

Quote:
Originally Posted by ldesnogu View Post
Also I wonder where the "4 ops/cycle" comes from; what does it take into account?
Core 2 and onwards can execute upto 4 FLOP/cycle. Sandy bridge can do 8.


Quote:
Originally Posted by ldesnogu View Post
An interesting thing to do would be to run Prime95 through a simulator that can count instructions by category
Like I said, FFT is very regular (no coditional jumps, etc.) You can do a static count based on algorithm. In fact, I think George may already have done this exercise.
axn is offline   Reply With Quote
Old 2011-11-15, 11:29   #11
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

3×179 Posts
Default

Quote:
Originally Posted by axn View Post
In the theoretical model, you shouldn't count that. Loads/Stores can usually execute simultaneously with other inflight instructions.
You're right as long as ld/st count doesn't exceed (math) operations count. Quickly looking at prime95 source code, it looks like there is a lot of memory operations (obviously), but that doesn't say much. I tried to find some papers about memory operations count for FFT and couldn't find anything, except for the classical papers that talk about optimization of the memory hierarchy.

Anyway I trust you and so will assume that the ld/st count is always less than the rest
ldesnogu is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
A Theoretical (vs. Proficient/Practical) Deterministic Primality Test a1call Miscellaneous Math 194 2018-03-19 05:54
The Limitations of the Cranial GPU Flatlander Science & Technology 3 2013-06-13 13:34
Maximum theoretical MPG Mini-Geek Lounge 9 2008-07-14 22:45
Custom test for maximum CPU stress Torpedo Information & Answers 10 2007-10-05 17:33
LL test speed up? jebeagles Miscellaneous Math 16 2006-01-04 02:43

All times are UTC. The time now is 09:24.

Tue Dec 1 09:24:03 UTC 2020 up 82 days, 6:35, 1 user, load averages: 1.50, 1.46, 1.49

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.