mersenneforum.org maximum theoretical speed of LL test w/o bandwidth limitations?
 Register FAQ Search Today's Posts Mark Forums Read

 2011-11-13, 19:37 #1 ixfd64 Bemusing Prompter     "Danny" Dec 2002 California 22·577 Posts maximum theoretical speed of LL test w/o bandwidth limitations? I know that in many cases, the speed of a LL test depends on not so much the CPU speed, but rather the memory bandwidth. Assuming that the latter isn't an issue, how much of a speedup can we get with current-generation processors? I recall someone saying that their Tesla GPU was using only 25% of its processing power, but does anyone know if the figures are similar in the case of CPUs? Suddenly, the prospects of "hybrid memory cubes" and DDR4 are starting to sound attractive... Last fiddled with by ixfd64 on 2011-11-13 at 19:45
 2011-11-13, 20:26 #2 Dubslow Basketry That Evening!     "Bunslow the Bold" Jun 2011 40
 2011-11-15, 03:57 #3 Christenson     Dec 2010 Monticello 6B016 Posts Let's suppose for a minute that memory speed is infinite, and the CPU core multiply is the only limitiation. For a 50M LL test, we need 50M operations of square and subtract 2, modulo 2^50M. Each squaring operation costs 2 FFTs of byte length 50M/8 = 6M. The 6M FFT costs 6M*log2(6M) byte multiplies; we'll assume the accumulations come for free as they are done in parallel with the slower multiplies. Latest CPUs run 128 bit registers (I THINK, could be wrong, certainly 64), so that divides my 6Ms by 16, giving 375K * log2(375K) = 375K * 22 multiplies per FFT = 8.25M 128 bit multiplies per FFT. 8M multiplies per FFT *2 FFTs per LL step *50M LL steps = 800 * 10^12 128 bit multiplies Assume a (theoretical) 1 GHz machine -- it will take it 16 clock steps to do that 128 bit multiply, so we end up with 16 * 800 * 10^12 clocks to do it...12800 seconds, about 3.5 hours. Now, I need someone to come poke holes in my numbers...and remind me where I lost a zero. **************** LL tests are specifically designed to be friendly to caches, so more L3 cache is better, all else being equal.
 2011-11-15, 04:28 #4 Dubslow Basketry That Evening!     "Bunslow the Bold" Jun 2011 40
2011-11-15, 04:54   #5
axn

Jun 2003

10010011010112 Posts

Quote:
 Originally Posted by Christenson Let's suppose for a minute that memory speed is infinite, and the CPU core multiply is the only limitiation. For a 50M LL test, we need 50M operations of square and subtract 2, modulo 2^50M. Each squaring operation costs 2 FFTs of byte length 50M/8 = 6M. The 6M FFT costs 6M*log2(6M) byte multiplies; we'll assume the accumulations come for free as they are done in parallel with the slower multiplies. Latest CPUs run 128 bit registers (I THINK, could be wrong, certainly 64), so that divides my 6Ms by 16, giving 375K * log2(375K) = 375K * 22 multiplies per FFT = 8.25M 128 bit multiplies per FFT.
You have misunderstood the mechanics of FFT.
Prime95 uses doubles (64 bit floating points) as the FFT "limbs" -- not bytes. It stores about 19 bits of data in a single limb. So that is around 50M/19 ~ 2.6M limbs. Since we can't do arbitrary sized FFT, we use the next higher size FFT - 3M. So, the opcount should be 3M * log2(3M) [But this is an approximation -- to get an exact count, we need to look at the actual FFT operations]. Double that for inverse FFT. Then 3M point-wise multiplications for the actual squaring. So roughly 3M + 2*3M*log2(3M).

EDIT:- Doing parallel operations using SSE2 merely reduces the number of clock cycles -- doesn't reduce the op count.

EDIT2:- So opcount per iteration for a 3M FFT ~ 139 Mflops. At 4 ops/cyc, a 2GHz processor would take 0.0174 secs. That is the fastest you can theoretically execute if memory were not a bottleneck. An Intel Core 2 Duo @ 2GHz takes 0.084 secs as per the benchmark page. Which is a factor of 5 from the "theoretical" calculation -- this cannot be accounted by memory issues alone. I must be really off in my opcount

Last fiddled with by axn on 2011-11-15 at 05:08

 2011-11-15, 05:13 #6 Dubslow Basketry That Evening!     "Bunslow the Bold" Jun 2011 40
2011-11-15, 05:41   #7
axn

Jun 2003

5×23×41 Posts

Quote:
 Originally Posted by Dubslow I think it's the approximation part. Correct me if I'm wrong, but aren't you using the complexity there? If so, there's multiplier you're missing which could be 100, 10000000, or whatever.
Yes. But, NlgN ought to be a pretty good approximation.

FFT algorithm is regular enough that we can count exactly the number of ops. There would be a multiplier, but it'll be a small one (say, 2). But there would also be other factors, and we should be able to count all of them. Anyways, I think I'm off by a factor of 2, at the most.

 2011-11-15, 06:02 #8 Dubslow Basketry That Evening!     "Bunslow the Bold" Jun 2011 40
2011-11-15, 08:53   #9
ldesnogu

Jan 2008
France

21016 Posts

Quote:
 Originally Posted by axn EDIT2:- So opcount per iteration for a 3M FFT ~ 139 Mflops. At 4 ops/cyc, a 2GHz processor would take 0.0174 secs. That is the fastest you can theoretically execute if memory were not a bottleneck. An Intel Core 2 Duo @ 2GHz takes 0.084 secs as per the benchmark page. Which is a factor of 5 from the "theoretical" calculation -- this cannot be accounted by memory issues alone. I must be really off in my opcount
Hmm, even if you have unlimited bandwidth and no latency, you still have to load and store data. Also I wonder where the "4 ops/cycle" comes from; what does it take into account?

An interesting thing to do would be to run Prime95 through a simulator that can count instructions by category

2011-11-15, 09:34   #10
axn

Jun 2003

5×23×41 Posts

Quote:
 Originally Posted by ldesnogu Hmm, even if you have unlimited bandwidth and no latency, you still have to load and store data.
In the theoretical model, you shouldn't count that. Loads/Stores can usually execute simultaneously with other inflight instructions.

Quote:
 Originally Posted by ldesnogu Also I wonder where the "4 ops/cycle" comes from; what does it take into account?
Core 2 and onwards can execute upto 4 FLOP/cycle. Sandy bridge can do 8.

Quote:
 Originally Posted by ldesnogu An interesting thing to do would be to run Prime95 through a simulator that can count instructions by category
Like I said, FFT is very regular (no coditional jumps, etc.) You can do a static count based on algorithm. In fact, I think George may already have done this exercise.

2011-11-15, 11:29   #11
ldesnogu

Jan 2008
France

10208 Posts

Quote:
 Originally Posted by axn In the theoretical model, you shouldn't count that. Loads/Stores can usually execute simultaneously with other inflight instructions.
You're right as long as ld/st count doesn't exceed (math) operations count. Quickly looking at prime95 source code, it looks like there is a lot of memory operations (obviously), but that doesn't say much. I tried to find some papers about memory operations count for FFT and couldn't find anything, except for the classical papers that talk about optimization of the memory hierarchy.

Anyway I trust you and so will assume that the ld/st count is always less than the rest

 Similar Threads Thread Thread Starter Forum Replies Last Post a1call Miscellaneous Math 194 2018-03-19 05:54 Flatlander Science & Technology 3 2013-06-13 13:34 Mini-Geek Lounge 9 2008-07-14 22:45 Torpedo Information & Answers 10 2007-10-05 17:33 jebeagles Miscellaneous Math 16 2006-01-04 02:43

All times are UTC. The time now is 18:45.

Mon Oct 26 18:45:02 UTC 2020 up 46 days, 15:56, 0 users, load averages: 2.02, 2.13, 2.11