20111113, 19:37  #1 
Bemusing Prompter
"Danny"
Dec 2002
California
2^{2}·577 Posts 
maximum theoretical speed of LL test w/o bandwidth limitations?
I know that in many cases, the speed of a LL test depends on not so much the CPU speed, but rather the memory bandwidth. Assuming that the latter isn't an issue, how much of a speedup can we get with currentgeneration processors? I recall someone saying that their Tesla GPU was using only 25% of its processing power, but does anyone know if the figures are similar in the case of CPUs?
Suddenly, the prospects of "hybrid memory cubes" and DDR4 are starting to sound attractive... Last fiddled with by ixfd64 on 20111113 at 19:45 
20111113, 20:26  #2 
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 89<O<88
3×2,399 Posts 
I've also been wondering how much the cache sizes and memory bandwidth affect LL tests. Would it improve performance to get higher bandwidth memory? (And for the same frequency, would a 2600k with 8MB L3 be faster than a 2500k with 6MB L3, for example?)

20111115, 03:57  #3 
Dec 2010
Monticello
6B0_{16} Posts 
Let's suppose for a minute that memory speed is infinite, and the CPU core multiply is the only limitiation. For a 50M LL test, we need 50M operations of square and subtract 2, modulo 2^50M. Each squaring operation costs 2 FFTs of byte length 50M/8 = 6M. The 6M FFT costs 6M*log2(6M) byte multiplies; we'll assume the accumulations come for free as they are done in parallel with the slower multiplies. Latest CPUs run 128 bit registers (I THINK, could be wrong, certainly 64), so that divides my 6Ms by 16, giving 375K * log2(375K) = 375K * 22 multiplies per FFT = 8.25M 128 bit multiplies per FFT.
8M multiplies per FFT *2 FFTs per LL step *50M LL steps = 800 * 10^12 128 bit multiplies Assume a (theoretical) 1 GHz machine  it will take it 16 clock steps to do that 128 bit multiply, so we end up with 16 * 800 * 10^12 clocks to do it...12800 seconds, about 3.5 hours. Now, I need someone to come poke holes in my numbers...and remind me where I lost a zero. **************** LL tests are specifically designed to be friendly to caches, so more L3 cache is better, all else being equal. 
20111115, 04:28  #4 
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 89<O<88
1C1D_{16} Posts 
How much better? Would the extra 2 MB be a 2% or 6% increase?

20111115, 04:54  #5  
Jun 2003
1001001101011_{2} Posts 
Quote:
Prime95 uses doubles (64 bit floating points) as the FFT "limbs"  not bytes. It stores about 19 bits of data in a single limb. So that is around 50M/19 ~ 2.6M limbs. Since we can't do arbitrary sized FFT, we use the next higher size FFT  3M. So, the opcount should be 3M * log2(3M) [But this is an approximation  to get an exact count, we need to look at the actual FFT operations]. Double that for inverse FFT. Then 3M pointwise multiplications for the actual squaring. So roughly 3M + 2*3M*log2(3M). EDIT: Doing parallel operations using SSE2 merely reduces the number of clock cycles  doesn't reduce the op count. EDIT2: So opcount per iteration for a 3M FFT ~ 139 Mflops. At 4 ops/cyc, a 2GHz processor would take 0.0174 secs. That is the fastest you can theoretically execute if memory were not a bottleneck. An Intel Core 2 Duo @ 2GHz takes 0.084 secs as per the benchmark page. Which is a factor of 5 from the "theoretical" calculation  this cannot be accounted by memory issues alone. I must be really off in my opcount Last fiddled with by axn on 20111115 at 05:08 

20111115, 05:13  #6 
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 89<O<88
3×2,399 Posts 
I think it's the approximation part. Correct me if I'm wrong, but aren't you using the complexity there? If so, there's multiplier you're missing which could be 100, 10000000, or whatever.

20111115, 05:41  #7  
Jun 2003
5×23×41 Posts 
Quote:
FFT algorithm is regular enough that we can count exactly the number of ops. There would be a multiplier, but it'll be a small one (say, 2). But there would also be other factors, and we should be able to count all of them. Anyways, I think I'm off by a factor of 2, at the most. 

20111115, 06:02  #8 
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 89<O<88
1110000011101_{2} Posts 
Huh. I thought I remember reading somewhere it was closer to 5... but you at least have calculations, which is better than me.

20111115, 08:53  #9  
Jan 2008
France
210_{16} Posts 
Quote:
An interesting thing to do would be to run Prime95 through a simulator that can count instructions by category 

20111115, 09:34  #10  
Jun 2003
5×23×41 Posts 
Quote:
Quote:
Like I said, FFT is very regular (no coditional jumps, etc.) You can do a static count based on algorithm. In fact, I think George may already have done this exercise. 

20111115, 11:29  #11  
Jan 2008
France
1020_{8} Posts 
Quote:
Anyway I trust you and so will assume that the ld/st count is always less than the rest 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
A Theoretical (vs. Proficient/Practical) Deterministic Primality Test  a1call  Miscellaneous Math  194  20180319 05:54 
The Limitations of the Cranial GPU  Flatlander  Science & Technology  3  20130613 13:34 
Maximum theoretical MPG  MiniGeek  Lounge  9  20080714 22:45 
Custom test for maximum CPU stress  Torpedo  Information & Answers  10  20071005 17:33 
LL test speed up?  jebeagles  Miscellaneous Math  16  20060104 02:43 