mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   maximum theoretical speed of LL test w/o bandwidth limitations? (https://www.mersenneforum.org/showthread.php?t=16232)

ixfd64 2011-11-13 19:37

maximum theoretical speed of LL test w/o bandwidth limitations?
 
I know that in many cases, the speed of a LL test depends on not so much the CPU speed, but rather the memory bandwidth. Assuming that the latter isn't an issue, how much of a speedup can we get with current-generation processors? I recall someone saying that their Tesla GPU was using only 25% of its processing power, but does anyone know if the figures are similar in the case of CPUs?

Suddenly, the prospects of [url=http://blogs.intel.com/research/2011/09/hmc.php]"hybrid memory cubes"[/url] and DDR4 are starting to sound attractive...

Dubslow 2011-11-13 20:26

I've also been wondering how much the cache sizes and memory bandwidth affect LL tests. Would it improve performance to get higher bandwidth memory? (And for the same frequency, would a 2600k with 8MB L3 be faster than a 2500k with 6MB L3, for example?)

Christenson 2011-11-15 03:57

Let's suppose for a minute that memory speed is infinite, and the CPU core multiply is the only limitiation. For a 50M LL test, we need 50M operations of square and subtract 2, modulo 2^50M. Each squaring operation costs 2 FFTs of byte length 50M/8 = 6M. The 6M FFT costs 6M*log2(6M) byte multiplies; we'll assume the accumulations come for free as they are done in parallel with the slower multiplies. Latest CPUs run 128 bit registers (I THINK, could be wrong, certainly 64), so that divides my 6Ms by 16, giving 375K * log2(375K) = 375K * 22 multiplies per FFT = 8.25M 128 bit multiplies per FFT.

8M multiplies per FFT
*2 FFTs per LL step
*50M LL steps
= 800 * 10^12 128 bit multiplies
Assume a (theoretical) 1 GHz machine -- it will take it 16 clock steps to do that 128 bit multiply, so we end up with
16 * 800 * 10^12 clocks to do it...12800 seconds, about 3.5 hours.

Now, I need someone to come poke holes in my numbers...and remind me where I lost a zero.
****************
LL tests are specifically designed to be friendly to caches, so more L3 cache is better, all else being equal.

Dubslow 2011-11-15 04:28

How much better? Would the extra 2 MB be a 2% or 6% increase?

axn 2011-11-15 04:54

[QUOTE=Christenson;278412]Let's suppose for a minute that memory speed is infinite, and the CPU core multiply is the only limitiation. For a 50M LL test, we need 50M operations of square and subtract 2, modulo 2^50M. Each squaring operation costs 2 FFTs of byte length 50M/8 = 6M. The 6M FFT costs 6M*log2(6M) byte multiplies; we'll assume the accumulations come for free as they are done in parallel with the slower multiplies. Latest CPUs run 128 bit registers (I THINK, could be wrong, certainly 64), so that divides my 6Ms by 16, giving 375K * log2(375K) = 375K * 22 multiplies per FFT = 8.25M 128 bit multiplies per FFT. [/QUOTE]
You have misunderstood the mechanics of FFT.
Prime95 uses doubles (64 bit floating points) as the FFT "limbs" -- not bytes. It stores about 19 bits of data in a single limb. So that is around 50M/19 ~ 2.6M limbs. Since we can't do arbitrary sized FFT, we use the next higher size FFT - 3M. So, the opcount should be 3M * log2(3M) [But this is an approximation -- to get an exact count, we need to look at the actual FFT operations]. Double that for inverse FFT. Then 3M point-wise multiplications for the actual squaring. So roughly 3M + 2*3M*log2(3M).

EDIT:- Doing parallel operations using SSE2 merely reduces the number of clock cycles -- doesn't reduce the op count.

EDIT2:- So opcount per iteration for a 3M FFT ~ 139 Mflops. At 4 ops/cyc, a 2GHz processor would take 0.0174 secs. That is the fastest you can theoretically execute if memory were not a bottleneck. An Intel Core 2 Duo @ 2GHz takes 0.084 secs as per the benchmark page. Which is a factor of 5 from the "theoretical" calculation -- this cannot be accounted by memory issues alone. I must be really off in my opcount :sad:

Dubslow 2011-11-15 05:13

I think it's the approximation part. Correct me if I'm wrong, but aren't you using the complexity there? If so, there's multiplier you're missing which could be 100, 10000000, or whatever.

axn 2011-11-15 05:41

[QUOTE=Dubslow;278431]I think it's the approximation part. Correct me if I'm wrong, but aren't you using the complexity there? If so, there's multiplier you're missing which could be 100, 10000000, or whatever.[/QUOTE]

Yes. But, NlgN ought to be a pretty good approximation.

FFT algorithm is regular enough that we can count exactly the number of ops. There would be a multiplier, but it'll be a small one (say, 2). But there would also be other factors, and we should be able to count all of them. Anyways, I think I'm off by a factor of 2, at the most.

Dubslow 2011-11-15 06:02

Huh. I thought I remember reading somewhere it was closer to 5... but you at least have calculations, which is better than me. :razz::smile:

ldesnogu 2011-11-15 08:53

[QUOTE=axn;278429]EDIT2:- So opcount per iteration for a 3M FFT ~ 139 Mflops. At 4 ops/cyc, a 2GHz processor would take 0.0174 secs. That is the fastest you can theoretically execute if memory were not a bottleneck. An Intel Core 2 Duo @ 2GHz takes 0.084 secs as per the benchmark page. Which is a factor of 5 from the "theoretical" calculation -- this cannot be accounted by memory issues alone. I must be really off in my opcount :sad:[/QUOTE]
Hmm, even if you have unlimited bandwidth and no latency, you still have to load and store data. Also I wonder where the "4 ops/cycle" comes from; what does it take into account?

An interesting thing to do would be to run Prime95 through a simulator that can count instructions by category :smile:

axn 2011-11-15 09:34

[QUOTE=ldesnogu;278449]Hmm, even if you have unlimited bandwidth and no latency, you still have to load and store data.[/quote]
In the theoretical model, you shouldn't count that. Loads/Stores can usually execute simultaneously with other inflight instructions.

[QUOTE=ldesnogu;278449]Also I wonder where the "4 ops/cycle" comes from; what does it take into account?[/quote]
Core 2 and onwards can execute upto 4 FLOP/cycle. Sandy bridge can do 8.


[QUOTE=ldesnogu;278449]An interesting thing to do would be to run Prime95 through a simulator that can count instructions by category :smile:[/QUOTE]
Like I said, FFT is very regular (no coditional jumps, etc.) You can do a static count based on algorithm. In fact, I think George may already have done this exercise.

ldesnogu 2011-11-15 11:29

[QUOTE=axn;278452]In the theoretical model, you shouldn't count that. Loads/Stores can usually execute simultaneously with other inflight instructions.[/QUOTE]
You're right as long as ld/st count doesn't exceed (math) operations count. Quickly looking at prime95 source code, it looks like there is a lot of memory operations (obviously), but that doesn't say much. I tried to find some papers about memory operations count for FFT and couldn't find anything, except for the classical papers that talk about optimization of the memory hierarchy.

Anyway I trust you and so will assume that the ld/st count is always less than the rest :smile:


All times are UTC. The time now is 23:30.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.