mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Odd scaling of test times between two machines (https://www.mersenneforum.org/showthread.php?t=19520)

mdettweiler 2014-07-27 20:38

Odd scaling of test times between two machines
 
Hi all,

I've been puzzling over something the last few days and am hoping that maybe someone here with more intimate knowledge of the interactions between FFT computations and processor speed/cache size can shed some light on it. :smile:

In trying to juggle the allocation of two of my computers to various subprojects at the No Prime Left Behind and Conjectures 'R Us projects, I swapped subprojects on the two computers for a week or so and compared the test timings I saw. The results surprised me, given the comparative capabilities of the two CPUs.

The two machines are:[LIST][*]AMD Phenom II X4 N970 (Caspian) - 512 KB x 4 L2 cache, 2.2 GHz (but with some thermal throttling, it's a laptop that I haven't cleaned in a while)[*]Intel Core i5-2400 (Sandy Bridge) - 6 MB L3 cache, 3.1 GHz (max turbo 3.4 GHz, likely not throttling much if at all)[/LIST]
The two subprojects are:[list][*]NPLB 14th Drive, LLR primality tests of k*2^n-1, range k=600-1001, n=~1.3M[*]CRUS, PRP/N-1 primality tests (using LLR) of k*6^n-1, k=1597 and 36772, n=~1.68M - roughly equivalent to n=~4.35M base 2 in terms of decimal length[/list]
The N970 is (very roughly) equivalent in speed to a slow Core 2 Quad, while the i5-2400 is roughly 2.5-3x faster than a Core 2 per-core between its better instruction throughput and AVX.

For the CRUS base 6 project, an FFT size of 448K is selected by LLR. The NPLB base 2 work uses an FFT size of 96K. Both of these are sufficiently small to fit within L2/L3 cache, respectively; the base 6 work just barely fits within the AMD's L2.

Others at NPLB and CRUS have reported that, in general, Intel CPUs tend to do better than AMDs as FFT size increases. However, this is the exact opposite of what I'm seeing here! These are the test times I'm getting (approximately):[list][*]61,000 seconds/test for CRUS base 6 on the AMD[*]24,500 seconds/test for CRUS base 6 on the Intel[*]3,500 seconds/test for NPLB base 2 on the AMD[*]800 seconds/test for NPLB base 2 on the Intel[/list]
In relative terms, [B]the AMD is ~2.5x worse than the Intel on base 6 (the larger FFT), but ~4.38x worse than the Intel on base 2 (the smaller FFT).[/B]

Does anyone know why this might be happening? Again, it runs completely counter to the conventional wisdom on how AMDs and Intels perform as FFT sizes increase. Indeed, others at NPLB/CRUS have reported results in line with the conventional wisdom, i.e. AMD K8 processors performing increasingly badly w.r.t. Intel Core 2s as FFT increased. I am quite thoroughly confused. :confused:

Are gwnum's non-base-2 FFTs not quite as heavily optimized for AVX by chance? (I'm pretty sure both bases [i]are[/i] using AVX FFTs on the Intel. I don't have physical access to it so I can't tell you for sure, but I have another Sandy Bridge box with me and it's using AVX for the base 6 tests.)

Max

Prime95 2014-07-28 03:49

[QUOTE=mdettweiler;379172]
For the CRUS base 6 project, an FFT size of 448K is selected by LLR. The NPLB base 2 work uses an FFT size of 96K. Both of these are sufficiently small to fit within L2/L3 cache, respectively; the base 6 work just barely fits within the AMD's L2.
[/QUOTE]

A 448K FFT uses 448K * 8 bytes plus sin/cos and weighting data -- around 4MB.

axn 2014-07-28 05:55

It is well known that AVX processors are very dependent on memory bandwidth (especially for larger FFTs). My 3GHz Ivy bridge can do a 448K FFT of similar bit length (SR5) in just over half the time (about 13500s).

My best guess is that your memory is not running in dual-channel mode. Or it is just very slow. What is your memory spec/configuration?

mdettweiler 2014-07-28 16:35

Ah, that would do it - I happen to know for a fact that the i5-2400 is [i]not[/i] running in dual channel mode. (When I built it I had to pick up the RAM last-minute at a local store, and all I could get my hands on at the time was 1 4GB stick.)

The AMD, by contrast, has 8 GB of 665 MHz memory - not very fast, but it [i]is[/i] running in dual channel mode (2x4 GB), as confirmed by CPU-Z.

That makes a whole lot of sense - thanks! I'll have to look into putting a second module in that i5...

Also, thanks George for the tidbit on FFT sizes - I forgot the need to multiply by 8. In that case, then, seems that both machines are operating out of cache for the 448K FFT, which would explain the memory bandwidth issues. The 96K FFT, by contrast, is well within the Intel's 6 MB L3 cache, but outside the AMD's 512 KBx4 L2 cache, which is why the Intel does so much better there.


All times are UTC. The time now is 22:40.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.