20140727, 20:38  #1 
A Sunny Moo
Aug 2007
USA (GMT5)
1100001101001_{2} Posts 
Odd scaling of test times between two machines
Hi all,
I've been puzzling over something the last few days and am hoping that maybe someone here with more intimate knowledge of the interactions between FFT computations and processor speed/cache size can shed some light on it. In trying to juggle the allocation of two of my computers to various subprojects at the No Prime Left Behind and Conjectures 'R Us projects, I swapped subprojects on the two computers for a week or so and compared the test timings I saw. The results surprised me, given the comparative capabilities of the two CPUs. The two machines are:
The two subprojects are:
The N970 is (very roughly) equivalent in speed to a slow Core 2 Quad, while the i52400 is roughly 2.53x faster than a Core 2 percore between its better instruction throughput and AVX. For the CRUS base 6 project, an FFT size of 448K is selected by LLR. The NPLB base 2 work uses an FFT size of 96K. Both of these are sufficiently small to fit within L2/L3 cache, respectively; the base 6 work just barely fits within the AMD's L2. Others at NPLB and CRUS have reported that, in general, Intel CPUs tend to do better than AMDs as FFT size increases. However, this is the exact opposite of what I'm seeing here! These are the test times I'm getting (approximately):
In relative terms, the AMD is ~2.5x worse than the Intel on base 6 (the larger FFT), but ~4.38x worse than the Intel on base 2 (the smaller FFT). Does anyone know why this might be happening? Again, it runs completely counter to the conventional wisdom on how AMDs and Intels perform as FFT sizes increase. Indeed, others at NPLB/CRUS have reported results in line with the conventional wisdom, i.e. AMD K8 processors performing increasingly badly w.r.t. Intel Core 2s as FFT increased. I am quite thoroughly confused. Are gwnum's nonbase2 FFTs not quite as heavily optimized for AVX by chance? (I'm pretty sure both bases are using AVX FFTs on the Intel. I don't have physical access to it so I can't tell you for sure, but I have another Sandy Bridge box with me and it's using AVX for the base 6 tests.) Max Last fiddled with by mdettweiler on 20140727 at 20:40 
20140728, 03:49  #2 
P90 years forever!
Aug 2002
Yeehaw, FL
3·5·499 Posts 
A 448K FFT uses 448K * 8 bytes plus sin/cos and weighting data  around 4MB.

20140728, 05:55  #3 
Jun 2003
3^{2}·19·29 Posts 
It is well known that AVX processors are very dependent on memory bandwidth (especially for larger FFTs). My 3GHz Ivy bridge can do a 448K FFT of similar bit length (SR5) in just over half the time (about 13500s).
My best guess is that your memory is not running in dualchannel mode. Or it is just very slow. What is your memory spec/configuration? Last fiddled with by axn on 20140728 at 05:56 
20140728, 16:35  #4 
A Sunny Moo
Aug 2007
USA (GMT5)
3×2,083 Posts 
Ah, that would do it  I happen to know for a fact that the i52400 is not running in dual channel mode. (When I built it I had to pick up the RAM lastminute at a local store, and all I could get my hands on at the time was 1 4GB stick.)
The AMD, by contrast, has 8 GB of 665 MHz memory  not very fast, but it is running in dual channel mode (2x4 GB), as confirmed by CPUZ. That makes a whole lot of sense  thanks! I'll have to look into putting a second module in that i5... Also, thanks George for the tidbit on FFT sizes  I forgot the need to multiply by 8. In that case, then, seems that both machines are operating out of cache for the 448K FFT, which would explain the memory bandwidth issues. The 96K FFT, by contrast, is well within the Intel's 6 MB L3 cache, but outside the AMD's 512 KBx4 L2 cache, which is why the Intel does so much better there. Last fiddled with by mdettweiler on 20140728 at 16:38 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
P1 factoring: B1 and B2 vs. multicore scaling  TheJudger  Software  1  20160502 21:09 
Skylake and RAM scaling  mackerel  Hardware  34  20160303 19:14 
Core2 X6800 Test Times  PrimeCrazzy  Hardware  9  20060829 08:34 
strange problem with torture test on 16core machines  TheJudger  Hardware  5  20060408 11:20 
Running a LL test on 2 different machines  lycorn  Software  10  20030113 19:34 