Hi all,

I've been puzzling over something the last few days and am hoping that maybe someone here with more intimate knowledge of the interactions between FFT computations and processor speed/cache size can shed some light on it.

In trying to juggle the allocation of two of my computers to various subprojects at the No Prime Left Behind and Conjectures 'R Us projects, I swapped subprojects on the two computers for a week or so and compared the test timings I saw. The results surprised me, given the comparative capabilities of the two CPUs.

The two machines are:

- AMD Phenom II X4 N970 (Caspian) - 512 KB x 4 L2 cache, 2.2 GHz (but with some thermal throttling, it's a laptop that I haven't cleaned in a while)
- Intel Core i5-2400 (Sandy Bridge) - 6 MB L3 cache, 3.1 GHz (max turbo 3.4 GHz, likely not throttling much if at all)

The two subprojects are:

- NPLB 14th Drive, LLR primality tests of k*2^n-1, range k=600-1001, n=~1.3M
- CRUS, PRP/N-1 primality tests (using LLR) of k*6^n-1, k=1597 and 36772, n=~1.68M - roughly equivalent to n=~4.35M base 2 in terms of decimal length

The N970 is (very roughly) equivalent in speed to a slow Core 2 Quad, while the i5-2400 is roughly 2.5-3x faster than a Core 2 per-core between its better instruction throughput and AVX.

For the CRUS base 6 project, an FFT size of 448K is selected by LLR. The NPLB base 2 work uses an FFT size of 96K. Both of these are sufficiently small to fit within L2/L3 cache, respectively; the base 6 work just barely fits within the AMD's L2.

Others at NPLB and CRUS have reported that, in general, Intel CPUs tend to do better than AMDs as FFT size increases. However, this is the exact opposite of what I'm seeing here! These are the test times I'm getting (approximately):

- 61,000 seconds/test for CRUS base 6 on the AMD
- 24,500 seconds/test for CRUS base 6 on the Intel
- 3,500 seconds/test for NPLB base 2 on the AMD
- 800 seconds/test for NPLB base 2 on the Intel

In relative terms,

**the AMD is ~2.5x worse than the Intel on base 6 (the larger FFT), but ~4.38x worse than the Intel on base 2 (the smaller FFT).**
Does anyone know why this might be happening? Again, it runs completely counter to the conventional wisdom on how AMDs and Intels perform as FFT sizes increase. Indeed, others at NPLB/CRUS have reported results in line with the conventional wisdom, i.e. AMD K8 processors performing increasingly badly w.r.t. Intel Core 2s as FFT increased. I am quite thoroughly confused.

Are gwnum's non-base-2 FFTs not quite as heavily optimized for AVX by chance? (I'm pretty sure both bases

*are* using AVX FFTs on the Intel. I don't have physical access to it so I can't tell you for sure, but I have another Sandy Bridge box with me and it's using AVX for the base 6 tests.)

Max