![]() |
|
|
#1 |
|
Jan 2010
516 Posts |
I have a few questions regarding FFT length.
What is the difference in the length or size of the FFT and what does it mean in terms of performance? I've been been messing around in with my worktodo.txt for LL testing and noticed it will only run between 2560K and 5120K FFT sizes. 2560K seeming to have the quickest iteration time, while running across all 4 cores with about 66% load. While the 4096K length seeming to be the slowest, and 5120K has the highest CPU load across 4 cores at about 96% load, but is quicker then 4096K length, which I found strange. Hoping someone can explain. (AMD Phenom II 955 @3.4GHz, 4GB memory with 768MB dedicated to Prime95 running on Windows 7 64-bit) Here is a crude chart of running the LL test on across 4 cores with different FFT lengths. 2560K FFT length - iteration average time 0.026ms 3072K FFT length - iteration average time 0.030ms 4096K FFT length - iteration average time 0.037ms 5120K FFT length - iteration average time 0.031ms Below is a benchmark test also showing that 5120K FFT length iterations seems to be quicker then the previous 4096K FFT length across 3 and 4 cores. I have searched around and have yet to find anything explaining this. Code:
AMD Phenom(tm) II X4 955 Processor CPU speed: 3400.14 MHz, 4 cores CPU features: RDTSC, CMOV, Prefetch, 3DNow!, MMX, SSE, SSE2 L1 cache size: 64 KB L2 cache size: 512 KB, L3 cache size: 6 MB L1 cache line size: 64 bytes L2 cache line size: 64 bytes L1 TLBS: 48 L2 TLBS: 512 Prime95 64-bit version 25.11, RdtscTiming=1 Best time for 768K FFT length: 11.671 ms. Best time for 896K FFT length: 14.178 ms. Best time for 1024K FFT length: 15.923 ms. Best time for 1280K FFT length: 20.229 ms. Best time for 1536K FFT length: 24.435 ms. Best time for 1792K FFT length: 29.733 ms. Best time for 2048K FFT length: 33.436 ms. Best time for 2560K FFT length: 44.284 ms. Best time for 3072K FFT length: 53.468 ms. Best time for 3584K FFT length: 64.629 ms. Best time for 4096K FFT length: 72.950 ms. Best time for 5120K FFT length: 96.691 ms. Best time for 6144K FFT length: 121.366 ms. Best time for 7168K FFT length: 149.088 ms. Best time for 8192K FFT length: 169.016 ms. Timing FFTs using 2 threads. Best time for 768K FFT length: 7.600 ms. Best time for 896K FFT length: 9.295 ms. Best time for 1024K FFT length: 10.797 ms. Best time for 1280K FFT length: 14.236 ms. Best time for 1536K FFT length: 17.108 ms. Best time for 1792K FFT length: 20.311 ms. Best time for 2048K FFT length: 22.891 ms. Best time for 2560K FFT length: 31.159 ms. Best time for 3072K FFT length: 36.824 ms. Best time for 3584K FFT length: 43.783 ms. Best time for 4096K FFT length: 49.583 ms. Best time for 5120K FFT length: 52.662 ms. Best time for 6144K FFT length: 64.240 ms. Best time for 7168K FFT length: 77.621 ms. Best time for 8192K FFT length: 89.281 ms. Timing FFTs using 3 threads. Best time for 768K FFT length: 5.512 ms. Best time for 896K FFT length: 6.392 ms. Best time for 1024K FFT length: 7.365 ms. Best time for 1280K FFT length: 12.054 ms. Best time for 1536K FFT length: 14.099 ms. Best time for 1792K FFT length: 16.238 ms. Best time for 2048K FFT length: 18.034 ms. Best time for 2560K FFT length: 25.825 ms. Best time for 3072K FFT length: 29.850 ms. Best time for 3584K FFT length: 34.141 ms. Best time for 4096K FFT length: 38.356 ms. Best time for 5120K FFT length: 35.637 ms. Best time for 6144K FFT length: 43.605 ms. Best time for 7168K FFT length: 53.100 ms. Best time for 8192K FFT length: 59.837 ms. Timing FFTs using 4 threads. Best time for 768K FFT length: 5.206 ms. Best time for 896K FFT length: 6.021 ms. Best time for 1024K FFT length: 6.899 ms. Best time for 1280K FFT length: 11.291 ms. Best time for 1536K FFT length: 13.120 ms. Best time for 1792K FFT length: 15.069 ms. Best time for 2048K FFT length: 16.804 ms. Best time for 2560K FFT length: 23.911 ms. Best time for 3072K FFT length: 27.644 ms. Best time for 3584K FFT length: 31.485 ms. Best time for 4096K FFT length: 35.525 ms. Best time for 5120K FFT length: 28.108 ms. Best time for 6144K FFT length: 34.150 ms. Best time for 7168K FFT length: 40.874 ms. Best time for 8192K FFT length: 45.925 ms. Best time for 58 bit trial factors: 2.075 ms. Best time for 59 bit trial factors: 2.128 ms. Best time for 60 bit trial factors: 2.123 ms. Best time for 61 bit trial factors: 2.309 ms. Best time for 62 bit trial factors: 2.354 ms. Best time for 63 bit trial factors: 2.800 ms. Best time for 64 bit trial factors: 3.148 ms. Best time for 65 bit trial factors: 3.903 ms. Best time for 66 bit trial factors: 4.656 ms. Best time for 67 bit trial factors: 4.629 ms. |
|
|
|
|
|
#2 |
|
Jan 2010
5 Posts |
Is there anyone that can help answer my question? (Been two weeks since I posted) I don't want to start messing around with values and submitting bad LL test wasting CPU hours. Maybe I was asking to much in my first post.
Why are there different FFT lengths/sizes? How does Prime95 decide the FFT length when doing LL test? What are the impacts of changing the default assigned length? |
|
|
|
|
|
#3 |
|
Sep 2006
Brussels, Belgium
32268 Posts |
You could search the Mersenne Wiki and this forum.
You should not use smaller FFT length than those proposed because the precision of the computations would not be enough and you would get a lot of rounding errors. As to why the average iteration time is smaller for 5120KB than 4096 KB with four threads on your architecture I can not answer. May be George or Ernst could answer. Jacob Last fiddled with by wblipp on 2010-02-14 at 20:35 Reason: Added "not" in "should not use" |
|
|
|
|
|
#4 |
|
"William"
May 2003
New Haven
2·7·132 Posts |
As an analog, suppose you were calculating an by taking the logarithm of a, multiplying by n, and then taking the anti-log. Do you understand that if you had too few digits in the approximation of log(a), there would be errors in the final result?
The same kind of thing happens with FFTs. Taking the FFT is a transform (the T stands for Transform) like taking the logarithm is a transform. You work in this transform space - in fact you multiply by 2 - like multiplying by n is working in the log transform space. Then you reverse transform, like taking the anti-logarithm is a reverse-transform from log space. If you have enough digits of log(a), the final result will be close enough to the exact result that you can recover the exact integer result. Likewise, if the FFT is long enough, the final result will be close enough to the exact result that you can recover the exact integer result. William Last fiddled with by wblipp on 2010-02-14 at 20:49 |
|
|
|
|
|
#5 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
1D6616 Posts |
I have no answer to 5120K being faster than 4096K using 4 threads. Something I'll have to look into one day.
|
|
|
|
|
|
#6 |
|
Aug 2002
North San Diego County
5×137 Posts |
With 3+ threads, most (if not all 4) of my Phenom II boxes benchmark with lower iteration times at 5120K than 4096K. I have not tested to see if that also translates to non-benchmark situtations.
EDIT: All Phenom II processors exhibit the odd 5120K reduction with 3+ threads. The Athlon IIx4 (essentially Phenom II without L3 cache) does not. Last fiddled with by sdbardwick on 2010-02-14 at 22:42 |
|
|
|
|
|
#7 | |
|
"Richard B. Woods"
Aug 2002
Wisconsin USA
170148 Posts |
Quote:
The longer the FFT, the more CPU instructions are required to perform the arithmetic of the transforms. On the face of it, this would imply that iterations (which consist almost entirely of two FF transforms) using a 5120K FFT would always take longer than iterations using a 4096K FFT or any shorter FFT. However, the relationship between FFT length and iteration time is not just a simple linear function of numbers of instructions executed. Another factor, which has become increasingly important as CPU design has evolved, is the amount of time needed for transferring data between the storage and execution units of the CPU. The presence of higher-speed-than-main-RAM cache memory between the main RAM storage and the execution unit(s) breaks the strictly linear time relationship between amount of data transfer and elapsed time required for data transfer. When some chunks of data can be transferred faster than other equal-sized chunks of data, execution time analysis becomes more complicated. But even after considering this, as a first approximation we'd still expect a 5120K FFT to take longer to transform than a 4096K FFT ... that is, if the general manner in which data were transferred between RAM and cache and registers were the same, overall, for the two different-sized FFTs. Now, Prime95 has definitely been optimized with the cache-memory effects in mind; the nonlinear data transfer rates have not been ignored. Also, tests have shown that, on a variety of CPUs, the different ways in which Prime95 is tuned for different type of cache-RAM relationships has indeed resulted in faster iteration times for smaller FFTs. But what we have here is some exception. Perhaps there is some obscure aspect of cache operation in the AMD Phenom II with L3 cache that has not been properly understood. Indeed, George Woltman, principal developer of Prime95, had to determine certain characteristics through experimentation ... because not every part of cache and bus operation has been documented by the manufacturers! So it could be that in your system's particular configuration, something doesn't operate in the way we think it does, and that this means that George's code isn't optimum for that configuration for FFT sizes under 5120K. (Or maybe there's some other reason.) Last fiddled with by cheesehead on 2010-02-15 at 20:39 |
|
|
|
|
|
|
#8 | |
|
Jul 2006
Calgary
1A916 Posts |
Quote:
|
|
|
|
|
|
|
#9 |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
With a decently well-coded set of non-power-of-2 FFT-pass routines this effect should be on the 5%-penalty-or-less ballpark, when normalized by FFT length. If you see a timing effect much larger than that it's usually due to cache-mapping issues or quirks of the multithreading, both of which can be difficult to track down. (Often because they only occur on a small % of hardware/software configurations.)
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Help with FFT length | Gradient | Software | 3 | 2013-12-16 01:53 |
| Factorization length test? | siegert81 | Math | 17 | 2011-12-20 15:33 |
| FFt length | mack | Information & Answers | 1 | 2009-09-06 03:24 |
| llr: FFT-length not monotone in n? | hhh | Software | 4 | 2008-12-20 09:49 |
| Q: FATAL: length 840 K not available. Halting... ? | MartinHvidberg | Software | 14 | 2003-07-23 16:14 |