mersenneforum.org Variable FFT sizes
 Register FAQ Search Today's Posts Mark Forums Read

 2008-01-13, 13:36 #1 Mini-Geek Account Deleted     "Tim Sorbera" Aug 2006 San Antonio, TX USA 17·251 Posts Variable FFT sizes Why are the FFT sizes only a few fixed sizes? Why not have them variable, only being as large as they need to to do the number without errors. As mentioned in The Happy Me(al) thread, I barely got one that was just outside of an FFT size to use the smaller one, and I have one that's on the lower end of its FFT size. Surely it doesn't need all 2560K FFT size precision to do something 39.78M? Even if 2048K is indeed too inaccurate for the number, there must be something in between those that will give sufficient accuracy and higher speed. To see what FFT size it needs, perhaps a ~1 hour self-test running the number at different FFT sizes, picking the smallest one that's within acceptable accuracy limits. If not variable, perhaps many sizes, such as 1 per ~1-2M instead of 1 per ~5-10M. In something only somewhat-related, is there a way I can get Prime95 to check if my 39.78M will run within acceptable limits on 2048K? It did this automatically, and passed, with my 39.509M number, and I just wanted to make sure I'm not running my other one at a larger FFT size if it's not needed.
 2008-01-13, 15:08 #2 davieddy     "Lucan" Dec 2006 England 2×3×13×83 Posts If you ask Primenet for an exponent NOW, you should get one <39.5M. A cursory study of how a FFT works should explain why the optimum size is a power of 2, as will a glance at the iteration times for more awkward sizes. I assume that this trend continues for sizes less smooth than (2^19)*5. Not to mention that I suspect the programming of each size has been customized by George! David It's similar to the reason why tennis tournaments start with 64 or 128 players. Last fiddled with by davieddy on 2008-01-13 at 15:34
2008-01-13, 18:44   #3
Mini-Geek
Account Deleted

"Tim Sorbera"
Aug 2006
San Antonio, TX USA

17×251 Posts

Quote:
 Originally Posted by davieddy If you ask Primenet for an exponent NOW, you should get one <39.5M.
Yeah, but I have no reason to just pass it on instead of finishing it (unless it's likely an SSE will take it up instead of another SSE2), I just wanted to make sure it's not wasting time by using a higher FFT size.
Quote:
 Originally Posted by davieddy A cursory study of how a FFT works should explain why the optimum size is a power of 2, as will a glance at the iteration times for more awkward sizes. I assume that this trend continues for sizes less smooth than (2^19)*5. Not to mention that I suspect the programming of each size has been customized by George! David It's similar to the reason why tennis tournaments start with 64 or 128 players.
I've tried to look up how FFT works, but it's pretty complicated. Over my head.
I've never noticed how the iteration times are with powers of two before, but now I see that. I guess variable sizes really would be worse.

Anyone know about forcing the self-test to see if 2048K can run the number?

2008-01-14, 12:51   #4
davieddy

"Lucan"
Dec 2006
England

2·3·13·83 Posts

Quote:
 Originally Posted by Mini-Geek I've tried to look up how FFT works, but it's pretty complicated. Over my head.
I was referring to "butterfly" diagrams which show where the
"Fast" enters into the Fast (Discrete) Fourier Transform. These
diagrams are what resemble the draw for a tennis tournament.
I agree that a full understanding of the subject cannot be gained
by "cursory study".

2008-01-14, 16:37   #5
ewmayer
2ω=0

Sep 2002
República de California

5·7·331 Posts

Quote:
 Originally Posted by Mini-Geek Why are the FFT sizes only a few fixed sizes? Why not have them variable, only being as large as they need to to do the number without errors. As mentioned in The Happy Me(al) thread, I barely got one that was just outside of an FFT size to use the smaller one, and I have one that's on the lower end of its FFT size. Surely it doesn't need all 2560K FFT size precision to do something 39.78M? Even if 2048K is indeed too inaccurate for the number, there must be something in between those that will give sufficient accuracy and higher speed.
There is - 2304K. However, that [and analogously with all such ever-finer-gradations of length] needs something called a "radix-9 DFT" routine. [Factorize the FFT length and you'll see where that arises.] Similarly, the other 3 "in-between" lengths in going to 4096K need radices 11,13 and 15 respectively, in addition to the usual powers-of-2. By way of example, my Mlucas code supports all those, but the only one of them which consistently gives decent bang-for-the-coding buck is radix 9. Here's why:

1) The relative efficiency of an FFT is [roughly] related to the smoothness of the factorization of its length - thus power-of-2 radices are most efficient, and large-odd-prime radices are worst;

2) Larger radices lead to higher register pressure in-hardware, thus coding them to run efficiently becomes quite difficult.

The above is just a bare outline: feel free to do some homework on the DFT if you wish to understand this more fully. For example, a good advanced-topics question is: "For odd composite radices of similar size, why might one with no repeated factors be preferable to one with repeated factors? [E.g. radix 21 vs 25]."

Last fiddled with by ewmayer on 2008-01-14 at 16:42

2008-01-14, 17:16   #6
fivemack
(loop (#_fork))

Feb 2006
Cambridge, England

637910 Posts

Quote:
 "For odd composite radices of similar size, why might one with no repeated factors be preferable to one with repeated factors? [E.g. radix 21 vs 25]."
Because radix-21 has 12 roots of unity and radix-25 has 20? (OK, six and ten once you've dealt with negation) - but on an x86-like platform I'd have thought you handled the roots of unity with load-execute operations, and wouldn't the radix-21 DFT be done as 3x7 anyway?

Is there some obvious reason that I'm missing why fast FFT code tends to be {small}*2^N rather than {small}*2^N*3^M? The three-way decimation-in-time stage isn't too hideous, the cube roots of unity are conveniently 1 and +/- omega.

 2008-01-14, 17:33 #7 Mini-Geek Account Deleted     "Tim Sorbera" Aug 2006 San Antonio, TX USA 17×251 Posts Thanks for all the explanations, I understand now that it takes a very smooth FFT size to be efficient. I got this error on my 39.509M number and was wondering if it's serious or a software bug or anything, and if this significantly raises the chance that this will return an incorrect result (copied from results.txt, with the self-test checking if 2048K is acceptable): Code: Trying 1000 iterations for exponent 39509597 using 2048K FFT. If average roundoff error is above 0.2425, then a larger FFT will be used. Final average roundoff error is 0.24118, using 2048K FFT for exponent 39509597. [Mon Jan 14 01:17:53 2008] Iteration: 1420185/39509597, ERROR: ROUND OFF (0.40625) > 0.40 Continuing from last save file. [Mon Jan 14 01:42:17 2008] Disregard last error. Result is reproducible and thus not a hardware problem. For added safety, redoing iteration using a slower, more reliable method. Continuing from last save file. Last fiddled with by Mini-Geek on 2008-01-14 at 17:43 Reason: clarifying part of second sentence
2008-01-14, 17:33   #8
ewmayer
2ω=0

Sep 2002
República de California

5×7×331 Posts

Quote:
 Originally Posted by fivemack Because radix-21 has 12 roots of unity and radix-25 has 20? (OK, six and ten once you've dealt with negation) - but on an x86-like platform I'd have thought you handled the roots of unity with load-execute operations, and wouldn't the radix-21 DFT be done as 3x7 anyway?
"Why twiddle when you can permute?"

Quote:
 Is there some obvious reason that I'm missing why fast FFT code tends to be {small}*2^N rather than {small}*2^N*3^M? The three-way decimation-in-time stage isn't too hideous, the cube roots of unity are conveniently 1 and +/- omega.
Might be worth investigating, but I expect a highly optimized power-of-2 combined with a decent variety of odd leading radices gives something approaching an optimum in the performance/implementation-difficulty space. At large FFT lengths, one would have the option of perhaps-inefficient single leading radix [pay me now] followed by many really fast powers-of-two, or multiple powers of three, all slightly less efficient than their smoother power-of-2 brethren [pay me later].

EDIT: By way of example, I can do a complex radix-8 DFT [sans any multipass twiddles] with just 52 [scalar] add and 4 mul.

A radix-9 DFT, on the other hand, needs 84 add and 20 mul. That is significantly more ops-per-input than the power of 2. OTOOH, if I follow that radix-9 with a large power of 2, the extra work done in doing the one radix-9 pass gets offset [and more] by the shorter vector length compared to [say] going to the next-larger power of 2, or even to 10*[power of 2]. The radix-10 DFT needs 88 add and 20 mul and thus is slightly better-per-input, but that work gets multiplied by [10/9] in comparing vs 9*[same power of 2] to reflect the overall larger vector length.

Of course in the real world there is much more to it than mere arithmetic opcount - cache misses and memory bandwidth are much more important at larger vector lengyhs - but all other things being equal [which the same large-power-of-2 ensures] using a shorter vector length will usually give better cache behavior, as well, since proportionally more of the data array can fit in cache.

Last fiddled with by ewmayer on 2008-01-14 at 19:08

 Similar Threads Thread Thread Starter Forum Replies Last Post MARTHA Number Theory Discussion Group 11 2018-03-02 15:55 Unregistered Homework Help 4 2010-08-04 06:38 Cruelty Riesel Prime Search 3 2006-07-12 15:15 wblipp GMP-ECM 5 2005-04-24 20:04 Unregistered Hardware 4 2003-12-06 13:43

All times are UTC. The time now is 21:11.

Sat Jan 23 21:11:40 UTC 2021 up 51 days, 17:22, 0 users, load averages: 2.44, 2.08, 1.95