mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   Variable FFT sizes (https://www.mersenneforum.org/showthread.php?t=9851)

Mini-Geek 2008-01-13 13:36

Variable FFT sizes
 
Why are the FFT sizes only a few fixed sizes? Why not have them variable, only being as large as they need to to do the number without errors. As mentioned in The Happy Me(al) thread, I barely got one that was just outside of an FFT size to use the smaller one, and I have one that's on the lower end of its FFT size. Surely it doesn't need all 2560K FFT size precision to do something 39.78M? Even if 2048K is indeed too inaccurate for the number, there must be something in between those that will give sufficient accuracy and higher speed.
To see what FFT size it needs, perhaps a ~1 hour self-test running the number at different FFT sizes, picking the smallest one that's within acceptable accuracy limits.
If not variable, perhaps many sizes, such as 1 per ~1-2M instead of 1 per ~5-10M.

In something only somewhat-related, is there a way I can get Prime95 to check if my 39.78M will run within acceptable limits on 2048K? It did this automatically, and passed, with my 39.509M number, and I just wanted to make sure I'm not running my other one at a larger FFT size if it's not needed.

davieddy 2008-01-13 15:08

If you ask Primenet for an exponent NOW, you should
get one <39.5M.
A cursory study of how a FFT works should explain why the
optimum size is a power of 2, as will a glance at the iteration
times for more awkward sizes. I assume that this trend continues
for sizes less smooth than (2^19)*5.
Not to mention that I suspect the programming of each size has been
customized by George!

David

It's similar to the reason why tennis tournaments start with
64 or 128 players.

Mini-Geek 2008-01-13 18:44

[quote=davieddy;122762]If you ask Primenet for an exponent NOW, you should
get one <39.5M.[/quote]
Yeah, but I have no reason to just pass it on instead of finishing it (unless it's likely an SSE will take it up instead of another SSE2), I just wanted to make sure it's not wasting time by using a higher FFT size.
[quote=davieddy;122762]A cursory study of how a FFT works should explain why the
optimum size is a power of 2, as will a glance at the iteration
times for more awkward sizes. I assume that this trend continues
for sizes less smooth than (2^19)*5.
Not to mention that I suspect the programming of each size has been
customized by George!

David

It's similar to the reason why tennis tournaments start with
64 or 128 players.[/quote]
I've tried to look up how FFT works, but it's pretty complicated. Over my head.
I've never noticed how the iteration times are with powers of two before, but now I see that. I guess variable sizes really would be worse.

Anyone know about forcing the self-test to see if 2048K can run the number?

davieddy 2008-01-14 12:51

[quote=Mini-Geek;122772]
I've tried to look up how FFT works, but it's pretty complicated. Over my head.
[/quote]

I was referring to "butterfly" diagrams which show where the
"Fast" enters into the Fast (Discrete) Fourier Transform. These
diagrams are what resemble the draw for a tennis tournament.
I agree that a full understanding of the subject cannot be gained
by "cursory study".

ewmayer 2008-01-14 16:37

[QUOTE=Mini-Geek;122760]Why are the FFT sizes only a few fixed sizes? Why not have them variable, only being as large as they need to to do the number without errors. As mentioned in The Happy Me(al) thread, I barely got one that was just outside of an FFT size to use the smaller one, and I have one that's on the lower end of its FFT size. Surely it doesn't need all 2560K FFT size precision to do something 39.78M? Even if 2048K is indeed too inaccurate for the number, there must be something in between those that will give sufficient accuracy and higher speed.[/QUOTE]

There is - 2304K. However, that [and analogously with all such ever-finer-gradations of length] needs something called a "radix-9 DFT" routine. [Factorize the FFT length and you'll see where that arises.] Similarly, the other 3 "in-between" lengths in going to 4096K need radices 11,13 and 15 respectively, in addition to the usual powers-of-2. By way of example, my Mlucas code supports all those, but the only one of them which consistently gives decent bang-for-the-coding buck is radix 9. Here's why:

1) The relative efficiency of an FFT is [roughly] related to the smoothness of the factorization of its length - thus power-of-2 radices are most efficient, and large-odd-prime radices are worst;

2) Larger radices lead to higher register pressure in-hardware, thus coding them to run efficiently becomes quite difficult.

The above is just a bare outline: feel free to do some homework on the DFT if you wish to understand this more fully. For example, a good advanced-topics question is: "For odd composite radices of similar size, why might one with no repeated factors be preferable to one with repeated factors? [E.g. radix 21 vs 25]."

fivemack 2008-01-14 17:16

[quote]
"For odd composite radices of similar size, why might one with no repeated factors be preferable to one with repeated factors? [E.g. radix 21 vs 25]."
[/quote]

Because radix-21 has 12 roots of unity and radix-25 has 20? (OK, six and ten once you've dealt with negation) - but on an x86-like platform I'd have thought you handled the roots of unity with load-execute operations, and wouldn't the radix-21 DFT be done as 3x7 anyway?

Is there some obvious reason that I'm missing why fast FFT code tends to be {small}*2^N rather than {small}*2^N*3^M? The three-way decimation-in-time stage isn't too hideous, the cube roots of unity are conveniently 1 and +/- omega.

Mini-Geek 2008-01-14 17:33

Thanks for all the explanations, I understand now that it takes a very smooth FFT size to be efficient.

I got this error on my 39.509M number and was wondering if it's serious or a software bug or anything, and if this significantly raises the chance that this will return an incorrect result (copied from results.txt, with the self-test checking if 2048K is acceptable):[code]Trying 1000 iterations for exponent 39509597 using 2048K FFT.
If average roundoff error is above 0.2425, then a larger FFT will be used.
Final average roundoff error is 0.24118, using 2048K FFT for exponent 39509597.
[Mon Jan 14 01:17:53 2008]
Iteration: 1420185/39509597, ERROR: ROUND OFF (0.40625) > 0.40
Continuing from last save file.
[Mon Jan 14 01:42:17 2008]
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.[/code]

ewmayer 2008-01-14 17:33

[QUOTE=fivemack;122824]Because radix-21 has 12 roots of unity and radix-25 has 20? (OK, six and ten once you've dealt with negation) - but on an x86-like platform I'd have thought you handled the roots of unity with load-execute operations, and wouldn't the radix-21 DFT be done as 3x7 anyway?[/quote]

"Why twiddle when you can permute?"

[quote]Is there some obvious reason that I'm missing why fast FFT code tends to be {small}*2^N rather than {small}*2^N*3^M? The three-way decimation-in-time stage isn't too hideous, the cube roots of unity are conveniently 1 and +/- omega.[/QUOTE]

Might be worth investigating, but I expect a highly optimized power-of-2 combined with a decent variety of odd leading radices gives something approaching an optimum in the performance/implementation-difficulty space. At large FFT lengths, one would have the option of perhaps-inefficient single leading radix [pay me now] followed by many really fast powers-of-two, or multiple powers of three, all slightly less efficient than their smoother power-of-2 brethren [pay me later].

[b]EDIT:[/b] By way of example, I can do a complex radix-8 DFT [sans any multipass twiddles] with just 52 [scalar] add and 4 mul.

A radix-9 DFT, on the other hand, needs 84 add and 20 mul. That is significantly more ops-per-input than the power of 2. OTOOH, if I follow that radix-9 with a large power of 2, the extra work done in doing the one radix-9 pass gets offset [and more] by the shorter vector length compared to [say] going to the next-larger power of 2, or even to 10*[power of 2]. The radix-10 DFT needs 88 add and 20 mul and thus is slightly better-per-input, but that work gets multiplied by [10/9] in comparing vs 9*[same power of 2] to reflect the overall larger vector length.

Of course in the real world there is much more to it than mere arithmetic opcount - cache misses and memory bandwidth are much more important at larger vector lengyhs - but all other things being equal [which the same large-power-of-2 ensures] using a shorter vector length will usually give better cache behavior, as well, since proportionally more of the data array can fit in cache.


All times are UTC. The time now is 04:42.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.