 mersenneforum.org Non-power-of-two FFTs
 Register FAQ Search Today's Posts Mark Forums Read  2014-05-11, 22:46   #1
jasonp
Tribal Bullet

Oct 2004

5×709 Posts Non-power-of-two FFTs

Quote:
 Originally Posted by diep hi Ken, The biggest limitation currently is that i do not know how to formulate a transform that's not a power of 2. No clue how to get that to O(n log n)...
There is a very well developed theory for computing FFTs of arbitrary size; a whirlwind introduction with references is here, though a detailed explanation of how things work requires looking up papers and PhD dissertations from the 1980s.

When the transform length is a prime number N, we use Rader's algorithm: permute the inputs, compute a cyclic convolution of length N-1, add a DC term, and permute the outputs. Libraries like FFTW use Rader's algorithm recursively to compute arbitrary size transforms in O(N*log(N)) time.

The convolution can be computed via FFT-multiply-IFFT, but can usually be computed much more quickly using Winograd's algorithm for short convolutions. Winograd's algorithm for length N convolutions is based on finding the polynomial factors of (x^N)-1, reducing the input modulo each of these polynomials, doing a bunch of tiny convolutions, then combining the results via the polynomial Chinese remainder theorem. There are a bunch of neat algebraic transformations that combine to produce remarkably high-performance FFTs, but generating these transforms tends to be very painful.

The prime factor algorithm can also be applied to transforms when N is a power of a prime. For powers of two this leads to very strange FFTs that have fewer multiplies but more additions than the standard Cooley-Tukey formulation.

PM me if you need more details.   2014-05-12, 03:26 #2 ewmayer ∂2ω=0   Sep 2002 República de California 2×11×13×41 Posts @JasonP: With a view toward non-power-of-2 FFTs for F33-mod work - which must be very nearly a power of 2 due to roundoff error considerations - I recently implemented twiddleless (i.e. use input-and-output-index permutations to obviate the usual layer of twiddle-muls needed between noncoprime radices) DFTs of length 992 (31*32) and 1008 (63*16). The radix-31 DFT I adapted (mainly by drastically reducing the number of local variables needed, with a view toward a 16-FP-register ASM implementation) from one by van Buskirk. I will have more details later in the Pepin-test thread in Math, but brief preview: o The radix-992 transform has ~30% more floating-point adds and 10% more muls per point than does a similarly well-optimized radix-1024 transform. (That is actually surprisingly good because radix-31 has fugly high opcounts - things only get reasonable when we twiddlelessly-combine with a similar-sized power-of-2 radix); o But especially the add count is still higher than is desirable, so I alternatively considered a length-1008 = 63*16 DFT. We assemble the radix-63 via a twiddleless combination of 9*radix7 + 7*radix9. We do 16 of those radix-63s, then combine the results twiddlelessly with a set of 63 radix-16 DFTs, for a total opcount of only 9% more floating-point adds and 14% fewer muls than we need for radix-1024. Note that for F33-modmul work in order to meet the 1MB-per-thread-or-less data-chunksize threshold which I noted in the Pepin thread makes a huge speed impact on x86 we will need even larger radices 31*128 and 63*64, but those are easily assembled from the same components put into place for the above proff-of-principle tests.   2014-05-12, 07:25 #3 jasonp Tribal Bullet   Oct 2004 5·709 Posts Have you considered using the WFTA? If the component transforms can be expressed as (bunch of adds) * (1 multiply per point) * (bunch of adds), then for e.g. a transform of size 9*7 you can do (adds for radix 7)*(adds for radix 9)*(multiplies for radix 7)*(multiplies for radix 9)*(adds for radix 9)*(adds for radix 7) and the two stages of multiplies form the precomputed quantities for the new transform. This cuts the number of multiplies by 25-50% without increasing the add count, at the expense of having to write a different transform for every combination of radix sizes. The WFTA can be used for as many radixes in a row as desired at the expense of lots of code. (Note that for radixes that are a power of two you have to use the Winograd version of the transform, not the Cooley-Tukey version, to get the flow graph of the transform into the correct form for using the WFTA) Edit: the knowledge of how to do all this would basically be lost if the DSP group at Rice University didn't post the dissertations of the group. This one is the only description I could find for deriving the transforms from first principles, which was necessary if you want to build number theoretic FFTs using Winograd's algorithm. My sample code for that is here. Last fiddled with by jasonp on 2014-05-12 at 07:33   2014-05-12, 20:57   #4
ewmayer
2ω=0

Sep 2002
República de California

2DCE16 Posts Quote:
 Originally Posted by jasonp Have you considered using the WFTA? If the component transforms can be expressed as (bunch of adds) * (1 multiply per point) * (bunch of adds), then for e.g. a transform of size 9*7 you can do (adds for radix 7)*(adds for radix 9)*(multiplies for radix 7)*(multiplies for radix 9)*(adds for radix 9)*(adds for radix 7) and the two stages of multiplies form the precomputed quantities for the new transform. This cuts the number of multiplies by 25-50% without increasing the add count
So far so good, though note that mul count is usually not a performance limiter in practice;

Quote:
 , at the expense of having to write a different transform for every combination of radix sizes. The WFTA can be used for as many radixes in a row as desired at the expense of lots of code.
That's the killer - no reuse of previously-optimized short-length DFT code is possible.

Quote:
 (Note that for radixes that are a power of two you have to use the Winograd version of the transform, not the Cooley-Tukey version, to get the flow graph of the transform into the correct form for using the WFTA) Edit: the knowledge of how to do all this would basically be lost if the DSP group at Rice University didn't post the dissertations of the group. This one is the only description I could find for deriving the transforms from first principles, which was necessary if you want to build number theoretic FFTs using Winograd's algorithm. My sample code for that is here.
I would be interested in exploring the differences between the Winograd and van Buskirk approaches - AFAIK the latter generally beats WFTA in terms of opcount, often by an appreciable margin. E.g. my original radix-13 DFT code was based on Winograd, several years ago I came across VB's implementation, which is what I now use.   2014-05-13, 01:15 #5 jasonp Tribal Bullet   Oct 2004 5·709 Posts Can van Buskirk's algorithms work in a general finite field with a root of unity, or is it limited to complex floating point only? Winograd's method provably needs the minimum number of multiplies but the add count is sort of unbounded; the method relies on using Toom-Cook for the small cyclic convolutions, and the add count in Toom-Cook methods explodes once the polynomials being multiplied have more than about 3 terms. For a transform of length 11, x^10-1 = (x+1)(x-1)(x^4+x^3+x^2+x+1)(x^4-x^3+x^2-x+1), and cyclic convolutions modulo the last two polynomials each require a Toom-Cook method of size 4. That's slow enough that the size 13 transform is actually faster because x^12-1 has only one of those degree-4 factors. In practice those large convolutions are handled by some other method like recursive Karatsuba, that has a few more multiplies than is optimal but many fewer additions.   2014-05-13, 02:16   #6
ewmayer
2ω=0

Sep 2002
República de California

2·11·13·41 Posts Quote:
 Originally Posted by jasonp Can van Buskirk's algorithms work in a general finite field with a root of unity, or is it limited to complex floating point only?
Dan Bernstein's 2007 PDF on the van Buskirk "tangent FFT" (that naming for this class of methods appears to be DJB's - he also has a few real zingers of footnotes and asides) approach casts things in terms of general polynomial arithmetic but mentions only complex arithmetic and the "tangent" language appears to specifically invoke factorization over C, e.g.:

The obvious way to multiply a + bi by a constant cos θ + i sin θ is to compute a cos θ − b sin θ and a sin θ + b cos θ. A different approach is to factor cos θ+i sin θ as (1+i tan θ) cos θ, or as (cot θ+i) sin θ. Multiplying by a real number cos θ is relatively easy, taking only 2 real operations. Multiplying by 1 + i tan θ is also relatively easy, taking only 4 real operations.
This change does not make any immediate difference in operation count: either strategy takes 6 real operations, when appropriate constants such as tan θ have been precomputed. But the change allows some extra flexibility: the real multiplication can be moved elsewhere in the computation. Van Buskirk’s clever observation is that these real multiplications can sometimes be combined!

Now back to what you wrote:
Quote:
 Winograd's method provably needs the minimum number of multiplies but the add count is sort of unbounded; the method relies on using Toom-Cook for the small cyclic convolutions, and the add count in Toom-Cook methods explodes once the polynomials being multiplied have more than about 3 terms. For a transform of length 11, x^10-1 = (x+1)(x-1)(x^4+x^3+x^2+x+1)(x^4-x^3+x^2-x+1), and cyclic convolutions modulo the last two polynomials each require a Toom-Cook method of size 4. That's slow enough that the size 13 transform is actually faster because x^12-1 has only one of those degree-4 factors. In practice those large convolutions are handled by some other method like recursive Karatsuba, that has a few more multiplies than is optimal but many fewer additions.
Remind me - for complex inputs, what are the real-arithmetic add and mul counts for Winograd-style radix-11 and 13?

Years ago I tackled both of those radices in order to gain some insight into that area of the DFT black arts :) ... for radix-11 I massaged things into a quartet of (real-arithmetic) length-5 cyclic subconvolutions, for a resulting opcount of 160 ADD, 44 MUL (or 168 ADD, 40 MUL in a low-MUL-count-but-higher-overall-opcount variant). I believe that matches the best-known for any other approach.

For radix-13 I used the Winograd-style polynomial factorization approach and CRT-based reconstruction of the various subproducts - result was a MUL-lean algo (46) but the ADD count was a disappointingly high 214. Some further work to address the latter issue led to a Lo-ADD variant with 190 ADD and 64 MUL, which seemed a nice compromise solution.

But van Buskirk's tangent approach gives 164 ADD, 64 MUL - 14% fewer ADD and same MUL count as my best - which beats all, as far as I can tell. The ADD total is just a smidge higher than for radix-11, but ~15% lower on a per-input normalized basis. I also use that as the basis for my radix-13 SIMD implementation, although there, as I was at the time working in a 32-bit 8-sse2-register paradigm, I modified JvB's radix-13 to target an 8-register compute model. That version increases the number of multiplies somewhat relative to the original temporary-heavy implementation (from 64 to 88) since it uses muls-by-2 to effect the in-place doubling needed by the very frequent radix-2 butterfly operation sequence which
takes inputs x,y and outputs x+y and x-y:

x -= y
y *= 2
y += x

We can of course also use an ADD to effect the doubling, but in the context of an overall add-dominated surrounding-ops environment, mul by 2 is preferable.   2014-05-13, 03:32 #7 Prime95 P90 years forever!   Aug 2002 Yeehaw, FL 2×7×563 Posts I've been meaning to ask this question in a separate thread, but since we're discussing FFT theory.... Is the current floating point DWT the best approach in a memory bandwidth limited world? Let's define this simple measure for efficiency: Prime95 can store about 19 bits per 64-bit double for a memory efficiency of 19/64, or 29.7%. Prime95 performs the forward and inverse FFT and carry propagation in two read-write passes of RAM memory, so if an alternative FFT requires more passes we'd need to reduce its efficiency rating accordingly. Sin/cos, weights, and other data only add about 5% to memory usage, so let's ignore that for now. One idea: Use a Nussbaumer FFT on say 128 bit integers in the forward FFT. We could probably get ~112 bits of data into each integer. No multiplications are required until you get to the point-wise squaring. After the point-wise squaring, we'd need to do an inverse FFT on 256 bit integers. This gives us a memory efficiency of 112/256 or 43.75%. Unfortunately, I don't see how we can use Crandall's IBDWT trick. So cut that efficiency in half. Another idea: Use an NTT. I'm not an expert on these. You can use Crandall's weightings. Can't you get close to 50% efficiency? Say you split the input number into 1024-bit chunks. The convolution output will be 2048 + several bits. So we can use 65 32-bit primes or 33 64-bit primes. Comments?   2014-05-13, 04:23 #8 ewmayer ∂2ω=0   Sep 2002 República de California 2×11×13×41 Posts @George: Cf. the several posts on this subject from waybackwhen starting with JasonP's post here (Note my p^2-1 notation in the factorization-smoothness bit really should use q^2-1 as below, since p should be reserved for the exponent M-prime q = 2^p-1). Just a quick extra data point to your candidates: If we use a hybrid of a floating-FFT and fast Galois transform over GF(q^2) with q = 2^p-1 a Mersenne prime, we supplement each double-complex [128 bits] with a mod-complex of slightly more than 2p bits. Big advantage: IBDWT is both feasible and efficient (the needed fractional roots of 2 turn into simple shifts). Current SIMD hardware development points to p=31 as the best choice due to efficient MUL support. The hybrid transform allows use of ~ p/2 extra bits per (real) input than float-FFT alone, so p=31 adds 15.5 bits per word at a cost of just 32 memory bits, for a mem-efficiency of (19+15.5)/(64+32) ~= 36%. Using a larger p's gives better %es, but even a smallish p such as 61 runs into the lack of SIMD hardware MUL for ints that size. The one issue I had not yet solved when I last seriously looked at these hybrids (back then focusing on p = 61, which seemed the most promising candidate at that time): For non-power-of-2 FFT lengths, when we use the usual trick of packing 2 real inputs into a complex datum and doing a complex FFT, we know how to recover the real-data outputs from the result by combining output pairs with indices (j, N-j). For the non-power-of-2 FGT I seem to recall being unable to discern the needed index-pair patterns. One can't simply use a pure-real complex-transform test input and look for complex conjugate pairs among the outputs as one can do with a complex FFT.   2014-05-15, 01:29 #9 jasonp Tribal Bullet   Oct 2004 354510 Posts What a great thread that was... According to the table here, a Winograd FFT of length 11 needs 40 multiplies and 168 adds, while a Winograd FFT of length 13 needs 40 multiplies and 188 adds. The DWT is possible (i.e. convolutions don't need zero-padding) anytime there is an Nth root of two in the field, for convolutions of size N. My understanding is that Nussbaumer convolution always needs zero padding, so it has twice the performance hurdle to overcome at the outset. The biggest problem with Nussbaumer is that you can only break the transform up into a small number of very large pieces, so that doing the first symbolic FFT phase means lots of passes through memory. Also, don't forget the overhead of using the CRT to reconstruct residues when using NTTs modulo a large number of primes; a naive CRT has complexity that's quadratic in the number of primes. Last fiddled with by jasonp on 2014-05-15 at 02:39   2014-05-15, 03:22   #10
ewmayer
2ω=0

Sep 2002
República de California

2×11×13×41 Posts Quote:
 Originally Posted by jasonp According to the table here, a Winograd FFT of length 11 needs 40 multiplies and 168 adds, while a Winograd FFT of length 13 needs 40 multiplies and 188 adds.
Thanks - so my radix-11 appears to be optimal (in the "best-known" sense, not "provable best") ... Winograd's radix-13 beats mine, but no surprise there, and van Buskirk beats Winograd, so my current radix-13 implementation is again the best-known-algorithm.

=================

Getting back to the original topic of what is now a newly-split thread (thanks, Jason) - the first main thing I needed to work out for myself back when I was first endeavoring to generalize my rudimentary power-of-2 FFT stuff to more general lengths was "how to generalize the concept of bit-reversal reordering of a sequence 0,1,2,...N-1 to N not a power of 2?" Since this is actually a fun exercise any serious student of this stuff should work through, rather than give a bunch of literature citations and high-flown theory I'll just briefly illustrate the approach I use, which is in terms of simple sequence-element insertions. First let's illustrate how this works for

Powers-of-2: We start with the single index 0; at each step doubling the sequence length by inserting a new index-element (j + N) after each existing index j:

N = 1: 0, insert 1 after the 0 to get
N = 2: 0,1, insert 2 and 3 to get
N = 4: 0,2,1,3, etc.

Non-powers-of-2: Here things get more interesting because for given final length N the index ordering depends on the order in which we process the radices, specifically the "primitive radices" which are the prime factors of N. (In the actual FFT implementation we may use larger radices which are products of one or more of said prime factors). We again start with the single index 0, but now if at a given step we are doing radix q, we multiply the sequence length by q by inserting (q-1) index-elements with values (j+N,j+2*N,...,j+(q-1)*N) after each existing index j. For example to end up with N = 6 we can process the primitive radices 2 and 3 in 2 different orders:

N = 1: 0, insert 1 after the 0 to get
N = 2: 0,1, insert pairs (2,4) and (3,5) to get
N = 6: 0,2,4,1,3,6.

N = 1: 0, insert pair (1,2) after the 0 to get
N = 3: 0,1,2, insert singletons 3,4 and 5 to get
N = 6: 0,3,1,4,2,5.

Note that, depending on the details of one's FFT implementation, the order of radix processing in the generalized "bit reversal" permutations may be the opposite of that in the actual FFT.   2014-05-25, 18:23   #11
diep

Sep 2006
The Netherlands

30E16 Posts Quote:
 Originally Posted by Prime95 I've been meaning to ask this question in a separate thread, but since we're discussing FFT theory.... Is the current floating point DWT the best approach in a memory bandwidth limited world? Let's define this simple measure for efficiency: Prime95 can store about 19 bits per 64-bit double for a memory efficiency of 19/64, or 29.7%. Prime95 performs the forward and inverse FFT and carry propagation in two read-write passes of RAM memory, so if an alternative FFT requires more passes we'd need to reduce its efficiency rating accordingly. Sin/cos, weights, and other data only add about 5% to memory usage, so let's ignore that for now. One idea: Use a Nussbaumer FFT on say 128 bit integers in the forward FFT. We could probably get ~112 bits of data into each integer. No multiplications are required until you get to the point-wise squaring. After the point-wise squaring, we'd need to do an inverse FFT on 256 bit integers. This gives us a memory efficiency of 112/256 or 43.75%. Unfortunately, I don't see how we can use Crandall's IBDWT trick. So cut that efficiency in half. Another idea: Use an NTT. I'm not an expert on these. You can use Crandall's weightings. Can't you get close to 50% efficiency? Say you split the input number into 1024-bit chunks. The convolution output will be 2048 + several bits. So we can use 65 32-bit primes or 33 64-bit primes. Comments?
Oh la la i missed a bunch of replies to the thread - sorry for answerring so late.

I implemented a NTT.

Now with CRT you can store more bits than i did do. I didn't explore the benefits of CRT too much.

The real problem is always the number of instructions you need for each "multiplication" and the nonstop existance of a modulo.

Total number of instructions for each butterfly (hope i express it correctly) in FFT is simply very little. That's why it is so tough to beat, provided you can solve the bandwidth problem.

Additionally to this, a NTT is only interesting if you have the availability of 64 bits integer math. I didn't figure it out for the latest GPU's, yet basically they're having 32 bits integer math.

A simple multiplication of 32 x 32 bits, to get the full 64 bits output requires 2 instructions and has a throughput of in total 2 clocks at the GPU i have, if i remember well.

A single double precision multiplication also uses 2 clocks throughput.

So at every front a NTT is going to lose it major league from double precision floating point at GPU. The transform simply slows down some factors to start with and we still didn't factor in the overhead of the CRT, which isn't for free either.

At CPU's it would be easier to break even than at manycore hardware for sure yet there is also a lack of instructions there in how i implemented it.

The NTT i implemented used 64 bits primes.

At the CPU the problem is that there isn't a 64x64 == 128 bits multiplication in SSE2 and i assume not in AVX either (is it?). So you end up using the integer multiplication for it which is multiplying 1 at a time. You suddenly need to execute 4 instructions to multiply a single AVX register versus double precision just 1 instructiion having throughput 1 clock at nowadays latest i7's.

Another issue from using the integer multiplication in the normal register is that would mean you are nonstop converting from and to SIMD registers. I'm no expert on that, but i would guess that loses a lot of extra instructions as well.

Instructions you cannot afford to lose simply.

We didn't discuss AVX2 even and i guess some years from now every cpu will have AVX2.

When dealing with 32 bits integers, like GMP is doing, you do not suffer from all those conversions. Yet even then the multiplication suddenly requires 2 instructions rather than 1. This where you just have 1 execution unit that can do those multiplications.

We can therefore prove easily the overhead in number of instructions is going to be at least factor 2. Yet i would be amazed if it wasn't factor 3 to 4.

So NTT has some huge overhead from my limited viewpoint, yet of course it does reduce the amount of bandwidth you need to the caches/RAM.

Another underlighted aspect of NTT is that you cannot pick a random number like 2^32 - 1 to toy with. You'll lose bits right from the start.

I completely stopped my NTT research after i realized this sequential instruction overhead a limb you need. Maybe someone will figure something out - yet i don't see how it can beat a double precision FFT, provided the math you're busy with allows a double precision FFT.

Now GPU's have potentially a lot of extra bandwidth over a CPU. The big problem there is basically 2 things as far as i can see right now (yet i didn't really experiment a lot yet).

Problem 1: the gpu's bandwidth to device RAM is a fraction of the processing power.

We are all aware of this problem.

Problem 2: within a SIMD life would be easy if we could crisscross proces data between the cores. In itself the total amount of cache the SIMD has for each cores local register is pretty ok, if you look to the total amount. That's not shared across the cores within each SIMD however.

Now there is a small tad of shared L1 cache yet that's pretty little. Probably will end up using 16KB there as obviously for each of the FFT's "log n" number of steps, you need different code obviously.

Every step, or i would rather want to call it iteration, they all need to get their own codebase.

I bet 48KB of L1 for instructions is no luxury then.

So the bandwidth problem is there only because it's not trivial to me how to use the caches. There is room for improvement there though.

There is 16 KB of shared RAM that can get reused and there is start data you can start with from the register files.

So a combined approach based upon throughput, where we run possibly several exponents at the same time, maybe it's possible to use all resources at nearly the same time, and get a higher efficiency out of the GPU.

the struggle for getting the maximum speed out of the gpu of 1 exponent for the entire GPU - i've given up that idea long time ago.

The initial attempt would be studying as well running 1 exponent at a single SIMD. So that whatever we just had stored in both the register file as well as the L1 data, that we can reuse part of this for the next calculation. Even though that will help little in most cases, it's SOME sort of savings :)

Alternatively of course we store in the L1 datacache precalculated complex numbers. For the start of each seqence we just need to start with 1 number and of course can selfcalculate the next one.

It's possible though this precalculated complex number ends up in the register file, as i'm guessing that if 32 cores read from the same address in the L1d at the same time, that it'll jam and slowdown (to be benchmarked).   Thread Tools Show Printable Version Email this Page Similar Threads Thread Thread Starter Forum Replies Last Post jasonp GPU Computing 33 2018-12-30 13:32 scrawlings Information & Answers 39 2014-08-02 21:48 KarateF22 PrimeNet 16 2013-10-28 00:34 nwb Information & Answers 2 2011-07-08 16:04 dave_dm Factoring 9 2004-09-04 11:47

All times are UTC. The time now is 15:23.

Wed May 18 15:23:58 UTC 2022 up 34 days, 13:25, 1 user, load averages: 1.69, 1.80, 1.76