20140511, 22:46  #1  
Tribal Bullet
Oct 2004
5×709 Posts 
Nonpoweroftwo FFTs
Quote:
When the transform length is a prime number N, we use Rader's algorithm: permute the inputs, compute a cyclic convolution of length N1, add a DC term, and permute the outputs. Libraries like FFTW use Rader's algorithm recursively to compute arbitrary size transforms in O(N*log(N)) time. The convolution can be computed via FFTmultiplyIFFT, but can usually be computed much more quickly using Winograd's algorithm for short convolutions. Winograd's algorithm for length N convolutions is based on finding the polynomial factors of (x^N)1, reducing the input modulo each of these polynomials, doing a bunch of tiny convolutions, then combining the results via the polynomial Chinese remainder theorem. There are a bunch of neat algebraic transformations that combine to produce remarkably highperformance FFTs, but generating these transforms tends to be very painful. The prime factor algorithm can also be applied to transforms when N is a power of a prime. For powers of two this leads to very strange FFTs that have fewer multiplies but more additions than the standard CooleyTukey formulation. PM me if you need more details. 

20140512, 03:26  #2 
∂^{2}ω=0
Sep 2002
República de California
2×11×13×41 Posts 
@JasonP:
With a view toward nonpowerof2 FFTs for F33mod work  which must be very nearly a power of 2 due to roundoff error considerations  I recently implemented twiddleless (i.e. use inputandoutputindex permutations to obviate the usual layer of twiddlemuls needed between noncoprime radices) DFTs of length 992 (31*32) and 1008 (63*16). The radix31 DFT I adapted (mainly by drastically reducing the number of local variables needed, with a view toward a 16FPregister ASM implementation) from one by van Buskirk. I will have more details later in the Pepintest thread in Math, but brief preview: o The radix992 transform has ~30% more floatingpoint adds and 10% more muls per point than does a similarly welloptimized radix1024 transform. (That is actually surprisingly good because radix31 has fugly high opcounts  things only get reasonable when we twiddlelesslycombine with a similarsized powerof2 radix); o But especially the add count is still higher than is desirable, so I alternatively considered a length1008 = 63*16 DFT. We assemble the radix63 via a twiddleless combination of 9*radix7 + 7*radix9. We do 16 of those radix63s, then combine the results twiddlelessly with a set of 63 radix16 DFTs, for a total opcount of only 9% more floatingpoint adds and 14% fewer muls than we need for radix1024. Note that for F33modmul work in order to meet the 1MBperthreadorless datachunksize threshold which I noted in the Pepin thread makes a huge speed impact on x86 we will need even larger radices 31*128 and 63*64, but those are easily assembled from the same components put into place for the above proffofprinciple tests. 
20140512, 07:25  #3 
Tribal Bullet
Oct 2004
5·709 Posts 
Have you considered using the WFTA? If the component transforms can be expressed as (bunch of adds) * (1 multiply per point) * (bunch of adds), then for e.g. a transform of size 9*7 you can do
(adds for radix 7)*(adds for radix 9)*(multiplies for radix 7)*(multiplies for radix 9)*(adds for radix 9)*(adds for radix 7) and the two stages of multiplies form the precomputed quantities for the new transform. This cuts the number of multiplies by 2550% without increasing the add count, at the expense of having to write a different transform for every combination of radix sizes. The WFTA can be used for as many radixes in a row as desired at the expense of lots of code. (Note that for radixes that are a power of two you have to use the Winograd version of the transform, not the CooleyTukey version, to get the flow graph of the transform into the correct form for using the WFTA) Edit: the knowledge of how to do all this would basically be lost if the DSP group at Rice University didn't post the dissertations of the group. This one is the only description I could find for deriving the transforms from first principles, which was necessary if you want to build number theoretic FFTs using Winograd's algorithm. My sample code for that is here. Last fiddled with by jasonp on 20140512 at 07:33 
20140512, 20:57  #4  
∂^{2}ω=0
Sep 2002
República de California
2DCE_{16} Posts 
Quote:
Quote:
Quote:


20140513, 01:15  #5 
Tribal Bullet
Oct 2004
5·709 Posts 
Can van Buskirk's algorithms work in a general finite field with a root of unity, or is it limited to complex floating point only?
Winograd's method provably needs the minimum number of multiplies but the add count is sort of unbounded; the method relies on using ToomCook for the small cyclic convolutions, and the add count in ToomCook methods explodes once the polynomials being multiplied have more than about 3 terms. For a transform of length 11, x^101 = (x+1)(x1)(x^4+x^3+x^2+x+1)(x^4x^3+x^2x+1), and cyclic convolutions modulo the last two polynomials each require a ToomCook method of size 4. That's slow enough that the size 13 transform is actually faster because x^121 has only one of those degree4 factors. In practice those large convolutions are handled by some other method like recursive Karatsuba, that has a few more multiplies than is optimal but many fewer additions. 
20140513, 02:16  #6  
∂^{2}ω=0
Sep 2002
República de California
2·11·13·41 Posts 
Quote:
The obvious way to multiply a + bi by a constant cos θ + i sin θ is to compute a cos θ − b sin θ and a sin θ + b cos θ. A different approach is to factor cos θ+i sin θ as (1+i tan θ) cos θ, or as (cot θ+i) sin θ. Multiplying by a real number cos θ is relatively easy, taking only 2 real operations. Multiplying by 1 + i tan θ is also relatively easy, taking only 4 real operations. This change does not make any immediate difference in operation count: either strategy takes 6 real operations, when appropriate constants such as tan θ have been precomputed. But the change allows some extra flexibility: the real multiplication can be moved elsewhere in the computation. Van Buskirk’s clever observation is that these real multiplications can sometimes be combined! Now back to what you wrote: Quote:
Years ago I tackled both of those radices in order to gain some insight into that area of the DFT black arts :) ... for radix11 I massaged things into a quartet of (realarithmetic) length5 cyclic subconvolutions, for a resulting opcount of 160 ADD, 44 MUL (or 168 ADD, 40 MUL in a lowMULcountbuthigheroverallopcount variant). I believe that matches the bestknown for any other approach. For radix13 I used the Winogradstyle polynomial factorization approach and CRTbased reconstruction of the various subproducts  result was a MULlean algo (46) but the ADD count was a disappointingly high 214. Some further work to address the latter issue led to a LoADD variant with 190 ADD and 64 MUL, which seemed a nice compromise solution. But van Buskirk's tangent approach gives 164 ADD, 64 MUL  14% fewer ADD and same MUL count as my best  which beats all, as far as I can tell. The ADD total is just a smidge higher than for radix11, but ~15% lower on a perinput normalized basis. I also use that as the basis for my radix13 SIMD implementation, although there, as I was at the time working in a 32bit 8sse2register paradigm, I modified JvB's radix13 to target an 8register compute model. That version increases the number of multiplies somewhat relative to the original temporaryheavy implementation (from 64 to 88) since it uses mulsby2 to effect the inplace doubling needed by the very frequent radix2 butterfly operation sequence which takes inputs x,y and outputs x+y and xy: x = y y *= 2 y += x We can of course also use an ADD to effect the doubling, but in the context of an overall adddominated surroundingops environment, mul by 2 is preferable. 

20140513, 03:32  #7 
P90 years forever!
Aug 2002
Yeehaw, FL
2×7×563 Posts 
I've been meaning to ask this question in a separate thread, but since we're discussing FFT theory....
Is the current floating point DWT the best approach in a memory bandwidth limited world? Let's define this simple measure for efficiency: Prime95 can store about 19 bits per 64bit double for a memory efficiency of 19/64, or 29.7%. Prime95 performs the forward and inverse FFT and carry propagation in two readwrite passes of RAM memory, so if an alternative FFT requires more passes we'd need to reduce its efficiency rating accordingly. Sin/cos, weights, and other data only add about 5% to memory usage, so let's ignore that for now. One idea: Use a Nussbaumer FFT on say 128 bit integers in the forward FFT. We could probably get ~112 bits of data into each integer. No multiplications are required until you get to the pointwise squaring. After the pointwise squaring, we'd need to do an inverse FFT on 256 bit integers. This gives us a memory efficiency of 112/256 or 43.75%. Unfortunately, I don't see how we can use Crandall's IBDWT trick. So cut that efficiency in half. Another idea: Use an NTT. I'm not an expert on these. You can use Crandall's weightings. Can't you get close to 50% efficiency? Say you split the input number into 1024bit chunks. The convolution output will be 2048 + several bits. So we can use 65 32bit primes or 33 64bit primes. Comments? 
20140513, 04:23  #8 
∂^{2}ω=0
Sep 2002
República de California
2×11×13×41 Posts 
@George:
Cf. the several posts on this subject from waybackwhen starting with JasonP's post here (Note my p^21 notation in the factorizationsmoothness bit really should use q^21 as below, since p should be reserved for the exponent Mprime q = 2^p1). Just a quick extra data point to your candidates: If we use a hybrid of a floatingFFT and fast Galois transform over GF(q^2) with q = 2^p1 a Mersenne prime, we supplement each doublecomplex [128 bits] with a modcomplex of slightly more than 2p bits. Big advantage: IBDWT is both feasible and efficient (the needed fractional roots of 2 turn into simple shifts). Current SIMD hardware development points to p=31 as the best choice due to efficient MUL support. The hybrid transform allows use of ~ p/2 extra bits per (real) input than floatFFT alone, so p=31 adds 15.5 bits per word at a cost of just 32 memory bits, for a memefficiency of (19+15.5)/(64+32) ~= 36%. Using a larger p's gives better %es, but even a smallish p such as 61 runs into the lack of SIMD hardware MUL for ints that size. The one issue I had not yet solved when I last seriously looked at these hybrids (back then focusing on p = 61, which seemed the most promising candidate at that time): For nonpowerof2 FFT lengths, when we use the usual trick of packing 2 real inputs into a complex datum and doing a complex FFT, we know how to recover the realdata outputs from the result by combining output pairs with indices (j, Nj). For the nonpowerof2 FGT I seem to recall being unable to discern the needed indexpair patterns. One can't simply use a purereal complextransform test input and look for complex conjugate pairs among the outputs as one can do with a complex FFT. 
20140515, 01:29  #9 
Tribal Bullet
Oct 2004
3545_{10} Posts 
What a great thread that was...
According to the table here, a Winograd FFT of length 11 needs 40 multiplies and 168 adds, while a Winograd FFT of length 13 needs 40 multiplies and 188 adds. The DWT is possible (i.e. convolutions don't need zeropadding) anytime there is an Nth root of two in the field, for convolutions of size N. My understanding is that Nussbaumer convolution always needs zero padding, so it has twice the performance hurdle to overcome at the outset. The biggest problem with Nussbaumer is that you can only break the transform up into a small number of very large pieces, so that doing the first symbolic FFT phase means lots of passes through memory. Also, don't forget the overhead of using the CRT to reconstruct residues when using NTTs modulo a large number of primes; a naive CRT has complexity that's quadratic in the number of primes. Last fiddled with by jasonp on 20140515 at 02:39 
20140515, 03:22  #10  
∂^{2}ω=0
Sep 2002
República de California
2×11×13×41 Posts 
Quote:
================= Getting back to the original topic of what is now a newlysplit thread (thanks, Jason)  the first main thing I needed to work out for myself back when I was first endeavoring to generalize my rudimentary powerof2 FFT stuff to more general lengths was "how to generalize the concept of bitreversal reordering of a sequence 0,1,2,...N1 to N not a power of 2?" Since this is actually a fun exercise any serious student of this stuff should work through, rather than give a bunch of literature citations and highflown theory I'll just briefly illustrate the approach I use, which is in terms of simple sequenceelement insertions. First let's illustrate how this works for Powersof2: We start with the single index 0; at each step doubling the sequence length by inserting a new indexelement (j + N) after each existing index j: N = 1: 0, insert 1 after the 0 to get N = 2: 0,1, insert 2 and 3 to get N = 4: 0,2,1,3, etc. Nonpowersof2: Here things get more interesting because for given final length N the index ordering depends on the order in which we process the radices, specifically the "primitive radices" which are the prime factors of N. (In the actual FFT implementation we may use larger radices which are products of one or more of said prime factors). We again start with the single index 0, but now if at a given step we are doing radix q, we multiply the sequence length by q by inserting (q1) indexelements with values (j+N,j+2*N,...,j+(q1)*N) after each existing index j. For example to end up with N = 6 we can process the primitive radices 2 and 3 in 2 different orders: [A] Radix2 followed by radix3: N = 1: 0, insert 1 after the 0 to get N = 2: 0,1, insert pairs (2,4) and (3,5) to get N = 6: 0,2,4,1,3,6. [B] Radix3 followed by radix2: N = 1: 0, insert pair (1,2) after the 0 to get N = 3: 0,1,2, insert singletons 3,4 and 5 to get N = 6: 0,3,1,4,2,5. Note that, depending on the details of one's FFT implementation, the order of radix processing in the generalized "bit reversal" permutations may be the opposite of that in the actual FFT. 

20140525, 18:23  #11  
Sep 2006
The Netherlands
30E_{16} Posts 
Quote:
I implemented a NTT. Now with CRT you can store more bits than i did do. I didn't explore the benefits of CRT too much. The real problem is always the number of instructions you need for each "multiplication" and the nonstop existance of a modulo. Total number of instructions for each butterfly (hope i express it correctly) in FFT is simply very little. That's why it is so tough to beat, provided you can solve the bandwidth problem. Additionally to this, a NTT is only interesting if you have the availability of 64 bits integer math. I didn't figure it out for the latest GPU's, yet basically they're having 32 bits integer math. A simple multiplication of 32 x 32 bits, to get the full 64 bits output requires 2 instructions and has a throughput of in total 2 clocks at the GPU i have, if i remember well. A single double precision multiplication also uses 2 clocks throughput. So at every front a NTT is going to lose it major league from double precision floating point at GPU. The transform simply slows down some factors to start with and we still didn't factor in the overhead of the CRT, which isn't for free either. At CPU's it would be easier to break even than at manycore hardware for sure yet there is also a lack of instructions there in how i implemented it. The NTT i implemented used 64 bits primes. At the CPU the problem is that there isn't a 64x64 == 128 bits multiplication in SSE2 and i assume not in AVX either (is it?). So you end up using the integer multiplication for it which is multiplying 1 at a time. You suddenly need to execute 4 instructions to multiply a single AVX register versus double precision just 1 instructiion having throughput 1 clock at nowadays latest i7's. Another issue from using the integer multiplication in the normal register is that would mean you are nonstop converting from and to SIMD registers. I'm no expert on that, but i would guess that loses a lot of extra instructions as well. Instructions you cannot afford to lose simply. We didn't discuss AVX2 even and i guess some years from now every cpu will have AVX2. When dealing with 32 bits integers, like GMP is doing, you do not suffer from all those conversions. Yet even then the multiplication suddenly requires 2 instructions rather than 1. This where you just have 1 execution unit that can do those multiplications. We can therefore prove easily the overhead in number of instructions is going to be at least factor 2. Yet i would be amazed if it wasn't factor 3 to 4. So NTT has some huge overhead from my limited viewpoint, yet of course it does reduce the amount of bandwidth you need to the caches/RAM. Another underlighted aspect of NTT is that you cannot pick a random number like 2^32  1 to toy with. You'll lose bits right from the start. I completely stopped my NTT research after i realized this sequential instruction overhead a limb you need. Maybe someone will figure something out  yet i don't see how it can beat a double precision FFT, provided the math you're busy with allows a double precision FFT. Now GPU's have potentially a lot of extra bandwidth over a CPU. The big problem there is basically 2 things as far as i can see right now (yet i didn't really experiment a lot yet). Problem 1: the gpu's bandwidth to device RAM is a fraction of the processing power. We are all aware of this problem. Problem 2: within a SIMD life would be easy if we could crisscross proces data between the cores. In itself the total amount of cache the SIMD has for each cores local register is pretty ok, if you look to the total amount. That's not shared across the cores within each SIMD however. Now there is a small tad of shared L1 cache yet that's pretty little. Probably will end up using 16KB there as obviously for each of the FFT's "log n" number of steps, you need different code obviously. Every step, or i would rather want to call it iteration, they all need to get their own codebase. I bet 48KB of L1 for instructions is no luxury then. So the bandwidth problem is there only because it's not trivial to me how to use the caches. There is room for improvement there though. There is 16 KB of shared RAM that can get reused and there is start data you can start with from the register files. So a combined approach based upon throughput, where we run possibly several exponents at the same time, maybe it's possible to use all resources at nearly the same time, and get a higher efficiency out of the GPU. the struggle for getting the maximum speed out of the gpu of 1 exponent for the entire GPU  i've given up that idea long time ago. The initial attempt would be studying as well running 1 exponent at a single SIMD. So that whatever we just had stored in both the register file as well as the L1 data, that we can reuse part of this for the next calculation. Even though that will help little in most cases, it's SOME sort of savings :) Alternatively of course we store in the L1 datacache precalculated complex numbers. For the start of each seqence we just need to start with 1 number and of course can selfcalculate the next one. It's possible though this precalculated complex number ends up in the register file, as i'm guessing that if 32 cores read from the same address in the L1d at the same time, that it'll jam and slowdown (to be benchmarked). 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
CUDA integer FFTs  jasonp  GPU Computing  33  20181230 13:32 
Small FFTs immediately crashes my computer, help please!  scrawlings  Information & Answers  39  20140802 21:48 
P95 PrimeNet causes BSOD; small FFTs, large FFTs, and blend test don't  KarateF22  PrimeNet  16  20131028 00:34 
In Place Large FFTs Failure  nwb  Information & Answers  2  20110708 16:04 
gmpecm and FFTs  dave_dm  Factoring  9  20040904 11:47 