20090301, 20:03  #1 
Dec 2008
Boycotting the Soapbox
2^{4}×3^{2}×5 Posts 
Augment SchoenhageStrassen with an FGT?
Maybe I still don't grok how SSA works, but figuring out the right k to split a number into 2^k parts, is a real bastard.
It would be a lot easier, if one could simply split a 2^(2*k1)bit number N into 2^k parts with 2^k bits each (half the bits are zero to hold the result of the product), and do a transform of length 2^k. Unfortunately the individual 2^k parts are just a little too small, as the results of the convolution require k + 2^k bits to avoid overflow. Idea: To fix this shortcomming I'm thinking about using a (small) FGT modulo the smallest 2^p1 > 2^k, and reconstructing the result modulo (2^p1)*(2^(2^k)+1) with the CRT. For example, if we're multiplying a 2^31 bit number (~600M decimals), then k==16, so we're doing the transform modulo 2^16+1 ~ 64Kbit which is comparatively small, fits into L1 at least twice and into L2 several times more. According to http://www.loria.fr/~gaudry/publis/issac07.pdf the optimal values in the MUL_FFT_TABLE are several orders of magnitude larger. The runtime of an FGT (mod 2^311) of length 2^16, should be proportional to k/(2^k) and far below 1% of the total time, computing the inputs (mod 2^311) requires only 1 extra instruction/word (+overhead). I believe doing the CRT will be the most expensive operation, but if the CRT can be done as fast as doing 1 butterfly, the time for doing this could be around 1/(2*k) ~ 3%. As soon as we have recursed deep enough (using the sqrt(2)trick when 2*k1 isn't a whole number), we can use the FGT to compute the entire negacyclic convolution, entirely removing any Toom3 or Karatsuba specialization. Comments? 
20090302, 03:44  #2 
Tribal Bullet
Oct 2004
DDB_{16} Posts 
I think for Fermat numbers your idea is essential for avoiding a doubling (at least) of the runtime. The only tricky part is that the CRT should be applied after every convolution is completed, before moving back up out of each recursion level, instead of once at the end. Ideally you don't want the fact that you split off an FGT transform to be known outside each convolution. (Your last sentence may have said that, but I'm not sure).

20090302, 07:26  #3  
Dec 2008
Boycotting the Soapbox
1320_{8} Posts 
Quote:
But the main idea is the same as for 'regular' NTTs: do two transforms modulo two different primes, use the CRT, then do the carrying. The advantage using FGT+SSA, is that (hopefully) the CRT is simple because the result is mod (2^n+1)*(2^k1), which can be done with adding and shifting. Of course in this case 2^n+1 isn't a prime but the CRT only requires the values to be coprime. But, as usual I'm handwaving a lot, mixing several things together, hoping to trigger the right neurons in someone else's brain. Here are some of the other things I've considered: Ideally, we'd like to do everything in SSA modulo 2^(wordsize*n)+1, because then we don't need any multiprecision shifts at all: if we're shifting multiples of the wordsize, we can do that by loading/storing [(pos+shift)%n] which is known at compiletime. If the wordsize in the SSApart is 61 bits, setting up the FGTpart mod 2^611 before doing the two transforms would be fairly fast  we can do 78 adds before we need to do a reduction (mov, shift, or, add). I think doing the CRT after both transforms might then also be easy. At least easier than with 2^311. As far as I can tell, one question is whether doing the extra work with an FGT mod 2^611 with simple setup & CRT will hurt less than using an FGT mod 2^311, with a difficult setup & CRT. Backofenvelope calculation tells me that mod 2^611 will take 48x longer because multiplies take 4x longer and we're operating only on half the number of elements. If it were not for the setup & CRT, one could even consider using 2^131 or even 2^71 doing 8 resp. 16 FGT's in parallel. Just to make things clear, the FGT is short and the initial elements have no padding, since we're only interested in the individual elements of the negacyclic convolution. Final carrying wouldn't be meaningful without combining pasting the two results together via CRT A 61bit SSA wordsize would still allow us to do three levels of radix2 (operating on wordsize slices of 8 values) so that L2>L1 bandwidth isn't a bottleneck. On the other hand, maybe the savings of eliminating multiprecision shifts don't compensate for the larger modulus size on the next level of recursion. This is a tough decision, because I think there is no way of avoiding having to multiprecisionshift every value at least once during every butterfly. The only savings are that this can probably be combined with the carrying, but I still think that this would increase runtime by 50%100%. The GMPdocs mention that the sqrt(2)trick (=transform*2) gave a 10% speedup, so extending the transform by 2^6 in the top level could really be worth it. Of course one could use a wordsize of 60 bits, to extend the transform by 4. Or 56 bits and extend the transform by 8, with only a few roots requiring shifts, but then we lose the fast CRT. If we use 62 bits, we can do mod 2^311 a little faster, can extend the transform by two, but can only use a radix4 butterflies, so we'd have to do more carrying. Computing carries for 2 values is as expensive as doing a whole butterfly, so this really hurts. Every which way has its drawbacks. Any ideas to help me make up my mind? Last fiddled with by __HRB__ on 20090302 at 07:54 

20090302, 14:07  #4 
"Nancy"
Aug 2002
Alexandria
2467_{10} Posts 
IIUC, you want to map the product in Z/(2^{[I]N[/I]}+1)Z to Z[x]/(x^{[I]l[/I]}+1) and then not to (Z/(2^{[I]n[/I]}+1)Z)[x]/(x^{l}+1) with n ≥ 2N/l+log_{2}(l), but to (R(Z/(2^{[I]n[/I]}+1)Z))[x]/(x^{[I]l[/I]}1), where R is some other ring that supports a lengthl transform, is large enough that the condition n ≥ 2N/l suffices and the correct product in Z[x]/(x^{[I]l[/I]}+1) can be reconstructed quickly by a CRT from separate convolutions in R[x]/(x^{[I]l[/I]}+1) and (Z/(2^{[I]n[/I]}+1)Z)[x]/(x^{[I]l[/I]}+1). In your FGT suggestion, R = GF(M_{[I]p[/I]}^{2}) for some Mersenne prime M_{[I]p[/I]}. Sounds perfectly feasible.
In fact, in a manuscript of the paper we mentioned an idea Robert Harley told us, but we had to remove it for the final paper due to the page limit. If I understood it correctly, he suggested using e.g. R=Z/2^{32}Z and doing some convolution algorithm (grammarschool, ToomCook, whatever), but of course there's no reason why we shouldn't use a ring that supports an FFT of the required length. According to Bernstein's "Multidigit Multiplication for Mathematicians" page 11, Karp came up with a similar idea, though I don't quite understand his description so I'm not sure he's talking about the same thing... So basically, yeah, I think your idea should work. Btw, if you do a 4step transform of length ≥2^{12}, actual shifts only occur in the "middle step" of applying twiddle factors between the row and columntransforms, and since the twiddle factor is ω^{ij} for matrix entry i,j, you often get an "even enough" exponent that you don't have to do an actual shift again  actual bitshifts are relatively rare in long SchönhageStrassen transforms. Alex Last fiddled with by akruppa on 20090302 at 14:13 
20090302, 17:34  #5  
Dec 2008
Boycotting the Soapbox
2^{4}×3^{2}×5 Posts 
Quote:
So, we simply take the bottom k bits, do any convolution, and use the bottom k bits of this intermediate result to CRT the final result? Great! This eliminates at least one pass over the data. Supposedly doing the CRT is easier, too. I think it actually might boil down to fiddling around with the top & bottom k bits. Quote:
I'll have to experiment a little more with this. Maybe I can cook up a poweroftwo solution, that doesn't hurt as much as mutiprecision shifting. It would be nice to be able to do everything with SSE2, which lacks explicit zeroextending loads. Last fiddled with by __HRB__ on 20090302 at 17:35 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
SchönhageStrassen Algorithm Paper  flouran  Math  8  20090523 23:52 