View Single Post
Old 2009-03-02, 07:26   #3
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24×32×5 Posts
Default

Quote:
Originally Posted by jasonp View Post
I think for Fermat numbers your idea is essential for avoiding a doubling (at least) of the runtime. The only tricky part is that the CRT should be applied after every convolution is completed, before moving back up out of each recursion level, instead of once at the end. Ideally you don't want the fact that you split off an FGT transform to be known outside each convolution. (Your last sentence may have said that, but I'm not sure).
What I meant with the last sentence was that if 2^n+1 is so small that SSA doesn't dramatically reduce the problem-size anymore, we could stop recursing and use an FGT to do the point-wise multiply of the SSA-values with a negacyclic convolution + carrying.

But the main idea is the same as for 'regular' NTTs: do two transforms modulo two different primes, use the CRT, then do the carrying. The advantage using FGT+SSA, is that (hopefully) the CRT is simple because the result is mod (2^n+1)*(2^k-1), which can be done with adding and shifting.

Of course in this case 2^n+1 isn't a prime but the CRT only requires the values to be coprime. But, as usual I'm handwaving a lot, mixing several things together, hoping to trigger the right neurons in someone else's brain.

Here are some of the other things I've considered:

Ideally, we'd like to do everything in SSA modulo 2^(wordsize*n)+1, because then we don't need any multiprecision shifts at all: if we're shifting multiples of the wordsize, we can do that by loading/storing [(pos+shift)%n] which is known at compile-time.

If the wordsize in the SSA-part is 61 bits, setting up the FGT-part mod 2^61-1 before doing the two transforms would be fairly fast - we can do 7-8 adds before we need to do a reduction (mov, shift, or, add). I think doing the CRT after both transforms might then also be easy. At least easier than with 2^31-1.

As far as I can tell, one question is whether doing the extra work with an FGT mod 2^61-1 with simple setup & CRT will hurt less than using an FGT mod 2^31-1, with a difficult setup & CRT. Back-of-envelope calculation tells me that mod 2^61-1 will take 4-8x longer because multiplies take 4x longer and we're operating only on half the number of elements. If it were not for the setup & CRT, one could even consider using 2^13-1 or even 2^7-1 doing 8 resp. 16 FGT's in parallel.

Just to make things clear, the FGT is short and the initial elements have no padding, since we're only interested in the individual elements of the negacyclic convolution. Final carrying wouldn't be meaningful without combining pasting the two results together via CRT

A 61-bit SSA wordsize would still allow us to do three levels of radix-2 (operating on wordsize slices of 8 values) so that L2->L1 bandwidth isn't a bottleneck.

On the other hand, maybe the savings of eliminating multiprecision shifts don't compensate for the larger modulus size on the next level of recursion. This is a tough decision, because I think there is no way of avoiding having to multiprecision-shift every value at least once during every butterfly. The only savings are that this can probably be combined with the carrying, but I still think that this would increase runtime by 50%-100%.

The GMP-docs mention that the sqrt(2)-trick (=transform*2) gave a 10% speedup, so extending the transform by 2^6 in the top level could really be worth it. Of course one could use a wordsize of 60 bits, to extend the transform by 4. Or 56 bits and extend the transform by 8, with only a few roots requiring shifts, but then we lose the fast CRT. If we use 62 bits, we can do mod 2^31-1 a little faster, can extend the transform by two, but can only use a radix-4 butterflies, so we'd have to do more carrying. Computing carries for 2 values is as expensive as doing a whole butterfly, so this really hurts.

Every which way has its drawbacks. Any ideas to help me make up my mind?

Last fiddled with by __HRB__ on 2009-03-02 at 07:54
__HRB__ is offline   Reply With Quote