Quote:
Originally Posted by jasonp
I think for Fermat numbers your idea is essential for avoiding a doubling (at least) of the runtime. The only tricky part is that the CRT should be applied after every convolution is completed, before moving back up out of each recursion level, instead of once at the end. Ideally you don't want the fact that you split off an FGT transform to be known outside each convolution. (Your last sentence may have said that, but I'm not sure).

What I meant with the last sentence was that if 2^n+1 is so small that SSA doesn't dramatically reduce the problemsize anymore, we could stop recursing and use an FGT to do the pointwise multiply of the SSAvalues with a negacyclic convolution + carrying.
But the main idea is the same as for 'regular' NTTs: do two transforms modulo two different primes, use the CRT, then do the carrying. The advantage using FGT+SSA, is that (hopefully) the CRT is simple because the result is mod (2^n+1)*(2^k1), which can be done with adding and shifting.
Of course in this case 2^n+1 isn't a prime but the CRT only requires the values to be coprime. But, as usual I'm handwaving a lot, mixing several things together, hoping to trigger the right neurons in someone else's brain.
Here are some of the other things I've considered:
Ideally, we'd like to do everything in SSA modulo 2^(wordsize*n)+1, because then we don't need any multiprecision shifts at all: if we're shifting multiples of the wordsize, we can do that by loading/storing [(pos+shift)%n] which is known at compiletime.
If the wordsize in the SSApart is 61 bits, setting up the FGTpart mod 2^611 before doing the two transforms would be fairly fast  we can do 78 adds before we need to do a reduction (mov, shift, or, add). I think doing the CRT after both transforms might then also be easy. At least easier than with 2^311.
As far as I can tell, one question is whether doing the extra work with an FGT mod 2^611 with simple setup & CRT will hurt less than using an FGT mod 2^311, with a difficult setup & CRT. Backofenvelope calculation tells me that mod 2^611 will take 48x longer because multiplies take 4x longer and we're operating only on half the number of elements. If it were not for the setup & CRT, one could even consider using 2^131 or even 2^71 doing 8 resp. 16 FGT's in parallel.
Just to make things clear, the FGT is short and the initial elements have no padding, since we're only interested in the individual elements of the negacyclic convolution. Final carrying wouldn't be meaningful without combining pasting the two results together via CRT
A 61bit SSA wordsize would still allow us to do three levels of radix2 (operating on wordsize slices of 8 values) so that L2>L1 bandwidth isn't a bottleneck.
On the other hand, maybe the savings of eliminating multiprecision shifts don't compensate for the larger modulus size on the next level of recursion. This is a tough decision, because I think there is no way of avoiding having to multiprecisionshift every value at least once during every butterfly. The only savings are that this can probably be combined with the carrying, but I still think that this would increase runtime by 50%100%.
The GMPdocs mention that the sqrt(2)trick (=transform*2) gave a 10% speedup, so extending the transform by 2^6 in the top level could really be worth it. Of course one could use a wordsize of 60 bits, to extend the transform by 4. Or 56 bits and extend the transform by 8, with only a few roots requiring shifts, but then we lose the fast CRT. If we use 62 bits, we can do mod 2^311 a little faster, can extend the transform by two, but can only use a radix4 butterflies, so we'd have to do more carrying. Computing carries for 2 values is as expensive as doing a whole butterfly, so this really hurts.
Every which way has its drawbacks. Any ideas to help me make up my mind?