Maybe I still don't grok how SSA works, but figuring out the right k to split a number into 2^k parts, is a real bastard.
It would be a lot easier, if one could simply split a 2^(2*k-1)-bit number N into 2^k parts with 2^k bits each (half the bits are zero to hold the result of the product), and do a transform of length 2^k.
Unfortunately the individual 2^k parts are just a little too small, as the results of the convolution require k + 2^k bits to avoid overflow.
Idea:
To fix this shortcomming I'm thinking about using a (small) FGT modulo the smallest 2^p-1 > 2^k, and reconstructing the result modulo (2^p-1)*(2^(2^k)+1) with the CRT.
For example, if we're multiplying a 2^31 bit number (~600M decimals), then k==16, so we're doing the transform modulo 2^16+1 ~ 64Kbit which is comparatively small, fits into L1 at least twice and into L2 several times more. According to
http://www.loria.fr/~gaudry/publis/issac07.pdf the optimal values in the MUL_FFT_TABLE are several orders of magnitude larger.
The runtime of an FGT (mod 2^31-1) of length 2^16, should be proportional to k/(2^k) and far below 1% of the total time, computing the inputs (mod 2^31-1) requires only 1 extra instruction/word (+overhead).
I believe doing the CRT will be the most expensive operation, but if the CRT can be done as fast as doing 1 butterfly, the time for doing this could be around 1/(2*k) ~ 3%.
As soon as we have recursed deep enough (using the sqrt(2)-trick when 2*k-1 isn't a whole number), we can use the FGT to compute the entire negacyclic convolution, entirely removing any Toom-3 or Karatsuba specialization.
Comments?