Quote:
Originally Posted by akruppa
In fact, in a manuscript of the paper we mentioned an idea Robert Harley told us, but we had to remove it for the final paper due to the page limit. If I understood it correctly, he suggested using e.g. R=Z/2^{32}Z and doing some convolution algorithm (grammarschool, ToomCook, whatever), but of course there's no reason why we shouldn't use a ring that supports an FFT of the required length. According to Bernstein's "Multidigit Multiplication for Mathematicians" page 11, Karp came up with a similar idea, though I don't quite understand his description so I'm not sure he's talking about the same thing...

Thanks, that clears things up a lot. Now I actually understand what 'Harley's trick' was all about in your paper.
So, we simply take the bottom k bits, do any convolution, and use the bottom k bits of this intermediate result to CRT the final result? Great! This eliminates at least one pass over the data. Supposedly doing the CRT is easier, too. I think it actually might boil down to fiddling around with the top & bottom k bits.
Quote:
Originally Posted by akruppa
Btw, if you do a 4step transform of length ≥2^{12}, actual shifts only occur in the "middle step" of applying twiddle factors between the row and columntransforms, and since the twiddle factor is ω^{ij} for matrix entry i,j, you often get an "even enough" exponent that you don't have to do an actual shift again  actual bitshifts are relatively rare in long SchönhageStrassen transforms.

This is only true, if the machine word is a power of 2. This collides with my other idea is to perform a multiprecision add "A+B+C+D" not as "(A+B) + (C+D)" but as "(A[i] + B[i]) + (C[i] + D[i])", to make it possible that values are used several times during the higher radix butterflies, and perform the carrying only once before they are evicted from L1.
I'll have to experiment a little more with this. Maybe I can cook up a poweroftwo solution, that doesn't hurt as much as mutiprecision shifting. It would be nice to be able to do everything with SSE2, which lacks explicit zeroextending loads.