View Single Post
Old 2009-03-02, 17:34   #5
__HRB__'s Avatar
Dec 2008
Boycotting the Soapbox

24×32×5 Posts

Originally Posted by akruppa View Post
In fact, in a manuscript of the paper we mentioned an idea Robert Harley told us, but we had to remove it for the final paper due to the page limit. If I understood it correctly, he suggested using e.g. R=Z/232Z and doing some convolution algorithm (grammar-school, Toom-Cook, whatever), but of course there's no reason why we shouldn't use a ring that supports an FFT of the required length. According to Bernstein's "Multidigit Multiplication for Mathematicians" page 11, Karp came up with a similar idea, though I don't quite understand his description so I'm not sure he's talking about the same thing...
Thanks, that clears things up a lot. Now I actually understand what 'Harley's trick' was all about in your paper.

So, we simply take the bottom k bits, do any convolution, and use the bottom k bits of this intermediate result to CRT the final result? Great! This eliminates at least one pass over the data. Supposedly doing the CRT is easier, too. I think it actually might boil down to fiddling around with the top & bottom k bits.

Originally Posted by akruppa View Post
Btw, if you do a 4-step transform of length ≥212, actual shifts only occur in the "middle step" of applying twiddle factors between the row- and column-transforms, and since the twiddle factor is ωij for matrix entry i,j, you often get an "even enough" exponent that you don't have to do an actual shift again - actual bit-shifts are relatively rare in long Schönhage-Strassen transforms.
This is only true, if the machine word is a power of 2. This collides with my other idea is to perform a multiprecision add "A+B+C+D" not as "(A+B) + (C+D)" but as "(A[i] + B[i]) + (C[i] + D[i])", to make it possible that values are used several times during the higher radix butterflies, and perform the carrying only once before they are evicted from L1.

I'll have to experiment a little more with this. Maybe I can cook up a power-of-two solution, that doesn't hurt as much as mutiprecision shifting. It would be nice to be able to do everything with SSE2, which lacks explicit zero-extending loads.

Last fiddled with by __HRB__ on 2009-03-02 at 17:35
__HRB__ is offline   Reply With Quote