mersenneforum.org Augment Schoenhage-Strassen with an FGT?
 Register FAQ Search Today's Posts Mark Forums Read

 2009-03-01, 20:03 #1 __HRB__     Dec 2008 Boycotting the Soapbox 24×32×5 Posts Augment Schoenhage-Strassen with an FGT? Maybe I still don't grok how SSA works, but figuring out the right k to split a number into 2^k parts, is a real bastard. It would be a lot easier, if one could simply split a 2^(2*k-1)-bit number N into 2^k parts with 2^k bits each (half the bits are zero to hold the result of the product), and do a transform of length 2^k. Unfortunately the individual 2^k parts are just a little too small, as the results of the convolution require k + 2^k bits to avoid overflow. Idea: To fix this shortcomming I'm thinking about using a (small) FGT modulo the smallest 2^p-1 > 2^k, and reconstructing the result modulo (2^p-1)*(2^(2^k)+1) with the CRT. For example, if we're multiplying a 2^31 bit number (~600M decimals), then k==16, so we're doing the transform modulo 2^16+1 ~ 64Kbit which is comparatively small, fits into L1 at least twice and into L2 several times more. According to http://www.loria.fr/~gaudry/publis/issac07.pdf the optimal values in the MUL_FFT_TABLE are several orders of magnitude larger. The runtime of an FGT (mod 2^31-1) of length 2^16, should be proportional to k/(2^k) and far below 1% of the total time, computing the inputs (mod 2^31-1) requires only 1 extra instruction/word (+overhead). I believe doing the CRT will be the most expensive operation, but if the CRT can be done as fast as doing 1 butterfly, the time for doing this could be around 1/(2*k) ~ 3%. As soon as we have recursed deep enough (using the sqrt(2)-trick when 2*k-1 isn't a whole number), we can use the FGT to compute the entire negacyclic convolution, entirely removing any Toom-3 or Karatsuba specialization. Comments?
 2009-03-02, 03:44 #2 jasonp Tribal Bullet     Oct 2004 67318 Posts I think for Fermat numbers your idea is essential for avoiding a doubling (at least) of the runtime. The only tricky part is that the CRT should be applied after every convolution is completed, before moving back up out of each recursion level, instead of once at the end. Ideally you don't want the fact that you split off an FGT transform to be known outside each convolution. (Your last sentence may have said that, but I'm not sure).
2009-03-02, 07:26   #3
__HRB__

Dec 2008
Boycotting the Soapbox

2D016 Posts

Quote:
 Originally Posted by jasonp I think for Fermat numbers your idea is essential for avoiding a doubling (at least) of the runtime. The only tricky part is that the CRT should be applied after every convolution is completed, before moving back up out of each recursion level, instead of once at the end. Ideally you don't want the fact that you split off an FGT transform to be known outside each convolution. (Your last sentence may have said that, but I'm not sure).
What I meant with the last sentence was that if 2^n+1 is so small that SSA doesn't dramatically reduce the problem-size anymore, we could stop recursing and use an FGT to do the point-wise multiply of the SSA-values with a negacyclic convolution + carrying.

But the main idea is the same as for 'regular' NTTs: do two transforms modulo two different primes, use the CRT, then do the carrying. The advantage using FGT+SSA, is that (hopefully) the CRT is simple because the result is mod (2^n+1)*(2^k-1), which can be done with adding and shifting.

Of course in this case 2^n+1 isn't a prime but the CRT only requires the values to be coprime. But, as usual I'm handwaving a lot, mixing several things together, hoping to trigger the right neurons in someone else's brain.

Here are some of the other things I've considered:

Ideally, we'd like to do everything in SSA modulo 2^(wordsize*n)+1, because then we don't need any multiprecision shifts at all: if we're shifting multiples of the wordsize, we can do that by loading/storing [(pos+shift)%n] which is known at compile-time.

If the wordsize in the SSA-part is 61 bits, setting up the FGT-part mod 2^61-1 before doing the two transforms would be fairly fast - we can do 7-8 adds before we need to do a reduction (mov, shift, or, add). I think doing the CRT after both transforms might then also be easy. At least easier than with 2^31-1.

As far as I can tell, one question is whether doing the extra work with an FGT mod 2^61-1 with simple setup & CRT will hurt less than using an FGT mod 2^31-1, with a difficult setup & CRT. Back-of-envelope calculation tells me that mod 2^61-1 will take 4-8x longer because multiplies take 4x longer and we're operating only on half the number of elements. If it were not for the setup & CRT, one could even consider using 2^13-1 or even 2^7-1 doing 8 resp. 16 FGT's in parallel.

Just to make things clear, the FGT is short and the initial elements have no padding, since we're only interested in the individual elements of the negacyclic convolution. Final carrying wouldn't be meaningful without combining pasting the two results together via CRT

A 61-bit SSA wordsize would still allow us to do three levels of radix-2 (operating on wordsize slices of 8 values) so that L2->L1 bandwidth isn't a bottleneck.

On the other hand, maybe the savings of eliminating multiprecision shifts don't compensate for the larger modulus size on the next level of recursion. This is a tough decision, because I think there is no way of avoiding having to multiprecision-shift every value at least once during every butterfly. The only savings are that this can probably be combined with the carrying, but I still think that this would increase runtime by 50%-100%.

The GMP-docs mention that the sqrt(2)-trick (=transform*2) gave a 10% speedup, so extending the transform by 2^6 in the top level could really be worth it. Of course one could use a wordsize of 60 bits, to extend the transform by 4. Or 56 bits and extend the transform by 8, with only a few roots requiring shifts, but then we lose the fast CRT. If we use 62 bits, we can do mod 2^31-1 a little faster, can extend the transform by two, but can only use a radix-4 butterflies, so we'd have to do more carrying. Computing carries for 2 values is as expensive as doing a whole butterfly, so this really hurts.

Every which way has its drawbacks. Any ideas to help me make up my mind?

Last fiddled with by __HRB__ on 2009-03-02 at 07:54

 2009-03-02, 14:07 #4 akruppa     "Nancy" Aug 2002 Alexandria 1001101000112 Posts IIUC, you want to map the product in Z/(2[I]N[/I]+1)Z to Z[x]/(x[I]l[/I]+1) and then not to (Z/(2[I]n[/I]+1)Z)[x]/(xl+1) with n ≥ 2N/l+log2(l), but to (R(Z/(2[I]n[/I]+1)Z))[x]/(x[I]l[/I]-1), where R is some other ring that supports a length-l transform, is large enough that the condition n ≥ 2N/l suffices and the correct product in Z[x]/(x[I]l[/I]+1) can be reconstructed quickly by a CRT from separate convolutions in R[x]/(x[I]l[/I]+1) and (Z/(2[I]n[/I]+1)Z)[x]/(x[I]l[/I]+1). In your FGT suggestion, R = GF(M[I]p[/I]2) for some Mersenne prime M[I]p[/I]. Sounds perfectly feasible. In fact, in a manuscript of the paper we mentioned an idea Robert Harley told us, but we had to remove it for the final paper due to the page limit. If I understood it correctly, he suggested using e.g. R=Z/232Z and doing some convolution algorithm (grammar-school, Toom-Cook, whatever), but of course there's no reason why we shouldn't use a ring that supports an FFT of the required length. According to Bernstein's "Multidigit Multiplication for Mathematicians" page 11, Karp came up with a similar idea, though I don't quite understand his description so I'm not sure he's talking about the same thing... So basically, yeah, I think your idea should work. Btw, if you do a 4-step transform of length ≥212, actual shifts only occur in the "middle step" of applying twiddle factors between the row- and column-transforms, and since the twiddle factor is ωij for matrix entry i,j, you often get an "even enough" exponent that you don't have to do an actual shift again - actual bit-shifts are relatively rare in long Schönhage-Strassen transforms. Alex Last fiddled with by akruppa on 2009-03-02 at 14:13
2009-03-02, 17:34   #5
__HRB__

Dec 2008
Boycotting the Soapbox

24·32·5 Posts

Quote:
 Originally Posted by akruppa In fact, in a manuscript of the paper we mentioned an idea Robert Harley told us, but we had to remove it for the final paper due to the page limit. If I understood it correctly, he suggested using e.g. R=Z/232Z and doing some convolution algorithm (grammar-school, Toom-Cook, whatever), but of course there's no reason why we shouldn't use a ring that supports an FFT of the required length. According to Bernstein's "Multidigit Multiplication for Mathematicians" page 11, Karp came up with a similar idea, though I don't quite understand his description so I'm not sure he's talking about the same thing...
Thanks, that clears things up a lot. Now I actually understand what 'Harley's trick' was all about in your paper.

So, we simply take the bottom k bits, do any convolution, and use the bottom k bits of this intermediate result to CRT the final result? Great! This eliminates at least one pass over the data. Supposedly doing the CRT is easier, too. I think it actually might boil down to fiddling around with the top & bottom k bits.

Quote:
 Originally Posted by akruppa Btw, if you do a 4-step transform of length ≥212, actual shifts only occur in the "middle step" of applying twiddle factors between the row- and column-transforms, and since the twiddle factor is ωij for matrix entry i,j, you often get an "even enough" exponent that you don't have to do an actual shift again - actual bit-shifts are relatively rare in long Schönhage-Strassen transforms.
This is only true, if the machine word is a power of 2. This collides with my other idea is to perform a multiprecision add "A+B+C+D" not as "(A+B) + (C+D)" but as "(A[i] + B[i]) + (C[i] + D[i])", to make it possible that values are used several times during the higher radix butterflies, and perform the carrying only once before they are evicted from L1.

I'll have to experiment a little more with this. Maybe I can cook up a power-of-two solution, that doesn't hurt as much as mutiprecision shifting. It would be nice to be able to do everything with SSE2, which lacks explicit zero-extending loads.

Last fiddled with by __HRB__ on 2009-03-02 at 17:35

All times are UTC. The time now is 01:10.

Thu Aug 18 01:10:25 UTC 2022 up 41 days, 19:57, 1 user, load averages: 1.02, 1.09, 1.10

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔