20040119, 19:07  #1 
Apr 2003
Berlin, Germany
19^{2} Posts 
Fast calculations modulo small mersenne primes like M61
In this thread we could collect nice tricks and interesting code fragments (HLL or assembler) regarding integer operations modulo M61, M31 or other mersenne primes. These are important if one wants to implement a Fast Galois Transform.
My contributions will concentrate on 64bit x86 code, but solutions for other CPUs (or 32bit x86) are welcome too First I will give some basics. People who're familiar with this stuff could just skip the following part. The most important thing to know is how the modular reduction can be done. For mersenne primes this is easy: First we separate the number to be reduced into 2 parts: x lower bits (x is the exponent of the mersenne prime 2^x  1, e.g. 61 in case of M61) and y higher bits, which begin at bit position x (if we start counting from 0). Finally the reduction happens by adding the y higher bits (after shifting them right to index 0) to the lower x bits and repeating this until no higher bits are left. If the result equals M61 then we could set it to zero to have a valid number mod M61. I wrote "could" because we could also leave it holding the value M61 and it will still work as expected. A small example will show how it works: We want to reduce 632 using M5 (31). Code:
632 = 1001111000b M5 = 2^5  1 = 11111b 632 split into 2 parts: 10011 11000 addition: 11000 (24) + 10011 (19) =101011 (43) There is still one high bit left, so we do another iteration: splitting 43: 1 01011 addition: 1011 + 1 = 1100 (12) Usual arithmetic operations (addition, subtraction, multiplication) modulo a mersenne prime can be implemented by using standard integer operations in combination with a modular reduction of the result. Some possible ways to represent and use such integers are shown below: The simplest way is to choose a word size large enough to hold the numbers during calculations and do modular reductions whenever they are necessary (i.e. before an overlow might occur). In case of M61 we can use the available 64bit general purpose registers and still have 3 bits left. This works fine and results don't need to be reduced all the time. Another neat representation is possible by splitting the number into a low part with x bits (x=width of the modulo) and a high part holding the bits which will add up during calculations. Here it is useful to shift the lower part to the left to match it's most significant bit with the msb of the register. Now let's assume we have the result of such an operation in val (big enough to hold all bits). The reduction mod M61 could look like this (assuming that val is a 64bit unsigned int): Code:
val = (val & M61) + (val>>61); val = (val & M61) + (val>>61); //necessary if previous add led to a val > M61 val = (val==M61)? 0 : val ; //actually not necessary Last fiddled with by Dresdenboy on 20040127 at 16:15 
20040119, 22:35  #2 
Sep 2002
2×331 Posts 
Is the modular reduction using the algorithm/example valid for
other numbers that are one less than a power of two but aren't mersenne primes ? I tested 255 and it worked. m = 2^8  1 = 255 v = 512 = 10 00000000b v = (512 and 255) + 2 = 2 = 00000000b + 10b = 10b v = 2 2 = 512 mod 255 Here is delphi code using longwords ( 32bit unsigned int, dword ) Code:
Program ModRed; // 2^x  1 var v : longword; m : longword; mp : longword; begin m := 255; mp := m + 1; v := 512; while v > m do v := (v and m) + ((v div mp) and m); // (v shr 8) = (v div mp) // and m); in case original v > 16 bits if (v = m) then v := 0; writeln(v); end. Code:
.586 .model flat, stdcall option casemap :none ;######################################################################### include \masm32\include\windows.inc include \masm32\include\kernel32.inc include \masm32\include\user32.inc includelib \masm32\lib\kernel32.lib includelib \masm32\lib\user32.lib .data v dd 0 m dd 0 mp dd 0 fmtAnswer db "Answer = %4lu ",0 szAnswer db " ",0 szCaption db "Modular Reduction",0 .code start: ;  ; Modular Reduction of 2^x 1 ;  mov m,0FFh ; m := 255; mov eax,m ; eax := m; inc eax ; eax := eax + 1; mov mp,eax ; mp := eax; mov v,200h ; v := 512; mov eax,v ; eax := v; .while (eax > m) xor edx,edx ; edx := 0; div mp ; eax := edx:eax div mp; and eax,m ; eax := eax and m; mov edx,v ; edx := v; and edx,m ; edx := edx and m; add eax,edx ; eax := eax + edx; mov v,eax ; v := eax; .endw .if (eax == m) mov v, 0 .endif invoke wsprintf,ADDR szAnswer,ADDR fmtAnswer,v invoke MessageBox,NULL,ADDR szAnswer,ADDR szCaption,MB_OK mov eax, 0 invoke ExitProcess,0 end start 
20040120, 01:03  #3  
"Gang aft agley"
Sep 2002
2·1,877 Posts 
Quote:
Later, I thought of it this way: If you were doing, say, MOD 256 on a 16 bit value, the lower 8 bits would be the remainder (and the upper 8 bits would be the quotient). The original 16 bit value would be q * 2^{8} + r If I wanted MOD 255 instead, I could consider that in the previous division (which is a process of subtracting 256 repeatedly until the remainder is less than 256); each time I subtracted 256, I really wanted to subtract 255. To adjust the remainder, I would add 1 to the remainder for each time 256 had been subtracted. The amount to add would be in q. Of course r might now be greater than 255 in which case I would add any carry out back into the remainder. Knuth refers to this method of computing a modulus for a value one less then a power of two in section 4.3.2 of volume 2 (my copy is 3rd edition) of The Art of Computer Programming. He notes that it can be helpful to relax the condition that the remainder is less then the modulus, allowing it to be less than or equal to the modulus. If it is equal to the modulus (2^{k}1), treat it as if it were 0. Another way of looking at this is by analogy with casting out nines. To get the remainder modulo 9, we can add all the digits of a base 10 number. To get the remainder modulo 2^{k}1, we add all the digits of a base 2^{k} number. In casting out 9s, we use the fact that 10 ≡ 1 (mod 9) In mod 2^{k}1, we use the fact that 2^{k} ≡ 1 (mod 2^{k}1) Last fiddled with by only_human on 20040120 at 01:07 

20040120, 12:26  #4 
"Gang aft agley"
Sep 2002
2×1,877 Posts 
With a little more care, a routine can calculate mod 2^{k}+1 almost as fast. By analogy, this is like calculating mod 11 with base 10 numbers. If the modulus m = 11, we note that:
10 ≡ 1, 10^{2} ≡ 1, 10^{3} ≡ 1, 10^{4} ≡ 1, . . . (mod 11) Therefore, 632 mod 11 is calculated alternatively adding and subtracting digits starting from the least significant digit: 632 ≡ 2  3 + 6 = 5 (mod 11). Using this trick in another base, have the modulus, m, be one greater than the base. Of course for the purpose of these fast routines, the base will be a power of 2. To calculate 632 mod 33 using binary representation to compare to the example by Dresdenboy: Code:
632 = 1001111000b 632 split into 2 parts: 10011 11000 subtraction: 11000 (24)  10011 (19) = 00101 ( 5) A little more work is needed here because the number can go negative and also each more significant digit is added or subtracted alternatingly. Also the fact that the modulus is one greater than a power of two requires attention if the power of two used is the computer word size. 
20040121, 08:21  #5 
Apr 2003
Berlin, Germany
19^{2} Posts 
Although the first method works for all numbers of the form 2^{p}1, they should be prime so that they can be used for a so called Fast Galois Transform.
The analogous method for 2^{p}+1 could be useful for NTTs based on primes of such form. I think they are called Fermat Number Transform. It's nice to bring up the topic of negative numbers. These could also show up during calculations. Then we have a negative number, which is the two's complement of its positive. In this case we need to use arithmetic shifts (C++ will do that for signed ints). Now I want to show an x8664 assembler version of the modular reduction, as described in the first message in this thread. Imagine, we have a 64bit number stored in RAX and want to reduce it mod M61. I'll represent it in shortened form: UUULLLLLLLLLLLLL U = upper bits L = lower bits I show 16 instead of 64 to make it more convenient. It's still enough to show what happens. My best solution to date for the modular reduction is: Code:
lea rbx, [rax*8] ;make a copy, shift it left by 3  LLLLLLLLLLLLL000 shr rax, 58 ;move the upper 3 bits to pos 2  0000000000UUULLL and rax, 111000b ;clear the other 3 bits  0000000000UUU000 add rax, rbx ;first addition, an overflow will set the carry bit, ; LLLLLLLLLLLLL000 ; + 0000000000UUU000 adc rax, 111b ;addition by a constant and the carry bit ;the carry bit will be added to bit pos 3 actually, ;leaving the bits 20 in an undefined state ;(000b or 111b, depending on the carry bit) ;another overflow couldn't have happened during ;the last step, so we finally shr rax, 3 ;shift everything back into place If the complete reduction (removing all upper bits) is not necessary, we can use a shorter variant: Code:
mov rbx, rax ;make a copy shr rax, 61 ;move the upper 3 bits to pos 0  0000000000000UUU and rbx, M61 ;clear the upper 3 bits  000LLLLLLLLLLLLL add rax, rbx ;addition, an "overflow" will set bit 61  00ULLLLLLLLLLLLL Last fiddled with by Dresdenboy on 20040127 at 16:20 
20040124, 17:13  #6 
"Gang aft agley"
Sep 2002
3754_{10} Posts 
Just for fun I tinkered with the routines. I did not see any improvements offhand. In the first routine I changed the constant added when the carry is added back in to a 0 so it could be replaced with further data of a longer calculation (and I reformated the comments to avoid slide bars on the post). I did not code any of this and may have screwed it up
Code:
lea rbx, [rax*8] ;make a copy, shift it left by 3 bits  LLLLLLLLLLLLL000 shr rax, 58 ;move the upper 3 bits to bit pos 3  0000000000UUULLL ; LLLLLLLLLLLLL000 ; rbx or rax, 111b ; 0000000000UUU111 ; rax add rax, rbx ; ================ ; first addition, an overflow sets CF adc rax, 0 ;add in the carry bit, propagating it to bit 3, ;leaving the bits 20 = 000b if carry set, else = 111b ;another overflow couldn't have happened during this step, shr rax, 3 ;so we finally shift everything back into place Code:
mov rbx, rax ;make a copy shr rax, 61 ;move the upper 3 bits to bits 2,1,0 :0000000000000UUU and rbx, M61 ;clear the upper 3 bits (bits 63,62,61) :000LLLLLLLLLLLLL add rax, rbx ;addition, an "overflow" will set bit 61 :00ULLLLLLLLLLLLL btr rax, 61 ;move bit 61 to CF; clear bit 61 CF,000LLLLLLLLLLLLL adc rax, 0 ;add CF back into result. No "overflow" happens this step. 
20040126, 15:43  #7 
∂^{2}ω=0
Sep 2002
República de California
26717_{8} Posts 
Another timesaving idea I came up with when Peter Montgormery, Jason Papadopoulos and I were first working out the details of Mersennemod FGT: since nearly all of the modmuls in the FGT are by (presumably precomputed) transform "twiddle factors" (the modular roots of unity, analogous to the sines and cosines of an FFT), for arithmetic modulo M61, we can save some operations by premultiplying these twiddles by 8, which causes the "folding" point of the mod to be aligned precisely on the boundary of two 64bit words, rather than between bits 60 and 61 of the lower word. Dresdenboy, I presume you're also implementing this in your code?

20040126, 15:55  #8  
∂^{2}ω=0
Sep 2002
República de California
10110111001111_{2} Posts 
Quote:
Last fiddled with by ewmayer on 20040126 at 16:05 

20040207, 00:02  #9 
"Gang aft agley"
Sep 2002
2·1,877 Posts 
I just noticed that the pihacks forum has been discussing the same topic as we have. Special form moduli for fast reduction in NTTs in messages 1096 through 1120 (Jan 15thJan 21st).
There is some extremely sharp code included. This contributers are Jason Papadopoulos, Décio Luiz Gazzoni Filho, and Carey Bloodworth. 
20040229, 17:02  #10  
Apr 2003
Berlin, Germany
19^{2} Posts 
Quote:


20040229, 17:27  #11  
Apr 2003
Berlin, Germany
361_{10} Posts 
Quote:
Code:
;our number resides in rbx lea rcx, [rbx*8] ; a copy shifted left by 3  upper bits are removed shr rbx, 61 ; move the upper bits right lea rax, [rcx+rbx*8] ; add the lower bits (shifted left) to the upper bits ; and load that sum into rax (where we need it for mul) mul qword ptr [twid] ; just to show, what happens next ... The result of the first two instructions could also be used to do some 128bit adds (add+adc) with a lot of space for carries and quick modular reduction later (shr+add). But that will only be useful in larger butterflies and if we only need to do one ore two more adds and a reduction is already necessary. BTW the next version of Athlon 64/Opteron core (could be the 90nm core) will do 1cycle latency LEAs, if they only do a RISClike addition (no shift). Now it has a latency of 2. The core will also have SSE3, but that won't help us here Last fiddled with by Dresdenboy on 20040301 at 11:07 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Find Mersenne Primes twice as fast?  Derived  Number Theory Discussion Group  24  20160908 11:45 
Sieving with powers of small primes in the Small Prime variation of the Quadratic Sieve  mickfrancis  Factoring  2  20160506 08:13 
On polynomials without roots modulo small p  fivemack  Computer Science & Computational Number Theory  2  20150918 12:54 
Small(ish) matrices on fast computers  fivemack  Msieve  1  20110314 23:43 
Order of 3 modulo a Mersenne prime  T.Rex  Math  7  20090313 10:46 