 mersenneforum.org Fast calculations modulo small mersenne primes like M61
 Register FAQ Search Today's Posts Mark Forums Read 2004-01-19, 19:07 #1 Dresdenboy   Apr 2003 Berlin, Germany 192 Posts Fast calculations modulo small mersenne primes like M61 In this thread we could collect nice tricks and interesting code fragments (HLL or assembler) regarding integer operations modulo M61, M31 or other mersenne primes. These are important if one wants to implement a Fast Galois Transform. My contributions will concentrate on 64bit x86 code, but solutions for other CPUs (or 32bit x86) are welcome too First I will give some basics. People who're familiar with this stuff could just skip the following part. The most important thing to know is how the modular reduction can be done. For mersenne primes this is easy: First we separate the number to be reduced into 2 parts: x lower bits (x is the exponent of the mersenne prime 2^x - 1, e.g. 61 in case of M61) and y higher bits, which begin at bit position x (if we start counting from 0). Finally the reduction happens by adding the y higher bits (after shifting them right to index 0) to the lower x bits and repeating this until no higher bits are left. If the result equals M61 then we could set it to zero to have a valid number mod M61. I wrote "could" because we could also leave it holding the value M61 and it will still work as expected. A small example will show how it works: We want to reduce 632 using M5 (31). Code: 632 = 1001111000b M5 = 2^5 - 1 = 11111b 632 split into 2 parts: 10011 11000 addition: 11000 (24) + 10011 (19) =101011 (43) There is still one high bit left, so we do another iteration: splitting 43: 1 01011 addition: 1011 + 1 = 1100 (12) 632 == 12 (mod M5) Usual arithmetic operations (addition, subtraction, multiplication) modulo a mersenne prime can be implemented by using standard integer operations in combination with a modular reduction of the result. Some possible ways to represent and use such integers are shown below: The simplest way is to choose a word size large enough to hold the numbers during calculations and do modular reductions whenever they are necessary (i.e. before an overlow might occur). In case of M61 we can use the available 64bit general purpose registers and still have 3 bits left. This works fine and results don't need to be reduced all the time. Another neat representation is possible by splitting the number into a low part with x bits (x=width of the modulo) and a high part holding the bits which will add up during calculations. Here it is useful to shift the lower part to the left to match it's most significant bit with the msb of the register. Now let's assume we have the result of such an operation in val (big enough to hold all bits). The reduction mod M61 could look like this (assuming that val is a 64bit unsigned int): Code: val = (val & M61) + (val>>61); val = (val & M61) + (val>>61); //necessary if previous add led to a val > M61 val = (val==M61)? 0 : val ; //actually not necessary So, that's all for the beginning. Last fiddled with by Dresdenboy on 2004-01-27 at 16:15   2004-01-19, 22:35 #2 dsouza123   Sep 2002 2×331 Posts Is the modular reduction using the algorithm/example valid for other numbers that are one less than a power of two but aren't mersenne primes ? I tested 255 and it worked. m = 2^8 - 1 = 255 v = 512 = 10 00000000b v = (512 and 255) + 2 = 2 = 00000000b + 10b = 10b v = 2 2 = 512 mod 255 Here is delphi code using longwords ( 32bit unsigned int, dword ) Code: Program ModRed; // 2^x - 1 var v : longword; m : longword; mp : longword; begin m := 255; mp := m + 1; v := 512; while v > m do v := (v and m) + ((v div mp) and m); // (v shr 8) = (v div mp) // and m); in case original v > 16 bits if (v = m) then v := 0; writeln(v); end. And the equivalent in windows 32bit assembly ( MASM32 ) with psuedo pascal style assembly comments. Code:  .586 .model flat, stdcall option casemap :none ;######################################################################### include \masm32\include\windows.inc include \masm32\include\kernel32.inc include \masm32\include\user32.inc includelib \masm32\lib\kernel32.lib includelib \masm32\lib\user32.lib .data v dd 0 m dd 0 mp dd 0 fmtAnswer db "Answer = %4lu ",0 szAnswer db " ",0 szCaption db "Modular Reduction",0 .code start: ; ------------------------------------------- ; Modular Reduction of 2^x -1 ; ------------------------------------------- mov m,0FFh ; m := 255; mov eax,m ; eax := m; inc eax ; eax := eax + 1; mov mp,eax ; mp := eax; mov v,200h ; v := 512; mov eax,v ; eax := v; .while (eax > m) xor edx,edx ; edx := 0; div mp ; eax := edx:eax div mp; and eax,m ; eax := eax and m; mov edx,v ; edx := v; and edx,m ; edx := edx and m; add eax,edx ; eax := eax + edx; mov v,eax ; v := eax; .endw .if (eax == m) mov v, 0 .endif invoke wsprintf,ADDR szAnswer,ADDR fmtAnswer,v invoke MessageBox,NULL,ADDR szAnswer,ADDR szCaption,MB_OK mov eax, 0 invoke ExitProcess,0 end start   2004-01-20, 01:03   #3
only_human

"Gang aft agley"
Sep 2002

3,581 Posts Quote:
 Is the modular reduction using the algorithm/example valid for other numbers that are one less than a power of two but aren't mersenne primes ?
Yes. I first came across this method in a crytopgraphy book. I was very surprised to see in the coded examples that the carry out was added back into the mod value.

Later, I thought of it this way:
If you were doing, say, MOD 256 on a 16 bit value, the lower 8 bits would be the remainder (and the upper 8 bits would be the quotient). The original 16 bit value would be
q * 28 + r

If I wanted MOD 255 instead, I could consider that in the previous division (which is a process of subtracting 256 repeatedly until the remainder is less than 256); each time I subtracted 256, I really wanted to subtract 255. To adjust the remainder, I would add 1 to the remainder for each time 256 had been subtracted. The amount to add would be in q. Of course r might now be greater than 255 in which case I would add any carry out back into the remainder.

Knuth refers to this method of computing a modulus for a value one less then a power of two in section 4.3.2 of volume 2 (my copy is 3rd edition) of The Art of Computer Programming. He notes that it can be helpful to relax the condition that the remainder is less then the modulus, allowing it to be less than or equal to the modulus. If it is equal to the modulus (2k-1), treat it as if it were 0.

Another way of looking at this is by analogy with casting out nines. To get the remainder modulo 9, we can add all the digits of a base 10 number. To get the remainder modulo 2k-1, we add all the digits of a base 2k number.
In casting out 9s, we use the fact that 10 &equiv; 1 (mod 9)
In mod 2k-1, we use the fact that 2k &equiv; 1 (mod 2k-1)

Last fiddled with by only_human on 2004-01-20 at 01:07   2004-01-20, 12:26 #4 only_human   "Gang aft agley" Sep 2002 3,581 Posts With a little more care, a routine can calculate mod 2k+1 almost as fast. By analogy, this is like calculating mod 11 with base 10 numbers. If the modulus m = 11, we note that: 10 ≡ -1, 102 ≡ 1, 103 ≡ -1, 104 ≡ 1, . . . (mod 11) Therefore, 632 mod 11 is calculated alternatively adding and subtracting digits starting from the least significant digit: 632 ≡ 2 - 3 + 6 = 5 (mod 11). Using this trick in another base, have the modulus, m, be one greater than the base. Of course for the purpose of these fast routines, the base will be a power of 2. To calculate 632 mod 33 using binary representation to compare to the example by Dresdenboy: Code: 632 = 1001111000b 632 split into 2 parts: 10011 11000 subtraction: 11000 (24) - 10011 (19) = 00101 ( 5) 632 == 5 (mod 33) A little more work is needed here because the number can go negative and also each more significant digit is added or subtracted alternatingly. Also the fact that the modulus is one greater than a power of two requires attention if the power of two used is the computer word size.   2004-01-21, 08:21 #5 Dresdenboy   Apr 2003 Berlin, Germany 5518 Posts Although the first method works for all numbers of the form 2p-1, they should be prime so that they can be used for a so called Fast Galois Transform. The analogous method for 2p+1 could be useful for NTTs based on primes of such form. I think they are called Fermat Number Transform. It's nice to bring up the topic of negative numbers. These could also show up during calculations. Then we have a negative number, which is the two's complement of its positive. In this case we need to use arithmetic shifts (C++ will do that for signed ints). Now I want to show an x86-64 assembler version of the modular reduction, as described in the first message in this thread. Imagine, we have a 64bit number stored in RAX and want to reduce it mod M61. I'll represent it in shortened form: UUULLLLLLLLLLLLL U = upper bits L = lower bits I show 16 instead of 64 to make it more convenient. It's still enough to show what happens. My best solution to date for the modular reduction is: Code: lea rbx, [rax*8] ;make a copy, shift it left by 3 - LLLLLLLLLLLLL000 shr rax, 58 ;move the upper 3 bits to pos 2 - 0000000000UUULLL and rax, 111000b ;clear the other 3 bits - 0000000000UUU000 add rax, rbx ;first addition, an overflow will set the carry bit, ; LLLLLLLLLLLLL000 ; + 0000000000UUU000 adc rax, 111b ;addition by a constant and the carry bit ;the carry bit will be added to bit pos 3 actually, ;leaving the bits 2-0 in an undefined state ;(000b or 111b, depending on the carry bit) ;another overflow couldn't have happened during ;the last step, so we finally shr rax, 3 ;shift everything back into place Note: I don't look for the total latency of this operation because my target CPU works out-of-order and is superscalar. Instead it is important to use as few simple instructions as possible. If the complete reduction (removing all upper bits) is not necessary, we can use a shorter variant: Code: mov rbx, rax ;make a copy shr rax, 61 ;move the upper 3 bits to pos 0 - 0000000000000UUU and rbx, M61 ;clear the upper 3 bits - 000LLLLLLLLLLLLL add rax, rbx ;addition, an "overflow" will set bit 61 - 00ULLLLLLLLLLLLL (Edit: scroll bars removed, changed one case of different bit numbering ) Last fiddled with by Dresdenboy on 2004-01-27 at 16:20   2004-01-24, 17:13 #6 only_human   "Gang aft agley" Sep 2002 3,581 Posts Just for fun I tinkered with the routines. I did not see any improvements offhand. In the first routine I changed the constant added when the carry is added back in to a 0 so it could be replaced with further data of a longer calculation (and I reformated the comments to avoid slide bars on the post). I did not code any of this and may have screwed it up Code: lea rbx, [rax*8] ;make a copy, shift it left by 3 bits - LLLLLLLLLLLLL000 shr rax, 58 ;move the upper 3 bits to bit pos 3 - 0000000000UUULLL ; LLLLLLLLLLLLL000 ; rbx or rax, 111b ; 0000000000UUU111 ; rax add rax, rbx ; ================ ; first addition, an overflow sets CF adc rax, 0 ;add in the carry bit, propagating it to bit 3, ;leaving the bits 2-0 = 000b if carry set, else = 111b ;another overflow couldn't have happened during this step, shr rax, 3 ;so we finally shift everything back into place The second routine looks more easily extended to a longer calculation. I didn't try that -- I just pinned a tail on it (not that it needed one for its intended purpose). Code: mov rbx, rax ;make a copy shr rax, 61 ;move the upper 3 bits to bits 2,1,0 :0000000000000UUU and rbx, M61 ;clear the upper 3 bits (bits 63,62,61) :000LLLLLLLLLLLLL add rax, rbx ;addition, an "overflow" will set bit 61 :00ULLLLLLLLLLLLL btr rax, 61 ;move bit 61 to CF; clear bit 61 CF,000LLLLLLLLLLLLL adc rax, 0 ;add CF back into result. No "overflow" happens this step.   2004-01-26, 15:43 #7 ewmayer ∂2ω=0   Sep 2002 República de California 263F16 Posts Another time-saving idea I came up with when Peter Montgormery, Jason Papadopoulos and I were first working out the details of Mersenne-mod FGT: since nearly all of the modmuls in the FGT are by (presumably precomputed) transform "twiddle factors" (the modular roots of unity, analogous to the sines and cosines of an FFT), for arithmetic modulo M61, we can save some operations by premultiplying these twiddles by 8, which causes the "folding" point of the mod to be aligned precisely on the boundary of two 64-bit words, rather than between bits 60 and 61 of the lower word. Dresdenboy, I presume you're also implementing this in your code?   2004-01-26, 15:55   #8
ewmayer
2ω=0

Sep 2002
República de California

9,791 Posts Quote:
 Originally Posted by Dresdenboy It's nice to bring up the topic of negative numbers. These could also show up during calculations. Then we have a negative number, which is the two's complement of its positive. In this case we need to use arithmetic shifts (C++ will do that for signed ints).
The problem with this is that it is awkward and introduces a branch (or at bes some extra operations) on hardware lacking a carry bit, I.e. most RISC-style hardware. It's also unnecessary. When putting together one's transform, one should simply carefully bound the outputs of each operation (this can be done manually, which is the way I prefer - Jason Papadopoulos has actually written code that automates the procedure, and inserts modular adds and reduction where necessary), and then whenever an output range becomes so large that the next operation it's involved in could lead to overflow, we do a modular reduction (typically a partial one, i.e. we don't care if folding bits 61-63 into bits 0-60 causes a carry into bit 61). If the output range in question has a negative component (e.g. x is in [-3*q, +5*q], where q = M61), we add the needed multiple of M61 to make it nonnegative, then reduce. Since we know the output ranges in advance, we can eliminate any code branches or special runtime handling of negative operands this way.

Last fiddled with by ewmayer on 2004-01-26 at 16:05   2004-02-07, 00:02 #9 only_human   "Gang aft agley" Sep 2002 3,581 Posts I just noticed that the pi-hacks forum has been discussing the same topic as we have. Special form moduli for fast reduction in NTTs in messages 1096 through 1120 (Jan 15th--Jan 21st). There is some extremely sharp code included. This contributers are Jason Papadopoulos, Décio Luiz Gazzoni Filho, and Carey Bloodworth.   2004-02-29, 17:02   #10
Dresdenboy

Apr 2003
Berlin, Germany

192 Posts Quote:
 Originally Posted by only_human I just noticed that the pi-hacks forum has been discussing the same topic as we have. Special form moduli for fast reduction in NTTs in messages 1096 through 1120 (Jan 15th--Jan 21st). There is some extremely sharp code included. This contributers are Jason Papadopoulos, Décio Luiz Gazzoni Filho, and Carey Bloodworth.
Yep, the pi-hacks thread is interesting. But the code I saw there was some prototype integer SSE2 code for doing modular multiplications. I will bring up a similar topic (simple and complex multiplications modulo M61) later in this thread.   2004-02-29, 17:27   #11
Dresdenboy

Apr 2003
Berlin, Germany

192 Posts Quote:
 Originally Posted by ewmayer Another time-saving idea I came up with when Peter Montgormery, Jason Papadopoulos and I were first working out the details of Mersenne-mod FGT: since nearly all of the modmuls in the FGT are by (presumably precomputed) transform "twiddle factors" (the modular roots of unity, analogous to the sines and cosines of an FFT), for arithmetic modulo M61, we can save some operations by premultiplying these twiddles by 8, which causes the "folding" point of the mod to be aligned precisely on the boundary of two 64-bit words, rather than between bits 60 and 61 of the lower word. Dresdenboy, I presume you're also implementing this in your code?
Yes, I do. For that case I also found an even shorter "short reduction" which does a modular reduction and a shift left by 3 (so the twiddle factor won't need to be preshifted):

Code:
;our number resides in rbx

lea  rcx, [rbx*8]      ; a copy shifted left by 3 - upper bits are removed
shr  rbx, 61           ; move the upper bits right
lea  rax, [rcx+rbx*8]  ; add the lower bits (shifted left) to the upper bits
; and load that sum into rax (where we need it for mul)
mul  qword ptr [twid]  ; just to show, what happens next ...
This combination lessens the negative effect of the requirement to place one factor into rax (for 128bit mul) and shortens the modular reduction code. To do this (and the rest of calculations) with less instructions (x86 and/or uOps) we'd need a 64bit ARM CPU.

The result of the first two instructions could also be used to do some 128bit adds (add+adc) with a lot of space for carries and quick modular reduction later (shr+add). But that will only be useful in larger butterflies and if we only need to do one ore two more adds and a reduction is already necessary.

BTW the next version of Athlon 64/Opteron core (could be the 90nm core) will do 1cycle latency LEAs, if they only do a RISC-like addition (no shift). Now it has a latency of 2. The core will also have SSE3, but that won't help us here Last fiddled with by Dresdenboy on 2004-03-01 at 11:07  Thread Tools Show Printable Version Email this Page Similar Threads Thread Thread Starter Forum Replies Last Post Derived Number Theory Discussion Group 24 2016-09-08 11:45 mickfrancis Factoring 2 2016-05-06 08:13 fivemack Computer Science & Computational Number Theory 2 2015-09-18 12:54 fivemack Msieve 1 2011-03-14 23:43 T.Rex Math 7 2009-03-13 10:46

All times are UTC. The time now is 01:59.

Mon Oct 26 01:59:15 UTC 2020 up 45 days, 23:10, 0 users, load averages: 1.43, 1.67, 1.66