mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2004-01-19, 19:07   #1
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default Fast calculations modulo small mersenne primes like M61

In this thread we could collect nice tricks and interesting code fragments (HLL or assembler) regarding integer operations modulo M61, M31 or other mersenne primes. These are important if one wants to implement a Fast Galois Transform.

My contributions will concentrate on 64bit x86 code, but solutions for other CPUs (or 32bit x86) are welcome too

First I will give some basics. People who're familiar with this stuff could just skip the following part.



The most important thing to know is how the modular reduction can be done. For mersenne primes this is easy:

First we separate the number to be reduced into 2 parts: x lower bits (x is the exponent of the mersenne prime 2^x - 1, e.g. 61 in case of M61) and y higher bits, which begin at bit position x (if we start counting from 0).

Finally the reduction happens by adding the y higher bits (after shifting them right to index 0) to the lower x bits and repeating this until no higher bits are left. If the result equals M61 then we could set it to zero to have a valid number mod M61. I wrote "could" because we could also leave it holding the value M61 and it will still work as expected. A small example will show how it works:

We want to reduce 632 using M5 (31).
Code:
632 = 1001111000b
M5 = 2^5 - 1 = 11111b
632 split into 2 parts:
10011 11000
addition:
  11000 (24)
+ 10011 (19)
=101011 (43)

There is still one high bit left, so we do another iteration:
splitting 43: 1 01011
addition: 1011 + 1 = 1100 (12)
632 == 12 (mod M5)


Usual arithmetic operations (addition, subtraction, multiplication) modulo a mersenne prime can be implemented by using standard integer operations in combination with a modular reduction of the result.

Some possible ways to represent and use such integers are shown below:

The simplest way is to choose a word size large enough to hold the numbers during calculations and do modular reductions whenever they are necessary (i.e. before an overlow might occur).

In case of M61 we can use the available 64bit general purpose registers and still have 3 bits left. This works fine and results don't need to be reduced all the time.

Another neat representation is possible by splitting the number into a low part with x bits (x=width of the modulo) and a high part holding the bits which will add up during calculations. Here it is useful to shift the lower part to the left to match it's most significant bit with the msb of the register.

Now let's assume we have the result of such an operation in val (big enough to hold all bits). The reduction mod M61 could look like this (assuming that val is a 64bit unsigned int):
Code:
val = (val & M61) + (val>>61);
val = (val & M61) + (val>>61); //necessary if previous add led to a val > M61
val = (val==M61)? 0 : val ; //actually not necessary
So, that's all for the beginning.

Last fiddled with by Dresdenboy on 2004-01-27 at 16:15
Dresdenboy is offline   Reply With Quote
Old 2004-01-19, 22:35   #2
dsouza123
 
dsouza123's Avatar
 
Sep 2002

2×331 Posts
Default

Is the modular reduction using the algorithm/example valid for
other numbers that are one less than a power of two but aren't mersenne primes ?

I tested 255 and it worked.

m = 2^8 - 1 = 255

v = 512 = 10 00000000b

v = (512 and 255) + 2 = 2 = 00000000b + 10b = 10b
v = 2

2 = 512 mod 255

Here is delphi code using longwords ( 32bit unsigned int, dword )

Code:
Program ModRed;      // 2^x - 1
var
  v  : longword;
  m  : longword;
  mp : longword;
  
begin
  m  := 255;
  mp := m + 1;

  v := 512;
  
  while v > m do
    v := (v and m) + ((v div mp) and m);     // (v shr 8) =  (v div mp)
                                             // and m); in case original  v > 16 bits
  if (v = m) then
    v := 0;

  writeln(v);
end.
And the equivalent in windows 32bit assembly ( MASM32 ) with psuedo pascal style assembly comments.

Code:
      .586
      .model flat, stdcall
      option casemap :none

;#########################################################################

      include \masm32\include\windows.inc
      include \masm32\include\kernel32.inc
      include \masm32\include\user32.inc

      includelib \masm32\lib\kernel32.lib
      includelib \masm32\lib\user32.lib

.data
      v    dd 0
      m    dd 0
      mp   dd 0

      fmtAnswer        db "Answer = %4lu    ",0
      szAnswer         db "                 ",0
      szCaption        db "Modular Reduction",0

.code


start:
  ; -------------------------------------------
  ; Modular Reduction of 2^x -1
  ; -------------------------------------------

     mov     m,0FFh             ;  m   := 255;
     mov     eax,m              ;  eax := m;
     inc     eax                ;  eax := eax + 1;
     mov     mp,eax             ;  mp  := eax;
     mov     v,200h             ;  v   := 512;

     mov     eax,v              ;  eax := v;

     .while (eax > m)
       xor     edx,edx            ;  edx := 0;
       div     mp                 ;  eax := edx:eax div mp;
  
       and     eax,m              ;  eax := eax and m;
       mov     edx,v              ;  edx := v;
       and     edx,m              ;  edx := edx and m; 
       add     eax,edx            ;  eax := eax + edx;
       mov     v,eax              ;  v   := eax;
     .endw
     
     .if (eax == m)
        mov v, 0
     .endif
        
     invoke wsprintf,ADDR szAnswer,ADDR fmtAnswer,v
     invoke MessageBox,NULL,ADDR szAnswer,ADDR szCaption,MB_OK

     mov    eax, 0
     invoke ExitProcess,0
end start
dsouza123 is offline   Reply With Quote
Old 2004-01-20, 01:03   #3
only_human
 
only_human's Avatar
 
"Gang aft agley"
Sep 2002

3,581 Posts
Default

Quote:
Is the modular reduction using the algorithm/example valid for
other numbers that are one less than a power of two but aren't mersenne primes ?
Yes. I first came across this method in a crytopgraphy book. I was very surprised to see in the coded examples that the carry out was added back into the mod value.

Later, I thought of it this way:
If you were doing, say, MOD 256 on a 16 bit value, the lower 8 bits would be the remainder (and the upper 8 bits would be the quotient). The original 16 bit value would be
q * 28 + r

If I wanted MOD 255 instead, I could consider that in the previous division (which is a process of subtracting 256 repeatedly until the remainder is less than 256); each time I subtracted 256, I really wanted to subtract 255. To adjust the remainder, I would add 1 to the remainder for each time 256 had been subtracted. The amount to add would be in q. Of course r might now be greater than 255 in which case I would add any carry out back into the remainder.

Knuth refers to this method of computing a modulus for a value one less then a power of two in section 4.3.2 of volume 2 (my copy is 3rd edition) of The Art of Computer Programming. He notes that it can be helpful to relax the condition that the remainder is less then the modulus, allowing it to be less than or equal to the modulus. If it is equal to the modulus (2k-1), treat it as if it were 0.

Another way of looking at this is by analogy with casting out nines. To get the remainder modulo 9, we can add all the digits of a base 10 number. To get the remainder modulo 2k-1, we add all the digits of a base 2k number.
In casting out 9s, we use the fact that 10 ≡ 1 (mod 9)
In mod 2k-1, we use the fact that 2k ≡ 1 (mod 2k-1)

Last fiddled with by only_human on 2004-01-20 at 01:07
only_human is offline   Reply With Quote
Old 2004-01-20, 12:26   #4
only_human
 
only_human's Avatar
 
"Gang aft agley"
Sep 2002

3,581 Posts
Default

With a little more care, a routine can calculate mod 2k+1 almost as fast. By analogy, this is like calculating mod 11 with base 10 numbers. If the modulus m = 11, we note that:
10 ≡ -1, 102 ≡ 1, 103 ≡ -1, 104 ≡ 1, . . . (mod 11)
Therefore, 632 mod 11 is calculated alternatively adding and subtracting digits starting from the least significant digit:
632 ≡ 2 - 3 + 6 = 5 (mod 11).
Using this trick in another base, have the modulus, m, be one greater than the base. Of course for the purpose of these fast routines, the base will be a power of 2.
To calculate 632 mod 33 using binary representation to compare to the example by Dresdenboy:

Code:
632 = 1001111000b
632 split into 2 parts:
10011 11000
subtraction:
   11000 (24)
-  10011 (19)
=  00101 ( 5)
632 == 5 (mod 33)

A little more work is needed here because the number can go negative and also each more significant digit is added or subtracted alternatingly. Also the fact that the modulus is one greater than a power of two requires attention if the power of two used is the computer word size.
only_human is offline   Reply With Quote
Old 2004-01-21, 08:21   #5
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

5518 Posts
Default

Although the first method works for all numbers of the form 2p-1, they should be prime so that they can be used for a so called Fast Galois Transform.

The analogous method for 2p+1 could be useful for NTTs based on primes of such form. I think they are called Fermat Number Transform.

It's nice to bring up the topic of negative numbers. These could also show up during calculations. Then we have a negative number, which is the two's complement of its positive.

In this case we need to use arithmetic shifts (C++ will do that for signed ints).

Now I want to show an x86-64 assembler version of the modular reduction, as described in the first message in this thread.

Imagine, we have a 64bit number stored in RAX and want to reduce it mod M61.
I'll represent it in shortened form: UUULLLLLLLLLLLLL
U = upper bits
L = lower bits
I show 16 instead of 64 to make it more convenient. It's still enough to show what happens.

My best solution to date for the modular reduction is:
Code:
lea   rbx, [rax*8]  ;make a copy, shift it left by 3 - LLLLLLLLLLLLL000
shr   rax, 58       ;move the upper 3 bits to pos 2 -  0000000000UUULLL
and   rax, 111000b  ;clear the other 3 bits -          0000000000UUU000
add   rax, rbx      ;first addition, an overflow will set the carry bit,
                    ;    LLLLLLLLLLLLL000
                    ;  + 0000000000UUU000
adc   rax, 111b     ;addition by a constant and the carry bit
                    ;the carry bit will be added to bit pos 3 actually,
                    ;leaving the bits 2-0 in an undefined state
                    ;(000b or 111b, depending on the carry bit)
                    ;another overflow couldn't have happened during
                    ;the last step, so we finally
shr   rax, 3        ;shift everything back into place
Note: I don't look for the total latency of this operation because my target CPU works out-of-order and is superscalar. Instead it is important to use as few simple instructions as possible.

If the complete reduction (removing all upper bits) is not necessary, we can use a shorter variant:
Code:
mov   rbx, rax      ;make a copy
shr   rax, 61       ;move the upper 3 bits to pos 0 -          0000000000000UUU
and   rbx, M61      ;clear the upper 3 bits -                  000LLLLLLLLLLLLL
add   rax, rbx      ;addition, an "overflow" will set bit 61 - 00ULLLLLLLLLLLLL
(Edit: scroll bars removed, changed one case of different bit numbering )

Last fiddled with by Dresdenboy on 2004-01-27 at 16:20
Dresdenboy is offline   Reply With Quote
Old 2004-01-24, 17:13   #6
only_human
 
only_human's Avatar
 
"Gang aft agley"
Sep 2002

3,581 Posts
Default

Just for fun I tinkered with the routines. I did not see any improvements offhand. In the first routine I changed the constant added when the carry is added back in to a 0 so it could be replaced with further data of a longer calculation (and I reformated the comments to avoid slide bars on the post). I did not code any of this and may have screwed it up
Code:
lea   rbx, [rax*8]  ;make a copy, shift it left by 3 bits - LLLLLLLLLLLLL000
shr   rax, 58       ;move the upper 3 bits to bit pos 3   - 0000000000UUULLL
                    ; LLLLLLLLLLLLL000 ; rbx
or    rax, 111b     ; 0000000000UUU111 ; rax
add   rax, rbx      ; ================ ; first addition, an overflow sets CF
adc   rax, 0        ;add in the carry bit, propagating it to bit 3,
                    ;leaving the bits 2-0 = 000b if carry set, else = 111b
                    ;another overflow couldn't have happened during this step,
shr   rax, 3        ;so we finally shift everything back into place
The second routine looks more easily extended to a longer calculation. I didn't try that -- I just pinned a tail on it (not that it needed one for its intended purpose).
Code:
mov   rbx, rax      ;make a copy
shr   rax, 61       ;move the upper 3 bits to bits 2,1,0     :0000000000000UUU
and   rbx, M61      ;clear the upper 3 bits (bits 63,62,61)  :000LLLLLLLLLLLLL
add   rax, rbx      ;addition, an "overflow" will set bit 61 :00ULLLLLLLLLLLLL
btr   rax, 61       ;move bit 61 to CF; clear bit 61       CF,000LLLLLLLLLLLLL
adc   rax, 0        ;add CF back into result. No "overflow" happens this step.
only_human is offline   Reply With Quote
Old 2004-01-26, 15:43   #7
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

263F16 Posts
Default

Another time-saving idea I came up with when Peter Montgormery, Jason Papadopoulos and I were first working out the details of Mersenne-mod FGT: since nearly all of the modmuls in the FGT are by (presumably precomputed) transform "twiddle factors" (the modular roots of unity, analogous to the sines and cosines of an FFT), for arithmetic modulo M61, we can save some operations by premultiplying these twiddles by 8, which causes the "folding" point of the mod to be aligned precisely on the boundary of two 64-bit words, rather than between bits 60 and 61 of the lower word. Dresdenboy, I presume you're also implementing this in your code?
ewmayer is offline   Reply With Quote
Old 2004-01-26, 15:55   #8
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

9,791 Posts
Default

Quote:
Originally Posted by Dresdenboy
It's nice to bring up the topic of negative numbers. These could also show up during calculations. Then we have a negative number, which is the two's complement of its positive.

In this case we need to use arithmetic shifts (C++ will do that for signed ints).
The problem with this is that it is awkward and introduces a branch (or at bes some extra operations) on hardware lacking a carry bit, I.e. most RISC-style hardware. It's also unnecessary. When putting together one's transform, one should simply carefully bound the outputs of each operation (this can be done manually, which is the way I prefer - Jason Papadopoulos has actually written code that automates the procedure, and inserts modular adds and reduction where necessary), and then whenever an output range becomes so large that the next operation it's involved in could lead to overflow, we do a modular reduction (typically a partial one, i.e. we don't care if folding bits 61-63 into bits 0-60 causes a carry into bit 61). If the output range in question has a negative component (e.g. x is in [-3*q, +5*q], where q = M61), we add the needed multiple of M61 to make it nonnegative, then reduce. Since we know the output ranges in advance, we can eliminate any code branches or special runtime handling of negative operands this way.

Last fiddled with by ewmayer on 2004-01-26 at 16:05
ewmayer is offline   Reply With Quote
Old 2004-02-07, 00:02   #9
only_human
 
only_human's Avatar
 
"Gang aft agley"
Sep 2002

3,581 Posts
Default

I just noticed that the pi-hacks forum has been discussing the same topic as we have. Special form moduli for fast reduction in NTTs in messages 1096 through 1120 (Jan 15th--Jan 21st).

There is some extremely sharp code included.
This contributers are Jason Papadopoulos, Décio Luiz Gazzoni Filho, and Carey Bloodworth.
only_human is offline   Reply With Quote
Old 2004-02-29, 17:02   #10
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Quote:
Originally Posted by only_human
I just noticed that the pi-hacks forum has been discussing the same topic as we have. Special form moduli for fast reduction in NTTs in messages 1096 through 1120 (Jan 15th--Jan 21st).

There is some extremely sharp code included.
This contributers are Jason Papadopoulos, Décio Luiz Gazzoni Filho, and Carey Bloodworth.
Yep, the pi-hacks thread is interesting. But the code I saw there was some prototype integer SSE2 code for doing modular multiplications. I will bring up a similar topic (simple and complex multiplications modulo M61) later in this thread.
Dresdenboy is offline   Reply With Quote
Old 2004-02-29, 17:27   #11
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Quote:
Originally Posted by ewmayer
Another time-saving idea I came up with when Peter Montgormery, Jason Papadopoulos and I were first working out the details of Mersenne-mod FGT: since nearly all of the modmuls in the FGT are by (presumably precomputed) transform "twiddle factors" (the modular roots of unity, analogous to the sines and cosines of an FFT), for arithmetic modulo M61, we can save some operations by premultiplying these twiddles by 8, which causes the "folding" point of the mod to be aligned precisely on the boundary of two 64-bit words, rather than between bits 60 and 61 of the lower word. Dresdenboy, I presume you're also implementing this in your code?
Yes, I do. For that case I also found an even shorter "short reduction" which does a modular reduction and a shift left by 3 (so the twiddle factor won't need to be preshifted):

Code:
;our number resides in rbx

lea  rcx, [rbx*8]      ; a copy shifted left by 3 - upper bits are removed
shr  rbx, 61           ; move the upper bits right
lea  rax, [rcx+rbx*8]  ; add the lower bits (shifted left) to the upper bits
                       ; and load that sum into rax (where we need it for mul)
mul  qword ptr [twid]  ; just to show, what happens next ...
This combination lessens the negative effect of the requirement to place one factor into rax (for 128bit mul) and shortens the modular reduction code. To do this (and the rest of calculations) with less instructions (x86 and/or uOps) we'd need a 64bit ARM CPU.

The result of the first two instructions could also be used to do some 128bit adds (add+adc) with a lot of space for carries and quick modular reduction later (shr+add). But that will only be useful in larger butterflies and if we only need to do one ore two more adds and a reduction is already necessary.

BTW the next version of Athlon 64/Opteron core (could be the 90nm core) will do 1cycle latency LEAs, if they only do a RISC-like addition (no shift). Now it has a latency of 2. The core will also have SSE3, but that won't help us here

Last fiddled with by Dresdenboy on 2004-03-01 at 11:07
Dresdenboy is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Find Mersenne Primes twice as fast? Derived Number Theory Discussion Group 24 2016-09-08 11:45
Sieving with powers of small primes in the Small Prime variation of the Quadratic Sieve mickfrancis Factoring 2 2016-05-06 08:13
On polynomials without roots modulo small p fivemack Computer Science & Computational Number Theory 2 2015-09-18 12:54
Small(ish) matrices on fast computers fivemack Msieve 1 2011-03-14 23:43
Order of 3 modulo a Mersenne prime T.Rex Math 7 2009-03-13 10:46

All times are UTC. The time now is 01:59.

Mon Oct 26 01:59:15 UTC 2020 up 45 days, 23:10, 0 users, load averages: 1.43, 1.67, 1.66

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.