mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Programming (https://www.mersenneforum.org/forumdisplay.php?f=29)
-   -   Modular multiplication and addition (https://www.mersenneforum.org/showthread.php?t=5922)

Citrix 2006-05-26 08:52

Modular multiplication and addition
 
I am trying to multiply 2, 31 bit integers and then add a number to the result; modulo a prime p(which is 31 bit)

I am using a 64 bit integer to store the product, multiplication done mod p.
Then adding the result to some other integer, mod p.

Q. Is there a faster way to do this?
After the multiplication step, I do not need 64 bit precision. Is there a way to reduce it to a 32 bit number? Assuming that 64 bit additions are slower than 32 bit additions? Would reducing the number to 32 bit help at all.


Thanks for the answer in advance.:bow:

edit: Doing this in VC++ 6.0

Greenbank 2006-05-26 13:29

Does p remain constant over many calculations?

If not:-

Look down about 8 posts here:-

[url]http://groups.google.com/group/sci.crypt/browse_frm/thread/16584f78726c6824/048e74f0b0707882?lnk=st&q=64+bit+integer+multiply+modulus&rnum=5&hl=en[/url]

If p does remain constant then look at Montgomery multiplication.

The [URL="http://www.cacr.math.uwaterloo.ca/hac/"]_Handbook of Applied Cryptography_[/URL] has a good description of Montgomery Multiplication.

If you choose a word size of 32-bits then you can use a much simpler version of the code.

Citrix 2006-05-27 02:55

Thank you for the reply. P remains constant over several multiplications. I have read on how you calculate xy*2^-n to save over the divison step. But how do you optomize this further to speed the code up if you are using the same p for several multplications.

Since in each case x and y are different. For calculating b^e it might help, but if x, y are changing how do you optomize it.

edit:

Is the time to add 2 numbers stored as 32 bit integers the same as adding 2 numbers stored as 64 bit numbers, even when the numbers are less than say 2^16?

dsouza123 2006-05-27 04:18

Using 32 bit ALU registers in assembly
the 64 bit adds require twice the instructions.

ALIGN 16
num1_32 dd 30000
num2_32 dd 1000

num1_64 dd 30000, 0 ; could be one dq (64 bit number)
num2_64 dd 1000, 0

num3_64 dq 30000

mov eax, num2_32
add num1_32, eax

versus

mov eax, num2_64
mov edx, num2_64 + 4
add num1_64, eax
adc num1_64, edx


If the values were all in registers it would be

add eax, ebx

versus

add eax, ebx
adc edx, ecx

In SSE2 there are instructions that work on 64 bit integers.
(There probably is better SSE2 code, I am far more familar with ALU instructions.
There are instructions that add the two 64 bit halves of a 128 bit xmm register.)

ALIGN 16
num1_64 dq 16386, 0 ; dq is 64 bit integer also called qword
num2_64 dq 1024, 0

num1_128 do 00000000000040020000000000000400

movq mm0, num2_64
paddq mm0, num1_64

dsouza123 2006-05-27 04:25

Also check a post in this forum by Dresdenboy

Fast calculations modulo small mersenne primes like M61
[url]http://www.mersenneforum.org/showthread.php?t=1955[/url]

Ken_g6 2006-07-03 01:32

Chris Monico's old ECCP-109 project would appear to describe the solution to your problem. The problem there was to multiply 2 128-bit (okay, 109-bit) numbers, then take them modulus a constant 109-bit number, so this should be a cinch by comparison!

Here's his comments in the assembly code after the multiplication is done:
[code]/***************************************************************/
/* The 218-bit product is now stored in t[0],...,t[6].
Our next mission, should we choose to accept it, is to
reduce this number mod 'p'.
We proceed in several steps:
1) Notice that we can write:
t = (a 128-bit number) + 2^{128}t4 + 2^{160}t5 + 2^{196}t6
Where t4,t5,t6 each have 32-bits.
Furthermore, we've precomputed the reductions :
r4 := 2^128 mod p = 7f0 55455b5c 27a9ef92 540f3986 (107 bits)
r5 := 2^160 mod p = acc f5a4e730 b2ec7424 32d0a6bf (108 bits)
r6 := 2^192 mod p = a7c a3ea9cfb 531af0b6 b786c901 (108 bits)

So that
r4*t4 will have 32+107 = 139 bits
r5*t5 will have 32+108 = 140 bits
r6*t6 will have 22+108 = 130 bits

Let's look carefully at what we'll do:
t = (a 128-bit number) + 2^{128}t4 + 2^{160}t5 + 2^{196}t6
= (128 bits) + (139 bits) + (140 bits) + (130 bits)
= (128 bits + 139 bits) + (140 bits + 130 bits)
<= (140 bits) + (141 bits)
<= (142 bits)


2) Use the precomputed tables (T) to reduce it to a number
in [0,2p)
3) Perform a single subtraction, if necessary, to get it into
[0,p)
*/
/***************************************************************/
[/code]
I believe the code does this at least once more on the 142-bit number.

You can get the entire mess from [url=http://home.att.net/~k.brazier/programs/eccp109/]my web site[/url]. Unfortunately, the good comments are only in the AT&T-syntax assembly, which is backwards to the standard Intel assembly (last argument first), and only compiles with certain Linux-style assemblers (gcc, et al).

Good luck!

P.S. If your modulus system is good enough, and you know the sum fits in 64 bits, it can be worthwhile to add a 32-bit number to the 64-bit number before taking the modulus. See my final version for that kind of code.

geoff 2007-01-26 01:00

1 Attachment(s)
In case anyone is still interested in multiplying a*b (mod p) for 0 <= a,b < p < 2^31, here are some SSE2 functions that work out faster than the usual mul/div code provided p is fixed for a reasonable number of calcualtions. If b is also fixed then it gets even faster.


All times are UTC. The time now is 23:22.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.