20060526, 08:52  #1 
Jun 2003
1,579 Posts 
Modular multiplication and addition
I am trying to multiply 2, 31 bit integers and then add a number to the result; modulo a prime p(which is 31 bit)
I am using a 64 bit integer to store the product, multiplication done mod p. Then adding the result to some other integer, mod p. Q. Is there a faster way to do this? After the multiplication step, I do not need 64 bit precision. Is there a way to reduce it to a 32 bit number? Assuming that 64 bit additions are slower than 32 bit additions? Would reducing the number to 32 bit help at all. Thanks for the answer in advance. edit: Doing this in VC++ 6.0 Last fiddled with by Citrix on 20060526 at 08:52 
20060526, 13:29  #2 
Jul 2005
602_{8} Posts 
Does p remain constant over many calculations?
If not: Look down about 8 posts here: http://groups.google.com/group/sci.c...s&rnum=5&hl=en If p does remain constant then look at Montgomery multiplication. The _Handbook of Applied Cryptography_ has a good description of Montgomery Multiplication. If you choose a word size of 32bits then you can use a much simpler version of the code. 
20060527, 02:55  #3 
Jun 2003
1,579 Posts 
Thank you for the reply. P remains constant over several multiplications. I have read on how you calculate xy*2^n to save over the divison step. But how do you optomize this further to speed the code up if you are using the same p for several multplications.
Since in each case x and y are different. For calculating b^e it might help, but if x, y are changing how do you optomize it. edit: Is the time to add 2 numbers stored as 32 bit integers the same as adding 2 numbers stored as 64 bit numbers, even when the numbers are less than say 2^16? Last fiddled with by Citrix on 20060527 at 03:02 
20060527, 04:18  #4 
Sep 2002
1226_{8} Posts 
Using 32 bit ALU registers in assembly
the 64 bit adds require twice the instructions. ALIGN 16 num1_32 dd 30000 num2_32 dd 1000 num1_64 dd 30000, 0 ; could be one dq (64 bit number) num2_64 dd 1000, 0 num3_64 dq 30000 mov eax, num2_32 add num1_32, eax versus mov eax, num2_64 mov edx, num2_64 + 4 add num1_64, eax adc num1_64, edx If the values were all in registers it would be add eax, ebx versus add eax, ebx adc edx, ecx In SSE2 there are instructions that work on 64 bit integers. (There probably is better SSE2 code, I am far more familar with ALU instructions. There are instructions that add the two 64 bit halves of a 128 bit xmm register.) ALIGN 16 num1_64 dq 16386, 0 ; dq is 64 bit integer also called qword num2_64 dq 1024, 0 num1_128 do 00000000000040020000000000000400 movq mm0, num2_64 paddq mm0, num1_64 
20060527, 04:25  #5 
Sep 2002
2·331 Posts 
Also check a post in this forum by Dresdenboy
Fast calculations modulo small mersenne primes like M61 http://www.mersenneforum.org/showthread.php?t=1955 
20060703, 01:32  #6 
Jan 2005
Caught in a sieve
110001011_{2} Posts 
Chris Monico's old ECCP109 project would appear to describe the solution to your problem. The problem there was to multiply 2 128bit (okay, 109bit) numbers, then take them modulus a constant 109bit number, so this should be a cinch by comparison!
Here's his comments in the assembly code after the multiplication is done: Code:
/***************************************************************/ /* The 218bit product is now stored in t[0],...,t[6]. Our next mission, should we choose to accept it, is to reduce this number mod 'p'. We proceed in several steps: 1) Notice that we can write: t = (a 128bit number) + 2^{128}t4 + 2^{160}t5 + 2^{196}t6 Where t4,t5,t6 each have 32bits. Furthermore, we've precomputed the reductions : r4 := 2^128 mod p = 7f0 55455b5c 27a9ef92 540f3986 (107 bits) r5 := 2^160 mod p = acc f5a4e730 b2ec7424 32d0a6bf (108 bits) r6 := 2^192 mod p = a7c a3ea9cfb 531af0b6 b786c901 (108 bits) So that r4*t4 will have 32+107 = 139 bits r5*t5 will have 32+108 = 140 bits r6*t6 will have 22+108 = 130 bits Let's look carefully at what we'll do: t = (a 128bit number) + 2^{128}t4 + 2^{160}t5 + 2^{196}t6 = (128 bits) + (139 bits) + (140 bits) + (130 bits) = (128 bits + 139 bits) + (140 bits + 130 bits) <= (140 bits) + (141 bits) <= (142 bits) 2) Use the precomputed tables (T) to reduce it to a number in [0,2p) 3) Perform a single subtraction, if necessary, to get it into [0,p) */ /***************************************************************/ You can get the entire mess from my web site. Unfortunately, the good comments are only in the AT&Tsyntax assembly, which is backwards to the standard Intel assembly (last argument first), and only compiles with certain Linuxstyle assemblers (gcc, et al). Good luck! P.S. If your modulus system is good enough, and you know the sum fits in 64 bits, it can be worthwhile to add a 32bit number to the 64bit number before taking the modulus. See my final version for that kind of code. Last fiddled with by Ken_g6 on 20060703 at 01:35 
20070126, 01:00  #7 
Mar 2003
New Zealand
13·89 Posts 
In case anyone is still interested in multiplying a*b (mod p) for 0 <= a,b < p < 2^31, here are some SSE2 functions that work out faster than the usual mul/div code provided p is fixed for a reasonable number of calcualtions. If b is also fixed then it gets even faster.

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Possible addition for the smiling Smilies?  schickel  Forum Feedback  40  20120815 00:40 
CPU clocks statistics (TF, modular multiplication)  tichy  Software  1  20101215 08:13 
Best practices in addition chains  SPWorley  Programming  10  20090728 13:50 
Addition to GMPECM...  WraithX  GMPECM  6  20070306 07:56 
New addition to the family!  SlashDude  15k Search  5  20040313 21:28 