mersenneforum.org A fast C/C++ function for base-2 modular exponentation
 Register FAQ Search Today's Posts Mark Forums Read

 2019-09-07, 00:34 #1 hansl     Apr 2019 5·41 Posts A fast C/C++ function for base-2 modular exponentation Hello, I'm writing my own C++ program for Mersenne trial factoring and seeing how fast I can make it. My program uses GMP lib for some of the math, but I found the "mpz_powm*" functions to be not particularly fast. This is my first go at a function using "native types" (+ compiler extension __uint128_t ) and its limited to factors up to 64-bits: Code: // return Mod(2,m)^e uint64_t pow2mod(uint64_t e, uint64_t m) { int lz = __builtin_clzl(e); int bits_remain = (lz > 58) ? 0 : 58 - lz; __uint128_t temp = 1ul << ((e >> bits_remain) & 63); for(; bits_remain >= 6; ) { temp = (temp*temp) % m; temp = (temp*temp) % m; temp = (temp*temp) % m; temp = (temp*temp) % m; temp = (temp*temp) % m; temp = (temp*temp) % m; bits_remain -= 6; temp = (temp*(1ul << ((e >> bits_remain) & 63))) % m; } switch(bits_remain) { case 5: temp = (temp*temp) % m; case 4: temp = (temp*temp) % m; case 3: temp = (temp*temp) % m; case 2: temp = (temp*temp) % m; case 1: temp = (temp*temp) % m; } return (temp*(1ul << ((e & ((1<
2019-09-07, 15:44   #2
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

34×53 Posts

Quote:
 Originally Posted by hansl Hello, I'm writing my own C++ program for Mersenne trial factoring and seeing how fast I can make it. My program uses GMP lib for some of the math, but I found the "mpz_powm*" functions to be not particularly fast. This is my first go at a function using "native types" (+ compiler extension __uint128_t ) and its limited to factors up to 64-bits: Code: // return Mod(2,m)^e uint64_t pow2mod(uint64_t e, uint64_t m) { int lz = __builtin_clzl(e); int bits_remain = (lz > 58) ? 0 : 58 - lz; __uint128_t temp = 1ul << ((e >> bits_remain) & 63); for(; bits_remain >= 6; ) { temp = (temp*temp) % m; temp = (temp*temp) % m; temp = (temp*temp) % m; temp = (temp*temp) % m; temp = (temp*temp) % m; temp = (temp*temp) % m; bits_remain -= 6; temp = (temp*(1ul << ((e >> bits_remain) & 63))) % m; } switch(bits_remain) { case 5: temp = (temp*temp) % m; case 4: temp = (temp*temp) % m; case 3: temp = (temp*temp) % m; case 2: temp = (temp*temp) % m; case 1: temp = (temp*temp) % m; } return (temp*(1ul << ((e & ((1<
A minor point about the first comment. I believe you are returning (2^e) mod m, not (2 mod m) ^e.

One possibility is that instead of using (implied) division, perhaps it may be faster to multiply by the modular inverse of m.

Another: perhaps you can exploit the structure of m for the case of Mersenne factors.

2019-09-07, 16:10   #3
LaurV
Romulan Interpreter

Jun 2011
Thailand

3×2,957 Posts

Quote:
 Originally Posted by xilman A minor point about the first comment. I believe you are returning (2^e) mod m, not (2 mod m) ^e.
You are both right, and we prefer hansl notation here. You old guy should learn some Pari
In programming languages and algebra systems a variable keeps its type (see Pari), regardless of what operations you do with it, therefore Mod(x, y) is a variable whose type is "mod" (which is different from the integer type). Mod(5,3) is not 2, but Mod(2,3). Similar, Mod(2,5)^3 is Mod(3,5), which in math notation is "2^3=3 (mod 5)". Which is right.

Code:
                                          GP/PARI CALCULATOR Version 2.11.1 (released)
amd64 running mingw (x86-64/GMP-6.1.2 kernel) 64-bit version
compiled: Nov 21 2018, gcc version 6.3.0 20170516 (GCC)
(readline v6.2 disabled, extended help enabled)

Copyright (C) 2000-2018 The PARI Group

PARI/GP is free software, covered by the GNU General Public License, and comes WITHOUT ANY WARRANTY WHATSOEVER.
Type ? for help, \q to quit.
Type ?17 for how to get moral (and possibly technical) support.
parisize = 10000000, primelimit = 1000000

(23:07:32) gp:> Mod(2,7)^5
%1 = Mod(4, 7)
(23:07:48) gp:>
However, you are totally right about multiplying with the modular inverse, a Barrett reduction will work faster for exponentiation.

Exploiting the structure of m for mersenne factors - we don't understand what you mean, when he does TF, he does not do "mod m", it is just an "unfortunate" choice for the name of the variable, confusing with the "standard" notation. Using the standard notation, m=2^p-1, q=2kp+1 is a factor, what he need to repeatedly do, is "mod q" (i.e. his formal parameter is m, but he call the function with actual parameter q). And for q there is not much a structure he can exploit. Please enlighten us, maybe we are missing something.

Last fiddled with by LaurV on 2019-09-07 at 16:27

2019-09-07, 16:48   #4
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

34×53 Posts

Quote:
 Originally Posted by LaurV You are both right, and we prefer hansl notation here. You old guy should learn some Pari In programming languages and algebra systems a variable keeps its type (see Pari), regardless of what operations you do with it, therefore Mod(x, y) is a variable whose type is "mod" (which is different from the integer type). Mod(5,3) is not 2, but Mod(2,3). Similar, Mod(2,5)^3 is Mod(3,5), which in math notation is "2^3=3 (mod 5)". Which is right.
True, but hansl explicitly stated that he was using C++ types, not Pari types. In raw C++ there is no native type "mod".

You young guy should learn how to read what is written, not what you believe ought to have been written.

 2019-09-07, 17:44 #5 hansl     Apr 2019 5×41 Posts Its just mathematical shorthand in comments, not a big deal. Anyways if the comment was adhering to pure C++ the "^" operator would be interpreted as XOR instead of exponentiation too.
 2019-09-07, 21:14 #6 hansl     Apr 2019 3158 Posts I've been trying some micro-optimizations to little success. I thought since the same exponent is used over and over, I could pre-process it to some degree: count leading zeroes ahead of time, split into 6-bit chunks and pass pointer to array of those, etc. But none of those resulted in a speed improvement. I only found one tiny change that seems to provide consistent improvement of about 0.75% Replacing the multiplication by a shifted one, with just directly shifting the temp value: Code:  (temp*(1ul << ((e >> bits_remain) & 63))) % m; // becomes (temp << ((e >> bits_remain) & 63)) % m; I've also started trying to implement Barrett reduction, but haven't quite figured out that algorithm yet. For example I'm not sure how to choose optimal word size and/or "k" value. I'll keep working on it though, now that I'm out of other, easier ideas to try.
 2019-09-08, 00:57 #7 hansl     Apr 2019 5×41 Posts OK, I think I have the Barrett reduction working, and it looks like I got a 21% improvement over my previous implementation. Not bad, but I was hoping Barrett would boost things more than that. There's one big 128x128 multiplication, which I might be doing in a really inefficient way. Not sure if there's a better way to extract the low and high words, or to handle overflows in the summation. Only the upper 128bits are kept for the result anyways, per the algorithm. Not sure if my implementation is 100% correct with regard to error terms, but seems to work for the inputs I've been testing. I based this on the definition from Handbook of Applied Cryptography Ch 14. Algorithm 14.42 The chapters are available online: http://cacr.uwaterloo.ca/hac/ I chose parameters b=2^64 and k=1, using the nomenclature from HAC. Then tried to simplify the math as much as I could based on those values. Code: #define H64(x) (x>>64) #define L64(x) ((__uint128_t)((uint64_t)(x))) #define _mulshr128(xh,xl,yh,yl) (xh*yh+(((((xh*yl)>>2)+((xl*yh)>>2)+(H64(xl*yl)>>2)))>>62)) #define MULSHR128(x,y) _mulshr128(H64(x),L64(x),H64(y),L64(y)) inline __uint128_t reduce( __uint128_t mu, __uint128_t x, uint64_t m) { __uint128_t r = x - MULSHR128(x, mu)*m; while (r >= m) r -= m; return r; } uint64_t pow2mod_barrett(uint64_t e, uint64_t m) { // supposed to be b^2/m, but best I can do is (b^2-1)/m , difference should be negligible? __uint128_t mu = (~__uint128_t(0))/m; int lz = __builtin_clzl(e); int bits_remain = (lz > 58) ? 0 : 58 - lz; __uint128_t temp = 1ul << ((e >> bits_remain) & 63); for (; bits_remain >= 6;) { temp = reduce(mu, temp*temp, m); temp = reduce(mu, temp*temp, m); temp = reduce(mu, temp*temp, m); temp = reduce(mu, temp*temp, m); temp = reduce(mu, temp*temp, m); temp = reduce(mu, temp*temp, m); bits_remain -= 6; temp = reduce(mu, temp << ((e >> bits_remain) & 63), m); } switch (bits_remain) { case 5: temp = reduce(mu, temp*temp, m); case 4: temp = reduce(mu, temp*temp, m); case 3: temp = reduce(mu, temp*temp, m); case 2: temp = reduce(mu, temp*temp, m); case 1: temp = reduce(mu, temp*temp, m); } return reduce(mu, temp << ((e & ((1 << bits_remain) - 1))), m); }
2019-09-08, 01:55   #8
snme2pm1

"Graham uses ISO 8601"
Mar 2014
AU, Sydney

241 Posts

Quote:
 Originally Posted by xilman True, but hansl explicitly stated that he was using C++ types, not Pari types. In raw C++ there is no native type "mod". You young guy should learn how to read what is written, not what you believe ought to have been written.
Entirely off topic, I had imagined that since xilman and LaurV both knew about mud mold capacitors like me, we must be of similar vintage?
p.s. Would it be warranted for us each to expose ourselves, I mean age wise!
Maybe it suffices to say that we are all 60+

Last fiddled with by snme2pm1 on 2019-09-08 at 02:46

 2019-09-08, 03:35 #9 hansl     Apr 2019 5·41 Posts Any other ideas for further improvement? I'm wondering if there's any particular optimizations specific to "sqrmod" operations that I'm not taking advantage of. Also, would it be worth trying Montgomery reduction instead of Barrett? Or maybe I should focus my efforts on the sieving part of the program at this point.
2019-09-08, 04:48   #10
snme2pm1

"Graham uses ISO 8601"
Mar 2014
AU, Sydney

241 Posts

I've had experience with MR, but not Barrett, but note that Barrett appears to feature more highly in TF work.
Quote:
 Originally Posted by hansl Any other ideas for further improvement? I'm wondering if there's any particular optimizations specific to "sqrmod" operations that I'm not taking advantage of. Also, would it be worth trying Montgomery reduction instead of Barrett? Or maybe I should focus my efforts on the sieving part of the program at this point.

 2019-09-08, 04:49 #11 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 713810 Posts The very first temp*temp can be eliminated (shift the starting value more). The very first reduce can probably be simplified because x is a power of two.

 Similar Threads Thread Thread Starter Forum Replies Last Post danaj Computer Science & Computational Number Theory 9 2018-03-31 14:59 jasong jasong 35 2016-12-11 00:57 BenR Computer Science & Computational Number Theory 2 2016-03-27 00:37 R.D. Silverman Programming 27 2010-11-19 17:50 jasong Conjectures 'R Us 36 2010-08-03 06:25

All times are UTC. The time now is 03:28.

Tue Oct 27 03:28:16 UTC 2020 up 47 days, 39 mins, 1 user, load averages: 2.31, 2.33, 1.94