20190907, 00:34  #1 
Apr 2019
5·41 Posts 
A fast C/C++ function for base2 modular exponentation
Hello, I'm writing my own C++ program for Mersenne trial factoring and seeing how fast I can make it.
My program uses GMP lib for some of the math, but I found the "mpz_powm*" functions to be not particularly fast. This is my first go at a function using "native types" (+ compiler extension __uint128_t ) and its limited to factors up to 64bits: Code:
// return Mod(2,m)^e uint64_t pow2mod(uint64_t e, uint64_t m) { int lz = __builtin_clzl(e); int bits_remain = (lz > 58) ? 0 : 58  lz; __uint128_t temp = 1ul << ((e >> bits_remain) & 63); for(; bits_remain >= 6; ) { temp = (temp*temp) % m; temp = (temp*temp) % m; temp = (temp*temp) % m; temp = (temp*temp) % m; temp = (temp*temp) % m; temp = (temp*temp) % m; bits_remain = 6; temp = (temp*(1ul << ((e >> bits_remain) & 63))) % m; } switch(bits_remain) { case 5: temp = (temp*temp) % m; case 4: temp = (temp*temp) % m; case 3: temp = (temp*temp) % m; case 2: temp = (temp*temp) % m; case 1: temp = (temp*temp) % m; } return (temp*(1ul << ((e & ((1<<bits_remain)1))))) % m; } But I'm wondering what I could do to make it faster still? Without resorting to handcrafted assembly. Or ideas to improve the upper range past 64bits (for the modulus of course, not the exponent) without sacrificing speed? I know TF'ing past 64bits has pretty much already been covered for Primenet's exponents up to 1e9 (by TJAOI), so I'm specifically doing this for mersenne.ca's 1e10 range, much of which is still only factored to 55bits. The program is still at a very early/rough stage, but I plan to share it all with source once its a bit more fleshed out. Its purpose is for fast bulk factoring over large ranges of exponents. 
20190907, 15:44  #2  
Bamboozled!
"πΊππ·π·π"
May 2003
Down not across
3^{4}×5^{3} Posts 
Quote:
I'm thinking about substantive improvements. One possibility is that instead of using (implied) division, perhaps it may be faster to multiply by the modular inverse of m. Another: perhaps you can exploit the structure of m for the case of Mersenne factors. 

20190907, 16:10  #3  
Romulan Interpreter
Jun 2011
Thailand
3×2,957 Posts 
Quote:
In programming languages and algebra systems a variable keeps its type (see Pari), regardless of what operations you do with it, therefore Mod(x, y) is a variable whose type is "mod" (which is different from the integer type). Mod(5,3) is not 2, but Mod(2,3). Similar, Mod(2,5)^3 is Mod(3,5), which in math notation is "2^3=3 (mod 5)". Which is right. Code:
GP/PARI CALCULATOR Version 2.11.1 (released) amd64 running mingw (x8664/GMP6.1.2 kernel) 64bit version compiled: Nov 21 2018, gcc version 6.3.0 20170516 (GCC) threading engine: single (readline v6.2 disabled, extended help enabled) Copyright (C) 20002018 The PARI Group PARI/GP is free software, covered by the GNU General Public License, and comes WITHOUT ANY WARRANTY WHATSOEVER. Type ? for help, \q to quit. Type ?17 for how to get moral (and possibly technical) support. parisize = 10000000, primelimit = 1000000 (23:07:32) gp:> Mod(2,7)^5 %1 = Mod(4, 7) (23:07:48) gp:> Exploiting the structure of m for mersenne factors  we don't understand what you mean, when he does TF, he does not do "mod m", it is just an "unfortunate" choice for the name of the variable, confusing with the "standard" notation. Using the standard notation, m=2^p1, q=2kp+1 is a factor, what he need to repeatedly do, is "mod q" (i.e. his formal parameter is m, but he call the function with actual parameter q). And for q there is not much a structure he can exploit. Please enlighten us, maybe we are missing something. Last fiddled with by LaurV on 20190907 at 16:27 

20190907, 16:48  #4  
Bamboozled!
"πΊππ·π·π"
May 2003
Down not across
3^{4}×5^{3} Posts 
Quote:
You young guy should learn how to read what is written, not what you believe ought to have been written. 

20190907, 17:44  #5 
Apr 2019
5×41 Posts 
Its just mathematical shorthand in comments, not a big deal.
Anyways if the comment was adhering to pure C++ the "^" operator would be interpreted as XOR instead of exponentiation too. 
20190907, 21:14  #6 
Apr 2019
315_{8} Posts 
I've been trying some microoptimizations to little success.
I thought since the same exponent is used over and over, I could preprocess it to some degree: count leading zeroes ahead of time, split into 6bit chunks and pass pointer to array of those, etc. But none of those resulted in a speed improvement. I only found one tiny change that seems to provide consistent improvement of about 0.75% Replacing the multiplication by a shifted one, with just directly shifting the temp value: Code:
(temp*(1ul << ((e >> bits_remain) & 63))) % m; // becomes (temp << ((e >> bits_remain) & 63)) % m; I'll keep working on it though, now that I'm out of other, easier ideas to try. 
20190908, 00:57  #7 
Apr 2019
5×41 Posts 
OK, I think I have the Barrett reduction working, and it looks like I got a 21% improvement over my previous implementation. Not bad, but I was hoping Barrett would boost things more than that.
There's one big 128x128 multiplication, which I might be doing in a really inefficient way. Not sure if there's a better way to extract the low and high words, or to handle overflows in the summation. Only the upper 128bits are kept for the result anyways, per the algorithm. Not sure if my implementation is 100% correct with regard to error terms, but seems to work for the inputs I've been testing. I based this on the definition from Handbook of Applied Cryptography Ch 14. Algorithm 14.42 The chapters are available online: http://cacr.uwaterloo.ca/hac/ I chose parameters b=2^64 and k=1, using the nomenclature from HAC. Then tried to simplify the math as much as I could based on those values. Code:
#define H64(x) (x>>64) #define L64(x) ((__uint128_t)((uint64_t)(x))) #define _mulshr128(xh,xl,yh,yl) (xh*yh+(((((xh*yl)>>2)+((xl*yh)>>2)+(H64(xl*yl)>>2)))>>62)) #define MULSHR128(x,y) _mulshr128(H64(x),L64(x),H64(y),L64(y)) inline __uint128_t reduce( __uint128_t mu, __uint128_t x, uint64_t m) { __uint128_t r = x  MULSHR128(x, mu)*m; while (r >= m) r = m; return r; } uint64_t pow2mod_barrett(uint64_t e, uint64_t m) { // supposed to be b^2/m, but best I can do is (b^21)/m , difference should be negligible? __uint128_t mu = (~__uint128_t(0))/m; int lz = __builtin_clzl(e); int bits_remain = (lz > 58) ? 0 : 58  lz; __uint128_t temp = 1ul << ((e >> bits_remain) & 63); for (; bits_remain >= 6;) { temp = reduce(mu, temp*temp, m); temp = reduce(mu, temp*temp, m); temp = reduce(mu, temp*temp, m); temp = reduce(mu, temp*temp, m); temp = reduce(mu, temp*temp, m); temp = reduce(mu, temp*temp, m); bits_remain = 6; temp = reduce(mu, temp << ((e >> bits_remain) & 63), m); } switch (bits_remain) { case 5: temp = reduce(mu, temp*temp, m); case 4: temp = reduce(mu, temp*temp, m); case 3: temp = reduce(mu, temp*temp, m); case 2: temp = reduce(mu, temp*temp, m); case 1: temp = reduce(mu, temp*temp, m); } return reduce(mu, temp << ((e & ((1 << bits_remain)  1))), m); } 
20190908, 01:55  #8  
"Graham uses ISO 8601"
Mar 2014
AU, Sydney
241 Posts 
Quote:
p.s. Would it be warranted for us each to expose ourselves, I mean age wise! Maybe it suffices to say that we are all 60+ Last fiddled with by snme2pm1 on 20190908 at 02:46 

20190908, 03:35  #9 
Apr 2019
5·41 Posts 
Any other ideas for further improvement?
I'm wondering if there's any particular optimizations specific to "sqrmod" operations that I'm not taking advantage of. Also, would it be worth trying Montgomery reduction instead of Barrett? Or maybe I should focus my efforts on the sieving part of the program at this point. 
20190908, 04:48  #10  
"Graham uses ISO 8601"
Mar 2014
AU, Sydney
241 Posts 
I've had experience with MR, but not Barrett, but note that Barrett appears to feature more highly in TF work.
Quote:


20190908, 04:49  #11 
P90 years forever!
Aug 2002
Yeehaw, FL
7138_{10} Posts 
The very first temp*temp can be eliminated (shift the starting value more). The very first reduce can probably be simplified because x is a power of two.

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Requestion for fast chebyshev theta function  danaj  Computer Science & Computational Number Theory  9  20180331 14:59 
Do normal adults give themselves an allowance? (...to fast or not to fast  there is no question!)  jasong  jasong  35  20161211 00:57 
Fast modular reduction for primes < 512 bits?  BenR  Computer Science & Computational Number Theory  2  20160327 00:37 
Fast Approximate Ceiling Function  R.D. Silverman  Programming  27  20101119 17:50 
Base6 speed for prime testing vs. base2  jasong  Conjectures 'R Us  36  20100803 06:25 