mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2019-09-07, 00:34   #1
hansl
 
hansl's Avatar
 
Apr 2019

5·41 Posts
Default A fast C/C++ function for base-2 modular exponentation

Hello, I'm writing my own C++ program for Mersenne trial factoring and seeing how fast I can make it.

My program uses GMP lib for some of the math, but I found the "mpz_powm*" functions to be not particularly fast.

This is my first go at a function using "native types" (+ compiler extension __uint128_t ) and its limited to factors up to 64-bits:
Code:
// return Mod(2,m)^e
uint64_t pow2mod(uint64_t e, uint64_t m) {
  int lz = __builtin_clzl(e);
  int bits_remain = (lz > 58) ? 0 : 58 - lz;
  __uint128_t temp = 1ul << ((e >> bits_remain) & 63);
  for(; bits_remain >= 6; ) {
    temp = (temp*temp) % m;
    temp = (temp*temp) % m;
    temp = (temp*temp) % m;
    temp = (temp*temp) % m;
    temp = (temp*temp) % m;
    temp = (temp*temp) % m;
    bits_remain -= 6;
    temp = (temp*(1ul << ((e >> bits_remain) & 63))) % m;
  }
  switch(bits_remain) {
    case 5: temp = (temp*temp) % m;
    case 4: temp = (temp*temp) % m;
    case 3: temp = (temp*temp) % m;
    case 2: temp = (temp*temp) % m;
    case 1: temp = (temp*temp) % m;
  }
  return (temp*(1ul << ((e & ((1<<bits_remain)-1))))) % m;
}
I was pleased to find that it apparently produces correct results and improved my overall speed by ~84% over using gmp's mpz_powm_ui.

But I'm wondering what I could do to make it faster still? Without resorting to handcrafted assembly.

Or ideas to improve the upper range past 64bits (for the modulus of course, not the exponent) without sacrificing speed?

I know TF'ing past 64bits has pretty much already been covered for Primenet's exponents up to 1e9 (by TJAOI), so I'm specifically doing this for mersenne.ca's 1e10 range, much of which is still only factored to 55bits.

The program is still at a very early/rough stage, but I plan to share it all with source once its a bit more fleshed out. Its purpose is for fast bulk factoring over large ranges of exponents.
hansl is offline   Reply With Quote
Old 2019-09-07, 15:44   #2
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

1011610 Posts
Default

Quote:
Originally Posted by hansl View Post
Hello, I'm writing my own C++ program for Mersenne trial factoring and seeing how fast I can make it.

My program uses GMP lib for some of the math, but I found the "mpz_powm*" functions to be not particularly fast.

This is my first go at a function using "native types" (+ compiler extension __uint128_t ) and its limited to factors up to 64-bits:
Code:
// return Mod(2,m)^e
uint64_t pow2mod(uint64_t e, uint64_t m) {
  int lz = __builtin_clzl(e);
  int bits_remain = (lz > 58) ? 0 : 58 - lz;
  __uint128_t temp = 1ul << ((e >> bits_remain) & 63);
  for(; bits_remain >= 6; ) {
    temp = (temp*temp) % m;
    temp = (temp*temp) % m;
    temp = (temp*temp) % m;
    temp = (temp*temp) % m;
    temp = (temp*temp) % m;
    temp = (temp*temp) % m;
    bits_remain -= 6;
    temp = (temp*(1ul << ((e >> bits_remain) & 63))) % m;
  }
  switch(bits_remain) {
    case 5: temp = (temp*temp) % m;
    case 4: temp = (temp*temp) % m;
    case 3: temp = (temp*temp) % m;
    case 2: temp = (temp*temp) % m;
    case 1: temp = (temp*temp) % m;
  }
  return (temp*(1ul << ((e & ((1<<bits_remain)-1))))) % m;
}
I was pleased to find that it apparently produces correct results and improved my overall speed by ~84% over using gmp's mpz_powm_ui.

But I'm wondering what I could do to make it faster still? Without resorting to handcrafted assembly.

Or ideas to improve the upper range past 64bits (for the modulus of course, not the exponent) without sacrificing speed?

I know TF'ing past 64bits has pretty much already been covered for Primenet's exponents up to 1e9 (by TJAOI), so I'm specifically doing this for mersenne.ca's 1e10 range, much of which is still only factored to 55bits.

The program is still at a very early/rough stage, but I plan to share it all with source once its a bit more fleshed out. Its purpose is for fast bulk factoring over large ranges of exponents.
A minor point about the first comment. I believe you are returning (2^e) mod m, not (2 mod m) ^e.

I'm thinking about substantive improvements.

One possibility is that instead of using (implied) division, perhaps it may be faster to multiply by the modular inverse of m.

Another: perhaps you can exploit the structure of m for the case of Mersenne factors.
xilman is online now   Reply With Quote
Old 2019-09-07, 16:10   #3
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

5·29·61 Posts
Default

Quote:
Originally Posted by xilman View Post
A minor point about the first comment. I believe you are returning (2^e) mod m, not (2 mod m) ^e.
You are both right, and we prefer hansl notation here. You old guy should learn some Pari
In programming languages and algebra systems a variable keeps its type (see Pari), regardless of what operations you do with it, therefore Mod(x, y) is a variable whose type is "mod" (which is different from the integer type). Mod(5,3) is not 2, but Mod(2,3). Similar, Mod(2,5)^3 is Mod(3,5), which in math notation is "2^3=3 (mod 5)". Which is right.

Code:
                                          GP/PARI CALCULATOR Version 2.11.1 (released)
                                  amd64 running mingw (x86-64/GMP-6.1.2 kernel) 64-bit version
                                    compiled: Nov 21 2018, gcc version 6.3.0 20170516 (GCC)
                                                    threading engine: single
                                        (readline v6.2 disabled, extended help enabled)

                                             Copyright (C) 2000-2018 The PARI Group

PARI/GP is free software, covered by the GNU General Public License, and comes WITHOUT ANY WARRANTY WHATSOEVER.
Type ? for help, \q to quit.
Type ?17 for how to get moral (and possibly technical) support.
parisize = 10000000, primelimit = 1000000


(23:07:32) gp:> Mod(2,7)^5
 %1 = Mod(4, 7)
(23:07:48) gp:>
However, you are totally right about multiplying with the modular inverse, a Barrett reduction will work faster for exponentiation.

Exploiting the structure of m for mersenne factors - we don't understand what you mean, when he does TF, he does not do "mod m", it is just an "unfortunate" choice for the name of the variable, confusing with the "standard" notation. Using the standard notation, m=2^p-1, q=2kp+1 is a factor, what he need to repeatedly do, is "mod q" (i.e. his formal parameter is m, but he call the function with actual parameter q). And for q there is not much a structure he can exploit. Please enlighten us, maybe we are missing something.

Last fiddled with by LaurV on 2019-09-07 at 16:27
LaurV is offline   Reply With Quote
Old 2019-09-07, 16:48   #4
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

22×32×281 Posts
Default

Quote:
Originally Posted by LaurV View Post
You are both right, and we prefer hansl notation here. You old guy should learn some Pari
In programming languages and algebra systems a variable keeps its type (see Pari), regardless of what operations you do with it, therefore Mod(x, y) is a variable whose type is "mod" (which is different from the integer type). Mod(5,3) is not 2, but Mod(2,3). Similar, Mod(2,5)^3 is Mod(3,5), which in math notation is "2^3=3 (mod 5)". Which is right.
True, but hansl explicitly stated that he was using C++ types, not Pari types. In raw C++ there is no native type "mod".

You young guy should learn how to read what is written, not what you believe ought to have been written.
xilman is online now   Reply With Quote
Old 2019-09-07, 17:44   #5
hansl
 
hansl's Avatar
 
Apr 2019

5×41 Posts
Default

Its just mathematical shorthand in comments, not a big deal.

Anyways if the comment was adhering to pure C++ the "^" operator would be interpreted as XOR instead of exponentiation too.
hansl is offline   Reply With Quote
Old 2019-09-07, 21:14   #6
hansl
 
hansl's Avatar
 
Apr 2019

5·41 Posts
Default

I've been trying some micro-optimizations to little success.

I thought since the same exponent is used over and over, I could pre-process it to some degree: count leading zeroes ahead of time, split into 6-bit chunks and pass pointer to array of those, etc. But none of those resulted in a speed improvement.

I only found one tiny change that seems to provide consistent improvement of about 0.75%

Replacing the multiplication by a shifted one, with just directly shifting the temp value:
Code:
        (temp*(1ul << ((e >> bits_remain) & 63))) % m;
        // becomes
        (temp << ((e >> bits_remain) & 63)) % m;
I've also started trying to implement Barrett reduction, but haven't quite figured out that algorithm yet. For example I'm not sure how to choose optimal word size and/or "k" value.

I'll keep working on it though, now that I'm out of other, easier ideas to try.
hansl is offline   Reply With Quote
Old 2019-09-08, 00:57   #7
hansl
 
hansl's Avatar
 
Apr 2019

20510 Posts
Default

OK, I think I have the Barrett reduction working, and it looks like I got a 21% improvement over my previous implementation. Not bad, but I was hoping Barrett would boost things more than that.

There's one big 128x128 multiplication, which I might be doing in a really inefficient way. Not sure if there's a better way to extract the low and high words, or to handle overflows in the summation. Only the upper 128bits are kept for the result anyways, per the algorithm.
Not sure if my implementation is 100% correct with regard to error terms, but seems to work for the inputs I've been testing.

I based this on the definition from Handbook of Applied Cryptography Ch 14.
Algorithm 14.42
The chapters are available online: http://cacr.uwaterloo.ca/hac/

I chose parameters b=2^64 and k=1, using the nomenclature from HAC. Then tried to simplify the math as much as I could based on those values.

Code:
#define H64(x) (x>>64)
#define L64(x) ((__uint128_t)((uint64_t)(x)))
#define _mulshr128(xh,xl,yh,yl) (xh*yh+(((((xh*yl)>>2)+((xl*yh)>>2)+(H64(xl*yl)>>2)))>>62))
#define MULSHR128(x,y) _mulshr128(H64(x),L64(x),H64(y),L64(y))

inline __uint128_t reduce( __uint128_t mu, __uint128_t x, uint64_t m) {
  __uint128_t r = x - MULSHR128(x, mu)*m;
  while (r >= m) r -= m;
  return r;
}

uint64_t pow2mod_barrett(uint64_t e, uint64_t m)
{
   // supposed to be b^2/m, but best I can do is (b^2-1)/m , difference should be negligible?
    __uint128_t mu = (~__uint128_t(0))/m;
    int lz = __builtin_clzl(e);
    int bits_remain = (lz > 58) ? 0 : 58 - lz;
    __uint128_t temp = 1ul << ((e >> bits_remain) & 63);
    for (; bits_remain >= 6;)
    {
        temp = reduce(mu, temp*temp, m);
        temp = reduce(mu, temp*temp, m);
        temp = reduce(mu, temp*temp, m);
        temp = reduce(mu, temp*temp, m);
        temp = reduce(mu, temp*temp, m);
        temp = reduce(mu, temp*temp, m);
        bits_remain -= 6;
        temp = reduce(mu, temp << ((e >> bits_remain) & 63), m);
    }
    switch (bits_remain)
    {
    case 5: temp = reduce(mu, temp*temp, m);
    case 4: temp = reduce(mu, temp*temp, m);
    case 3: temp = reduce(mu, temp*temp, m);
    case 2: temp = reduce(mu, temp*temp, m);
    case 1: temp = reduce(mu, temp*temp, m);
    }
    return reduce(mu, temp << ((e & ((1 << bits_remain) - 1))), m);
}
hansl is offline   Reply With Quote
Old 2019-09-08, 01:55   #8
snme2pm1
 
"Graham uses ISO 8601"
Mar 2014
AU, Sydney

241 Posts
Default

Quote:
Originally Posted by xilman View Post
True, but hansl explicitly stated that he was using C++ types, not Pari types. In raw C++ there is no native type "mod".

You young guy should learn how to read what is written, not what you believe ought to have been written.
Entirely off topic, I had imagined that since xilman and LaurV both knew about mud mold capacitors like me, we must be of similar vintage?
p.s. Would it be warranted for us each to expose ourselves, I mean age wise!
Maybe it suffices to say that we are all 60+

Last fiddled with by snme2pm1 on 2019-09-08 at 02:46
snme2pm1 is offline   Reply With Quote
Old 2019-09-08, 03:35   #9
hansl
 
hansl's Avatar
 
Apr 2019

20510 Posts
Default

Any other ideas for further improvement?

I'm wondering if there's any particular optimizations specific to "sqrmod" operations that I'm not taking advantage of.

Also, would it be worth trying Montgomery reduction instead of Barrett?

Or maybe I should focus my efforts on the sieving part of the program at this point.
hansl is offline   Reply With Quote
Old 2019-09-08, 04:48   #10
snme2pm1
 
"Graham uses ISO 8601"
Mar 2014
AU, Sydney

241 Posts
Default

I've had experience with MR, but not Barrett, but note that Barrett appears to feature more highly in TF work.
Quote:
Originally Posted by hansl View Post
Any other ideas for further improvement?

I'm wondering if there's any particular optimizations specific to "sqrmod" operations that I'm not taking advantage of.

Also, would it be worth trying Montgomery reduction instead of Barrett?

Or maybe I should focus my efforts on the sieving part of the program at this point.
snme2pm1 is offline   Reply With Quote
Old 2019-09-08, 04:49   #11
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2×3×29×41 Posts
Default

The very first temp*temp can be eliminated (shift the starting value more). The very first reduce can probably be simplified because x is a power of two.
Prime95 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Requestion for fast chebyshev theta function danaj Computer Science & Computational Number Theory 9 2018-03-31 14:59
Do normal adults give themselves an allowance? (...to fast or not to fast - there is no question!) jasong jasong 35 2016-12-11 00:57
Fast modular reduction for primes < 512 bits? BenR Computer Science & Computational Number Theory 2 2016-03-27 00:37
Fast Approximate Ceiling Function R.D. Silverman Programming 27 2010-11-19 17:50
Base-6 speed for prime testing vs. base-2 jasong Conjectures 'R Us 36 2010-08-03 06:25

All times are UTC. The time now is 00:36.

Fri Oct 23 00:36:12 UTC 2020 up 42 days, 21:47, 0 users, load averages: 1.15, 1.48, 1.56

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.