mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2019-09-08, 13:01   #12
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

3·2,957 Posts
Default

Quote:
Originally Posted by snme2pm1 View Post
we must be of similar vintage?
We are, and we respect each other. At least, I respect Paul a lot! Newcomers and youngsters may be a little put off by these "blow-exchanges" or how one wants to call them, but it is nothing serious, this is part of the fun, to pull the sleeve of each-other when we make mistakes, because nobody is perfect. One can't really understand a movie like Last Vegas, unless he reaches that "age of understanding".
LaurV is online now   Reply With Quote
Old 2019-09-08, 16:36   #13
xilman
Bamboozled!
 
xilman's Avatar
 
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

34×53 Posts
Default

Quote:
Originally Posted by LaurV View Post
We are, and we respect each other. At least, I respect Paul a lot! Newcomers and youngsters may be a little put off by these "blow-exchanges" or how one wants to call them, but it is nothing serious, this is part of the fun, to pull the sleeve of each-other when we make mistakes, because nobody is perfect. One can't really understand a movie like Last Vegas, unless he reaches that "age of understanding".
Exactly.

The ability successfully to decide whether or not to to take one of my posts seriously is a sign of maturity.

LaurV, despite his extreme youth, shows clear evidence of maturity.

Last fiddled with by xilman on 2019-09-08 at 16:53 Reason: Add final sentence
xilman is offline   Reply With Quote
Old 2019-09-08, 20:32   #14
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

9,791 Posts
Default

Quote:
Originally Posted by hansl View Post
I'll keep working on it though, now that I'm out of other, easier ideas to try.
Try processing multiple TF moduli side-by-side in order to mitigate MUL latency. That is easier when one is working with atomic operations (as opposed to your 128-bit intrinsic stuff) and/or in asm, i.e. when one has access to the underlying hardware-arithmetic sequence and can structure things to best use available registers. But you should still get a decent boost in the context of your high-level coding paradigm.
ewmayer is offline   Reply With Quote
Old 2019-09-10, 07:08   #15
hansl
 
hansl's Avatar
 
Apr 2019

110011012 Posts
Default

Quote:
Originally Posted by Prime95 View Post
The very first temp*temp can be eliminated (shift the starting value more). The very first reduce can probably be simplified because x is a power of two.
Thanks! I've change that now, and yeah I found the 128x128 multiplication can be replaced with a shift for that first reduce.

Code:
#define H64(x) (x>>64)
#define L64(x) ((__uint128_t)((uint64_t)(x)))
#define _mulshr128(xh,xl,yh,yl) (xh*yh+(((((xh*yl)>>2)+((xl*yh)>>2)+(H64(xl*yl)>>2)))>>62))
#define MULSHR128(x,y) _mulshr128(H64(x),L64(x),H64(y),L64(y))

inline __uint128_t reduce( __uint128_t mu, __uint128_t x, uint64_t m) {
  __uint128_t r = x - MULSHR128(x, mu)*m;
  while (unlikely(r >= m)) 
    r -= m;
  return r;
}

// Simplified version for first reduction, where x = 2^(e0), e0 > 64
// this mean that multiplying x*mu can be done with just a left shift. 
// Since we only want the top 128 bits of this product, 
// mu_shr is mu shifted right by (128-e0)  ***expected to be done by caller***
inline __uint128_t reduce_e0( __uint128_t mu_shr, __uint128_t x, uint64_t m) {
  __uint128_t r = x - mu_shr*m;
  while (unlikely(r >= m)) 
    r -= m;
  return r;
}

uint64_t pow2mod_barrett(uint64_t e, uint64_t m)
{
    __uint128_t mu = (~__uint128_t(0))/m;
    int lz = __builtin_clzl(e);
    int bits_remain = (lz > 57) ? 0 : 57 - lz; 
    int e0 = e >> bits_remain; // take leading 7 bits
    __uint128_t temp = reduce_e0(mu >> (128-e0), (__uint128_t)1 << e0, m);
    for (; bits_remain >= 6;)
    {
        temp = reduce(mu, temp*temp, m);
        temp = reduce(mu, temp*temp, m);
        temp = reduce(mu, temp*temp, m);
        temp = reduce(mu, temp*temp, m);
        temp = reduce(mu, temp*temp, m);
        temp = reduce(mu, temp*temp, m);
        bits_remain -= 6;
        temp = reduce(mu, temp << ((e >> bits_remain) & 63), m);
    }
    switch (bits_remain)
    {
    case 5: temp = reduce(mu, temp*temp, m);
    case 4: temp = reduce(mu, temp*temp, m);
    case 3: temp = reduce(mu, temp*temp, m);
    case 2: temp = reduce(mu, temp*temp, m);
    case 1: temp = reduce(mu, temp*temp, m);
    return reduce(mu, temp << ((e & ((1 << bits_remain) - 1))), m);
    }
    return temp;
}

Quote:
Originally Posted by ewmayer View Post
Try processing multiple TF moduli side-by-side in order to mitigate MUL latency. That is easier when one is working with atomic operations (as opposed to your 128-bit intrinsic stuff) and/or in asm, i.e. when one has access to the underlying hardware-arithmetic sequence and can structure things to best use available registers. But you should still get a decent boost in the context of your high-level coding paradigm.
I haven't tried this yet, I'll have to think some more about how exactly I would do multiple at once.

I worked some more on my k sieving code, and reduced some overhead. I've been testing/benchmarking the code against known factors downloaded from mersenne.ca. So far the largest workload I've tried is all exponents from 1277 to 1,000,000,000 from 39 to 40 bits. I have a separate program to verify the results and It finds exactly all factors in that range (plus a few composites, but I can just filter those in post)
It did this in 74m29s on a single thread of my 2700x @4.0GHz. Not sure if that's good or not, but it definitely beats the PARI script I had been using.

Extrapolating as 2x per bitlevel increase, it would only take me 2378 thread-years to go up to 64bits.
No idea how TJAOI does it (custom GPU code and some serious hardware maybe?), but damn!
Actually I've found the time per bit increase doesn't quite scale x2, its a little less, but still...

Anyways I think I can improve it a bit more more with a better/existent presieve, which I plan to focus on next.
hansl is offline   Reply With Quote
Old 2019-09-10, 15:22   #16
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

1101000012 Posts
Default

Quote:
Originally Posted by LaurV View Post
However, you are totally right about multiplying with the modular inverse, a Barrett reduction will work faster for exponentiation.

Thanks for that. Looking up the "Barrett reduction" wikipedia entry told me that this is what Thilo and me were doing in our "fast trial division by inverse multiplication". What we had was the floating point multiplication variant. Switching to integers makes another nice improvement.
Till is offline   Reply With Quote
Old 2019-09-12, 15:57   #17
hansl
 
hansl's Avatar
 
Apr 2019

110011012 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Try processing multiple TF moduli side-by-side in order to mitigate MUL latency. That is easier when one is working with atomic operations (as opposed to your 128-bit intrinsic stuff) and/or in asm, i.e. when one has access to the underlying hardware-arithmetic sequence and can structure things to best use available registers. But you should still get a decent boost in the context of your high-level coding paradigm.
I'm still a bit confused about this. The program is currently single-threaded, but would this latency effectively be covered by just making use of hypertheads?

I tend to run mprime PM1, LL-D, or PRP tasks on all cores (no hyperthreads), and use the spare hyperthreads for my extra TF stuff.

I briefly tried to add vectorization/SIMD using OpenMP pragmas but didn't have any luck making things go faster. Its my first time messing with that so I'm likely doing many things wrong.
hansl is offline   Reply With Quote
Old 2019-09-13, 19:38   #18
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

9,791 Posts
Default

Quote:
Originally Posted by hansl View Post
I'm still a bit confused about this. The program is currently single-threaded, but would this latency effectively be covered by just making use of hypertheads?
Nothing to do with explicit threading or hyperthreading, rather with use of parallelism in a single instruction stream in order to avoid latency bubbles. Consider the following core 3-operation sequence from the Montgomery-multiply algorithm for 2 single-word (B-bit) inputs x and y, modulo q, with qinv containing the Montgomery inverse, i.e. q*qinv == 1 (mod 2^B). The Barrett reduction contains a not-dissimilar MUL sequence:

MUL_LOHI(x,y,&lo,&hi); // 2B-bit product x*y, output halves into lo and hi
MULL(qinv,lo,lo); // lo = qinv*lo, low B-bit product
lo = MULH(q,lo); // High B bits of 2B-bit product q*lo

In a typical hardware implementation, each of those MULs has 4-5 cycle latency. So in the context of e.g. Mersenne TF work, where we do a bunch of modular powerings 2^p (mod q) for various factor candidates (values of q), it makes sense to feed, say 4 such q's at a time to a 4-way modpow loop, meaning that there will be 4 independent ones of each of the above 3 MULs:

MUL_LOHI(x0,y0,&lo0,&hi0);
MUL_LOHI(x1,y1,&lo1,&hi1);
MUL_LOHI(x2,y2,&lo2,&hi2);
MUL_LOHI(x3,y3,&lo3,&hi3);
MULL(qinv0,lo0,lo0);
MULL(qinv1,lo1,lo1);
MULL(qinv2,lo2,lo2);
MULL(qinv3,lo3,lo3);
lo0 = MULH(q0,lo0);
lo1 = MULH(q1,lo1);
lo2 = MULH(q2,lo2);
lo3 = MULH(q3,lo3);

Thus, say, if latency = 4 cycles and one MUL can start every cycle, the result of the MUL_LOHI(x0,y0,&lo0,&hi0) becomes available just as it is needed for the ensuing MULL(qinv0,lo0,lo0), and so forth.
ewmayer is offline   Reply With Quote
Old 2019-09-20, 16:50   #19
hansl
 
hansl's Avatar
 
Apr 2019

5·41 Posts
Default

OK, thanks for the explanation, I understand better now.

I made a new version of my MULSHR128 routine (128x128, keeping the top 128b result), which should be more correct(and i think marginally faster) than my previous one that shifted away lsb's before carrying:
Code:
inline __uint128_t MULSHR128(__uint128_t x, __uint128_t y) {
    __uint128_t xl = L64(x);
    __uint128_t xh = H64(x);
    __uint128_t yl = L64(y);
    __uint128_t yh = H64(y);
    __uint128_t xhyl = xh*yl;
    __uint128_t xlyh = xl*yh;
    uint64_t xlyl_h = H64(xl*yl);
    unsigned long long z;
    return xh*yh + H64(xhyl) + H64(xlyh) + 
        __builtin_uaddll_overflow(xhyl, xlyh, &z) + 
        __builtin_uaddll_overflow(z, xlyl_h, &z);
}
Then I went back to a previous idea I had of pre-processing/deconstructing the exponent beforehand into 6-bit chunks, and specially handling the first 7-bit chunk, since this makes it simpler to implement multiple operations in parallel.

So in my latest code, pow2mod is a functor class now, using the operator()(...) to handle multiple modulos in parallel. I wrote parallel versions for x4, x3, x2, and x1. I'm showing just the "x1" version for brevity, but the others basically consist of the same individual lines repeated with 0,1,2,3 suffix on each variable name. Also merged the above code of MULSHR128 into the reduce function:
Code:
#define likely(x)   __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0) 

constexpr inline __uint128_t H64(__uint128_t x) { return x >> 64; }
constexpr inline __uint128_t L64(__uint128_t x) { return uint64_t(x); }

struct pow2mod {
  __uint128_t pow_init;
  uint8_t sqrs[6];
  uint8_t e_chunks[6];
  uint8_t mu_shr_init; 
  uint8_t chunk_count;

  // store deconstructed exponent for simplified powmod, base-2
  pow2mod(uint64_t e) {
    int lz = __builtin_clzl(e);
    int bits_remain = (lz > 57) ? 0 : 57 - lz; 
    int e0 = e >> bits_remain;

    this->pow_init = __uint128_t(1) << e0;
    this->mu_shr_init = 128-e0;
   
    int n = 0;
    // shift chunks of bits, 6 at a time, into msb of e
    if (bits_remain) {
      for (e <<= lz + 7; e; ++n, e <<= 6) {
          lz = __builtin_clzl(e);
          e <<= lz;
          bits_remain -= lz + 6;
          sqrs[n] = lz + 6;
          e_chunks[n] = e >> 58; // grab top 6 bits
      }
    }
    // exponent more than 7 bits total
    if (n > 0) {
      e_chunks[n-1] >>= -bits_remain;
      sqrs[n-1] += bits_remain; // adjust squarings for next-to-last chunk
    }
    chunk_count = n;
  }

private:
  inline void reduce_x1(__uint128_t &x0, uint64_t mu_l0, uint64_t mu_h0, uint64_t m0) {
    __uint128_t xl0 = L64(x0);
    __uint128_t xh0 = H64(x0);
    __uint128_t xlyh0 = xl0 * mu_h0;
    __uint128_t xhyl0 = xh0 * mu_l0;
    __uint128_t xlyl_h0 = H64(xl0*mu_l0);
    __uint128_t xhyh0 = xh0 * mu_h0;
    unsigned long long z0;
    xhyh0 = (xhyh0 + H64(xlyh0) + H64(xhyl0) + __builtin_uaddll_overflow(xhyl0, xlyh0, &z0) + __builtin_uaddll_overflow(z0, xlyl_h0, &z0));
    xhyh0 *= m0;
    x0 -= xhyh0;
    while (unlikely(x0 >= m0)) x0 -= m0;
  }

public:
  inline uint8_t operator()(uint64_t m0) {
    constexpr __uint128_t B = ~__uint128_t(0);
    __uint128_t mu0 = B / m0;
    uint64_t mu_l0 = mu0;
    uint64_t mu_h0 = mu0 >> 64;
    // initial reduction for first 7 bits of exponent
    __uint128_t temp0 = mu0 >> mu_shr_init;
    temp0 *= m0;
    temp0 = this->pow_init - temp0;
    while (unlikely(temp0 >= m0)) temp0 -= m0;
    for (size_t i = 0; i < chunk_count; ++i)
    {
        for(uint8_t j = 0; j < sqrs[i]; ++j) {
          // temp = (temp * temp) % m;
          temp0 *= temp0;
          reduce_x1(temp0, mu_l0, mu_h0, m0);
        }
        // temp = (temp << chunks[i]) % m;
        temp0 <<= e_chunks[i];
        reduce_x1(temp0, mu_l0, mu_h0, m0);
    }
    // check if each result == 1
    return (temp0 == 1);
  }
};
So far I've only got this parallelized version working with a dumb "try every k value between min_k, max_k" loop; no sieve or classes.

I'm still trying to figure out a nice way control flow between my old way of pulling one candidate at a time out of the sieve/classes into calling this parallel version.

I've also done some profiling with perf, and realized I can see the effects of this latency from overall cycles spent on instructions, or more specifically recording the event "perf record -e stalled-cycles-backend ..."

With the above code, I noticed that 20% of overall time is spent in the initial "mu = 2^128 / modulus" calculation which is calls a builtin named "__udivti3".
The disassembly of this in perf shows a few div instructions very close one after the other, so its fairly non-ideal in terms of latency.
Not sure how I could implement my own parallelized version of that particular operation though. Are there any special tricks when the numerator is a constant power of 2?
hansl is offline   Reply With Quote
Old 2019-09-21, 01:50   #20
hansl
 
hansl's Avatar
 
Apr 2019

5×41 Posts
Default

I did some more timing of the various "parallel" computation functions I wrote, and found that x1, x3, and x4 are basically all the same performance.

Only x2 seems to benefit with a ~16% increase in speed.

edit: Here's some "perf stat" results for the x2 version, still with dumb "try every k" code
Code:
 Performance counter stats for './merstf 2900000000 2900000100 55 60':

        146,871.08 msec task-clock                #    1.000 CPUs utilized          
            15,371      context-switches          #    0.105 K/sec                  
                38      cpu-migrations            #    0.000 K/sec                  
               195      page-faults               #    0.001 K/sec                  
   585,354,021,276      cycles                    #    3.985 GHz                      (83.33%)
       706,590,899      stalled-cycles-frontend   #    0.12% frontend cycles idle     (83.33%)
   455,069,579,192      stalled-cycles-backend    #   77.74% backend cycles idle      (83.33%)
 1,496,220,929,283      instructions              #    2.56  insn per cycle         
                                                  #    0.30  stalled cycles per insn  (83.33%)
    54,473,802,597      branches                  #  370.895 M/sec                    (83.33%)
       624,591,586      branch-misses             #    1.15% of all branches          (83.33%)

     146.939449114 seconds time elapsed

     146.823542000 seconds user
       0.031982000 seconds sys

Last fiddled with by hansl on 2019-09-21 at 01:57
hansl is offline   Reply With Quote
Old 2019-09-21, 12:44   #21
snme2pm1
 
"Graham uses ISO 8601"
Mar 2014
AU, Sydney

3618 Posts
Default

Quote:
Originally Posted by xilman View Post
Exactly.

The ability successfully to decide whether or not to to take one of my posts seriously is a sign of maturity.

LaurV, despite his extreme youth, shows clear evidence of maturity.
Yeah a great laugh, but how long does it endure rather than confound folk unwittingly exposed to your banter?
There are some forums that I visit that would delete a message like this and prior as off topic.
mersennneforum.org moderators have demonstrated a more relaxed attitude.
p.s. I took up the challenge proposed by LaurV and have viewed the video Last Vegas (2013).
It didn't evoke any awkward feelings at all.
snme2pm1 is offline   Reply With Quote
Old 2019-10-30, 11:38   #22
Dan Ee
 
Mar 2013

7 Posts
Default

I think I may have a slightly different method for doing Mersenne number trial division that you can test in your project. This method can be summarized as "Negative power base-2 modular exponentiation using Montgomery reduction." Instead of computing Mod(2,m)^q and test whether the result is 1, we can compute Mod(2,m)^-q and test whether the result is 1.

The idea is to view Montgomery reduction as adding a multiple of the modulus to the accumulator so that 64 least significant bits of accumulator become zeros, then right-shift the accumulator by 64 bits. This right-shift operation has the effect of multiplying 2^-64 to the accumulator. We want to utilize this effect to bring the exponent more and more negative towards our target.

Code:
Example with exponent=33816457, we will compute 2^-33816457
Start with 2^0=1
Perform Montgomery reduction, accumulator now holds 2^-64
In the following table, sqmont means "square accumulator, then perform Montgomery reduction", this has the "2x-64" effect on the exponent.
div2 means "divide the accumulator by 2", this has the "minus 1" effect on the exponent.

Operation  Exponent  Operation  Exponent
sqmont         -192  div2           -193
sqmont         -450  div2           -451
sqmont         -966  div2           -967
sqmont        -1998  div2          -1999
sqmont        -4062  div2          -4063
sqmont        -8190  div2          -8191
sqmont       -16446  div2         -16447
sqmont       -32958  div2         -32959
sqmont       -65982  div2         -65983
sqmont      -132030  div2        -132031
sqmont      -264126  div2        -264127
sqmont      -528318  div2        -528319
sqmont     -1056702
sqmont     -2113468
sqmont     -4227000  div2       -4227001
sqmont     -8454066
sqmont    -16908196
sqmont    -33816456  div2      -33816457
This method has the advantage of not having any reciprocal approximation involved. The sequence of "div2 and no div2" is not obvious and I have figured that out and is shown in my demo code.
Besides trial division for Mersenne numbers, this method can be applied to base-2 Fermat primality test.
Attached Files
File Type: zip negpow.zip (1.2 KB, 56 views)
Dan Ee is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Requestion for fast chebyshev theta function danaj Computer Science & Computational Number Theory 9 2018-03-31 14:59
Do normal adults give themselves an allowance? (...to fast or not to fast - there is no question!) jasong jasong 35 2016-12-11 00:57
Fast modular reduction for primes < 512 bits? BenR Computer Science & Computational Number Theory 2 2016-03-27 00:37
Fast Approximate Ceiling Function R.D. Silverman Programming 27 2010-11-19 17:50
Base-6 speed for prime testing vs. base-2 jasong Conjectures 'R Us 36 2010-08-03 06:25

All times are UTC. The time now is 04:35.

Tue Oct 27 04:35:14 UTC 2020 up 47 days, 1:46, 0 users, load averages: 2.16, 1.96, 1.85

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.