Register FAQ Search Today's Posts Mark Forums Read

 2005-11-03, 19:58 #1 ewmayer ∂2ω=0     Sep 2002 República de California 2DDA16 Posts Mfactor sieve code thread Dmitri Gribenko (a.k.a. gribozavr) has kindly built an AMD64/linux binary of my Mfactor mersenne-sieving code for me. Binaries are available at http://hogranch.com/mayer/bin/mfactor_amd64.tar.gz or, using ftp: ftp://hogranch.com/pub/mayer/bin/mfactor_amd64.tar.gz The zip archive contains two executables, a dynamically-linked and a statically-linked one. These should run on both plain AMD64 and Opteron systems. The speed of the 2 is virtually equal, so I suggest simply using the statically-linked binary, as that should run on the widest variety of setups. To give a rough idea of speed, testing a gigabit-size M(p) to a bit depth of 70 should need roughly 3 hours on a 2GHz system. --------------------------------------------------------------------------------- 24 March 2006: An Itanium/Linux binary is available at http://hogranch.com/mayer/bin/mfactor_ia64_linux.gz --------------------------------------------------------------------------------- 24 March 2006: A binary for Alpha (should run under both TruUnix and Linux) is available at http://hogranch.com/mayer/bin/mfactor_alpha.gz --------------------------------------------------------------------------------- 22 March 2006: A Win32 binary is available at http://hogranch.com/mayer/bin/mfactor_win32.zip --------------------------------------------------------------------------------- The simple way to invoke the code (assuming you've called your local executable "Mfactor") is Mfactor -m {exponent} -bmax {some positive number} & The code allows p up to 2^50 and q's up to 2^96 (i.e. bmax as large as 96). The exponent must be an odd prime. The argument following the -bmax flag is the maximum bit depth of factoring, i.e. log2(qmax), where we use q = 2*k*p+1 (k a positive int) to denote factor candidates. The bmax argument may have a fractional part, e.g. to sieve M(3321928381) to depth 2^(72.5) enter Mfactor -m 3321928381 -bmax 72.5 The program can also start at a non-minimal bit depth; for instance if you (or someone else) had already sieved the above exponent to 2^70 and you wanted to sieve from 2^70 to 2^73, you'd enter Mfactor -m 3321928381 -bmin 70 -bmax 73 Like bmax, bmin may also have a fractional part. Note that if your run is of this type but bmin is more than 2 or 3 bits smaller than bmax I suggest just specifiying bmax (i.e. running from scratch) - that will cost little extra runtime over running from bmin, and will have the added benefit of double-checking the results (or non-results) of the previous, shallower sieving run. The code will print some diagnostic output (much of will eventually be suppressed, but as this is a beta I want it for now) and then pass-by-pass info about progress through each of the default 16 factoring passes (each of which tests a set of q's having a distinct value of k mod 60 - you don't really need to worry about that, just know that a full factoring run needs to do all 16 passes.) To suppress this on-screen output (e.g. if you want to run quietly in background or on a remote system) you can either divert the output to a file (e.g. by appending ' > foo.txt' to the above) or use the init-file method described a few paragraphs below. As it runs, the code periodically (every few minutes on a fast system) writes a small checkpoint file named t{exponent} (brackets for emphasis only), so if running in background mode you can examine that file if you want to see which of the 16 passes it's currently doing and which k it's at relative to kmax for each pass. This file has the following format: Code: exponent bmin <== log2(minimum factor to try) bmax <== log2(maximum factor to try) KNow = (some int) <== Current k of this pass KMax = (some int) <== Maximum k of this pass PassNow = (some int <= 15) <== Current pass PassMax = (some int <= 15) <== Maximum pass (typically 15) So for instance if I'm partway through a run of M(3321928381) from bit bounds 0 to 73, I might see 3321928381 0.0000000 73.000013 KNow = 1154259025940 KMax = 1421586566400 PassNow = 4 PassMax = 15 That tells me the program is roughly 80% of the way through the 5th pass (counting from 0) of the 16. This file tells the code where to resume factoring in the event of a restart-from-interrupt. To resume you simply re-enter the same command line you used to start the run, or (since the restart file contains all the needed information about the bmin/bmax bounds) simply enter Mfactor -m {exponent} & i.e. simply provide the exponent, so the program knows which restart file to read. You can also use the same restart file format to initiate a run, if you don't want to deal with the screen-output kludge mentioned above: To start a run of exponent {p} from bit bounds {bmin} to {bmax}, create a file t{p} containing (the {} indicate fields you need to fill in) {p} {bmin} {bmax} KNow = 0 KMax = 0 PassNow = 0 PassMax = 15 and invoke the program using Mfactor -m {p} & In that case all screen output except for the start-of-run diagnostics will be suppressed, i.e. after starting the run you can log out and walk away. The code will automatically generate the KMin and KMax values corresponding to your bmin and bmax bounds and write them to the checkpoint file at the time of the next file-update. Results are written to a file named results.txt - at the moment the code also writes some small additional debug information (every 2^26th q that was tried) to the file; for kicks you could check that each of these intermediate q's is either prime or has no factor less than the small-factor-sieve bound of 611957 (that's the 50000th odd prime, in case you were wondering), but I expect most of you will mainly be interested in the lines containing the word "factor". I'll be adding more binaries as the need arises and time permits - the code also runs extremely fast on Alpha and Itanium systems (which have similarly good 64-integer MUL throughput as the AMD64), and should be within a factor of 2 of that (in terms of per-cycle throughput) on the Mac G5, with some tuning of the key modmul macros. Pentium users will have to be patient - the pentium has poor integer MUL capability relative to the aforementioned platforms, which means I'll be using a floating-point modmul algorithm I recently wrote (that allows q's up to 78 bits), but that will need to be SSE2-ified to give decent performance. Have fun, let me know if you have any problems. -Ernst Last fiddled with by ewmayer on 2006-03-24 at 17:23
 2005-11-06, 21:24 #2 ewmayer ∂2ω=0     Sep 2002 República de California 2×5,869 Posts Does anybody have an AMD64 running Windows and with gcc installed which they could use to do a Windows build for me? If so, PM me and I'll send you a zip of the source code and simple build and test instructions. Thanks, -Ernst
 2005-11-07, 08:20 #3 fatphil     May 2003 F416 Posts I pulled up 'shark' for the first time yesterday, and it seems quite informative. I have no idea if it can be used over an SSH session, as it seems gooey only. Perhaps Mark can enlighten us. I think that a 3x25bit pure FPU version may have some potential, the a=+/-(b*c)+/-d instruction should help pack those operations in.
2005-11-07, 13:37   #4
rogue

"Mark"
Apr 2003
Between here and the

3·17·131 Posts

Quote:
 Originally Posted by fatphil I pulled up 'shark' for the first time yesterday, and it seems quite informative. I have no idea if it can be used over an SSH session, as it seems gooey only. Perhaps Mark can enlighten us.
My experience with Shark is limited, but I did find this:

Remote - a CHUD remote client program calls chudStartRemotePerfMonitor()/chudStopRemotePerfMonitor() to start/stop sampling. If you are using shark over a remote login (rlogin, telnet, ssh, etc.) connection you will need to run it and the remote client from the same session.

at Apple

2005-11-07, 19:42   #5
ewmayer
2ω=0

Sep 2002
República de California

2×5,869 Posts

Quote:
 Originally Posted by fatphil I pulled up 'shark' for the first time yesterday, and it seems quite informative. I have no idea if it can be used over an SSH session, as it seems gooey only. Perhaps Mark can enlighten us. I think that a 3x25bit pure FPU version may have some potential, the a=+/-(b*c)+/-d instruction should help pack those operations in.
Using balanced-digit representation and IEEE doubles one can go up to 26 bits per input for vector lengths up to 4, i.e. 78 bits for 3-vector inputs and 104 bits for 4-vector - I've only coded the former to date, since those are most relevant for GIMPS trial-factoring work.

I'll probably try a G5 build later this week - my plan is to do timings of pure-64-bit-int vs. pure-float versions on various platforms, and then design routines that interleave various numbers of modpow operations of each type in an attempt to keep both the IU and FPU pipelines as full as possible, while having both sets of parallel powerings finish in roughly the same number of total cycles. For instance on Alpha/Itanium/AMD64 a 2:1 ratio of int/float modpows looks about right (because the former are so fast), whereas on the G5 the ratio will probably be inverted, because 64-bit integer mul has 1/3 the throughput.

I must say, the AMD64 has been a pleasant revelation in this regard - here I was working under the mistaken assumption that only the Opteron version of that chip can do screamingly fast 64x64=>128-bit MULs, when in fact all AMD64s can.

2005-11-08, 09:50   #6
fatphil

May 2003

22×61 Posts

Quote:
 Originally Posted by ewmayer Using balanced-digit representation and IEEE doubles one can go up to 26 bits per input for vector lengths up to 4, i.e. 78 bits for 3-vector inputs and 104 bits for 4-vector
Call me old-fashioned, but I like fingers at the ends of my limbs!
For limbs a,b,c, it's nice to be able to exactly represent (2*a*c+b^2) without exceeding the precision of the FPU.

Quote:
 Originally Posted by ewmayer - I've only coded the former to date, since those are most relevant for GIMPS trial-factoring work. I'll probably try a G5 build later this week - my plan is to do timings of pure-64-bit-int vs. pure-float versions on various platforms, and then design routines that interleave various numbers of modpow operations of each type in an attempt to keep both the IU and FPU pipelines as full as possible, while having both sets of parallel powerings finish in roughly the same number of total cycles. For instance on Alpha/Itanium/AMD64 a 2:1 ratio of int/float modpows looks about right (because the former are so fast), whereas on the G5 the ratio will probably be inverted, because 64-bit integer mul has 1/3 the throughput.
FP is 2/clock verses int's 2/6 clocks. i.e. 6:1. But the "area" of a FPU mulmod is much less (maybe even 1/6th). However, the mad instructions let you do 2 things at once... Personally, I can't call it - whatever the benchmarks say is the truth!

Quote:
 Originally Posted by ewmayer I must say, the AMD64 has been a pleasant revelation in this regard - here I was working under the mistaken assumption that only the Opteron version of that chip can do screamingly fast 64x64=>128-bit MULs, when in fact all AMD64s can.
Indeedy. pretty OK chip. Maybe my chrissy present to myself...

Phil

2005-11-08, 10:36   #7
fatphil

May 2003

22×61 Posts

Quote:
 Originally Posted by fatphil Call me old-fashioned, but I like fingers at the ends of my limbs!
D'oh! 3*2^50 works, you're right. My mistake. I was getting confused by the more general case of non-power-of-2 radices. (the kinds of bignums I normally deal with, where the carries need to be approximated and rounded.) However, you've really got to make sure you're in the right rounding mode, or you'll have some nasty fixups to do, I think.

2005-11-08, 13:57   #8
jasonp
Tribal Bullet

Oct 2004

5×709 Posts

Quote:
 Originally Posted by ewmayer Using balanced-digit representation and IEEE doubles one can go up to 26 bits per input for vector lengths up to 4, i.e. 78 bits for 3-vector inputs and 104 bits for 4-vector - I've only coded the former to date, since those are most relevant for GIMPS trial-factoring work.
There's another handy trick on PPC hardware, in case you haven't considered it: For <= 50-bit integers in FPU registers a and b, c = a * b and d = (c - a*b) yield a pair of values where c + d = the full 100-bit product. I don't know if the floats in the vector unit are also required to support the full double-size product internally when computing a multiply-add operation. Using a pair of doubles instead of a vector of floats may get you higher throughput for an inherently serial operation like a modmul.

jasonp

2005-11-08, 14:41   #9
fatphil

May 2003

22×61 Posts

Quote:
 Originally Posted by jasonp There's another handy trick on PPC hardware, in case you haven't considered it: For <= 50-bit integers in FPU registers a and b, c = a * b and d = (c - a*b) yield a pair of values where c + d = the full 100-bit product. I don't know if the floats in the vector unit are also required to support the full double-size product internally when computing a multiply-add operation. Using a pair of doubles instead of a vector of floats may get you higher throughput for an inherently serial operation like a modmul. jasonp
This is MADD! I love it. It changes _everything_!
(OK, everything that was on my next 2 or 3 weeks' to-do list.)

I better ask - does it guarantee <0.5 ulp?

Phil

2005-11-08, 19:36   #10
ewmayer
2ω=0

Sep 2002
República de California

2×5,869 Posts

Quote:
 Originally Posted by fatphil Call me old-fashioned, but I like fingers at the ends of my limbs!
I have a friend who lost a fingertip in a home-woodworking accident several years ago. The joke among us ever since has been that he's suffering from roundoff error. (Since it was his index finger rather than, say, the pinkie, it truly was a significant digit. ;)

Quote:
 For limbs a,b,c, it's nice to be able to exactly represent (2*a*c+b^2) without exceeding the precision of the FPU.
As you've already noted further down, balanced-digit and base-2^26 means the magnitudes of a,b,c,... are all <= 2^25, so that's not a problem. However, at least in my code I have one additional constraint, namely that all integer input operands are strictly nonnegative. Since I don't want the digit-balancing to create a carry out of the highest-order input term (that would increase my vector length by one), in my implementation the MSW remains unbalanced (like its owner, some would say ;), so in the above, |a| and |b| are both <= 2^25 but |c| <= 2^26. Thus |2*a*c+b^2| <= 2^52 + 2^50, which still leaves some headroom below the 53-bit limit. For 4-vector inputs, we have a,b,c all with magnitudes <= 2^25, the MSW d is <= 2^26, and the worst-case polynomial-product coefficient in the case of a squaring is 2*a*c+2*b*d. This nominally is <= 2^52 + 2^51 in magnitude, which is again no problem. (Note that we don't need to take advantage that the term in question is an exact multiple of 2 for this to be the case, i.e. general multiply works, too.)

In fact, it appears that this scheme with input magnitude bounds of 2^{25,25,...,26} should work all the way up to 6-vector inputs, i.e. inputs as large as 156 bits, as long as the hardware properly supports 53-mantissa-bit IEEE doubles. Of course for polynomials of length 4 and above it becomes worthwhile to consider an optimized Nussbaumer-style short-length convolution rather than straight grammar-school, and for the latter the input bounds may be slightly less lenient - I haven't done the detailed math, though, so right now I'm just talking off the top of my head w.r.to the latter.

Quote:
 Originally Posted by jasonp There's another handy trick on PPC hardware, in case you haven't considered it: For <= 50-bit integers in FPU registers a and b, c = a * b and d = (c - a*b) yield a pair of values where c + d = the full 100-bit product. I don't know if the floats in the vector unit are also required to support the full double-size product internally when computing a multiply-add operation.
Yes, that's definitely a neat little trick - it'll be interesting to see how large a variety of MADD-capable hardware supports it!

Last fiddled with by ewmayer on 2005-11-08 at 19:38

2005-11-09, 13:59   #11
jasonp
Tribal Bullet

Oct 2004

5·709 Posts

Quote:
 Originally Posted by fatphil This is MADD! I love it. It changes _everything_! (OK, everything that was on my next 2 or 3 weeks' to-do list.) I better ask - does it guarantee <0.5 ulp?
My understanding is that the result is always exact. if a + b makes the full 100-bit product, and 'a' is implicitly the 'high half', then if 'a' is too large by one then b is negative. Remember that since the inputs to the multiply started off represented exactly, as integers, then a + b will be an integer too. It's when you perform things like multiplying by an inverse to simulate a division that you start having to deal with roundoff. For a modmul, it doesn't matter if 'a' is an integer or not, since it's only there to be multiplied by an inverse. You use less precision than is available to allow a few bits for unambiguous rounding, when producing the quotient from the high half.

A while ago I experimented with using this technique to implement 50-bit modular multiplication. On the G5 a single modmul takes ~50 clocks, and I wrote a massively pipelined vectorized version that is supposed to finish one 50-bit modmul, with arbitrary modulus, in 7 clocks. Unfortunately the actual throughput is 12.5 clocks. For reference, years ago I wrote an all-integer vectorized modmul on the Alpha that sustained one modmul, with 62-bit modulus, in 6 clocks. Of course, the G5 has a much faster clock and so ended up faster in real-world terms.

Phil, before you spend as much time on this as I did, the following is a reference modmul. The quantity 'magic' is 3*2^51. n has an upper limit of about 2^50; I've tried higher and it occaisionally doesn't work.

Code:
#include <ppc_intrinsics.h>

double fast_modmul(double a, double b, double n,
double recip_n, double magic) {

double q, r, nq_hi, nq_lo;

q = a * b;                       /* compute high 53 bits of a * b */
q = __fmadd(q, recip_n, magic);  /* multiply by the reciprocal */
q = q - magic;                   /* round to nearest integer */

nq_hi = n * q;
nq_lo = __fmsub(n, q, nq_hi);  /* nq_hi + nq_lo = n * q exactly */

r = __fmsub(a, b, nq_hi);
r = r - nq_lo;                 /* compute a * b - n * q */

q = r + n;
return __fsel(r, r, q);        /* return r, or r+n if r < 0 */
}
Major bonus points to anybody who can figure out how to manipulate the FPU rounding mode so that the final correction at the end isn't necessary. Just setting the rounding mode to truncate the quotient still produces a negative remainder sometimes.

Happy crunching,
jasonp

Last fiddled with by ewmayer on 2005-11-11 at 01:17 Reason: (With Jasonp's approval) "Magic" FP rounding constant is 3*2^51, not 3*2^52

 Similar Threads Thread Thread Starter Forum Replies Last Post skan Programming 4 2018-03-24 09:54 Historian Twin Prime Search 105 2013-02-05 01:35 ET_ Operation Billion Digits 10 2008-09-17 12:28 maqableh Programming 9 2006-05-12 16:22 Citrix Prime Sierpinski Project 15 2005-08-29 13:56

All times are UTC. The time now is 09:55.

Fri Aug 12 09:55:42 UTC 2022 up 36 days, 4:43, 2 users, load averages: 1.70, 1.47, 1.29