![]() |
![]() |
#1 |
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]()
Dmitri Gribenko (a.k.a. gribozavr) has kindly built an AMD64/linux binary of my Mfactor mersenne-sieving code for me. Binaries are available at
http://hogranch.com/mayer/bin/mfactor_amd64.tar.gz or, using ftp: ftp://hogranch.com/pub/mayer/bin/mfactor_amd64.tar.gz The zip archive contains two executables, a dynamically-linked and a statically-linked one. These should run on both plain AMD64 and Opteron systems. The speed of the 2 is virtually equal, so I suggest simply using the statically-linked binary, as that should run on the widest variety of setups. To give a rough idea of speed, testing a gigabit-size M(p) to a bit depth of 70 should need roughly 3 hours on a 2GHz system. --------------------------------------------------------------------------------- 24 March 2006: An Itanium/Linux binary is available at http://hogranch.com/mayer/bin/mfactor_ia64_linux.gz --------------------------------------------------------------------------------- 24 March 2006: A binary for Alpha (should run under both TruUnix and Linux) is available at http://hogranch.com/mayer/bin/mfactor_alpha.gz --------------------------------------------------------------------------------- 22 March 2006: A Win32 binary is available at http://hogranch.com/mayer/bin/mfactor_win32.zip --------------------------------------------------------------------------------- The simple way to invoke the code (assuming you've called your local executable "Mfactor") is Mfactor -m {exponent} -bmax {some positive number} & The code allows p up to 2^50 and q's up to 2^96 (i.e. bmax as large as 96). The exponent must be an odd prime. The argument following the -bmax flag is the maximum bit depth of factoring, i.e. log2(qmax), where we use q = 2*k*p+1 (k a positive int) to denote factor candidates. The bmax argument may have a fractional part, e.g. to sieve M(3321928381) to depth 2^(72.5) enter Mfactor -m 3321928381 -bmax 72.5 The program can also start at a non-minimal bit depth; for instance if you (or someone else) had already sieved the above exponent to 2^70 and you wanted to sieve from 2^70 to 2^73, you'd enter Mfactor -m 3321928381 -bmin 70 -bmax 73 Like bmax, bmin may also have a fractional part. Note that if your run is of this type but bmin is more than 2 or 3 bits smaller than bmax I suggest just specifiying bmax (i.e. running from scratch) - that will cost little extra runtime over running from bmin, and will have the added benefit of double-checking the results (or non-results) of the previous, shallower sieving run. The code will print some diagnostic output (much of will eventually be suppressed, but as this is a beta I want it for now) and then pass-by-pass info about progress through each of the default 16 factoring passes (each of which tests a set of q's having a distinct value of k mod 60 - you don't really need to worry about that, just know that a full factoring run needs to do all 16 passes.) To suppress this on-screen output (e.g. if you want to run quietly in background or on a remote system) you can either divert the output to a file (e.g. by appending ' > foo.txt' to the above) or use the init-file method described a few paragraphs below. As it runs, the code periodically (every few minutes on a fast system) writes a small checkpoint file named t{exponent} (brackets for emphasis only), so if running in background mode you can examine that file if you want to see which of the 16 passes it's currently doing and which k it's at relative to kmax for each pass. This file has the following format: Code:
exponent bmin <== log2(minimum factor to try) bmax <== log2(maximum factor to try) KNow = (some int) <== Current k of this pass KMax = (some int) <== Maximum k of this pass PassNow = (some int <= 15) <== Current pass PassMax = (some int <= 15) <== Maximum pass (typically 15) 3321928381 0.0000000 73.000013 KNow = 1154259025940 KMax = 1421586566400 PassNow = 4 PassMax = 15 That tells me the program is roughly 80% of the way through the 5th pass (counting from 0) of the 16. This file tells the code where to resume factoring in the event of a restart-from-interrupt. To resume you simply re-enter the same command line you used to start the run, or (since the restart file contains all the needed information about the bmin/bmax bounds) simply enter Mfactor -m {exponent} & i.e. simply provide the exponent, so the program knows which restart file to read. You can also use the same restart file format to initiate a run, if you don't want to deal with the screen-output kludge mentioned above: To start a run of exponent {p} from bit bounds {bmin} to {bmax}, create a file t{p} containing (the {} indicate fields you need to fill in) {p} {bmin} {bmax} KNow = 0 KMax = 0 PassNow = 0 PassMax = 15 and invoke the program using Mfactor -m {p} & In that case all screen output except for the start-of-run diagnostics will be suppressed, i.e. after starting the run you can log out and walk away. The code will automatically generate the KMin and KMax values corresponding to your bmin and bmax bounds and write them to the checkpoint file at the time of the next file-update. Results are written to a file named results.txt - at the moment the code also writes some small additional debug information (every 2^26th q that was tried) to the file; for kicks you could check that each of these intermediate q's is either prime or has no factor less than the small-factor-sieve bound of 611957 (that's the 50000th odd prime, in case you were wondering), but I expect most of you will mainly be interested in the lines containing the word "factor". I'll be adding more binaries as the need arises and time permits - the code also runs extremely fast on Alpha and Itanium systems (which have similarly good 64-integer MUL throughput as the AMD64), and should be within a factor of 2 of that (in terms of per-cycle throughput) on the Mac G5, with some tuning of the key modmul macros. Pentium users will have to be patient - the pentium has poor integer MUL capability relative to the aforementioned platforms, which means I'll be using a floating-point modmul algorithm I recently wrote (that allows q's up to 78 bits), but that will need to be SSE2-ified to give decent performance. Have fun, let me know if you have any problems. -Ernst Last fiddled with by ewmayer on 2006-03-24 at 17:23 |
![]() |
![]() |
![]() |
#2 |
∂2ω=0
Sep 2002
República de California
101101111010112 Posts |
![]()
Does anybody have an AMD64 running Windows and with gcc installed which they could use to do a Windows build for me? If so, PM me and I'll send you a zip of the source code and simple build and test instructions.
Thanks, -Ernst |
![]() |
![]() |
![]() |
#3 |
May 2003
3778 Posts |
![]()
I pulled up 'shark' for the first time yesterday, and it seems quite informative. I have no idea if it can be used over an SSH session, as it seems gooey only. Perhaps Mark can enlighten us.
I think that a 3x25bit pure FPU version may have some potential, the a=+/-(b*c)+/-d instruction should help pack those operations in. |
![]() |
![]() |
![]() |
#4 | |
"Mark"
Apr 2003
Between here and the
2·5·17·41 Posts |
![]() Quote:
Remote - a CHUD remote client program calls chudStartRemotePerfMonitor()/chudStopRemotePerfMonitor() to start/stop sampling. If you are using shark over a remote login (rlogin, telnet, ssh, etc.) connection you will need to run it and the remote client from the same session. at Apple |
|
![]() |
![]() |
![]() |
#5 | |
∂2ω=0
Sep 2002
República de California
5·2,351 Posts |
![]() Quote:
I'll probably try a G5 build later this week - my plan is to do timings of pure-64-bit-int vs. pure-float versions on various platforms, and then design routines that interleave various numbers of modpow operations of each type in an attempt to keep both the IU and FPU pipelines as full as possible, while having both sets of parallel powerings finish in roughly the same number of total cycles. For instance on Alpha/Itanium/AMD64 a 2:1 ratio of int/float modpows looks about right (because the former are so fast), whereas on the G5 the ratio will probably be inverted, because 64-bit integer mul has 1/3 the throughput. I must say, the AMD64 has been a pleasant revelation in this regard - here I was working under the mistaken assumption that only the Opteron version of that chip can do screamingly fast 64x64=>128-bit MULs, when in fact all AMD64s can. |
|
![]() |
![]() |
![]() |
#6 | |||
May 2003
3×5×17 Posts |
![]() Quote:
For limbs a,b,c, it's nice to be able to exactly represent (2*a*c+b^2) without exceeding the precision of the FPU. Quote:
Quote:
Phil |
|||
![]() |
![]() |
![]() |
#7 | |
May 2003
FF16 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#8 | |
Tribal Bullet
Oct 2004
32·5·79 Posts |
![]() Quote:
jasonp |
|
![]() |
![]() |
![]() |
#9 | |
May 2003
FF16 Posts |
![]() Quote:
(OK, everything that was on my next 2 or 3 weeks' to-do list.) I better ask - does it guarantee <0.5 ulp? Phil |
|
![]() |
![]() |
![]() |
#10 | |||
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]() Quote:
Quote:
In fact, it appears that this scheme with input magnitude bounds of 2^{25,25,...,26} should work all the way up to 6-vector inputs, i.e. inputs as large as 156 bits, as long as the hardware properly supports 53-mantissa-bit IEEE doubles. Of course for polynomials of length 4 and above it becomes worthwhile to consider an optimized Nussbaumer-style short-length convolution rather than straight grammar-school, and for the latter the input bounds may be slightly less lenient - I haven't done the detailed math, though, so right now I'm just talking off the top of my head w.r.to the latter. Quote:
Last fiddled with by ewmayer on 2005-11-08 at 19:38 |
|||
![]() |
![]() |
![]() |
#11 | |
Tribal Bullet
Oct 2004
32·5·79 Posts |
![]() Quote:
A while ago I experimented with using this technique to implement 50-bit modular multiplication. On the G5 a single modmul takes ~50 clocks, and I wrote a massively pipelined vectorized version that is supposed to finish one 50-bit modmul, with arbitrary modulus, in 7 clocks. Unfortunately the actual throughput is 12.5 clocks. For reference, years ago I wrote an all-integer vectorized modmul on the Alpha that sustained one modmul, with 62-bit modulus, in 6 clocks. Of course, the G5 has a much faster clock and so ended up faster in real-world terms. Phil, before you spend as much time on this as I did, the following is a reference modmul. The quantity 'magic' is 3*2^51. n has an upper limit of about 2^50; I've tried higher and it occaisionally doesn't work. Code:
#include <ppc_intrinsics.h> double fast_modmul(double a, double b, double n, double recip_n, double magic) { double q, r, nq_hi, nq_lo; q = a * b; /* compute high 53 bits of a * b */ q = __fmadd(q, recip_n, magic); /* multiply by the reciprocal */ q = q - magic; /* round to nearest integer */ nq_hi = n * q; nq_lo = __fmsub(n, q, nq_hi); /* nq_hi + nq_lo = n * q exactly */ r = __fmsub(a, b, nq_hi); r = r - nq_lo; /* compute a * b - n * q */ q = r + n; return __fsel(r, r, q); /* return r, or r+n if r < 0 */ } Happy crunching, jasonp Last fiddled with by ewmayer on 2005-11-11 at 01:17 Reason: (With Jasonp's approval) "Magic" FP rounding constant is 3*2^51, not 3*2^52 |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Easiest working Quadratic Sieve code? | skan | Programming | 4 | 2018-03-24 09:54 |
Sieve Benchmark Thread | Historian | Twin Prime Search | 105 | 2013-02-05 01:35 |
Factor5 source code thread | ET_ | Operation Billion Digits | 10 | 2008-09-17 12:28 |
Where I should write C code (thread moved) | maqableh | Programming | 9 | 2006-05-12 16:22 |
New Sieve Thread Discussion | Citrix | Prime Sierpinski Project | 15 | 2005-08-29 13:56 |