mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Operation Billion Digits (https://www.mersenneforum.org/forumdisplay.php?f=50)
-   -   Mfactor sieve code thread (https://www.mersenneforum.org/showthread.php?t=4939)

ewmayer 2005-11-03 19:58

Mfactor sieve code thread
 
Dmitri Gribenko (a.k.a. gribozavr) has kindly built an AMD64/linux binary of my Mfactor mersenne-sieving code for me. Binaries are available at

[url]http://hogranch.com/mayer/bin/mfactor_amd64.tar.gz[/url]

or, using ftp:

[url]ftp://hogranch.com/pub/mayer/bin/mfactor_amd64.tar.gz[/url]

The zip archive contains two executables, a dynamically-linked and a statically-linked one. These should run on both plain AMD64 and Opteron systems. The speed of the 2 is virtually equal, so I suggest simply using the statically-linked binary, as that should run on the widest variety of setups.

To give a rough idea of speed, testing a gigabit-size M(p) to a bit depth of 70 should need roughly 3 hours on a 2GHz system.

---------------------------------------------------------------------------------

[i]24 March 2006:[/i] An Itanium/Linux binary is available at

[url]http://hogranch.com/mayer/bin/mfactor_ia64_linux.gz[/url]

---------------------------------------------------------------------------------

[i]24 March 2006:[/i] A binary for Alpha (should run under both TruUnix and Linux) is available at

[url]http://hogranch.com/mayer/bin/mfactor_alpha.gz[/url]

---------------------------------------------------------------------------------

[i]22 March 2006:[/i] A Win32 binary is available at

[url]http://hogranch.com/mayer/bin/mfactor_win32.zip[/url]

---------------------------------------------------------------------------------

The simple way to invoke the code (assuming you've called your local executable "Mfactor") is

Mfactor -m {exponent} -bmax {some positive number} &

The code allows p up to 2^50 and q's up to 2^96 (i.e. bmax as large as 96).
The exponent must be an odd prime. The argument following the -bmax flag is the maximum bit depth of factoring, i.e. log2(qmax), where we use q = 2*k*p+1 (k a positive int) to denote factor candidates. The bmax argument may have a fractional part, e.g. to sieve M(3321928381) to depth 2^(72.5) enter

Mfactor -m 3321928381 -bmax 72.5

The program can also start at a non-minimal bit depth; for instance if you (or someone else) had already sieved the above exponent to 2^70 and you wanted to sieve from 2^70 to 2^73, you'd enter

Mfactor -m 3321928381 -bmin 70 -bmax 73

Like bmax, bmin may also have a fractional part. Note that if your run is of this type but bmin is more than 2 or 3 bits smaller than bmax I suggest just specifiying bmax (i.e. running from scratch) - that will cost little extra runtime over running from bmin, and will have the added benefit of double-checking the results (or non-results) of the previous, shallower sieving run.

The code will print some diagnostic output (much of will eventually be suppressed, but as this is a beta I want it for now) and then pass-by-pass info about progress through each of the default 16 factoring passes (each of which tests a set of q's having a distinct value of k mod 60 - you don't really need to worry about that, just know that a full factoring run needs to do all 16 passes.) To suppress this on-screen output (e.g. if you want to run quietly in background or on a remote system) you can either divert the output to a file (e.g. by appending ' > foo.txt' to the above) or use the init-file method described a few paragraphs below.

As it runs, the code periodically (every few minutes on a fast system) writes a small checkpoint file named t{exponent} (brackets for emphasis only), so if running in background mode you can examine that file if you want to see which of the 16 passes it's currently doing and which k it's at relative to kmax for each pass. This file has the following format:

[code]
exponent
bmin <== log2(minimum factor to try)
bmax <== log2(maximum factor to try)
KNow = (some int) <== Current k of this pass
KMax = (some int) <== Maximum k of this pass
PassNow = (some int <= 15) <== Current pass
PassMax = (some int <= 15) <== Maximum pass (typically 15)
[/code]
So for instance if I'm partway through a run of M(3321928381) from bit bounds 0 to 73, I might see

3321928381
0.0000000
73.000013
KNow = 1154259025940
KMax = 1421586566400
PassNow = 4
PassMax = 15

That tells me the program is roughly 80% of the way through the 5th pass (counting from 0) of the 16. This file tells the code where to resume factoring in the event of a restart-from-interrupt. To resume you simply re-enter the same command line you used to start the run, or (since the restart file contains all the needed information about the bmin/bmax bounds) simply enter

Mfactor -m {exponent} &

i.e. simply provide the exponent, so the program knows which restart file to read. You can also use the same restart file format to initiate a run, if you don't want to deal with the screen-output kludge mentioned above: To start a run of exponent {p} from bit bounds {bmin} to {bmax}, create a file t{p} containing (the {} indicate fields you need to fill in)

{p}
{bmin}
{bmax}
KNow = 0
KMax = 0
PassNow = 0
PassMax = 15

and invoke the program using

Mfactor -m {p} &

In that case all screen output except for the start-of-run diagnostics will be suppressed, i.e. after starting the run you can log out and walk away. The code will automatically generate the KMin and KMax values corresponding to your bmin and bmax bounds and write them to the checkpoint file at the time of the next file-update. Results are written to a file named results.txt - at the moment the code also writes some small additional debug information (every 2^26th q that was tried) to the file; for kicks you could check that each of these intermediate q's is either prime or has no factor less than the small-factor-sieve bound of 611957 (that's the 50000th odd prime, in case you were wondering), but I expect most of you will mainly be interested in the lines containing the word "factor".

I'll be adding more binaries as the need arises and time permits - the code also runs extremely fast on Alpha and Itanium systems (which have similarly good 64-integer MUL throughput as the AMD64), and should be within a factor of 2 of that (in terms of per-cycle throughput) on the Mac G5, with some tuning of the key modmul macros. Pentium users will have to be patient - the pentium has poor integer MUL capability relative to the aforementioned platforms, which means I'll be using a floating-point modmul algorithm I recently wrote (that allows q's up to 78 bits), but that will need to be SSE2-ified to give decent performance.

Have fun, let me know if you have any problems.
-Ernst

ewmayer 2005-11-06 21:24

Does anybody have an AMD64 running Windows and with gcc installed which they could use to do a Windows build for me? If so, PM me and I'll send you a zip of the source code and simple build and test instructions.

Thanks,
-Ernst

fatphil 2005-11-07 08:20

I pulled up 'shark' for the first time yesterday, and it seems quite informative. I have no idea if it can be used over an SSH session, as it seems gooey only. Perhaps Mark can enlighten us.
I think that a 3x25bit pure FPU version may have some potential, the a=+/-(b*c)+/-d instruction should help pack those operations in.

rogue 2005-11-07 13:37

[QUOTE=fatphil]I pulled up 'shark' for the first time yesterday, and it seems quite informative. I have no idea if it can be used over an SSH session, as it seems gooey only. Perhaps Mark can enlighten us.[/QUOTE]

My experience with Shark is limited, but I did find this:

Remote - a CHUD remote client program calls chudStartRemotePerfMonitor()/chudStopRemotePerfMonitor() to start/stop sampling. If you are using shark over a remote login (rlogin, telnet, ssh, etc.) connection you will need to run it and the remote client from the same session.

at [URL=http://developer.apple.com/documentation/Darwin/Reference/ManPages/man1/shark.1.html]Apple[/URL]

ewmayer 2005-11-07 19:42

[QUOTE=fatphil]I pulled up 'shark' for the first time yesterday, and it seems quite informative. I have no idea if it can be used over an SSH session, as it seems gooey only. Perhaps Mark can enlighten us.
I think that a 3x25bit pure FPU version may have some potential, the a=+/-(b*c)+/-d instruction should help pack those operations in.[/QUOTE]
Using balanced-digit representation and IEEE doubles one can go up to 26 bits per input for vector lengths up to 4, i.e. 78 bits for 3-vector inputs and 104 bits for 4-vector - I've only coded the former to date, since those are most relevant for GIMPS trial-factoring work.

I'll probably try a G5 build later this week - my plan is to do timings of pure-64-bit-int vs. pure-float versions on various platforms, and then design routines that interleave various numbers of modpow operations of each type in an attempt to keep both the IU and FPU pipelines as full as possible, while having both sets of parallel powerings finish in roughly the same number of total cycles. For instance on Alpha/Itanium/AMD64 a 2:1 ratio of int/float modpows looks about right (because the former are so fast), whereas on the G5 the ratio will probably be inverted, because 64-bit integer mul has 1/3 the throughput.

I must say, the AMD64 has been a pleasant revelation in this regard - here I was working under the mistaken assumption that only the Opteron version of that chip can do screamingly fast 64x64=>128-bit MULs, when in fact all AMD64s can.

fatphil 2005-11-08 09:50

[QUOTE=ewmayer]Using balanced-digit representation and IEEE doubles one can go up to 26 bits per input for vector lengths up to 4, i.e. 78 bits for 3-vector inputs and 104 bits for 4-vector[/QUOTE]

Call me old-fashioned, but I like fingers at the ends of my limbs!
For limbs a,b,c, it's nice to be able to exactly represent (2*a*c+b^2) without exceeding the precision of the FPU.

[QUOTE=ewmayer]
- I've only coded the former to date, since those are most relevant for GIMPS trial-factoring work.

I'll probably try a G5 build later this week - my plan is to do timings of pure-64-bit-int vs. pure-float versions on various platforms, and then design routines that interleave various numbers of modpow operations of each type in an attempt to keep both the IU and FPU pipelines as full as possible, while having both sets of parallel powerings finish in roughly the same number of total cycles. For instance on Alpha/Itanium/AMD64 a 2:1 ratio of int/float modpows looks about right (because the former are so fast), whereas on the G5 the ratio will probably be inverted, because 64-bit integer mul has 1/3 the throughput.[/QUOTE]

FP is 2/clock verses int's 2/6 clocks. i.e. 6:1. But the "area" of a FPU mulmod is much less (maybe even 1/6th). However, the mad instructions let you do 2 things at once... Personally, I can't call it - whatever the benchmarks say is the truth!

[QUOTE=ewmayer]
I must say, the AMD64 has been a pleasant revelation in this regard - here I was working under the mistaken assumption that only the Opteron version of that chip can do screamingly fast 64x64=>128-bit MULs, when in fact all AMD64s can.[/QUOTE]

Indeedy. pretty OK chip. Maybe my chrissy present to myself...

Phil

fatphil 2005-11-08 10:36

[QUOTE=fatphil]Call me old-fashioned, but I like fingers at the ends of my limbs![/QUOTE]

D'oh! 3*2^50 works, you're right. My mistake. I was getting confused by the more general case of non-power-of-2 radices. (the kinds of bignums I normally deal with, where the carries need to be approximated and rounded.) However, you've really got to make sure you're in the right rounding mode, or you'll have some nasty fixups to do, I think.

jasonp 2005-11-08 13:57

[QUOTE=ewmayer]Using balanced-digit representation and IEEE doubles one can go up to 26 bits per input for vector lengths up to 4, i.e. 78 bits for 3-vector inputs and 104 bits for 4-vector - I've only coded the former to date, since those are most relevant for GIMPS trial-factoring work.
[/QUOTE]
There's another handy trick on PPC hardware, in case you haven't considered it: For <= 50-bit integers in FPU registers a and b, c = a * b and d = (c - a*b) yield a pair of values where c + d = the full 100-bit product. I don't know if the floats in the vector unit are also required to support the full double-size product internally when computing a multiply-add operation. Using a pair of doubles instead of a vector of floats may get you higher throughput for an inherently serial operation like a modmul.

jasonp

fatphil 2005-11-08 14:41

[QUOTE=jasonp]There's another handy trick on PPC hardware, in case you haven't considered it: For <= 50-bit integers in FPU registers a and b, c = a * b and d = (c - a*b) yield a pair of values where c + d = the full 100-bit product. I don't know if the floats in the vector unit are also required to support the full double-size product internally when computing a multiply-add operation. Using a pair of doubles instead of a vector of floats may get you higher throughput for an inherently serial operation like a modmul.

jasonp[/QUOTE]

This is MADD! I love it. It changes _everything_!
(OK, everything that was on my next 2 or 3 weeks' to-do list.)

I better ask - does it guarantee <0.5 ulp?

Phil

ewmayer 2005-11-08 19:36

[QUOTE=fatphil]Call me old-fashioned, but I like fingers at the ends of my limbs![/quote]I have a friend who lost a fingertip in a home-woodworking accident several years ago. The joke among us ever since has been that he's suffering from roundoff error. (Since it was his index finger rather than, say, the pinkie, it truly was a significant digit. ;)

[quote]For limbs a,b,c, it's nice to be able to exactly represent (2*a*c+b^2) without exceeding the precision of the FPU.[/QUOTE]As you've already noted further down, balanced-digit and base-2^26 means the magnitudes of a,b,c,... are all <= 2^25, so that's not a problem. However, at least in my code I have one additional constraint, namely that all integer input operands are strictly nonnegative. Since I don't want the digit-balancing to create a carry out of the highest-order input term (that would increase my vector length by one), in my implementation the MSW remains unbalanced (like its owner, some would say ;), so in the above, |a| and |b| are both <= 2^25 but |c| <= 2^26. Thus |2*a*c+b^2| <= 2^52 + 2^50, which still leaves some headroom below the 53-bit limit. For 4-vector inputs, we have a,b,c all with magnitudes <= 2^25, the MSW d is <= 2^26, and the worst-case polynomial-product coefficient in the case of a squaring is 2*a*c+2*b*d. This nominally is <= 2^52 + 2^51 in magnitude, which is again no problem. (Note that we don't need to take advantage that the term in question is an exact multiple of 2 for this to be the case, i.e. general multiply works, too.)

In fact, it appears that this scheme with input magnitude bounds of 2^{25,25,...,26} should work all the way up to 6-vector inputs, i.e. inputs as large as 156 bits, as long as the hardware properly supports 53-mantissa-bit IEEE doubles. Of course for polynomials of length 4 and above it becomes worthwhile to consider an optimized Nussbaumer-style short-length convolution rather than straight grammar-school, and for the latter the input bounds may be slightly less lenient - I haven't done the detailed math, though, so right now I'm just talking off the top of my head w.r.to the latter.

[quote=jasonp]There's another handy trick on PPC hardware, in case you haven't considered it: For <= 50-bit integers in FPU registers a and b, c = a * b and d = (c - a*b) yield a pair of values where c + d = the full 100-bit product. I don't know if the floats in the vector unit are also required to support the full double-size product internally when computing a multiply-add operation.[/quote]Yes, that's definitely a neat little trick - it'll be interesting to see how large a variety of MADD-capable hardware supports it!

jasonp 2005-11-09 13:59

[QUOTE=fatphil]This is MADD! I love it. It changes _everything_!
(OK, everything that was on my next 2 or 3 weeks' to-do list.)

I better ask - does it guarantee <0.5 ulp?
[/QUOTE]
My understanding is that the result is always exact. if a + b makes the full 100-bit product, and 'a' is implicitly the 'high half', then if 'a' is too large by one then b is negative. Remember that since the inputs to the multiply started off represented exactly, as integers, then a + b will be an integer too. It's when you perform things like multiplying by an inverse to simulate a division that you start having to deal with roundoff. For a modmul, it doesn't matter if 'a' is an integer or not, since it's only there to be multiplied by an inverse. You use less precision than is available to allow a few bits for unambiguous rounding, when producing the quotient from the high half.

A while ago I experimented with using this technique to implement 50-bit modular multiplication. On the G5 a single modmul takes ~50 clocks, and I wrote a massively pipelined vectorized version that is supposed to finish one 50-bit modmul, with arbitrary modulus, in 7 clocks. Unfortunately the actual throughput is 12.5 clocks. For reference, years ago I wrote an all-integer vectorized modmul on the Alpha that sustained one modmul, with 62-bit modulus, in 6 clocks. Of course, the G5 has a much faster clock and so ended up faster in real-world terms.

Phil, before you spend as much time on this as I did, the following is a reference modmul. The quantity 'magic' is 3*2^51. n has an upper limit of about 2^50; I've tried higher and it occaisionally doesn't work.

[code]
#include <ppc_intrinsics.h>

double fast_modmul(double a, double b, double n,
double recip_n, double magic) {

double q, r, nq_hi, nq_lo;

q = a * b; /* compute high 53 bits of a * b */
q = __fmadd(q, recip_n, magic); /* multiply by the reciprocal */
q = q - magic; /* round to nearest integer */

nq_hi = n * q;
nq_lo = __fmsub(n, q, nq_hi); /* nq_hi + nq_lo = n * q exactly */

r = __fmsub(a, b, nq_hi);
r = r - nq_lo; /* compute a * b - n * q */

q = r + n;
return __fsel(r, r, q); /* return r, or r+n if r < 0 */
}
[/code]

Major bonus points to anybody who can figure out how to manipulate the FPU rounding mode so that the final correction at the end isn't necessary. Just setting the rounding mode to truncate the quotient still produces a negative remainder sometimes.

Happy crunching,
jasonp

fatphil 2005-11-09 14:25

[QUOTE=jasonp]
Phil, before you spend as much time on this as I did
[/QUOTE]

Too late :-)

Last night I came up with 3 slightly different routines, including one looking identical to yours, each of which could possibly be fastest depending on quantity of unrolling and related register usage. Only testing would tell me which would be best in reality...
However, all are untested as I don't have access to my G5 presently, I simply scribbled some example code on my craptop. It'll be 2 weeks before I get to see my G5 again.

[QUOTE=jasonp]
Major bonus points to anybody who can figure out how to manipulate the FPU rounding mode so that the final correction at the end isn't necessary. Just setting the rounding mode to truncate the quotient still produces a negative remainder sometimes.

Happy crunching,
jasonp[/QUOTE]

Is the routine to set the rounding mode in the ppc_intrinsics header file?
I had assumed no correction would be necessary, I shall look into it when I do finally power up gcc/gdb on the old crate. Of course, the correction isn't necessary a large proportion of the time for what most people are doing, I'd bet.

Oh - props for msieve - beautiful code.

jasonp 2005-11-09 15:40

[QUOTE=fatphil]Too late :-)

Last night I came up with 3 slightly different routines, including one looking identical to yours, each of which could possibly be fastest depending on quantity of unrolling and related register usage. Only testing would tell me which would be best in reality...
[/QUOTE]
These algorithms can be especially useful on the G4, where an integer modmul would have excruciating latency. Let me know if you'd be interested in a vectorized version that uses software pipelining.

I forgot to mention in my previous message that because the FPU handles signed inputs automatically, a or b can be negative, and can also exceed the modulus (as long as intermediate products can be represented exactly). This means that normalization isn't necessary if the above routine is the core of a modular exponentiation. Also, for things like integer FFTs where the input is the result of a subtraction, you don't have to normalize the input before going straight into the multiply code.

[QUOTE]
Is the routine to set the rounding mode in the ppc_intrinsics header file?
I had assumed no correction would be necessary, I shall look into it when I do finally power up gcc/gdb on the old crate. Of course, the correction isn't necessary a large proportion of the time for what most people are doing, I'd bet.

Oh - props for msieve - beautiful code.[/QUOTE]
Thanks, it was great fun to write.

I never looked seriously at the intrinsics for rounding mode; asm() and a powerpc manual were sufficient :)

jasonp

ewmayer 2005-11-09 20:00

Thanks for the sample code, Jason - nice to see that vendor-provided intrinsics files are becoming more and more common - less futzing around with ASM macros, which is especially useful in early stages of code development.

I'm working out how all this can be used to speed a > 64-bit modmul - using pairs of 50-bit floats one could handle moduli of 100 bits (101 using balanced-digit for the LSW of each pair), which would be a nice alternative (or complement) to, say, a 96 or 128-bit pure-integer modmul. In that context the possibly negative lower-half outputs you mention are not a problem - unless one considers getting the digit-balancing for free a problem, that is.

ValerieVonck 2005-11-27 10:58

Anyone got a Windows 32 bit build?

Best Regards.

gribozavr 2006-02-10 20:26

Win32 binaries are available here: [url]http://ohmdpp.5gigs.com/files/mfactor/Mfactor_i686_win32_bin.tar.gz[/url]
32-bit linux binaries: [url]http://ohmdpp.5gigs.com/files/mfactor/Mfactor_i686_linux_bin.tar.gz[/url]

ValerieVonck 2006-02-10 20:42

Thnx!

em99010pepe 2006-02-11 17:48

[quote=gribozavr]Win32 binaries are available here: [URL="http://ohmdpp.5gigs.com/files/mfactor/Mfactor_i686_win32_bin.tar.gz"]http://ohmdpp.5gigs.com/files/mfactor/Mfactor_i686_win32_bin.tar.gz[/URL]
32-bit linux binaries: [URL="http://ohmdpp.5gigs.com/files/mfactor/Mfactor_i686_linux_bin.tar.gz"]http://ohmdpp.5gigs.com/files/mfactor/Mfactor_i686_linux_bin.tar.gz[/URL][/quote]

Thanks.

Carlos

Joshua2 2006-03-21 05:33

How does the speed of mfactor compare to factor4 on 100 million or billion digit numbers on 32 bit windows athlon non-sse2? If this is too specific then what is known about its speed? Should I switch from using factor4?

T.Rex 2006-03-21 20:44

Source code of MFactor
 
[QUOTE=ewmayer]... my Mfactor mersenne-sieving code ...[/QUOTE]I cannot find the source code. Where is it ?
It does not appear in this [URL="http://www.mersenneforum.org/showthread.php?t=3255"]thread[/URL].
Thanks,
T.

Joshua2 2006-03-22 04:44

[QUOTE=Joshua2]How does the speed of mfactor compare to factor4 on 100 million or billion digit numbers on 32 bit windows athlon non-sse2? If this is too specific then what is known about its speed? Should I switch from using factor4?[/QUOTE]
I did some limited testing and it appears that mfacor is significantly faster. Is it ok to use? (ie not missing factors)

ewmayer 2006-03-22 23:43

[QUOTE=Joshua2]I did some limited testing and it appears that mfacor is significantly faster. Is it ok to use? (ie not missing factors)[/QUOTE]
I've just posted an updated Win32 binary (see link in post at top of thread), which fixes a bug that could cause some factors to be missed. The bug only affects the floating-based factoring routines, i.e. only the Win32 build had this problem.

Dmitri (a.k.a. gribozavr), please update any of your Win-binary links (and local copies of the Win32 binary) as needed.

Happy hunting,
-Ernst

ewmayer 2006-04-06 19:29

Tony Reix just brought to my attention a bug in the sieve-bit-clearing code inside Mfactor - if the NUM_SIEVING_PRIME value used by the code is such that the largest sieve prime >= Mersenne exponent p, the sieve-clearing routine winds up sending a zero argument to the 32-bit extended GCD routine used to calculate sieve bit offsets, causing an error, for instance invoking the code with

Mfactor -m 100003 -bmin 10 -bmax 50

gives the following error:

[code]
INFO: No factoring savefile t100003 found ... starting from scratch.
searching in the interval k=[0, 1475695268087040], i.e. q=[1.000000e+00, 2.951479e+20]
each of 16 (p mod 60) passes will consist of 90332172 intervals of length 272272
ERROR: eGCD called with zero input: x = 0, y = 100003
ERROR: At line 825 of file ../util.c
Assertion failed: 0
[/code]

The fix is simple - in such cases the sieve-clearing routine should just skip over p as it processes the sieving primes (since p can never divide a factor candidate of form q = 2*k*p+1 anyway), and this will be included in the next release, but in the meanwhile you should be able to run any prime > 1299721 (the maximum sieving prime used when NUM_SIEVING_PRIME = 100000 as in most of the builds I posted - if you happen to have a build with the default value of NUM_SIEVING_PRIME = 50000, p must be > 611957) without problems. I'd been so focused on large p's I neglected to run some small ones.

S00113 2006-04-13 16:12

Mfactor on Linksys WRT54G wireless router
 
I compiled Mfactor (an old version I found on the web) for my Linksys WRT54G router, and it works! The router runs Linux, and I have changed to the OpenWRT distribution to get a more user friendly version. (At least it is more friendly to old unix users like me.:smile:) The CPU is a BCM4712 which runs the MIPS32 instruction set at 216MHz. The router also has 16 MiB RAM and 8 MiB flash memory. Frequenzy, RAM and amount of flash varies between versions, and not all versions run Linux. Go to [URL]http://www.openwrt.org/[/URL] for more information on this and other wireless routers with similar hardware.

This version limits factoring to 60 bits with 32bit integer arithmetic. Here is the result of one of my first tests (the 59 bit factor is known):

[CODE]root@line:~# time ./Mfactor -h factorlog 77485997 3 1152921504606846975
# Mersenne factorer. Author: Peter L. Montgomery, pmontgom@cwi.nl
# $Id: Mfactor.c,v 1.2 1999/03/24 05:45:50 pmontgom Exp pmontgom $
# Compiled by GCC on 32-bit MIPS under IRIX.
sh: hinv: not found
# ARITHMETIC_MODEL = ARITHMETIC_INT32
# NUM_SIEVING_PRIME = 34567, SIEVE_LENGTH_BITS = 204800
# SIEVE_STEP_BIT = 840
# TRYQ_BLOCKING = 3, USING_ASSEMBLY = 0, USING_POPULATION_COUNT = 0
# VECTOR = 0, SHORT_VECTOR_LENGTH = 20
# Type sieving_prime_t has 32 bits.
# Type wsieve_t has 32 bits, of which all 32 are used.
# Type natural_t is signed and has 32 bits.
# Type uint32 has 32 bits. Type uint64 has 64 bits.
# QBITSMAX = 60
#
# Invoked as ./Mfactor -h factorlog 77485997 3 1152921504606846975
# Running on line.(none) at Thu Apr 13 14:04:57 2006
#
# p = 77485997. Searching q in [154971995, 1152921504488807235]
# The isieve loop in tryp sieves an interval of length
# 13330071035904000 = 1.333e+16 in NSIEVE = 96 pieces.
# It will need 86.49 passes to search from 154971995 to 1152921504488807235.
# 34567-th sieving prime is 409639. 9791 are below SIEVE_LENGTH_BITS/2,
# 8562 more below SIEVE_LENGTH_BITS, 16214 above.
M( 77485997 )C: 1152331426770601081 # line.(none) @ Thu Apr 13 15:26:51 2006
# 323699721 values of q tried out of 1700467991 sieved.
# 80.96% of potential q's (mod 840*p) were eliminated by sieving.
M( 77485997 )U: 1152921504488807235 # line.(none) @ Thu Apr 13 15:27:09 2006
real 1h 22m 12s
user 1h 21m 38s
sys 0m 33.76s
[/CODE]

Next I'll try to make it faster by adjusting various defines in the source. Are there newer versions of Mfactor out there with new optimizations which may work on 32bit hardware?

Next project is mersfacgmp, which is able to factor higher and talk to a server, but mersfacgmp is much harder to cross-compile.

ewmayer 2006-04-13 17:50

Sorry about the nomenclatural confusion (Peter Montgomery and I have last names beginning with the same letter, after all), but that's a different Mfactor than the one being discussed in this thread. Peter's code is very fast on a small set of 64-bit hardware (Alpha and SGI MIPS) for which Peter wrote custom assembly code subroutines for key operations, but be aware that that code can go only up to 63 bits, and even less on some platforms (as you note). Should be fine for using on slower hardware though, since you're not going to want to be factoring very deep there, to begin with.

To reiterate: although the 2 codes are intended for the same purpose and use similary underlying algorithms, the implementations are completely independent, i.e.

[b]M(ayer)factor != M(ontgomery)factor[/b]

As to the question about optimizations for 32-bit hardware, I have no plans for my mfactor code along these lines - just not worth the effort. Your same test example to 60 bits needs around a minute on decent 64-bit hardware running around 2GHz - even factoring in the clock speed difference that's nearly an order of magntiude faster in terms of per-cycle performance. My intention is not to discourage you from playing with this, but just to say that optimizing for 32-bit isn't where I'll be spending my own time.

Oh, here's the on-screen output from my run, just to illustrate that we're talking about different codes:

[code]
Mfactor build flags:
TRYQ = 4
THREE_OP128 = FALSE
NUM_SIEVING_PRIME = 100000
USE_FLOAT not defined
USE_FMADD not defined
FACTOR_STANDALONE = 1
FAC_DEBUG = 0
DBG_SIEVE not defined
NOBRANCH = 1
QUIT_WHEN_FACTOR_FOUND not defined
USE_65BIT not defined
USE_128x96 = 1
P2WORD not defined
P3WORD not defined
P4WORD not defined
Mfactor self-tests:
Testing 63-bit factors...
Testing 64-bit factors...
Testing 65-bit factors...
Testing 96-bit factors...
Factoring self-tests completed successfully.
INFO: No factoring savefile t77485997 found ... starting from scratch.
searching in the interval k=[0, 7449361920], i.e. q=[1.000000e+00, 1.154442e+18]
each of 16 (p mod 60) passes will consist of 456 intervals of length 272272
max sieving prime = 1299721
Time to set up sieve = 00:00:00.080
pass = 0....
pass = 1....
pass = 2....
pass = 3....
pass = 4....
pass = 5....
pass = 6....
pass = 7....
pass = 8....
pass = 9....
pass = 10....
pass = 11....
pass = 12....
pass = 13....
pass = 14....
pass = 15....M77485997 has a factor: 1152331426770601081. Program: E2.8x

M77485997 has 1 factors in [1.000000e+00, 1.154442e+18], passes 0-15
Performed 299545084 trial divides
checksum1 = 03A973A58DED4EBE
Clocks = 00:01:12.439
73.910u 0.013s 1:14.64 99.0% 0+0k 0+0io 17pf+0w[/code]

S00113 2006-04-17 21:17

[QUOTE=ewmayer]Sorry about the nomenclatural confusion (Peter Montgomery and I have last names beginning with the same letter, after all), but that's a different Mfactor than the one being discussed in this thread.[/QUOTE]
I should have guessed that. I have used your Mfactor before on AMD64, and noticed the output was very different. Great program btw. Once found a 70-bit factor for me.
[QUOTE]As to the question about optimizations for 32-bit hardware, I have no plans for my mfactor code along these lines - just not worth the effort.[/QUOTE]
Completely understandable. I'm just doing this for fun. it is not high priority in any way.
[QUOTE]Your same test example to 60 bits needs around a minute on decent 64-bit hardware running around 2GHz - even factoring in the clock speed difference that's nearly an order of magntiude faster in terms of per-cycle performance. My intention is not to discourage you from playing with this, but just to say that optimizing for 32-bit isn't where I'll be spending my own time.
[/QUOTE]
My intentions with my wireless router is have fun and to fill gaps in factoring. Find more factors of exponents with known small factors, double check factoring from clients with bugs which may have caused them to miss a factor, etc. Trial factoring such composites is low priority work. Most, probably all, of the factors my slow wireless router find will already have been found by P-1, ECM or by trial factoring because the client wasn't buggy after all. I wouldn't use my precious 64-bit CPUs on this. Old 386s, 486s, PentiumIIIs, SGI Indys, DecStations and my cheap wireless router will do the job just fine while our modern machines use their resources on heavier crunching.

ewmayer 2006-04-18 19:05

[QUOTE=S00113]My intentions with my wireless router is have fun and to fill gaps in factoring. Find more factors of exponents with known small factors, double check factoring from clients with bugs which may have caused them to miss a factor, etc. Trial factoring such composites is low priority work. Most, probably all, of the factors my slow wireless router find will already have been found by P-1, ECM or by trial factoring because the client wasn't buggy after all. I wouldn't use my precious 64-bit CPUs on this. Old 386s, 486s, PentiumIIIs, SGI Indys, DecStations and my cheap wireless router will do the job just fine while our modern machines use their resources on heavier crunching.[/QUOTE]
Well have fun, and keep us posted on your progress! As you say, since we're not talking the "race for the cure" for some disease here, it's more about having fun and learning interesting/useful things than anything else.

jinydu 2006-11-03 08:05

[url]http://home.earthlink.net/~elevensmooth/Billion.html[/url]

(I thought it would be useful to post this link in a Sticky thread)


All times are UTC. The time now is 22:54.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.