mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2019-12-06, 20:34   #1
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10011001110012 Posts
Default Ernst's Mfactor program

This thread is being started as a place to put public discussion of how to build, use, or improve the Mfactor program whose source is included with the source of Mlucas. I'm imagining this thread would encompass general considerations, and cpu-specific considerations, and that another thread, probably in the gpu-computing subforum, would address gpu-specific considerations (and include an early cross reference link to this thread).
After things shake out a bit I'll probably condense some of it into a separate reference thread. Or two.
kriesel is offline   Reply With Quote
Old 2019-12-06, 23:00   #2
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

3×53×73 Posts
Default

Ken, thanks for the thread - I will post "current status of the code" note here once I've done a few builds with various compile flags to shake out any issues in the current codebase. So, using the current Mlucas v19 codebase, let's get started:

o You need to renamed your local factor.c.txt file back to the buildable factor.c, and disable the little chunk of test code starting at line 1908 of that file by changing the #if 1 to #if 0;

o I already found 1 small code-fiddle that is needed, to un-break the "factor found" probable-prime testing call: line 5230 of mi64.c needs to be changed to

if(q) ASSERT(HERE, lenQ != 0x0, "If quotient requested, quotient-length pointer must be provided!");

That allowed me to build and run my standard quicktest, which reproduces one of the factors of mm31 - I successfully used both gcc and clang on my Macbook, the run needs around 2 min, this again is single-threaded:

cd ~/mlucas/src/obj_mfac
gcc -c -Os ../get*c && rm get_preferred_fft_radix.o
gcc -c -Os ../imul_macro.c ../mi64.c ../qfloat.c ../rng_isaac.c ../two*c ../types.c ../util.c
gcc -c -Os -DFACTOR_STANDALONE -DTRYQ=4 ../factor.c
gcc -o Mfactor *o -lm
time ./Mfactor -m 2147483647 -bmax 68 -passmin 15

Here, -DTRYQ=4 activates the 4-way pure-integer modexp codepath, for which there is optimized x86_64 inline-asm. Try it yourselves!

Playing with the multithreaded build option now ... more on that, the various SIMD/floatig-point-mod and nVidia GPU build options later.

Last fiddled with by ewmayer on 2019-12-07 at 02:19
ewmayer is offline   Reply With Quote
Old 2019-12-07, 01:53   #3
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7×19×37 Posts
Default

An update in factor.c, so obvious, even I can see it::
Code:
/*...Known Mersenne prime exponents. This array must be null-terminated.    */
const uint32 knowns[] = {2,3,5,7,13,17,19,31,61,89,107,127,521,607,1279,2203,2281,3217,4253,4423,9689,9941
    ,11213,19937,21701,23209,44497,86243,110503,132049,216091,756839,859433,1257787,1398269,2976221,3021377,6972593
    ,13466917,20996011,24036583,25964951,30402457,32582657,37156667,42643801,43112609,57885161,74207281,77232917
    ,82589933,0x0};

Last fiddled with by kriesel on 2019-12-07 at 01:54
kriesel is offline   Reply With Quote
Old 2019-12-07, 19:48   #4
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2D5716 Posts
Default

@Ken: Thanks for the reminder, that array is used to check for valid exponents for the double-Mersenne TFing option, which I'll describe later, when I get around to detailing how to build for TFing beyond 96 bits.
---------------

Note re. the sample quick-test case: The default build mode (at least for non-GPU builds) divides the TFing into 16 separate passes (numbered 0-15), each of which uses candidates from one of the 16 permissible (mod 60) residue classes based on the particular Mersenne exponent used. For the mm31 testcase (p = 2147483647) the factor in question happens to be found on the last of the 16 paases, so rather than do all 16 passes (-passmin 0, the -passmax 15 can be omitted if one wants to do all passes from the passmin argument through 15), or some subrange not necessarily ending in pass 15 (-passmin low -passmin high), we simply start on pass 15, which is the last-possible one.

Here is the actual-TFing part of the output for the 64-bit-int-mul-based build, on 1 core of my 2GHz Core2Duo Macbook - one can use the count of trial divides to compute a decent approximate cycle count per candidate, lumping the sieving work together wih the modexp computations:

Pass = 15............................
Factor with k = 56474845800. This factor is a probable prime.
.....
M(2147483647) has 1 factors in range k = [0, 68726898240], passes 15-15
Performed 171253887 trial divides
User time: 128 sec

Next, the code also supports a floating-double-based modexp build option, which allows factor candidates up to 78 bits, and splits each such into three 26-bit pieces. The only reason one would use this build option in scalar-double (non-SIMD) mode is if one's target hardware had dismally slow integer multiply; the scalar-double code was really developed to lead the way to SIMD-based computation, since in the past 15 years the major hardware manufacturers have put disproportionately more silicon into their SIMD FPUs. For example, take 64-bit integer multiply - x86_64 is very fast at this, relative to its double-mul capability. In the SIMD, on the other hand, none of the Intel/AMD offerings has an instruction to compute the high 64 bits of a 64x64-bit double-wide integer product. In the latest-greatest avx-512 hardware Intel added an instruction to compute the upper 52 bits of such a product, by leveraging the already-present FMA capability - more on this wide-integer-product trick using doubles and FMA in a later post.

So I'll jump right to the SSE2 modpow, which is double-float based, with available TRYQ (factor-at-a-time-to-try) counts 2,4,8 ... larger values have better instruction latency hiding, but note TRYQ=8 is only available on SSE2 systems which support the ROUND instruction, i.e. starting with the last generation of the Core2 architecture ... my Core2Duo Macbook does not support ROUND and thus throws an Illegal instruction exception when built with TRYQ=8. (For AVX and later this is not an issue, but those code paths are discussed further on). This needs rebuild of 2 files, the setup here is a little dumb, it needs both USE_FLOAT and USE_SSE2 to be defined. As I noted, TRYQ=4 was the fastest viable option on my Core2:

clang -c -Os -DFACTOR_STANDALONE -DTRYQ=4 -DUSE_FLOAT -DUSE_SSE2 ../factor.c ../twopmodq80.c

On my Core2Duo Macbook the above build with TRYQ=2 is 2.5x slower that the pure-64-bit-int-based one, and TRYQ=4 is ~50% slower:

TRYQ=2: 317 sec
TRYQ=4: 203 sec

In order to test out TRYQ=8 I could do an SSE2-enabled build on my Haswell, but there seems little point since there we have avx and avx2 to play with. Based on the above trend, SSE2 with TRYQ=8 looks like it might be close to speed-competitive with the 64-bit-int build, but I doubt it would be faster. More on AVX/AVX2 in my next post.

Last fiddled with by ewmayer on 2019-12-07 at 19:50
ewmayer is offline   Reply With Quote
Old 2019-12-07, 22:24   #5
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7·19·37 Posts
Default test hardware

I have a motley assortment of old gear available for test including plain pentium, Pentium MMX, Pentium II, Pentium III, some old 32-bit laptops, twin systems containing Core2 Duo E8200 (Vista and Win 7 Pro) which is the rough minimum to be worth the electrical heating, and a few flavors of dual-Xeons, a mix of Windows 7 & 10, also i3-370M, i7500U, and i-8750H. Only very few of this assortment have an msys2/mingw build environment. A Xeon E5645 Win7 system does (mostly for gpuowl builds).
It's hard to tell what Mlucas or Mfactor flags correspond to what processor models.

Last fiddled with by kriesel on 2019-12-07 at 23:12
kriesel is offline   Reply With Quote
Old 2019-12-07, 22:40   #6
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7×19×37 Posts
Default some MFactor past history threads

Thread re Ernst's Mfactor program, primarily if not entirely cpu oriented, begins late 2005:
https://www.mersenneforum.org/showthread.php?t=4939

Another thread, re getting set up for CUDA gpu MFactor compiles on linux, begins late 2014, terminating in a batch of questions Ernst had about efficient gpu implementation:
https://www.mersenneforum.org/showthread.php?t=19457
Since that's from very early 2015, nearly 5 years ago, some of them may no longer be relevant. It's probably best to leave it alone for now, until Ernst has a chance to review/revise the list. https://www.mersenneforum.org/showpo...0&postcount=67


Are there more?

Last fiddled with by kriesel on 2019-12-07 at 23:34
kriesel is offline   Reply With Quote
Old 2019-12-08, 04:49   #7
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

3,037 Posts
Default

I noticed that Mfactor from MLucas 19.1 says p<250 while Mfactor from Mlucas 14.1 can do up to 2kp+1 < 2128, I found a factor of a 98.6 bit exponent.

I compiled Mfactor 19.1 i Windows with MSYS2 but got some errors when trying to run your test:

Code:
gcc -c -Os -DFACTOR_STANDALONE -DTRYQ=4 get*c imul_macro.c mi64.c qfloat.c rng_isaac.c two*c types.c util.c factor.c
rm get_preferred_fft_radix.o
gcc -o Mfactor *o -lm

mfactor -m 2147483647 -kmin 1 -kmax 68726898240 -passmin 15 -passmax 15

TRYQ = 4, max sieving prime = 1299721
Time to set up sieve = 00:00:00.025
pass = 15............................ERROR: at line 5230 of file mi64.c
Assertion failed: Either both or neither of quotient-array and quotient-length pointers must be provided!
Code:
gcc -c -Os -DFACTOR_STANDALONE -DTRYQ=4 -DUSE_FLOAT -DUSE_SSE2 get*c imul_macro.c mi64.c qfloat.c rng_isaac.c two*c types.c util.c factor.c
rm get_preferred_fft_radix.o
gcc -o Mfactor *o -lm

mfactor -m 2147483647 -kmin 1 -kmax 68726898240 -passmin 15 -passmax 15

Testing 63-bit factors...
Testing 64-bit factors...
Testing 65-bit factors...
Testing 96-bit factors...
ERROR: at line 1366 of file factor_test.h
Assertion failed: Approx and exactly-computed k differ!

Last fiddled with by ATH on 2019-12-08 at 04:52
ATH is offline   Reply With Quote
Old 2019-12-08, 20:13   #8
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

265278 Posts
Default

@Andreas:
Quote:
Originally Posted by ATH View Post
I noticed that Mfactor from MLucas 19.1 says p<250 while Mfactor from Mlucas 14.1 can do up to 2kp+1 < 2128, I found a factor of a 98.6 bit exponent.
Where are you getting that 50-bit limit ... can you copy the warnig/assertion here?
Re. 1st error, see my code-fiddle notes in post #2 for the needed 1-line fix in mi64.c.
Re. 2nd error, don't see that in my sse2 builds ... if you have gdb, recompile factor.c and util.c with -O1 -g3 -ggdb, set a breakpoint at util.c:100 and examine the variable values involved in the failure.

@Ken: Thanks for the GPU-build thread links ... my file with GPU build notes and instructions was victim of my Macbook HD-failure a few years back (I had all the main Mlucas-dev stuff backed up in various places, but a few valuable "notes" files got lost), am currently trying nvcc-compiles of a few source files, those old links may be helpful.

Last fiddled with by ewmayer on 2019-12-08 at 20:21
ewmayer is offline   Reply With Quote
Old 2019-12-08, 23:35   #9
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

492110 Posts
Default

Quote:
Originally Posted by ewmayer View Post
@Andreas:

Where are you getting that 50-bit limit ... can you copy the warnig/assertion here?
Re. 1st error, see my code-fiddle notes in post #2 for the needed 1-line fix in mi64.c.
Re. 2nd error, don't see that in my sse2 builds ... if you have gdb, recompile factor.c and util.c with -O1 -g3 -ggdb, set a breakpoint at util.c:100 and examine the variable values involved in the failure.

@Ken: Thanks for the GPU-build thread links ... my file with GPU build notes and instructions was victim of my Macbook HD-failure a few years back (I had all the main Mlucas-dev stuff backed up in various places, but a few valuable "notes" files got lost), am currently trying nvcc-compiles of a few source files, those old links may be helpful.
I also had 50 bit exponent limit and 96 bit factor, in the available software tabulation, and could not figure out where I got it from. I gather from reading a bit of your Mfactor code, that the values will vary depending on the build defines and target hardware. I feel a table coming on eventually.
Re your build notes/instructions, ouch. Those, and your time, are valuable enough that I suggest including such documentation in a redundant backup scheme. Someone I used to supervise ran lots of servers and desktops but some important things went in his spiral bound paper notebook. Documentation is not useful if the server or drive it lives on is down!
kriesel is offline   Reply With Quote
Old 2019-12-09, 02:48   #10
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2D5716 Posts
Default

Quote:
Originally Posted by kriesel View Post
I also had 50 bit exponent limit and 96 bit factor, in the available software tabulation, and could not figure out where I got it from.
Should've looked in the (to me) obvious place first:

MacBook:src ewmayer$ grep 50 Mdata.h
128: #define MAX_BITS_P 50

There are smaller limits in at least one build type: CUDA/GPU builds require p < 2^32.
ewmayer is offline   Reply With Quote
Old 2019-12-09, 08:57   #11
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

492110 Posts
Default

Here's a limits table draft . I vaguely recall the CUDA factor limit being something like max k = 64 bits, so max factor candidate q would be 64+pbits+1.
Attached Files
File Type: pdf mfactor bits table.pdf (10.0 KB, 142 views)

Last fiddled with by kriesel on 2019-12-09 at 08:58
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Official Ernst (ewmayer) / Richard (cheesehead) feud thread cheesehead Soap Box 50 2014-06-30 01:06
Official "Ernst is a deceiving bully and George is a meanie" thread cheesehead Soap Box 61 2013-06-11 04:30
Running Mfactor M0CZY LMH > 100M 2 2011-02-23 11:48
Mfactor sieve code thread ewmayer Operation Billion Digits 27 2006-11-03 08:05
which program? drakkar67 Prime Sierpinski Project 14 2005-11-29 06:25

All times are UTC. The time now is 14:12.

Fri Mar 5 14:12:27 UTC 2021 up 92 days, 10:23, 0 users, load averages: 3.07, 2.74, 2.74

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.