Go Back > Extra Stuff > Blogorrhea > kriesel

Closed Thread
Thread Tools
Old 2021-08-15, 23:09   #12
kriesel's Avatar
Mar 2017
US midwest

112×61 Posts
Default Tuning Mlucas V20

My first try with Mlucas V20.0 was in Ubuntu 18.04 LTS installed in WSL1 on Windows 10 Home X64 in an i7-8750H laptop.

These were run with mfaktc also running on the laptop's discrete GPU, nothing on the IGP, a web browser with active Google Colab sessions, and TightVNC remote desktop for all access. Prime95 was stopped and exited before the test.

Experimenting a bit with Ernst's posted readme guidance, I obtained the timings shown in the attachment. Since some of these cases would have run with only 100 iterations, and I may have affected them somewhat with some interactive use, I may rerun some of these, perhaps after the next update release. Timings in a bit of production running seem to have gradually improved. Possibly that relates to ambient temperature.

Top of reference tree:
Attached Files
File Type: pdf i7-8750H timings.pdf (40.7 KB, 135 views)

Last fiddled with by kriesel on 2021-08-31 at 23:57
kriesel is offline  
Old 2021-09-06, 21:02   #13
kriesel's Avatar
Mar 2017
US midwest

112·61 Posts
Default Mlucas V20.1 timings on various hardware and environments, & prime95 compared

None of the following should be mistaken for criticism of anyone's efforts or results. Writing such software is hard. Making it produce correct outputs is harder. Making it fast and functional also on a variety of inputs, hardware, environments, etc, is harder still. Few even dare to try.

prime95 prevents running multiple instances in the same folder.
Mlucas does not prevent simultaneously running multiple instances on the same exponent in the same folder. Don't do that. It creates an awful mess.

Case 1: PRP DC 84.7M on i7-8750H (6 core, 12 hyperthread)
Mlucas v20.1 on Ubuntu 18.04 LTS atop WSL1 on Win10: nominal 4-thread 18 iters/sec; nominal 8-thread 29 iters/sec, so 47 iters/sec throughput for system as operated, potentially up to 54 iters/sec combined for 3 processes of 4-thread.

V29.5b6 prime95 benchmark on Windows 10 Home x64, same system: benchmarked all FFT lengths 2M-32M. For 84.7M, fft is 4480K; 88.7 to 93.4 iters/sec throughput. Best throughput is all 6 cores on one worker, which also gives minimum latency.

Mlucas v20.1/WSL1 performance observed is ~ 50 to 61% that of prime95/Win on this system. Note prime95 has subsequently improved speed in some aspects since the version benchmarked. Access via TightVNC & GPU app overhead present and should have been about constant.

Case 2: Dual e5-2697V2 (12 core & x2 HT) for wavefront PRP
V29.8b6 prime95 on Win10, benchmark 5760K fft length; best was 2 workers, 238. iters/sec throughput

Mlucas v20.1/WSL 8 thread, 5632K fft length, 15.09 ms/iter -> 66.3 iters/sec. Optimistically extrapolating to triple throughput for 24 cores, 198.8 iters/sec.
Other benchmarking showed a disadvantage to using all hyperthreads, versus 1 thread per processor core.

Mlucas V20.0/WSL 4 thread,
      5632  msec/iter =   29.25  ROE[avg,max] = [0.231956749, 0.312500000]  radices = 176 16 32 32  0  0  0  0  0  0
Corresponds to 1000/29.25 = 34.2 iter/sec
Optimistically extrapolating to 6x throughput for 24 cores, 205.1 iter/sec throughput.

Mlucas/WSL performance is ~83.5-86.2% of prime95 under favorable assumptions. Note that's in comparison to V29.8b6, not the current v30.6b4 prime95.

Case 3: Hardware is i7-4770 (4 core & x2 HT) for wavefront PRP, 5-way test on dual-boot Win10/Ubuntu system
Mlucas V20.1/Ubuntu 20.04/WSL2/Win10 sandwich on Windows boot (primary) partition
      5632  msec/iter =   23.68  ROE[avg,max] = [0.196175927, 0.250000000]  radices = 352 16 16 32  0  0  0  0  0  0
1000ms/sec / (23.68 ms/iter) = 42.23 iter/sec. Probably suffered somewhat from RDP, GPU apps running on Windows simultaneously.

Mlucas V20.1/Ubuntu 20.04 LTS boot on second partition on same system drive; 8 thread which showed advantage in WSL over 4 thread.
      5632  msec/iter =   15.98  ROE[avg,max] = [0.196175927, 0.250000000]  radices = 352 16 16 32  0  0  0  0  0  0
1000ms/sec / (15.98 ms/iter) = 62.58 iter/sec. Much improved throughput over the WSL2 scenario above.

prime95 v30.6b4, usual RDP, GPU apps running, etc so some overhead load.
FFTlen=5600K, Type=3, Arch=4, Pass1=448, Pass2=12800, clm=2 (3 cores, 1 worker): 13.37 ms.
Throughput: 74.82 iter/sec.

prime95 v30.6b4, Windows 10 Pro x64, no RDP or GPU apps running
Timings for 5760K FFT length (4 cores, 1 worker): 11.31 ms.  Throughput: 88.43 iter/sec.
Timings for 5760K FFT length (4 cores, 2 workers): 22.88, 22.71 ms.  Throughput: 87.73 iter/sec.
Timings for 5760K FFT length (4 cores, 4 workers): 45.76, 44.74, 44.90, 44.29 ms.  Throughput: 89.06 iter/sec.
Average of the 3 worker counts, 88.41 iter/sec

mprime v30.6b4, Ubuntu 20.04, logged in at console, no GPU apps, no remote access, minimal overhead
Timings for 5760K FFT length (4 cores, 1 worker): 11.24 ms.  Throughput: 88.97 iter/sec.
Timings for 5760K FFT length (4 cores, 2 workers): 22.43, 22.48 ms.  Throughput: 89.06 iter/sec.
Timings for 5760K FFT length (4 cores, 4 workers): 44.65, 44.65, 44.60, 44.76 ms.  Throughput: 89.55 iter/sec.
Average of the 3 worker counts, 89.19 iter/sec (1.0088x Windows low-overhead run average throughput)
(Mprime/Linux max throughput 1.0055 of prime95/Windows max throughput)

mprime and prime95 timings are very close for equalized system overhead, same hardware.
While there's essentially no speed advantage Linux vs Windows for prime95/mprime, there may be for Mlucas because of the core virtualization issue on WSL which is required to run Mlucas on Windows now. This should have less effect when the cores are fully loaded with enough Mlucas threads to occupy them all.

Mlucas/WSL performance 42.23/74.82 ~56.4% of prime95/Win10. Both sessions may have been negatively impacted by remote-desktop overhead.
Mlucas/Ubuntu performance 62.58/88.43 ~70.8% of prime95/Win10 single-worker. Both Win & Ubuntu timing were without remote desktop overhead or GPU apps.

Benchmarking experimental error unknown. Digitization error up to ~0.09%.

Mlucas can currently perform LL, PRP, and P-1 computations on higher exponents than any other GIMPS software known. Benchmark and estimate run times.

Top of reference tree:

Last fiddled with by kriesel on 2021-09-06 at 21:31
kriesel is offline  
Old 2021-09-17, 18:53   #14
kriesel's Avatar
Mar 2017
US midwest

1CD516 Posts
Default Mlucas releases

This is an incomplete draft list.

2017-06-15 V17.0

2017-07-02? V17.1

2019-02-20 V18

2019-12-01 v19

2021-02-11 v19.1 ARMv8-SIMD / Clang/LLVM compiler compatibility

2021-07-31 V20.0 P-1 support; automake script

2021-08-31 V20.1 faster P-1 stage 2, some bug fixes, print refinements, new help provisions, corrected reference-residues, raised maximum Mp limits

tbd V20.2? minor cleanup such as labeling factor bits as bits, additional bug fixes; possibly resync mfactor variable types with shared routines typing from Mlucas

tbd V21? PRP proof generation planned

Top of reference tree:

Last fiddled with by kriesel on 2021-09-19 at 13:41
kriesel is offline  
Old 2021-09-17, 23:14   #15
kriesel's Avatar
Mar 2017
US midwest

1CD516 Posts
Default Wish list

Features I'd like to see added in Mlucas. As always, it's the developer's call what actually happens; his time, his talent, his program. These are in when-I-thought-to-write-them order.

  1. PRP proof file generation by VDF. Preferably V2 format, such as prime95 and gpuowl produce. I think this is generally agreed to be the highest priority feature addition.
  2. ETAs in the .stat file output
  3. Jacobi symbol check for LL / LLDC running
  4. Solution for WSL-related core-hopping seen on Xeon Phi and elsewhere
  5. Solution for building native Windows executables
  6. Solution for building multithreaded native Windows executables
  7. Ability to accept a list of interim LL 64-bit residues from a parallel run for comparison at widely spaced iteration counts such as every 5M or 10M from previous runs, useful in DC / TC / new-discovery-verification
  8. Date/time stamps on first record of console or nohup.out output or upon restart in <exponent>.stat file, and on most other output
  9. Only one process can run per folder at a time
  10. Multiple-worker integration into a single process
  11. Segmenting a P-1 stage 2 run onto multiple processes or machines for running in parallel; this will be necessary for F33 P-1 stage 2, and may be useful on OBD P-1 also
  12. Total run time per P-1 stage and total of both stages and GCDs, output by program for more convenient benchmarking, run time scaling measurement.
  13. Change worktodo.ini to worktodo.txt
  14. On Ernst's wish list judging by reading some source code is a GUI someday.
  15. Integral PrimeNet API use
(what else?)

Top of reference tree:

Last fiddled with by kriesel on 2021-10-14 at 17:40
kriesel is offline  
Old 2021-09-17, 23:18   #16
kriesel's Avatar
Mar 2017
US midwest

1CD516 Posts
Default Bug list

This is a partial list, mostly by version in which they were first seen. Testing has only involved Mersenne number related capabilities. NO attempt was made at testing on Fermat number capabilities.


gave msec/iter times, but labeled sec/iter. Resolved in later version


  1. Indicates a known Mersenne prime as composite:
    {"status":"C", "exponent":1257787, "worktype":"PRP-3", "res64":"               1", "residue-type":1, "fft-length":65536, "shift-count":758233, "error-code":"00000000", "program":{"name":"Mlucas", "version":"19.0"}, "timestamp":"2021-12-03 12:52:38 Central Standard Time", "aid":"00000000000000000000000000000000"}
  2. On Windows, fails to rename worktodo.ini. This is commonly caused by attempting to implicitly delete or overwrite an existing file of the same name by renaming another file to the same name, which Windows does not allow.
    ERROR: unable to rename WINI.TMP file ==> worktodo.ini ... attempting line-by-line copy instead.
  3. fft length (of an initial worktodo item or previous self-test) gets carried over to the next worktodo item, even if the exponent is much larger than its maximum.
    INFO: Maximum recommended exponent for this runlength = 1327103; p[ = 3321928243]/pmax_rec = 2503.1427425000.
    Initial DWT-multipliers chain length = [short] in carry step.
    ERROR: at line 1273 of file ../src/Mlucas.c
    Assertion failed: specified FFT length 64 K is much too small: Recommended length for this p = 196608 K.

There are several described at
Upgrade to V20.1 for faster P-1 stage 2 and multiple bug fixes.
And at least one that slipped by brief testing, so are present in V20.1 also. See also P-1 stage 2 restart issue etc below.

  1. mislabels n-bit P-1 factor found as n (base ten) digits.
    Found 70-digit factor in Stage 2: 646560662529991467527
    Following examples from V20.0
    Found 95-digit factor in Stage 1: 33287662948300610984694812407
    Found 84-digit factor in Stage 2: 15299475858498328182948679
    Log10(33,287,662,948,300,610,984,694,812,407)=28.52...; Log10(33,287,662,948,300,610,984,694,812,407)/Log10(2) = 94.74...
    Appears to be corrected in a subsequent update.
  2. When there is a restart in P-1 stage 2 (Mlucas intended stop/restart, or Windows Update or power failure pulls the rug out from under Linux/WSL and Mlucas), the following result record for P-1 stopped/restarted in stage 2 has 1970-01-01 midnight as time stamp, instead of the actual completion time. <exponent>.stat file entries are ok. The P-1 stage 2 restart code path bypasses the usual inits of calendar_time, which later affects the result output timestamp.
    Appears to be corrected in a subsequent update.
  3. More recently, also on Ubuntu/WSL/Win10, I've observed peculiar result line date values such as "4442758-11-21 10:39:25 UTC" on ~2021-10-07 after recovering from large-memory related stage 2 Mlucas crash on 10M and on 106M exponent runs. Appears to be corrected in v20.1.1.
  4. Factors found at a GCD early in stage 2 are reported as if they were found in stage 1, with only stage 1 bound given. Computing the effective stage 2 bound in such a case is not easy or clear. Appears to be corrected in 20.1.1.
  5. Factor found after a full but interrupted stage 2 was indicated as stage 1 bounds only. Appears to be corrected in 20.1.1.
  6. -maxalloc with a % that equates to > ~32GiB attempted usage results in a segmentation fault at the beginning of P-1 stage 2. Observed on a 128 GiB ram AVX system with Win10/WSL1/Ubuntu 18.04.2 LTS combo. Ernst has been able to reproduce the issue on a KNL/Ubuntu system. Some variables that were typed uint32 will need to be uint64. Until resolved, a workaround is to use less of the available ram, at some loss of speed. Appears to be corrected in a subsequent update.
  7. In P-1 at least, some values that ought be recalculated for each worktodo item appear to be reused unchanged instead. Number of buffers in stage 2 is the first example seen, not recalculated from 106M to 334M. Another is FFT length did not get updated from a 1M P-1 task to a 3M P-1 task immediately following. Possible workarounds include sorting and segregating assignments to similar exponents, or use of the command line and scripting for separate program sessions for disparate exponents. B2start is another variable that gets carried over. Appears to be corrected in a subsequent update.
  8. For P-1 and probably other work types, on small exponents on which a stage of computation may complete faster than the checkpoint save interval or stat file update interval, no stage timing is saved to the stat file or displayed on stdout/stderr. This means run time scaling measurements on small exponents can not be made, except with a stopwatch or the batch file/shell script equivalent.
  9. In self test on enormous fft lengths (256M - 512M) on 2 models of AVX512 CPUs on Ubuntu/WSL/Win10, one of the 512M radix sets reproducibly produces a segfault crash, preventing production of a line to finish the self test. On Ernst's attempt to reproduce on bare Ubuntu on AVX512, and perhaps later version of source code, there's instead an excessive roundoff error flagged. A workaround is to hand edit the mlucas.cfg file based on console output from other radix sets completed prior to the crash. Since the enormous class fft lengths begin at 256M, there's little or no need for them currently in the Mersenne number realm. FFT length 192M is expected to be sufficient for P-1 factoring attempts on OBD candidates. Appears to be corrected in a subsequent update.
  10. for worktodo entry:
    Product of Stage 1 prime powers with b1 = 8000 is 11649 bits (183 limbs), vs estimated 12035. Setting PRP_BASE = 3.
    ERROR: at line 1165 of file ../src/mi64.c
    Assertion failed: mi64_shl: zero-length array or shift count >= 64!
    Code needs modification for the case where it incorrectly generates a shift count of 64. Appears to be corrected in a subsequent update.
  11. a lingering bug related to relocation-prime handling in P-1 stage 2 restart. Appears to be corrected in a subsequent update.
  12. see also the Mlucas readme.html for a more comprehensive list (currently 14 bullet points, some of which appear to be compound)
v20.1.1 (2021-11-01 or 2021-11-06)
  1. Observed on first attempt on WSL, if mlucas.ini does not exist (so no entries exist), running -s m -iters 100 (and probably other command line variations), generates incorrect error messages regarding related mlucas.ini possible entries: "User set unsupported value LowMem = nan in mlucas.ini ... ignoring.
    User set non-whole-number CheckInterval = nan in mlucas.ini ... ignoring." Apparently the program does not initialize default values before looking for mlucas.ini and interpreting its contents if found. A likely simple workaround is to create mlucas.ini with applicable contents. This appears to be addressed in the 2021-11-23 patch.
  2. Attempts to run worktodo.ini entry: PRP=00000000000000000000000000000000,1,2,3321928171,-1,91,0
    yielded following, thought to be due to some variables cast (int). Consequently Mersenne number PRP and LL testing would be capped at 231-1 until further revision:
    INFO: Maximum recommended exponent for FFT length (196608 Kdbl) = 3409766353; p[ = 3321928171]/pmax_rec = 0.9742392373.
    Initial DWT-multipliers chain length = [medium] in carry step.
    INFO: primary restart file p3321928171 not found...looking for secondary...
    INFO: no restart file found...starting run from scratch.
    ERROR: at line 1695 of file ../src/Mlucas.c
    Assertion failed: Require (int)maxiter > 0
    Run times impose lower limits. End users may attempt increasing the limit to 232-1 by removing the relevant (int) casts in their copy of source code before performing a build. For example, from
    ASSERT(HERE, (int)maxiter > 0,"Require (int)maxiter > 0");
    (It would be useful for debugging if the assert output the current maxiter value there.)
    ASSERT(HERE, (int)maxiter > 0,"Require (int)maxiter > 0");
    ASSERT(HERE, maxiter > 0,"Require maxiter > 0");
    or perhaps the compiler-acceptable equivalent of
    charbuf="Require maxiter = "+sprintf("%u",maxiter)+" > 0";
    ASSERT(HERE, maxiter > 0,charbuf);
    The 2021-11-23 patch has as line 1697,
    ASSERT(HERE, maxiter > 0,"Require (uint32)maxiter > 0");
    Which provides a bit more exponent range. A more complete solution, revising the source code to use 64-bit ints for more variables, to support the full nominal capability of the largest implemented fft lengths, is planned at some point. Until that is done, it will limit testing the code in several ways. is a prime exponent, above the nominal Mlucas v20.x limit of PRP exponents supported with the 512M fft and above 232. Attempting to run PRP on it produces not a message about being too large, but
     worktodo.ini  entry: PRP=00000000000000000000000000000000,1,2,8937021983,-1,91,0
    check_kbnc: Mersenne exponent must be prime!
    ERROR: at line 590 of file ../src/Mlucas.c
    Assertion failed: [k,b,n,c] portion of in_line fails to parse  correctly!
    8937021983 is prime, but 8937021983 mod 232 is 347,087,391 = 3 × 7 × 16 527971. Similarly an exponent below the asympconst = 0.6 estimated limit of the 512M fft length, 8883334793 mod 232 = 293,400,201 = 3 × 97800067
     worktodo.ini entry: Pminus1=00000000000000000000000000000000,1,2,8883334793,-1,10000,10000
    check_kbnc: Mersenne exponent must be prime!
    ERROR: at line 716 of file ../src/Mlucas.c
    Assertion failed: [k,b,n,c] portion of in_line fails to parse correctly!
    What's occurring would be more clear if the error message indicated the exponent as perceived by the code at that point, eg
    check_kbnc: Mersenne exponent 293,400,201 must be prime!
    Just above 232, 4294967311 mod 232 = 15 = 3 x 5
     worktodo.ini entry: Pminus1=00000000000000000000000000000000,1,2,4294967311,-1,10000,10000
    check_kbnc: Mersenne exponent must be prime!
    ERROR: at line 716 of file ../src/Mlucas.c
    Assertion failed: [k,b,n,c] portion of in_line fails to parse correctly!
    while just below 232, 4294967291 P-1 appears at least initially to run. Note some exponents >232 that are prime may correspond to another smaller prime mod 232 and may fail another assert re minimum magnitude. For example, 4294967357 mod 232 = 61, while minimum exponent = 4096.
  3. Fft length can carry over from a self test to a worktodo item, possibly yielding impossibly high bits/word.
    ERROR: at line 1285 of file ../src/Mlucas.c
    Assertion failed: ERROR: specified FFT length 44 K is much too small: Recommended length for this p = 196608 K.
    Reportedly fixed in the 2021-11-23 patch.
  4. P-1 stage 2 may indicate >100% completion along the way. Known factor is found, so it's cosmetic. Per Ernst, it's a side effect of ensuring no paired primes below B2 get skipped, by doing some paired above B2 with them, IIUC. However, there's much more status output from 100% to 104+% than from 0% to 100%, and that's a bug, for which a fix is planned in the next update. Reportedly fixed in the 2021-11-23 patch.
  5. Attempts to stop a running Mlucas session, such as to restart from beginning, or to change the mlucas.ini file and have the changes take effect, may produce assorted .stat file anomalies, depending on the method of stopping the session attempted. top kill 9 works. Ctrl-C creates anomalies. (There is a patch file available which is thought to address most cases (but possibly not P-1 stage 2). And the 2021-11-23 patch likely incorporates that if not more, as a result of restored signal handling.) Following is a list of some of the interesting features of the resulting stat file. These are mostly also typically signs of a computation gone wrong.
    1. Repeating res64 with differing iteration count during Ctrl-C as kill attempts.
    2. Clocks anomalies vs. elapsed time indicated by time stamp. For example, 100k iter to 110k iter, 13 seconds, clocks value unchanged, ms/iter 10-fold increase.
    3. Unchanged shift count between status lines indicated as 10k iterations apart, in a run using nonzero shift.
    4. Error measures zero in some entries.
    5. Exact same clocks values at 10k iters apart in early running.
    6. Every log interval Mlucas claims to be restarting, including during the last run in which it was being left to run undisturbed while I slept.
    7. Mismatch of res64 for same iteration count and exponent and computation type, between the affected run and independent runs' log output. After a thorough restart from scratch process, the 10k and 100k iter res64s match the corresponding res64s from a previous gpuowl run. I suspect the erroneous res64 in the affected Mlucas run is a res64 carried over from some earlier iteration count.
v20.1.1 (2021-12-02)
  1. Exponents > 232 but within the nominally supported range may produce a worktodo parse error. See for more information and a user-applyable patch.

Top of reference tree:

Last fiddled with by kriesel on 2022-02-09 at 09:32 Reason: add 20.1.1 2021-12-02
kriesel is offline  
Old 2021-10-10, 16:16   #17
kriesel's Avatar
Mar 2017
US midwest

112·61 Posts
Default V20.1.x P-1 run time scaling

Based on very limited data, run time scaling is approximately p2.1, for typical recommended bounds, in line with expectations from other applications and from first principles. (So twice the exponent is more than four times the run time, for nontrivial exponents, where fixed-duration or low-order-scaling setup time does not affect scaling much.)

When selecting exponents for run-time scaling tests, I recommend at least one with a known factor that should be found with usual bounds. That goes first to anchor the low end of the scaling. M10000831 works well. Widely spaced other exponents of use to GIMPS compose the rest; current first-test wavefront ~107M, ~220M, 332M (100Mdigit), & ideally higher (~500M-700M). Running them in that order allows a scaling fit to develop in a spreadsheet with the least compute time expenditure. That helps avoid single data points costing months or the appearance of a hung application. If running on WSL & Windows, take care to pause Windows updating for a sufficient duration that the scaling runs will complete without interruption, for an easier situation for tabulating compute time per exponent.

The first attachment shows results of runs while also running other GIMPS loads, and brief stage 1 tests without the other loads. Run time scaling for Mlucas v20.1 on a dual-xeon-e5-2697v2 system on Ubuntu atop WSL & Windows 10, 128 GiB ECC ram, is consistent with OBD P-1 factoring whole attempts of ~10 months standalone duration, ~15 months with other usual loads. Running stage 2 in parallel on multiple systems can be used to ensure OBD P-1 completion in under a year. (Note the fit to points including a 10M exponent is inaccurate, because that point was run with 8 cores, unlike 16 for the others in that set.)

An experimental sequence of self test at 192M fft length for varying thread counts indicated 20 threads was latency optimal for this system for that length. See

A second run time scaling set for the same dual-xeon-e5-2697v2 system on Ubuntu atop WSL & Windows 10 system was run with 20 threads and Mlucas V20.1.1. See the second attachment. Several run time estimates for gigadigit P-1 were computed, with all shorter than one year. There are a few available ways to shorten run time relative to those tests and estimates, listed in the attachment.
A comparison of the stat files of M3321928171 from the first scaling run in Mlucas V20.1 and the second in V20.1.1 with differing core counts but matching B1=17,000,000 shows stage 1 iteration 100,000 res64 values match. (Res64: C51C82322FC7CBE6) See also "requirements for comparability of interim residues" post.
This system's run time scaling may be revisited and improved later based on lessons learned in the following.

An additional run time scaling on a similar system (dual-xeon-e5-2690, 64 GiB ECC ram) on Ubuntu atop WSL & Windows 10 and 16 threads indicates solo completion of both stages for OBD P-1 would take ~1.9 years, and split stage 2 with an equal speed system would take ~1.25 years overall. Experiments varying number of cores for solo instance yielded local minima at 9, 19 and 32 cores (19 fastest), providing ~10.6% improvement for an estimated ~1.7 years solo both stages, and with a split stage 2 ~1.1 years (or <1 year with a faster system such as the one in the previous paragraph). Limited testing of dual instances indicated aggregate throughput ~2.65 iterations/sec is possible, corresponding to an effective timing of 377. msec/iter. This could be either a stage 1 and a stage 2 or two stage 1s, and with latency ~2 years, equivalent aggregate throughput >1 OBD P-1/year. Some of these figures could be improved by using a native Linux boot, avoiding various detrimental aspects of WSL environment. See the third attachment, which includes some CentOS 8 Stream selftest timings for comparison to Ubuntu/WSL/Win10 on the identical hardware.

A separate run time scaling was conducted on an i5-7600T (4 core no HT) 64 GiB (nonECC) ram CentOS 7.9 system. This indicates OBD P-1 may be feasible in 2 years on that hardware, or with stage 2 split with another equal of faster system, 1.4 years, although the lack of ECC is a concern. It may be used for some OBD stage 1 progress and then that system switched to P-1 for a 2.25G exponent with estimated completion ~9.8 months solo. See the fourth attachment.

Given the run time scalings obtained so far, and mlucas.cfg timing for 192M fft length, we can estimate that a timing under ~500 ms/iter is required to qualify for OBD P-1 to designated bounds solo within a year. Assuming run time is 1/3 stage 1, 2/3 stage 2, stage 1 taking no more than (4 months * 365 /12 days/month * 24 hours/day * 3600 seconds/hour) / (17000000 * 1.442 iters) = 10512000 / 24514000 = 429. msec / iter provides a rough magnitude check. So far I have observed s2/s1 duration ratios lower than 2, during most runtime scaling at exponents < 1G (and those would improve with greater memory allowed for stage 2), which would permit somewhat longer stage 1 than 4 months and longer than 429. msec/iter 192M fft length selftest times.

Top of reference tree:

Last fiddled with by kriesel on 2022-02-13 at 17:29 Reason: updated third attachment & related text
kriesel is offline  
Old 2021-10-22, 20:23   #18
kriesel's Avatar
Mar 2017
US midwest

112×61 Posts
Default Max exponent versus fft length

Mlucas v20.1 fft length, maxp (excerpted from get_fft_radices.c, subject to change)
Note, for 256M fft length and larger, -shift 0 is required.
With AsympConst = 0.6, this gives the following maxP values for various FFT lengths:
                 maxn(N) as a function of AsympConst:
             N       AC = 0.6    AC = 0.4
        --------    ----------    ----------
             1 K         22686         22788    [0.45% larger]
             2 K         44683         44888
             3 K         66435         66742
             4 K         88029         88438
             5 K        109506        110018
             6 K        130892        131506
             7 K        152201        152918
             8 K        173445        174264
             9 K        194632        195554
            10 K        215769        216793
            12 K        257912        259141
            14 K        299904        301338
            16 K        341769        343407
            18 K        383521        385364
            20 K        425174        427222
            24 K        508222        510679
            28 K        590972        593840
            32 K        673470        676747
            36 K        755746        759433
            40 K        837827        841923
            48 K       1001477       1006392
            56 K       1164540       1170275
            64 K       1327103       1333656
            72 K       1489228       1496601
            80 K       1650966       1659158
            96 K       1973430       1983260
           112 K       2294732       2306201
           128 K       2615043       2628150
           144 K       2934488       2949234
           160 K       3253166       3269550
           176 K       3571154       3589176
           192 K       3888516       3908176
           208 K       4205305       4226604
           224 K       4521565       4544502
           240 K       4837335       4861911
           256 K       5152648       5178863
           288 K       5782016       5811507
           320 K       6409862       6442630
           352 K       7036339       7072384
           384 K       7661575       7700897
           416 K       8285675       8328273
           448 K       8908726       8954601
           480 K       9530805       9579957
           512 K      10151977      10204406
           576 K      11391823      11450805
           640 K      12628648      12694184
           704 K      13862759      13934849
           768 K      15094405      15173048
           832 K      16323795      16408992
           896 K      17551103      17642854
           960 K      18776481      18874785
          1024 K      20000058      20104916    [0.52% larger]
          1152 K      22442252      22560217
          1280 K      24878447      25009519
          1408 K      27309250      27453429
          1536 K      29735157      29892444
          1664 K      32156582      32326975
          1792 K      34573872      34757372
          1920 K      36987325      37183933
          2048 K      39397201      39606917
          2304 K      44207097      44443027
          2560 K      49005071      49267215
          2816 K      53792328      54080687
          3072 K      58569855      58884428
          3328 K      63338470      63679258
          3584 K      68098867      68465868
          3840 K      72851637      73244853
          4096 K      77597294      78016725
          4608 K      87069012      87540871
          5120 K      96517023      97041311
          5632 K     105943724     106520441
          6144 K     115351074     115980220
          6656 K     124740700     125422275
          7168 K     134113980     134847983
          7680 K     143472090     144258522
  8 M =   8192 K     152816052     153654913
  9 M =   9216 K     171464992     172408710
 10 M =  10240 K     190066770     191115346
 11 M =  11264 K     208626152     209779586
 12 M =  12288 K     227147031     228405322
 13 M =  13312 K     245632644     246995793
 14 M =  14336 K     264085729     265553736
 15 M =  15360 K     282508628     284081492
 16 M =  16384 K     300903371     302581093
 18 M =  18432 K     337615274     339502711    <*** smallest 100-Mdigit moduli ***
 20 M =  20480 K     374233313     376330465
 22 M =  22528 K     410766968     413073835
 24 M =  24576 K     447223981     449740563
 26 M =  26624 K     483610796     486337093
 28 M =  28672 K     519932856     522868869
 30 M =  30720 K     556194824     559340552
 32 M =  32768 K     592400738     595756181    <*** Nov 2015: No ROE issues with a run of p = 595799947 [maxErr = 0.375], corr. to AsympConst ~= 0.4
 36 M =  36864 K     664658102     668432976
 40 M =  40960 K     736728582     740922886
 44 M =  45056 K     808631042     813244776
 48 M =  49152 K     880380890     885414055
 52 M =  53248 K     951990950     957443546
 56 M =  57344 K    1023472059    1029344085
 60 M =  61440 K    1094833496    1101124952
 64 M =  65536 K    1166083299    1172794185
 72 M =  73728 K    1308275271    1315825018
 80 M =  81920 K    1450095024    1458483632
 88 M =  90112 K    1591580114    1600807583
 96 M =  98304 K    1732761219    1742827549
104 M = 106496 K    1873663870    1884569060
112 M = 114688 K    2014309644    2026053695
120 M = 122880 K    2154717020    2167299932
128 M = 131072 K    2294902000    2308323773
144 M = 147456 K    2574659086    2589758580
160 M = 163840 K    2853674592    2870451808
176 M = 180224 K    3132023315    3150478252
192 M = 196608 K    3409766353    3429899013    <<<<*** Allows numbers slightly > 1Gdigit
208 M = 212992 K    3686954556    3708764937
224 M = 229376 K    3963630903    3987119006
240 M = 245760 K    4239832202    4264998026    <*** largest FFT length for 32-bit exponents
256 M = 262144 K    4515590327    4542433872
288 M = 294912 K    5065885246    5096084235
320 M = 327680 K    5614702299    5648256731
352 M = 360448 K    6162190494    6199100369
384 M = 393216 K    6708471554    6748736872
416 M = 425984 K    7253646785    7297267547
448 M = 458752 K    7797801823    7844778027
480 M = 491520 K    8341010002    8391341650
512 M = 524288 K    8883334834    8937021925    [0.60% larger]
Top of this reference thread
Top of reference tree:

Last fiddled with by kriesel on 2021-11-13 at 21:55
kriesel is offline  
Old 2021-10-25, 09:50   #19
kriesel's Avatar
Mar 2017
US midwest

738110 Posts
Default Ram required versus exponent for P-1 stage 2 in Mlucas

Stage 1 can be run to gigadigit or higher on 12 GiB ram, probably less ram.

Stage 2 is much more demanding of ram than stage 1.
A rough estimate of stage 2 ram required for successful launch is 30 buffers x (fft length x 8 bytes)/buffer.
For example, if for gigadigit stage 2, 256M is the fastest fft length of sufficient size (>=192Mi), 30 x 256Mi x 8 bytes = 61440 MiB = 60 GiB.
In that case the ram required could be reduced by commenting out the 256M entry in mlucas.cfg, accepting the somewhat slower 192M timing in exchange for smaller ram requirement.
Then 30 x 192Mi x 8 bytes = 45 GiB. Observed ram usage of the Mlucas program in top was 45.9 GiB.

Stage 2 P-1 for F33 is estimated to require 30 x 512Mi x 8 bytes = 120 GiB.

Wavefront stage 2 P-1 at ~107M exponent would require 6M fft length, ~1.4 GiB.
100Mdigit would require 18M fft length, ~4.2 GiB.
~1G exponent would require 56M fft length, ~13.1 GiB. So since free ram in a WSL session is ~12GiB on a 16 GiB system, max exponent for stage 2 on such a system would be ~910M.

Those values above are the estimated minimums to be able to run the stage. Somewhat more ram could enable faster completion.
In preliminary testing, I've observed Mlucas allocate buffers as multiples of 24, plus per Ernst the equivalent of ~5 are required for other data, and buffers are allocated in multiples of 24 or 40. For OBD, 64 GiB would be the same speed as 60 or 80 GiB, but 96 GiB or higher may allow use of 48 buffers or more and somewhat higher speed. There are diminishing returns with successive doublings of ram and buffer count, observed in other P-1 capable software.
Ram in use may fluctuate somewhat. A run of 468M is observed with 6.72 GiB virtual size, 6.02GiB resident in top in stage 2, while the estimate would give for its 26M fft length, 6.1 GiB, for 24 buffers used.

Top of this reference thread
Top of reference tree:

Last fiddled with by kriesel on 2021-11-17 at 18:06
kriesel is offline  
Old 2021-11-17, 12:37   #20
kriesel's Avatar
Mar 2017
US midwest

112·61 Posts
Default Optimizing core count for fastest iteration time of a single task

Round one: Gigadigit timings
On a dual-processor-package system, 2 x E5-2697V2 (each of which are 12-core plus x2 hyperthreading, for a total of 2 x 12 x 2 =48 logical processors), with 128 GiB ECC ram, and prime95 reports L1 cache 24x32KB, L2 24x256KB, L3 2x30MB; within Ubuntu running atop WSL1 on Win 10 Pro x64, a series of self tests for a single fft length and varying cpu core counts were run in Mlucas v20.1.1 (2021 Nov 6 tarball). Usage that way would be likely when attempting to complete one testing task as quickly as possible (minimum latency). Examples are OBD or F33 P-1, or confirming a new Mersenne prime discovery. It is very likely not the maximum-throughput case, that would constitute typical production running.

Fastest iteration time, ~400ms/iter at 192M (suitable for OBD P-1) was obtained at 20 cores, which is less than the total physical core count 24.
Iteration times obtained were observed to have limited reproducibility at 100 iterations. (10% or worse variability.) Reproducibility was much better with 1000-iteration runs.
Best reproducibility was apparently by running nothing else, no interactive use, and not even top, although I left a gpuowl instance running uninterrupted.

The thread efficiency = ms/iter * threadcount / ms/iter for 1 thread varied widely, down to 20.6% at 48 threads. At the fastest iteration time, 20 threads, it was 65.8%. Power-of-two thread counts were in most cases local maxima.

The tests were performed by writing and launching a simple sequential shell script, specifying Mlucas command line and output redirection, followed by rename of the mlucas.cfg before the next thread count run. The results are tabulated in the first attachment.

Round two: gathering 1G7 PRP reference interim residues
On the same dual 12-core & x2 HT Xeon e5-2697v2, Windows 10 Pro, WSL1, Canonical Ubuntu 18.04, Mlucas v20.1.1 (2022-02-09 build)
M1,000,000,007 PRP performance versus core count specification
The value x below is the highest logical processor number in the set of logical processors specified used in multithreaded runs. Numbering starts at zero. 0,1 is both logical processors (hyperthreads) of one physical core. Max x on a single 12-core 2 hyperthread Xeon is 23.

FFT used is 960 32 32 32 so core counts that are factors of 960 or of 32 are likely to give more favorable timings.
Powers of two are likely to be faster than non-power-of-two.
Core counts that fully occupy a CPU package with or without HT are likely to be faster.
There's likely to be a dip in performance due to NUMA when using both cpu packages / ram banks.
These considerations conflict somewhat, on a non-power-of-two core count CPU or dual-cpu-NUMA design.

Observed timings below are msec/iteration using however many logical processors specified
No hyperthreading specified: 0:x:2
(nominally single CPU package)
cpuspec lcorecount	msec/iter	lcore*msec/iter or notes
0:7:2 		4	460.9		1843.6
0:11:2 		6	401.5		2409.
0:15:2		8	221.5		1776.	min lcore*msec/iter observed
0:23:2 		12	208.6		2503.2

(nominally both cpu packages)
0:31:2 		16	123.		1968.
0:47:2		24	114.		2736.

Hyperthreading specified: 0:x:1
(nominally single CPU package)
cpuspec lcorecount	msec/iter
0:3:1		4	452.4		1809.6
0:5:1		6	381.2		2287.2
0:7:1 		8	222.3		1778.4
0:11:1 		12	207.2		2486.4
0:15:1		16	130.7		2091.2	
0:23:1 		24	114.1		2738.4

(necessarily both cpu packages)
cpuspec lcorecount	msec/iter
0:29:1		30	112.		3360.
0:31:1 		32	105.3		3369.6 min time/iter; power of two and a good fit to the fft's radices
0:35:1		36	107.4 		3866.4
0:39:1		40	107.3		4292.
0:47:1		48	108.4		5174.4 more logical cores made it slower
The above evaluates single-task performance. Max throughput may be with multiple tasks, and longer latency, such as 0:23:1 on cpu0 for task0, and 24:47:1 on cpu1 for a different exponent, running in parallel in separate instances and separate folders, or 0:15:1, 16:31:1, 32:47:1. Mlucas is multithreaded but single worker, so it would require multiple instances to run multiple tasks.
Use the above figures and cpu specs with caution. It's common to see in WSL Mlucas runs, when monitoring logical core by logical core activity in Windows Task Manager, activity spread for low specified core counts to more cores than specified. For example, 0:7:2 should be on 4 specific logical cores, but shows significant use spread on 9 logical cores. It is not reflective of what may occur on native Linux, or the next WSL application run. IIRC the core alignment is worse on WSL2 than WSL1.
Clock rates were not held constant. Increasing number of cores active probably affects usable clock rates.

Minimum iteration time achieved corresponds to ~3.34 years for a single gigabit primality test.

Top of this reference thread
Top of reference tree:
Attached Files
File Type: pdf optimizing 192M thread count.pdf (27.8 KB, 83 views)
File Type: pdf parallelism at 1G PRP.pdf (31.2 KB, 2 views)

Last fiddled with by kriesel on 2023-01-12 at 18:39
kriesel is offline  
Old 2021-11-18, 19:22   #21
kriesel's Avatar
Mar 2017
US midwest

112×61 Posts
Default File size scaling

Size of the largest files depends on exponent and computation type. Number depends on the computation type. File size is independent of P-1 bounds, determined by mod Mp. LL will have p and q files. PRP will have also a .G file for the GEC data. P-1 will have p and q and eventually .s1 and .s2, and .s1_prod which is ~100X smaller. .stat files tend to be modest size. A 4G P-1 attempt means 1-2 GiB of space used, even with minimum possible bounds. PRP with proof generation is unreleased and behavior unknown.
Attached Files
File Type: pdf file sizes.pdf (17.8 KB, 72 views)

Last fiddled with by kriesel on 2023-01-12 at 18:43 Reason: added proof generation comment
kriesel is offline  
Closed Thread

Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL-specific reference material kriesel kriesel 32 2022-08-07 17:06
Mfaktc-specific reference material kriesel kriesel 9 2022-05-15 13:21
Mfakto-specific reference material kriesel kriesel 5 2020-07-02 01:30
CUDALucas-specific reference material kriesel kriesel 9 2020-05-28 23:32
CUDAPm1-specific reference material kriesel kriesel 12 2019-08-12 15:51

All times are UTC. The time now is 22:33.

Wed Feb 8 22:33:20 UTC 2023 up 174 days, 20:01, 1 user, load averages: 1.20, 1.08, 1.01

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔