mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Blogorrhea > kriesel

Closed Thread
 
Thread Tools
Old 2021-08-15, 23:09   #12
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

169716 Posts
Default Tuning Mlucas V20

My first try with Mlucas V20.0 was in Ubuntu 18.04 LTS installed in WSL1 on Windows 10 Home X64 in an i7-8750H laptop.

These were run with mfaktc also running on the laptop's discrete GPU, nothing on the IGP, a web browser with active Google Colab sessions, and TightVNC remote desktop for all access. Prime95 was stopped and exited before the test.

Experimenting a bit with Ernst's posted readme guidance, I obtained the timings shown in the attachment. Since some of these cases would have run with only 100 iterations, and I may have affected them somewhat with some interactive use, I may rerun some of these, perhaps after the next update release. Timings in a bit of production running seem to have gradually improved. Possibly that relates to ambient temperature.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
File Type: pdf i7-8750H timings.pdf (40.7 KB, 32 views)

Last fiddled with by kriesel on 2021-08-31 at 23:57
kriesel is online now  
Old 2021-09-06, 21:02   #13
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

132278 Posts
Default Mlucas V20.1 timings on various hardware and environments, & prime95 compared

Preface:
None of the following should be mistaken for criticism of anyone's efforts or results. Writing such software is hard. Making it produce correct outputs is harder. Making it fast and functional also on a variety of inputs, hardware, environments, etc, is harder still. Few even dare to try.

Note:
prime95 prevents running multiple instances in the same folder.
Mlucas does not prevent simultaneously running multiple instances on the same exponent in the same folder. Don't do that. It creates an awful mess.

Case 1: PRP DC 84.7M on i7-8750H (6 core, 12 hyperthread)
Mlucas v20.1 on Ubuntu 18.04 LTS atop WSL1 on Win10: nominal 4-thread 18 iters/sec; nominal 8-thread 29 iters/sec, so 47 iters/sec throughput for system as operated, potentially up to 54 iters/sec combined for 3 processes of 4-thread.

V29.5b6 prime95 benchmark on Windows 10 Home x64, same system: benchmarked all FFT lengths 2M-32M. For 84.7M, fft is 4480K; 88.7 to 93.4 iters/sec throughput. Best throughput is all 6 cores on one worker, which also gives minimum latency.

Mlucas v20.1/WSL1 performance observed is ~ 50 to 61% that of prime95/Win on this system. Note prime95 has subsequently improved speed in some aspects since the version benchmarked. Access via TightVNC & GPU app overhead present and should have been about constant.

Case 2: Dual e5-2697V2 (12 core & x2 HT) for wavefront PRP
V29.8b6 prime95 on Win10, benchmark 5760K fft length; best was 2 workers, 238. iters/sec throughput

Mlucas v20.1/WSL 8 thread, 5632K fft length, 15.09 ms/iter -> 66.3 iters/sec. Optimistically extrapolating to triple throughput for 24 cores, 198.8 iters/sec.
Other benchmarking showed a disadvantage to using all hyperthreads, versus 1 thread per processor core.

Mlucas V20.0/WSL 4 thread,
Code:
      5632  msec/iter =   29.25  ROE[avg,max] = [0.231956749, 0.312500000]  radices = 176 16 32 32  0  0  0  0  0  0
Corresponds to 1000/29.25 = 34.2 iter/sec
Optimistically extrapolating to 6x throughput for 24 cores, 205.1 iter/sec throughput.

Mlucas/WSL performance is ~83.5-86.2% of prime95 under favorable assumptions. Note that's in comparison to V29.8b6, not the current v30.6b4 prime95.

Case 3: Hardware is i7-4770 (4 core & x2 HT) for wavefront PRP, 5-way test on dual-boot Win10/Ubuntu system
Mlucas V20.1/Ubuntu 20.04/WSL2/Win10 sandwich on Windows boot (primary) partition
Code:
      5632  msec/iter =   23.68  ROE[avg,max] = [0.196175927, 0.250000000]  radices = 352 16 16 32  0  0  0  0  0  0
1000ms/sec / (23.68 ms/iter) = 42.23 iter/sec. Probably suffered somewhat from RDP, GPU apps running on Windows simultaneously.

Mlucas V20.1/Ubuntu 20.04 LTS boot on second partition on same system drive; 8 thread which showed advantage in WSL over 4 thread.
Code:
      5632  msec/iter =   15.98  ROE[avg,max] = [0.196175927, 0.250000000]  radices = 352 16 16 32  0  0  0  0  0  0
1000ms/sec / (15.98 ms/iter) = 62.58 iter/sec. Much improved throughput over the WSL2 scenario above.

prime95 v30.6b4, usual RDP, GPU apps running, etc so some overhead load.
Code:
FFTlen=5600K, Type=3, Arch=4, Pass1=448, Pass2=12800, clm=2 (3 cores, 1 worker): 13.37 ms.
Throughput: 74.82 iter/sec.

prime95 v30.6b4, Windows 10 Pro x64, no RDP or GPU apps running
Code:
Timings for 5760K FFT length (4 cores, 1 worker): 11.31 ms.  Throughput: 88.43 iter/sec.
Timings for 5760K FFT length (4 cores, 2 workers): 22.88, 22.71 ms.  Throughput: 87.73 iter/sec.
Timings for 5760K FFT length (4 cores, 4 workers): 45.76, 44.74, 44.90, 44.29 ms.  Throughput: 89.06 iter/sec.
Average of the 3 worker counts, 88.41 iter/sec

mprime v30.6b4, Ubuntu 20.04, logged in at console, no GPU apps, no remote access, minimal overhead
Code:
Timings for 5760K FFT length (4 cores, 1 worker): 11.24 ms.  Throughput: 88.97 iter/sec.
Timings for 5760K FFT length (4 cores, 2 workers): 22.43, 22.48 ms.  Throughput: 89.06 iter/sec.
Timings for 5760K FFT length (4 cores, 4 workers): 44.65, 44.65, 44.60, 44.76 ms.  Throughput: 89.55 iter/sec.
Average of the 3 worker counts, 89.19 iter/sec (1.0088x Windows low-overhead run average throughput)
(Mprime/Linux max throughput 1.0055 of prime95/Windows max throughput)

mprime and prime95 timings are very close for equalized system overhead, same hardware.
While there's essentially no speed advantage Linux vs Windows for prime95/mprime, there may be for Mlucas because of the core virtualization issue on WSL which is required to run Mlucas on Windows now. This should have less effect when the cores are fully loaded with enough Mlucas threads to occupy them all.

Mlucas/WSL performance 42.23/74.82 ~56.4% of prime95/Win10. Both sessions may have been negatively impacted by remote-desktop overhead.
Mlucas/Ubuntu performance 62.58/88.43 ~70.8% of prime95/Win10 single-worker. Both Win & Ubuntu timing were without remote desktop overhead or GPU apps.

Benchmarking experimental error unknown. Digitization error up to ~0.09%.

Mlucas can currently perform LL, PRP, and P-1 computations on higher exponents than any other GIMPS software known. Benchmark and estimate run times.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-09-06 at 21:31
kriesel is online now  
Old 2021-09-17, 18:53   #14
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,783 Posts
Default Mlucas releases

This is an incomplete draft list.

2017-06-15 V17.0 https://mersenneforum.org/showthread.php?t=22391

2017-07-02? V17.1 https://mersenneforum.org/showthread.php?t=2977

2019-02-20 V18 https://mersenneforum.org/showthread.php?t=24100

2019-12-01 v19
https://mersenneforum.org/showthread.php?t=24990

2021-02-11 v19.1 ARMv8-SIMD / Clang/LLVM compiler compatibility
https://mersenneforum.org/showthread.php?t=26483

2021-07-31 V20.0 P-1 support; automake script makemake.sh
https://mersenneforum.org/showthread.php?t=27031

2021-08-31 V20.1 faster P-1 stage 2, some bug fixes, print refinements, new help provisions, corrected reference-residues, raised maximum Mp limits https://mersenneforum.org/showthread.php?t=27114

tbd V20.2? minor cleanup such as labeling factor bits as bits, additional bug fixes; possibly resync mfactor variable types with shared routines typing from Mlucas

tbd V21? PRP proof generation planned


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-09-19 at 13:41
kriesel is online now  
Old 2021-09-17, 23:14   #15
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

578310 Posts
Default Wish list

Features I'd like to see added in Mlucas. As always, it's the developer's call what actually happens; his time, his talent, his program. These are in when-I-thought-to-write-them order.

  1. PRP proof file generation by VDF. Preferably V2 format, such as prime95 and gpuowl produce. I think this is generally agreed to be the highest priority feature addition.
  2. ETAs in the .stat file output
  3. Jacobi symbol check for LL / LLDC running
  4. Solution for WSL-related core-hopping seen on Xeon Phi and elsewhere
  5. Solution for building native Windows executables
  6. Solution for building multithreaded native Windows executables
  7. Ability to accept a list of interim LL 64-bit residues from a parallel run for comparison at widely spaced iteration counts such as every 5M or 10M from previous runs, useful in DC / TC / new-discovery-verification
  8. Date/time stamps on first record of console or nohup.out output or upon restart in <exponent>.stat file, and on most other output
  9. Only one process can run per folder at a time
  10. Multiple-worker integration into a single process
  11. Segmenting a P-1 stage 2 run onto multiple processes or machines for running in parallel; this will be necessary for F33 P-1 stage 2, and may be useful on OBD P-1 also
  12. Total run time per P-1 stage and total of both stages and GCDs, output by program for more convenient benchmarking, run time scaling measurement.
  13. Change worktodo.ini to worktodo.txt
  14. On Ernst's wish list judging by reading some source code is a GUI someday.
  15. Integral PrimeNet API use
(what else?)


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-10-14 at 17:40
kriesel is online now  
Old 2021-09-17, 23:18   #16
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

169716 Posts
Default Bug list

This is a partial list, mostly by version in which they were first seen. Testing has only involved Mersenne number related capabilities. NO attempt was made at testing on Fermat number capabilities.

V17.0

gave msec/iter times, but labeled sec/iter. Resolved in later version

V18.0
?

V19.0
?

V19.1
?

V20.0
There are several described at https://mersenneforum.org/showpost.p...47&postcount=1.
Upgrade to V20.1 for faster P-1 stage 2 and multiple bug fixes.
And at least one that slipped by brief testing, so are present in V20.1 also. See also P-1 stage 2 restart issue etc below.

V20.1
  1. mislabels n-bit P-1 factor found as n (base ten) digits.
    Code:
    Found 70-digit factor in Stage 2: 646560662529991467527
    Following examples from V20.0
    Code:
    Found 95-digit factor in Stage 1: 33287662948300610984694812407
    Found 84-digit factor in Stage 2: 15299475858498328182948679
    Log10(33,287,662,948,300,610,984,694,812,407)=28.52...; Log10(33,287,662,948,300,610,984,694,812,407)/Log10(2) = 94.74...
  2. When there is a restart in P-1 stage 2 (Mlucas intended stop/restart, or Windows Update or power failure pulls the rug out from under Linux/WSL and Mlucas), the following result record for P-1 stopped/restarted in stage 2 has 1970-01-01 midnight as time stamp, instead of the actual completion time. <exponent>.stat file entries are ok. The P-1 stage 2 restart code path bypasses the usual inits of calendar_time, which later affects the result output timestamp.
  3. More recently, also on Ubuntu/WSL/Win10, I've observed peculiar result line date values such as "4442758-11-21 10:39:25 UTC" on ~2021-10-07 after recovering from large-memory related stage 2 Mlucas crash on 10M and on 106M exponent runs.
  4. Factors found at a GCD early in stage 2 are reported as if they were found in stage 1, with only stage 1 bound given. Computing the effective stage 2 bound in such a case is not easy or clear.
  5. Factor found after a full but interrupted stage 2 was indicated as stage 1 bounds only.
  6. -maxalloc with a % that equates to > ~32GiB attempted usage results in a segmentation fault at the beginning of P-1 stage 2. Observed on a 128 GiB ram AVX system with Win10/WSL1/Ubuntu 18.04.2 LTS combo. Ernst has been able to reproduce the issue on a KNL/Ubuntu system. Some variables that were typed uint32 will need to be uint64. Until resolved, a workaround is to use less of the available ram.
  7. In P-1 at least, some values that ought be recalculated for each worktodo item appear to be reused unchanged instead. Number of buffers in stage 2 is the first example seen, not recalculated from 106M to 334M. Another is FFT length did not get updated from a 1M P-1 task to a 3M P-1 task immediately following. Possible workarounds include sorting and segregating assignments to similar exponents, or use of the command line and scripting for separate program sessions for disparate exponents.
  8. For P-1 and probably other work types, on small exponents on which a stage of computation may complete faster than the checkpoint save interval or stat file update interval, no stage timing is saved to the stat file or displayed on stdout/stderr. This means run time scaling measurements on small exponents can not be made, except with a stopwatch or the batch file/shell script equivalent.
  9. In self test on enormous fft lengths (256M - 512M) on 2 models of AVX512 CPUs on Ubuntu/WSL/Win10, one of the 512M radix sets reproducibly produces a segfault crash, preventing production of a line to finish the self test. On Ernst's attempt to reproduce on bare Ubuntu on AVX512, and perhaps later version of source code, there's instead an excessive roundoff error flagged. A workaround is to hand edit the mlucas.cfg file based on console output from other radix sets completed prior to the crash. Since the enormous class fft lengths begin at 256M, there's little or no need for them currently in the Mersenne number realm. FFT length 192M is expected to be sufficient for P-1 factoring attempts on OBD candidates.
  10. for worktodo entry:
    PMinus1=00000000000000000000000000000000,1,2,3000077,-1,8000,1200000
    ...
    Product of Stage 1 prime powers with b1 = 8000 is 11649 bits (183 limbs), vs estimated 12035. Setting PRP_BASE = 3.
    ERROR: at line 1165 of file ../src/mi64.c
    Assertion failed: mi64_shl: zero-length array or shift count >= 64!
    Code needs modification for the case where it incorrectly generates a shift count of 64.
  11. a lingering bug related to relocation-prime handling in P-1 stage 2 restart

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-10-20 at 00:16
kriesel is online now  
Old 2021-10-10, 16:16   #17
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,783 Posts
Default V20.1 P-1 run time scaling

(draft)


Based on very limited data, run time scaling is approximately p2, in line with expectations from other applications and from first principles. I'll add more here after more data points complete running.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-10-10 at 16:16
kriesel is online now  
Closed Thread

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL-specific reference material kriesel kriesel 30 2021-09-10 16:09
Mfakto-specific reference material kriesel kriesel 5 2020-07-02 01:30
CUDALucas-specific reference material kriesel kriesel 9 2020-05-28 23:32
Mfaktc-specific reference material kriesel kriesel 8 2020-04-17 03:50
CUDAPm1-specific reference material kriesel kriesel 12 2019-08-12 15:51

All times are UTC. The time now is 11:39.


Wed Oct 20 11:39:16 UTC 2021 up 89 days, 6:08, 0 users, load averages: 0.98, 1.19, 1.16

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.