View Single Post
Old 2021-09-06, 21:02   #13
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23·7·127 Posts
Default Mlucas V20.1 timings on various hardware and environments, & prime95 compared

Preface:
None of the following should be mistaken for criticism of anyone's efforts or results. Writing such software is hard. Making it produce correct outputs is harder. Making it fast and functional also on a variety of inputs, hardware, environments, etc, is harder still. Few even dare to try.

Note:
prime95 prevents running multiple instances in the same folder.
Mlucas does not prevent simultaneously running multiple instances on the same exponent in the same folder. Don't do that. It creates an awful mess.

Case 1: PRP DC 84.7M on i7-8750H (6 core, 12 hyperthread)
Mlucas v20.1 on Ubuntu 18.04 LTS atop WSL1 on Win10: nominal 4-thread 18 iters/sec; nominal 8-thread 29 iters/sec, so 47 iters/sec throughput for system as operated, potentially up to 54 iters/sec combined for 3 processes of 4-thread.

V29.5b6 prime95 benchmark on Windows 10 Home x64, same system: benchmarked all FFT lengths 2M-32M. For 84.7M, fft is 4480K; 88.7 to 93.4 iters/sec throughput. Best throughput is all 6 cores on one worker, which also gives minimum latency.

Mlucas v20.1/WSL1 performance observed is ~ 50 to 61% that of prime95/Win on this system. Note prime95 has subsequently improved speed in some aspects since the version benchmarked. Access via TightVNC & GPU app overhead present and should have been about constant.

Case 2: Dual e5-2697V2 (12 core & x2 HT) for wavefront PRP
V29.8b6 prime95 on Win10, benchmark 5760K fft length; best was 2 workers, 238. iters/sec throughput

Mlucas v20.1/WSL 8 thread, 5632K fft length, 15.09 ms/iter -> 66.3 iters/sec. Optimistically extrapolating to triple throughput for 24 cores, 198.8 iters/sec.
Other benchmarking showed a disadvantage to using all hyperthreads, versus 1 thread per processor core.

Mlucas V20.0/WSL 4 thread,
Code:
      5632  msec/iter =   29.25  ROE[avg,max] = [0.231956749, 0.312500000]  radices = 176 16 32 32  0  0  0  0  0  0
Corresponds to 1000/29.25 = 34.2 iter/sec
Optimistically extrapolating to 6x throughput for 24 cores, 205.1 iter/sec throughput.

Mlucas/WSL performance is ~83.5-86.2% of prime95 under favorable assumptions. Note that's in comparison to V29.8b6, not the current v30.6b4 prime95.

Case 3: Hardware is i7-4770 (4 core & x2 HT) for wavefront PRP, 5-way test on dual-boot Win10/Ubuntu system
Mlucas V20.1/Ubuntu 20.04/WSL2/Win10 sandwich on Windows boot (primary) partition
Code:
      5632  msec/iter =   23.68  ROE[avg,max] = [0.196175927, 0.250000000]  radices = 352 16 16 32  0  0  0  0  0  0
1000ms/sec / (23.68 ms/iter) = 42.23 iter/sec. Probably suffered somewhat from RDP, GPU apps running on Windows simultaneously.

Mlucas V20.1/Ubuntu 20.04 LTS boot on second partition on same system drive; 8 thread which showed advantage in WSL over 4 thread.
Code:
      5632  msec/iter =   15.98  ROE[avg,max] = [0.196175927, 0.250000000]  radices = 352 16 16 32  0  0  0  0  0  0
1000ms/sec / (15.98 ms/iter) = 62.58 iter/sec. Much improved throughput over the WSL2 scenario above.

prime95 v30.6b4, usual RDP, GPU apps running, etc so some overhead load.
Code:
FFTlen=5600K, Type=3, Arch=4, Pass1=448, Pass2=12800, clm=2 (3 cores, 1 worker): 13.37 ms.
Throughput: 74.82 iter/sec.

prime95 v30.6b4, Windows 10 Pro x64, no RDP or GPU apps running
Code:
Timings for 5760K FFT length (4 cores, 1 worker): 11.31 ms.  Throughput: 88.43 iter/sec.
Timings for 5760K FFT length (4 cores, 2 workers): 22.88, 22.71 ms.  Throughput: 87.73 iter/sec.
Timings for 5760K FFT length (4 cores, 4 workers): 45.76, 44.74, 44.90, 44.29 ms.  Throughput: 89.06 iter/sec.
Average of the 3 worker counts, 88.41 iter/sec

mprime v30.6b4, Ubuntu 20.04, logged in at console, no GPU apps, no remote access, minimal overhead
Code:
Timings for 5760K FFT length (4 cores, 1 worker): 11.24 ms.  Throughput: 88.97 iter/sec.
Timings for 5760K FFT length (4 cores, 2 workers): 22.43, 22.48 ms.  Throughput: 89.06 iter/sec.
Timings for 5760K FFT length (4 cores, 4 workers): 44.65, 44.65, 44.60, 44.76 ms.  Throughput: 89.55 iter/sec.
Average of the 3 worker counts, 89.19 iter/sec (1.0088x Windows low-overhead run average throughput)
(Mprime/Linux max throughput 1.0055 of prime95/Windows max throughput)

mprime and prime95 timings are very close for equalized system overhead, same hardware.
While there's essentially no speed advantage Linux vs Windows for prime95/mprime, there may be for Mlucas because of the core virtualization issue on WSL which is required to run Mlucas on Windows now. This should have less effect when the cores are fully loaded with enough Mlucas threads to occupy them all.

Mlucas/WSL performance 42.23/74.82 ~56.4% of prime95/Win10. Both sessions may have been negatively impacted by remote-desktop overhead.
Mlucas/Ubuntu performance 62.58/88.43 ~70.8% of prime95/Win10 single-worker. Both Win & Ubuntu timing were without remote desktop overhead or GPU apps.

Benchmarking experimental error unknown. Digitization error up to ~0.09%.

Mlucas can currently perform LL, PRP, and P-1 computations on higher exponents than any other GIMPS software known. Benchmark and estimate run times.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-09-06 at 21:31
kriesel is offline