mersenneforum.org How to get mprime to run self bench?
 Register FAQ Search Today's Posts Mark Forums Read

2022-06-28, 05:49   #12
Uncwilly
6809 > 6502

"""""""""""""""""""
Aug 2003
101×103 Posts

2·13·409 Posts

Quote:
 Originally Posted by timbit Also I still cannot get more than 8 threads on a worker.
You want more workers, not threads per worker. Basically more workers, each doing their own assignments.

2022-06-28, 16:18   #13
timbit

Mar 2009

101002 Posts

Hi, thanks for all the replies.

My slow system has 32 GB DDR4-2400 ECC RDIMM (4 * 8GB) RAM running in quad channel. (Intel Xeon E5-2680 v4, 14 cores, 28 threads). In the local.txt I have:

Memory=28672 during 7:30-23:30 else 28672

So essentially 28GB RAM available to mprime.

I've deleted the results.bench.txt and gwnum.txt. I'm then invoked mprime -m, did the self-throughput test, with 1 worker 4 cores, 48kB FFT size. Yes, I see the replies saying "it doesn't multithread will with small FFTs", and "use many workers, 1 core". Yes, I will lean towards that from now on. But I am doing 1 worker, 4 cores as my benchmark.

I've attached snippets of results.bench.txt, and gwnum.txt.

Now according to the results.bench.txt, the fastest is:

FFTlen=48K, Type=3, Arch=4, Pass1=768, Pass2=64, clm=1 (4 cores, 1 worker): 0.21 ms. Throughput: 4848.21 iter/sec.

Pass1=768, Pass2=64, clm=1. Right?

However, when I invoke mprime -d on my ECM 999181 I see:

[Work thread Jun 27 22:30] Using FMA3 FFT length 48K, Pass1=768, Pass2=64, clm=2, 4 threads

This isn't the fastest FFT selection given the results from the self-benchmark. Is this normal? Also I still cannot see any autobenchmark in my work, can anyone explain what triggers the autobench? I see other posts complaining about it, and how to prevent it, I want to do the opposite and invoke it!

The reason I am asking for the autobench, is on one of my other machines it was also running slow for a day or two on a new exponent, then the autbench kicked in, and it was much much faster after that. Obviously a different FFT implementation was chosen which made throughput higher.

In the attached log_snippet.txt, ECM curve 50 phase 1 is taking almost 3000 seconds. Other machine with DDR4-2133 it's taking < 2000 seconds for phase 1. That machine has been shutdown for the summer unfortunately, so I cannot access it again until the fall.

Also when I run htop there are no other processes taking up CPU cycles (mprime is using 4 out of 14 cores anyways).
Attached Files
 gwnum.txt (681 Bytes, 42 views) results.bench.txt (10.2 KB, 38 views) log_snippet.txt (2.4 KB, 45 views)

Last fiddled with by timbit on 2022-06-28 at 16:22 Reason: formatting

 2022-06-28, 19:36 #14 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 23×72×17 Posts Memory situation looks pretty good to me. Log file shows up to 8 GB used in stage2, so 3 workers at a time could get all they want, and a fourth can get by on somewhat less. Additional workers could be running stage 1 at the same time. It's not normally necessary to delete results.bench.txt and gwnum.txt. Mprime defaults to 4 cores per worker, because the expectation is most users will be running DC (~61M exponent) or higher, and 4 cores/worker is reasonably close to maximum total system throughput for that exponent & fft size and somewhat upward. Deviating considerably from usual usage as you do (very small exponent) or I do (very large exponent) means the usual near optimal no longer applies. And systems/processors do vary in what is optimal for their specific design. Please bite the bullet and benchmark with multiple workers. It's the only way you will get close to the full capability of your system on such small exponents. I suggest benchmarking 1 & 2 cores/worker, on 14 and 7 workers respectively. Latency of an assignment will go up, but throughput (assignments completed per day) should go up. After that you could try whichever cores/worker seems faster for your chosen work, and varying number of workers downward from all-cores-occupied. It's possible that cache efficiency may be higher at less than the maximum possible number of workers, enough to give higher throughput at say 12 rather than 14 cores working. You could also consider what is the max throughput/system-power-consumption configuration. Last fiddled with by kriesel on 2022-06-28 at 19:39
2022-06-29, 16:35   #15
timbit

Mar 2009

22·5 Posts

Ok, I managed to see the autobench last night (1 worker, 4 cores). The fastest FFT implementation did not get selected.

I then decided to go back to basics. 1 worker, 1 core that's it. I deleted existing gwnum.txt and results.bench.txt.

I ran ./mprime -m, chose item 17, benchmark. 48k FFT, 1 worker, 1 core.

Attached are the results.bench.txt, and gwnum.txt. When I started ECM on 999xxx exponent, B1=1000000. I can see that a non-optimal FFT was chosen.

How can I get mprime to choose the optimal FFT? Is there anything in prime.txt or local.txt that can manually choose an FFT implementation? I've run as 1 worker, 1 core, no excuses now. Also nothing else running on my Ubuntu 2204 x64 (Intel Xeon E5-2680 v4).

From results.bench.txt: (bold is fastest)

Prime95 64-bit version 30.7, RdtscTiming=1
FFTlen=48K, Type=3, Arch=4, Pass1=256, Pass2=192, clm=4 (1 core, 1 worker): 0.47 ms. Throughput: 2138.80 iter/sec.
FFTlen=48K, Type=3, Arch=4, Pass1=256, Pass2=192, clm=2 (1 core, 1 worker): 0.45 ms. Throughput: 2238.99 iter/sec.
FFTlen=48K, Type=3, Arch=4, Pass1=256, Pass2=192, clm=1 (1 core, 1 worker): 0.26 ms. Throughput: 3908.64 iter/sec.
FFTlen=48K, Type=3, Arch=4, Pass1=768, Pass2=64, clm=4 (1 core, 1 worker): 0.21 ms. Throughput: 4709.43 iter/sec.
FFTlen=48K, Type=3, Arch=4, Pass1=768, Pass2=64, clm=2 (1 core, 1 worker): 0.26 ms. Throughput: 3907.09 iter/sec.
FFTlen=48K, Type=3, Arch=4, Pass1=768, Pass2=64, clm=1 (1 core, 1 worker): 0.21 ms. Throughput: 4874.29 iter/sec.

When I start ./mprime -d, I see:

[Main thread Jun 29 09:26] Mersenne number primality test program version 30.7
[Main thread Jun 29 09:26] Optimizing for CPU architecture: Core i3/i5/i7, L2 cache size: 14x256 KB, L3 cache size: 35 MB
[Main thread Jun 29 09:26] Starting worker.
[Work thread Jun 29 09:26] Worker starting
[Work thread Jun 29 09:26] Setting affinity to run worker on CPU core #2
[Work thread Jun 29 09:26] Using FMA3 FFT length 48K, Pass1=768, Pass2=64, clm=2
[Work thread Jun 29 09:26] 0.052 bits-per-word below FFT limit (more than 0.509 allows extra optimizations)
[Work thread Jun 29 09:26] ECM on M999217: curve #1 with s=7014263894342847, B1=1000000, B2=TBD

Non-optimal FFT chosen. It's truly bizarre.

Any thoughts on what a root cause may be?
Attached Files
 results.bench.txt (5.1 KB, 11 views) gwnum.txt (375 Bytes, 10 views)

2022-06-29, 19:00   #16
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23×72×17 Posts

Quote:
 Originally Posted by timbit I deleted existing gwnum.txt and results.bench.txt. Non-optimal FFT chosen. It's truly bizarre. Any thoughts on what a root cause may be?
Please stop routinely giving mprime "benchmark amnesia". Files results.bench.txt and gwnum.txt are for ACCUMULATING performance data. Over time. I have systems that have run prime95 for years for which these files have NEVER been cleared. Mprime/prime95 will repeatedly auto benchmark to handle the case of relative performance being affected by fluctuations in other loadings, whether caused by the user or system activity, until it builds up a sufficient count of redundant benchmark values. You're repeatedly intentionally erasing the history instead.

Perhaps there is a bug in v30.7 regarding benchmark handling. Have you considered trying v30.8b15?

For comparison, here's 48k fft benchmark results done two different ways on 4-core/8HT i5-1035G1, 2 of 2 DDR4-2666 SODIMMs x 8 GiB each, Win10 with prime95 v30.7b9, and the relevant portion of results.bench.txt produced from them both.
There is a considerable effect of what I think is memory bandwidth constraint visible, given that 4 cores/4workers produce only two to three times the total iteration throughput of a single core/worker. Note, this benchmark was while a multi-tab web browser session and multiple remote desktop session clients were also running. Prime95 gets ~59% of CPU while these & other tasks (Windows Explorer, AV, system services, etc.) keep the HT busy up to 90% of the 8 logical cores. That's my normal way of operating this particular system, so that's what I benchmark for.
Quote:
 Originally Posted by kriesel Please bite the bullet and benchmark with multiple workers.

edit: for ~77M PRP as DC, 4M fft, all 4 cores in use, checking my system above's results.bench.txt and running fft, I find it is using the fastest fft benchmarked, which happens to be 4M, Pass1=1K, Pass2=4K, clm=1, 4 threads.
Its results.bench.txt is ~0.5MB, gwnum.txt ~0.18MB. There have been rare cases where one grew too large and caused problems, but these sizes seem ok in this version.
Attached Thumbnails

Attached Files
 martin48kbenchmark.txt (10.2 KB, 9 views)

Last fiddled with by kriesel on 2022-06-29 at 19:55

 2022-06-29, 19:18 #17 timbit   Mar 2009 22·5 Posts Ok, thanks for the reply. I will move the files out of the "trash bin". Was unaware the program uses all the results from the past. I would only assume it takes the latest one. (if there's only 1 benchmark available -- what else is it supposed to use?) Ahhh... there's a 30.8 build 15. Okey dokey. Let me give that a shot. I'll let it run for a week or two. I'll double check it for autobench, and hopefully there's smarts to let it run faster. Or perhaps a bug was indeed fixed.
 2022-06-29, 19:30 #18 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 1A0816 Posts Found in prime95's undoc.txt (emphasis mine): Code: Most FFT sizes have several implementations. The program uses throughput benchmark data to select the fastest FFT implementation. The program assumes all CPU cores will be used and all workers will be running FFTs. This can be overridden in gwnum.txt: BenchCores=x BenchWorkers=y If the program is selecting what would be fastest with all cores busy, as its documentation states, and your expectation is it would select what would be fastest with a few of the 14 cores busy, that may account for some discrepancies between expectation and observed operation. I'd be interested to see what George's take is on the clm=2 vs. 1 performance and selection.
2022-06-29, 19:57   #19
timbit

Mar 2009

22·5 Posts

Quote:
 Originally Posted by kriesel Found in prime95's undoc.txt (emphasis mine): Code: Most FFT sizes have several implementations. The program uses throughput benchmark data to select the fastest FFT implementation. The program assumes all CPU cores will be used and all workers will be running FFTs. This can be overridden in gwnum.txt: BenchCores=x BenchWorkers=y If the program is selecting what would be fastest with all cores busy, as its documentation states, and your expectation is it would select what would be fastest with a few of the 14 cores busy, that may account for some discrepancies between expectation and observed operation. I'd be interested to see what George's take is on the clm=2 vs. 1 performance and selection.
Hi, ya I saw that too. I didn't think much of it because the autobench would know your current configuration (plus the manual throughput test, the user is inputting the number of cores and workers).
I'll stop mprime, save the entire directory and give the new version a try (it's a beta and I'll keep that in mind) in a fresh directory.

2022-06-29, 20:47   #20
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

2×3×52×53 Posts

Quote:
 Originally Posted by timbit Any thoughts on what a root cause may be?
A bug is certainly possible. Version 30.8 will not behave any differently.

For the throughput benchmark data to be used in FFT selection, the throughput benchmark inputs (#workers,#cores) must match current mprime #workers/#cores configuration. As you've noted the auto-bench should make this happen.

There could be an issue/bug in that you aren't using all cores. I did the majority of my testing assuming all cores would be used. Try running a throughput benchmark on 14 workers/14 cores, then set up mprime to run 14 workers (obviously 1 core per worker). Is the fastest implementation selected?

Last fiddled with by Prime95 on 2022-06-29 at 20:49

 2022-06-29, 20:51 #21 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 2×3×52×53 Posts Keep an eye out for version 30.9 (won't address your problem, but will run ECM better)
2022-06-29, 20:53   #22
timbit

Mar 2009

22·5 Posts

Quote:
 Originally Posted by Prime95 There could be an issue/bug in that you aren't using all cores. I did the majority of my testing assuming all cores would be used. Try running a throughput benchmark on 14 workers/14 cores, then set up mprime to run 14 workers (obviously 1 core per worker). Is the fastest implementation selected?
That's definitely possible. I am only using 4 cores per worker at the most.

Let me run the 30.7 version throughput test, 14 workers, 1 core per worker. I can have some results within 24 hours.

I'll also try with 30.8 b15.

Last fiddled with by timbit on 2022-06-29 at 20:58 Reason: Version 30.8

 Similar Threads Thread Thread Starter Forum Replies Last Post AwesomeMachine Software 4 2021-10-07 23:49 Viliam Furik Viliam Furik 17 2021-01-14 08:12 SELROC Software 2 2018-10-30 10:16 joblack Hardware 2 2010-03-12 19:38 antiroach Software 2 2004-07-19 04:07

All times are UTC. The time now is 01:07.

Sun Aug 14 01:07:39 UTC 2022 up 37 days, 19:54, 2 users, load averages: 1.01, 0.98, 0.99