![]() |
![]() |
#12 |
6809 > 6502
"""""""""""""""""""
Aug 2003
101×103 Posts
5·2,179 Posts |
![]() |
![]() |
![]() |
![]() |
#13 |
Mar 2009
22·5 Posts |
![]()
Hi, thanks for all the replies.
My slow system has 32 GB DDR4-2400 ECC RDIMM (4 * 8GB) RAM running in quad channel. (Intel Xeon E5-2680 v4, 14 cores, 28 threads). In the local.txt I have: Memory=28672 during 7:30-23:30 else 28672 So essentially 28GB RAM available to mprime. I've deleted the results.bench.txt and gwnum.txt. I'm then invoked mprime -m, did the self-throughput test, with 1 worker 4 cores, 48kB FFT size. Yes, I see the replies saying "it doesn't multithread will with small FFTs", and "use many workers, 1 core". Yes, I will lean towards that from now on. But I am doing 1 worker, 4 cores as my benchmark. I've attached snippets of results.bench.txt, and gwnum.txt. Now according to the results.bench.txt, the fastest is: FFTlen=48K, Type=3, Arch=4, Pass1=768, Pass2=64, clm=1 (4 cores, 1 worker): 0.21 ms. Throughput: 4848.21 iter/sec. Pass1=768, Pass2=64, clm=1. Right? However, when I invoke mprime -d on my ECM 999181 I see: [Work thread Jun 27 22:30] Using FMA3 FFT length 48K, Pass1=768, Pass2=64, clm=2, 4 threads This isn't the fastest FFT selection given the results from the self-benchmark. Is this normal? Also I still cannot see any autobenchmark in my work, can anyone explain what triggers the autobench? I see other posts complaining about it, and how to prevent it, I want to do the opposite and invoke it! The reason I am asking for the autobench, is on one of my other machines it was also running slow for a day or two on a new exponent, then the autbench kicked in, and it was much much faster after that. Obviously a different FFT implementation was chosen which made throughput higher. In the attached log_snippet.txt, ECM curve 50 phase 1 is taking almost 3000 seconds. Other machine with DDR4-2133 it's taking < 2000 seconds for phase 1. That machine has been shutdown for the summer unfortunately, so I cannot access it again until the fall. Also when I run htop there are no other processes taking up CPU cycles (mprime is using 4 out of 14 cores anyways). Last fiddled with by timbit on 2022-06-28 at 16:22 Reason: formatting |
![]() |
![]() |
![]() |
#14 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
34·7·13 Posts |
![]()
Memory situation looks pretty good to me. Log file shows up to 8 GB used in stage2, so 3 workers at a time could get all they want, and a fourth can get by on somewhat less. Additional workers could be running stage 1 at the same time.
It's not normally necessary to delete results.bench.txt and gwnum.txt. Mprime defaults to 4 cores per worker, because the expectation is most users will be running DC (~61M exponent) or higher, and 4 cores/worker is reasonably close to maximum total system throughput for that exponent & fft size and somewhat upward. Deviating considerably from usual usage as you do (very small exponent) or I do (very large exponent) means the usual near optimal no longer applies. And systems/processors do vary in what is optimal for their specific design. Please bite the bullet and benchmark with multiple workers. It's the only way you will get close to the full capability of your system on such small exponents. I suggest benchmarking 1 & 2 cores/worker, on 14 and 7 workers respectively. Latency of an assignment will go up, but throughput (assignments completed per day) should go up. After that you could try whichever cores/worker seems faster for your chosen work, and varying number of workers downward from all-cores-occupied. It's possible that cache efficiency may be higher at less than the maximum possible number of workers, enough to give higher throughput at say 12 rather than 14 cores working. You could also consider what is the max throughput/system-power-consumption configuration. Last fiddled with by kriesel on 2022-06-28 at 19:39 |
![]() |
![]() |
![]() |
#15 |
Mar 2009
22×5 Posts |
![]()
Ok, I managed to see the autobench last night (1 worker, 4 cores). The fastest FFT implementation did not get selected.
I then decided to go back to basics. 1 worker, 1 core that's it. I deleted existing gwnum.txt and results.bench.txt. I ran ./mprime -m, chose item 17, benchmark. 48k FFT, 1 worker, 1 core. Attached are the results.bench.txt, and gwnum.txt. When I started ECM on 999xxx exponent, B1=1000000. I can see that a non-optimal FFT was chosen. How can I get mprime to choose the optimal FFT? Is there anything in prime.txt or local.txt that can manually choose an FFT implementation? I've run as 1 worker, 1 core, no excuses now. Also nothing else running on my Ubuntu 2204 x64 (Intel Xeon E5-2680 v4). From results.bench.txt: (bold is fastest) Prime95 64-bit version 30.7, RdtscTiming=1 FFTlen=48K, Type=3, Arch=4, Pass1=256, Pass2=192, clm=4 (1 core, 1 worker): 0.47 ms. Throughput: 2138.80 iter/sec. FFTlen=48K, Type=3, Arch=4, Pass1=256, Pass2=192, clm=2 (1 core, 1 worker): 0.45 ms. Throughput: 2238.99 iter/sec. FFTlen=48K, Type=3, Arch=4, Pass1=256, Pass2=192, clm=1 (1 core, 1 worker): 0.26 ms. Throughput: 3908.64 iter/sec. FFTlen=48K, Type=3, Arch=4, Pass1=768, Pass2=64, clm=4 (1 core, 1 worker): 0.21 ms. Throughput: 4709.43 iter/sec. FFTlen=48K, Type=3, Arch=4, Pass1=768, Pass2=64, clm=2 (1 core, 1 worker): 0.26 ms. Throughput: 3907.09 iter/sec. FFTlen=48K, Type=3, Arch=4, Pass1=768, Pass2=64, clm=1 (1 core, 1 worker): 0.21 ms. Throughput: 4874.29 iter/sec. When I start ./mprime -d, I see: [Main thread Jun 29 09:26] Mersenne number primality test program version 30.7 [Main thread Jun 29 09:26] Optimizing for CPU architecture: Core i3/i5/i7, L2 cache size: 14x256 KB, L3 cache size: 35 MB [Main thread Jun 29 09:26] Starting worker. [Work thread Jun 29 09:26] Worker starting [Work thread Jun 29 09:26] Setting affinity to run worker on CPU core #2 [Work thread Jun 29 09:26] [Work thread Jun 29 09:26] Using FMA3 FFT length 48K, Pass1=768, Pass2=64, clm=2 [Work thread Jun 29 09:26] 0.052 bits-per-word below FFT limit (more than 0.509 allows extra optimizations) [Work thread Jun 29 09:26] ECM on M999217: curve #1 with s=7014263894342847, B1=1000000, B2=TBD Non-optimal FFT chosen. It's truly bizarre. Any thoughts on what a root cause may be? |
![]() |
![]() |
![]() |
#16 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1CCB16 Posts |
![]() Quote:
Perhaps there is a bug in v30.7 regarding benchmark handling. Have you considered trying v30.8b15? For comparison, here's 48k fft benchmark results done two different ways on 4-core/8HT i5-1035G1, 2 of 2 DDR4-2666 SODIMMs x 8 GiB each, Win10 with prime95 v30.7b9, and the relevant portion of results.bench.txt produced from them both. There is a considerable effect of what I think is memory bandwidth constraint visible, given that 4 cores/4workers produce only two to three times the total iteration throughput of a single core/worker. Note, this benchmark was while a multi-tab web browser session and multiple remote desktop session clients were also running. Prime95 gets ~59% of CPU while these & other tasks (Windows Explorer, AV, system services, etc.) keep the HT busy up to 90% of the 8 logical cores. That's my normal way of operating this particular system, so that's what I benchmark for. edit: for ~77M PRP as DC, 4M fft, all 4 cores in use, checking my system above's results.bench.txt and running fft, I find it is using the fastest fft benchmarked, which happens to be 4M, Pass1=1K, Pass2=4K, clm=1, 4 threads. Its results.bench.txt is ~0.5MB, gwnum.txt ~0.18MB. There have been rare cases where one grew too large and caused problems, but these sizes seem ok in this version. Last fiddled with by kriesel on 2022-06-29 at 19:55 |
|
![]() |
![]() |
![]() |
#17 |
Mar 2009
22·5 Posts |
![]()
Ok, thanks for the reply. I will move the files out of the "trash bin". Was unaware the program uses all the results from the past. I would only assume it takes the latest one. (if there's only 1 benchmark available -- what else is it supposed to use?)
Ahhh... there's a 30.8 build 15. Okey dokey. Let me give that a shot. I'll let it run for a week or two. I'll double check it for autobench, and hopefully there's smarts to let it run faster. Or perhaps a bug was indeed fixed. |
![]() |
![]() |
![]() |
#18 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
34×7×13 Posts |
![]()
Found in prime95's undoc.txt (emphasis mine):
Code:
Most FFT sizes have several implementations. The program uses throughput benchmark data to select the fastest FFT implementation. The program assumes all CPU cores will be used and all workers will be running FFTs. This can be overridden in gwnum.txt: BenchCores=x BenchWorkers=y I'd be interested to see what George's take is on the clm=2 vs. 1 performance and selection. |
![]() |
![]() |
![]() |
#19 | |
Mar 2009
22×5 Posts |
![]() Quote:
I'll stop mprime, save the entire directory and give the new version a try (it's a beta and I'll keep that in mind) in a fresh directory. |
|
![]() |
![]() |
![]() |
#20 |
P90 years forever!
Aug 2002
Yeehaw, FL
22×13×157 Posts |
![]()
A bug is certainly possible. Version 30.8 will not behave any differently.
For the throughput benchmark data to be used in FFT selection, the throughput benchmark inputs (#workers,#cores) must match current mprime #workers/#cores configuration. As you've noted the auto-bench should make this happen. There could be an issue/bug in that you aren't using all cores. I did the majority of my testing assuming all cores would be used. Try running a throughput benchmark on 14 workers/14 cores, then set up mprime to run 14 workers (obviously 1 core per worker). Is the fastest implementation selected? Last fiddled with by Prime95 on 2022-06-29 at 20:49 |
![]() |
![]() |
![]() |
#21 |
P90 years forever!
Aug 2002
Yeehaw, FL
22·13·157 Posts |
![]()
Keep an eye out for version 30.9 (won't address your problem, but will run ECM better)
|
![]() |
![]() |
![]() |
#22 | |
Mar 2009
22·5 Posts |
![]() Quote:
Let me run the 30.7 version throughput test, 14 workers, 1 core per worker. I can have some results within 24 hours. I'll also try with 30.8 b15. Last fiddled with by timbit on 2022-06-29 at 20:58 Reason: Version 30.8 |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Running fstrim on SSD while mprime is running might cause errors in mprime | AwesomeMachine | Software | 4 | 2021-10-07 23:49 |
Radeon VII on a mining-like bench | Viliam Furik | Viliam Furik | 17 | 2021-01-14 08:12 |
mprime from git | SELROC | Software | 2 | 2018-10-30 10:16 |
2 x AMD Opteron 2427 @ 2.39 GHz - prime95 bench- | joblack | Hardware | 2 | 2010-03-12 19:38 |
Problem with mprime (Fixed with mprime -d) | antiroach | Software | 2 | 2004-07-19 04:07 |