mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Information & Answers (https://www.mersenneforum.org/forumdisplay.php?f=38)
-   -   How to get mprime to run self bench? (https://www.mersenneforum.org/showthread.php?t=27896)

timbit 2022-06-27 02:14

How to get mprime to run self bench?
 
Hi,
I'm have a fresh install of mprime on a linux x64 (Ubuntu 2204) machine. I have it testing an exponent (ECM) and it's running really really slow. On a computer with almost identical hardware running Windows 10, the ECM on almost same exponent is running about 2 times faster.

I thought mprime would run a self bench (autobench) after a day or two of running? I suspect the linux mprime is using a non-optimized FFT, or the FFT size is too big.

I have explicitly set AutoBench=1 for prime.txt on the mprime machine, but I still haven't seen the program trigger the autobench.

Where is the optimized FFT data stored, so maybe i can copy the FFT knowledge from one machine to the other.

MattcAnderson 2022-06-27 03:09

Welcome to mersenneforum.org !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Uncwilly 2022-06-27 03:45

The memory on the slower machine, are all the modules same speed and same manufacturer? And how are the banks filled? Improper memory set-up can slow down a machine dramatically.

timbit 2022-06-27 04:46

[QUOTE=Uncwilly;608512]The memory on the slower machine, are all the modules same speed and same manufacturer? And how are the banks filled? Improper memory set-up can slow down a machine dramatically.[/QUOTE]

Actually the slower machine has DDR4-2400, faster has DDR4-2133.

Again, slower machine is 2 times slower. No idea why. I'm trying to get slower machine to run an autobench, but does not do so for whatever reason. Faster one does.

timbit 2022-06-27 15:28

[QUOTE=timbit;608520]Actually the slower machine has DDR4-2400, faster has DDR4-2133.

Again, slower machine is 2 times slower. No idea why. I'm trying to get slower machine to run an autobench, but does not do so for whatever reason. Faster one does.[/QUOTE]

Faster machine = Win10
4 sticks DDR4-2133 ECC RIMM Quad channel
Intel Xeon E5-1607 v3 @ 3.1 Ghz
Does ECM 4 threads at ~M1000000 range (B1=1000000) stage 1 = 1450 sec, stage 2, 850 sec, total = ~2300 sec

Slower machine = Ubuntu 2204
4 sticks DDR4-2400 ECC RDIMM Quad channel
Intel Xeon E5-2680 v4 @ 2.9 Ghz
Does ECM 6 threads at ~M1000000 range (B1=1000000) stage 1 = 3400 sec, stage2 = 1100 sec, total = ~4500 sec

Faster RAM, more threads, and 2 times slower? Again, I'm trying to trigger an autobench so the program can choose the best FFT algorithm. Surely the linux box can do a curve faster than 4500 sec.

axn 2022-06-27 15:56

It would be more helpful if you post the screen outputs from both systems.

BTW, according to ark, E5-2680 v4 is a 14-core system, so you should be able to run 14 threads at the same time.

timbit 2022-06-27 16:09

[QUOTE=axn;608546]It would be more helpful if you post the screen outputs from both systems.

BTW, according to ark, E5-2680 v4 is a 14-core system, so you should be able to run 14 threads at the same time.[/QUOTE]

I am well aware the E5-2480 v4 is a 14 core system. Due to the slowness of any Mersenne work I assign to it, I only use 1 worker (8 threads).

Is 8 threads the max for any worker? I am setting 1 worker, 14 cores and the most I see for any worker is 8 threads. I would expect 14 unless that is some upper limit?

I don't know what you are expecting to see from the output logs. Looks like usual ECM logs except it is taking very long time to finish.

How does mprime select the fastest FFT implementation? Does manually starting a benchmark help?

axn 2022-06-27 16:21

The output will show the FFT selected, the worker configuration, affinities, etc. Let us understand the problem first before attempting a solution/

ECM work you're mentioning (M1000000) is very small and does not multithread. You'll most likely get the best thruput by running 14 workers.

kriesel 2022-06-27 17:06

[QUOTE=timbit;608548]Is 8 threads the max for any worker?[/QUOTE]No. I've benchmarked up to 68 cores/worker on a Xeon Phi 7250.
prime95 number of cores (threads) supported is 512 or 1024 [URL]https://mersenneforum.org/showpost.php?p=479009&postcount=202[/URL]

Yes, you can manually trigger a benchmark, and optionally specify what range of fft sizes are benchmarked, what list or range of core counts per worker, whether HT is tried or not, etc. Start by experimenting with few fft sizes for speed of experimentation. See also [URL]https://www.mersenneforum.org/showpost.php?p=563304&postcount=11[/URL] and its attachments.

timbit 2022-06-28 05:14

OK I've deleted the results.bench.txt and the gwnum.txt files.
I've manually run the thoughput tests on the same size FFT.
Now it's running the ECM again. I'll let this go for a day or two and I'll see if the autobench runs again.
Regardless of the results in results.bench.txt or gwnum.txt, it never seems to select the FFT with the most throughput. Odd.

Also I still cannot get more than 8 threads on a worker.

kriesel 2022-06-28 05:48

As axn wrote, # of useful cores/worker is a function of fft size, which is a function of exponent.
I don't run 1M ECM, but run DC & first time primality test wavefront PRP, and up to 1G P-1, big exponents, big ffts, higher core counts.
Your 1M/ ~20 bits/word ~ 50K fft size. As axn wrote, that only needs/uses one core, not multithreaded.
Run lots of workers, one core each. Downside is that will multiply demand for main memory.
How many GB of ram do you have installed per system?
How much did you set the prime95 setting to, increased from the very low default for daytime and nighttime P-1, P+1, ECM stage 2?


All times are UTC. The time now is 08:21.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.