My slow system has 32 GB DDR4-2400 ECC RDIMM (4 * 8GB) RAM running in quad channel. (Intel Xeon E5-2680 v4, 14 cores, 28 threads). In the local.txt I have:

Memory=28672 during 7:30-23:30 else 28672

So essentially 28GB RAM available to mprime.

I've deleted the results.bench.txt and gwnum.txt. I'm then invoked mprime -m, did the self-throughput test, with 1 worker 4 cores, 48kB FFT size. Yes, I see the replies saying "it doesn't multithread will with small FFTs", and "use many workers, 1 core". Yes, I will lean towards that from now on. But I am doing 1 worker, 4 cores as my benchmark.

I've attached snippets of results.bench.txt, and gwnum.txt.

Now according to the results.bench.txt, the fastest is:

FFTlen=48K, Type=3, Arch=4, Pass1=768, Pass2=64, clm=1 (4 cores, 1 worker): 0.21 ms. Throughput: 4848.21 iter/sec.

Pass1=768, Pass2=64, clm=1. Right?

However, when I invoke mprime -d on my ECM 999181 I see:

[Work thread Jun 27 22:30] Using FMA3 FFT length 48K, Pass1=768, Pass2=64, clm=2, 4 threads

This isn't the fastest FFT selection given the results from the self-benchmark. Is this normal? Also I still cannot see any autobenchmark in my work, can anyone explain what triggers the autobench? I see other posts complaining about it, and how to prevent it, I want to do the opposite and invoke it!

The reason I am asking for the autobench, is on one of my other machines it was also running slow for a day or two on a new exponent, then the autbench kicked in, and it was much much faster after that. Obviously a different FFT implementation was chosen which made throughput higher.

In the attached log_snippet.txt, ECM curve 50 phase 1 is taking almost 3000 seconds. Other machine with DDR4-2133 it's taking < 2000 seconds for phase 1. That machine has been shutdown for the summer unfortunately, so I cannot access it again until the fall.

Also when I run htop there are no other processes taking up CPU cycles (mprime is using 4 out of 14 cores anyways).
