On a dual-processor-package system, 2 x E5-2697V2 (each of which are 12-core plus x2 hyperthreading, for a total of 2 x 12 x 2 =48 logical processors), within Ubuntu running atop WSL1 on Win 10 Pro x64, a series of self tests for a single fft length and varying cpu core counts were run in Mlucas v20.1.1 (Nov 6 tarball). Usage that way would be likely when attempting to complete one testing task as quickly as possible (minimum latency). Examples are OBD or F33 P-1, or confirming a new Mersenne prime discovery. It is very likely not the maximum-throughput case, that would constitute typical production running.
Fastest iteration time, ~400ms/iter at 192M (suitable for OBD P-1) was obtained at 20 cores, which is less than the total physical core count 24.
Iteration times obtained were observed to have limited reproducibility at 100 iterations. (10% or worse variability.) Reproducibility was
much better with 1000-iteration runs.
Best reproducibility was apparently by running nothing else, no interactive use, and not even top, although I left a gpuowl instance running uninterrupted.
The thread efficiency = ms/iter * threadcount / ms/iter for 1 thread varied widely, down to 20.6% at 48 threads. At the fastest iteration time, 20 threads, it was 65.8%. Power-of-two thread counts were in most cases local maxima.
The tests were performed by writing and launching a simple sequential shell script, specifying Mlucas command line and output redirection, followed by rename of the mlucas.cfg before the next thread count run.
Top of this reference thread
https://www.mersenneforum.org/showthread.php?t=23427
Top of reference tree:
https://www.mersenneforum.org/showpo...22&postcount=1