View Single Post
Old 2021-11-17, 12:37   #20
kriesel's Avatar
Mar 2017
US midwest

2·3·1,229 Posts
Default Optimizing core count for fastest iteration time of a single task

Round one: Gigadigit timings
On a dual-processor-package system, 2 x E5-2697V2 (each of which are 12-core plus x2 hyperthreading, for a total of 2 x 12 x 2 =48 logical processors), with 128 GiB ECC ram, and prime95 reports L1 cache 24x32KB, L2 24x256KB, L3 2x30MB; within Ubuntu running atop WSL1 on Win 10 Pro x64, a series of self tests for a single fft length and varying cpu core counts were run in Mlucas v20.1.1 (2021 Nov 6 tarball). Usage that way would be likely when attempting to complete one testing task as quickly as possible (minimum latency). Examples are OBD or F33 P-1, or confirming a new Mersenne prime discovery. It is very likely not the maximum-throughput case, that would constitute typical production running.

Fastest iteration time, ~400ms/iter at 192M (suitable for OBD P-1) was obtained at 20 cores, which is less than the total physical core count 24.
Iteration times obtained were observed to have limited reproducibility at 100 iterations. (10% or worse variability.) Reproducibility was much better with 1000-iteration runs.
Best reproducibility was apparently by running nothing else, no interactive use, and not even top, although I left a gpuowl instance running uninterrupted.

The thread efficiency = ms/iter * threadcount / ms/iter for 1 thread varied widely, down to 20.6% at 48 threads. At the fastest iteration time, 20 threads, it was 65.8%. Power-of-two thread counts were in most cases local maxima.

The tests were performed by writing and launching a simple sequential shell script, specifying Mlucas command line and output redirection, followed by rename of the mlucas.cfg before the next thread count run. The results are tabulated in the first attachment.

Round two: gathering 1G7 PRP reference interim residues
On the same dual 12-core & x2 HT Xeon e5-2697v2, Windows 10 Pro, WSL1, Canonical Ubuntu 18.04, Mlucas v20.1.1 (2022-02-09 build)
M1,000,000,007 PRP performance versus core count specification
The value x below is the highest logical processor number in the set of logical processors specified used in multithreaded runs. Numbering starts at zero. 0,1 is both logical processors (hyperthreads) of one physical core. Max x on a single 12-core 2 hyperthread Xeon is 23.

FFT used is 960 32 32 32 so core counts that are factors of 960 or of 32 are likely to give more favorable timings.
Powers of two are likely to be faster than non-power-of-two.
Core counts that fully occupy a CPU package with or without HT are likely to be faster.
There's likely to be a dip in performance due to NUMA when using both cpu packages / ram banks.
These considerations conflict somewhat, on a non-power-of-two core count CPU or dual-cpu-NUMA design.

Observed timings below are msec/iteration using however many logical processors specified
No hyperthreading specified: 0:x:2
(nominally single CPU package)
cpuspec lcorecount	msec/iter	lcore*msec/iter or notes
0:7:2 		4	460.9		1843.6
0:11:2 		6	401.5		2409.
0:15:2		8	221.5		1776.	min lcore*msec/iter observed
0:23:2 		12	208.6		2503.2

(nominally both cpu packages)
0:31:2 		16	123.		1968.
0:47:2		24	114.		2736.

Hyperthreading specified: 0:x:1
(nominally single CPU package)
cpuspec lcorecount	msec/iter
0:3:1		4	452.4		1809.6
0:5:1		6	381.2		2287.2
0:7:1 		8	222.3		1778.4
0:11:1 		12	207.2		2486.4
0:15:1		16	130.7		2091.2	
0:23:1 		24	114.1		2738.4

(necessarily both cpu packages)
cpuspec lcorecount	msec/iter
0:29:1		30	112.		3360.
0:31:1 		32	105.3		3369.6 min time/iter; power of two and a good fit to the fft's radices
0:35:1		36	107.4 		3866.4
0:39:1		40	107.3		4292.
0:47:1		48	108.4		5174.4 more logical cores made it slower
The above evaluates single-task performance. Max throughput may be with multiple tasks, and longer latency, such as 0:23:1 on cpu0 for task0, and 24:47:1 on cpu1 for a different exponent, running in parallel in separate instances and separate folders, or 0:15:1, 16:31:1, 32:47:1. Mlucas is multithreaded but single worker, so it would require multiple instances to run multiple tasks.
Use the above figures and cpu specs with caution. It's common to see in WSL Mlucas runs, when monitoring logical core by logical core activity in Windows Task Manager, activity spread for low specified core counts to more cores than specified. For example, 0:7:2 should be on 4 specific logical cores, but shows significant use spread on 9 logical cores. It is not reflective of what may occur on native Linux, or the next WSL application run. IIRC the core alignment is worse on WSL2 than WSL1.
Clock rates were not held constant. Increasing number of cores active probably affects usable clock rates.

Minimum iteration time achieved corresponds to ~3.34 years for a single gigabit primality test.

Top of this reference thread
Top of reference tree:
Attached Files
File Type: pdf optimizing 192M thread count.pdf (27.8 KB, 83 views)
File Type: pdf parallelism at 1G PRP.pdf (31.2 KB, 2 views)

Last fiddled with by kriesel on 2023-01-12 at 18:39
kriesel is offline