View Single Post
Old 2018-12-28, 18:13   #4
kriesel's Avatar
Mar 2017
US midwest

22×19×97 Posts
Default Effect of number of workers

Similar to the number of threads choices in gpu applications, on multicore systems, the effect of number of cores per worker in prime95 is unpredictable, and so there is provision for benchmarking.

Number of workers could be chosen to optimize performance. But which measure of performance? Aggregate throughput maximized, latency of one assignment minimized, number of joules used for a 100GhzD primality test, aggregate throughput given a constraint of latency low enough to avoid assignment expiration, something else? For which single fft length, or for the current and next several?

For minimum latency, as for confirming a newly discovered Mersenne prime, Madpoo has run experiments on a dual-14-core system. He reported the fastest primality test time around 20 cores out of the 28 available; any more than 6 on the lesser use package, and the increased package to package data transfers slow the progress.

For picking number of cores/worker per cpu type, that's a reasonable compromise for maximum aggregate throughput, so I can set it and forget it for months or years on each system, I ran the built in prime95 benchmarking over wide fft ranges for a variety of cores/worker, on a variety of cpu types. Then the timings were tabulated in spreadsheets and graphed.

If going after the maximum performance per fft length, consider that some work types restart from the beginning when the number of workers is changed. Read the readme.txt and other files, back up before changing number of workers, plan ahead, etc.

Some patterns emerge. Worker counts that would straddle the divide between processor packages if divided evenly typically do not provide as much throughput. A 12-core 2-package system with 3 workers with equal cores/worker would have at least one worker with cores in each package (4 2 + 2 4). George indicates recent versions of prime95 prevent the straddle by assigning unequal numbers of cores to the workers.
For larger core counts there can be quite a few choices to evaluate. What's fastest for one fft length may not be for others. A compromise that averages a small percentage penalty is usually available. Plotting the various combinations with trend lines seems a useful visualization method for selecting one configuration to run with for a long time.

Top of reference tree:

Last fiddled with by kriesel on 2021-03-17 at 17:44
kriesel is online now