![]() |
|
|||||||
![]() |
|
|
Thread Tools |
|
|
#1 |
|
Jul 2020
13 Posts |
Greetings.
Since yesterday I'm trying my hand at GIMPS using spare cycles of my new 16-core workstation. I'm running the official build of Prime95 v29.8b7 on Linux. When configuring Prime95 for the first time, I was presented with a choice of how many work threads I wanted to create and how many cores to allocate to each work thread (I suppose that a "work thread" is a misnomer?). Originally I chose to run 16 single-threaded workers, which subsequently received 16 different double-check assignments and started computing away at 42-45 ms/iter. Later, I decided to experiment a bit and reconfigured Prime95 to run a single work thread using all 16 available cores. This yielded a computation speed of 1.5 ms/iter, i. e. a more than 16x speedup (meaning that 16 original assignments in total would finish quicker than if I ran them in parallel). Hence three questions:
Last fiddled with by intelfx on 2020-07-18 at 09:17 |
|
|
|
|
|
#2 |
|
Romulan Interpreter
Jun 2011
Thailand
961010 Posts |
Please run a benchmark from the menu(s). Linux users may help you here, I don't know how P95 looks on Linux, and every system is different. How many memory channels? How much cache? Air or water cooled? Etc. (rhetoric questions, no need answers). The common wisdom is that new processors will give you a better output if you only run fewer (1-2) workers, each in more than one thread, such that the sum of all threads are not higher than the number of your physical cores. It didn't use to be like that in the past, but with CPUs getting lots of cores, the limitation became the memory bandwidth - 16 workers would need to exchange data for 16 tests. For example, in my system (10 cores, 20 threads, lots of cache memory) the best output I get with 2 workers, each running 5 threads. The hyper-threading is not useful, for most of the work types, it only produces more heat, but not more output.
Last fiddled with by LaurV on 2020-07-18 at 09:26 |
|
|
|
|
|
#3 | |
|
Jul 2020
13 Posts |
Quote:
I see. Memory throughput being the bottleneck sounds quite plausible. I'll run the benchmark, thanks. |
|
|
|
|
|
|
#4 |
|
"Composite as Heck"
Oct 2017
32D16 Posts |
If you have a Ryzen 3950X there is also the L3 cache split to consider (each CCX of 4 cores can directly access only 16MiB of L3). 4 workers might be optimal for that processor (in theory better cache utilisation means less memory bandwidth consumption to do the same work) but it could depend on FFT size. tl;dr always benchmark.
|
|
|
|
|
|
#5 | |
|
Jul 2020
13 Posts |
Quote:
In fact I have simply overlooked the benchmark option. Considering benchmark results, it would appear that for 2048K FFTs the absolute best throughput (1800 iter/sec) is achieved with 4 workers: Code:
FFTlen=2048K, Type=3, Arch=4, Pass1=1024, Pass2=2048, clm=2 (16 cores, 4 workers): 2.18, 2.18, 2.17, 2.17 ms. Throughput: 1840.17 iter/sec. FFTlen=2048K, Type=3, Arch=4, Pass1=2048, Pass2=1024, clm=1 (16 cores, 4 workers): 2.15, 2.16, 2.14, 2.14 ms. Throughput: 1863.96 iter/sec. Code:
Timings for 2240K FFT length (16 cores, 2 workers): 1.43, 1.43 ms. Throughput: 1399.54 iter/sec. Timings for 2240K FFT length (16 cores, 4 workers): 3.39, 3.41, 3.22, 3.21 ms. Throughput: 1209.72 iter/sec. Timings for 2304K FFT length (16 cores, 2 workers): 1.43, 1.43 ms. Throughput: 1397.51 iter/sec. Timings for 2304K FFT length (16 cores, 4 workers): 4.04, 4.00, 3.66, 3.65 ms. Throughput: 1044.27 iter/sec. Timings for 2400K FFT length (16 cores, 2 workers): 1.53, 1.54 ms. Throughput: 1300.87 iter/sec. Timings for 2400K FFT length (16 cores, 4 workers): 4.52, 4.46, 4.66, 4.74 ms. Throughput: 870.69 iter/sec. With significantly larger FFTs, 4 worker performance turns drastically lower than 2 workers: Code:
Timings for 3072K FFT length (16 cores, 2 workers): 1.88, 1.91 ms. Throughput: 1056.56 iter/sec. Timings for 3072K FFT length (16 cores, 4 workers): 7.88, 7.97, 7.80, 7.79 ms. Throughput: 508.94 iter/sec. Incidentally, do you happen to know how exactly can I use the extended benchmark results (i. e. the Type, Arch, Pass1, Pass2, clm values)? Can I specify them in a config somewhere to override the builtin values for my CPU? Last fiddled with by intelfx on 2020-07-18 at 12:42 |
|
|
|
|
|
|
#6 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
2×53×71 Posts |
Well, it will happen automatically over time. Every night, prime95 will do a quick benchmark of all the different FFT implementations for the work you'll be doing in the near future and writes the results to gwnum.txt. Prime95 uses that to pick the fastest FFT implementation for your machine. Some of that info is buried in undoc.txt, but it is pretty terse.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Worker #5 and Worker#7 not running (Error ILLEGAL SUMOUT | skrupian08 | Information & Answers | 9 | 2016-08-23 16:35 |
| Multi-threaded factoring | bchaffin | Aliquot Sequences | 8 | 2010-10-24 13:38 |
| exclude single core from quad core cpu for gimps | jippie | Information & Answers | 7 | 2009-12-14 22:04 |
| NEW MERSENNES AND MULTI-THREADED SOFTWARE | lpmurray | Software | 13 | 2005-12-21 08:24 |
| Prime95 a multi-threaded application? | Unregistered | Software | 10 | 2004-06-11 05:31 |