mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2016-08-19, 14:38   #1
sonjohan
 
sonjohan's Avatar
 
May 2003
Belgium

4268 Posts
Question 4 Cores, only 1 worker?

I want to run prime95 on a Intel(R) Xeon(R) CPU E3-1241 v3 @ 3.50GHz
CPU speed: 3492.01 MHz, 4 hyperthreaded cores

According to Prime95, I have 4 hyperthreaded cores.
When I want to assign 1 worker for each core (with worker windows), I get a warning that I've allocated more threads than CPU's available and that it would greatly reduce performance?

How can I run several DC in parallel? Because right now, I have only 1 DC running on all cores at once, where the 3 additional cores are just 'helper threads'.
Is this really the optimal running of Prime95?


I'm doing DC because the PC hasn't run any LL checks yet.

Last fiddled with by sonjohan on 2016-08-19 at 14:40
sonjohan is offline   Reply With Quote
Old 2016-08-19, 15:04   #2
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

3×11×89 Posts
Default

Quote:
Originally Posted by sonjohan View Post
I want to run prime95 on a Intel(R) Xeon(R) CPU E3-1241 v3 @ 3.50GHz
CPU speed: 3492.01 MHz, 4 hyperthreaded cores

According to Prime95, I have 4 hyperthreaded cores.
When I want to assign 1 worker for each core (with worker windows), I get a warning that I've allocated more threads than CPU's available and that it would greatly reduce performance?
You should only run at most one thread (or worker) per pair of hyperthreaded cores. Prime95 is written efficiently enough to fully saturate the physical core with a single thread, and in fact is almost always memory bound on new machines anyway.

Quote:
How can I run several DC in parallel? Because right now, I have only 1 DC running on all cores at once, where the 3 additional cores are just 'helper threads'.
Is this really the optimal running of Prime95?
Yes, with modern machines. Running a single test uses the CPU's cache more efficiently.

If you had a 6 or 8 core CPU, then running 2 workers would probably be better.
Mark Rose is offline   Reply With Quote
Old 2016-08-19, 15:41   #3
GP2
 
GP2's Avatar
 
Sep 2003

5×11×47 Posts
Default

Quote:
Originally Posted by sonjohan View Post
How can I run several DC in parallel? Because right now, I have only 1 DC running on all cores at once, where the 3 additional cores are just 'helper threads'.
Is this really the optimal running of Prime95?
Yes, it's optimal. Only one test runs at a time, but the helper threads really help it run a lot faster.

If you really think you want to run four separate tests simultaneously (one per core), you can put the following lines in your local.txt file:

Code:
ThreadsPerTest=1
WorkerThreads=4
But... this has several drawbacks. First, each test will now run at least four times slower because there are no helper threads anymore, so you don't gain anything by running four tests simultaneously. Second, each test will in fact run even slower than that, maybe an extra 20% or 30% slower, because now the four separate threads are fighting with one another over the CPU cache, instead of one main thread and three helper threads working harmoniously.

I only use the above setting (ThreadsPerTest=1, WorkerThreads=4) on a four-core machine that does only ECM tests, since ECM tests don't make use of helper threads.
GP2 is offline   Reply With Quote
Old 2016-08-20, 03:27   #4
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

53×79 Posts
Default

Without diminishing what the other guys before answered already, you should first run a benchmark, from the menu chose "Options"/"Benchmark". Let it finish. (your current test will be temporary interrupted). Your CPU has 4 physical cores, with 8 threads (2 per core). Therefore, the benchmark will run for 1, 2, 4 and 8 threads, at different FFT sizes. It will take a while (minutes, not hours), let it finish. Look for the times and make some calculus. See which configuration/number of threads/workers are best for you. There are many parameters that make a system performance. From the CPU speed, number of memory channels and speed (bandwidth), etc, up to the power supply and heat spreading (thermal throttling).

You may be better running a single thread in all 4 cores, or you may be better running 4 threads, each in its own physical core, and this will also depend on the exponent you are working at (i.e. the FFT size). Older tasks used to work on lower exponents with a smaller FFT, and you always were better running 4 workers (threads), each with its own core. For example, using all 4 cores to run a single LL in a i7-2600k would take 4 days, so you could finish 4 exponents in 16 days. But using one core per test it would take 15 days, and as you could run 4 tests in parallel, you could finish four exponents one day earlier. OTOH, running 4 tests in parallel would use a higher memory bandwidth, and you would need a good mobo and good memory, otherwise your computer may become less responsive for tasks like acrobat reader, etc. (which also work at lowest priority, as P95 does).

This was ALWAYS my case, for years, with ALL computers I run P95 into. The "one core per thread" is always faster than "one thread using all cores". But sometime we select the second choice because we want to see a result faster (in 4 days, instead of 15, even if we lose one day every 16 days!) or because the computer is more responsive (low-end mobos) or because it produces less heat (laptops).

Situation changed recently as we progressed to higher exponents, which need a larger FFT, and more memory bandwidth, as former poster said, most systems today are bandwidth limited. So, you may be better doing a single LL test in all 4 cores, or even better, you can do two LL tests, each using 2 cores. But for that, you have to BENCHMARK. Also, DC results are not significant for LL result. If you do DC on a small exponent (40M, 50M), you will use a small FFT, less memory transfer, and you will almost always be better running 4 tests/workers in parallel, each in a single core.

To change the number of workers, you can open menu "Test"/"worker windows" and select the "number of worker windows" and for each worker (or "all workers" selected in "worker number" box) you can select how many "cpus to use (multithreading)" in the boxes. This is the same as playing with prime.txt configuration file as explained by the poster above, but here you can do it from the menus. You can for example, use 3 workers and give 1 core to worker 1, then 1 core to worker 2, then 2 cores to worker 3. Total 3 workers, 4 cores. Or you can run 2 workers, and give 3 cores to the first worker, and 1 core to the second worker, because you want to see faster the result of the first test. Etc. Under no circumstance your total number of cores used (add the numbers for all tasks) should be higher than 8, and the best is if they are not higher than 4 (the hyper-threading is detrimental for P95, as the program is quite optimally written to take full advantage of the physical core, so running two threads per core in HT mode will only waste time with changing the thread context every now and then).

[edit: I forgot, when you increase the number of workers, new sections are added to your worktodo.txt, for each new worker. Therefore, when you will decrease the number of workers, these new sections are orphaned in your worktodo file, and a warning will be printed on the screen, saying that. The warning is harmless and you can ignore it. But better, you will need to stop P95 and manually edit the worktodo file, to move the work from those sections to the remaining workers, and delete the orphaned empty sections. You will understand immediately, better than I can explain, what you have to do, when you open the worktodo.txt file with your favorite text editor (notepad, etc). If you don't do this, no harm, P95 will still run perfectly with the remaining workers, but any work allocated to the higher workers will not be done. Unless you increase the number of workers again to cover the orphaned sections (or edit the file as explained), the work will hang there].

So, benchmark, then select your favorite way. My "feeling" is that for your system you will be better doing two workers, each with two threads. But I may be wrong.

If you have problems computing, then post your benchmark results here, someone will help you chose. You can copy/paste from the screen, or from the results.txt file.

Last fiddled with by LaurV on 2016-08-20 at 03:47
LaurV is offline   Reply With Quote
Old 2016-08-20, 17:54   #5
sonjohan
 
sonjohan's Avatar
 
May 2003
Belgium

2×139 Posts
Default

Same problem, other system.
I do however notice that since I've changed the configuration to 1 worker only, the due date for the last exponent (5 exponents were reserved) changed from 2016-09-16 to 2016-09-15, so I theoretically gain 1 day by using helper threads?

Code:
Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz
CPU speed: 3378.73 MHz, 4 cores
CPU features: Prefetch, SSE, SSE2, SSE4, AVX, AVX2, FMA
L1 cache size: 32 KB
L2 cache size: 256 KB, L3 cache size: 6 MB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
TLBS: 64
Prime95 64-bit version 28.7, RdtscTiming=1
Best time for 1024K FFT length: 3.474 ms., avg: 4.008 ms.
Best time for 1280K FFT length: 4.445 ms., avg: 4.600 ms.
Best time for 1536K FFT length: 5.400 ms., avg: 5.791 ms.
Best time for 1792K FFT length: 6.505 ms., avg: 6.901 ms.
Best time for 2048K FFT length: 7.363 ms., avg: 7.852 ms.
Best time for 2560K FFT length: 9.558 ms., avg: 10.072 ms.
Best time for 3072K FFT length: 11.325 ms., avg: 11.981 ms.
Best time for 3584K FFT length: 14.158 ms., avg: 17.844 ms.
Best time for 4096K FFT length: 15.794 ms., avg: 21.011 ms.
Best time for 5120K FFT length: 19.480 ms., avg: 22.155 ms.
Best time for 6144K FFT length: 23.899 ms., avg: 27.519 ms.
Best time for 7168K FFT length: 28.615 ms., avg: 32.117 ms.
Best time for 8192K FFT length: 33.347 ms., avg: 34.771 ms.
Timing FFTs using 2 threads.
Best time for 1024K FFT length: 1.851 ms., avg: 2.027 ms.
Best time for 1280K FFT length: 2.367 ms., avg: 2.549 ms.
Best time for 1536K FFT length: 2.870 ms., avg: 2.968 ms.
Best time for 1792K FFT length: 3.488 ms., avg: 3.722 ms.
Best time for 2048K FFT length: 3.878 ms., avg: 4.155 ms.
Best time for 2560K FFT length: 4.958 ms., avg: 6.061 ms.
Best time for 3072K FFT length: 5.885 ms., avg: 6.294 ms.
Best time for 3584K FFT length: 7.244 ms., avg: 8.778 ms.
Best time for 4096K FFT length: 8.061 ms., avg: 8.681 ms.
Best time for 5120K FFT length: 10.201 ms., avg: 11.283 ms.
Best time for 6144K FFT length: 12.341 ms., avg: 12.931 ms.
Best time for 7168K FFT length: 14.751 ms., avg: 15.300 ms.
Best time for 8192K FFT length: 17.556 ms., avg: 18.322 ms.
Timing FFTs using 3 threads.
Best time for 1024K FFT length: 1.298 ms., avg: 1.434 ms.
Best time for 1280K FFT length: 1.713 ms., avg: 1.789 ms.
Best time for 1536K FFT length: 2.060 ms., avg: 2.158 ms.
Best time for 1792K FFT length: 2.483 ms., avg: 2.622 ms.
Best time for 2048K FFT length: 2.864 ms., avg: 2.967 ms.
Best time for 2560K FFT length: 3.642 ms., avg: 3.842 ms.
Best time for 3072K FFT length: 4.375 ms., avg: 4.518 ms.
Best time for 3584K FFT length: 5.237 ms., avg: 5.525 ms.
Best time for 4096K FFT length: 6.147 ms., avg: 6.352 ms.
Best time for 5120K FFT length: 7.691 ms., avg: 8.795 ms.
Best time for 6144K FFT length: 9.324 ms., avg: 9.659 ms.
Best time for 7168K FFT length: 10.928 ms., avg: 11.526 ms.
Best time for 8192K FFT length: 12.709 ms., avg: 13.256 ms.
Timing FFTs using 4 threads.
Best time for 1024K FFT length: 1.094 ms., avg: 1.217 ms.
Best time for 1280K FFT length: 1.487 ms., avg: 1.675 ms.
Best time for 1536K FFT length: 1.839 ms., avg: 2.017 ms.
Best time for 1792K FFT length: 2.237 ms., avg: 2.465 ms.
Best time for 2048K FFT length: 2.700 ms., avg: 2.923 ms.
Best time for 2560K FFT length: 3.454 ms., avg: 3.706 ms.
Best time for 3072K FFT length: 4.051 ms., avg: 4.262 ms.
Best time for 3584K FFT length: 4.900 ms., avg: 5.019 ms.
Best time for 4096K FFT length: 5.689 ms., avg: 6.844 ms.
Best time for 5120K FFT length: 7.158 ms., avg: 7.423 ms.
Best time for 6144K FFT length: 8.589 ms., avg: 10.187 ms.
Best time for 7168K FFT length: 10.101 ms., avg: 10.973 ms.
Best time for 8192K FFT length: 11.669 ms., avg: 12.568 ms.

Timings for 1024K FFT length (1 cpu, 1 worker):  3.52 ms.  Throughput: 284.04 iter/sec.
Timings for 1024K FFT length (2 cpus, 2 workers):  3.79,  3.70 ms.  Throughput: 534.44 iter/sec.
Timings for 1024K FFT length (3 cpus, 3 workers):  4.52,  4.44,  4.36 ms.  Throughput: 675.80 iter/sec.
Timings for 1024K FFT length (4 cpus, 4 workers):  5.83,  5.67,  6.07,  5.43 ms.  Throughput: 696.83 iter/sec.
Timings for 1280K FFT length (1 cpu, 1 worker):  4.66 ms.  Throughput: 214.64 iter/sec.
Timings for 1280K FFT length (2 cpus, 2 workers):  4.99,  4.81 ms.  Throughput: 408.07 iter/sec.
Timings for 1280K FFT length (3 cpus, 3 workers):  5.67,  5.74,  5.50 ms.  Throughput: 532.22 iter/sec.
Timings for 1280K FFT length (4 cpus, 4 workers):  7.34,  6.94,  6.94,  6.75 ms.  Throughput: 572.68 iter/sec.
Timings for 1536K FFT length (1 cpu, 1 worker):  5.51 ms.  Throughput: 181.36 iter/sec.
Timings for 1536K FFT length (2 cpus, 2 workers):  6.03,  5.87 ms.  Throughput: 336.06 iter/sec.
Timings for 1536K FFT length (3 cpus, 3 workers):  6.81,  6.65,  6.64 ms.  Throughput: 447.83 iter/sec.
Timings for 1536K FFT length (4 cpus, 4 workers):  8.86,  8.52,  8.43,  8.08 ms.  Throughput: 472.62 iter/sec.
Timings for 1792K FFT length (1 cpu, 1 worker):  6.60 ms.  Throughput: 151.46 iter/sec.
Timings for 1792K FFT length (2 cpus, 2 workers):  7.30,  7.06 ms.  Throughput: 278.68 iter/sec.
Timings for 1792K FFT length (3 cpus, 3 workers):  8.35,  8.07,  7.87 ms.  Throughput: 370.74 iter/sec.
Timings for 1792K FFT length (4 cpus, 4 workers): 10.71,  9.88,  9.87,  9.51 ms.  Throughput: 401.01 iter/sec.
Timings for 2048K FFT length (1 cpu, 1 worker):  7.49 ms.  Throughput: 133.50 iter/sec.
[Sat Aug 20 19:45:25 2016]
Timings for 2048K FFT length (2 cpus, 2 workers):  8.16,  7.94 ms.  Throughput: 248.50 iter/sec.
Timings for 2048K FFT length (3 cpus, 3 workers):  9.66,  9.33,  9.33 ms.  Throughput: 317.95 iter/sec.
Timings for 2048K FFT length (4 cpus, 4 workers): 12.33, 12.41, 11.27, 11.39 ms.  Throughput: 338.32 iter/sec.
Timings for 2560K FFT length (1 cpu, 1 worker):  9.66 ms.  Throughput: 103.51 iter/sec.
Timings for 2560K FFT length (2 cpus, 2 workers): 10.48, 10.29 ms.  Throughput: 192.58 iter/sec.
Timings for 2560K FFT length (3 cpus, 3 workers): 12.07, 11.65, 11.51 ms.  Throughput: 255.58 iter/sec.
Timings for 2560K FFT length (4 cpus, 4 workers): 15.14, 14.52, 14.39, 14.00 ms.  Throughput: 275.82 iter/sec.
Timings for 3072K FFT length (1 cpu, 1 worker): 11.39 ms.  Throughput: 87.76 iter/sec.
Timings for 3072K FFT length (2 cpus, 2 workers): 12.55, 12.22 ms.  Throughput: 161.50 iter/sec.
Timings for 3072K FFT length (3 cpus, 3 workers): 14.38, 13.75, 13.46 ms.  Throughput: 216.60 iter/sec.
Timings for 3072K FFT length (4 cpus, 4 workers): 18.02, 17.19, 17.14, 16.85 ms.  Throughput: 231.35 iter/sec.
Timings for 3584K FFT length (1 cpu, 1 worker): 14.16 ms.  Throughput: 70.64 iter/sec.
Timings for 3584K FFT length (2 cpus, 2 workers): 15.35, 14.93 ms.  Throughput: 132.13 iter/sec.
Timings for 3584K FFT length (3 cpus, 3 workers): 16.94, 17.37, 16.75 ms.  Throughput: 176.32 iter/sec.
Timings for 3584K FFT length (4 cpus, 4 workers): 20.82, 21.11, 21.12, 19.19 ms.  Throughput: 194.86 iter/sec.
Timings for 4096K FFT length (1 cpu, 1 worker): 15.73 ms.  Throughput: 63.59 iter/sec.
Timings for 4096K FFT length (2 cpus, 2 workers): 17.29, 16.91 ms.  Throughput: 116.98 iter/sec.
Timings for 4096K FFT length (3 cpus, 3 workers): 19.60, 19.07, 18.85 ms.  Throughput: 156.49 iter/sec.
Timings for 4096K FFT length (4 cpus, 4 workers): 24.65, 24.30, 22.80, 23.31 ms.  Throughput: 168.47 iter/sec.
Timings for 5120K FFT length (1 cpu, 1 worker): 21.94 ms.  Throughput: 45.57 iter/sec.
Timings for 5120K FFT length (2 cpus, 2 workers): 32.66, 32.01 ms.  Throughput: 61.86 iter/sec.
Timings for 5120K FFT length (3 cpus, 3 workers): 34.97, 32.24, 30.84 ms.  Throughput: 92.04 iter/sec.
Timings for 5120K FFT length (4 cpus, 4 workers): 35.62, 32.25, 31.43, 31.76 ms.  Throughput: 122.39 iter/sec.
Timings for 6144K FFT length (1 cpu, 1 worker): 23.88 ms.  Throughput: 41.87 iter/sec.
Timings for 6144K FFT length (2 cpus, 2 workers): 26.26, 25.43 ms.  Throughput: 77.41 iter/sec.
Timings for 6144K FFT length (3 cpus, 3 workers): 30.11, 29.54, 28.72 ms.  Throughput: 101.88 iter/sec.
Timings for 6144K FFT length (4 cpus, 4 workers): 36.55, 36.11, 35.42, 34.76 ms.  Throughput: 112.05 iter/sec.
[Sat Aug 20 19:50:39 2016]
Timings for 7168K FFT length (1 cpu, 1 worker): 28.53 ms.  Throughput: 35.06 iter/sec.
Timings for 7168K FFT length (2 cpus, 2 workers): 31.41, 30.84 ms.  Throughput: 64.26 iter/sec.
Timings for 7168K FFT length (3 cpus, 3 workers): 36.12, 34.30, 34.43 ms.  Throughput: 85.88 iter/sec.
Timings for 7168K FFT length (4 cpus, 4 workers): 44.40, 41.98, 41.49, 41.12 ms.  Throughput: 94.77 iter/sec.
Timings for 8192K FFT length (1 cpu, 1 worker): 33.66 ms.  Throughput: 29.71 iter/sec.
Timings for 8192K FFT length (2 cpus, 2 workers): 36.85, 36.16 ms.  Throughput: 54.79 iter/sec.
Timings for 8192K FFT length (3 cpus, 3 workers): 41.01, 40.42, 39.99 ms.  Throughput: 74.13 iter/sec.
Timings for 8192K FFT length (4 cpus, 4 workers): 50.24, 48.68, 48.56, 46.05 ms.  Throughput: 82.76 iter/sec.

Last fiddled with by sonjohan on 2016-08-20 at 17:56
sonjohan is offline   Reply With Quote
Old 2016-08-22, 12:01   #6
sonjohan
 
sonjohan's Avatar
 
May 2003
Belgium

2·139 Posts
Default

This is the benchmark for the reporter PC:
Code:
Compare your results to other computers at http://www.mersenne.org/report_benchmarks
Intel(R) Xeon(R) CPU E3-1241 v3 @ 3.50GHz
CPU speed: 3492.01 MHz, 4 hyperthreaded cores
CPU features: Prefetch, SSE, SSE2, SSE4, AVX, AVX2, FMA
L1 cache size: 32 KB
L2 cache size: 256 KB, L3 cache size: 8 MB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
TLBS: 64
Prime95 64-bit version 28.9, RdtscTiming=1
Best time for 1024K FFT length: 4.263 ms., avg: 4.362 ms.
Best time for 1280K FFT length: 5.444 ms., avg: 5.768 ms.
Best time for 1536K FFT length: 6.769 ms., avg: 7.322 ms.
Best time for 1792K FFT length: 8.231 ms., avg: 8.381 ms.
Best time for 2048K FFT length: 9.411 ms., avg: 9.559 ms.
Best time for 2560K FFT length: 11.982 ms., avg: 12.119 ms.
Best time for 3072K FFT length: 14.615 ms., avg: 15.825 ms.
Best time for 3584K FFT length: 17.212 ms., avg: 18.413 ms.
Best time for 4096K FFT length: 19.812 ms., avg: 20.651 ms.
Best time for 5120K FFT length: 25.055 ms., avg: 25.326 ms.
Best time for 6144K FFT length: 29.501 ms., avg: 30.008 ms.
Best time for 7168K FFT length: 35.207 ms., avg: 35.709 ms.
Best time for 8192K FFT length: 40.683 ms., avg: 41.432 ms.
Timing FFTs using 2 threads on 1 physical CPU.
Best time for 1024K FFT length: 4.099 ms., avg: 4.166 ms.
Best time for 1280K FFT length: 5.300 ms., avg: 5.398 ms.
Best time for 1536K FFT length: 6.612 ms., avg: 6.748 ms.
Best time for 1792K FFT length: 7.862 ms., avg: 7.993 ms.
Best time for 2048K FFT length: 9.410 ms., avg: 9.539 ms.
Best time for 2560K FFT length: 11.646 ms., avg: 11.786 ms.
Best time for 3072K FFT length: 13.782 ms., avg: 14.141 ms.
Best time for 3584K FFT length: 16.858 ms., avg: 18.529 ms.
Best time for 4096K FFT length: 21.197 ms., avg: 21.692 ms.
Best time for 5120K FFT length: 31.279 ms., avg: 32.688 ms.
Best time for 6144K FFT length: 31.440 ms., avg: 32.040 ms.
Best time for 7168K FFT length: 36.827 ms., avg: 37.432 ms.
Best time for 8192K FFT length: 44.131 ms., avg: 44.913 ms.
Timing FFTs using 2 threads on 2 physical CPUs.
Best time for 1024K FFT length: 2.286 ms., avg: 2.362 ms.
Best time for 1280K FFT length: 2.992 ms., avg: 3.174 ms.
Best time for 1536K FFT length: 3.668 ms., avg: 3.961 ms.
Best time for 1792K FFT length: 4.392 ms., avg: 4.586 ms.
Best time for 2048K FFT length: 4.963 ms., avg: 5.130 ms.
Best time for 2560K FFT length: 6.352 ms., avg: 6.704 ms.
Best time for 3072K FFT length: 7.683 ms., avg: 7.864 ms.
Best time for 3584K FFT length: 9.108 ms., avg: 9.523 ms.
Best time for 4096K FFT length: 10.477 ms., avg: 10.720 ms.
Best time for 5120K FFT length: 13.238 ms., avg: 13.456 ms.
Best time for 6144K FFT length: 15.677 ms., avg: 17.073 ms.
Best time for 7168K FFT length: 18.702 ms., avg: 18.861 ms.
Best time for 8192K FFT length: 21.726 ms., avg: 26.040 ms.
Timing FFTs using 3 threads on 3 physical CPUs.
Best time for 1024K FFT length: 1.621 ms., avg: 1.844 ms.
Best time for 1280K FFT length: 2.201 ms., avg: 2.342 ms.
Best time for 1536K FFT length: 2.769 ms., avg: 3.403 ms.
Best time for 1792K FFT length: 3.336 ms., avg: 3.584 ms.
Best time for 2048K FFT length: 3.681 ms., avg: 3.819 ms.
Best time for 2560K FFT length: 4.719 ms., avg: 4.963 ms.
Best time for 3072K FFT length: 5.742 ms., avg: 6.230 ms.
Best time for 3584K FFT length: 6.899 ms., avg: 7.078 ms.
Best time for 4096K FFT length: 7.743 ms., avg: 7.929 ms.
Best time for 5120K FFT length: 9.746 ms., avg: 10.094 ms.
Best time for 6144K FFT length: 11.896 ms., avg: 12.236 ms.
Best time for 7168K FFT length: 14.246 ms., avg: 14.964 ms.
Best time for 8192K FFT length: 16.381 ms., avg: 17.033 ms.
Timing FFTs using 4 threads on 4 physical CPUs.
Best time for 1024K FFT length: 1.300 ms., avg: 1.389 ms.
Best time for 1280K FFT length: 1.922 ms., avg: 2.045 ms.
Best time for 1536K FFT length: 2.300 ms., avg: 2.603 ms.
Best time for 1792K FFT length: 2.874 ms., avg: 2.986 ms.
Best time for 2048K FFT length: 3.284 ms., avg: 3.453 ms.
Best time for 2560K FFT length: 4.260 ms., avg: 4.460 ms.
Best time for 3072K FFT length: 5.230 ms., avg: 5.394 ms.
Best time for 3584K FFT length: 6.213 ms., avg: 6.431 ms.
Best time for 4096K FFT length: 7.092 ms., avg: 7.373 ms.
Best time for 5120K FFT length: 8.934 ms., avg: 9.079 ms.
Best time for 6144K FFT length: 10.832 ms., avg: 11.314 ms.
Best time for 7168K FFT length: 12.790 ms., avg: 13.017 ms.
Best time for 8192K FFT length: 14.659 ms., avg: 14.943 ms.
Timing FFTs using 8 threads on 4 physical CPUs.
Best time for 1024K FFT length: 1.369 ms., avg: 1.748 ms.
Best time for 1280K FFT length: 2.109 ms., avg: 3.429 ms.
Best time for 1536K FFT length: 2.454 ms., avg: 2.709 ms.
Best time for 1792K FFT length: 3.142 ms., avg: 3.710 ms.
Best time for 2048K FFT length: 3.384 ms., avg: 3.799 ms.
Best time for 2560K FFT length: 4.423 ms., avg: 5.081 ms.
Best time for 3072K FFT length: 5.405 ms., avg: 6.002 ms.
Best time for 3584K FFT length: 6.299 ms., avg: 6.980 ms.
Best time for 4096K FFT length: 7.476 ms., avg: 8.078 ms.
Best time for 5120K FFT length: 9.224 ms., avg: 10.125 ms.
Best time for 6144K FFT length: 11.034 ms., avg: 11.762 ms.
Best time for 7168K FFT length: 12.999 ms., avg: 14.016 ms.
Best time for 8192K FFT length: 14.922 ms., avg: 15.583 ms.

Timings for 1024K FFT length (4 cpus, 4 workers):  7.14,  7.08,  7.17,  6.86 ms.  Throughput: 566.58 iter/sec.
Timings for 1024K FFT length (4 cpus hyperthreaded, 4 workers):  7.33,  7.13,  7.03,  7.21 ms.  Throughput: 557.52 iter/sec.
Timings for 1280K FFT length (4 cpus, 4 workers):  9.26,  9.23,  9.09,  9.01 ms.  Throughput: 437.49 iter/sec.
Timings for 1280K FFT length (4 cpus hyperthreaded, 4 workers):  9.02,  8.93,  8.84,  9.05 ms.  Throughput: 446.43 iter/sec.
Timings for 1536K FFT length (4 cpus, 4 workers): 10.76, 10.67, 10.52, 10.32 ms.  Throughput: 378.59 iter/sec.
Timings for 1536K FFT length (4 cpus hyperthreaded, 4 workers): 10.81, 11.06, 10.83, 10.88 ms.  Throughput: 367.19 iter/sec.
Timings for 1792K FFT length (4 cpus, 4 workers): 13.12, 13.15, 13.05, 12.94 ms.  Throughput: 306.11 iter/sec.
Timings for 1792K FFT length (4 cpus hyperthreaded, 4 workers): 12.87, 12.85, 12.72, 12.93 ms.  Throughput: 311.50 iter/sec.
Timings for 2048K FFT length (4 cpus, 4 workers): 14.82, 14.77, 14.64, 14.36 ms.  Throughput: 273.12 iter/sec.
Timings for 2048K FFT length (4 cpus hyperthreaded, 4 workers): 14.88, 14.95, 14.72, 14.70 ms.  Throughput: 270.09 iter/sec.
[Fri Aug 19 16:07:37 2016]
Timings for 2560K FFT length (4 cpus, 4 workers): 19.24, 19.29, 18.99, 18.31 ms.  Throughput: 211.10 iter/sec.
Timings for 2560K FFT length (4 cpus hyperthreaded, 4 workers): 18.75, 18.84, 18.69, 18.54 ms.  Throughput: 213.84 iter/sec.
Timings for 3072K FFT length (4 cpus, 4 workers): 22.42, 22.45, 22.33, 21.76 ms.  Throughput: 179.88 iter/sec.
Timings for 3072K FFT length (4 cpus hyperthreaded, 4 workers): 22.84, 24.18, 23.28, 22.57 ms.  Throughput: 172.40 iter/sec.
Timings for 3584K FFT length (4 cpus, 4 workers): 26.07, 26.06, 26.08, 25.41 ms.  Throughput: 154.42 iter/sec.
Timings for 3584K FFT length (4 cpus hyperthreaded, 4 workers): 26.44, 26.38, 26.76, 26.47 ms.  Throughput: 150.89 iter/sec.
Timings for 4096K FFT length (4 cpus, 4 workers): 29.91, 29.85, 29.99, 28.70 ms.  Throughput: 135.11 iter/sec.
Timings for 4096K FFT length (4 cpus hyperthreaded, 4 workers): 30.92, 30.71, 30.36, 30.46 ms.  Throughput: 130.68 iter/sec.
Timings for 5120K FFT length (4 cpus, 4 workers): 37.15, 37.10, 36.67, 35.99 ms.  Throughput: 108.93 iter/sec.
Timings for 5120K FFT length (4 cpus hyperthreaded, 4 workers): 40.19, 39.67, 38.87, 39.31 ms.  Throughput: 101.26 iter/sec.
Timings for 6144K FFT length (4 cpus, 4 workers): 45.77, 45.89, 44.87, 44.67 ms.  Throughput: 88.31 iter/sec.
Timings for 6144K FFT length (4 cpus hyperthreaded, 4 workers): 45.99, 46.02, 49.07, 45.91 ms.  Throughput: 85.64 iter/sec.
Timings for 7168K FFT length (4 cpus, 4 workers): 51.91, 51.83, 51.72, 50.33 ms.  Throughput: 77.76 iter/sec.
Timings for 7168K FFT length (4 cpus hyperthreaded, 4 workers): 54.99, 53.98, 53.69, 57.55 ms.  Throughput: 72.71 iter/sec.
Timings for 8192K FFT length (4 cpus, 4 workers): 60.98, 61.54, 60.28, 61.43 ms.  Throughput: 65.52 iter/sec.
Timings for 8192K FFT length (4 cpus hyperthreaded, 4 workers): 61.11, 60.88, 60.46, 65.22 ms.  Throughput: 64.66 iter/sec.
sonjohan is offline   Reply With Quote
Old 2016-08-23, 15:59   #7
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

332410 Posts
Default

Quote:
Originally Posted by sonjohan View Post
This is the benchmark for the reporter PC:
...
What you'd want to do is run a benchmark that shows the difference between:
1 worker on 4 CPUs
4 workers on 4 CPUs

Disable the hyperthreading benchmarks (8 workers on 4 cores)... it will probably make it slower as 2 workers are competing for a single FPU pipeline and also for memory bandwidth, L1-L3 cache, etc.

In the "prime.txt" files add these lines:
BenchHyperthreads=0
BenchMultithreads=1

Optionally also limit what FFT sizes you're benchmarking... the current first-time checks are around the 4M FFT size, so if you're doing 1st time checks, try these:
MinBenchFFT=4096K
MaxBenchFFT=4096K

If you're interested in your assigned work in the 43M range, I don't remember what the FFT size is for those, but maybe 2560K, so use that for the min/max.

I'm 99% certain that you'll get better real world performance by using a single worker on 4 cores instead of 4 workers on 4 cores. Even if the benchmark doesn't bear that out, do some A/B testing with an actual test and compare the per-iteration times.

When you get into the 2M FFT size (like 36M-37M exponents), you might be fine running 2 of them on a CPU (2 workers using 2 cores each in your case), and if you were doing even smaller exponents in the <= 1M FFT size, one worker per core has no problem.

I think that has everything to do with the L1/L2/L3 cache sizes and the main memory bandwidth. The smaller FFT sizes don't have as much problem fitting in and getting things done.

In my own tests, once they pop above 2M FFT sizes, the performance of multiple workers on the same chip starts to fade fast and you're better off doing multi-threaded workers.
Madpoo is offline   Reply With Quote
Old 2016-08-23, 19:02   #8
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

25×35 Posts
Default

Quote:
Originally Posted by Madpoo View Post
I'm 99% certain that you'll get better real world performance by using a single worker on 4 cores instead of 4 workers on 4 cores. Even if the benchmark doesn't bear that out, do some A/B testing with an actual test and compare the per-iteration times.

I think that has everything to do with the L1/L2/L3 cache sizes and the main memory bandwidth. The smaller FFT sizes don't have as much problem fitting in and getting things done.
In my experience, most CPUs have more throughput with 4 workers for the 4 CPUs. The reason we changed v28.9 default behavior was because a) the throughput penalty is usually small, and b) it increases the chance that new users will finish at least one result before quitting, and c) we won't tie up as many assignments with new users claiming 4 exponents and then abandoning, and d) it was felt most new users prefer the "quicker" turn-around time.

As to memory bandwidth, it is a common misconception that running helper threads reduces memory bandwidth usage. It usually does not. One worker with 3 helpers running 4 times faster than 4 workers running with no helpers uses the same memory bandwidth!

One major exception: If you have a humongous L3 or L4 cache where a significant portion of the FFT data can remain in cache when running 1 worker then you can get a significant reduction in memory bandwidth
Prime95 is offline   Reply With Quote
Old 2016-08-24, 06:02   #9
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

53·79 Posts
Default

Which is exactly what I explained, and exactly what OP's benchmark shows. Taking for example 1024k FFT, in the first post:

One worker in 1 thread: avg: 4.008 ms.
Same worker in 2 threads: avg: 2.027 ms.
in 3 threads: avg: 1.434 ms.
in 4 threads: avg: 1.217 ms.

If he runs 4 single workers, assuming his memory is not the bottleneck, he gets 4 iterations in 4.008 ms.

If he runs 2 double workers (i.e each use 2 threads), he gets 4 iterations in 2*2.027=4.054 ms. (each double worker does 2 iterations in this time).

One quadruple worker (4 threads) can do 4 iterations in 1.217*4=4.868 ms, which is sensible slower, about 25% penalty. The memory penalty however, is not existent, you can already see that when comparing 3 threads with 2 and 4 threads. You get some benefit running 3 threads instead of 4. That is because 3 threads will finish 3 exponents in 1.434*3=4.302, you get a free core to do something else, etc. (including another iteration in 4.008 ms). But that is due to the time consumed by the "cooperation" between the helpers.

But now, if you run 4096k FFT, you can see there is no big difference between 3 and 4 threads, they need about the same time. In fact, 3 threads is a bit faster! (6.3 agains 6.8 ms). The third helper (4th thread) does not produce any useful work, beside of heat and noise, and THAT is due to memory bottleneck.

In fact, you will get ~22 ms for 4 iterations, using 4 single workers (each do one iteration in about 22 ms), and ~26 ms with 1 quadruple workers doing 4 iterations (4* ~6.5 ms). In this case, it is by far more optimum to use 2 workers, each of them with 2 threads (one helper), and you can get 4 iterations (two times two) in ~17 ms, a huge speedup compared with 26 ms.

Say you want to test exponents at the current LL range, where a 4096k FFT is used. If you use 4 single-thread workers, you will get about 170 iterations per second. You can see that in the last part of the benchmark, you need about 24ms per iteration and you have 4 workers, so 1000 (ms) / 24 = ~41.7 iterations per second per worker, this times 4, you get ~168 iterations per second per total, working on 4 exponents.

If otherwise you use a single worker with 4 threads (from the first part of the benchmark, at the same FFT of 4096k), you get 6.844 ms per iteration, so 1000 (ms) / 6.844 ~= 146 iterations per second.

Between the two, which one will you chose?

Even better, running 2 double-thread workers, you may get almost 200 iterations per second. This you can not calculate from the benchmark, or you can, but you need to add a penalty.

For example, 4 single-thread workers, and one quadruple-thread worker, both versions occupy the whole CPU and need all the needed bandwidth, so you can use directly the numbers given by the benchmark. I mean, not in the same time, but as alternatives. You get the numbers for the first from the second part of the benchmark, and the numbers for the second from the first part of the table (sorry for the cross, it came like that from the English I used, and i won't go back to edit). It is "safe" to do so, as I explain further.

But for example, you can not compute the number of iterations per second for 4 single-thread workers given the time for 1 single-thread worker in the first part of the table. Those values are not realistic, because not all the CPU is occupied.

For example, 4096k FFT, one thread, one worker, time given in your table is: 21.011ms.
Now, 1000/21.011=47.6. Now, you can not multiply this with 4, because running one core during the other 3 are resting is not the same as running all cores, when the other 3 cores would also need resources. If you would do that, then you would come to a number of 47.6*4=~190 iterations/second, which is not realistic. The realistic value is the one displayed on the second part of the benchmark table, where the time per iteration per core increases to about 24ms, and you get 168 iterations/second, due to the fact that all cores are running in the same time, bumping each-other on the nose from time to time, and wasting time crying... Therefore, you only get the 168 iterations per second. So, there is a penalty of about 13% caused by the fact that you are running all cores in the same time.

If you do the same calculus with 2 double-core workers, the time in the table is 8.681 ms, therefore 1000/8.681~=115, and because you have 2, you get 230 iterations per second. But same as above, this is not the real number, because only half of your CPU was working during that test. So, you have to apply the "penalty", which is also around 10%, in the most unfortunate case, you still get 200 iterations per second. Here the penalty is smaller, because half of the CPU was working, compared with the case above, where a quarter of the CPU was working. But I said "most unfortunate" case, like some bad memory configuration, or who knows what.

On the other hand, taking the "1 quadruple-thread worker" from the first part of the table is "safe", as whole CPU was running during that test (all 4 cores). So, even without applying a penalty, 1000/6.8 is still less than 150 iterations per second.

Summary:
That was just an example of calculus. You can do your own better.
Every system is different.
For the specified FFT size, your system will get:
  • 1 worker using 4 threads (cores) and working one exponent: ~150 iter/s (advantage: exponents finishes faster if you are very curious to see the result).
  • 4 workers, each 1 thread (core) working on 4 exponents, total ~170 iter/s (no advantage for your system, at this FFT size, beside of the fact that is a bit faster than the one above)
  • 2 workers, each with 2 threads: ~200 iter/s totally (the best compromise for your system, between how fast each exponent finish, speed, memory, etc).

And here it went my lunch break...

Last fiddled with by LaurV on 2016-08-24 at 06:18
LaurV is offline   Reply With Quote
Old 2016-08-24, 14:45   #10
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

1011011110012 Posts
Default

It really does depend on your hardware though. Take my i5-6600 with DDR-2133 systems, where one worker is always faster:

[Work thread Aug 24 10:24] Benchmarking multiple workers to measure the impact of memory bandwidth
[Work thread Aug 24 10:27] Timing 2048K FFT, 4 cpus, 1 worker. Average times: 2.57 ms. Total throughput: 388.96 iter/sec.
[Work thread Aug 24 10:27] Timing 2048K FFT, 4 cpus, 2 workers. Average times: 5.30, 5.30 ms. Total throughput: 377.27 iter/sec.
[Work thread Aug 24 10:27] Timing 2048K FFT, 4 cpus, 4 workers. Average times: 10.71, 10.71, 10.71, 10.71 ms. Total throughput: 373.41 iter/sec.
...
[Work thread Aug 24 10:29] Timing 4096K FFT, 4 cpus, 1 worker. Average times: 5.40 ms. Total throughput: 185.15 iter/sec.
[Work thread Aug 24 10:29] Timing 4096K FFT, 4 cpus, 2 workers. Average times: 10.88, 10.78 ms. Total throughput: 184.67 iter/sec.
[Work thread Aug 24 10:29] Timing 4096K FFT, 4 cpus, 4 workers. Average times: 21.67, 21.70, 21.65, 21.64 ms. Total throughput: 184.64 iter/sec.

But a stock-clocked i7-4770k with DDR3-2400 is more interesting:

[Work thread Aug 24 10:30] Timing 2048K FFT, 4 cpus, 1 worker. Average times: 2.63 ms. Total throughput: 380.55 iter/sec.
[Work thread Aug 24 10:30] Timing 2048K FFT, 4 cpus, 2 workers. Average times: 5.90, 5.58 ms. Total throughput: 348.60 iter/sec.
[Work thread Aug 24 10:30] Timing 2048K FFT, 4 cpus, 4 workers. Average times: 10.49, 10.42, 10.54, 10.42 ms. Total throughput: 382.13 iter/sec.
[Work thread Aug 24 10:31] Timing 2560K FFT, 4 cpus, 1 worker. Average times: 3.56 ms. Total throughput: 280.91 iter/sec.
[Work thread Aug 24 10:31] Timing 2560K FFT, 4 cpus, 2 workers. Average times: 6.81, 6.80 ms. Total throughput: 294.08 iter/sec.
[Work thread Aug 24 10:31] Timing 2560K FFT, 4 cpus, 4 workers. Average times: 13.53, 13.56, 13.67, 13.58 ms. Total throughput: 294.46 iter/sec.
[Work thread Aug 24 10:31] Timing 3072K FFT, 4 cpus, 1 worker. Average times: 4.36 ms. Total throughput: 229.51 iter/sec.
[Work thread Aug 24 10:31] Timing 3072K FFT, 4 cpus, 2 workers. Average times: 8.07, 8.09 ms. Total throughput: 247.56 iter/sec.
[Work thread Aug 24 10:32] Timing 3072K FFT, 4 cpus, 4 workers. Average times: 16.12, 16.11, 16.25, 16.13 ms. Total throughput: 247.63 iter/sec.
[Work thread Aug 24 10:32] Timing 3584K FFT, 4 cpus, 1 worker. Average times: 4.81 ms. Total throughput: 207.94 iter/sec.
[Work thread Aug 24 10:32] Timing 3584K FFT, 4 cpus, 2 workers. Average times: 9.69, 9.66 ms. Total throughput: 206.67 iter/sec.
[Work thread Aug 24 10:32] Timing 3584K FFT, 4 cpus, 4 workers. Average times: 19.07, 18.99, 18.99, 19.09 ms. Total throughput: 210.14 iter/sec.
[Work thread Aug 24 10:32] Timing 4096K FFT, 4 cpus, 1 worker. Average times: 5.51 ms. Total throughput: 181.46 iter/sec.
[Work thread Aug 24 10:33] Timing 4096K FFT, 4 cpus, 2 workers. Average times: 10.85, 10.83 ms. Total throughput: 184.52 iter/sec.
[Work thread Aug 24 10:33] Timing 4096K FFT, 4 cpus, 4 workers. Average times: 21.96, 21.98, 21.76, 21.81 ms. Total throughput: 182.84 iter/sec.

In the ranges 2048K and below, 2 workers is the worst! 2 workers is only best at 4096K and above.
Mark Rose is offline   Reply With Quote
Old 2016-08-24, 19:34   #11
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

1100111111002 Posts
Default

Quote:
Originally Posted by LaurV View Post
Which is exactly what I explained, and exactly what OP's benchmark shows. Taking for example 1024k FFT, in the first post:

One worker in 1 thread: avg: 4.008 ms.
Same worker in 2 threads: avg: 2.027 ms.
in 3 threads: avg: 1.434 ms.
in 4 threads: avg: 1.217 ms.

If he runs 4 single workers, assuming his memory is not the bottleneck, he gets 4 iterations in 4.008 ms.
...
That's where your math always broke down for me, on my systems anyway.

Using your example, if a single worker on a single core did 4 ms, if I then ran 2 workers each using a single core, my per-worker throughput would actually degrade to maybe 5 ms or 6 ms (just using example values, but you get the gist).

The more workers I had running on a single CPU, competing for memory access, the worse the degradation becomes.

I should probably just do a real world example and post my findings.

Admittedly, I think at 1024K FFT size, I don't think it's an issue on a 4-core CPU. At 2048K FFT, running 4 workers on 4 cores is on the "iffy" side, but running 2 workers with 2 cores each was okay.

Moving beyond 2M FFT sizes though, it got bad fast.

Anyway, I guess it's easy enough to tell if it's a problem on your system. If you're already setup for 1 worker per core, take a note of the per-iteration times when all workers are running. Then stop all but a single worker and see if it's iteration times improve. Start the other workers again, one at a time, and note how the iteration times of that first worker change... is it getting worse as more workers start running?

It always did for me...the first worker *always* slowed down when other workers started. As mentioned, for smaller FFT sizes the slowdown wasn't too bad and still managed to be better than the losses from the inefficiencies in multi-threading a single worker. But as the FFT size grows, it eventually becomes better (more efficient) to multi-thread a single worker.

Maybe I don't understand the "why" that is, but my observations bore it out pretty convincingly.
Madpoo is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Prime 95 will not let me change cores per worker evanh Software 4 2017-12-22 22:25
Worker #5 and Worker#7 not running (Error ILLEGAL SUMOUT skrupian08 Information & Answers 9 2016-08-23 16:35
32 cores limitation gabrieltt Software 12 2010-07-15 10:26
CPU cores Unregistered Information & Answers 7 2009-11-02 08:27
A program that uses all the CPU-cores Primix Hardware 7 2008-09-06 21:09

All times are UTC. The time now is 11:57.


Thu Jan 27 11:57:20 UTC 2022 up 188 days, 6:26, 1 user, load averages: 1.28, 1.43, 1.41

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔