![]() |
|
|
#1 |
|
Jun 2020
23 Posts |
I’ve recently run P95 benchmark of system I’m building. While looking at results (see attached screenshot) I’ve noticed something that to me, as person that has zero P95 knowledge and experience, seems odd.
First “oddity” is that hyperthreaded throughput is on average lower than non-hyperthreaded (16% average drop in case of single worker). Second “oddity” is that it seems single worker always has best throughput, figures seem to start dropping once number of workers starts increasing. That leaves me scratching my head because I, not having knowledge, assume that a) hyperthreading should result in higher overall throughput, not lower, and b) more workers should result in more iterations per second, not less. So I need help, please, answering: Am I interpreting figures correctly? When P95 says throughput is xyz that is -total- throughput, it isn’t per “thread” per worker, correct? Am I correct in assuming that hyperthreaded figures should be higher than non-hyperthreaded ones, not lower? Am I correct in assuming more workers should’ve resulted in higher throughput, not lower? In other words: Does this seem odd to you too / does something seem wrong? |
|
|
|
|
|
#2 |
|
6809 > 6502
"""""""""""""""""""
Aug 2003
101×103 Posts
22·2,767 Posts |
Prime95 is so efficiently written that the normal gains that a program sees from hyperthreading don't happen. In fact hyperthreading interferes with it.
The through put is the total potential through put (how much total work gets done.) So each core doing its own task will get the most work done. Putting multiple cores on to a single task will get that one task done faster. But the total amount of work will be less. There may be issues with memory bandwidth if many cores are each trying to access a bunch of memory. Actual best performance might be slightly different. |
|
|
|
|
|
#3 |
|
Jun 2003
155816 Posts |
|
|
|
|
|
|
#4 |
|
"Composite as Heck"
Oct 2017
95010 Posts |
1) As mentioned hyperthreading (generically called SMT) should normally be disabled for P95 as P95 is more efficient at occupying the core than SMT is. Hyperthreading allows two threads to queue up work simultaneously to increase occupancy but there is overhead. A workload like P95 fully occupies the core without the cost of this thread juggling overhead.
2) L3 cache is shared between cores, more workers means less cache per worker. The less cache a worker has the higher the chance that a piece of data is evicted from cache before it gets accessed again, in which case it has to be loaded from RAM again. P95 is normally memory bound, meaning throughput is limited by how much memory bandwidth you have. The more bandwidth that is consumed transferring duplicate data the less that is available for unique data, making the memory bottleneck even worse and resulting in lower throughput. |
|
|
|
|
|
#5 |
|
6809 > 6502
"""""""""""""""""""
Aug 2003
101×103 Posts
254748 Posts |
|
|
|
|
|
|
#6 |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
6,793 Posts |
4EvrYng: Try more options.
You tested 10 x 1 and 1 x 10. Also test 5 x 2 and 2 x 5 Plus other splits like: 3,3,4 and 2,2,3,3. Even options that don't use all the cores: 4,4 and 3,3,3 and 2,2,2 etc. See which of those gives you the better outcome and use it. But I don't understand why you didn't try 20 x 1 and 1 x 20 when you had SMT enabled. Your 10 cores should logically (not physically) become 20 cores with SMT on. |
|
|
|
|
|
#7 | ||||
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24·3·163 Posts |
Quote:
Timings for 2240K FFT length (4 cores, 1 worker): 7.14 ms. Throughput: 140.00 iter/sec. Timings for 2240K FFT length (4 cores, 2 workers): 9.80, 12.34 ms. Throughput: 183.03 iter/sec. Timings for 2240K FFT length (4 cores, 4 workers): 24.74, 20.54, 18.41, 22.06 ms. Throughput: 188.76 iter/sec. [Fri May 29 22:33:19 2020] Timings for 2240K FFT length (4 cores hyperthreaded, 1 worker): 5.95 ms. Throughput: 168.09 iter/sec. Timings for 2240K FFT length (4 cores hyperthreaded, 2 workers): 11.51, 11.72 ms. Throughput: 172.17 iter/sec. Timings for 2240K FFT length (4 cores hyperthreaded, 4 workers): 46.42, 20.56, 17.51, 17.54 ms. Throughput: 184.32 iter/sec. Timings for 11520K FFT length (4 cores, 1 worker): 25.56 ms. Throughput: 39.12 iter/sec. Timings for 11520K FFT length (4 cores, 2 workers): 47.17, 46.74 ms. Throughput: 42.59 iter/sec. Timings for 11520K FFT length (4 cores, 4 workers): 99.19, 98.26, 95.55, 95.97 ms. Throughput: 41.14 iter/sec. Timings for 11520K FFT length (4 cores hyperthreaded, 1 worker): 29.73 ms. Throughput: 33.64 iter/sec. Timings for 11520K FFT length (4 cores hyperthreaded, 2 workers): 58.22, 57.55 ms. Throughput: 34.55 iter/sec. Timings for 11520K FFT length (4 cores hyperthreaded, 4 workers): 118.11, 115.52, 115.83, 114.87 ms. Throughput: 34.46 iter/sec. Timings for 65536K FFT length (4 cores, 1 worker): 160.64 ms. Throughput: 6.23 iter/sec. Timings for 65536K FFT length (4 cores, 2 workers): 340.03, 339.46 ms. Throughput: 5.89 iter/sec. Timings for 65536K FFT length (4 cores, 4 workers): 696.01, 689.13, 692.94, 688.94 ms. Throughput: 5.78 iter/sec. Timings for 65536K FFT length (4 cores hyperthreaded, 1 worker): 251.43 ms. Throughput: 3.98 iter/sec. Timings for 65536K FFT length (4 cores hyperthreaded, 2 workers): 529.32, 524.69 ms. Throughput: 3.80 iter/sec. Timings for 65536K FFT length (4 cores hyperthreaded, 4 workers): 1086.87, 1063.50, 1072.06, 1054.67 ms. Throughput: 3.74 iter/sec. Quote:
Quote:
Quote:
See Effect of number of workers and Effect of number of workers (continued) for several cpu models' extensive benchmark runs analyzed and graphed. |
||||
|
|
|
|
|
#8 |
|
Jun 2020
23 Posts |
|
|
|
|
|
|
#9 | ||
|
Jun 2020
10002 Posts |
Quote:
Quote:
|
||
|
|
|
|
|
#10 |
|
Jun 2020
23 Posts |
I did. For FFTs I tested figures on my machine would start dropping moment there was more than one worker. I just didn't show all figures in spreadsheet posted in order to keep it small.
P95 offers 10 as default and when I enter more than 10 it does nothing so my interpretation of that is that it asks you how many cores you want testes and 'test hyperthreding' checkbox controls will you test "ht" ones too or not. |
|
|
|
|
|
#11 | |
|
Jun 2020
23 Posts |
Quote:
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Prime95 benchmark results in GHz-days/day? | mnd9 | Information & Answers | 0 | 2019-09-24 19:46 |
| Statistical properties of categories of GIMPS results and interim results | kriesel | Probability & Probabilistic Number Theory | 1 | 2019-05-22 22:59 |
| NVIDIA Quadro K4000 speed results benchmark | sixblueboxes | GPU Computing | 3 | 2014-07-17 00:25 |
| Strange benchmark results | AlTonno15 | Information & Answers | 3 | 2013-01-29 02:23 |
| "Hybrid Memory Cube" offers 1 Tb/s memory bandwith at just 1.4 mW/Gb/s | ixfd64 | Hardware | 4 | 2011-12-14 21:24 |