mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2020-06-30, 03:08   #1
4EvrYng
 
Jun 2020

5 Posts
Default “Odd” P95 memory benchmark results?

I’ve recently run P95 benchmark of system I’m building. While looking at results (see attached screenshot) I’ve noticed something that to me, as person that has zero P95 knowledge and experience, seems odd.

First “oddity” is that hyperthreaded throughput is on average lower than non-hyperthreaded (16% average drop in case of single worker).

Second “oddity” is that it seems single worker always has best throughput, figures seem to start dropping once number of workers starts increasing.

That leaves me scratching my head because I, not having knowledge, assume that a) hyperthreading should result in higher overall throughput, not lower, and b) more workers should result in more iterations per second, not less.

So I need help, please, answering:

Am I interpreting figures correctly? When P95 says throughput is xyz that is -total- throughput, it isn’t per “thread” per worker, correct?

Am I correct in assuming that hyperthreaded figures should be higher than non-hyperthreaded ones, not lower?

Am I correct in assuming more workers should’ve resulted in higher throughput, not lower?

In other words: Does this seem odd to you too / does something seem wrong?
Attached Thumbnails
Click image for larger version

Name:	P95Benchmark.PNG
Views:	19
Size:	91.3 KB
ID:	22672  
4EvrYng is offline   Reply With Quote
Old 2020-06-30, 04:52   #2
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

8,369 Posts
Default

Prime95 is so efficiently written that the normal gains that a program sees from hyperthreading don't happen. In fact hyperthreading interferes with it.

The through put is the total potential through put (how much total work gets done.) So each core doing its own task will get the most work done. Putting multiple cores on to a single task will get that one task done faster. But the total amount of work will be less. There may be issues with memory bandwidth if many cores are each trying to access a bunch of memory. Actual best performance might be slightly different.
Uncwilly is offline   Reply With Quote
Old 2020-06-30, 05:39   #3
axn
 
axn's Avatar
 
Jun 2003

2×32×7×37 Posts
Default

Quote:
Originally Posted by Uncwilly View Post
So each core doing its own task will get the most work done. Putting multiple cores on to a single task will get that one task done faster. But the total amount of work will be less.
Which is the opposite of what they're seeing.
axn is online now   Reply With Quote
Old 2020-06-30, 08:44   #4
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

3×199 Posts
Default

1) As mentioned hyperthreading (generically called SMT) should normally be disabled for P95 as P95 is more efficient at occupying the core than SMT is. Hyperthreading allows two threads to queue up work simultaneously to increase occupancy but there is overhead. A workload like P95 fully occupies the core without the cost of this thread juggling overhead.

2) L3 cache is shared between cores, more workers means less cache per worker. The less cache a worker has the higher the chance that a piece of data is evicted from cache before it gets accessed again, in which case it has to be loaded from RAM again. P95 is normally memory bound, meaning throughput is limited by how much memory bandwidth you have. The more bandwidth that is consumed transferring duplicate data the less that is available for unique data, making the memory bottleneck even worse and resulting in lower throughput.
M344587487 is offline   Reply With Quote
Old 2020-06-30, 12:45   #5
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

8,369 Posts
Default

Quote:
Originally Posted by axn View Post
Which is the opposite of what they're seeing.
That is what I get for trying to read that spreadsheet while sleepy/under caffeinated..
Uncwilly is offline   Reply With Quote
Old 2020-06-30, 13:04   #6
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

11×509 Posts
Default

4EvrYng: Try more options.

You tested 10 x 1 and 1 x 10.

Also test 5 x 2 and 2 x 5

Plus other splits like: 3,3,4 and 2,2,3,3.

Even options that don't use all the cores: 4,4 and 3,3,3 and 2,2,2 etc.

See which of those gives you the better outcome and use it.

But I don't understand why you didn't try 20 x 1 and 1 x 20 when you had SMT enabled. Your 10 cores should logically (not physically) become 20 cores with SMT on.
retina is online now   Reply With Quote
Old 2020-06-30, 13:24   #7
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

53×79 Posts
Default

Quote:
Originally Posted by 4EvrYng View Post
When P95 says throughput is xyz that is -total- throughput, it isn’t per “thread” per worker
prime95 is quite clear when I run it. It states timing figures for each worker tested, and a combined total system throughput for the stated running condition. The following is from an i1035G1 which has 4 actual cores plus hyperthreading. Take the second line. 1000msec/sec * (1 /(9.80msec/iter) + 1/(12.34ms/iter) ) = 183.0 iter / sec total.
Timings for 2240K FFT length (4 cores, 1 worker): 7.14 ms. Throughput: 140.00 iter/sec.
Timings for 2240K FFT length (4 cores, 2 workers): 9.80, 12.34 ms. Throughput: 183.03 iter/sec.
Timings for 2240K FFT length (4 cores, 4 workers): 24.74, 20.54, 18.41, 22.06 ms. Throughput: 188.76 iter/sec.
[Fri May 29 22:33:19 2020]
Timings for 2240K FFT length (4 cores hyperthreaded, 1 worker): 5.95 ms. Throughput: 168.09 iter/sec.
Timings for 2240K FFT length (4 cores hyperthreaded, 2 workers): 11.51, 11.72 ms. Throughput: 172.17 iter/sec.
Timings for 2240K FFT length (4 cores hyperthreaded, 4 workers): 46.42, 20.56, 17.51, 17.54 ms. Throughput: 184.32 iter/sec.

Timings for 11520K FFT length (4 cores, 1 worker): 25.56 ms. Throughput: 39.12 iter/sec.
Timings for 11520K FFT length (4 cores, 2 workers): 47.17, 46.74 ms. Throughput: 42.59 iter/sec.
Timings for 11520K FFT length (4 cores, 4 workers): 99.19, 98.26, 95.55, 95.97 ms. Throughput: 41.14 iter/sec.
Timings for 11520K FFT length (4 cores hyperthreaded, 1 worker): 29.73 ms. Throughput: 33.64 iter/sec.
Timings for 11520K FFT length (4 cores hyperthreaded, 2 workers): 58.22, 57.55 ms. Throughput: 34.55 iter/sec.
Timings for 11520K FFT length (4 cores hyperthreaded, 4 workers): 118.11, 115.52, 115.83, 114.87 ms. Throughput: 34.46 iter/sec.

Timings for 65536K FFT length (4 cores, 1 worker): 160.64 ms. Throughput: 6.23 iter/sec.
Timings for 65536K FFT length (4 cores, 2 workers): 340.03, 339.46 ms. Throughput: 5.89 iter/sec.
Timings for 65536K FFT length (4 cores, 4 workers): 696.01, 689.13, 692.94, 688.94 ms. Throughput: 5.78 iter/sec.
Timings for 65536K FFT length (4 cores hyperthreaded, 1 worker): 251.43 ms. Throughput: 3.98 iter/sec.
Timings for 65536K FFT length (4 cores hyperthreaded, 2 workers): 529.32, 524.69 ms. Throughput: 3.80 iter/sec.
Timings for 65536K FFT length (4 cores hyperthreaded, 4 workers): 1086.87, 1063.50, 1072.06, 1054.67 ms. Throughput: 3.74 iter/sec.

Quote:
Am I correct in assuming that hyperthreaded figures should be higher than non-hyperthreaded ones
No. Try reading a book. Now try reading two books at the same time. Hyperthreading rarely benchmarks faster for primality testing.

Quote:
Am I correct in assuming more workers should’ve resulted in higher throughput
No. Imagine a small restaurant kitchen, with limited range surface, counter surface, aisle space. etc. For several small orders, a worker may be able to work independently, on each, without getting in each other's way and slowing each other down. Eventually as you increase workers and tasks you run out of resources. Now imagine they're working on big elaborate different dinners. Required space per order increases, optimal number of simultaneous orders being processed decreases. That's what appears in the example above, where, as the fft size increases, the optimal number of workers decreases.

Quote:
In other words: Does this seem odd to you too / does something seem wrong?
No. But it did long ago when I was newer to running and analyzing it.
See Effect of number of workers and Effect of number of workers (continued)
for several cpu models' extensive benchmark runs analyzed and graphed.
kriesel is offline   Reply With Quote
Old 2020-06-30, 23:23   #8
4EvrYng
 
Jun 2020

516 Posts
Default

Quote:
Originally Posted by Uncwilly View Post
Prime95 is so efficiently written that the normal gains that a program sees from hyperthreading don't happen. In fact hyperthreading interferes with it.
Thank you!
4EvrYng is offline   Reply With Quote
Old 2020-06-30, 23:37   #9
4EvrYng
 
Jun 2020

5 Posts
Default

Quote:
Originally Posted by M344587487 View Post
As mentioned hyperthreading (generically called SMT) should normally be disabled for P95 as P95 is more efficient at occupying the core than SMT is ... A workload like P95 fully occupies the core ...
Thank you! Then I am surprised P95 offers you to benchmark ht without any warning figures will most likely drop if you go for it. I wish that was not the case as it is possible it left many absolute noob like me unaware of it.

Quote:
Originally Posted by M344587487 View Post
L3 cache is shared between cores, more workers means less cache per worker. The less cache a worker has the higher the chance that a piece of data is evicted from cache before it gets accessed again, in which case it has to be loaded from RAM again. P95 is normally memory bound, meaning throughput is limited by how much memory bandwidth you have. The more bandwidth that is consumed transferring duplicate data the less that is available for unique data, making the memory bottleneck even worse and resulting in lower throughput.
Thank you for the elaborate explanation! It makes absolute sense. Then I again wish there was some warning about it because by default P95 offers to run three different worker counts.
4EvrYng is offline   Reply With Quote
Old 2020-06-30, 23:58   #10
4EvrYng
 
Jun 2020

58 Posts
Default

Quote:
Originally Posted by retina View Post
4EvrYng: Try more options.
I did. For FFTs I tested figures on my machine would start dropping moment there was more than one worker. I just didn't show all figures in spreadsheet posted in order to keep it small.

Quote:
Originally Posted by retina View Post
But I don't understand why you didn't try 20 x 1 and 1 x 20 when you had SMT enabled. Your 10 cores should logically (not physically) become 20 cores with SMT on.
P95 offers 10 as default and when I enter more than 10 it does nothing so my interpretation of that is that it asks you how many cores you want testes and 'test hyperthreding' checkbox controls will you test "ht" ones too or not.
4EvrYng is offline   Reply With Quote
Old 2020-07-01, 00:26   #11
4EvrYng
 
Jun 2020

58 Posts
Default

Quote:
Originally Posted by kriesel View Post
No. But it did long ago when I was newer to running and analyzing it.
See Effect of number of workers and Effect of number of workers (continued)
for several cpu models' extensive benchmark runs analyzed and graphed.
Thank you for the help, posts you linked to were very informative. I'm glad to find out "oddity" I've perceived is not an actual one / sign of any issue.
4EvrYng is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Prime95 benchmark results in GHz-days/day? mnd9 Information & Answers 0 2019-09-24 19:46
Statistical properties of categories of GIMPS results and interim results kriesel Probability & Probabilistic Number Theory 1 2019-05-22 22:59
NVIDIA Quadro K4000 speed results benchmark sixblueboxes GPU Computing 3 2014-07-17 00:25
Strange benchmark results AlTonno15 Information & Answers 3 2013-01-29 02:23
"Hybrid Memory Cube" offers 1 Tb/s memory bandwith at just 1.4 mW/Gb/s ixfd64 Hardware 4 2011-12-14 21:24

All times are UTC. The time now is 08:35.

Tue Aug 4 08:35:09 UTC 2020 up 18 days, 4:21, 0 users, load averages: 1.46, 1.51, 1.40

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.