mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2017-12-02, 16:11   #12
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2·5·293 Posts
Default

Quote:
Originally Posted by GP2 View Post
Except Prime95 hasn't (yet) been tuned for Skylake, and therefore if you use Skylake you should try turning on hyperthreading (by adding HyperthreadLL=1 to local.txt) and doing benchmarks with and without this setting. For me it did make a difference.
What tuning still needs to be done?
Mark Rose is offline   Reply With Quote
Old 2017-12-02, 17:55   #13
GP2
 
GP2's Avatar
 
Sep 2003

5·11·47 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
What tuning still needs to be done?
Uh, all of it, I think.

George was testing on Knights Landing, I don't recall if any of that made it into the program yet. I don't think he's done any optimizations for "actual" Skylake yet.

In any case, empirical testing indicated that HyperthreadLL=1 in local.txt produces better performance for c5.large instances (Skylake) on AWS, but not on the c4.large instances (Haswell).


Note: mprime runs about 25% faster on c5.large than on c4.large, but presumably that's due entirely to larger cache and better memory bandwidth, rather than any tuning of the code.

Last fiddled with by GP2 on 2017-12-02 at 18:00
GP2 is offline   Reply With Quote
Old 2017-12-02, 20:46   #14
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

55628 Posts
Default

Quote:
Originally Posted by GP2 View Post
Uh, all of it, I think.

George was testing on Knights Landing, I don't recall if any of that made it into the program yet. I don't think he's done any optimizations for "actual" Skylake yet.
Oh you mean the AVX512 stuff. Gotcha.
Mark Rose is offline   Reply With Quote
Old 2017-12-02, 20:50   #15
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

11100001101012 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
Oh you mean the AVX512 stuff. Gotcha.
I thought AVX512 wasn't available on Skylake (at least not for less than high-end 4 digit prices)?
Dubslow is offline   Reply With Quote
Old 2017-12-02, 20:51   #16
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

B7216 Posts
Default

Quote:
Originally Posted by Dubslow View Post
I thought AVX512 wasn't available on Skylake (at least not for less than high-end 4 digit prices)?
It's not in the Core 6xxx chips. It is in the Xeon chips and the i9 chips.
Mark Rose is offline   Reply With Quote
Old 2017-12-02, 23:25   #17
Thratrun
 
Nov 2017

5 Posts
Default

Quote:
Originally Posted by kladner View Post
In local.txt:
Code:
[Worker #1]
Affinity=1,3,5,7
# Affinity=0,2,4,6
The second "Affinity" line is a commented example of another set of 'cores' that would use only physical cores.
Note that I run a single worker with 4 cores. If you have more workers, add sections as appropriate: [Worker #2] etc. Also that, in Windows at least, core numbers start with 0.
I'm running 4 workers, so if I wanted to have 1 physical and 1 HT on each worker it would look like this? Are the odd number HT cores and even number physical cores?:
[Worker #1]
Affinity=1
[Worker #2]
Affinity=3
[Worker #3]
Affinity=5
[Worker #4]
Affinity=7

Also, I'm running 4 because I was told that using all the cores in 1 worker doesn't "add" properly, but in the throughput benchmark, the highest throughput was always 4 cores (non HT) in 1 worker. I don't know if I'm understanding properly what that means, but I think it means that I'd work better for me to put all the cores into 1 worker?

I don't know too much of the subject yet, sorry if some of the questions are too obvious
Thratrun is offline   Reply With Quote
Old 2017-12-03, 01:03   #18
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

486810 Posts
Default

Quote:
Originally Posted by Thratrun View Post
I'm running 4 workers, so if I wanted to have 1 physical and 1 HT on each worker it would look like this? Are the odd number HT cores and even number physical cores?:
It's more like a modern McDonald's drive-thru line, where two lines of cars line up, but merge into a single lane at the payment window. If nobody is in line, it doesn't matter if you use the even-numbered or odd-numbered lanes, but setting "affinity" makes sure the program doesn't send two cars to the same payment window by using core#0 and core #1. If you use all the odds, or all the evens, you use all the cores fully (all the payment windows), without sending multiple tasks to the same payment window. Neither lane is any faster or slower, as long as only one lane per payment window is being used. There isn't one designated "this one for HT traffic".

HT works a lot like McDonald's, come to think of it- some parts get faster by interleaving instructions, but the main computation engine (like the payment window) doesn't get any faster. Prime95 does not get any faster by using both lanes, which is why we don't try to use HT; using 4 jobs at once fully uses the entire CPU, while 8 jobs at once just causes a traffic jam without more cars getting through the line.
VBCurtis is offline   Reply With Quote
Old 2017-12-03, 02:21   #19
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

100111101011102 Posts
Default

Quote:
Originally Posted by Thratrun View Post
I'm running 4 workers, so if I wanted to have 1 physical and 1 HT on each worker it would look like this? Are the odd number HT cores and even number physical cores?:
[Worker #1]
Affinity=1
[Worker #2]
Affinity=3
[Worker #3]
Affinity=5
[Worker #4]
Affinity=7


Also, I'm running 4 because I was told that using all the cores in 1 worker doesn't "add" properly, but in the throughput benchmark, the highest throughput was always 4 cores (non HT) in 1 worker. I don't know if I'm understanding properly what that means, but I think it means that I'd work better for me to put all the cores into 1 worker?

I don't know too much of the subject yet, sorry if some of the questions are too obvious
Affinity: The local.txt lines you have will run each worker thread on one physical core. The HTs will not be used by Prime95, which is as it should be. I keep HT enabled because other applications seemed to be a bit more responsive while P95 is running, at least by impression.

Cores per Worker: Part of the issue here is that modern CPUs can end up waiting on memory performance. This may be more significant when multiple assignments are competing for memory bandwidth. The benchmarks show results for different combinations of cores/workers.

There can be multiple reasons for choosing your particular setup. I kind of like running a single worker with all four cores, because it completes a 45M double check in 28-30 hours. One benefit to having multiple workers is that you can stop part of P95 to reduce load without stopping the whole process.

EDIT: To add to VBCurtis' excellent analogy, there is no assignment of HT status to core numbers. In the Windows numbering scheme, each adjacent pair of "cores" as seen in Task Manager represent a single physical core. All that matters is to not assign workers to both of, for instance, 0 and 1, 2 and 3, etc.. The affinities I gave are just easy to remember. You could do Affinity=0,3,4,7 or 1,2,5,6 with the same effect.

Last fiddled with by kladner on 2017-12-03 at 02:36
kladner is offline   Reply With Quote
Old 2017-12-03, 02:26   #20
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

1C3516 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
It's more like a modern McDonald's drive-thru line, where two lines of cars line up, but merge into a single lane at the payment window. If nobody is in line, it doesn't matter if you use the even-numbered or odd-numbered lanes, but setting "affinity" makes sure the program doesn't send two cars to the same payment window by using core#0 and core #1. If you use all the odds, or all the evens, you use all the cores fully (all the payment windows), without sending multiple tasks to the same payment window. Neither lane is any faster or slower, as long as only one lane per payment window is being used. There isn't one designated "this one for HT traffic".

HT works a lot like McDonald's, come to think of it- some parts get faster by interleaving instructions, but the main computation engine (like the payment window) doesn't get any faster. Prime95 does not get any faster by using both lanes, which is why we don't try to use HT; using 4 jobs at once fully uses the entire CPU, while 8 jobs at once just causes a traffic jam without more cars getting through the line.
The extended analogy of course is that almost every other program or software isn't fully optimized, which is something like the equivalent of the other programs take their sweet time ordering -- and when every car takes its sweet time ordering, that's when having two ordering lanes for one payment window is useful. Of course, Prime95 is the exception to the rule: it knows exactly what it wants, always (more or less), so it breezes through the lanes, and using two lanes for one payment window is no benefit.
Dubslow is offline   Reply With Quote
Old 2017-12-03, 04:42   #21
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2D7E16 Posts
Default

Quote:
Originally Posted by kladner View Post
In the Windows numbering scheme, each adjacent pair of "cores" as seen in Task Manager represent a single physical core. All that matters is to not assign workers to both of, for instance, 0 and 1, 2 and 3, etc.. The affinities I gave are just easy to remember. You could do Affinity=0,3,4,7 or 1,2,5,6 with the same effect.
In my own pthread affinity-setting code I need to take account of the different core-numbering conventions used by Intel and AMD, in which on a quad system, physical cores 0-3 map to logical core pairs [0,4],[1,5],[2,6],[3,7] on Intel, versus [0,1],[2,3],[4,5],[6,7] on AMD. It seems Prime95's affinity-setting schema uses a common logical-core-numbering scheme for both manufacturers' chips and does an extra translation step from that user-visible scheme to the respective ones used by Intel and AMD, is that right?
ewmayer is offline   Reply With Quote
Old 2017-12-03, 04:58   #22
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2×3×1,693 Posts
Default

Quote:
Originally Posted by ewmayer View Post
In my own pthread affinity-setting code I need to take account of the different core-numbering conventions used by Intel and AMD, in which on a quad system, physical cores 0-3 map to logical core pairs [0,4],[1,5],[2,6],[3,7] on Intel, versus [0,1],[2,3],[4,5],[6,7] on AMD. It seems Prime95's affinity-setting schema uses a common logical-core-numbering scheme for both manufacturers' chips and does an extra translation step from that user-visible scheme to the respective ones used by Intel and AMD, is that right?
I don't know, regarding the last question. As to core numbering, running a 6700K in Win 7 Pro, they show as
Code:
[core 1]0,1
[core 2]2,3
[core 3]4,5
[core 4]6,7
I believe this was the same when I was running AMD.
Correction: That was an 8 Integer, 4 Floating Point CPU. Not at all the same as HT. However, those cores were paired by FPU in the same order.
I think I am not fully understanding your part about 'extra translation step'.

Last fiddled with by kladner on 2017-12-03 at 05:03
kladner is offline   Reply With Quote
Reply



All times are UTC. The time now is 17:47.


Sun Aug 1 17:47:04 UTC 2021 up 9 days, 12:16, 0 users, load averages: 1.87, 1.99, 1.75

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.