![]() |
|
|
#12 |
|
"/X\(‘-‘)/X\"
Jan 2013
2·5·293 Posts |
What tuning still needs to be done?
|
|
|
|
|
|
#13 |
|
Sep 2003
5·11·47 Posts |
Uh, all of it, I think.
George was testing on Knights Landing, I don't recall if any of that made it into the program yet. I don't think he's done any optimizations for "actual" Skylake yet. In any case, empirical testing indicated that HyperthreadLL=1 in local.txt produces better performance for c5.large instances (Skylake) on AWS, but not on the c4.large instances (Haswell). Note: mprime runs about 25% faster on c5.large than on c4.large, but presumably that's due entirely to larger cache and better memory bandwidth, rather than any tuning of the code. Last fiddled with by GP2 on 2017-12-02 at 18:00 |
|
|
|
|
|
#14 |
|
"/X\(‘-‘)/X\"
Jan 2013
55628 Posts |
|
|
|
|
|
|
#15 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
11100001101012 Posts |
|
|
|
|
|
|
#16 |
|
"/X\(‘-‘)/X\"
Jan 2013
B7216 Posts |
|
|
|
|
|
|
#17 | |
|
Nov 2017
5 Posts |
Quote:
[Worker #1] Affinity=1 [Worker #2] Affinity=3 [Worker #3] Affinity=5 [Worker #4] Affinity=7 Also, I'm running 4 because I was told that using all the cores in 1 worker doesn't "add" properly, but in the throughput benchmark, the highest throughput was always 4 cores (non HT) in 1 worker. I don't know if I'm understanding properly what that means, but I think it means that I'd work better for me to put all the cores into 1 worker? I don't know too much of the subject yet, sorry if some of the questions are too obvious |
|
|
|
|
|
|
#18 | |
|
"Curtis"
Feb 2005
Riverside, CA
486810 Posts |
Quote:
HT works a lot like McDonald's, come to think of it- some parts get faster by interleaving instructions, but the main computation engine (like the payment window) doesn't get any faster. Prime95 does not get any faster by using both lanes, which is why we don't try to use HT; using 4 jobs at once fully uses the entire CPU, while 8 jobs at once just causes a traffic jam without more cars getting through the line. |
|
|
|
|
|
|
#19 | |
|
"Kieren"
Jul 2011
In My Own Galaxy!
100111101011102 Posts |
Quote:
Cores per Worker: Part of the issue here is that modern CPUs can end up waiting on memory performance. This may be more significant when multiple assignments are competing for memory bandwidth. The benchmarks show results for different combinations of cores/workers. There can be multiple reasons for choosing your particular setup. I kind of like running a single worker with all four cores, because it completes a 45M double check in 28-30 hours. One benefit to having multiple workers is that you can stop part of P95 to reduce load without stopping the whole process. EDIT: To add to VBCurtis' excellent analogy, there is no assignment of HT status to core numbers. In the Windows numbering scheme, each adjacent pair of "cores" as seen in Task Manager represent a single physical core. All that matters is to not assign workers to both of, for instance, 0 and 1, 2 and 3, etc.. The affinities I gave are just easy to remember. You could do Affinity=0,3,4,7 or 1,2,5,6 with the same effect. Last fiddled with by kladner on 2017-12-03 at 02:36 |
|
|
|
|
|
|
#20 | |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
1C3516 Posts |
Quote:
|
|
|
|
|
|
|
#21 | |
|
∂2ω=0
Sep 2002
República de California
2D7E16 Posts |
Quote:
|
|
|
|
|
|
|
#22 | |
|
"Kieren"
Jul 2011
In My Own Galaxy!
2×3×1,693 Posts |
Quote:
Code:
[core 1]0,1 [core 2]2,3 [core 3]4,5 [core 4]6,7 Correction: That was an 8 Integer, 4 Floating Point CPU. Not at all the same as HT. However, those cores were paired by FPU in the same order. I think I am not fully understanding your part about 'extra translation step'.
Last fiddled with by kladner on 2017-12-03 at 05:03 |
|
|
|
|