mersenneforum.org running single tests fast
 Register FAQ Search Today's Posts Mark Forums Read

 2015-09-23, 22:00 #1 dragonbud20     Mar 2014 5116 Posts running single tests fast I'm trying to help out with the strategic double check project a bit and I want to figure out the best way to run 1 or two test at a time. currently I'm running 1 test on my 5930k which is 6 physical cores with hyperthreading to make 12 virtual cores; I have the worker windows setting with #workers set to 1, CPU affinity: "run on any CPU", CPUs to use:"12". Is this the best way to do it? is there any way to do it better? and I assume that if I'm running two tests I would want to set it to 2 workers and 6 CPUs each.
2015-09-24, 00:11   #2
chalsall
If I May

"Chris Halsall"
Sep 2002

948210 Posts

Quote:
 Originally Posted by dragonbud20 and I assume that if I'm running two tests I would want to set it to 2 workers and 6 CPUs each.
Nope... Hyperthreads don't help at all; slow things down in fact. Prime95/mprime is just too carefully hand-crafted assembly; you quickly get memory bandwidth bound.

The next critical thing is "real, on-chip" CPU affinity. Madpoo can speak to what is needed under Windows.

If you have more than one physical CPU, then (and only then) is when you go to multiple tests.

Last fiddled with by chalsall on 2015-09-24 at 00:11 Reason: Closed quote.

2015-09-24, 02:27   #3
dragonbud20

Mar 2014

34 Posts

Quote:
 Originally Posted by chalsall Nope... Hyperthreads don't help at all; slow things down in fact. Prime95/mprime is just too carefully hand-crafted assembly; you quickly get memory bandwidth bound. The next critical thing is "real, on-chip" CPU affinity. Madpoo can speak to what is needed under Windows. If you have more than one physical CPU, then (and only then) is when you go to multiple tests.
hmm... so my default settings are to run 6 workers with each one using one physical core on my CPU is this actually sub-optimal? If this isn't the best way to go about doing tests what is on my particular CPU?

2015-09-24, 02:44   #4
VBCurtis

"Curtis"
Feb 2005
Riverside, CA

467810 Posts

Quote:
 Originally Posted by dragonbud20 hmm... so my default settings are to run 6 workers with each one using one physical core on my CPU is this actually sub-optimal? If this isn't the best way to go about doing tests what is on my particular CPU?
Experiment to find best overall production. It's memory-bound, but different CPU/memory combos have different best setups. Try 6x1, 3x2 (2 threads per test), 2x3 (3 threads per test), and see what produces the most work done. You also may find that first-time LL tests have a different optimal setup than DCs, as smaller FFTs tend to saturate memory less than larger ones.

You also may find that, say, 5 LL tests is as much production as 6 tests, allowing that 6th core to do something else from this forum- ECM in particular is not very memory-bandwidth intensive.

2015-09-24, 02:50   #5
chalsall
If I May

"Chris Halsall"
Sep 2002

2×11×431 Posts

Quote:
 Originally Posted by dragonbud20 IIf this isn't the best way to go about doing tests what is on my particular CPU?
It is all rather complicated I'm afraid. Just about every machine is different; L1 and L2 (possibly L3) caches; speed of memory; interleaved memory, other processing demands, etc, etc, etc...

The best way to find the "sweet spot" for optimal configuration is empirical testing. Try different settings, note the speeds achieved. Do the math (read: build a spreadsheet to record data). Rince and repeat (as necessary).

Prime95/mprime's benchmarking option helps a lot in this analysis, but I found that it ignored custom Affinity settings (at least under Linux) so I ended up running the same candidate dozens of times just to establish a baseline.

And, to put on the table, I /only/ do Linux. Again, Madpoo is a better man to speak about optimizing Windows.

2015-09-24, 04:49   #6
LaurV
Romulan Interpreter

Jun 2011
Thailand

52·7·53 Posts

Quote:
 Originally Posted by dragonbud20 I have the worker windows setting with #workers set to 1, CPU affinity: "run on any CPU", CPUs to use:"12". Is this the best way to do it? is there any way to do it better? and I assume that if I'm running two tests I would want to set it to 2 workers and 6 CPUs each.
A better way for one worker is "run on any CPU, CPUs to use: 6"
Accordingly, for two workers use "CPUs to use: 3"

The HT will only succeed to muddy the waters, without bringing any benefits. P95 is enough optimized to take advantage of the full core, without HT. HT is only for programs that don't use the core fully (having idle ticks), so splitting it in two gives other processes/tasks/programs the opportunity to use the free ticks. It is not the case for P95, in its case one of the logical core is waiting, because the other one uses all cycles, then after the first finishes it is waiting for the second, then some more time is used to synchronize them each-other, etc. So, HT does not bring any benefits. Contrarily.

Remark that I didn't say "the best way", but "a better way". The best way may be for you to reduce the number of cores more, if you have a slow memory, for example, or not enough channels. But for sure, using 6 cores with a single worker is a better way than using 12 cores. You can use Options/Benchmark from the P95 menu, compute the output speed and the time you need in each case, and see for yourself which way is better for your system.

 2015-09-25, 02:31 #7 dragonbud20     Mar 2014 10100012 Posts hmm so I've been doing a bit of testing and it seems that 12 CPUs is a 15% faster than 6CPUs the cause of this seems to be that P95 is only hitting about 50% CPU utilization any idea how to get better utilization? Also when running more than one worker what is the difference between smart assignment and run on any CPU both technically and performance wise?
2015-09-25, 05:14   #8
sdbardwick

Aug 2002
North San Diego County

2×11×31 Posts

Quote:
 Originally Posted by dragonbud20 hmm so I've been doing a bit of testing and it seems that 12 CPUs is a 15% faster than 6CPUs the cause of this seems to be that P95 is only hitting about 50% CPU utilization any idea how to get better utilization? Also when running more than one worker what is the difference between smart assignment and run on any CPU both technically and performance wise?
A 15% increase using hyperthreaded cores strongly indicates that the threads are being allocated less than optimally. That is, some of the threads are running on logical cores that share the same physical core.
For example, the 6C12T box I use to quickly check single exponents is setup like this:
Code:
CPU Affinity: CPU #1
CPUs to use: 6
AffinityScramble2=13579B02468A (in the local.txt file)
This forces Prime95 to use 1 unshared logical core per thread, and gives me the best throughput. Windows task manager shows +-50% CPU use with every other logical core at +-100%.

If I turn hyperthreading off in BIOS (and take out the AffinityScramble2 line), then task manager shows +-100% CPU use, and the iteration times don't change materially.

Last fiddled with by sdbardwick on 2015-09-25 at 05:21

2015-09-25, 05:30   #9
LaurV
Romulan Interpreter

Jun 2011
Thailand

52×7×53 Posts

Quote:
 Originally Posted by dragonbud20 50% CPU utilization
That is a windows "bug", is not really a bug, but it is how windows count the cores, you only use 6 from 12 possible, so no matter wht you do with them, you will not be able to see more than 50% occupancy. Your CPU is used close to 100% in that case, don't worry. Windoze makes a distinction between "single physical core" and "two logical cores running on a single physical core", he sees that as two cores (which is arguable wrong, but that is the case). If you disable HT in bios, you will still get your 15% more speed, in 6 cores only, most probably. In fact, it depends of many other things, including what else you are running in the (other) logical cores in this time. But the result may also mean that you are not memory BW limited, which is good. You may also make tests and see if your 15% higher speed doesn't come with 70% higher heat and/or current consumption (which it is usually the case with HT, here the consensus is that for LL, HT is not good - but of course it is up to you, and all computers/systems are different).

Edit: Crosspost, I started to reply before the last post was made. We are in a divergent agreement, like someone here used to say, so I will let my post be.

Last fiddled with by LaurV on 2015-09-25 at 05:42

 2015-09-25, 06:41 #10 VBCurtis     "Curtis" Feb 2005 Riverside, CA 2·2,339 Posts I think you mean "violent agreement". dragonbud should learn to not trust what windows tells him.
2015-09-25, 06:49   #11
LaurV
Romulan Interpreter

Jun 2011
Thailand

243B16 Posts

Quote:
 Originally Posted by VBCurtis "violent agreement"
Thanks, you don't know how much I wanted to remember that expression, and I still forgot it! Grrrr... I am really getting older.

 Similar Threads Thread Thread Starter Forum Replies Last Post GP2 Cloud Computing 52 2020-07-30 08:51 R. Gerbicz Number Theory Discussion Group 15 2018-09-01 13:23 ZFR Software 4 2018-02-02 20:18 GARYP166 Information & Answers 11 2009-07-13 19:39 Gary Edstrom Lounge 7 2003-01-13 22:35

All times are UTC. The time now is 16:22.

Mon Mar 8 16:22:49 UTC 2021 up 95 days, 12:34, 1 user, load averages: 2.00, 1.91, 1.72