mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   More efficient to reduce worker count? (https://www.mersenneforum.org/showthread.php?t=20561)

CuriousKit 2015-10-21 20:05

More efficient to reduce worker count?
 
Hi everyone,

I have a quad-core Intel Core i7-4770 running at 3.40GHz running the 64-bit version of Windows 7, and with 4 worker threads running, I'm currently averaging about 24 to 25 ms/iteration for first-time checks (in the 60m range), and about 14 ms/iteration for double checks (in the 35m range). If I turn off one of the worker threads, these times drop to 20 ms/iteration for first-time checks and 11 ms/iteration for double checks. Of course, I'm still learning the quirks of the thread scheduler, but for a computer that is used as a work machine as well, this seems to be quite a boost in efficiency, although I'm not entirely certain if it equates or surpasses having an extra worker thread.

What are people's experience with this?

aurashift 2015-10-21 21:20

Don't have time to post an explanation, but being very succinct: google interrupts, hyperthreading, and turbo boost.

petrw1 2015-10-21 21:24

[QUOTE=CuriousKit;413273]Hi everyone,

I have a quad-core Intel Core i7-4770 running at 3.40GHz running the 64-bit version of Windows 7, and with 4 worker threads running, I'm currently averaging about 24 to 25 ms/iteration for first-time checks (in the 60m range), and about 14 ms/iteration for double checks (in the 35m range). If I turn off one of the worker threads, these times drop to 20 ms/iteration for first-time checks and 11 ms/iteration for double checks. Of course, I'm still learning the quirks of the thread scheduler, but for a computer that is used as a work machine as well, this seems to be quite a boost in efficiency, although I'm not entirely certain if it equates or surpasses having an extra worker thread.

What are people's experience with this?[/QUOTE]

Unless you are seriously RAM constrained as my only 4770 is stopping a worker or two will "slightly" improve iteration times as you note; though 25 - 20 seems a little more drop than I would expect. You will still get more total thruput with 4 cores at 25Ms than 3 at 20Ms.

And don't worry about the machine doing "real" work too. Prime95 runs at the lowest priority and unless you are using memory intensive work like P-1; or Large ECM you will NOT notice any impact on your work.

petrw1 2015-10-21 21:25

[QUOTE=aurashift;413282]Don't have time to post an explanation, but being very succinct: google interrupts, hyperthreading, and turbo boost.[/QUOTE]

Don't use Hyper-threading in Prime95.
If you have 4 Physical cores only run 4 Prime95 Workers (not the Hyper-threading proposes).

aurashift 2015-10-21 23:50

[QUOTE=petrw1;413284]Don't use Hyper-threading in Prime95.
If you have 4 Physical cores only run 4 Prime95 Workers (not the Hyper-threading proposes).[/QUOTE]

I didn't propose that he use HT, just to look it up. If you don't leave a [I]physical[/I] core free, you'll be interrupting p95 so that your computer can do its everyday tasks which will be threaded in a [I]logical[/I] core. I'd recommend trying 'intel power gadget' a try so you can see how turbo boost, heat, and load will affect your CPU performance in real time.

Mark Rose 2015-10-22 03:39

I also have a 4770 and I have the same experience, with hyperthreading enabled.

3 cores slightly outperforms 4.

Madpoo 2015-10-22 03:44

[QUOTE=CuriousKit;413273]Hi everyone,

I have a quad-core Intel Core i7-4770 running at 3.40GHz running the 64-bit version of Windows 7, and with 4 worker threads running, I'm currently averaging about 24 to 25 ms/iteration for first-time checks (in the 60m range), and about 14 ms/iteration for double checks (in the 35m range). If I turn off one of the worker threads, these times drop to 20 ms/iteration for first-time checks and 11 ms/iteration for double checks. Of course, I'm still learning the quirks of the thread scheduler, but for a computer that is used as a work machine as well, this seems to be quite a boost in efficiency, although I'm not entirely certain if it equates or surpasses having an extra worker thread.

What are people's experience with this?[/QUOTE]

Besides the advice others gave, here's my experience with it.

For me, I would run two workers, each one using all the cores on it's own CPU (it's a dual CPU system).

I experienced exactly what you're describing if one of the two workers was doing an LL test on any exponent > 58M. If that's the case, the other worker would need to be doing a DC on something below 38M otherwise they would both slow down a lot.

My theory had to do with memory contention or something like that, but whatever... I just know to either keep both workers doing something below 58M, or if I must do something larger, set the other one to a small double-check task.

I recently set some of my "lower end" systems to do a single core per worker, which means I have some with 8 or 12 workers going. I've found that the same basic rule applies... if any single worker is doing a test > 58M then it'll slow down anything else doing something > 38M. As long as I have all of them doing small work (double checks below 50M or so) I don't have to worry about it.

There is an exception to this, and that's on my newer system with dual Xeon E5 2697 v3 and DDR4 RAM. Not sure if it's the larger L1/L2/L3 caching or the faster DDR4 or just something else, but that > 58M limit doesn't apply.

Right now I've got it set for two workers (using all the cores on either CPU)... one is doing a 100M test and the other a 60M and they don't interfere with each other. I haven't really pushed it yet to see what the limit is for it... if I ever have the need I'll experiment to see where it winds up.

axn 2015-10-22 04:00

[QUOTE=CuriousKit;413273]I have a quad-core Intel Core i7-4770 running at 3.40GHz running the 64-bit version of Windows 7, and with 4 worker threads running, I'm currently averaging about 24 to 25 ms/iteration for first-time checks (in the 60m range), and about 14 ms/iteration for double checks (in the 35m range). If I turn off one of the worker threads, these times drop to 20 ms/iteration for first-time checks and 11 ms/iteration for double checks. Of course, I'm still learning the quirks of the thread scheduler, but for a computer that is used as a work machine as well, this seems to be quite a boost in efficiency, although I'm not entirely certain if it equates or surpasses having an extra worker thread.[/QUOTE]

24-25 ms/iter x 4 workers = 160-166 iters/sec thruput
20 ms/iter x 3 workers = 150 iters/sec thruput.

14 ms/iter x 4 workers = 285 iters/sec thruput
11 ms/iter x 3 workers = 272 iters/sec thruput

Suggesting that you're very nearly memory bottlenecked, but 4 workers is still better than 3 workers in terms of total productivity.

What is your RAM spec?

CuriousKit 2015-10-22 15:03

I have 16 GB of RAM on the machine in question.

chalsall 2015-10-22 15:11

[QUOTE=CuriousKit;413335]I have 16 GB of RAM on the machine in question.[/QUOTE]

Dual channel? As in, pair(s) of "sticks"? I assume so based on the amount of RAM you have, but it's an important variable.

Also, what speed?

Another thing which is absolutely critical in optimizing Prime95/mprime is to get the affinity correct. At least under Linux (mprime) I have found that without explicitly setting the affinity manually (via the AffinityScramble2 parameter in local.txt) the processes can often jump around, sometimes ending up with hyperthreaded "virtual" cores being used.

petrw1 2015-10-22 15:50

[QUOTE=aurashift;413295]I didn't propose that he use HT, just to look it up. [/QUOTE]

Yes I understood it that way....maybe I was quick to simply say DON'T...but I have seen VERY VERY little commentary to the contrary in this forum.

petrw1 2015-10-22 15:51

[QUOTE=Mark Rose;413301]I also have a 4770 and I have the same experience, with hyperthreading enabled.

3 cores slightly outperforms 4.[/QUOTE]

This seems odd but not impossible....the proof is in the pudding.

CuriousKit 2015-10-22 21:42

[QUOTE=chalsall;413337]Dual channel? As in, pair(s) of "sticks"? I assume so based on the amount of RAM you have, but it's an important variable.

Also, what speed?

Another thing which is absolutely critical in optimizing Prime95/mprime is to get the affinity correct. At least under Linux (mprime) I have found that without explicitly setting the affinity manually (via the AffinityScramble2 parameter in local.txt) the processes can often jump around, sometimes ending up with hyperthreaded "virtual" cores being used.[/QUOTE]

The 16 GB is split over four 4 GB RAM chips. I confess I don't know the brand or speed off the top of my head and will have to look inside to be sure.

I have noticed that Prime95 doesn't always know which logical CPUs apply to which core - sometimes it can work it out, but other times it can't, and just assumes 1 and 2 for core 1, 3 and 4 for core 2 etc (which is also what it determines upon working it out). My current settings are to use 4 worker threads, but with no multithreading (CPUs to use = 1). What would you recommend I set the affinity scramble to in order to ensure the worker threads stay out of the hyperthreaded virtual cores? Currently I've hacked it a bit by using Task Manager to uncheck what I assumed to be the hyperthreaded virtual cores in the process affinity for Prime95.

I figure I need to get a program that better identifies the CPU and its stats in order to make a more informed decision - any recommendations?

chalsall 2015-10-22 21:59

[QUOTE=CuriousKit;413380]I figure I need to get a program that better identifies the CPU and its stats in order to make a more informed decision - any recommendations?[/QUOTE]

No. You need many programs. Experiment. Test your theories. Rince. Repeat.

Test. Rerun tests. Then run again (changing only one variable).

Always amuse you are wrong.

petrw1 2015-10-22 22:08

[QUOTE=chalsall;413381]Always amuse you are wrong.[/QUOTE]

Freudian slip?

kladner 2015-10-22 22:08

I see what you did there.

chalsall 2015-10-22 22:23

[QUOTE=petrw1;413382]Freudian slip?[/QUOTE]

Nope.

A true researcher must assume they might be wrong, but also that they might be correct...

And be prepared to answer the questions asked of them....

Cue the Monty Python "Run away, run away" skit....

Mark Rose 2015-10-22 22:39

Assumptions often lead to amusement.

CuriousKit 2015-10-22 23:35

I'm guessing that since Prime95 cannot reliably identify Windows CPU identifiers to individual cores, there isn't a single means of doing so. Experimentation seems like a good idea - I just wanted to avoid doing it unnecessarily if the answers are already published somewhere. I'll see what I find though.

Worst comes to the worst, I'll play around with the Windows API and the CPUID opcode myself!

chalsall 2015-10-22 23:59

[QUOTE=CuriousKit;413397]Worst comes to the worst, I'll play around with the Windows API and the CPUID opcode myself![/QUOTE]

What this really comes down to is Prime95 / mprime doesn't actually optimize the CPU usage as well as was expected without human intervention.

No disrespect intended towards George.

But hand optimizing the affinity goes a really long way.

Madpoo 2015-10-23 22:23

[QUOTE=CuriousKit;413397]I'm guessing that since Prime95 cannot reliably identify Windows CPU identifiers to individual cores, there isn't a single means of doing so. Experimentation seems like a good idea - I just wanted to avoid doing it unnecessarily if the answers are already published somewhere. I'll see what I find though.

Worst comes to the worst, I'll play around with the Windows API and the CPUID opcode myself![/QUOTE]

There is a way in Windows, and the Sysinternals program [URL="https://technet.microsoft.com/en-us/sysinternals/cc835722.aspx"]CoreInfo[/URL] does a great job of showing how that works.

The method used to get the details hasn't always existed in Windows which is why, I assume, George didn't use that in the code. Seems to me that since it is there now, that should be the first option to figure it out and only if it's an older OS (Windows 2000 or something?) should it revert to the much cruder "timing method".

Ideally Prime95 would also simply exclude hyperthreads from consideration at all, pretending they don't exist when it comes time to setting how many workers a machine can have.

Anyway, the AffinityScramble2 I would use on a 4 physical core system is =02461357 ... your physical cores in Windows are going to be the 0246 and your HT cores are the 1357. You "map" those last so when you set the affinity for each of your four workers, you can just start with 0 then 1, 2 and 3 so they point to those first 4 cores you've mapped out.

Otherwise if you leave the affinity at 01234567 you have to remember to set each worker to have an Affinity that matches a physical core, like 0, 2, 4 and 6.

I guess either way you'd want to set an Affinity= entry on each worker, but what affinity you use depends on how you've mapped your cores.

Confusing? Yeah, probably... it could be made clearer or just handled in the code for the optimal settings, but leaving the option to override "just in case".

CuriousKit 2015-10-24 03:40

No worries. It's something I need to pick up anyway, since my own programming is getting more advanced. Heck, maybe if I can get something consistent to work in regards to working out which logical CPUs map onto the same core, I can e-mail George with a proposed bit of code or something, or just post it on here somewhere. I don't know, whatever helps. In the meantime, I'll use the affinity override as suggested (yes, it is good to have a manual override no matter what, in cases where the automated method doesn't work for some reason, or if you want to deliberately override it to test something).


All times are UTC. The time now is 20:01.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.