mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Hardware Benchmark Jest Thread for 100M exponents (https://www.mersenneforum.org/showthread.php?t=13185)

TheMawn 2014-11-21 02:55

I don't have access to a hyperthreaded CPU, so maybe someone who has experience with that can help.

Has running a worker on a physical core and its logical friend as a helper been pretty much proven to be no faster than running on the single core? I know that trying to run 8 workers on 4 physical cores just results in more heat, but 4 workers on 8 logical cores, I don't remember.

petrw1 2014-11-21 03:14

[QUOTE=TheMawn;388135]I don't have access to a hyperthreaded CPU, so maybe someone who has experience with that can help.

Has running a worker on a physical core and its logical friend as a helper been pretty much proven to be no faster than running on the single core? I know that trying to run 8 workers on 4 physical cores just results in more heat, but 4 workers on 8 logical cores, I don't remember.[/QUOTE]

In almost every case enabling Hyper Threading (2 logical cores per physical) at best makes no difference and usually reduces thruput but a few percent.

tha 2014-11-21 09:45

[QUOTE=petrw1;388136]In almost every case enabling Hyper Threading (2 logical cores per physical) at best makes no difference and usually reduces thruput but a few percent.[/QUOTE]

And as far as I understand it, the reason is that programs written in assembler are so optimized that there is no 'air' in them that could be squeezed out of it by trying to fit two threads of code through a single core. In contrast to compiled code.

VBCurtis 2014-11-21 18:32

[QUOTE=tha;388152]And as far as I understand it, the reason is that programs written in assembler are so optimized that there is no 'air' in them that could be squeezed out of it by trying to fit two threads of code through a single core. In contrast to compiled code.[/QUOTE]

I have had good luck using HT to interleave a high-bandwidth program (LLR or Prime95) with a lower-bandwidth program (GMP-ECM or NFS sieving). It appears that doing these other tasks helps to hide memory saturation, such that Prime95 hardly slows down but ECM does meaningful work. HT perhaps schedules ECM cycles while Prime95 would have waited for data from main memory.

I found best speeds by manually assigning cores so that ECM was never on its own core.

TheMawn 2014-11-21 23:14

[QUOTE=VBCurtis;388181]I have had good luck using HT to interleave a high-bandwidth program (LLR or Prime95) with a lower-bandwidth program (GMP-ECM or NFS sieving). It appears that doing these other tasks helps to hide memory saturation, such that Prime95 hardly slows down but ECM does meaningful work. HT perhaps schedules ECM cycles while Prime95 would have waited for data from main memory.

I found best speeds by manually assigning cores so that ECM was never on its own core.[/QUOTE]

Now that's a pretty neat idea.

Madpoo 2015-02-26 03:19

Dual E5-2690v2 (@ 3 GHz)
 
I went ahead and tried this exponent out on my dual core E5-2690v2 system (it's a DL380p Gen8)

It's running Windows Server 2012 R2 for what that's worth. Hyperthreading is enabled but I know that does pretty much nothing for P95. It does make cycles available for the OS though.

Each CPU is 10 physical cores + HT, and there are 2 CPU's installed.

At 10 threads it hits it's stride at 79 days and 4 hours. And it pretty much stays in that 79-80 day range all the way up through 20 threads, and then of course when it goes past 20 threads and uses the HT cores it also tends to stay in that same range. Sometimes showing a bit longer like up to 100 days, just based on how throttled the system seemed to be at the time.

I thought I remembered doing some benchmarks on the same system a while back and seeing it scale up (slowly, but improving) all the way up through the 20 real cores even though it spanned CPU sockets. I guess that was on a smaller FFT size where the mem bandwidth between sockets wasn't a big limiting factor. Or I just misremembered.

Either way it did enlighten me that when I'm knocking out some of these oddball exponents (those needing triple-checks, etc) I can do a lot better than I have been where I was just throwing all 20 cores at one exponent. I get the same performance with only 10, and I can do two of those.

For these things I'm doing I want a single exponent tested as fast as possible because I'm collecting stats on triple-checks, or doing double-checks on "suspicious" results, so that works well for me. I know I could get more total throughput by splitting it up more.

TheJudger 2015-02-26 20:30

For (current) NUMA systems it is usually NOT a good idea to run a single worker across multiple CPU sockets (NUMA nodes to be more specific).

Oliver

ATH 2015-02-26 23:12

I'm curious what is the timing for this 100M digit exponent on a Titan GPU?

TheMawn 2015-02-27 02:30

Yeah that's the idea of the benchmark. Look at total iter/sec throughput and see where the scaling stops. Use that to decide how many cores.

I'm not entirely surprised that 20 cores is no faster than 10 cores.

Madpoo 2015-02-27 04:15

[QUOTE=TheJudger;396474]For (current) NUMA systems it is usually NOT a good idea to run a single worker across multiple CPU sockets (NUMA nodes to be more specific).
[/QUOTE]

Yeah, true enough, and pretty obvious once I thought about it.

Right now I'm having difficulty figuring out why Prime95 doesn't seem to correctly split cores apart so that if, for instance, I want one worker using 10 cores, and another using 10 cores, to split it so that the 10 cores in question are all on one CPU, and the other 10 are on the other CPU.

When I tried that it seemed to flood one NUMA node with all 20 cores on that one (including the HT). I posed this question on another thread though so I won't belabor it here, but needless to say I'm trying to wrap my head around just how the AffinityScramble2 thing works. I'm probably psyching myself out thinking about it in terms of dual CPU rather than just multi cores on one CPU. Either way it's puzzling.

Robert_JD 2015-02-27 11:02

1 Attachment(s)
[QUOTE=ATH;396492]I'm curious what is the timing for this 100M digit exponent on a Titan GPU?[/QUOTE]

Approximately 50 days.


All times are UTC. The time now is 21:15.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.