mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2014-11-21, 02:55   #144
TheMawn
 
TheMawn's Avatar
 
May 2013
East. Always East.

11·157 Posts
Default

I don't have access to a hyperthreaded CPU, so maybe someone who has experience with that can help.

Has running a worker on a physical core and its logical friend as a helper been pretty much proven to be no faster than running on the single core? I know that trying to run 8 workers on 4 physical cores just results in more heat, but 4 workers on 8 logical cores, I don't remember.
TheMawn is offline   Reply With Quote
Old 2014-11-21, 03:14   #145
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

22×7×167 Posts
Default

Quote:
Originally Posted by TheMawn View Post
I don't have access to a hyperthreaded CPU, so maybe someone who has experience with that can help.

Has running a worker on a physical core and its logical friend as a helper been pretty much proven to be no faster than running on the single core? I know that trying to run 8 workers on 4 physical cores just results in more heat, but 4 workers on 8 logical cores, I don't remember.
In almost every case enabling Hyper Threading (2 logical cores per physical) at best makes no difference and usually reduces thruput but a few percent.
petrw1 is offline   Reply With Quote
Old 2014-11-21, 09:45   #146
tha
 
tha's Avatar
 
Dec 2002

32E16 Posts
Default

Quote:
Originally Posted by petrw1 View Post
In almost every case enabling Hyper Threading (2 logical cores per physical) at best makes no difference and usually reduces thruput but a few percent.
And as far as I understand it, the reason is that programs written in assembler are so optimized that there is no 'air' in them that could be squeezed out of it by trying to fit two threads of code through a single core. In contrast to compiled code.
tha is offline   Reply With Quote
Old 2014-11-21, 18:32   #147
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

4,861 Posts
Default

Quote:
Originally Posted by tha View Post
And as far as I understand it, the reason is that programs written in assembler are so optimized that there is no 'air' in them that could be squeezed out of it by trying to fit two threads of code through a single core. In contrast to compiled code.
I have had good luck using HT to interleave a high-bandwidth program (LLR or Prime95) with a lower-bandwidth program (GMP-ECM or NFS sieving). It appears that doing these other tasks helps to hide memory saturation, such that Prime95 hardly slows down but ECM does meaningful work. HT perhaps schedules ECM cycles while Prime95 would have waited for data from main memory.

I found best speeds by manually assigning cores so that ECM was never on its own core.
VBCurtis is online now   Reply With Quote
Old 2014-11-21, 23:14   #148
TheMawn
 
TheMawn's Avatar
 
May 2013
East. Always East.

110101111112 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
I have had good luck using HT to interleave a high-bandwidth program (LLR or Prime95) with a lower-bandwidth program (GMP-ECM or NFS sieving). It appears that doing these other tasks helps to hide memory saturation, such that Prime95 hardly slows down but ECM does meaningful work. HT perhaps schedules ECM cycles while Prime95 would have waited for data from main memory.

I found best speeds by manually assigning cores so that ECM was never on its own core.
Now that's a pretty neat idea.
TheMawn is offline   Reply With Quote
Old 2015-02-26, 03:19   #149
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

7·11·43 Posts
Default Dual E5-2690v2 (@ 3 GHz)

I went ahead and tried this exponent out on my dual core E5-2690v2 system (it's a DL380p Gen8)

It's running Windows Server 2012 R2 for what that's worth. Hyperthreading is enabled but I know that does pretty much nothing for P95. It does make cycles available for the OS though.

Each CPU is 10 physical cores + HT, and there are 2 CPU's installed.

At 10 threads it hits it's stride at 79 days and 4 hours. And it pretty much stays in that 79-80 day range all the way up through 20 threads, and then of course when it goes past 20 threads and uses the HT cores it also tends to stay in that same range. Sometimes showing a bit longer like up to 100 days, just based on how throttled the system seemed to be at the time.

I thought I remembered doing some benchmarks on the same system a while back and seeing it scale up (slowly, but improving) all the way up through the 20 real cores even though it spanned CPU sockets. I guess that was on a smaller FFT size where the mem bandwidth between sockets wasn't a big limiting factor. Or I just misremembered.

Either way it did enlighten me that when I'm knocking out some of these oddball exponents (those needing triple-checks, etc) I can do a lot better than I have been where I was just throwing all 20 cores at one exponent. I get the same performance with only 10, and I can do two of those.

For these things I'm doing I want a single exponent tested as fast as possible because I'm collecting stats on triple-checks, or doing double-checks on "suspicious" results, so that works well for me. I know I could get more total throughput by splitting it up more.
Madpoo is offline   Reply With Quote
Old 2015-02-26, 20:30   #150
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

For (current) NUMA systems it is usually NOT a good idea to run a single worker across multiple CPU sockets (NUMA nodes to be more specific).

Oliver
TheJudger is offline   Reply With Quote
Old 2015-02-26, 23:12   #151
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

C5716 Posts
Default

I'm curious what is the timing for this 100M digit exponent on a Titan GPU?
ATH is offline   Reply With Quote
Old 2015-02-27, 02:30   #152
TheMawn
 
TheMawn's Avatar
 
May 2013
East. Always East.

32778 Posts
Default

Yeah that's the idea of the benchmark. Look at total iter/sec throughput and see where the scaling stops. Use that to decide how many cores.

I'm not entirely surprised that 20 cores is no faster than 10 cores.
TheMawn is offline   Reply With Quote
Old 2015-02-27, 04:15   #153
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

7×11×43 Posts
Default

Quote:
Originally Posted by TheJudger View Post
For (current) NUMA systems it is usually NOT a good idea to run a single worker across multiple CPU sockets (NUMA nodes to be more specific).
Yeah, true enough, and pretty obvious once I thought about it.

Right now I'm having difficulty figuring out why Prime95 doesn't seem to correctly split cores apart so that if, for instance, I want one worker using 10 cores, and another using 10 cores, to split it so that the 10 cores in question are all on one CPU, and the other 10 are on the other CPU.

When I tried that it seemed to flood one NUMA node with all 20 cores on that one (including the HT). I posed this question on another thread though so I won't belabor it here, but needless to say I'm trying to wrap my head around just how the AffinityScramble2 thing works. I'm probably psyching myself out thinking about it in terms of dual CPU rather than just multi cores on one CPU. Either way it's puzzling.
Madpoo is offline   Reply With Quote
Old 2015-02-27, 11:02   #154
Robert_JD
 
Robert_JD's Avatar
 
Sep 2010
So Cal

2×52 Posts
Default

Quote:
Originally Posted by ATH View Post
I'm curious what is the timing for this 100M digit exponent on a Titan GPU?
Approximately 50 days.
Attached Files
File Type: txt CULU_Test.txt (2.2 KB, 96 views)
Robert_JD is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Perpetual benchmark thread... Xyzzy Hardware 849 2021-05-20 12:38
Sieve Benchmark Thread Historian Twin Prime Search 105 2013-02-05 01:35
LLR benchmark thread Oddball Riesel Prime Search 5 2010-08-02 00:11
sr5sieve Benchmark thread axn Sierpinski/Riesel Base 5 25 2010-05-28 23:57
Old Hardware Thread E_tron Hardware 0 2004-06-18 03:32

All times are UTC. The time now is 20:16.


Fri Jul 16 20:16:06 UTC 2021 up 49 days, 18:03, 1 user, load averages: 2.20, 2.16, 2.19

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.