![]() |
|
|
#1 |
|
Jul 2016
2×3 Posts |
Hi,
I have a computer equipped with 2 E2650 v3 Xeon CPUs running Windows7. For most of the day the computer waits fro me pressing a key (while developing software ), I'd like to assign the whole second CPU to prime factoring.When running the "Time..." on a prime assigned to me (some 76'000'000) I get an optimum of 4 threads per worker resulting in an average of 13ms per iteration. But when I set my local.txt to the following: Memory=1024 during 6:30-18:30 else 4096 WorkerThreads=5 Affinity=100 ThreadsPerTest=4 the worker window says "50 ms per iteration", which is annoyingly slower than the timing test suggested. Additionally, when I start only one worker, I see that it is using only cores from the second CPU, but not the first 4, but 0, 2, 4 and 12 ... And when I start the second worker, the CPU usage is divided among more than 8 Cores: the first worker seems to stick to it's cores, but all other workers seem to jump between 2 cores each (I see pairs of cores with ~15% usage and ~85% usage). Could this slow down the performance that much? What am I doing wrong? Yours Guenter Last fiddled with by g33py on 2016-08-31 at 08:10 |
|
|
|
|
|
#2 |
|
Sep 2003
5×11×47 Posts |
The above setting would be suitable if your machine had 20 cores (5 times 4).
However, according to Intel's website, the E5-2650 v3 has 10 cores, 20 threads. For mprime/Prime95, only the number of cores matters. Hyperthreading doesn't give much advantage. So maybe try WorkerThreads=2 ThreadsPerTest=4 or something similar. Note: every time you change the number of WorkerThreads in local.txt, you might want to adjust worktodo.txt to add or remove lines like: [Worker #1] [Worker #2] [Worker #3] etc. |
|
|
|
|
|
#3 | |
|
Jun 2003
117328 Posts |
Quote:
Best would be: WorkerThreads=2 ThreadsPerTest=10 EDIT:- Nevermind. Apparently, he only wants to run on the second CPU. Considering the humongous cache, it might be best to just run one Worker with all 10 threads. Last fiddled with by axn on 2016-08-31 at 11:41 |
|
|
|
|
|
|
#4 | |
|
Serpentine Vermin Jar
Jul 2014
3,313 Posts |
Quote:
In my case, I only use 2 workers, each one using all 10 cores of a single CPU. Also, it makes a big difference if hyperthreading is enabled or not... I'm going to assume it's *enabled* since that's the default. If you have it disabled then you need to adjust the affinity mask so that instead of skipping every other "core" it's merely sequential from 0 to J In the "prime.txt" file add this line: Code:
DebugAffinityScramble=2 In the "local.txt" file add these lines: Code:
AffinityScramble2=02468ACEGIKMOQSUWYac13579BDFHJLNPRTVXZbd WorkerThreads=2 ThreadsPerTest=10 [Worker #1] Affinity=0 [Worker #2] Affinity=10 (and PS, the scramble above is only valid for a dual 10-core with hyperthread scenario...if you have a different # of cores per CPU or only a single (or quad) CPU system, you need to adjust for that particular setup ... and the reason I use a manual affinity is because Prime95 on startup will try to figure things out, but I found it did a poor job, and it also didn't assign cores from a single CPU to a worker, it was mixing and matching cores across CPUs for a single worker which was very inefficient). So, that's if you want *two* workers, each using 10 threads. But you said you only want to use the 2nd CPU and leave the first one available for other things. Note that Prime95 runs at idle priority and for 99% of programs out there, that's fine. The only issues are with other apps that also use idle priority or whatever... Anyway, to accomplish your goal, simply change the WorkerThreads to 1, remove the 2nd "[Worker #2]" entry and change the affinity of worker #1 to "10" instead of 0. That tells the first worker to start out at the 10th (base zero) core and use cores 10-19. If you felt like running 10 workers of 1 core each, set WorkerThreads=10, ThreadsPerTest=1, and have 10 of those [Worker #x] sections and have them start at Affinity=10 and go up to Affinity=19 I pointed out in another thread that with 10 cores, by the time you start up the 6th or 7th worker you start to run into some kind of memory contention or other issue and the performance of each worker will suffer as a result. I found it simpler and more efficient to have a single 10-core worker instead of 10 x 1-core workers, especially with the larger FFT sizes. Hope that helps. |
|
|
|
|
|
|
#5 |
|
"/X\(‘-‘)/X\"
Jan 2013
55628 Posts |
And just a note to anyone doing the same on Linux: Use AffinityScramble2=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcd
And set Affinity to 0 and 10 still. Here is what a dual-cpu 20 core, 40 thread system looks like in Linux: Code:
$ likwid-topology -g -------------------------------------------------------------------------------- CPU name: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz CPU type: Intel Xeon Haswell EN/EP/EX processor CPU stepping: 2 ******************************************************************************** Hardware Thread Topology ******************************************************************************** Sockets: 2 Cores per socket: 10 Threads per core: 2 -------------------------------------------------------------------------------- HWThread Thread Core Socket Available 0 0 0 0 * 1 0 1 0 * 2 0 2 0 * 3 0 3 0 * 4 0 4 0 * 5 0 5 0 * 6 0 6 0 * 7 0 7 0 * 8 0 8 0 * 9 0 9 0 * 10 0 0 1 * 11 0 1 1 * 12 0 2 1 * 13 0 3 1 * 14 0 4 1 * 15 0 5 1 * 16 0 6 1 * 17 0 7 1 * 18 0 8 1 * 19 0 9 1 * 20 1 0 0 * 21 1 1 0 * 22 1 2 0 * 23 1 3 0 * 24 1 4 0 * 25 1 5 0 * 26 1 6 0 * 27 1 7 0 * 28 1 8 0 * 29 1 9 0 * 30 1 0 1 * 31 1 1 1 * 32 1 2 1 * 33 1 3 1 * 34 1 4 1 * 35 1 5 1 * 36 1 6 1 * 37 1 7 1 * 38 1 8 1 * 39 1 9 1 * -------------------------------------------------------------------------------- Socket 0: ( 0 20 1 21 2 22 3 23 4 24 5 25 6 26 7 27 8 28 9 29 ) Socket 1: ( 10 30 11 31 12 32 13 33 14 34 15 35 16 36 17 37 18 38 19 39 ) -------------------------------------------------------------------------------- ******************************************************************************** Cache Topology ******************************************************************************** Level: 1 Size: 32 kB Cache groups: ( 0 20 ) ( 1 21 ) ( 2 22 ) ( 3 23 ) ( 4 24 ) ( 5 25 ) ( 6 26 ) ( 7 27 ) ( 8 28 ) ( 9 29 ) ( 10 30 ) ( 11 31 ) ( 12 32 ) ( 13 33 ) ( 14 34 ) ( 15 35 ) ( 16 36 ) ( 17 37 ) ( 18 38 ) ( 19 39 ) -------------------------------------------------------------------------------- Level: 2 Size: 256 kB Cache groups: ( 0 20 ) ( 1 21 ) ( 2 22 ) ( 3 23 ) ( 4 24 ) ( 5 25 ) ( 6 26 ) ( 7 27 ) ( 8 28 ) ( 9 29 ) ( 10 30 ) ( 11 31 ) ( 12 32 ) ( 13 33 ) ( 14 34 ) ( 15 35 ) ( 16 36 ) ( 17 37 ) ( 18 38 ) ( 19 39 ) -------------------------------------------------------------------------------- Level: 3 Size: 30 MB Cache groups: ( 0 20 1 21 2 22 3 23 4 24 5 25 6 26 7 27 8 28 9 29 ) ( 10 30 11 31 12 32 13 33 14 34 15 35 16 36 17 37 18 38 19 39 ) -------------------------------------------------------------------------------- ******************************************************************************** NUMA Topology ******************************************************************************** NUMA domains: 2 -------------------------------------------------------------------------------- Domain: 0 Processors: ( 0 20 1 21 2 22 3 23 4 24 5 25 6 26 7 27 8 28 9 29 ) Distances: 10 20 Free memory: 79606.4 MB Total memory: 80555.2 MB -------------------------------------------------------------------------------- Domain: 1 Processors: ( 10 30 11 31 12 32 13 33 14 34 15 35 16 36 17 37 18 38 19 39 ) Distances: 20 10 Free memory: 79876.5 MB Total memory: 80632.7 MB -------------------------------------------------------------------------------- ******************************************************************************** Graphical Topology ******************************************************************************** Socket 0: +---------------------------------------------------------------------------------------------------------------+ | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 0 20 | | 1 21 | | 2 22 | | 3 23 | | 4 24 | | 5 25 | | 6 26 | | 7 27 | | 8 28 | | 9 29 | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | +-----------------------------------------------------------------------------------------------------------+ | | | 30 MB | | | +-----------------------------------------------------------------------------------------------------------+ | +---------------------------------------------------------------------------------------------------------------+ Socket 1: +---------------------------------------------------------------------------------------------------------------+ | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 10 30 | | 11 31 | | 12 32 | | 13 33 | | 14 34 | | 15 35 | | 16 36 | | 17 37 | | 18 38 | | 19 39 | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | 32 kB | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | 256 kB | | | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | | +-----------------------------------------------------------------------------------------------------------+ | | | 30 MB | | | +-----------------------------------------------------------------------------------------------------------+ | +---------------------------------------------------------------------------------------------------------------+ |
|
|
|
|
|
#6 | |
|
Jul 2016
616 Posts |
Quote:
indeed that did help, now I have an understanding about how these parameters aork together. especially the "affinity" is now much clearer - thank you! I hav applied your settings, but still I see drastic differences between the "Time..." results and the normal worker results. I have played with 3 workers à 6 threads. The "Time..." results suggest a time of about 8ms per iteration, wheras the actual workers report som 20 ms (calculating the same prime candidate). Is this a systematical difference or is something wrong with my computer settings? Guenter |
|
|
|
|
|
|
#7 | |
|
Serpentine Vermin Jar
Jul 2014
3,313 Posts |
Quote:
If you wanted you could do a test where you have 10 workers of 1 core each running on a single CPU, and then see how the timings of worker #1 change as you start up each additional worker. I did a test just like that at a couple different FFT sizes recently and here's the data... what it shows is the ms per iteration at the specified FFT size, and then how many workers were running when I got that time. It ran fastest with just one worker going, and as each additional worker starts, the times for worker #1 got slower and slower. Code:
3584K FFT ms/iter workers 34.18 10 31.97 9 30.10 8 29.12 7 28.62 6 28.40 5 28.31 4 28.23 3 28.20 2 27.95 1 Code:
3840K FFT ms/iter workers 38.20 10 35.10 9 32.85 8 30.10 7 30.20 6 29.67 5 29.45 4 29.23 3 28.90 2 28.74 1 Code:
4096K FFT ms/iter workers 40.85 10 38.95 9 36.79 8 35.54 7 35.12 6 34.66 5 34.00 4 34.11 3 33.89 2 33.52 1 For 3584K FFT it was over 22% slower with all 10 going, @ 3840K FFT it was almost 33% slower, and @ 4096K FFT it was also about 22% slower. Then I setup another test where I had one worker using all 10 cores and checked it's ms/iter times. @3584K FFT = 3.18ms @3840K FFT = 3.42ms @4096K FFT = 4.08ms If we made the magical assumption that 1 worker / 1 core would NOT slow down at all when running 10 of those (of course, that's not true as shown above), I compared the 1 worker/10 core time to that magical extrapolation. To do the comparison, I took the 1-worker/1-core optimal time of (e.g. @3584K) 27.95ms/iter (with a single one running) and divide by 10 to get 2.795... compare that to the 1-worker/10-core using the same (b-a)/a ... so (3.18-2.795)/2.795 = 13% slower to have a single 10-core worker than 10 (idealized) single-core workers. So in a perfect world, it'd be better to have 10 single-core workers. But in reality that's end up being even slower in total throughput than the single 10-core worker. I probably made my math more difficult by sticking to the ms/iter instead of doing the obvious and converting to iterations/second but oh well...if you want, feel free. That's what the benchmark option does in Prime95 as well as show you the avg timings. I'm not sure exactly why the Time/Benchmark options give erroneous estimates... not long enough runtimes, or is it doing something else that makes it seem far rosier than it really is? I don't know for sure. If you have fewer than 10 cores, then the math could start looking a little better for getting better total throughput by having multiple single-core workers. I did timings using some other odd combos like 2 workers of 5 cores each, or 5 workers of 2 cores each, but in each of those cases the difference in total throughput was minor. It was only when it jumped from between 6 and 8 workers that it really started to lag. TL;DR = if you have 4-6 cores per CPU, it's probably very little difference, but if you have 8+ cores you'll probably do better with a single worker using all cores. |
|
|
|
|
|
|
#8 |
|
"David"
Jul 2015
Ohio
10058 Posts |
I also have several Dual Xeon V2 and V3 systems (10, 12, and 16 core). In almost every case for production (ready current LL range) work and higher the best overall throughput is achieved at 2 workers per system with affinity set to match all the cores on each CPU.
If I absolutely want to blast through an exponent I can set all 32 cores on both CPUs and see slight gains over the QPI link, but overall throughput is better with 1 worker per physical chip and NumberOfCoresPerCPU threads. Settings for my 12 core v2 WorkerThreads=2 Affinity=0 ThreadsPerTest=12 AffinityScramble2=0123456789ABCDEFGHIJKLMNOPQRSTUV [Worker #1] Affinity=0 [Worker #2] Affinity=12 |
|
|
|
|
|
#9 |
|
Serpentine Vermin Jar
Jul 2014
63618 Posts |
|
|
|
|
|
|
#10 |
|
"David"
Jul 2015
Ohio
11×47 Posts |
Actually HT is not disabled, but under Debian Linux 8 on these boxes the actual cores are grouped 0-31 and their hyper threaded partners are numbered 32-64. Don't ask me why, but a bunch of testing confirmed it.
|
|
|
|
|
|
#11 |
|
Serpentine Vermin Jar
Jul 2014
63618 Posts |
Oh, yeah, that's right. Linux huddles all the "real" cores up front and the HT cores last. It makes it easier, but in my mind it seems like that makes NUMA mapping harder to track. A 2 CPU system has two NUMA nodes, and each node has CPUs that aren't sequential. Weird, but oh well.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Set affinity does not work | g33py | Software | 3 | 2016-07-27 05:26 |
| Processor Affinity | R.D. Silverman | Programming | 19 | 2015-04-24 22:46 |
| Helper Thread Affinity | TObject | Software | 3 | 2012-07-20 19:21 |
| Affinity setting in local.ini | PageFault | Software | 1 | 2006-09-13 16:39 |
| Affinity on Linux | bmg9611 | Software | 5 | 2002-11-04 21:26 |