mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2015-09-05, 12:20   #23
patrik
 
patrik's Avatar
 
"Patrik Johansson"
Aug 2002
Uppsala, Sweden

23·53 Posts
Default

Quote:
Originally Posted by nucleon View Post
It'll be interesting on skylake with DDR4, how 4 cores loaded vs 3 cores, and see memory starvation.
With the double-checks I was assigned on this new (yesterday) computer. First using all four cores:
Code:
[Worker #1 Sep 5 13:02] Iteration: 9310000 / 39026909 [23.85%], ms/iter: 11.193, ETA: 3d 20:23
[Worker #4 Sep 5 13:03] Iteration: 9280000 / 39143207 [23.70%], ms/iter: 11.209, ETA: 3d 20:59
[Worker #3 Sep 5 13:03] Iteration: 9300000 / 39070231 [23.80%], ms/iter: 11.185, ETA: 3d 20:29
[Worker #2 Sep 5 13:03] Iteration: 9290000 / 39068641 [23.77%], ms/iter: 11.197, ETA: 3d 20:37
[Worker #1 Sep 5 13:04] Iteration: 9320000 / 39026909 [23.88%], ms/iter: 11.176, ETA: 3d 20:13
[Worker #4 Sep 5 13:05] Iteration: 9290000 / 39143207 [23.73%], ms/iter: 11.210, ETA: 3d 20:57
[Worker #3 Sep 5 13:05] Iteration: 9310000 / 39070231 [23.82%], ms/iter: 11.166, ETA: 3d 20:18
Then with three cores:
Code:
[Worker #1 Sep 5 13:24] Iteration: 9350000 / 39026909 [23.95%], ms/iter:  8.529, ETA: 70:18:44
[Worker #2 Sep 5 13:26] Iteration: 9330000 / 39068641 [23.88%], ms/iter:  8.573, ETA: 70:49:06
[Worker #3 Sep 5 13:26] Iteration: 9340000 / 39070231 [23.90%], ms/iter:  8.528, ETA: 70:25:49
[Worker #1 Sep 5 13:26] Iteration: 9360000 / 39026909 [23.98%], ms/iter:  8.528, ETA: 70:16:50
[Worker #2 Sep 5 13:27] Iteration: 9340000 / 39068641 [23.90%], ms/iter:  8.587, ETA: 70:54:27
[Worker #3 Sep 5 13:27] Iteration: 9350000 / 39070231 [23.93%], ms/iter:  8.544, ETA: 70:31:56
Only when I reduce to two cores, I can see the throughput start dropping:
Code:
[Worker #1 Sep 5 13:37] Iteration: 9440000 / 39026909 [24.18%], ms/iter:  7.186, ETA: 59:03:31
[Worker #2 Sep 5 13:39] Iteration: 9420000 / 39068641 [24.11%], ms/iter:  7.224, ETA: 59:29:35
[Worker #1 Sep 5 13:39] Iteration: 9450000 / 39026909 [24.21%], ms/iter:  7.182, ETA: 59:00:22
[Worker #2 Sep 5 13:40] Iteration: 9430000 / 39068641 [24.13%], ms/iter:  7.243, ETA: 59:37:48
[Worker #1 Sep 5 13:40] Iteration: 9460000 / 39026909 [24.23%], ms/iter:  7.201, ETA: 59:08:29
[Worker #2 Sep 5 13:41] Iteration: 9440000 / 39068641 [24.16%], ms/iter:  7.231, ETA: 59:30:32
Throughput:
Code:
4 /11.2 iter/ms = 0.36 iter/ms
3 / 8.5 iter/ms = 0.35 iter/ms
2 / 7.2 iter/sm = 0.28 iter/ms
My hardware is:
Code:
Intel Core i5 6600K / 3.5 GHz processor S-1151
ASUS Z170-P S-1151 ATX
Kingston HyperX Predator 16GB 3000MHz DDR4 DIMM 288-pin
but I just realize that I didn't use the XMP setting of the memory, just the default, so I'm running it at 2133 MHz. I will post again if and when I get the memory running at full speed.
patrik is offline   Reply With Quote
Old 2015-09-05, 13:36   #24
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

23×797 Posts
Default

I'm pretty sure Skylake Xeon will need a new socket, because leaked slides like http://www.extremetech.com/computing...emory-channels suggest that it has six-channel DDR4. I am saving up for a dual Skylake Xeon box in early 2017 to replace my 48-core Opteron from May 2011 - the Opteron is still quite a capable machine, with 2.5x the performance of an i7/4790 at 5x the power, but I'd expect 24 cores of Skylake to be twice that performance at less absolute power.

Last fiddled with by fivemack on 2015-09-05 at 18:44
fivemack is offline   Reply With Quote
Old 2015-09-05, 17:14   #25
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

29·113 Posts
Default

Quote:
Originally Posted by patrik View Post
Throughput:
Code:
4 /11.2 iter/ms = 0.36 iter/ms
3 / 8.5 iter/ms = 0.35 iter/ms
2 / 7.2 iter/sm = 0.28 iter/ms
That suggests to me that with optimal throughput in your case being 3 workers, you could setup one of the workers to possibly use 2 cores, and the other 2 workers using just one core each?

This reminds me that I really need to sit myself down sometime and work through the actual throughputs on a 6/8/10/14 core chip and see just where the sweet spots are. Right now I throw all of the cores on one CPU at a single worker so I'm not worried about memory thrashing between multiple workers, but I know I'm leaving a little bit of throughput out that way. I can see that the CPU's aren't totally maxed (but they are close).

You might see the same thing with 4 workers... are your cores at 100% each, or are they slightly under as I'd expect? Maybe by just 1% or less under the max so it could be hard to really tell.
Madpoo is offline   Reply With Quote
Old 2015-09-05, 20:32   #26
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

52·172 Posts
Default

Quote:
Originally Posted by Madpoo View Post
That suggests to me that with optimal throughput in your case being 3 workers, you could setup one of the workers to possibly use 2 cores, and the other 2 workers using just one core each?
Unlikely. One worker using two threads creates (nearly) as much memory contention as two workers.

Quote:
are your cores at 100% each, or are they slightly under as I'd expect? Maybe by just 1% or less under the max so it could be hard to really tell.
The OS CPU utilization does cannot measure memory contention. CPU utilization is based on the time slices the OS gives to each process. Memory delays happen at far too fine a granularity to be measured by the OS.
Prime95 is offline   Reply With Quote
Old 2015-09-05, 22:03   #27
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2×3×7×137 Posts
Default

Could memory delays be measured within Prime95?
henryzz is online now   Reply With Quote
Old 2015-09-05, 22:09   #28
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

52·172 Posts
Default

Quote:
Originally Posted by henryzz View Post
Could memory delays be measured within Prime95?
No.

There probably is a performance counter to measure that. But I believe reading performance counters is a privileged operation.
Prime95 is offline   Reply With Quote
Old 2015-09-06, 17:35   #29
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

29·113 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Unlikely. One worker using two threads creates (nearly) as much memory contention as two workers.
In my tests, I've seen some improvement adding CPU cores even when I'm (nearly) sure that memory contention was already at a peak level. Maybe it's related to L2/L3 cache? No idea.

Quote:
Originally Posted by Prime95 View Post
The OS CPU utilization does cannot measure memory contention. CPU utilization is based on the time slices the OS gives to each process. Memory delays happen at far too fine a granularity to be measured by the OS.
Also, going from my own experience which may or may not apply to anyone else, when mem contentiongot too high, what I saw was the CPU usage by Prime95 drop (sometimes dramatically) as an effect of the CPU idling more often when waiting on memory.

I'm guessing it's memory contention by inference, since otherwise Prime95 would presumably keep chugging along and keeping those cycles going?

Based on everything I've encountered with memory lately, it makes me hopeful that new things like HBM, memory cubes, blah blah, will have an amazing effect on apps like Prime95 that really need that kind of bandwidth.

All we have to do is sit around and wait for those shiny things to show up on our desktops, right?
Madpoo is offline   Reply With Quote
Old 2015-09-06, 19:52   #30
patrik
 
patrik's Avatar
 
"Patrik Johansson"
Aug 2002
Uppsala, Sweden

23·53 Posts
Default

Quote:
Originally Posted by patrik View Post
I will post again if and when I get the memory running at full speed.
It took me some time because I had to upgrade the BIOS. Before that I couldn't get the memory to run faster than 2500 MHz without getting hardware errors from mprime. Now it runs at 3000 MHz without errors.

4 cores:
Code:
[Worker #4 Sep 6 20:53] Iteration: 3280000 / 39143207 [8.37%], ms/iter:  8.641, ETA: 3d 14:04
[Worker #2 Sep 6 20:54] Iteration: 3290000 / 39068641 [8.42%], ms/iter:  8.590, ETA: 3d 13:22
[Worker #3 Sep 6 20:54] Iteration: 3290000 / 39070231 [8.42%], ms/iter:  8.640, ETA: 3d 13:52
[Worker #1 Sep 6 20:55] Iteration: 3340000 / 39026909 [8.55%], ms/iter:  8.594, ETA: 3d 13:11
[Worker #4 Sep 6 20:55] Iteration: 3290000 / 39143207 [8.40%], ms/iter:  8.646, ETA: 3d 14:06
[Worker #2 Sep 6 20:55] Iteration: 3300000 / 39068641 [8.44%], ms/iter:  8.587, ETA: 3d 13:19
[Worker #3 Sep 6 20:56] Iteration: 3300000 / 39070231 [8.44%], ms/iter:  8.641, ETA: 3d 13:51
3 cores:
Code:
[Worker #1 Sep 6 21:13] Iteration: 3480000 / 39026909 [8.91%], ms/iter:  7.507, ETA: 3d 02:07
[Worker #2 Sep 6 21:14] Iteration: 3440000 / 39068641 [8.80%], ms/iter:  7.485, ETA: 3d 02:04
[Worker #3 Sep 6 21:14] Iteration: 3440000 / 39070231 [8.80%], ms/iter:  7.487, ETA: 3d 02:05
[Worker #1 Sep 6 21:14] Iteration: 3490000 / 39026909 [8.94%], ms/iter:  7.509, ETA: 3d 02:07
[Worker #2 Sep 6 21:15] Iteration: 3450000 / 39068641 [8.83%], ms/iter:  7.488, ETA: 3d 02:04
[Worker #3 Sep 6 21:16] Iteration: 3450000 / 39070231 [8.83%], ms/iter:  7.488, ETA: 3d 02:05
2 cores:
Code:
[Worker #2 Sep 6 21:27] Iteration: 3540000 / 39068641 [9.06%], ms/iter:  7.054, ETA: 69:37:10
[Worker #1 Sep 6 21:27] Iteration: 3590000 / 39026909 [9.19%], ms/iter:  7.044, ETA: 69:20:30
[Worker #2 Sep 6 21:28] Iteration: 3550000 / 39068641 [9.08%], ms/iter:  7.063, ETA: 69:41:12
[Worker #1 Sep 6 21:28] Iteration: 3600000 / 39026909 [9.22%], ms/iter:  7.053, ETA: 69:24:24
[Worker #2 Sep 6 21:29] Iteration: 3560000 / 39068641 [9.11%], ms/iter:  7.068, ETA: 69:42:46
[Worker #1 Sep 6 21:29] Iteration: 3610000 / 39026909 [9.25%], ms/iter:  7.052, ETA: 69:22:26
Throughput:
Code:
4 / 8.62 iter/ms = 0.464 iter/ms
3 / 7.48 iter/ms = 0.401 iter/ms
2 / 7.06 iter/sm = 0.283 iter/ms
This time all four cores seem to be useful, at least when running double-checks.

I repeat that my hardware is:
Code:
Intel Core i5 6600K / 3.5 GHz processor S-1151
ASUS Z170-P S-1151 ATX
Kingston HyperX Predator 16GB 3000MHz DDR4 DIMM 288-pin
patrik is offline   Reply With Quote
Old 2015-09-07, 05:22   #31
axn
 
axn's Avatar
 
Jun 2003

4,789 Posts
Default

Quote:
Originally Posted by Madpoo View Post
Also, going from my own experience which may or may not apply to anyone else, when mem contentiongot too high, what I saw was the CPU usage by Prime95 drop (sometimes dramatically) as an effect of the CPU idling more often when waiting on memory.
Memory contention or threading waits? Did you see it when running N workers x 1 thread (which has the max memory contention)?
axn is offline   Reply With Quote
Old 2015-09-07, 05:47   #32
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

29·113 Posts
Default

Quote:
Originally Posted by axn View Post
Memory contention or threading waits? Did you see it when running N workers x 1 thread (which has the max memory contention)?
Let me put it this way which may help, if anyone else wanted to try and reproduce it on their own machine...

The most dramatic example you could probably see is to configure a single worker thread using however many CPU's you have on your system (let's say 4 for the sake of argument).

Assign a really small exponent to that worker in the 10M range while you look at the graphs of each individual core. What you *should* see (and what I see) is that the first core on that worker will use roughly all of it's power, but the other 3 "helper" cores will use a noticeably smaller amount. In the case of a 10M exponent, it should be pretty obvious.

The theory I posed to Science_Man_88 a little earlier today was along the lines of: The processors are able to work a lot quicker on the small exponent with a small FFT, which means that you're more likely to run into the memory bottleneck since the CPU is cruising along fairly fast.

Is that true? I have no idea... it was just my theory to explain why I saw a more pronounced CPU idle with smaller exponents. The fewer cores I threw at the worker meant each CPU was using more of its full potential, I guess because there's a point where memory and CPU are roughly balanced.

With larger exponents, the CPU is doing much more work, even with tests I've done up to 14 cores on a single CPU, to the point where the CPU might still be the limiting factor in such a setup. It's only when I start adding more cores on another physical chip, stressing the QPI link and overall memory bandwidth even more, before I see it really start to bog down.

On my 14-core CPU's I would start to see the "helper" cores using noticeably less than 100% after adding 4-5 more cores on the other chip. Might be QPI being saturated, might be the overall memory use...but same end result where I *think* the cores end up waiting on memory.

For what it's worth, I don't think Windows has any way to measure raw memory bandwidth use. At least, I'm not aware of any built in way to see that, like with CPU cycles. My gut tells me that doing so is possible but then if you were to measure that kind of detail it would probably slow things down. What I call the "quantum effect" of monitoring a server, that by just monitoring some things you're affecting the system itself. I try to avoid intrusive monitoring for that reason.
Madpoo is offline   Reply With Quote
Old 2015-09-07, 14:35   #33
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

3·13·151 Posts
Default

Quote:
Originally Posted by Madpoo View Post
... look at the graphs of each individual core.
Those graphs are the OS time slice views and have absolutely nothing at all to do with memory bandwidth or bottlenecks. The OS can't halt the CPU when instructions are waiting for memory, it just doesn't work that way. The OS can't see into the process and decide to somehow insert a HLT instruction in the middle of a memory read instruction. The OS is just another program (albeit with higher privileges) and runs on the same core hardware as everything else. Only when some interrupt or exception happens does the OS get to run some code, but during normal program operation the OS isn't even running.
retina is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Skylake vs Kabylake ET_ Hardware 17 2017-05-24 16:19
768k Skylake Problem/Bug Aurum Software 590 2017-05-19 10:03
Skylake and RAM scaling mackerel Hardware 34 2016-03-03 19:14
Skylake processor tha Hardware 7 2015-03-05 23:49
Skylake AVX-512 clarke Software 15 2015-03-04 21:48

All times are UTC. The time now is 11:47.

Fri Dec 4 11:47:07 UTC 2020 up 1 day, 7:58, 0 users, load averages: 1.51, 1.50, 1.51

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.