mersenneforum.org So does skylake-nonXeon actually get us anything?
 Register FAQ Search Today's Posts Mark Forums Read

2015-09-05, 12:20   #23
patrik

"Patrik Johansson"
Aug 2002
Uppsala, Sweden

1A916 Posts

Quote:
 Originally Posted by nucleon It'll be interesting on skylake with DDR4, how 4 cores loaded vs 3 cores, and see memory starvation.
With the double-checks I was assigned on this new (yesterday) computer. First using all four cores:
Code:
[Worker #1 Sep 5 13:02] Iteration: 9310000 / 39026909 [23.85%], ms/iter: 11.193, ETA: 3d 20:23
[Worker #4 Sep 5 13:03] Iteration: 9280000 / 39143207 [23.70%], ms/iter: 11.209, ETA: 3d 20:59
[Worker #3 Sep 5 13:03] Iteration: 9300000 / 39070231 [23.80%], ms/iter: 11.185, ETA: 3d 20:29
[Worker #2 Sep 5 13:03] Iteration: 9290000 / 39068641 [23.77%], ms/iter: 11.197, ETA: 3d 20:37
[Worker #1 Sep 5 13:04] Iteration: 9320000 / 39026909 [23.88%], ms/iter: 11.176, ETA: 3d 20:13
[Worker #4 Sep 5 13:05] Iteration: 9290000 / 39143207 [23.73%], ms/iter: 11.210, ETA: 3d 20:57
[Worker #3 Sep 5 13:05] Iteration: 9310000 / 39070231 [23.82%], ms/iter: 11.166, ETA: 3d 20:18
Then with three cores:
Code:
[Worker #1 Sep 5 13:24] Iteration: 9350000 / 39026909 [23.95%], ms/iter:  8.529, ETA: 70:18:44
[Worker #2 Sep 5 13:26] Iteration: 9330000 / 39068641 [23.88%], ms/iter:  8.573, ETA: 70:49:06
[Worker #3 Sep 5 13:26] Iteration: 9340000 / 39070231 [23.90%], ms/iter:  8.528, ETA: 70:25:49
[Worker #1 Sep 5 13:26] Iteration: 9360000 / 39026909 [23.98%], ms/iter:  8.528, ETA: 70:16:50
[Worker #2 Sep 5 13:27] Iteration: 9340000 / 39068641 [23.90%], ms/iter:  8.587, ETA: 70:54:27
[Worker #3 Sep 5 13:27] Iteration: 9350000 / 39070231 [23.93%], ms/iter:  8.544, ETA: 70:31:56
Only when I reduce to two cores, I can see the throughput start dropping:
Code:
[Worker #1 Sep 5 13:37] Iteration: 9440000 / 39026909 [24.18%], ms/iter:  7.186, ETA: 59:03:31
[Worker #2 Sep 5 13:39] Iteration: 9420000 / 39068641 [24.11%], ms/iter:  7.224, ETA: 59:29:35
[Worker #1 Sep 5 13:39] Iteration: 9450000 / 39026909 [24.21%], ms/iter:  7.182, ETA: 59:00:22
[Worker #2 Sep 5 13:40] Iteration: 9430000 / 39068641 [24.13%], ms/iter:  7.243, ETA: 59:37:48
[Worker #1 Sep 5 13:40] Iteration: 9460000 / 39026909 [24.23%], ms/iter:  7.201, ETA: 59:08:29
[Worker #2 Sep 5 13:41] Iteration: 9440000 / 39068641 [24.16%], ms/iter:  7.231, ETA: 59:30:32
Throughput:
Code:
4 /11.2 iter/ms = 0.36 iter/ms
3 / 8.5 iter/ms = 0.35 iter/ms
2 / 7.2 iter/sm = 0.28 iter/ms
My hardware is:
Code:
Intel Core i5 6600K / 3.5 GHz processor S-1151
ASUS Z170-P S-1151 ATX
Kingston HyperX Predator 16GB 3000MHz DDR4 DIMM 288-pin
but I just realize that I didn't use the XMP setting of the memory, just the default, so I'm running it at 2133 MHz. I will post again if and when I get the memory running at full speed.

 2015-09-05, 13:36 #24 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 41×157 Posts I'm pretty sure Skylake Xeon will need a new socket, because leaked slides like http://www.extremetech.com/computing...emory-channels suggest that it has six-channel DDR4. I am saving up for a dual Skylake Xeon box in early 2017 to replace my 48-core Opteron from May 2011 - the Opteron is still quite a capable machine, with 2.5x the performance of an i7/4790 at 5x the power, but I'd expect 24 cores of Skylake to be twice that performance at less absolute power. Last fiddled with by fivemack on 2015-09-05 at 18:44
2015-09-05, 17:14   #25
Serpentine Vermin Jar

Jul 2014

3,313 Posts

Quote:
 Originally Posted by patrik Throughput: Code: 4 /11.2 iter/ms = 0.36 iter/ms 3 / 8.5 iter/ms = 0.35 iter/ms 2 / 7.2 iter/sm = 0.28 iter/ms
That suggests to me that with optimal throughput in your case being 3 workers, you could setup one of the workers to possibly use 2 cores, and the other 2 workers using just one core each?

This reminds me that I really need to sit myself down sometime and work through the actual throughputs on a 6/8/10/14 core chip and see just where the sweet spots are. Right now I throw all of the cores on one CPU at a single worker so I'm not worried about memory thrashing between multiple workers, but I know I'm leaving a little bit of throughput out that way. I can see that the CPU's aren't totally maxed (but they are close).

You might see the same thing with 4 workers... are your cores at 100% each, or are they slightly under as I'd expect? Maybe by just 1% or less under the max so it could be hard to really tell.

2015-09-05, 20:32   #26
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

24·32·53 Posts

Quote:
 Originally Posted by Madpoo That suggests to me that with optimal throughput in your case being 3 workers, you could setup one of the workers to possibly use 2 cores, and the other 2 workers using just one core each?
Unlikely. One worker using two threads creates (nearly) as much memory contention as two workers.

Quote:
 are your cores at 100% each, or are they slightly under as I'd expect? Maybe by just 1% or less under the max so it could be hard to really tell.
The OS CPU utilization does cannot measure memory contention. CPU utilization is based on the time slices the OS gives to each process. Memory delays happen at far too fine a granularity to be measured by the OS.

 2015-09-05, 22:03 #27 henryzz Just call me Henry     "David" Sep 2007 Cambridge (GMT/BST) 10111000110002 Posts Could memory delays be measured within Prime95?
2015-09-05, 22:09   #28
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

167208 Posts

Quote:
 Originally Posted by henryzz Could memory delays be measured within Prime95?
No.

There probably is a performance counter to measure that. But I believe reading performance counters is a privileged operation.

2015-09-06, 17:35   #29
Serpentine Vermin Jar

Jul 2014

331310 Posts

Quote:
 Originally Posted by Prime95 Unlikely. One worker using two threads creates (nearly) as much memory contention as two workers.
In my tests, I've seen some improvement adding CPU cores even when I'm (nearly) sure that memory contention was already at a peak level. Maybe it's related to L2/L3 cache? No idea.

Quote:
 Originally Posted by Prime95 The OS CPU utilization does cannot measure memory contention. CPU utilization is based on the time slices the OS gives to each process. Memory delays happen at far too fine a granularity to be measured by the OS.
Also, going from my own experience which may or may not apply to anyone else, when mem contentiongot too high, what I saw was the CPU usage by Prime95 drop (sometimes dramatically) as an effect of the CPU idling more often when waiting on memory.

I'm guessing it's memory contention by inference, since otherwise Prime95 would presumably keep chugging along and keeping those cycles going?

Based on everything I've encountered with memory lately, it makes me hopeful that new things like HBM, memory cubes, blah blah, will have an amazing effect on apps like Prime95 that really need that kind of bandwidth.

All we have to do is sit around and wait for those shiny things to show up on our desktops, right?

2015-09-06, 19:52   #30
patrik

"Patrik Johansson"
Aug 2002
Uppsala, Sweden

52·17 Posts

Quote:
 Originally Posted by patrik I will post again if and when I get the memory running at full speed.
It took me some time because I had to upgrade the BIOS. Before that I couldn't get the memory to run faster than 2500 MHz without getting hardware errors from mprime. Now it runs at 3000 MHz without errors.

4 cores:
Code:
[Worker #4 Sep 6 20:53] Iteration: 3280000 / 39143207 [8.37%], ms/iter:  8.641, ETA: 3d 14:04
[Worker #2 Sep 6 20:54] Iteration: 3290000 / 39068641 [8.42%], ms/iter:  8.590, ETA: 3d 13:22
[Worker #3 Sep 6 20:54] Iteration: 3290000 / 39070231 [8.42%], ms/iter:  8.640, ETA: 3d 13:52
[Worker #1 Sep 6 20:55] Iteration: 3340000 / 39026909 [8.55%], ms/iter:  8.594, ETA: 3d 13:11
[Worker #4 Sep 6 20:55] Iteration: 3290000 / 39143207 [8.40%], ms/iter:  8.646, ETA: 3d 14:06
[Worker #2 Sep 6 20:55] Iteration: 3300000 / 39068641 [8.44%], ms/iter:  8.587, ETA: 3d 13:19
[Worker #3 Sep 6 20:56] Iteration: 3300000 / 39070231 [8.44%], ms/iter:  8.641, ETA: 3d 13:51
3 cores:
Code:
[Worker #1 Sep 6 21:13] Iteration: 3480000 / 39026909 [8.91%], ms/iter:  7.507, ETA: 3d 02:07
[Worker #2 Sep 6 21:14] Iteration: 3440000 / 39068641 [8.80%], ms/iter:  7.485, ETA: 3d 02:04
[Worker #3 Sep 6 21:14] Iteration: 3440000 / 39070231 [8.80%], ms/iter:  7.487, ETA: 3d 02:05
[Worker #1 Sep 6 21:14] Iteration: 3490000 / 39026909 [8.94%], ms/iter:  7.509, ETA: 3d 02:07
[Worker #2 Sep 6 21:15] Iteration: 3450000 / 39068641 [8.83%], ms/iter:  7.488, ETA: 3d 02:04
[Worker #3 Sep 6 21:16] Iteration: 3450000 / 39070231 [8.83%], ms/iter:  7.488, ETA: 3d 02:05
2 cores:
Code:
[Worker #2 Sep 6 21:27] Iteration: 3540000 / 39068641 [9.06%], ms/iter:  7.054, ETA: 69:37:10
[Worker #1 Sep 6 21:27] Iteration: 3590000 / 39026909 [9.19%], ms/iter:  7.044, ETA: 69:20:30
[Worker #2 Sep 6 21:28] Iteration: 3550000 / 39068641 [9.08%], ms/iter:  7.063, ETA: 69:41:12
[Worker #1 Sep 6 21:28] Iteration: 3600000 / 39026909 [9.22%], ms/iter:  7.053, ETA: 69:24:24
[Worker #2 Sep 6 21:29] Iteration: 3560000 / 39068641 [9.11%], ms/iter:  7.068, ETA: 69:42:46
[Worker #1 Sep 6 21:29] Iteration: 3610000 / 39026909 [9.25%], ms/iter:  7.052, ETA: 69:22:26
Throughput:
Code:
4 / 8.62 iter/ms = 0.464 iter/ms
3 / 7.48 iter/ms = 0.401 iter/ms
2 / 7.06 iter/sm = 0.283 iter/ms
This time all four cores seem to be useful, at least when running double-checks.

I repeat that my hardware is:
Code:
Intel Core i5 6600K / 3.5 GHz processor S-1151
ASUS Z170-P S-1151 ATX
Kingston HyperX Predator 16GB 3000MHz DDR4 DIMM 288-pin

2015-09-07, 05:22   #31
axn

Jun 2003

22·5·257 Posts

Quote:
 Originally Posted by Madpoo Also, going from my own experience which may or may not apply to anyone else, when mem contentiongot too high, what I saw was the CPU usage by Prime95 drop (sometimes dramatically) as an effect of the CPU idling more often when waiting on memory.
Memory contention or threading waits? Did you see it when running N workers x 1 thread (which has the max memory contention)?

2015-09-07, 05:47   #32
Serpentine Vermin Jar

Jul 2014

3,313 Posts

Quote:
 Originally Posted by axn Memory contention or threading waits? Did you see it when running N workers x 1 thread (which has the max memory contention)?
Let me put it this way which may help, if anyone else wanted to try and reproduce it on their own machine...

The most dramatic example you could probably see is to configure a single worker thread using however many CPU's you have on your system (let's say 4 for the sake of argument).

Assign a really small exponent to that worker in the 10M range while you look at the graphs of each individual core. What you *should* see (and what I see) is that the first core on that worker will use roughly all of it's power, but the other 3 "helper" cores will use a noticeably smaller amount. In the case of a 10M exponent, it should be pretty obvious.

The theory I posed to Science_Man_88 a little earlier today was along the lines of: The processors are able to work a lot quicker on the small exponent with a small FFT, which means that you're more likely to run into the memory bottleneck since the CPU is cruising along fairly fast.

Is that true? I have no idea... it was just my theory to explain why I saw a more pronounced CPU idle with smaller exponents. The fewer cores I threw at the worker meant each CPU was using more of its full potential, I guess because there's a point where memory and CPU are roughly balanced.

With larger exponents, the CPU is doing much more work, even with tests I've done up to 14 cores on a single CPU, to the point where the CPU might still be the limiting factor in such a setup. It's only when I start adding more cores on another physical chip, stressing the QPI link and overall memory bandwidth even more, before I see it really start to bog down.

On my 14-core CPU's I would start to see the "helper" cores using noticeably less than 100% after adding 4-5 more cores on the other chip. Might be QPI being saturated, might be the overall memory use...but same end result where I *think* the cores end up waiting on memory.

For what it's worth, I don't think Windows has any way to measure raw memory bandwidth use. At least, I'm not aware of any built in way to see that, like with CPU cycles. My gut tells me that doing so is possible but then if you were to measure that kind of detail it would probably slow things down. What I call the "quantum effect" of monitoring a server, that by just monitoring some things you're affecting the system itself. I try to avoid intrusive monitoring for that reason.

2015-09-07, 14:35   #33
retina
Undefined

"The unspeakable one"
Jun 2006
My evil lair

3×7×13×23 Posts

Quote:
 Originally Posted by Madpoo ... look at the graphs of each individual core.
Those graphs are the OS time slice views and have absolutely nothing at all to do with memory bandwidth or bottlenecks. The OS can't halt the CPU when instructions are waiting for memory, it just doesn't work that way. The OS can't see into the process and decide to somehow insert a HLT instruction in the middle of a memory read instruction. The OS is just another program (albeit with higher privileges) and runs on the same core hardware as everything else. Only when some interrupt or exception happens does the OS get to run some code, but during normal program operation the OS isn't even running.

 Similar Threads Thread Thread Starter Forum Replies Last Post ET_ Hardware 17 2017-05-24 16:19 Aurum Software 590 2017-05-19 10:03 mackerel Hardware 34 2016-03-03 19:14 tha Hardware 7 2015-03-05 23:49 clarke Software 15 2015-03-04 21:48

All times are UTC. The time now is 09:07.

Sat Oct 16 09:07:25 UTC 2021 up 85 days, 3:36, 0 users, load averages: 0.76, 1.00, 0.98