![]() |
|
|
#23 | |
|
"Patrik Johansson"
Aug 2002
Uppsala, Sweden
52·17 Posts |
Quote:
Code:
[Worker #1 Sep 5 13:02] Iteration: 9310000 / 39026909 [23.85%], ms/iter: 11.193, ETA: 3d 20:23 [Worker #4 Sep 5 13:03] Iteration: 9280000 / 39143207 [23.70%], ms/iter: 11.209, ETA: 3d 20:59 [Worker #3 Sep 5 13:03] Iteration: 9300000 / 39070231 [23.80%], ms/iter: 11.185, ETA: 3d 20:29 [Worker #2 Sep 5 13:03] Iteration: 9290000 / 39068641 [23.77%], ms/iter: 11.197, ETA: 3d 20:37 [Worker #1 Sep 5 13:04] Iteration: 9320000 / 39026909 [23.88%], ms/iter: 11.176, ETA: 3d 20:13 [Worker #4 Sep 5 13:05] Iteration: 9290000 / 39143207 [23.73%], ms/iter: 11.210, ETA: 3d 20:57 [Worker #3 Sep 5 13:05] Iteration: 9310000 / 39070231 [23.82%], ms/iter: 11.166, ETA: 3d 20:18 Code:
[Worker #1 Sep 5 13:24] Iteration: 9350000 / 39026909 [23.95%], ms/iter: 8.529, ETA: 70:18:44 [Worker #2 Sep 5 13:26] Iteration: 9330000 / 39068641 [23.88%], ms/iter: 8.573, ETA: 70:49:06 [Worker #3 Sep 5 13:26] Iteration: 9340000 / 39070231 [23.90%], ms/iter: 8.528, ETA: 70:25:49 [Worker #1 Sep 5 13:26] Iteration: 9360000 / 39026909 [23.98%], ms/iter: 8.528, ETA: 70:16:50 [Worker #2 Sep 5 13:27] Iteration: 9340000 / 39068641 [23.90%], ms/iter: 8.587, ETA: 70:54:27 [Worker #3 Sep 5 13:27] Iteration: 9350000 / 39070231 [23.93%], ms/iter: 8.544, ETA: 70:31:56 Code:
[Worker #1 Sep 5 13:37] Iteration: 9440000 / 39026909 [24.18%], ms/iter: 7.186, ETA: 59:03:31 [Worker #2 Sep 5 13:39] Iteration: 9420000 / 39068641 [24.11%], ms/iter: 7.224, ETA: 59:29:35 [Worker #1 Sep 5 13:39] Iteration: 9450000 / 39026909 [24.21%], ms/iter: 7.182, ETA: 59:00:22 [Worker #2 Sep 5 13:40] Iteration: 9430000 / 39068641 [24.13%], ms/iter: 7.243, ETA: 59:37:48 [Worker #1 Sep 5 13:40] Iteration: 9460000 / 39026909 [24.23%], ms/iter: 7.201, ETA: 59:08:29 [Worker #2 Sep 5 13:41] Iteration: 9440000 / 39068641 [24.16%], ms/iter: 7.231, ETA: 59:30:32 Code:
4 /11.2 iter/ms = 0.36 iter/ms 3 / 8.5 iter/ms = 0.35 iter/ms 2 / 7.2 iter/sm = 0.28 iter/ms Code:
Intel Core i5 6600K / 3.5 GHz processor S-1151 ASUS Z170-P S-1151 ATX Kingston HyperX Predator 16GB 3000MHz DDR4 DIMM 288-pin |
|
|
|
|
|
|
#24 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
72·131 Posts |
I'm pretty sure Skylake Xeon will need a new socket, because leaked slides like http://www.extremetech.com/computing...emory-channels suggest that it has six-channel DDR4. I am saving up for a dual Skylake Xeon box in early 2017 to replace my 48-core Opteron from May 2011 - the Opteron is still quite a capable machine, with 2.5x the performance of an i7/4790 at 5x the power, but I'd expect 24 cores of Skylake to be twice that performance at less absolute power.
Last fiddled with by fivemack on 2015-09-05 at 18:44 |
|
|
|
|
|
#25 | |
|
Serpentine Vermin Jar
Jul 2014
1100111011112 Posts |
Quote:
This reminds me that I really need to sit myself down sometime and work through the actual throughputs on a 6/8/10/14 core chip and see just where the sweet spots are. Right now I throw all of the cores on one CPU at a single worker so I'm not worried about memory thrashing between multiple workers, but I know I'm leaving a little bit of throughput out that way. I can see that the CPU's aren't totally maxed (but they are close). You might see the same thing with 4 workers... are your cores at 100% each, or are they slightly under as I'd expect? Maybe by just 1% or less under the max so it could be hard to really tell. |
|
|
|
|
|
|
#26 | ||
|
P90 years forever!
Aug 2002
Yeehaw, FL
2×53×71 Posts |
Quote:
Quote:
|
||
|
|
|
|
|
#27 |
|
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
16F816 Posts |
Could memory delays be measured within Prime95?
|
|
|
|
|
|
#28 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
2×53×71 Posts |
|
|
|
|
|
|
#29 | ||
|
Serpentine Vermin Jar
Jul 2014
7·11·43 Posts |
Quote:
Quote:
I'm guessing it's memory contention by inference, since otherwise Prime95 would presumably keep chugging along and keeping those cycles going? Based on everything I've encountered with memory lately, it makes me hopeful that new things like HBM, memory cubes, blah blah, will have an amazing effect on apps like Prime95 that really need that kind of bandwidth. All we have to do is sit around and wait for those shiny things to show up on our desktops, right? |
||
|
|
|
|
|
#30 | |
|
"Patrik Johansson"
Aug 2002
Uppsala, Sweden
52·17 Posts |
Quote:
4 cores: Code:
[Worker #4 Sep 6 20:53] Iteration: 3280000 / 39143207 [8.37%], ms/iter: 8.641, ETA: 3d 14:04 [Worker #2 Sep 6 20:54] Iteration: 3290000 / 39068641 [8.42%], ms/iter: 8.590, ETA: 3d 13:22 [Worker #3 Sep 6 20:54] Iteration: 3290000 / 39070231 [8.42%], ms/iter: 8.640, ETA: 3d 13:52 [Worker #1 Sep 6 20:55] Iteration: 3340000 / 39026909 [8.55%], ms/iter: 8.594, ETA: 3d 13:11 [Worker #4 Sep 6 20:55] Iteration: 3290000 / 39143207 [8.40%], ms/iter: 8.646, ETA: 3d 14:06 [Worker #2 Sep 6 20:55] Iteration: 3300000 / 39068641 [8.44%], ms/iter: 8.587, ETA: 3d 13:19 [Worker #3 Sep 6 20:56] Iteration: 3300000 / 39070231 [8.44%], ms/iter: 8.641, ETA: 3d 13:51 Code:
[Worker #1 Sep 6 21:13] Iteration: 3480000 / 39026909 [8.91%], ms/iter: 7.507, ETA: 3d 02:07 [Worker #2 Sep 6 21:14] Iteration: 3440000 / 39068641 [8.80%], ms/iter: 7.485, ETA: 3d 02:04 [Worker #3 Sep 6 21:14] Iteration: 3440000 / 39070231 [8.80%], ms/iter: 7.487, ETA: 3d 02:05 [Worker #1 Sep 6 21:14] Iteration: 3490000 / 39026909 [8.94%], ms/iter: 7.509, ETA: 3d 02:07 [Worker #2 Sep 6 21:15] Iteration: 3450000 / 39068641 [8.83%], ms/iter: 7.488, ETA: 3d 02:04 [Worker #3 Sep 6 21:16] Iteration: 3450000 / 39070231 [8.83%], ms/iter: 7.488, ETA: 3d 02:05 Code:
[Worker #2 Sep 6 21:27] Iteration: 3540000 / 39068641 [9.06%], ms/iter: 7.054, ETA: 69:37:10 [Worker #1 Sep 6 21:27] Iteration: 3590000 / 39026909 [9.19%], ms/iter: 7.044, ETA: 69:20:30 [Worker #2 Sep 6 21:28] Iteration: 3550000 / 39068641 [9.08%], ms/iter: 7.063, ETA: 69:41:12 [Worker #1 Sep 6 21:28] Iteration: 3600000 / 39026909 [9.22%], ms/iter: 7.053, ETA: 69:24:24 [Worker #2 Sep 6 21:29] Iteration: 3560000 / 39068641 [9.11%], ms/iter: 7.068, ETA: 69:42:46 [Worker #1 Sep 6 21:29] Iteration: 3610000 / 39026909 [9.25%], ms/iter: 7.052, ETA: 69:22:26 Code:
4 / 8.62 iter/ms = 0.464 iter/ms 3 / 7.48 iter/ms = 0.401 iter/ms 2 / 7.06 iter/sm = 0.283 iter/ms I repeat that my hardware is: Code:
Intel Core i5 6600K / 3.5 GHz processor S-1151 ASUS Z170-P S-1151 ATX Kingston HyperX Predator 16GB 3000MHz DDR4 DIMM 288-pin |
|
|
|
|
|
|
#31 |
|
Jun 2003
505110 Posts |
Memory contention or threading waits? Did you see it when running N workers x 1 thread (which has the max memory contention)?
|
|
|
|
|
|
#32 | |
|
Serpentine Vermin Jar
Jul 2014
7×11×43 Posts |
Quote:
The most dramatic example you could probably see is to configure a single worker thread using however many CPU's you have on your system (let's say 4 for the sake of argument). Assign a really small exponent to that worker in the 10M range while you look at the graphs of each individual core. What you *should* see (and what I see) is that the first core on that worker will use roughly all of it's power, but the other 3 "helper" cores will use a noticeably smaller amount. In the case of a 10M exponent, it should be pretty obvious. The theory I posed to Science_Man_88 a little earlier today was along the lines of: The processors are able to work a lot quicker on the small exponent with a small FFT, which means that you're more likely to run into the memory bottleneck since the CPU is cruising along fairly fast. Is that true? I have no idea... it was just my theory to explain why I saw a more pronounced CPU idle with smaller exponents. The fewer cores I threw at the worker meant each CPU was using more of its full potential, I guess because there's a point where memory and CPU are roughly balanced. With larger exponents, the CPU is doing much more work, even with tests I've done up to 14 cores on a single CPU, to the point where the CPU might still be the limiting factor in such a setup. It's only when I start adding more cores on another physical chip, stressing the QPI link and overall memory bandwidth even more, before I see it really start to bog down. On my 14-core CPU's I would start to see the "helper" cores using noticeably less than 100% after adding 4-5 more cores on the other chip. Might be QPI being saturated, might be the overall memory use...but same end result where I *think* the cores end up waiting on memory. For what it's worth, I don't think Windows has any way to measure raw memory bandwidth use. At least, I'm not aware of any built in way to see that, like with CPU cycles. My gut tells me that doing so is possible but then if you were to measure that kind of detail it would probably slow things down. What I call the "quantum effect" of monitoring a server, that by just monitoring some things you're affecting the system itself. I try to avoid intrusive monitoring for that reason.
|
|
|
|
|
|
|
#33 |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
22·1,549 Posts |
Those graphs are the OS time slice views and have absolutely nothing at all to do with memory bandwidth or bottlenecks. The OS can't halt the CPU when instructions are waiting for memory, it just doesn't work that way. The OS can't see into the process and decide to somehow insert a HLT instruction in the middle of a memory read instruction. The OS is just another program (albeit with higher privileges) and runs on the same core hardware as everything else. Only when some interrupt or exception happens does the OS get to run some code, but during normal program operation the OS isn't even running.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Skylake vs Kabylake | ET_ | Hardware | 17 | 2017-05-24 16:19 |
| 768k Skylake Problem/Bug | Aurum | Software | 590 | 2017-05-19 10:03 |
| Skylake and RAM scaling | mackerel | Hardware | 34 | 2016-03-03 19:14 |
| Skylake processor | tha | Hardware | 7 | 2015-03-05 23:49 |
| Skylake AVX-512 | clarke | Software | 15 | 2015-03-04 21:48 |