mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   So does skylake-nonXeon actually get us anything? (https://www.mersenneforum.org/showthread.php?t=20403)

fivemack 2015-08-05 16:17

So does skylake-nonXeon actually get us anything?
 
Sky Lake has launched, were I to go mad this moment I could get a system shipped to me by Friday.

4GHz turbo to 4.2GHz, two channels of DDR4 memory support, reasonably fancy GPU for which Windows 10 might have useful OpenCL drivers. No new instructions. No CrystalWell L4 cache. Not the slightest information provided about improvements to the execution pipeline.

Is this likely to have any advantage over i7-4790K?

pinhodecarlos 2015-08-05 17:27

[QUOTE=fivemack;407291]Sky Lake has launched, were I to go mad this moment I could get a system shipped to me by Friday.

4GHz turbo to 4.2GHz, two channels of DDR4 memory support, reasonably fancy GPU for which Windows 10 might have useful OpenCL drivers. No new instructions. No CrystalWell L4 cache. Not the slightest information provided about improvements to the execution pipeline.

Is this likely to have any advantage over i7-4790K?[/QUOTE]

Please ship one for me too.

[QUOTE=John P. Myers;98772]Looks like the final numbers are in. The Skylake 6700K beats a 4790K by 9% at stock speeds specifically on compute alone.[/QUOTE]

Xyzzy 2015-08-06 01:16

What is the maximum memory the chipset can address?

Mark Rose 2015-08-06 01:42

[QUOTE=Xyzzy;407314]What is the maximum memory the chipset can address?[/QUOTE]

As best as I can tell, it's limited to 64 GB. 32GB modules are coming on to the market, and I haven't found anything regarding whether the 64 GB limit is hardware or simply a lack of testing with larger modules.

ATH 2015-08-06 05:17

Unfortunately Skylake has "only" dual channel DDR4, so I guess Haswell-E with quad-channel DDR4 are still going to be faster for LL test? Anyone tested the difference between dual- and quad channel memory for LL tests?

VictordeHolland 2015-08-06 09:35

[QUOTE=ATH;407323]Unfortunately Skylake has "only" dual channel DDR4, so I guess Haswell-E with quad-channel DDR4 are still going to be faster for LL test? Anyone tested the difference between dual- and quad channel memory for LL tests?[/QUOTE]
My i7 3770 can easily saturate dual channel DDR3-2133. Fast DDR4 memory will help, but I'm afraid you'll also run into memory bandwidth bottlenecks with Skylake.
In the end it is all about price/performance and on that front Skylake will probably win it over the more expensive Haswell-E.

ldesnogu 2015-08-08 11:27

[QUOTE=Xyzzy;407314]What is the maximum memory the chipset can address?[/QUOTE]
It seems to be 32GB: [URL]http://ark.intel.com/products/88195/Intel-Core-i7-6700K-Processor-8M-Cache-up-to-4_20-GHz[/URL]

ldesnogu 2015-08-08 17:29

[QUOTE=fivemack;407291]No new instructions.[/QUOTE]
Compared to Haswell there's ADX. That should be useful for MP computations.

[QUOTE]Not the slightest information provided about improvements to the execution pipeline.

Is this likely to have any advantage over i7-4790K?[/QUOTE]Some hints [URL="http://users.atw.hu/instlatx64/HSWvsBDWvsSKL.txt"]here[/URL]: FMA latency reduced by 1. VADD latency increased but throughput increased.

More data: [URL]http://instlatx64.atw.hu/[/URL]

Madpoo 2015-08-09 22:40

[QUOTE=fivemack;407291]Sky Lake has launched, were I to go mad this moment I could get a system shipped to me by Friday.

4GHz turbo to 4.2GHz, two channels of DDR4 memory support, reasonably fancy GPU for which Windows 10 might have useful OpenCL drivers. No new instructions. No CrystalWell L4 cache. Not the slightest information provided about improvements to the execution pipeline.

Is this likely to have any advantage over i7-4790K?[/QUOTE]

In case anyone cares, I dug through and found just this one GIMPS benchmark for a Skylake system. I guess we need more people with Skylakes to run benchmarks and check them in:
[URL="http://www.mersenne.org/report_benchmarks/?specific_cpu=4379196"]i5-6440HQ @ 2.9 GHz[/URL]

It's [I]just[/I] an i5-6440HQ @ 2.6 GHz, so compare accordingly (it was running at 2.9 GHz for the benchmark, probably stock turbo speed).

I was trying to find a comparable i5-5xxx at a similar clock speed... maybe this one:
[URL="http://www.mersenne.org/report_benchmarks/?specific_cpu=4378720"]i5-5257U @ 3 GHz[/URL]

Then there's this... it's a Haswell clocked faster (3.3 GHz) and it seemed to have similar timings to Skylake at the 2.9 GHz clock. Pretty cool:
[URL="http://www.mersenne.org/report_benchmarks/?specific_cpu=4379198"]i5-5675C @ 3.3 GHz[/URL]

pinhodecarlos 2015-08-10 20:28

One of the things it's concerning me is the TDP of Broadwell vs Skylake and therefore the specific consumption analysis of both processors simulating/crunching/testing the same amount of work.

ldesnogu 2015-08-18 14:49

GMP is getting nice speedups for Skylake over Haswell:
[URL]https://gmplib.org/gmpbench.html[/URL]
[code] base app GMP Score
CPU freq mult div gcd gcd rsa pi bench /GHz
Core i5-6600 (Skylake 6MB L3) 3500 68173 63168 10973 7417 9030 62.7 5049 1443 2015-08-17
Core i7-4790 (Haswell 8MB L3) 4000 66464 59923 11336 7487 8294 62.0 4882 1220
Xeon E5-1650v2 (Ivy Bridge 12MB L3) 3500 52036 49885 9108 5946 6756 50.7 3956 1130 2015-04-30
Core i5 2500 (Sandy Bridge 6MB L3) 3300 44474 43235 7960 5265 5827 44.0 3425 1038 2015-05-01
Phenom 1090T (K10 6MB L3) 3200 42183 40877 6751 4579 5876 42.4 3257 1018 2015-05-26
Core i3 5010U (Broadwell 3MB L3) 2100 37245 35997 6179 4199 4973 35.9 2831 1348 2015-05-14[/code]Nice but not enough to upgrade from Haswell based on that only ;)

Ken_g6 2015-08-27 00:34

Skylake might get us...smoldering piles of slag. :max:

OK, that's an exaggeration, but Prime95 [url=http://www.overclock.net/t/1571038/my-6700k-dead]seems to have killed this one[/url]. But it was overclocked and overvolted. So maybe we shouldn't do that with Skylake?

kladner 2015-08-27 03:16

[QUOTE=Ken_g6;408892]Skylake might get us...smoldering piles of slag. :max:

OK, that's an exaggeration, but Prime95 [URL="http://www.overclock.net/t/1571038/my-6700k-dead"]seems to have killed this one[/URL]. But it was overclocked and overvolted. So maybe we shouldn't do that with Skylake?[/QUOTE]

Last comment to the thread linked above when I looked:

I wouldn't use prime95 on any of my newer cpus... thats just asking for it.

People don't understand what it means to [I]really load[/I] a CPU. Nor do they understand the risks of pushing limits.

LaurV 2015-08-27 04:54

Well, I won't make an account there just to start the "brood war", but there are always some idiots there in the wild who kill the messenger when the delivered news are not good.

They should only read the "stress.txt" file that come with P95 distribution, i.e. the last Q&A:

[QUOTE="stress.txt"]Q) A forum member said "Don't bother with prime95, it always pukes on me,
and my system is stable!. What do you make of that?"

or

"We had a server at work that ran for 2 MONTHS straight, without a reboot
I installed Prime95 on it and ran it - a couple minutes later I get an error.
You are going to tell me that the server wasn't stable?"

A) These users obviously do not subscribe to the 100% rock solid
school of thought. THEIR MACHINES DO HAVE HARDWARE PROBLEMS.
But since they are not presently running any programs that reveal
the hardware problem, the machines are quite stable. As long as
these machines never run a program that uncovers the hardware problem,
then the machines will continue to be stable.[/QUOTE]

Madpoo 2015-08-27 19:12

[QUOTE=LaurV;408926]...These users obviously do not subscribe to the 100% rock solid
school of thought. THEIR MACHINES DO HAVE HARDWARE PROBLEMS.
But since they are not presently running any programs that reveal
the hardware problem, the machines are quite stable. As long as
these machines never run a program that uncovers the hardware problem,
then the machines will continue to be stable....[/QUOTE]

Well, I know for a fact that this logic is sound. Because if I don't look under the bed, the boogeyman isn't really there. But, man... I just know if I look, he'll be there. :smile:

tha 2015-08-27 19:51

Well, my current system is dual core Conroe system from 2006. When I bought the parts and assembled the system, it was all state of the art. I can't say that I need a new system, but I sold some of my blue chips a month ago and bought them back this week at a 23% lower price. They've gone up 12% since, so I decided to order some new parts.

I bought a 6700K processor priced at €400,- and 16 Gb of 3200 GHz DDR4 priced at €180,-. And an Asus top of the line motherboard with everything on it for €300,-. Not a cost efficient system, but one that I hope will last as long as my Conroe system and will be as pleasant to work with. Now I need to find some time to assemble it. I will post the mprime results when I have it. I am not going to overclock it, just run the memory at the highest XMP setting.

ATH 2015-08-28 00:01

[QUOTE=Madpoo;408975]Well, I know for a fact that this logic is sound. Because if I don't look under the bed, the boogeyman isn't really there. But, man... I just know if I look, he'll be there. :smile:[/QUOTE]

[URL="https://www.youtube.com/watch?v=wYzSHynBeh4&t=4s"]AHHHHHHHHHHHHHH BOOGEYMAN![/URL]

ldesnogu 2015-09-02 14:45

[QUOTE=ldesnogu;407458]It seems to be 32GB: [URL]http://ark.intel.com/products/88195/Intel-Core-i7-6700K-Processor-8M-Cache-up-to-4_20-GHz[/URL][/QUOTE]
Intel fixed the page, it's now saying 64GB.

nucleon 2015-09-05 00:15

I'm really disappointed over last few cycles of cpus. And generally I'm an intel fan.

I get the impression, even if AVX512 was available in these cpus, it would essentially be no use as ram wouldn't be able to keep up with it.

I've seen +30% memory benchmarks on previous gens, but we already had mem starvation on four cores in use.

It'll be interesting on skylake with DDR4, how 4 cores loaded vs 3 cores, and see memory starvation.

I'm curious to see when we get cpus with more advanced memory than DDRx incarnations.

I've seen this link

[url]http://www.extremetech.com/gaming/203210-leaked-details-if-true-point-to-potent-amd-zen-cpu[/url]

To me it looks like someones wish list, but hey, if it happens - that will be awesome. (TLDR: Link has AMD cpu with HBM plans)

-- Craig

Mark Rose 2015-09-05 01:29

I always wondered how the i7-5775C would perform with the 128 MB of cache. It's a shame there's no equivalent Skylake: socketed with huge eDRAM.

ATH 2015-09-05 06:31

Any idea when the Skylake Xeon with the new instruction is supposed to be available?

Madpoo 2015-09-05 06:42

[QUOTE=ATH;409632]Any idea when the Skylake Xeon with the new instruction is supposed to be available?[/QUOTE]

I was thinking it was due sometime early in 2016, but I don't remember where I read that.

I've got my fingers crossed it'll be out and in some shipping products by the time I get my hardware budget around then so I can (hopefully) get one of those instead, but it probably wouldn't work out on the timing. Then again, maybe... all a matter of whether the Proliant Gen9's can accept one of those or not. I forget if they were supposed to share the same socket as the Xeon E5 v3's.

patrik 2015-09-05 12:20

[QUOTE=nucleon;409617]It'll be interesting on skylake with DDR4, how 4 cores loaded vs 3 cores, and see memory starvation.
[/QUOTE]
With the double-checks I was assigned on this new (yesterday) computer. First using all four cores:
[CODE][Worker #1 Sep 5 13:02] Iteration: 9310000 / 39026909 [23.85%], ms/iter: 11.193, ETA: 3d 20:23
[Worker #4 Sep 5 13:03] Iteration: 9280000 / 39143207 [23.70%], ms/iter: 11.209, ETA: 3d 20:59
[Worker #3 Sep 5 13:03] Iteration: 9300000 / 39070231 [23.80%], ms/iter: 11.185, ETA: 3d 20:29
[Worker #2 Sep 5 13:03] Iteration: 9290000 / 39068641 [23.77%], ms/iter: 11.197, ETA: 3d 20:37
[Worker #1 Sep 5 13:04] Iteration: 9320000 / 39026909 [23.88%], ms/iter: 11.176, ETA: 3d 20:13
[Worker #4 Sep 5 13:05] Iteration: 9290000 / 39143207 [23.73%], ms/iter: 11.210, ETA: 3d 20:57
[Worker #3 Sep 5 13:05] Iteration: 9310000 / 39070231 [23.82%], ms/iter: 11.166, ETA: 3d 20:18[/CODE]
Then with three cores:
[CODE][Worker #1 Sep 5 13:24] Iteration: 9350000 / 39026909 [23.95%], ms/iter: 8.529, ETA: 70:18:44
[Worker #2 Sep 5 13:26] Iteration: 9330000 / 39068641 [23.88%], ms/iter: 8.573, ETA: 70:49:06
[Worker #3 Sep 5 13:26] Iteration: 9340000 / 39070231 [23.90%], ms/iter: 8.528, ETA: 70:25:49
[Worker #1 Sep 5 13:26] Iteration: 9360000 / 39026909 [23.98%], ms/iter: 8.528, ETA: 70:16:50
[Worker #2 Sep 5 13:27] Iteration: 9340000 / 39068641 [23.90%], ms/iter: 8.587, ETA: 70:54:27
[Worker #3 Sep 5 13:27] Iteration: 9350000 / 39070231 [23.93%], ms/iter: 8.544, ETA: 70:31:56[/CODE]
Only when I reduce to two cores, I can see the throughput start dropping:
[CODE][Worker #1 Sep 5 13:37] Iteration: 9440000 / 39026909 [24.18%], ms/iter: 7.186, ETA: 59:03:31
[Worker #2 Sep 5 13:39] Iteration: 9420000 / 39068641 [24.11%], ms/iter: 7.224, ETA: 59:29:35
[Worker #1 Sep 5 13:39] Iteration: 9450000 / 39026909 [24.21%], ms/iter: 7.182, ETA: 59:00:22
[Worker #2 Sep 5 13:40] Iteration: 9430000 / 39068641 [24.13%], ms/iter: 7.243, ETA: 59:37:48
[Worker #1 Sep 5 13:40] Iteration: 9460000 / 39026909 [24.23%], ms/iter: 7.201, ETA: 59:08:29
[Worker #2 Sep 5 13:41] Iteration: 9440000 / 39068641 [24.16%], ms/iter: 7.231, ETA: 59:30:32[/CODE]
Throughput:
[CODE]4 /11.2 iter/ms = 0.36 iter/ms
3 / 8.5 iter/ms = 0.35 iter/ms
2 / 7.2 iter/sm = 0.28 iter/ms[/CODE]
My hardware is:
[CODE]Intel Core i5 6600K / 3.5 GHz processor S-1151
ASUS Z170-P S-1151 ATX
Kingston HyperX Predator 16GB 3000MHz DDR4 DIMM 288-pin[/CODE]but I just realize that I didn't use the XMP setting of the memory, just the default, so I'm running it at 2133 MHz. I will post again if and when I get the memory running at full speed.

fivemack 2015-09-05 13:36

I'm pretty sure Skylake Xeon will need a new socket, because leaked slides like [url]http://www.extremetech.com/computing/206659-future-skylake-xeons-could-pack-up-to-28-cores-6-memory-channels[/url] suggest that it has six-channel DDR4. I am saving up for a dual Skylake Xeon box in early 2017 to replace my 48-core Opteron from May 2011 - the Opteron is still quite a capable machine, with 2.5x the performance of an i7/4790 at 5x the power, but I'd expect 24 cores of Skylake to be twice that performance at less absolute power.

Madpoo 2015-09-05 17:14

[QUOTE=patrik;409646]Throughput:
[CODE]4 /11.2 iter/ms = 0.36 iter/ms
3 / 8.5 iter/ms = 0.35 iter/ms
2 / 7.2 iter/sm = 0.28 iter/ms[/CODE]
[/QUOTE]

That suggests to me that with optimal throughput in your case being 3 workers, you could setup one of the workers to possibly use 2 cores, and the other 2 workers using just one core each?

This reminds me that I really need to sit myself down sometime and work through the actual throughputs on a 6/8/10/14 core chip and see just where the sweet spots are. Right now I throw all of the cores on one CPU at a single worker so I'm not worried about memory thrashing between multiple workers, but I know I'm leaving a little bit of throughput out that way. I can see that the CPU's aren't totally maxed (but they are close).

You might see the same thing with 4 workers... are your cores at 100% each, or are they slightly under as I'd expect? Maybe by just 1% or less under the max so it could be hard to really tell.

Prime95 2015-09-05 20:32

[QUOTE=Madpoo;409684]That suggests to me that with optimal throughput in your case being 3 workers, you could setup one of the workers to possibly use 2 cores, and the other 2 workers using just one core each?[/quote]

Unlikely. One worker using two threads creates (nearly) as much memory contention as two workers.

[quote]are your cores at 100% each, or are they slightly under as I'd expect? Maybe by just 1% or less under the max so it could be hard to really tell.[/QUOTE]

The OS CPU utilization does cannot measure memory contention. CPU utilization is based on the time slices the OS gives to each process. Memory delays happen at far too fine a granularity to be measured by the OS.

henryzz 2015-09-05 22:03

Could memory delays be measured within Prime95?

Prime95 2015-09-05 22:09

[QUOTE=henryzz;409708]Could memory delays be measured within Prime95?[/QUOTE]

No.

There probably is a performance counter to measure that. But I believe reading performance counters is a privileged operation.

Madpoo 2015-09-06 17:35

[QUOTE=Prime95;409702]Unlikely. One worker using two threads creates (nearly) as much memory contention as two workers.[/QUOTE]

In my tests, I've seen some improvement adding CPU cores even when I'm (nearly) sure that memory contention was already at a peak level. Maybe it's related to L2/L3 cache? No idea.

[QUOTE=Prime95;409702]The OS CPU utilization does cannot measure memory contention. CPU utilization is based on the time slices the OS gives to each process. Memory delays happen at far too fine a granularity to be measured by the OS.[/QUOTE]

Also, going from my own experience which may or may not apply to anyone else, when mem contentiongot too high, what I saw was the CPU usage by Prime95 drop (sometimes dramatically) as an effect of the CPU idling more often when waiting on memory.

I'm guessing it's memory contention by inference, since otherwise Prime95 would presumably keep chugging along and keeping those cycles going?

Based on everything I've encountered with memory lately, it makes me hopeful that new things like HBM, memory cubes, blah blah, will have an amazing effect on apps like Prime95 that really need that kind of bandwidth.

All we have to do is sit around and wait for those shiny things to show up on our desktops, right?

patrik 2015-09-06 19:52

[QUOTE=patrik;409646]I will post again if and when I get the memory running at full speed.[/QUOTE]
It took me some time because I had to upgrade the BIOS. Before that I couldn't get the memory to run faster than 2500 MHz without getting hardware errors from mprime. Now it runs at 3000 MHz without errors.

4 cores:
[CODE][Worker #4 Sep 6 20:53] Iteration: 3280000 / 39143207 [8.37%], ms/iter: 8.641, ETA: 3d 14:04
[Worker #2 Sep 6 20:54] Iteration: 3290000 / 39068641 [8.42%], ms/iter: 8.590, ETA: 3d 13:22
[Worker #3 Sep 6 20:54] Iteration: 3290000 / 39070231 [8.42%], ms/iter: 8.640, ETA: 3d 13:52
[Worker #1 Sep 6 20:55] Iteration: 3340000 / 39026909 [8.55%], ms/iter: 8.594, ETA: 3d 13:11
[Worker #4 Sep 6 20:55] Iteration: 3290000 / 39143207 [8.40%], ms/iter: 8.646, ETA: 3d 14:06
[Worker #2 Sep 6 20:55] Iteration: 3300000 / 39068641 [8.44%], ms/iter: 8.587, ETA: 3d 13:19
[Worker #3 Sep 6 20:56] Iteration: 3300000 / 39070231 [8.44%], ms/iter: 8.641, ETA: 3d 13:51[/CODE]
3 cores:
[CODE][Worker #1 Sep 6 21:13] Iteration: 3480000 / 39026909 [8.91%], ms/iter: 7.507, ETA: 3d 02:07
[Worker #2 Sep 6 21:14] Iteration: 3440000 / 39068641 [8.80%], ms/iter: 7.485, ETA: 3d 02:04
[Worker #3 Sep 6 21:14] Iteration: 3440000 / 39070231 [8.80%], ms/iter: 7.487, ETA: 3d 02:05
[Worker #1 Sep 6 21:14] Iteration: 3490000 / 39026909 [8.94%], ms/iter: 7.509, ETA: 3d 02:07
[Worker #2 Sep 6 21:15] Iteration: 3450000 / 39068641 [8.83%], ms/iter: 7.488, ETA: 3d 02:04
[Worker #3 Sep 6 21:16] Iteration: 3450000 / 39070231 [8.83%], ms/iter: 7.488, ETA: 3d 02:05[/CODE]
2 cores:
[CODE][Worker #2 Sep 6 21:27] Iteration: 3540000 / 39068641 [9.06%], ms/iter: 7.054, ETA: 69:37:10
[Worker #1 Sep 6 21:27] Iteration: 3590000 / 39026909 [9.19%], ms/iter: 7.044, ETA: 69:20:30
[Worker #2 Sep 6 21:28] Iteration: 3550000 / 39068641 [9.08%], ms/iter: 7.063, ETA: 69:41:12
[Worker #1 Sep 6 21:28] Iteration: 3600000 / 39026909 [9.22%], ms/iter: 7.053, ETA: 69:24:24
[Worker #2 Sep 6 21:29] Iteration: 3560000 / 39068641 [9.11%], ms/iter: 7.068, ETA: 69:42:46
[Worker #1 Sep 6 21:29] Iteration: 3610000 / 39026909 [9.25%], ms/iter: 7.052, ETA: 69:22:26[/CODE]
Throughput:
[CODE]4 / 8.62 iter/ms = 0.464 iter/ms
3 / 7.48 iter/ms = 0.401 iter/ms
2 / 7.06 iter/sm = 0.283 iter/ms[/CODE]
This time all four cores seem to be useful, at least when running double-checks.

I repeat that my hardware is:
[CODE]Intel Core i5 6600K / 3.5 GHz processor S-1151
ASUS Z170-P S-1151 ATX
Kingston HyperX Predator 16GB 3000MHz DDR4 DIMM 288-pin[/CODE]

axn 2015-09-07 05:22

[QUOTE=Madpoo;409740]Also, going from my own experience which may or may not apply to anyone else, when mem contentiongot too high, what I saw was the CPU usage by Prime95 drop (sometimes dramatically) as an effect of the CPU idling more often when waiting on memory.[/QUOTE]

Memory contention or threading waits? Did you see it when running N workers x 1 thread (which has the max memory contention)?

Madpoo 2015-09-07 05:47

[QUOTE=axn;409762]Memory contention or threading waits? Did you see it when running N workers x 1 thread (which has the max memory contention)?[/QUOTE]

Let me put it this way which may help, if anyone else wanted to try and reproduce it on their own machine...

The most dramatic example you could probably see is to configure a single worker thread using however many CPU's you have on your system (let's say 4 for the sake of argument).

Assign a really small exponent to that worker in the 10M range while you look at the graphs of each individual core. What you *should* see (and what I see) is that the first core on that worker will use roughly all of it's power, but the other 3 "helper" cores will use a noticeably smaller amount. In the case of a 10M exponent, it should be pretty obvious.

The theory I posed to Science_Man_88 a little earlier today was along the lines of: The processors are able to work a lot quicker on the small exponent with a small FFT, which means that you're more likely to run into the memory bottleneck since the CPU is cruising along fairly fast.

Is that true? I have no idea... it was just my theory to explain why I saw a more pronounced CPU idle with smaller exponents. The fewer cores I threw at the worker meant each CPU was using more of its full potential, I guess because there's a point where memory and CPU are roughly balanced.

With larger exponents, the CPU is doing much more work, even with tests I've done up to 14 cores on a single CPU, to the point where the CPU might still be the limiting factor in such a setup. It's only when I start adding more cores on another physical chip, stressing the QPI link and overall memory bandwidth even more, before I see it really start to bog down.

On my 14-core CPU's I would start to see the "helper" cores using noticeably less than 100% after adding 4-5 more cores on the other chip. Might be QPI being saturated, might be the overall memory use...but same end result where I *think* the cores end up waiting on memory.

For what it's worth, I don't think Windows has any way to measure raw memory bandwidth use. At least, I'm not aware of any built in way to see that, like with CPU cycles. My gut tells me that doing so is possible but then if you were to measure that kind of detail it would probably slow things down. What I call the "quantum effect" of monitoring a server, that by just monitoring some things you're affecting the system itself. I try to avoid intrusive monitoring for that reason. :smile:

retina 2015-09-07 14:35

[QUOTE=Madpoo;409765]... look at the graphs of each individual core.[/QUOTE]Those graphs are the OS time slice views and have absolutely nothing at all to do with memory bandwidth or bottlenecks. The OS can't halt the CPU when instructions are waiting for memory, it just doesn't work that way. The OS can't see into the process and decide to somehow insert a HLT instruction in the middle of a memory read instruction. The OS is just another program (albeit with higher privileges) and runs on the same core hardware as everything else. Only when some interrupt or exception happens does the OS get to run some code, but during normal program operation the OS isn't even running.

Madpoo 2015-09-07 18:04

[QUOTE=retina;409787]Those graphs are the OS time slice views and have absolutely nothing at all to do with memory bandwidth or bottlenecks. The OS can't halt the CPU when instructions are waiting for memory, it just doesn't work that way. The OS can't see into the process and decide to somehow insert a HLT instruction in the middle of a memory read instruction. The OS is just another program (albeit with higher privileges) and runs on the same core hardware as everything else. Only when some interrupt or exception happens does the OS get to run some code, but during normal program operation the OS isn't even running.[/QUOTE]

I know all that. But the graphs *do* indicate how much each core is being used by the system including all programs. I think you misunderstood what I'm saying when I mentioned that the CPU % used on each core isn't running at 100%. All I mean is that it's a clear indication that Prime95 itself isn't so much CPU bound at that point, but rather it's bottlenecked by something else, which is almost certainly the memory.

Even on a new system when I'm burning it in and Prime95 is literally the only non-OS thing running, this is the case.

If Prime95 is in the middle of some LL iteration and it involves reading/writing a large chunk of RAM, then it will by definition be memory bound at that point and any further execution in the program will be, by necessity, on hold until that's done. If that mem access happens to fit into L2/L3 cache then it would finish fairly quick, but even then there's still latency involved with L2/L3 cache coherence and any cache read misses or write-throughs to the memory controller.

It's those cases where it has to access main memory and the latency involved where I suspect we're seeing the most memory related bottlenecks and the LL test stalls for a bit while waiting for those ops to complete. It's expected with large datasets and not out of the ordinary.

My only point was to what degree the CPU is stalled during those times. It may be sloppy of me to say the CPU is stalled because it's really the Prime95 execution thread that's stalled. If I had another bunch of things running on this system that needed some CPU cycles to execute (web server, SQL, whatever) it would happily use those cycles if needed. Is that the confusing part of my terminology, that you thought I was saying the CPU itself is being halted or something? Because I didn't mean to imply that.

Prime95 2015-09-07 19:40

[QUOTE=Madpoo;409765]
Assign a really small exponent to that worker in the 10M range while you look at the graphs of each individual core. What you *should* see (and what I see) is that the first core on that worker will use roughly all of it's power, but the other 3 "helper" cores will use a noticeably smaller amount. In the case of a 10M exponent, it should be pretty obvious.

Is that true? I have no idea... it was just my theory to explain why I saw a more pronounced CPU idle with smaller exponents. The fewer cores I threw at the worker meant each CPU was using more of its full potential, I guess because there's a point where memory and CPU are roughly balanced.[/QUOTE]

In this case the OS is measuring prime95's shoddy multithreading code. Prime95 breaks the task into several decent sized chunks, submits them to the worker threads, waits for them all to finish -- repeat. What you are seeing is that all worker threads won't finish chunks at the same time causing idleness. Worse, if you have say 10 chunks to do on 4 workers, then 2 workers do three chunks and two workers do two chunks --- i.e. two workers are 33% idle.

Madpoo 2015-09-08 01:29

[QUOTE=Prime95;409814]In this case the OS is measuring prime95's shoddy multithreading code. Prime95 breaks the task into several decent sized chunks, submits them to the worker threads, waits for them all to finish -- repeat. What you are seeing is that all worker threads won't finish chunks at the same time causing idleness. Worse, if you have say 10 chunks to do on 4 workers, then 2 workers do three chunks and two workers do two chunks --- i.e. two workers are 33% idle.[/QUOTE]

Oh... weird. :smile: Is that the type of thing that would be more pronounced with smaller exponents? And is there anything that could be done to improve the way it works, like having an equal # of chunks as worker threads?

I understand that the LL process is sequential and any multithreading is the type of thing that has to happen within a single iteration which limits things a bit. Given a best case scenario (no other apps running, leaving as much of the horsepower to Prime95 as possible), each core/thread should finish its work at the same time as every other one?

And is it the role of the first thread to distribute those chunks and collect the results and prep for the next step? If so, I guess there is indeed a case to be made to make that primary thread run on something besides core zero which also has to deal with interrupts on most systems. Maybe give that core the lightest possible load. :smile:

Prime95 2015-09-08 01:42

[QUOTE=Madpoo;409838]Is that the type of thing that would be more pronounced with smaller exponents?[/quote]

Likely. With larger FFTs there are more chunks to process. Thus, percentage-wise there is less wastage when the odd number of chunks are distributed (e.g. 50 chunks on 4 threads vs. 10 chunks on 4 threads = 8% vs. 33% waste)

[quote]
And is there anything that could be done to improve the way it works, like having an equal # of chunks as worker threads?
[/quote]

Maybe, but multi-threading optimization is not a high priority item for me. Ernst's mlucas seems to do a better job with multi-threading. I'm not sure if that's because he has superior methods or some other reason.

[quote]Given a best case scenario (no other apps running, leaving as much of the horsepower to Prime95 as possible), each core/thread should finish its work at the same time as every other one?[/quote]

Assuming the number of chunks is a multiple of the number of threads and no lock contention issues arise, then they should finish very close to the same time.

[quote]And is it the role of the first thread to distribute those chunks and collect the results and prep for the next step?[/QUOTE]

Yes.


All times are UTC. The time now is 23:25.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.