mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   768k Skylake Problem/Bug (https://www.mersenneforum.org/showthread.php?t=20714)

s1riker 2016-01-15 20:00

[QUOTE=pegnose;422584]Re-reading this thread...

others have experienced freezes as well, which seem to be cured by bios 1402:
[URL]http://www.mersenneforum.org/showpost.php?p=420747&postcount=206[/URL]

But it is discussed to be indicative of different bug:
[URL]http://www.mersenneforum.org/showpost.php?p=420954&postcount=256[/URL][/QUOTE]

Regarding the second link and the hard lockup, I'm not so sure it's related to this AVX bug (my 6700k system experiences both). I've been participating in that thread extensively and from the looks of it, it seems like either a chipset or Skylake IMC bug/incompatibility. Probably growing pains with DDR4. The reason I say that is because many people have had luck in at least increasing stability by increasing either VCC or VCCIO or both. I haven't been able to get rid of the hard lock completely yet, but in playing with timings and voltages I have been able to reduce the frequency of hard locking from about once per day to once per week. Because of how infrequently it happens now, trying new things and seeing if it makes any differences is taking a long time.

Oh and somewhat interestingly, it almost always happens when the system is idle (ie. no interrupts from user input).

chalsall 2016-01-15 21:08

[QUOTE=s1riker;422585]Oh and somewhat interestingly, it almost always happens when the system is idle (ie. no interrupts from user input).[/QUOTE]

Then you are dealing with a very different situation.

Until you have a nominal stable system you should go back and try to figure out what's wrong.

s1riker 2016-01-15 21:54

[QUOTE=chalsall;422595]Then you are dealing with a very different situation.

Until you have a nominal stable system you should go back and try to figure out what's wrong.[/QUOTE]

Yeah I agree. The point was this instability is affecting quite a few people with a whole range of motherboards and RAM modules, so it points to some CPU/chipset bug that is yet to be fixed. I've been building systems for years and I've never had this much trouble getting a new build to be stable (been trying since August). Anyhow just wanted to point out that this is likely another issue entirely and not tied to the AVX bug discussed in this thread.

chalsall 2016-01-15 22:06

[QUOTE=s1riker;422609]Yeah I agree. The point was this instability is affecting quite a few people with a whole range of motherboards and RAM modules, so it points to some CPU/chipset bug that is yet to be fixed. I've been building systems for years and I've never had this much trouble getting a new build to be stable (been trying since August). Anyhow just wanted to point out that this is likely another issue entirely and not tied to the AVX bug discussed in this thread.[/QUOTE]

Please forgive me for this, but I don't think you have a clue about what you are talking about.

If I'm wrong, then please present testable evidence to the contrary.

VBCurtis 2016-01-15 23:08

[QUOTE=chalsall;422611]Please forgive me for this, but I don't think you have a clue about what you are talking about.

If I'm wrong, then please present testable evidence to the contrary.[/QUOTE]

The evidence has already been presented that it's likely not linked to the 768k bug: Increasing voltages reduces the incidence of the lockup. Also, it happens at idle.

Why are you attacking him?

chalsall 2016-01-15 23:18

[QUOTE=VBCurtis;422626]The evidence has already been presented that it's likely not linked to the 768k bug: Increasing voltages reduces the incidence of the lockup. Also, it happens at idle.[/QUOTE]

No, it hasn't been presented. The evidence presented suggests that clocks nor voltages don't have any influence on the bug found.

[QUOTE=VBCurtis;422626]Why are you attacking him?[/QUOTE]

I'm not "attacking". I'm asking for supporting evidence of a claim.

chalsall 2016-01-15 23:44

[QUOTE=s1riker;422609]The point was this instability is affecting quite a few people with a whole range of motherboards and RAM modules, so it points to some CPU/chipset bug that is yet to be fixed. I've been building systems for years and I've never had this much trouble getting a new build to be stable (been trying since August).[/QUOTE]

So, I'm now under attack for questioning you...

My first question is, are you overclocking any of your systems and/or sub-systems?

My second question is, are you being methodical in your testing?

pegnose 2016-01-16 12:21

[QUOTE=s1riker;422585]Regarding the second link and the hard lockup, I'm not so sure it's related to this AVX bug (my 6700k system experiences both). I've been participating in that thread extensively and from the looks of it, it seems like either a chipset or Skylake IMC bug/incompatibility. Probably growing pains with DDR4. The reason I say that is because many people have had luck in at least increasing stability by increasing either VCC or VCCIO or both. I haven't been able to get rid of the hard lock completely yet, but in playing with timings and voltages I have been able to reduce the frequency of hard locking from about once per day to once per week. Because of how infrequently it happens now, trying new things and seeing if it makes any differences is taking a long time.

Oh and somewhat interestingly, it almost always happens when the system is idle (ie. no interrupts from user input).[/QUOTE]


Thank you so much, this is exactly what I am dealing with. I got my system stable at least for memtest86 by
- disabling the two unused dram slots
- enabling MCH Full Check (ASUS)
- setting dram voltage tolerance to 110%

But I still experience a freeze during Prime95. Yes, testing is a PITA. Currently I try setting 16-16-16-40 (instead of 39), which should be compatible in my case according to the manual.

From what I have heard, hard lock during idle could be a third problem. I had this when I had ASPM enabled for the link between SA and the PCH. Others suggested the PSU being incompatible with Haswell C-States (6/7). You could try disabling all power saving options altogether.

s1riker 2016-01-16 18:39

[QUOTE=chalsall;422632]So, I'm now under attack for questioning you...

My first question is, are you overclocking any of your systems and/or sub-systems?

My second question is, are you being methodical in your testing?[/QUOTE]

Hey Chalsall,

I've gathered a sense of what your personality is like from reading this thread, so I won't take any attacks personally :) Since you asked, here's a dump of everything I know/tried:

1. Hang happens whether overclocking core CPU frequency or not, strangely, it seems to happen less when it is overclocked to 4.5GHz.
2. Hang happens whether I'm running RAM at default 2133MHz or my RAM's XMP settings which are 3000MHz (15-15-15-35).
3. Hang never happens while I'm using the PC. I can game on it for hours, run Prime95 torture test for over 12h.
4. I've run HCI memtest for 24 hours to ensure there are no RAM errors. All clean.
5. Hang always happens after PC is left idle (sleep is turned off as I like to have my PC available for RDP at all times)
6. When hung, PC must be hard shut down, reset button does not work
7. I've already tried a different set of RAM sticks (although same manufacturer and same rated speed), different PSU, onboard video (instead of my 970GTX) and the motherboard was RMA'ed. Nothing made any different, still hangs.
8. I've tried all available BIOS updates for my motherboard, including beta ones. The newest beta does seem to reduce the frequency of the hang, but not completely.

That's all I recall for now.

chalsall 2016-01-16 19:14

[QUOTE=s1riker;422707]I've gathered a sense of what your personality is like from reading this thread, so I won't take any attacks personally :)[/QUOTE]

Great. There's a reason I'm called "Sheldon" by my friends... :smile:

[QUOTE=s1riker;422707]5. [B]Hang always happens after PC is left idle (sleep is turned off as I like to have my PC available for RDP at all times)[/B]
8. I've tried all available BIOS updates for my motherboard, including beta ones. The newest beta does seem to reduce the frequency of the hang, but not completely.[/QUOTE]

Interesting...

Are you able to replicate this on any other machines?

Madpoo 2016-01-16 19:30

[QUOTE=pegnose;422682]...
But I still experience a freeze during Prime95. Yes, testing is a PITA. Currently I try setting 16-16-16-40 (instead of 39), which should be compatible in my case according to the manual...[/QUOTE]

It can be hard to say just why Prime95 would lock a machine, but bear in mind that Prime95 is doing what it's supposed to... stressing the system so that if it has a problem, it will manifest itself.

Sounds like you're doing all the right things by focusing on different things that could be the problem until you finally (and hopefully) figure out what causes it.

One thing I'm not sure if you've tried is to underclock your system and see if it's stable. That might just help you verify if there are thermal issues, or maybe power supply related things. Same with the memory.

Madpoo 2016-01-16 19:39

[QUOTE=s1riker;422707]Hey Chalsall,

I've gathered a sense of what your personality is like from reading this thread, so I won't take any attacks personally :) [/QUOTE]

He's really a lovable old grump, deep down.

Regarding your issue, I know it can be a pain in the butt to nail down what's causing hanging issues, but with the servers I work with, my basic approach is to remove *everything* except the bare essentials needed to get the system booted. That might mean taking out all but a single stick of RAM, maybe just one CPU on a dual socket motherboard, taking out any add-on cards, etc. etc.

Then set the BIOS settings to stock/default. Give it a spin and see if it works any better. If it still fails I could swap the one mem module with another and try again, in case I happened to keep the one bad stick. Changing the BIOS settings to some low power/efficient mode might also be useful to see the effect.

If it works great, then yay, I can start adding things back in one at a time until I find the culprit.

Worst case scenario is when multiple things are bad (2 mem modules, or some funky interaction with different PCI cards, for example), but that's not common.

Oh, I guess another worst case is a bad power supply, but the servers I use have dual supplies and have never caused an issue, but on desktops it's more likely.

pegnose 2016-01-16 20:04

[QUOTE=pegnose;422682]Thank you so much, this is exactly what I am dealing with. I got my system stable at least for memtest86 by
- disabling the two unused dram slots
- enabling MCH Full Check (ASUS)
- setting dram voltage tolerance to 110%

But I still experience a freeze during Prime95. Yes, testing is a PITA. Currently I try setting 16-16-16-40 (instead of 39), which should be compatible in my case according to the manual.[/QUOTE]


Looks good so far: Prime95 ran for over 8h for the first time! I even managed to reproduce the 768k bug. :)

pegnose 2016-01-16 20:10

[QUOTE=s1riker;422707]Hey Chalsall,

I've gathered a sense of what your personality is like from reading this thread, so I won't take any attacks personally :) Since you asked, here's a dump of everything I know/tried:

1. Hang happens whether overclocking core CPU frequency or not, strangely, it seems to happen less when it is overclocked to 4.5GHz.
2. Hang happens whether I'm running RAM at default 2133MHz or my RAM's XMP settings which are 3000MHz (15-15-15-35).
3. Hang never happens while I'm using the PC. I can game on it for hours, run Prime95 torture test for over 12h.
4. I've run HCI memtest for 24 hours to ensure there are no RAM errors. All clean.
5. Hang always happens after PC is left idle (sleep is turned off as I like to have my PC available for RDP at all times)
6. When hung, PC must be hard shut down, reset button does not work
7. I've already tried a different set of RAM sticks (although same manufacturer and same rated speed), different PSU, onboard video (instead of my 970GTX) and the motherboard was RMA'ed. Nothing made any different, still hangs.
8. I've tried all available BIOS updates for my motherboard, including beta ones. The newest beta does seem to reduce the frequency of the hang, but not completely.

That's all I recall for now.[/QUOTE]


Have you read what I wrote above? Probably you have, but I just post it again:

From what I have heard, hard lock during idle could be a third problem. I had this when I had ASPM enabled for the link between SA and the PCH. Others suggested the PSU being incompatible with Haswell C-States (6/7). You could try disabling all power saving options altogether.

Did you do that? Also in Windows?

pegnose 2016-01-16 20:14

[QUOTE=Madpoo;422713]It can be hard to say just why Prime95 would lock a machine, but bear in mind that Prime95 is doing what it's supposed to... stressing the system so that if it has a problem, it will manifest itself.

Sounds like you're doing all the right things by focusing on different things that could be the problem until you finally (and hopefully) figure out what causes it.

One thing I'm not sure if you've tried is to underclock your system and see if it's stable. That might just help you verify if there are thermal issues, or maybe power supply related things. Same with the memory.[/QUOTE]


Thanks for reassuring me, Madpoo! I tried the memory @2133 MHz (JEDEC) and the CPU runs stock values (didn't go lower, so far; but it is well below 50 °C during 1344k stress testing). However, now with 16-16-16-40 timing it looks pretty good. I think I'll know by tomorrow evening.

chalsall 2016-01-16 20:23

[QUOTE=pegnose;422718]Thanks for reassuring me, Madpoo! I tried the memory @2133 MHz (JEDEC) and the CPU runs stock values (didn't go lower, so far; but it is well below 50 °C during 1344k stress testing). However, now with 16-16-16-40 timing it looks pretty good. I think I'll know by tomorrow evening.[/QUOTE]

On how many machines are these errors manifesting?

Have you tried different PSUs?

pegnose 2016-01-16 21:19

[QUOTE=chalsall;422719]On how many machines are these errors manifesting?

Have you tried different PSUs?[/QUOTE]

I have only the one machine. And no, not yet. But as I said: with 16-16-16-40 no hard lock so far. Plus: neither idle state nor massive load (300 W GTX 980TI OC during Furmark) were able to provoke the issue (at least not within 1h or so).

I didn't believe it, but maybe a) ASUS and b) Crucial were both right: a) my modules with 16-16-16-39 timing actually are not compatible (with my M. VIII Hero), but b) they are the very same hardware as the ones with 16-16-16-40, only with different timings programmed.

Mark Rose 2016-01-16 21:35

You should be able to disable sleep states in the BIOS. I would play with that setting.

Madpoo 2016-01-16 23:25

Nice boost in software downloads thanks to this issue
 
By the way for everyone following along...

The [url]www.mersenne.org[/url] server has seen a decent little boost in traffic to the download page, mostly from links on PC World.

I'm sure many of those are people interested in testing their own Skylake to see if it exhibits the issue, but maybe some will stick around and test a few exponents.

s1riker 2016-01-17 02:20

[QUOTE=chalsall;422709]Are you able to replicate this on any other machines?[/QUOTE]

Unfortunately, don't have any other systems to test on.

s1riker 2016-01-17 02:24

[QUOTE=pegnose;422717]Have you read what I wrote above? Probably you have, but I just post it again:

From what I have heard, hard lock during idle could be a third problem. I had this when I had ASPM enabled for the link between SA and the PCH. Others suggested the PSU being incompatible with Haswell C-States (6/7). You could try disabling all power saving options altogether.

Did you do that? Also in Windows?[/QUOTE]

I tried two recent, relative high end power supplies, and both exhibited the exact same issue. The next thing I was going to try was to disable C-States one at a time and see what happens. Although, I was hesitant to go down that route because I'm not fond of having the system waste more power than it needs to, given that I leave it running 24/7 but for a couple weeks, I'll give it a shot.

Also, to the above poster who suggested I try 1 stick of RAM. Did that, still hung, but thanks for the suggestion.

pegnose 2016-01-17 08:14

[QUOTE=Madpoo;422736]By the way for everyone following along...

The [URL="http://www.mersenne.org"]www.mersenne.org[/URL] server has seen a decent little boost in traffic to the download page, mostly from links on PC World.

I'm sure many of those are people interested in testing their own Skylake to see if it exhibits the issue, but maybe some will stick around and test a few exponents.[/QUOTE]

Any particular instructions on that? Program version, settings?

pegnose 2016-01-17 08:22

[QUOTE=s1riker;422749]I tried two recent, relative high end power supplies, and both exhibited the exact same issue. The next thing I was going to try was to disable C-States one at a time and see what happens. Although, I was hesitant to go down that route because I'm not fond of having the system waste more power than it needs to, given that I leave it running 24/7 but for a couple weeks, I'll give it a shot.

Also, to the above poster who suggested I try 1 stick of RAM. Did that, still hung, but thanks for the suggestion.[/QUOTE]

Make sure, the PSU vendor explicitly states compatibility with Haswell C-states. Then disable C-state 8 in Bios, because to my knowledge, Haswell C-states are 6/7. Doesn't have to mean anything, but I did all this, and I don't have the issue. After reading your thread at Toms I let my machine idle overnight. Was still there in the morning. I also have the microcode 39. Hyperthreading, C-states, Turbo boost... all enabled. ASUS M. VIII Hero, Bios 1302.

The second thing is ASPM (advanced state power management) for PCIe components. Make sure, everything related to this is explicitly disabled in the Bios. Also make sure PCIe power management in Windows power settings is set to "Maximum performance". OS's ability to do control ASPM is called PCIe Native Power Managment, at least in ASUS Bioses.

And finally I recommend a CMOS reset by inverting your CMOS battery in the socket (for at least 10 min) to make sure that (really, really) everything is gone.

s1riker 2016-01-17 14:39

[QUOTE=pegnose;422770]Make sure, the PSU vendor explicitly states compatibility with Haswell C-states. Then disable C-state 8 in Bios, because to my knowledge, Haswell C-states are 6/7. Doesn't have to mean anything, but I did all this, and I don't have the issue. After reading your thread at Toms I let my machine idle overnight. Was still there in the morning. I also have the microcode 39. Hyperthreading, C-states, Turbo boost... all enabled. ASUS M. VIII Hero, Bios 1302.

The second thing is ASPM (advanced state power management) for PCIe components. Make sure, everything related to this is explicitly disabled in the Bios. Also make sure PCIe power management in Windows power settings is set to "Maximum performance". OS's ability to do control ASPM is called PCIe Native Power Managment, at least in ASUS Bioses.

And finally I recommend a CMOS reset by inverting your CMOS battery in the socket (for at least 10 min) to make sure that (really, really) everything is gone.[/QUOTE]

Thank you, I will try your suggestions.

ralleh 2016-01-18 01:05

[URL="http://abload.de/img/unbenannt9vzhv.png"]I can confirm the 6A MC fixed everything on my system :)[/URL]

Thanks for everyone involved over here, we made it happen!

ralleh 2016-01-18 10:13

[QUOTE=pegnose;422186]Ok, thank you. I just ask because the experienced testers here said they didn't have any trouble whatsoever with other settings then AVX FFTs with 768k. So if I take them by their word, my issue must be a different one. I was hoping that some of those could tell me that they had issues with other test constellations, but only less frequently.[/QUOTE]

I can confirm there were no issues with other tests before. I tested 800k for 12 hours+ successfully with the old MC.

Your issues sound more like a regular system instability due to bad overclock or component issues. 800k is used to verify memory overclocking stability usually (was found to be one of the best tests for RAM overclock), so you might want to try another kit or check if the ram is the problem in general (some DDR4 kits that were released for X99 systems cause severe stability problems on some skylake motherboards).

Also I had never any issues/freezes when I was using my rig for gaming, when using the old MC, too.

[QUOTE=s1riker;422749]Although, I was hesitant to go down that route because I'm not fond of having the system waste more power than it needs to, given that I leave it running 24/7 but for a couple weeks, I'll give it a shot.[/QUOTE]

The difference in power usage is way smaller than you might think (on more recent systems). Mostly <5W on DC/Skylake systems I tested ;)

-

[B]/Edit:[/B] In addition to my post above: The usual version of P2.00 for the ASRock Z170 OCF does [u]not[/u] include the MC 6A yet, it was modded into the UEFI. German retailer [URL="http://www.jzelectronic.de/"]JZ Electronics[/URL] offers such modded UEFIs for his customers (and non-customers who bother to register in his forum). Here is the link to the [URL="http://www.jzelectronic.de/jz2/index.php?lid=dGlkPSZ0aGVtYV9pZD0mYWN0PTM0OTU2"]modded ASRock UEFIs[/URL].

pegnose 2016-01-19 08:50

[QUOTE=ralleh;422881]I can confirm there were no issues with other tests before. I tested 800k for 12 hours+ successfully with the old MC.

Your issues sound more like a regular system instability due to bad overclock or component issues. 800k is used to verify memory overclocking stability usually (was found to be one of the best tests for RAM overclock), so you might want to try another kit or check if the ram is the problem in general (some DDR4 kits that were released for X99 systems cause severe stability problems on some skylake motherboards).

Also I had never any issues/freezes when I was using my rig for gaming, when using the old MC, too.[/QUOTE]

Thanks for your follow-up and concern, ralleh! It had turned out that Crucial shippes the same hardware with different timings, one of which wasn't compatible with my mobo. So a software (bios) fix did the trick (tRAS: 39 -> 40).

pegnose 2016-01-19 09:14

Erm... why do I have an Alien avatar?!? Is this mod humor?

[B]INDEED IT IS. YOU HAVE BEEN CHOSEN. CONSIDER IT A VERY SMALL BADGE OF HONOR[/B]

Dubslow 2016-01-19 10:06

The various animals that (secretly?) run this place get bored from time to time... :smile:

pegnose 2016-01-19 12:29

[QUOTE=Dubslow;423002]The various animals that (secretly?) run this place get bored from time to time... :smile:[/QUOTE]

And here I thought you wanted to distance yourself by marking me as being from outer space. ;)

So, thank you... I guess. Already starting to feel minimally honored. It is slightly tickling...

Also started working on some prime factors in my idle time. Is it a problem to cut the internet connection while Prime95 is doing the hard work?

chalsall 2016-01-19 13:17

[QUOTE=pegnose;423011]Is it a problem to cut the internet connection while Prime95 is doing the hard work?[/QUOTE]

You should let the client contact the server every few days, but other than that it's fine to not have an Internet connection. Note that the communications only consume a hundred bytes or so; nominally once a day.

pegnose 2016-01-19 14:16

[QUOTE=chalsall;423013]You should let the client contact the server every few days, but other than that it's fine to not have an Internet connection. Note that the communications only consume a hundred bytes or so; nominally once a day.[/QUOTE]

Ok, thx! The traffic is totally ok, I just tend to switch of my wifi router on going to work.

ralleh 2016-01-19 15:47

Nick Shih posted a [URL="http://picx.xfastest.com/nickshih/asrock/Z17OCF202.zip"]2.02 BIOS[/URL] for the Z170 OC Formula today. The new BIOS contains the newer Microcode 74.

Too bad it's not told what's the difference between 6A and 74, as 6A already fixed the 768k problem.

pegnose 2016-01-19 16:02

[QUOTE=ralleh;423026]Nick Shih posted a [URL="http://picx.xfastest.com/nickshih/asrock/Z17OCF202.zip"]2.02 BIOS[/URL] for the Z170 OC Formula today. The new BIOS contains the newer Microcode 74.

Too bad it's not told what's the difference between 6A and 74, as 6A already fixed the 768k problem.[/QUOTE]

Hopefully memory compatibility and power management issues. Many people out there are going bonkers because of randomly freezing machines (particularly idle state).

Uncwilly 2016-01-19 22:44

Can someone with MersenneWiki privileges please edit the stub on the bug?
[url]http://mersennewiki.org/index.php/Skylake_Bug[/url]

Phil MjX 2016-01-19 23:40

I have received today a beta bios from MSI, I'll give it a try and keep you informed.
But before I'd like to do some benchmarks with prime95 to see if it has performance impact.

ixfd64 2016-01-20 00:09

Probably a stupid question as I haven't had the time to read through the thread yet, but does the bug affect LL results, or does it simply cause computers to freeze? If it's the former case, should we prioritize double-checking on results returned by Skylake machines?

VBCurtis 2016-01-20 00:15

[QUOTE=ixfd64;423140]Probably a stupid question as I haven't had the time to read through the thread yet, but does the bug affect LL results, or does it simply cause computers to freeze? If it's the former case, should we prioritize double-checking on results returned by Skylake machines?[/QUOTE]

Yes, we should double-check all skylake results run at 768k FFT size. Luckily, I did that in the time it took me to write this sentence, so we're good.

Madpoo 2016-01-20 01:45

[QUOTE=VBCurtis;423141]Yes, we should double-check all skylake results run at 768k FFT size. Luckily, I did that in the time it took me to write this sentence, so we're good.[/QUOTE]

LOL...

The good (?) news is that even if it affects other FFT sizes (which seemed like a possibility, maybe remote), it seemed to trigger round-off errors.

Regardless, they'll be double-checked eventually. :smile:

pegnose 2016-01-20 08:19

[QUOTE=ixfd64;423140]Probably a stupid question as I haven't had the time to read through the thread yet, but does the bug affect LL results, or does it simply cause computers to freeze?[/QUOTE]


The (sadly) misleading thing in the news (web article headlines explicitly stating this) is that while it might lead to computers freezing (who knows; although there are so many Skylake problems out there with memory, power management etc. that actually! freeze PCs), I learned here (and by testing) that Prime95 (at least 27.x) will just drop workers.

Unfortunately, by now many people with Skylakes freezing believe that the 768k bug accounts for their problems. And the first already are disappointed learning that a bios update with the new micro code isn't helping their PCs.

I am also curious. I had solved memory issues for my machine until I finally could run memtest85 and Prime95 for whole days without a freeze and idle states wouldn't bother me as well (as opposed to other people). At that point I decided it would be worth to finally resync my software raid (which usually takes ~1d too).

And what can I say? MY PC FROZE!!! I am so p***ed off, excuse my french.

Ok, I had Prime95 running during that on all 8 cores, but that shouldn't be an issue for a 1.5k Euro machine. Seriously not.

henryzz 2016-01-20 10:04

[QUOTE=Madpoo;423153]LOL...

The good (?) news is that even if it affects other FFT sizes (which seemed like a possibility, maybe remote), it seemed to trigger round-off errors.

Regardless, they'll be double-checked eventually. :smile:[/QUOTE]

The majority of Skylake systems will probably be using v28 as well which means FMA avoiding this bug.

s1riker 2016-01-20 15:09

[QUOTE=pegnose;423191]The (sadly) misleading thing in the news (web article headlines explicitly stating this) is that while it might lead to computers freezing (who knows; although there are so many Skylake problems out there with memory, power management etc. that actually! freeze PCs), I learned here (and by testing) that Prime95 (at least 27.x) will just drop workers.

Unfortunately, by now many people with Skylakes freezing believe that the 768k bug accounts for their problems. And the first already are disappointed learning that a bios update with the new micro code isn't helping their PCs.

I am also curious. I had solved memory issues for my machine until I finally could run memtest85 and Prime95 for whole days without a freeze and idle states wouldn't bother me as well (as opposed to other people). At that point I decided it would be worth to finally resync my software raid (which usually takes ~1d too).

And what can I say? MY PC FROZE!!! I am so p***ed off, excuse my french.

Ok, I had Prime95 running during that on all 8 cores, but that shouldn't be an issue for a 1.5k Euro machine. Seriously not.[/QUOTE]

I told you there was some fundamental bug here :) .. I only smile because there is someone else who can join my misery. It should not be this hard to get a stable system. I'm on my last set of things to try (increased VCCIO to 1.0v, load line calibration set to high and C-State 8 off). I'm not hopeful it'll fix anything. I'm just going to wait till around July/August and RMA my CPU in the hopes that a new stepping will fix the issue.

Edit: The interesting thing is that it happened for you during high I/O, which is inline with what Solis3 was saying the Tom's thread. Maybe that's something we should investigate.

Madpoo 2016-01-20 16:02

[QUOTE=pegnose;423191]...
I am also curious. I had solved memory issues for my machine until I finally could run memtest85 and Prime95 for whole days without a freeze and idle states wouldn't bother me as well (as opposed to other people). At that point I decided it would be worth to finally resync my software raid (which usually takes ~1d too).

And what can I say? MY PC FROZE!!! I am so p***ed off, excuse my french.

Ok, I had Prime95 running during that on all 8 cores, but that shouldn't be an issue for a 1.5k Euro machine. Seriously not.[/QUOTE]

Could that be a power supply issue? You had the CPU running full tilt *and* the RAID rebuild would have caused the drives to use max power for a long period of time.

Just guessing... software RAID also uses a lot of RAM so if there was anything bad with it, you might have bumped into the bad section of memory.

I had a server a few years back that ran great most of the time. I could run Prime95 stress tests for days on end, no worries. I ran memtest on it for about a week solid, no problems. But when this thing was running normally and happened to take over as the primary SQL node in a cluster, once SQL's mem usage crept past a certain point and hit some bad section of memory, the thing would blue screen and an unrecoverable memory error was logged (it was a Proliant, so it actually tells you which DIMM it was).

I mean, I threw everything I could at this to try and replicate the issue with artificial memory tests, using every type of bit patterns imaginable. It "only" had 36 GB so memtest could run through the whole thing in a fairly short time, and yeah, I left it running from a memtest bootable USB for around a week and it never did throw an error.

To work around it I setup the Proliant in "spare memory" mode which effectively took the bad module out of service. Then a couple months later I finally got onsite with a new module and swapped it out, the problem disappeared.

But it goes to show that even the best artificial tests out there are no match for real workloads.

s1riker 2016-01-20 16:21

[QUOTE=Madpoo;423224]Could that be a power supply issue? You had the CPU running full tilt *and* the RAID rebuild would have caused the drives to use max power for a long period of time.

Just guessing... software RAID also uses a lot of RAM so if there was anything bad with it, you might have bumped into the bad section of memory.

I had a server a few years back that ran great most of the time. I could run Prime95 stress tests for days on end, no worries. I ran memtest on it for about a week solid, no problems. But when this thing was running normally and happened to take over as the primary SQL node in a cluster, once SQL's mem usage crept past a certain point and hit some bad section of memory, the thing would blue screen and an unrecoverable memory error was logged (it was a Proliant, so it actually tells you which DIMM it was).

I mean, I threw everything I could at this to try and replicate the issue with artificial memory tests, using every type of bit patterns imaginable. It "only" had 36 GB so memtest could run through the whole thing in a fairly short time, and yeah, I left it running from a memtest bootable USB for around a week and it never did throw an error.

To work around it I setup the Proliant in "spare memory" mode which effectively took the bad module out of service. Then a couple months later I finally got onsite with a new module and swapped it out, the problem disappeared.

But it goes to show that even the best artificial tests out there are no match for real workloads.[/QUOTE]

I would have probably posted something like this before I had built this system :) ... It's not PSU or bad RAM. I've tried both. I have 2 high end PSU's that are marked as fully compatible with Haswell. One of which was in my previous build for about 2 years now and was completely stable. Both exhibit the same hard lock with my Skylake-S build. Same with RAM, I tried 2 separate sets of RAM, at default speeds. Even, single sticks just to see, and hard lock still happened. Others with different motherboards, different RAM and different PSUs (who've also tried swapping out components) have had the same issue with Skylake-S.

pegnose 2016-01-20 19:08

[QUOTE=s1riker;423214]I told you there was some fundamental bug here :) .. I only smile because there is someone else who can join my misery. It should not be this hard to get a stable system. I'm on my last set of things to try (increased VCCIO to 1.0v, load line calibration set to high and C-State 8 off). I'm not hopeful it'll fix anything. I'm just going to wait till around July/August and RMA my CPU in the hopes that a new stepping will fix the issue.

Edit: The interesting thing is that it happened for you during high I/O, which is inline with what Solis3 was saying the Tom's thread. Maybe that's something we should investigate.[/QUOTE]

You can smile, fine by me. ;)

However, I highly doubt that there is something broken. Because I can kick my system as hard as I can. It only freezes after many hours, somehow ruling out a power or heat problem. It is hard to understand. Freezes got less and less frequent, this is a gradual thing. If at all, I will swap memory to a 100% compatible kit. After that I can - again - open a support case with asus.

ALSO, my system was stable the first two months! Or it is mostly with particular load scenarios (games in my case: Fallout 4 and Deus Ex Human Revolution; but not Dying Light, so far).

pegnose 2016-01-20 19:15

[QUOTE=Madpoo;423224]Could that be a power supply issue? You had the CPU running full tilt *and* the RAID rebuild would have caused the drives to use max power for a long period of time.

Just guessing... software RAID also uses a lot of RAM so if there was anything bad with it, you might have bumped into the bad section of memory.

I had a server a few years back that ran great most of the time. I could run Prime95 stress tests for days on end, no worries. I ran memtest on it for about a week solid, no problems. But when this thing was running normally and happened to take over as the primary SQL node in a cluster, once SQL's mem usage crept past a certain point and hit some bad section of memory, the thing would blue screen and an unrecoverable memory error was logged (it was a Proliant, so it actually tells you which DIMM it was).

I mean, I threw everything I could at this to try and replicate the issue with artificial memory tests, using every type of bit patterns imaginable. It "only" had 36 GB so memtest could run through the whole thing in a fairly short time, and yeah, I left it running from a memtest bootable USB for around a week and it never did throw an error.

To work around it I setup the Proliant in "spare memory" mode which effectively took the bad module out of service. Then a couple months later I finally got onsite with a new module and swapped it out, the problem disappeared.

But it goes to show that even the best artificial tests out there are no match for real workloads.[/QUOTE]

I highly doubt a power issue. My PSU is 700 W, and BeQuiet is an outstanding brand regarding the amps of individual rails. Also CPU at max my whole system won't consum emore than 200 W, whereas with my GPU in full load it is at 450 W. And like this I can't produce a freeze with Furmark for 1+ hours (also, PSU rather causes a black-out; I had this when my GTX 980Ti was attached to only one of the PCIe power rails ;).

Ram? a) I have 16 GB, b) memtest86 ran for 10h straight. Ok, for stresstesting you should use HCI's version. I will do that soon. But I am not so sure it is memory any more for me.

Ok, so it was a faulty module for you and you couldn't detect it with memtest? That is actually weird. But I would be happy about this. Probably I will know soon, because I plan on switching the brand (more people have issues with Crucial on recent ASUS boards). Only the modules from ASUS' HCL for that board in part are hard to get in Germany, and the other part has some addendum like "(ver. 5.26)". How am I supposed to get exactly THAT version?!

Nevertheless, thank you very much for sharing your experience with me!

pegnose 2016-01-20 19:30

[QUOTE=s1riker;423228]I would have probably posted something like this before I had built this system :) ... It's not PSU or bad RAM. I've tried both. I have 2 high end PSU's that are marked as fully compatible with Haswell. One of which was in my previous build for about 2 years now and was completely stable. Both exhibit the same hard lock with my Skylake-S build. Same with RAM, I tried 2 separate sets of RAM, at default speeds. Even, single sticks just to see, and hard lock still happened. Others with different motherboards, different RAM and different PSUs (who've also tried swapping out components) have had the same issue with Skylake-S.[/QUOTE]

Yes, I also think it is the platform as such. You can do things to get it more stable like disabling C-states, or optimizing ram compatibility. But somehow these measure only affect a gradual improvement.

My last freeze (and the first AFTER I thought I was finally good) now was with high load on CPU and HDD (ASM1061 in particular; which is PCH in platform terms). Memory is SA, the other side of the DMI link, if I am correct. Interestingly, I had freezes when I activated native power management for the DMI link (the PCH side of it) in the bios. Maybe that is the relevant component.

- idle state (power management) for some
- video streaming (to the hdd) for others
- memory for most of us

Does that make sense?

chalsall 2016-01-20 19:42

[QUOTE=pegnose;423251]Does that make sense?[/QUOTE]

Certainly. But please keep in mind that modern computers are /incredibly/ complicated, with many different interacting components (often from different suppliers) which each must work perfectly for the system as a whole to work correctly.

Within the software industry there's the term "once a month" bug. This means that something going demonstrably wrong, but it is not easy to deterministically reproduce. This is where the expression "have you tried turning it off and on again" comes from.

To try to interject something useful to this post, have you looked at your PSU's power rails loading? It is possible your PSU is fine, but by chance you have one or more of your rails /just/ at the edge of its rating because of the power cabling configuration.

pegnose 2016-01-20 20:20

[QUOTE=chalsall;423252]Certainly. But please keep in mind that modern computers are /incredibly/ complicated, with many different interacting components (often from different suppliers) which each must work perfectly for the system as a whole to work correctly.

Within the software industry there's the term "once a month" bug. This means that something going demonstrably wrong, but it is not easy to deterministically reproduce. This is where the expression "have you tried turning it off and on again" comes from.

To try to interject something useful to this post, have you looked at your PSU's power rails loading? It is possible your PSU is fine, but by chance you have one or more of your rails /just/ at the edge of its rating because of the power cabling configuration.[/QUOTE]

Thanks for chiming in! Yes, this problem certainly is hard to track and of complicated, multifacetted nature. The common factor seems to be "unpredictable". ;)

Unfortunately, is is more "once every 4 days" for many users. So you wouldn't think about just ignoring it. And it seems to affect really a lot of people out there.

Yes, I thought about the power rails. But...

- my system consumes up to 450 W, on a 700 W PSU of a solid brand (BeQuiet)
- with CPU and mobo connectors there is no choice
- one SSD is M.2 on the board, the other three consume up to... 20 W?
- then there is my GTX980Ti which takes up to 300 W, but (now) it is connected to both PCIe rails (before, when I was still naive, hehe, I had 2 (!) power outages in 5 months - real black-outs (PC only, I mean: instant-off) - which is nothing compared to the Skylake hard-lock); I even had contact with the BeQuiet support on that matter: with both rails I should be more than fine now

So... unless my PSU has a weird defect, which is shared by many other users with many other PSUs... it is not the culprit, I guess.

chalsall 2016-01-20 20:44

[QUOTE=pegnose;423256]Unfortunately, is is more "once every 4 days" for many users. So you wouldn't think about just ignoring it. And it seems to affect really a lot of people out there.[/QUOTE]

Please trust me, I understand. When I was responsible for building out the first wireless WAN here in Barbados the manufacturer had a "once a month" bug. Of course, after deploying ~100 radios, the problem manifested ~3 times a day...

[QUOTE=pegnose;423256]Yes, I thought about the power rails. But...

- one SSD is M.2 on the board, the other three consume up to... 20 W?[/QUOTE]

Are the other three drives SSDs or HDs? Please know that "spinning rust storage" can have rather extreme random spikes in their draw (both OS driven, and independent).

[QUOTE=pegnose;423256]So... unless my PSU has a weird defect, which is shared by many other users with many other PSUs... it is not the culprit, I guess.[/QUOTE]

Don't guess. Test.

I would suggest (if you and/or others can) to first remove any kit you can (for example, HDs, GPUs, RAM), and rerun the tests you've used to produce the observed crashing (even if not deterministically -- currently you're doing statistical testing). Swap out MBs and PSUs. Make sure your mains power is good.

This is not to say this is not a CPU issue, but you don't /know/ it is yet.

I hope that makes sense and helps.

kladner 2016-01-20 21:52

chalsall:
[QUOTE]Are the other three drives SSDs or HDs? Please know that "spinning rust storage" can have rather extreme random spikes in their draw (both OS driven, and independent).[/QUOTE]pegnose:
[QUOTE][U]one SSD[/U] is M.2 on the board, [U]the other three[/U] consume up to... 20 W?[/QUOTE]I'm pretty sure, from the OP's post, that all the drives are SSDs. Twenty watts for three spinners also seems pretty unlikely.

chalsall 2016-01-20 22:07

[QUOTE=kladner;423263]I'm pretty sure, from the OP's post, that all the drives are SSDs. Twenty watts for three spinners also seems pretty unlikely.[/QUOTE]

I'm not so sure. Easy to achieve that kind of load at idle spin with no seek.

Please also note this: "My last freeze (and the first AFTER I thought I was finally good) now was with high load on CPU and HDD".

Perhaps pegnose will speak to our questions.

pegnose 2016-01-20 22:09

[QUOTE=chalsall;423258]Please trust me, I understand. When I was responsible for building out the first wireless WAN here in Barbados the manufacturer had a "once a month" bug. Of course, after deploying ~100 radios, the problem manifested ~3 times a day...[/QUOTE]

Oh nose. What was it?

[QUOTE=chalsall;423258]Are the other three drives SSDs or HDs? Please know that "spinning rust storage" can have rather extreme random spikes in their draw (both OS driven, and independent).[/QUOTE]

Its one Samsung 840 Pro 256 GB SSD and two WD Green 3TB.

If I understand my PSU correct, the 4 rails are 1) for CPU and mobo, 2) for drives and all other devices, 3) PCIe 1, 4) PCIe 2. If I am correct, there is no danger here. Unfortunately there is no info on that on their website.

[QUOTE=chalsall;423258]Don't guess. Test.[/QUOTE]

Many people with many different PSUs have thre freeze. There is no reason to believe it is the problem.

[QUOTE=chalsall;423258]I would suggest (if you and/or others can) to first remove any kit you can (for example, HDs, GPUs, RAM), and rerun the tests you've used to produce the observed crashing (even if not deterministically -- currently you're doing statistical testing). Swap out MBs and PSUs. Make sure your mains power is good.[/QUOTE]

That is exactly what I did. As soon as ASUS support pointed me to my memory not exactly being compatible, I adjusted settings and took all componentes out except CPU and one ram module. I even unplugged all the fans. Ok, the DVD drive was sill on, but I had to run memtest86 somehow. ;)

Finally I arrived at memtest86, Prime95, and idle state, running for a whole day without any issue - with correct memory timings, more power for DRAM, and some compat settings that possibly are pure placebo.

[QUOTE=chalsall;423258]This is not to say this is not a CPU issue, but you don't /know/ it is yet.

I hope that makes sense and helps.[/QUOTE]

To bring you up to speed: only few of us were able to imporve their situation by RMAing the CPU. My impression is that many think it is a platform/bios/microcode (whatever of those) issue - me included. I don't think that anything is broken, particularly as I had two crash-free months in the beginning. Here is 'our' main discussion thread:

[url]http://www.tomshardware.co.uk/forum/id-2830772/skylake-build-randomly-freezing-crashing/page-9.html#17356820[/url]

Some have the Skylake freeze during idle, others during load, one guy while video streaming. Some could improve by disabling c-states, some with memory timings and voltages, some even with CPU core voltage, if I remember correctly. It is diffuse, obscure, and some other nice words. ;)

It could be very different problems, but interestingly, all these different issues were not completely cured on any machine (not that I read of), but only got better to a more or less substantial extent.

pegnose 2016-01-20 22:18

[QUOTE=kladner;423263]chalsall:
pegnose:
I'm pretty sure, from the OP's post, that all the drives are SSDs. Twenty watts for three spinners also seems pretty unlikely.[/QUOTE]

Samsung 840 pro: 3.21 W in use
WD Green 3TB: 4.1 W in use (x2)
plus a substantial buffer, or let it be 30 W altogether; on one PSU rail, what are we talking about?!

pegnose 2016-01-20 22:23

[QUOTE=chalsall;423265]Please also note this: "My last freeze (and the first AFTER I thought I was finally good) now was with high load on CPU and HDD".[/QUOTE]

Yes, I stated this intentionally. But I rather meant data transfer. My software Raid 1 was resyncing. Of course, this also means power, but spikes? Plus: even the BeQuiet support told me that I would rather have to fear black-outs in such cases.

I really apreciate your effort!! But I am afraid we are on the wrong track. First thing I do is get different (brand) compatible memory. I.e. after I flash the new 1402 bios update and get the next freeze (if any, haha).

chalsall 2016-01-20 22:25

[QUOTE=pegnose;423267]It is diffuse, obscure, and some other nice words. ;)

It could be very different problems, but interestingly, all these different issues were not completely cured on any machine (not that I read of), but only got better to a more or less substantial extent.[/QUOTE]

OK. But, you guys are changing many different variables all at the same time, with little cross correlation nor testable results.

This is not how the Scientific Method works.

To use an analogy, this is worse than shooting a shotgun in the dark hoping to find your keys....

pegnose 2016-01-20 22:29

[QUOTE=chalsall;423272]OK. But, you guys are changing many different variables all at the same time, with little cross correlation nor testable results.

This is not how the Scientific Method works.

To use an analogy, this is worse than shooting a shotgun in the dark hoping to find your keys....[/QUOTE]

You mean shooting the shotgun at the streetlight you are standing below... ,)

pegnose 2016-01-20 22:31

And what is wrong with getting advised memory? I say: check one component at a time and make 100% sure it is ok. I am not done with memory, yet. If I get new memory from a different brand that is in my HCL, and I still have the issue, I move on to the next component.

chalsall 2016-01-20 22:33

[QUOTE=pegnose;423271]I really apreciate your effort!! But I am afraid we are on the wrong track. First thing I do is get different (brand) compatible memory. I.e. after I flash the new 1402 bios update and get the next freeze (if any, haha).[/QUOTE]

Have you, personally, tried a different motherboard supplier with all your other components, including the CPU?

I learnt the hard way to eliminate *all* variables.... :smile:

pegnose 2016-01-20 22:34

[QUOTE=chalsall;423276]Have you, personally, tried a different motherboard supplier with all your other components, including the CPU?

I learnt the hard way to eliminate *all* variables.... :smile:[/QUOTE]

No. As I said: I started out with memory, and I am not yet done with it.

But, of course, part of the problem is - and this is more or less the same with all of us: I bought my mobo more than 6 months ago. What will my vendor say if I want to return it without being able to proove that it is broken and that it WAS broken from the beginning (after 6 mo that is necessary), PLUS that I want a different one in return? I should be happy if he deems my worthy of even the shortest response.

AND on the other hand: as I said, I don't believe that something is broken (or sort of broken with all ASUS Z170 boards). If my other componentes are fine, ASUS support has to deal with it. I will make them. Oh, I will.

EDIT: You HEAR me, ASUS?!?

chalsall 2016-01-20 22:45

HDD Diet: Power Consumption and Heat Dissipation
 
Just to put this out there...

[URL="http://ixbtlabs.com/articles2/storage/hddpower.html"]Almost eleven years old, and yet still relevant[/URL]....

pegnose 2016-01-20 22:51

[QUOTE=chalsall;423278]Just to put this out there...

[URL="http://ixbtlabs.com/articles2/storage/hddpower.html"]Almost eleven years old, and yet still relevant[/URL]....[/QUOTE]

You are right, that is important.

In my case, I have a good cooling solution (and case ;). My HDDs are resyncing for 16 h now, and they are 31°C and 32°C.


EDIT: Nighty night.

chalsall 2016-01-20 23:07

[QUOTE=pegnose;423279]In my case, I have a good cooling solution (and case ;). My HDDs are resyncing for 16 h now, and they are 31°C and 32°C.[/QUOTE]

You might have missed the point of the article...

Measuring the temperature of the components involved is an averaged and high-latency measurement of the power consumed.

Taking an instantaneous power consumption measurement is a lot more difficult (particularly when Direct Current rather than Alternating Current is involved).

kladner 2016-01-21 01:12

[QUOTE=chalsall;423265]I'm not so sure. Easy to achieve that kind of load at idle spin with no seek.

Please also note this: "My last freeze (and the first AFTER I thought I was finally good) now was with[B][U] high load on CPU and HDD[/U][/B]".

Perhaps pegnose will speak to our questions.[/QUOTE]
Good point.

kladner 2016-01-21 01:21

[QUOTE=chalsall;423278]Just to put this out there...

[URL="http://ixbtlabs.com/articles2/storage/hddpower.html"]Almost eleven years old, and yet still relevant[/URL]....[/QUOTE]

Great article! Thanks!

pegnose 2016-01-21 07:23

Ok, I don't get it. You mean resyncing two "Eco" drives with 5400 rmp each knocks out my gaming rig? Even if that consumes like 10+ times that much power on (actual) onload? As I said: I appreaciate your concern (I really do), but I don't think so. We are not talking about hiph-speed SCSI/SAS drives from outer space here.

EDIT: Regarding the power consumption of the HDDs, first I took it off the specs, then I measured via subtraction. Seems that they consume up to 6 W each on full load. Now let it be 12 W on spikes and - of both drives spiked synchronously - 24 W for both. That still is no big deal.

Brunnis 2016-01-21 08:09

[QUOTE=pegnose;423277]No. As I said: I started out with memory, and I am not yet done with it.

But, of course, part of the problem is - and this is more or less the same with all of us: I bought my mobo more than 6 months ago. What will my vendor say if I want to return it without being able to proove that it is broken and that it WAS broken from the beginning (after 6 mo that is necessary), PLUS that I want a different one in return? I should be happy if he deems my worthy of even the shortest response.

AND on the other hand: as I said, I don't believe that something is broken (or sort of broken with all ASUS Z170 boards). If my other componentes are fine, ASUS support has to deal with it. I will make them. Oh, I will.

EDIT: You HEAR me, ASUS?!?[/QUOTE]
Don't know if this is of any help, but below are the specs of my Skylake system which appears fully stable so far. I've had it for a couple of months now.

EVGA SuperNova G2 750W
ASUS Z170-A (currently BIOS 1602, but 1203, 1302 and 1402 gave me no issues)
2x8GB Corsair Vengeance LPX Black DDR4 2400MHz @ 2666MHz 16-16-16-39-1T @ 1.254V
Core i7-6700K (stock)
Sapphire Radeon R9 390 Nitro
ASUS Xonar DX (PCI-E soundcard)
Samsung 850 EVO 1TB
An old Samsung DVD-RW drive
Windows 10 Home x64 (Fully updated and clean install. No registry tweaks and no advanced power management profile modifications.)

The BIOS is mostly at default settings. I have changed the following:

- ASUS Multicore Enhancement: disabled (enabling this feature basically runs the CPU out of spec, although it "should" be harmless)
- FCLK: 1GHz (see [url]http://www.anandtech.com/show/9607/skylake-discrete-graphics-performance-pcie-optimizations[/url])
- Voltages manually set to the following, since I don't trust ASUS with "Auto" values:[INDENT]- VCCIO: 0.95000V
- System Agent: 1.05000V
- PCH Core Voltage: 1.00000V
- CPU Standby Voltage: 1.000V[/INDENT]- PCI Express Native Power Management: enabled (I think this was actually enabled by default in earlier BIOS versions and I've kept it enabled since)
- HD Audio Controller: disabled
- Intel Thunderbolt: disabled

pegnose 2016-01-21 10:14

[QUOTE=Brunnis;423339]Don't know if this is of any help, but below are the specs of my Skylake system which appears fully stable so far. I've had it for a couple of months now.

EVGA SuperNova G2 750W
ASUS Z170-A (currently BIOS 1602, but 1203, 1302 and 1402 gave me no issues)
2x8GB Corsair Vengeance LPX Black DDR4 2400MHz @ 2666MHz 16-16-16-39-1T @ 1.254V
Core i7-6700K (stock)
Sapphire Radeon R9 390 Nitro
ASUS Xonar DX (PCI-E soundcard)
Samsung 850 EVO 1TB
An old Samsung DVD-RW drive
Windows 10 Home x64 (Fully updated and clean install. No registry tweaks and no advanced power management profile modifications.)

The BIOS is mostly at default settings. I have changed the following:

- ASUS Multicore Enhancement: disabled (enabling this feature basically runs the CPU out of spec, although it "should" be harmless)
- FCLK: 1GHz (see [URL]http://www.anandtech.com/show/9607/skylake-discrete-graphics-performance-pcie-optimizations[/URL])
- Voltages manually set to the following, since I don't trust ASUS with "Auto" values:[INDENT]- VCCIO: 0.95000V
- System Agent: 1.05000V
- PCH Core Voltage: 1.00000V
- CPU Standby Voltage: 1.000V[/INDENT]- PCI Express Native Power Management: enabled (I think this was actually enabled by default in earlier BIOS versions and I've kept it enabled since)
- HD Audio Controller: disabled
- Intel Thunderbolt: disabled[/QUOTE]

Thanks for listing this so detailedly, Brunnis! Only, I am afraid the only thing our rigs have in commen ins the CPU. :-/

I have a few questions, though:

- Settings the voltages means the don't get reduced via c-states, right? You might be on to something, here.
- Did you configure your RAM manually, or did you load XMP, or a mixture of both?

Brunnis 2016-01-21 10:46

[QUOTE=pegnose;423348]Thanks for listing this so detailedly, Brunnis! Only, I am afraid the only thing our rigs have in commen ins the CPU. :-/

I have a few questions, though:

- Settings the voltages means the don't get reduced via c-states, right? You might be on to something, here.
- Did you configure your RAM manually, or did you load XMP, or a mixture of both?[/QUOTE]
No problem! Good questions. Let's see:

- I believe (but I'm actually not sure) that only the CPU core voltage is affected by the C-states. With my current settings (CPU voltage at default), the CPU voltage jumps around as expected depending on load. I have not paid attention to how the other voltages behave with BIOS stock settings, but maybe you have? I guess it should be possible to check with HWMonitor or similar. If they do jump around, I guess this could have an effect. I only used auto values for a short time after building the system.

- I have avoided loading XMP settings. I've simply set the memory to 2666MHz, kept every timing at default (auto), except CAS, RAS to CAS, RAS ACT Time and Command Rate, which I've set to 16, 16, 39 and 1. And then of course memory voltage to 1.25V. Actually not sure if the extra memory voltage is needed with my current settings, but it's not hurting anything. Since the memory is in fact overclocked, I thought I might as well leave it with 0.05V extra voltage to provide some additional margin.

pegnose 2016-01-21 11:01

[QUOTE=Brunnis;423349]No problem! Good questions. Let's see:

- I believe (but I'm actually not sure) that only the CPU core voltage is affected by the C-states. With my current settings (CPU voltage at default), the CPU voltage jumps around as expected depending on load.[/QUOTE]

Ok, but the lower limit is 1.0 V (your CPU standby voltage). This is interesting, because IIRC CPU core voltage drops as low as 0.8xy V with stock settings. This is what one guy reported at Tomshardware who could improve his situation with RMAing his 6700k. Maybe not all CPUs that passed the QM can actually stand this - and hence the lot of people with idle issues!

[QUOTE=Brunnis;423349]I have not paid attention to how the other voltages behave with BIOS stock settings, but maybe you have? I guess it should be possible to check with HWMonitor or similar. If they do jump around, I guess this could have an effect. I only used auto values for a short time after building the system.[/QUOTE]

Not yet, but I'll have a close look!

[QUOTE=Brunnis;423349] - I have avoided loading XMP settings. I've simply set the memory to 2666MHz, kept every timing at default (auto), except CAS, RAS to CAS, RAS ACT Time and Command Rate, which I've set to 16, 16, 39 and 1. And then of course memory voltage to 1.25V. Actually not sure if the extra memory voltage is needed with my current settings, but it's not hurting anything. Since the memory is in fact overclocked, I thought I might as well leave it with 0.05V extra voltage to provide some additional margin.[/QUOTE]

This sounds very reasonable. And many people over the web have reported that upping the DRAM voltage a bit helped their stability.

Cool, thanks for your info!!

Brunnis 2016-01-21 12:32

[QUOTE=pegnose;423351]Ok, but the lower limit is 1.0 V (your CPU standby voltage). This is interesting, because IIRC CPU core voltage drops as low as 0.8xy V with stock settings. This is what one guy reported at Tomshardware who could improve his situation with RMAing his 6700k. Maybe not all CPUs that passed the QM can actually stand this - and hence the lot of people with idle issues![/QUOTE]
I'm not sure exactly when the CPU standby voltage is applied, but it's not the voltage used when idling in Windows. Even though I've set the standby voltage to 1.0V, the CPU is idling at 0.8V and 800MHz in Windows, as is the default behavior. So, unfortunately I doubt this particular setting affects you while being up and running in Windows.

[QUOTE=pegnose;423351]Cool, thanks for your info!![/QUOTE]
No problem, keep us posted!

pegnose 2016-01-21 12:33

[QUOTE=Brunnis;423355]I'm not sure exactly when the CPU standby voltage is applied, but it's not the voltage used when idling in Windows. Even though I've set the standby voltage to 1.0V, the CPU is idling at 0.8V and 800MHz in Windows, as is the default behavior. So, unfortunately I doubt this particular setting affects you while being up and running in Windows.[/QUOTE]

-.-

s1riker 2016-01-21 15:36

Well just to add to the fun ...

After running HCI memtest for 48 hours with a clean bill of health and good temps all around, I figured, ok RAM's gotta be stable, so I stopped it.

I then proceeded to work all day on the PC yesterday without issue, left for a while, came back to play some GTA (maybe 20 mins at most), and left it again. Went to check on it this morning and it was frozen again.

So C-State 8 disabled, increased VCCIO and load line calibration, and new bios with MC 6A, made no difference.

EDIT: and for chalsall, I know this is not the scientific method, I'm trying multiple things at a time. But the problem is I have only 1 workstation and I need to use it as well for real work, if I were to try 1 single thing at a time, I could be at this for years. I did however test new MB (same model), new RAM (same model) and new PSU (different model) in isolation and am confident none of those are defective.

pegnose 2016-01-21 15:50

[QUOTE=s1riker;423371]Well just to add to the fun ...

After running HCI memtest for 48 hours with a clean bill of health and good temps all around, I figured, ok RAM's gotta be stable, so I stopped it.

I then proceeded to work all day on the PC yesterday without issue, left for a while, came back to play some GTA (maybe 20 mins at most), and left it again. Went to check on it this morning and it was frozen again.

So C-State 8 disabled, increased VCCIO and load line calibration, and new bios with MC 6A, made no difference.

EDIT: and for chalsall, I know this is not the scientific method, I'm trying multiple things at a time. But the problem is I have only 1 workstation and I need to use it as well for real work, if I were to try 1 single thing at a time, I could be at this for years. I did however test new MB (same model), new RAM (same model) and new PSU (different model) in isolation and am confident none of those are defective.[/QUOTE]

Try setting all your voltages to a fixed value?

s1riker 2016-01-21 15:56

[QUOTE=pegnose;423374]Try setting all your voltages to a fixed value?[/QUOTE]

Going to try that next as I got nothing else at this point.

Brunnis 2016-01-21 16:02

[QUOTE=s1riker;423371]Well just to add to the fun ...

After running HCI memtest for 48 hours with a clean bill of health and good temps all around, I figured, ok RAM's gotta be stable, so I stopped it.

I then proceeded to work all day on the PC yesterday without issue, left for a while, came back to play some GTA (maybe 20 mins at most), and left it again. Went to check on it this morning and it was frozen again.

So C-State 8 disabled, increased VCCIO and load line calibration, and new bios with MC 6A, made no difference.

EDIT: and for chalsall, I know this is not the scientific method, I'm trying multiple things at a time. But the problem is I have only 1 workstation and I need to use it as well for real work, if I were to try 1 single thing at a time, I could be at this for years. I did however test new MB (same model), new RAM (same model) and new PSU (different model) in isolation and am confident none of those are defective.[/QUOTE]
Did you try exchanging the OS drive? I've had extremely similar issues with a flakey SSD. It eventually went out, but it took 1.5 years until that happened. Before that, I got complete system freezes (usually during idle) and I could never find a way of reproducing the issue with any stress tests.

I don't know if you're the one I discussed this issue with on Anandtech's forum? If so, I've already suggested the above to you. ;)

s1riker 2016-01-21 16:05

[QUOTE=Brunnis;423379]Did you try exchanging the OS drive? I've had extremely similar issues with a flakey SSD. It eventually went out, but it took 1.5 years until that happened. Before that, I got complete system freezes (usually during idle) and I could never find a way of reproducing the issue with any stress tests.

I don't know if you're the one I discussed this issue with on Anandtech's forum? If so, I've already suggested the above to you. ;)[/QUOTE]

Yeah that was me. I haven't forgotten, but as you can imagine that involves a much bigger time commitment as I have to reinstall all my software (to be able to work) on the replacement SSD and then resync everything when I switch back, so I was putting it off in hopes something else was the cause, but it looks like I don't have much choice now.

pegnose 2016-01-21 16:12

Congrats btw. to 2^(74,207,281)-1! :)

pegnose 2016-01-21 16:17

[QUOTE=s1riker;423383]Yeah that was me. I haven't forgotten, but as you can imagine that involves a much bigger time commitment as I have to reinstall all my software (to be able to work) on the replacement SSD and then resync everything when I switch back, so I was putting it off in hopes something else was the cause, but it looks like I don't have much choice now.[/QUOTE]

No, just use DriveImage XML or some other tool like Drive Snapshot and restore to another drive. You could even try drive-to-drive with DXML. That's a matter of <60 min.

EDIT: If you do this, your new drive initially might not boot. You can (with DXML) set a new disk ID and then use Windows startup repair to get it going again. Keep your old system as a backup/fallback.

And for debug purposes you could simply go dual-boot with a fresh system.

kladner 2016-01-21 16:21

[QUOTE=pegnose;423388]No, just use [B]DriveImage XML[/B] or some other tool like [B]Drive Snapshot[/B] and restore to another drive. You could even try drive-to-drive with DXML. That's a matter of <60 min.[/QUOTE]

These tips are much appreciated! I was unaware of those programs.

s1riker 2016-01-21 16:22

[QUOTE=pegnose;423388]No, just use DriveImage XML or some other tool like Drive Snapshot and restore to another drive. You could even try drive-to-drive with DXML. That's a matter of <60 min.[/QUOTE]

Yeah good point. I was thinking I was going to use a smaller spare SSD I had lying around, which didn't have enough space to do an image, but I just realized I no longer have that SSD, so I'm going to have to use a mechanical drive instead.

Brunnis 2016-01-21 16:25

[QUOTE=pegnose;423388]No, just use DriveImage XML or some other tool like Drive Snapshot and restore to another drive. You could even try drive-to-drive with DXML. That's a matter of <60 min.

EDIT: If you do this, your new drive initially might not boot. You can (with DXML) set a new disk ID and then use Windows startup repair to get it going again. Keep your old system as a backup/fallback.[/QUOTE]
Yep, I agree. I recently used Macrium Reflect Free to clone an old harddrive directly to a new SSD. Worked exactly as advertised and the system booted up without any fuss. The only thing that took some time was the actual copying of the content.

chalsall 2016-01-21 16:36

[QUOTE=s1riker;423371]EDIT: and for chalsall, I know this is not the scientific method, I'm trying multiple things at a time. But the problem is I have only 1 workstation and I need to use it as well for real work, if I were to try 1 single thing at a time, I could be at this for years. I did however test new MB (same model), new RAM (same model) and new PSU (different model) in isolation and am confident none of those are defective.[/QUOTE]

OK. And again, please trust I know how difficult this can be. Been there, done that (many times).

One suggestion... If you can, try a different MB by a different manufacturer with all the other components unchanged.

This /may/ be fundamental to the CPU and/or the chipset. But, as demonstrated with the "Skylake freezing bug", the "Big Boys" won't take you terribly seriously until and unless you can provide a reliably reproducible fault scenario with only one component identified as the likely cause.

pegnose 2016-01-21 16:53

[QUOTE=Brunnis;423392]Yep, I agree. I recently used Macrium Reflect Free to clone an old harddrive directly to a new SSD. Worked exactly as advertised and the system booted up without any fuss. The only thing that took some time was the actual copying of the content.[/QUOTE]

Yes, these dedicated migration tools work even better. Some of them come with new SSDs (based on Acronis e.g.) and work if at least one drive of a certain brand is involved (e.g. WD).

In any case, make sure the old partition is smaller than the new one (at least with the software I mentioned above). If it is not: use Windows or other tools to shrink the partition (e.g. EaseUs Partition Manager). Windows can't move the file table and other stuff. 3rd party apps do this during start-up.

EDIT: Ok, I see that Macrium Reflect can do even more, thanks for this valuable tip!

LaurV 2016-01-22 10:02

[QUOTE=s1riker;423383]Yeah that was me. I haven't forgotten, but as you can imagine that involves a much bigger time commitment as I have to reinstall all my software (to be able to work) on the replacement SSD and then resync everything when I switch back, so I was putting it off in hopes something else was the cause, but it looks like I don't have much choice now.[/QUOTE]
Why the bigger time commitment? Add another SSD with the same size, configure it as RAID1, run for a while or use sync software provided by the bios or by the os, take the old one out.

pegnose 2016-01-22 10:56

[QUOTE=LaurV;423506]Why the bigger time commitment? Add another SSD with the same size, configure it as RAID1, run for a while or use sync software provided by the bios or by the os, take the old one out.[/QUOTE]

Nice! :)

s1riker 2016-01-22 15:39

[QUOTE=LaurV;423506]Why the bigger time commitment? Add another SSD with the same size, configure it as RAID1, run for a while or use sync software provided by the bios or by the os, take the old one out.[/QUOTE]

Sure, if money were no issue :) .. between buying another Samsung 850 Pro 512GB and another Z170 motherboard to try out, and the tanking CAD dollar, this build is going to end up costing me well over $2.5k

pegnose 2016-01-22 19:11

Looks like I am through with upping voltages. Now I am at disabling onboard devices. HD Audio, USB 3.1, Thunderbold, ASMedia Sata are all off. We'll see.

My final shot before RMAing major components would be a fresh OS on a drive I haven't used in that system, yet.

In any way, it might be that my previous RAM issues had nothing todo with my system being unstable on load. I don't even know whether they manifested in the OS at all. I am beginning to believe that I have (had) two issues, one of which is not RAM and is responsible for the hard locks. Because my machine finally survived whole days of memtesting, but the issues in Win10 are as present as ever.

s1riker 2016-01-22 19:20

[QUOTE=pegnose;423568]Looks like I am through with upping voltages. Now I am at disabling onboard devices. HD Audio, USB 3.1, Thunderbold, ASMedia Sata are all off. We'll see.

My final shot before RMAing major components would be a fresh OS on a drive I haven't used in that system, yet.

In any way, it might be that my previous RAM issues had nothing todo with my system being unstable on load. I don't even know whether they manifested in the OS at all. I am beginning to believe that I have (had) two issues, one of which is not RAM and is responsible for the hard locks. Because my machine finally survived whole days of memtesting, but the issues in Win10 are as present as ever.[/QUOTE]

Yeah I'm done with touching voltages as well, it doesn't make sense to touch them anyway, because they are all handled, or should be handled dynamically.

Before I go ahead and try a different drive, I've disabled Intel LPM in the Rapid Storage Management console. Since that affects idle operation, it's worth a shot. I assume you've already tried that?

The other thing I've done is setup HWInfo to log everything it can every second, and write it to a file, in the hopes that I can take a look at the log info the next time it happens and see if anything looks weird, ie. sudden drop of voltage, jump in temp, etc.

chalsall 2016-01-22 19:20

[QUOTE=pegnose;423568]... but the issues in Win10 are as present as ever.[/QUOTE]

Have you considered that the OS might be the issue? It's not like Micro$oft has never had bugs.... :wink:

s1riker 2016-01-22 19:26

[QUOTE=chalsall;423573]Have you considered that the OS might be the issue? It's not like Micro$oft has never had bugs.... :wink:[/QUOTE]

People are saying the same thing happens in Windows 7 and linux, but who knows if it's the exact same problem pegnose and I are having. A lot of people in the Tom's forum think they have the same problem, but in many cases it's your standard run of the mill BSOD, crash under load or something entirely different. But you're right, it's definitely something to consider.

pegnose 2016-01-22 19:37

[QUOTE=s1riker;423577]People are saying the same thing happens in Windows 7 and linux, but who knows if it's the exact same problem pegnose and I are having. A lot of people in the Tom's forum think they have the same problem, but in many cases it's your standard run of the mill BSOD, crash under load or something entirely different. But you're right, it's definitely something to consider.[/QUOTE]

Well, I had this before upgrading from Win7 to Win10. But as I said: "My final shot before RMAing major components would be a fresh OS on a drive I haven't used in that system, yet."

pegnose 2016-01-22 19:39

[QUOTE=s1riker;423572]Yeah I'm done with touching voltages as well, it doesn't make sense to touch them anyway, because they are all handled, or should be handled dynamically.

Before I go ahead and try a different drive, I've disabled Intel LPM in the Rapid Storage Management console. Since that affects idle operation, it's worth a shot. I assume you've already tried that?[/QUOTE]

I didn't have it installed until very recently. And LPM is still disabled, I think. Plus it is not enabled in the bios (although that options says "Aggressive Link Power Management").

[QUOTE=s1riker;423572]The other thing I've done is setup HWInfo to log everything it can every second, and write it to a file, in the hopes that I can take a look at the log info the next time it happens and see if anything looks weird, ie. sudden drop of voltage, jump in temp, etc.[/QUOTE]

Yeah, I am doing that too. But I have never seen anything interesting in that. Tell us if you do.

pegnose 2016-01-22 19:43

I will go play some Fallout 4 now. See you after the next crash. ;)

chalsall 2016-01-22 19:52

[QUOTE=pegnose;423581]Well, I had this before upgrading from Win7 to Win10. But as I said: "My final shot before RMAing major components would be a fresh OS on a drive I haven't used in that system, yet."[/QUOTE]

If I may give some advise, based on decades of experience (I'm an old guy)...

The issue /may/ be in the current Windows codebase, but it /might/ also be the drivers supplied for and by third party hardware suppliers (even if "Certified" by M$), including the drivers for the MB.

One suggestion would be, if you're going to try a "Fresh OS", to try to install as little additional hardware (and thus, drivers) as possible, and see how that goes.

Another suggestion would be to fall back to an OS before Win7 (if that is even possible), and see how things go. This last suggestion is a shot in the dark; I haven't used any Micro$oft software for more than twenty years, but it might be worth trying if possible.

pegnose 2016-01-22 23:05

[QUOTE=chalsall;423585]If I may give some advise, based on decades of experience (I'm an old guy)...

The issue /may/ be in the current Windows codebase, but it /might/ also be the drivers supplied for and by third party hardware suppliers (even if "Certified" by M$), including the drivers for the MB.

One suggestion would be, if you're going to try a "Fresh OS", to try to install as little additional hardware (and thus, drivers) as possible, and see how that goes.[/QUOTE]

That is definitely a very good advice!

[QUOTE=chalsall;423585]Another suggestion would be to fall back to an OS before Win7 (if that is even possible), and see how things go. This last suggestion is a shot in the dark; I haven't used any Micro$oft software for more than twenty years, but it might be worth trying if possible.[/QUOTE]

I doubt that Skylake would run very well under XP.

chalsall 2016-01-22 23:19

[QUOTE=pegnose;423639]I doubt that Skylake would run very well under XP.[/QUOTE]

Define "very well". Very fast, or very stable?

Also please note that there are many Linux distributions which pre-date Win7 which could be brought into this testing.


All times are UTC. The time now is 23:23.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.