mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   768k Skylake Problem/Bug (https://www.mersenneforum.org/showthread.php?t=20714)

s1riker 2016-01-20 15:09

[QUOTE=pegnose;423191]The (sadly) misleading thing in the news (web article headlines explicitly stating this) is that while it might lead to computers freezing (who knows; although there are so many Skylake problems out there with memory, power management etc. that actually! freeze PCs), I learned here (and by testing) that Prime95 (at least 27.x) will just drop workers.

Unfortunately, by now many people with Skylakes freezing believe that the 768k bug accounts for their problems. And the first already are disappointed learning that a bios update with the new micro code isn't helping their PCs.

I am also curious. I had solved memory issues for my machine until I finally could run memtest85 and Prime95 for whole days without a freeze and idle states wouldn't bother me as well (as opposed to other people). At that point I decided it would be worth to finally resync my software raid (which usually takes ~1d too).

And what can I say? MY PC FROZE!!! I am so p***ed off, excuse my french.

Ok, I had Prime95 running during that on all 8 cores, but that shouldn't be an issue for a 1.5k Euro machine. Seriously not.[/QUOTE]

I told you there was some fundamental bug here :) .. I only smile because there is someone else who can join my misery. It should not be this hard to get a stable system. I'm on my last set of things to try (increased VCCIO to 1.0v, load line calibration set to high and C-State 8 off). I'm not hopeful it'll fix anything. I'm just going to wait till around July/August and RMA my CPU in the hopes that a new stepping will fix the issue.

Edit: The interesting thing is that it happened for you during high I/O, which is inline with what Solis3 was saying the Tom's thread. Maybe that's something we should investigate.

Madpoo 2016-01-20 16:02

[QUOTE=pegnose;423191]...
I am also curious. I had solved memory issues for my machine until I finally could run memtest85 and Prime95 for whole days without a freeze and idle states wouldn't bother me as well (as opposed to other people). At that point I decided it would be worth to finally resync my software raid (which usually takes ~1d too).

And what can I say? MY PC FROZE!!! I am so p***ed off, excuse my french.

Ok, I had Prime95 running during that on all 8 cores, but that shouldn't be an issue for a 1.5k Euro machine. Seriously not.[/QUOTE]

Could that be a power supply issue? You had the CPU running full tilt *and* the RAID rebuild would have caused the drives to use max power for a long period of time.

Just guessing... software RAID also uses a lot of RAM so if there was anything bad with it, you might have bumped into the bad section of memory.

I had a server a few years back that ran great most of the time. I could run Prime95 stress tests for days on end, no worries. I ran memtest on it for about a week solid, no problems. But when this thing was running normally and happened to take over as the primary SQL node in a cluster, once SQL's mem usage crept past a certain point and hit some bad section of memory, the thing would blue screen and an unrecoverable memory error was logged (it was a Proliant, so it actually tells you which DIMM it was).

I mean, I threw everything I could at this to try and replicate the issue with artificial memory tests, using every type of bit patterns imaginable. It "only" had 36 GB so memtest could run through the whole thing in a fairly short time, and yeah, I left it running from a memtest bootable USB for around a week and it never did throw an error.

To work around it I setup the Proliant in "spare memory" mode which effectively took the bad module out of service. Then a couple months later I finally got onsite with a new module and swapped it out, the problem disappeared.

But it goes to show that even the best artificial tests out there are no match for real workloads.

s1riker 2016-01-20 16:21

[QUOTE=Madpoo;423224]Could that be a power supply issue? You had the CPU running full tilt *and* the RAID rebuild would have caused the drives to use max power for a long period of time.

Just guessing... software RAID also uses a lot of RAM so if there was anything bad with it, you might have bumped into the bad section of memory.

I had a server a few years back that ran great most of the time. I could run Prime95 stress tests for days on end, no worries. I ran memtest on it for about a week solid, no problems. But when this thing was running normally and happened to take over as the primary SQL node in a cluster, once SQL's mem usage crept past a certain point and hit some bad section of memory, the thing would blue screen and an unrecoverable memory error was logged (it was a Proliant, so it actually tells you which DIMM it was).

I mean, I threw everything I could at this to try and replicate the issue with artificial memory tests, using every type of bit patterns imaginable. It "only" had 36 GB so memtest could run through the whole thing in a fairly short time, and yeah, I left it running from a memtest bootable USB for around a week and it never did throw an error.

To work around it I setup the Proliant in "spare memory" mode which effectively took the bad module out of service. Then a couple months later I finally got onsite with a new module and swapped it out, the problem disappeared.

But it goes to show that even the best artificial tests out there are no match for real workloads.[/QUOTE]

I would have probably posted something like this before I had built this system :) ... It's not PSU or bad RAM. I've tried both. I have 2 high end PSU's that are marked as fully compatible with Haswell. One of which was in my previous build for about 2 years now and was completely stable. Both exhibit the same hard lock with my Skylake-S build. Same with RAM, I tried 2 separate sets of RAM, at default speeds. Even, single sticks just to see, and hard lock still happened. Others with different motherboards, different RAM and different PSUs (who've also tried swapping out components) have had the same issue with Skylake-S.

pegnose 2016-01-20 19:08

[QUOTE=s1riker;423214]I told you there was some fundamental bug here :) .. I only smile because there is someone else who can join my misery. It should not be this hard to get a stable system. I'm on my last set of things to try (increased VCCIO to 1.0v, load line calibration set to high and C-State 8 off). I'm not hopeful it'll fix anything. I'm just going to wait till around July/August and RMA my CPU in the hopes that a new stepping will fix the issue.

Edit: The interesting thing is that it happened for you during high I/O, which is inline with what Solis3 was saying the Tom's thread. Maybe that's something we should investigate.[/QUOTE]

You can smile, fine by me. ;)

However, I highly doubt that there is something broken. Because I can kick my system as hard as I can. It only freezes after many hours, somehow ruling out a power or heat problem. It is hard to understand. Freezes got less and less frequent, this is a gradual thing. If at all, I will swap memory to a 100% compatible kit. After that I can - again - open a support case with asus.

ALSO, my system was stable the first two months! Or it is mostly with particular load scenarios (games in my case: Fallout 4 and Deus Ex Human Revolution; but not Dying Light, so far).

pegnose 2016-01-20 19:15

[QUOTE=Madpoo;423224]Could that be a power supply issue? You had the CPU running full tilt *and* the RAID rebuild would have caused the drives to use max power for a long period of time.

Just guessing... software RAID also uses a lot of RAM so if there was anything bad with it, you might have bumped into the bad section of memory.

I had a server a few years back that ran great most of the time. I could run Prime95 stress tests for days on end, no worries. I ran memtest on it for about a week solid, no problems. But when this thing was running normally and happened to take over as the primary SQL node in a cluster, once SQL's mem usage crept past a certain point and hit some bad section of memory, the thing would blue screen and an unrecoverable memory error was logged (it was a Proliant, so it actually tells you which DIMM it was).

I mean, I threw everything I could at this to try and replicate the issue with artificial memory tests, using every type of bit patterns imaginable. It "only" had 36 GB so memtest could run through the whole thing in a fairly short time, and yeah, I left it running from a memtest bootable USB for around a week and it never did throw an error.

To work around it I setup the Proliant in "spare memory" mode which effectively took the bad module out of service. Then a couple months later I finally got onsite with a new module and swapped it out, the problem disappeared.

But it goes to show that even the best artificial tests out there are no match for real workloads.[/QUOTE]

I highly doubt a power issue. My PSU is 700 W, and BeQuiet is an outstanding brand regarding the amps of individual rails. Also CPU at max my whole system won't consum emore than 200 W, whereas with my GPU in full load it is at 450 W. And like this I can't produce a freeze with Furmark for 1+ hours (also, PSU rather causes a black-out; I had this when my GTX 980Ti was attached to only one of the PCIe power rails ;).

Ram? a) I have 16 GB, b) memtest86 ran for 10h straight. Ok, for stresstesting you should use HCI's version. I will do that soon. But I am not so sure it is memory any more for me.

Ok, so it was a faulty module for you and you couldn't detect it with memtest? That is actually weird. But I would be happy about this. Probably I will know soon, because I plan on switching the brand (more people have issues with Crucial on recent ASUS boards). Only the modules from ASUS' HCL for that board in part are hard to get in Germany, and the other part has some addendum like "(ver. 5.26)". How am I supposed to get exactly THAT version?!

Nevertheless, thank you very much for sharing your experience with me!

pegnose 2016-01-20 19:30

[QUOTE=s1riker;423228]I would have probably posted something like this before I had built this system :) ... It's not PSU or bad RAM. I've tried both. I have 2 high end PSU's that are marked as fully compatible with Haswell. One of which was in my previous build for about 2 years now and was completely stable. Both exhibit the same hard lock with my Skylake-S build. Same with RAM, I tried 2 separate sets of RAM, at default speeds. Even, single sticks just to see, and hard lock still happened. Others with different motherboards, different RAM and different PSUs (who've also tried swapping out components) have had the same issue with Skylake-S.[/QUOTE]

Yes, I also think it is the platform as such. You can do things to get it more stable like disabling C-states, or optimizing ram compatibility. But somehow these measure only affect a gradual improvement.

My last freeze (and the first AFTER I thought I was finally good) now was with high load on CPU and HDD (ASM1061 in particular; which is PCH in platform terms). Memory is SA, the other side of the DMI link, if I am correct. Interestingly, I had freezes when I activated native power management for the DMI link (the PCH side of it) in the bios. Maybe that is the relevant component.

- idle state (power management) for some
- video streaming (to the hdd) for others
- memory for most of us

Does that make sense?

chalsall 2016-01-20 19:42

[QUOTE=pegnose;423251]Does that make sense?[/QUOTE]

Certainly. But please keep in mind that modern computers are /incredibly/ complicated, with many different interacting components (often from different suppliers) which each must work perfectly for the system as a whole to work correctly.

Within the software industry there's the term "once a month" bug. This means that something going demonstrably wrong, but it is not easy to deterministically reproduce. This is where the expression "have you tried turning it off and on again" comes from.

To try to interject something useful to this post, have you looked at your PSU's power rails loading? It is possible your PSU is fine, but by chance you have one or more of your rails /just/ at the edge of its rating because of the power cabling configuration.

pegnose 2016-01-20 20:20

[QUOTE=chalsall;423252]Certainly. But please keep in mind that modern computers are /incredibly/ complicated, with many different interacting components (often from different suppliers) which each must work perfectly for the system as a whole to work correctly.

Within the software industry there's the term "once a month" bug. This means that something going demonstrably wrong, but it is not easy to deterministically reproduce. This is where the expression "have you tried turning it off and on again" comes from.

To try to interject something useful to this post, have you looked at your PSU's power rails loading? It is possible your PSU is fine, but by chance you have one or more of your rails /just/ at the edge of its rating because of the power cabling configuration.[/QUOTE]

Thanks for chiming in! Yes, this problem certainly is hard to track and of complicated, multifacetted nature. The common factor seems to be "unpredictable". ;)

Unfortunately, is is more "once every 4 days" for many users. So you wouldn't think about just ignoring it. And it seems to affect really a lot of people out there.

Yes, I thought about the power rails. But...

- my system consumes up to 450 W, on a 700 W PSU of a solid brand (BeQuiet)
- with CPU and mobo connectors there is no choice
- one SSD is M.2 on the board, the other three consume up to... 20 W?
- then there is my GTX980Ti which takes up to 300 W, but (now) it is connected to both PCIe rails (before, when I was still naive, hehe, I had 2 (!) power outages in 5 months - real black-outs (PC only, I mean: instant-off) - which is nothing compared to the Skylake hard-lock); I even had contact with the BeQuiet support on that matter: with both rails I should be more than fine now

So... unless my PSU has a weird defect, which is shared by many other users with many other PSUs... it is not the culprit, I guess.

chalsall 2016-01-20 20:44

[QUOTE=pegnose;423256]Unfortunately, is is more "once every 4 days" for many users. So you wouldn't think about just ignoring it. And it seems to affect really a lot of people out there.[/QUOTE]

Please trust me, I understand. When I was responsible for building out the first wireless WAN here in Barbados the manufacturer had a "once a month" bug. Of course, after deploying ~100 radios, the problem manifested ~3 times a day...

[QUOTE=pegnose;423256]Yes, I thought about the power rails. But...

- one SSD is M.2 on the board, the other three consume up to... 20 W?[/QUOTE]

Are the other three drives SSDs or HDs? Please know that "spinning rust storage" can have rather extreme random spikes in their draw (both OS driven, and independent).

[QUOTE=pegnose;423256]So... unless my PSU has a weird defect, which is shared by many other users with many other PSUs... it is not the culprit, I guess.[/QUOTE]

Don't guess. Test.

I would suggest (if you and/or others can) to first remove any kit you can (for example, HDs, GPUs, RAM), and rerun the tests you've used to produce the observed crashing (even if not deterministically -- currently you're doing statistical testing). Swap out MBs and PSUs. Make sure your mains power is good.

This is not to say this is not a CPU issue, but you don't /know/ it is yet.

I hope that makes sense and helps.

kladner 2016-01-20 21:52

chalsall:
[QUOTE]Are the other three drives SSDs or HDs? Please know that "spinning rust storage" can have rather extreme random spikes in their draw (both OS driven, and independent).[/QUOTE]pegnose:
[QUOTE][U]one SSD[/U] is M.2 on the board, [U]the other three[/U] consume up to... 20 W?[/QUOTE]I'm pretty sure, from the OP's post, that all the drives are SSDs. Twenty watts for three spinners also seems pretty unlikely.

chalsall 2016-01-20 22:07

[QUOTE=kladner;423263]I'm pretty sure, from the OP's post, that all the drives are SSDs. Twenty watts for three spinners also seems pretty unlikely.[/QUOTE]

I'm not so sure. Easy to achieve that kind of load at idle spin with no seek.

Please also note this: "My last freeze (and the first AFTER I thought I was finally good) now was with high load on CPU and HDD".

Perhaps pegnose will speak to our questions.


All times are UTC. The time now is 23:23.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.