mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2016-01-20, 15:09   #430
s1riker
 
s1riker's Avatar
 
Jan 2016

31 Posts
Default

Quote:
Originally Posted by pegnose View Post
The (sadly) misleading thing in the news (web article headlines explicitly stating this) is that while it might lead to computers freezing (who knows; although there are so many Skylake problems out there with memory, power management etc. that actually! freeze PCs), I learned here (and by testing) that Prime95 (at least 27.x) will just drop workers.

Unfortunately, by now many people with Skylakes freezing believe that the 768k bug accounts for their problems. And the first already are disappointed learning that a bios update with the new micro code isn't helping their PCs.

I am also curious. I had solved memory issues for my machine until I finally could run memtest85 and Prime95 for whole days without a freeze and idle states wouldn't bother me as well (as opposed to other people). At that point I decided it would be worth to finally resync my software raid (which usually takes ~1d too).

And what can I say? MY PC FROZE!!! I am so p***ed off, excuse my french.

Ok, I had Prime95 running during that on all 8 cores, but that shouldn't be an issue for a 1.5k Euro machine. Seriously not.
I told you there was some fundamental bug here :) .. I only smile because there is someone else who can join my misery. It should not be this hard to get a stable system. I'm on my last set of things to try (increased VCCIO to 1.0v, load line calibration set to high and C-State 8 off). I'm not hopeful it'll fix anything. I'm just going to wait till around July/August and RMA my CPU in the hopes that a new stepping will fix the issue.

Edit: The interesting thing is that it happened for you during high I/O, which is inline with what Solis3 was saying the Tom's thread. Maybe that's something we should investigate.

Last fiddled with by s1riker on 2016-01-20 at 15:12
s1riker is offline   Reply With Quote
Old 2016-01-20, 16:02   #431
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

3,313 Posts
Default

Quote:
Originally Posted by pegnose View Post
...
I am also curious. I had solved memory issues for my machine until I finally could run memtest85 and Prime95 for whole days without a freeze and idle states wouldn't bother me as well (as opposed to other people). At that point I decided it would be worth to finally resync my software raid (which usually takes ~1d too).

And what can I say? MY PC FROZE!!! I am so p***ed off, excuse my french.

Ok, I had Prime95 running during that on all 8 cores, but that shouldn't be an issue for a 1.5k Euro machine. Seriously not.
Could that be a power supply issue? You had the CPU running full tilt *and* the RAID rebuild would have caused the drives to use max power for a long period of time.

Just guessing... software RAID also uses a lot of RAM so if there was anything bad with it, you might have bumped into the bad section of memory.

I had a server a few years back that ran great most of the time. I could run Prime95 stress tests for days on end, no worries. I ran memtest on it for about a week solid, no problems. But when this thing was running normally and happened to take over as the primary SQL node in a cluster, once SQL's mem usage crept past a certain point and hit some bad section of memory, the thing would blue screen and an unrecoverable memory error was logged (it was a Proliant, so it actually tells you which DIMM it was).

I mean, I threw everything I could at this to try and replicate the issue with artificial memory tests, using every type of bit patterns imaginable. It "only" had 36 GB so memtest could run through the whole thing in a fairly short time, and yeah, I left it running from a memtest bootable USB for around a week and it never did throw an error.

To work around it I setup the Proliant in "spare memory" mode which effectively took the bad module out of service. Then a couple months later I finally got onsite with a new module and swapped it out, the problem disappeared.

But it goes to show that even the best artificial tests out there are no match for real workloads.
Madpoo is offline   Reply With Quote
Old 2016-01-20, 16:21   #432
s1riker
 
s1riker's Avatar
 
Jan 2016

111112 Posts
Default

Quote:
Originally Posted by Madpoo View Post
Could that be a power supply issue? You had the CPU running full tilt *and* the RAID rebuild would have caused the drives to use max power for a long period of time.

Just guessing... software RAID also uses a lot of RAM so if there was anything bad with it, you might have bumped into the bad section of memory.

I had a server a few years back that ran great most of the time. I could run Prime95 stress tests for days on end, no worries. I ran memtest on it for about a week solid, no problems. But when this thing was running normally and happened to take over as the primary SQL node in a cluster, once SQL's mem usage crept past a certain point and hit some bad section of memory, the thing would blue screen and an unrecoverable memory error was logged (it was a Proliant, so it actually tells you which DIMM it was).

I mean, I threw everything I could at this to try and replicate the issue with artificial memory tests, using every type of bit patterns imaginable. It "only" had 36 GB so memtest could run through the whole thing in a fairly short time, and yeah, I left it running from a memtest bootable USB for around a week and it never did throw an error.

To work around it I setup the Proliant in "spare memory" mode which effectively took the bad module out of service. Then a couple months later I finally got onsite with a new module and swapped it out, the problem disappeared.

But it goes to show that even the best artificial tests out there are no match for real workloads.
I would have probably posted something like this before I had built this system :) ... It's not PSU or bad RAM. I've tried both. I have 2 high end PSU's that are marked as fully compatible with Haswell. One of which was in my previous build for about 2 years now and was completely stable. Both exhibit the same hard lock with my Skylake-S build. Same with RAM, I tried 2 separate sets of RAM, at default speeds. Even, single sticks just to see, and hard lock still happened. Others with different motherboards, different RAM and different PSUs (who've also tried swapping out components) have had the same issue with Skylake-S.
s1riker is offline   Reply With Quote
Old 2016-01-20, 19:08   #433
pegnose
 
pegnose's Avatar
 
Jan 2016

34 Posts
Default

Quote:
Originally Posted by s1riker View Post
I told you there was some fundamental bug here :) .. I only smile because there is someone else who can join my misery. It should not be this hard to get a stable system. I'm on my last set of things to try (increased VCCIO to 1.0v, load line calibration set to high and C-State 8 off). I'm not hopeful it'll fix anything. I'm just going to wait till around July/August and RMA my CPU in the hopes that a new stepping will fix the issue.

Edit: The interesting thing is that it happened for you during high I/O, which is inline with what Solis3 was saying the Tom's thread. Maybe that's something we should investigate.
You can smile, fine by me. ;)

However, I highly doubt that there is something broken. Because I can kick my system as hard as I can. It only freezes after many hours, somehow ruling out a power or heat problem. It is hard to understand. Freezes got less and less frequent, this is a gradual thing. If at all, I will swap memory to a 100% compatible kit. After that I can - again - open a support case with asus.

ALSO, my system was stable the first two months! Or it is mostly with particular load scenarios (games in my case: Fallout 4 and Deus Ex Human Revolution; but not Dying Light, so far).

Last fiddled with by pegnose on 2016-01-20 at 19:17
pegnose is offline   Reply With Quote
Old 2016-01-20, 19:15   #434
pegnose
 
pegnose's Avatar
 
Jan 2016

34 Posts
Default

Quote:
Originally Posted by Madpoo View Post
Could that be a power supply issue? You had the CPU running full tilt *and* the RAID rebuild would have caused the drives to use max power for a long period of time.

Just guessing... software RAID also uses a lot of RAM so if there was anything bad with it, you might have bumped into the bad section of memory.

I had a server a few years back that ran great most of the time. I could run Prime95 stress tests for days on end, no worries. I ran memtest on it for about a week solid, no problems. But when this thing was running normally and happened to take over as the primary SQL node in a cluster, once SQL's mem usage crept past a certain point and hit some bad section of memory, the thing would blue screen and an unrecoverable memory error was logged (it was a Proliant, so it actually tells you which DIMM it was).

I mean, I threw everything I could at this to try and replicate the issue with artificial memory tests, using every type of bit patterns imaginable. It "only" had 36 GB so memtest could run through the whole thing in a fairly short time, and yeah, I left it running from a memtest bootable USB for around a week and it never did throw an error.

To work around it I setup the Proliant in "spare memory" mode which effectively took the bad module out of service. Then a couple months later I finally got onsite with a new module and swapped it out, the problem disappeared.

But it goes to show that even the best artificial tests out there are no match for real workloads.
I highly doubt a power issue. My PSU is 700 W, and BeQuiet is an outstanding brand regarding the amps of individual rails. Also CPU at max my whole system won't consum emore than 200 W, whereas with my GPU in full load it is at 450 W. And like this I can't produce a freeze with Furmark for 1+ hours (also, PSU rather causes a black-out; I had this when my GTX 980Ti was attached to only one of the PCIe power rails ;).

Ram? a) I have 16 GB, b) memtest86 ran for 10h straight. Ok, for stresstesting you should use HCI's version. I will do that soon. But I am not so sure it is memory any more for me.

Ok, so it was a faulty module for you and you couldn't detect it with memtest? That is actually weird. But I would be happy about this. Probably I will know soon, because I plan on switching the brand (more people have issues with Crucial on recent ASUS boards). Only the modules from ASUS' HCL for that board in part are hard to get in Germany, and the other part has some addendum like "(ver. 5.26)". How am I supposed to get exactly THAT version?!

Nevertheless, thank you very much for sharing your experience with me!

Last fiddled with by pegnose on 2016-01-20 at 19:21
pegnose is offline   Reply With Quote
Old 2016-01-20, 19:30   #435
pegnose
 
pegnose's Avatar
 
Jan 2016

34 Posts
Default

Quote:
Originally Posted by s1riker View Post
I would have probably posted something like this before I had built this system :) ... It's not PSU or bad RAM. I've tried both. I have 2 high end PSU's that are marked as fully compatible with Haswell. One of which was in my previous build for about 2 years now and was completely stable. Both exhibit the same hard lock with my Skylake-S build. Same with RAM, I tried 2 separate sets of RAM, at default speeds. Even, single sticks just to see, and hard lock still happened. Others with different motherboards, different RAM and different PSUs (who've also tried swapping out components) have had the same issue with Skylake-S.
Yes, I also think it is the platform as such. You can do things to get it more stable like disabling C-states, or optimizing ram compatibility. But somehow these measure only affect a gradual improvement.

My last freeze (and the first AFTER I thought I was finally good) now was with high load on CPU and HDD (ASM1061 in particular; which is PCH in platform terms). Memory is SA, the other side of the DMI link, if I am correct. Interestingly, I had freezes when I activated native power management for the DMI link (the PCH side of it) in the bios. Maybe that is the relevant component.

- idle state (power management) for some
- video streaming (to the hdd) for others
- memory for most of us

Does that make sense?
pegnose is offline   Reply With Quote
Old 2016-01-20, 19:42   #436
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×67×73 Posts
Default

Quote:
Originally Posted by pegnose View Post
Does that make sense?
Certainly. But please keep in mind that modern computers are /incredibly/ complicated, with many different interacting components (often from different suppliers) which each must work perfectly for the system as a whole to work correctly.

Within the software industry there's the term "once a month" bug. This means that something going demonstrably wrong, but it is not easy to deterministically reproduce. This is where the expression "have you tried turning it off and on again" comes from.

To try to interject something useful to this post, have you looked at your PSU's power rails loading? It is possible your PSU is fine, but by chance you have one or more of your rails /just/ at the edge of its rating because of the power cabling configuration.
chalsall is offline   Reply With Quote
Old 2016-01-20, 20:20   #437
pegnose
 
pegnose's Avatar
 
Jan 2016

34 Posts
Default

Quote:
Originally Posted by chalsall View Post
Certainly. But please keep in mind that modern computers are /incredibly/ complicated, with many different interacting components (often from different suppliers) which each must work perfectly for the system as a whole to work correctly.

Within the software industry there's the term "once a month" bug. This means that something going demonstrably wrong, but it is not easy to deterministically reproduce. This is where the expression "have you tried turning it off and on again" comes from.

To try to interject something useful to this post, have you looked at your PSU's power rails loading? It is possible your PSU is fine, but by chance you have one or more of your rails /just/ at the edge of its rating because of the power cabling configuration.
Thanks for chiming in! Yes, this problem certainly is hard to track and of complicated, multifacetted nature. The common factor seems to be "unpredictable". ;)

Unfortunately, is is more "once every 4 days" for many users. So you wouldn't think about just ignoring it. And it seems to affect really a lot of people out there.

Yes, I thought about the power rails. But...

- my system consumes up to 450 W, on a 700 W PSU of a solid brand (BeQuiet)
- with CPU and mobo connectors there is no choice
- one SSD is M.2 on the board, the other three consume up to... 20 W?
- then there is my GTX980Ti which takes up to 300 W, but (now) it is connected to both PCIe rails (before, when I was still naive, hehe, I had 2 (!) power outages in 5 months - real black-outs (PC only, I mean: instant-off) - which is nothing compared to the Skylake hard-lock); I even had contact with the BeQuiet support on that matter: with both rails I should be more than fine now

So... unless my PSU has a weird defect, which is shared by many other users with many other PSUs... it is not the culprit, I guess.

Last fiddled with by pegnose on 2016-01-20 at 20:34
pegnose is offline   Reply With Quote
Old 2016-01-20, 20:44   #438
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2·67·73 Posts
Default

Quote:
Originally Posted by pegnose View Post
Unfortunately, is is more "once every 4 days" for many users. So you wouldn't think about just ignoring it. And it seems to affect really a lot of people out there.
Please trust me, I understand. When I was responsible for building out the first wireless WAN here in Barbados the manufacturer had a "once a month" bug. Of course, after deploying ~100 radios, the problem manifested ~3 times a day...

Quote:
Originally Posted by pegnose View Post
Yes, I thought about the power rails. But...

- one SSD is M.2 on the board, the other three consume up to... 20 W?
Are the other three drives SSDs or HDs? Please know that "spinning rust storage" can have rather extreme random spikes in their draw (both OS driven, and independent).

Quote:
Originally Posted by pegnose View Post
So... unless my PSU has a weird defect, which is shared by many other users with many other PSUs... it is not the culprit, I guess.
Don't guess. Test.

I would suggest (if you and/or others can) to first remove any kit you can (for example, HDs, GPUs, RAM), and rerun the tests you've used to produce the observed crashing (even if not deterministically -- currently you're doing statistical testing). Swap out MBs and PSUs. Make sure your mains power is good.

This is not to say this is not a CPU issue, but you don't /know/ it is yet.

I hope that makes sense and helps.
chalsall is offline   Reply With Quote
Old 2016-01-20, 21:52   #439
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2·3·1,693 Posts
Default

chalsall:
Quote:
Are the other three drives SSDs or HDs? Please know that "spinning rust storage" can have rather extreme random spikes in their draw (both OS driven, and independent).
pegnose:
Quote:
one SSD is M.2 on the board, the other three consume up to... 20 W?
I'm pretty sure, from the OP's post, that all the drives are SSDs. Twenty watts for three spinners also seems pretty unlikely.

Last fiddled with by kladner on 2016-01-20 at 21:53
kladner is offline   Reply With Quote
Old 2016-01-20, 22:07   #440
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×67×73 Posts
Default

Quote:
Originally Posted by kladner View Post
I'm pretty sure, from the OP's post, that all the drives are SSDs. Twenty watts for three spinners also seems pretty unlikely.
I'm not so sure. Easy to achieve that kind of load at idle spin with no seek.

Please also note this: "My last freeze (and the first AFTER I thought I was finally good) now was with high load on CPU and HDD".

Perhaps pegnose will speak to our questions.
chalsall is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Skylake vs Kabylake ET_ Hardware 17 2017-05-24 16:19
Skylake and RAM scaling mackerel Hardware 34 2016-03-03 19:14
So does skylake-nonXeon actually get us anything? fivemack Hardware 36 2015-09-08 01:42
Skylake processor tha Hardware 7 2015-03-05 23:49
Skylake AVX-512 clarke Software 15 2015-03-04 21:48

All times are UTC. The time now is 08:27.


Fri Aug 6 08:27:48 UTC 2021 up 14 days, 2:56, 1 user, load averages: 2.64, 2.55, 2.41

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.