mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2019-02-07, 19:41   #309
simon389
 
Aug 2013

5716 Posts
Default

Quote:
Originally Posted by Mysticial View Post
I didn't realize there were multiple of these machines!

I guess the obvious questions/clarifications are:
  • Are all the systems identical? (same CPU, mobo, memory, etc...)
  • Are all the systems unstable?
  • If they are unstable, are they all unstable in the same way? (AIDA64 AVX512 + Prime95 PRP?)
Slightly different mobo, slightly different RAM. Otherwise all 4 machines are the same.

2/4 have EVGA x299 Micro (131-SX-E295) - latest BIOS
2/4 have EVGA x299 Micro 2 (121-SX-E296) - latest BIOS
2/4 have G.SKILL Sniper X Series 64GB (4 x 16GB) 288-Pin DDR4 SDRAM DDR4 3600 (PC4 28800) Desktop Memory Model F4-3600C19D-32GSXKB
2/4 have G.SKILL Ripjaws V Series 64GB (4 x 16GB) 288-Pin DDR4 SDRAM DDR4 3600 (PC4 28800) Desktop Memory Model F4-3600C19D-32GVRB

4/4 have:
Intel 9800X Skylake-X CPU (lapped)
Noctua NH-D15 coolers with NT-H1 paste (all temps under 80C at load)
All are using XMP settings for 3600Mhz RAM @ cas 19
400W Seasonic platinum PSU (actually one has a 850W to see if that was the problem)
Headless with junk 2nd hand graphics cards
120GB SSD
Win10 64 bit
Cheap microatx Cases
All are running default BIOS settings except for XMP and the one I’m underclocking to reach stability
All are plugged into a KillOWatt to measure consumption.
All are stable under 29.4 (no AVX512) and unstable @ 29.5b9 (AVX512)

Last fiddled with by simon389 on 2019-02-07 at 19:53
simon389 is offline   Reply With Quote
Old 2019-02-07, 20:01   #310
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

948110 Posts
Default

Quote:
Originally Posted by simon389 View Post
Slightly different mobo, slightly different RAM. Otherwise all 4 machines are the same.
Please forgive me for this, but your MB count adds up to eight. May I presume you meant you have four different MBs, from two different manufacturers? That's a good sign.

The only important commonality I see is the Intel 9800X Skylake-X CPU (lapped).

If you have the time and the coin, swapping one of those out and testing deeply might be worth your time.

Intel have learnt the hard way that selling bad kit is not good for public relations, nor their stock price....

Edit: Just to be clear, you "lapped" your kit!?!?!? And you're wondering why it's not working reliably?

Last fiddled with by chalsall on 2019-02-07 at 20:13 Reason: s/only commonality/only important commonality/;
chalsall is online now   Reply With Quote
Old 2019-02-07, 20:02   #311
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

53F16 Posts
Default

Quote:
Originally Posted by simon389 View Post
I’m dropping all the speeds. I keep an eye on CPU Z and even with all speeds @ 3.8 (incl AVX and AVX512 @ 3.8) I cant get a stable system. Failed at 7.5 hours. Trying now with 3.7 ::shrug::
IMHO, it would be better to keep the failing system untouched and failing, and run mprime95 again on it, maybe even the same exponent again, to hopefully fail in the same way -- in order to find out why PRP didn't detect the error. That's more important than getting the one system stable.

The core problem is not why the system failed, but why the PRP didn't catch it.
preda is offline   Reply With Quote
Old 2019-02-07, 20:14   #312
Mysticial
 
Mysticial's Avatar
 
Sep 2016

331 Posts
Default

Quote:
Originally Posted by simon389 View Post
Slightly different mobo, slightly different RAM. Otherwise all 4 machines are the same.

2/4 have EVGA x299 Micro (131-SX-E295) - latest BIOS
2/4 have EVGA x299 Micro 2 (121-SX-E296) - latest BIOS
2/4 have G.SKILL Sniper X Series 64GB (4 x 16GB) 288-Pin DDR4 SDRAM DDR4 3600 (PC4 28800) Desktop Memory Model F4-3600C19D-32GSXKB
2/4 have G.SKILL Ripjaws V Series 64GB (4 x 16GB) 288-Pin DDR4 SDRAM DDR4 3600 (PC4 28800) Desktop Memory Model F4-3600C19D-32GVRB

4/4 have:
Intel 9800X Skylake-X CPU (lapped)
Noctua NH-D15 coolers with NT-H1 paste (all temps under 80C at load)
All are using XMP settings for 3600Mhz RAM @ cas 19
400W Seasonic platinum PSU (actually one has a 850W to see if that was the problem)
Headless with junk 2nd hand graphics cards
120GB SSD
Win10 64 bit
Cheap microatx Cases
All are running default BIOS settings except for XMP and the one I’m underclocking to reach stability
All are plugged into a KillOWatt to measure consumption.
All are stable under 29.4 (no AVX512) and unstable @ 29.5b9 (AVX512)
The thing that stands out is the XMP. 3600 MT/s is high for this platform. Both of my Skylake X boxes have trouble at this speed.

But since there's 4 machines here failing the same way, it seems unlikely this would be the cause.

It's worth turning off the XMP anyway to see if you still see the instability. But I doubt this is the cause. Memory instability is highly variable across even identical systems. So it seems unlikely that it would put all 4 systems right on the edge of stability.

Quote:
Originally Posted by preda View Post
IMHO, it would be better to keep the failing system untouched and failing, and run mprime95 again on it, maybe even the same exponent again, to hopefully fail in the same way -- in order to find out why PRP didn't detect the error. That's more important than getting the one system stable.

The core problem is not why the system failed, but why the PRP didn't catch it.
There are two things we're trying to solve here, and they are not conflicting. We already know the settings that are failing - stock settings. And it can be easily returned to those settings at any time.

So while we are blocked trying to solve the software issue, it is still safe to try to (in parallel) debug the hardware issue as well. There's also multiple boxes here exhibiting the same issue. So if we're extra paranoid, just leave one of them untouched.

Last fiddled with by Mysticial on 2019-02-07 at 20:25
Mysticial is offline   Reply With Quote
Old 2019-02-07, 20:29   #313
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

19×499 Posts
Default

Quote:
Originally Posted by preda View Post
The core problem is not why the system failed, but why the PRP didn't catch it.
That is one of two problems we're currently "working".

Some of us prefer to work with stable kit before trying to figure out why the software failed.

But your point is valid. Possibly simon389 will agree to not fix one of his machines, so George can run tests on it. To figure out why the software didn't correctly determine that the hardware was insane.
chalsall is online now   Reply With Quote
Old 2019-02-07, 20:35   #314
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

5·172 Posts
Default

Quote:
Originally Posted by preda View Post
run mprime95 again on it, maybe even the same exponent again,
If he is getting/using different shift value for PRP then he will squaremod very different numbers.

Quote:
Originally Posted by chalsall View Post
But your point is valid. Possibly simon389 will agree to not fix one of his machines, so George can run tests on it. To figure out why the software didn't correctly determine that the hardware was insane.
Agree.
R. Gerbicz is offline   Reply With Quote
Old 2019-02-07, 21:08   #315
simon389
 
Aug 2013

10101112 Posts
Default

Quote:
Originally Posted by chalsall View Post
Just to be clear, you "lapped" your kit!?!?!? And you're wondering why it's not working reliably?
They had problems before lapping the IHS. I lapped to try and get temps down (and the temps did indeed lower but the errors remained). I have lapped many times. No problems ever.

Last fiddled with by simon389 on 2019-02-07 at 21:09
simon389 is offline   Reply With Quote
Old 2019-02-07, 21:25   #316
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

100101000010012 Posts
Default

Quote:
Originally Posted by simon389 View Post
I have lapped many times. No problems ever.
With all due respect, you are reporting problems now.

And you will not be able to return them as defective, since you have modified them from spec.
chalsall is online now   Reply With Quote
Old 2019-02-07, 21:32   #317
Mysticial
 
Mysticial's Avatar
 
Sep 2016

331 Posts
Default

Quote:
Originally Posted by chalsall View Post
With all due respect, you are reporting problems now.

And you will not be able to return them as defective, since you have modified them from spec.
If I had to bet (without much confidence), I would put my money on a mobo/ram incompatibility issue.

The same mobo, with the same memory brand and type (albeit with different decorations). I've seen these sorts of things show up all the time on my own hardware. Even for memory that's on the QVL of the mobo. I've had multiple instances where the QVL has failed me. Quite often, downclocking doesn't work, and the only solution is to change either the mobo or the memory.

The only thing that doesn't really support this is that the errors are AVX512-sensitive. Perhaps the AVX512 makes the workload sufficiently memory-bound to stress the memory in a way that's not possible with just AVX or scalar.

I find it hard to believe that sanding down the IHS would cause problems like this. I'd expect it to be no issues at all, unrelated issues (like temperature), or catastrophic issues like a missing memory channel, or complete failure of the chip. And then you have the fact that it's 4/4 machines here - all with the exact same symptoms.

Last fiddled with by Mysticial on 2019-02-07 at 21:48
Mysticial is offline   Reply With Quote
Old 2019-02-07, 21:43   #318
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

19×499 Posts
Default

Quote:
Originally Posted by Mysticial View Post
If I had to bet (without much confidence), I would put my money on a mobo/ram incompatibility issue.
Perhaps simon389 would be willing to take some direction from you wrt MB and RAM compatibility, using only the kit they have on hand.

This might only involve some BIOS settings (set at the lowest possible, initially), to experiment....
chalsall is online now   Reply With Quote
Old 2019-02-07, 21:47   #319
simon389
 
Aug 2013

1278 Posts
Default

Lapping was a really smart decision IMHO. Yes, it killed my warranty, but the IHSes were all rounded in the middle and temps were all over the place. It’s amazing, actually, how uneven CPUs and heat sinks are, even of the premium variety.

Quote:
Originally Posted by chalsall View Post
Perhaps simon389 would be willing to take some direction from you wrt MB and RAM compatibility, using only the kit they have on hand.

This might only involve some BIOS settings (set at the lowest possible, initially), to experiment....
More than happy to

Last fiddled with by simon389 on 2019-02-07 at 21:50
simon389 is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 00:38.

Fri Mar 5 00:38:35 UTC 2021 up 91 days, 20:49, 0 users, load averages: 3.15, 3.10, 2.97

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.