mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2019-02-05, 19:32   #254
Mysticial
 
Mysticial's Avatar
 
Sep 2016

5138 Posts
Default

Quote:
Originally Posted by simon389 View Post
Update with my situation: I installed an 850W Seagate PSU and instead of trying 29.5b9 I ran AIDA64 stress test with FPU option checked to see if it was stable.

The system is not stable in AIDA64, even with the 850W PSU. It frequently crashes AIDA64's stress test, citing "Error: hardware error". So it appears the problem is on my end and is not the result of Prime95's AVX512 optimizations.

I do find it interesting that all 4 of my Skylake-X 9800X CPUs are stable in 29.4 or in stress tests without AVX512 optimizations, but the moment I include AVX512 in AIDA the system breaks down. I will have to do some digging about this situation.
What's the AVX512 offset set to in the BIOS? If it says "auto", what does CPUz show the CPU frequencies as when you run AVX512 on all cores?

Last fiddled with by Mysticial on 2019-02-05 at 19:34
Mysticial is offline   Reply With Quote
Old 2019-02-05, 20:04   #255
GP2
 
GP2's Avatar
 
Sep 2003

258310 Posts
Default

Quote:
Originally Posted by simon389 View Post
The system is not stable in AIDA64, even with the 850W PSU. It frequently crashes AIDA64's stress test, citing "Error: hardware error". So it appears the problem is on my end and is not the result of Prime95's AVX512 optimizations.

I do find it interesting that all 4 of my Skylake-X 9800X CPUs are stable in 29.4 or in stress tests without AVX512 optimizations, but the moment I include AVX512 in AIDA the system breaks down. I will have to do some digging about this situation.
In its current flawed state, the system is an invaluable source of reliably bad PRP results that test and (so far) overcome the program's defenses against such an outcome.

Actually fixing it would destroy its usefulness for this purpose.

So, how would you feel about shipping the whole box, as-is, to George for him to experiment with? How much out-of-pocket expense would this represent for you?

I am semi-seriously proposing that this forum might take up a collection to make this happen.
GP2 is offline   Reply With Quote
Old 2019-02-05, 20:17   #256
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

9,473 Posts
Default

Quote:
Originally Posted by GP2 View Post
I am semi-seriously proposing that this forum might take up a collection to make this happen.
I would suggest a slightly different "angle of attack". Perhaps simon389 could make the machine available to George by way of remote access.

There is a lot to be said for testing rigorously a machine which is exhibiting an issue which is not yet fully understood, while hoping to change as few variables as possible.
chalsall is online now   Reply With Quote
Old 2019-02-05, 20:17   #257
Mysticial
 
Mysticial's Avatar
 
Sep 2016

331 Posts
Default

Quote:
Originally Posted by GP2 View Post
In its current flawed state, the system is an invaluable source of reliably bad PRP results that test and (so far) overcome the program's defenses against such an outcome.

Actually fixing it would destroy its usefulness for this purpose.

So, how would you feel about shipping the whole box, as-is, to George for him to experiment with? How much out-of-pocket expense would this represent for you?

I am semi-seriously proposing that this forum might take up a collection to make this happen.
I don't think it's necessary. The problem is well known. I've seen it enough myself both on my own hardware and on OC forums that I usually feel confident calling it out.

So if you're an experienced overclocker and you understand the problem, it's pretty easy to trigger it artificially on any hardware. (I've done this myself for the very purpose of testing software error-detection.)

The cause is that the AVX512 is running at too high a frequency. If it's way too high, the system freezes or BSODs. If it's only slightly too high, you get soft errors.

The reason why it's too high is because the BIOS is borked. When Skylake-X first launched, neither the engineers nor support even knew what it was. And there was no software at the time to test AVX512. So almost every single mobo ended up shipping with a broken BIOS that would fail on AVX512.

The fix is to either update to a fixed BIOS (if available). Or to manually override the offsets.

Likewise, to artificially trigger it, pick an offset that puts the core right on that edge of stability. If the chip overheats, you may need to drop all the frequencies and pull back the voltage as well. The goal is to get at least one core on that edge without any core going too far unstable. SIMD instability tends to cause soft errors instead of crashes since SIMD rarely affects control flow and is usually on a different power domain in the CPU core. So its much easier to trigger soft SIMD errors as opposed to soft scalar execution errors.

Last fiddled with by Mysticial on 2019-02-05 at 21:01
Mysticial is offline   Reply With Quote
Old 2019-02-05, 21:41   #258
simon389
 
Aug 2013

3·29 Posts
Default

I'm running this computer with an EVGA X299 2 motherboard using the latest bios. There is no "AVX512 offset" setting in the BIOS that I can find, although it seems like people have different names for different things sometimes.

Here's CPUZ running with Prime95 295b9: https://imgur.com/pv5pf59

The 9800X doing a LL Doublecheck with AVX512 seems to be oscillating between 4096 and 4115 Mhz. I would be sad to have to underclock my CPU to get it stable at AVX512, which sort of defeats the purpose of the boost it gives to my iterations per second. Although maybe the voltage simply needs to be bumped up.

Here's the results of running AIDA64 with 512AVX tests enabled: https://imgur.com/S5YET1v

I see that a few cores are getting up to 78C at some point during the test, which is frustrating and I think largely due to the fact that the massive Noctua DH-D15S has a very slight unevenness to the copper face of the heatsink, even with me lapping it with various grits of sandpaper to get it to as flat as possible (and lapping was very awkward because of the size of the massive heatsink). If you guys think I need to drop all cores below 78C then I could purchase a new Noctua D15S and hopefully it has a more even surface, and then also apply some of that sexy liquid metal thermal compound that I see everybody raving about online. Both of those are probably good for another 2-3C drop in temps.
simon389 is offline   Reply With Quote
Old 2019-02-05, 21:51   #259
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

9,473 Posts
Default

Quote:
Originally Posted by simon389 View Post
I see that a few cores are getting up to 78C...
Personally, I make my morning coffee on my CPUs.

Everything you do is OK, so long as you understand what you are doing....
chalsall is online now   Reply With Quote
Old 2019-02-05, 21:57   #260
GP2
 
GP2's Avatar
 
Sep 2003

258310 Posts
Default

Quote:
Originally Posted by Mysticial View Post
I don't think it's necessary. The problem is well known. I've seen it enough myself both on my own hardware and on OC forums that I usually feel confident calling it out.
But run-of-the-mill hardware errors are caught by Gerbicz error checking.

This is the only system we know of that reports erroneous final PRP residues to Primenet.
GP2 is offline   Reply With Quote
Old 2019-02-05, 22:04   #261
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

9,473 Posts
Default

Quote:
Originally Posted by GP2 View Post
This is the only system we know of that reports erroneous final PRP residues to Primenet.
But, as Mysticial has suggested, the reason for this is not a mystery....
chalsall is online now   Reply With Quote
Old 2019-02-05, 22:06   #262
Mysticial
 
Mysticial's Avatar
 
Sep 2016

1010010112 Posts
Default

Quote:
Originally Posted by simon389 View Post
I'm running this computer with an EVGA X299 2 motherboard using the latest bios. There is no "AVX512 offset" setting in the BIOS that I can find, although it seems like people have different names for different things sometimes.

Here's CPUZ running with Prime95 295b9: https://imgur.com/pv5pf59

The 9800X doing a LL Doublecheck with AVX512 seems to be oscillating between 4096 and 4115 Mhz. I would be sad to have to underclock my CPU to get it stable at AVX512, which sort of defeats the purpose of the boost it gives to my iterations per second. Although maybe the voltage simply needs to be bumped up.
4.1 GHz all-core AVX512 is way too high for any chip that isn't overclocked. I can't tell for sure, but it looks like the BIOS is not applying any AVX512 offset.

What's the all-core frequency for non-AVX? Is it the same?

If it's the same, then it confirms my suspicion that the BIOS isn't doing the offsets. I looked up your motherboard and the option does exist. So you might have to find it. When you do, I recommend setting the AVX offset to -3 and the AVX512 offset to -5. That will get you closer to the "true" stock settings.

If my suspicions are true, your chip is already running overclocked for AVX(2) and AVX512. It just hasn't been crashing for AVX2.

Yes, increasing the voltages will also work. But of course that counts as overclocking. Normally, I don't try to overclock until I get the system stable at stock.


Quote:
Here's the results of running AIDA64 with 512AVX tests enabled: https://imgur.com/S5YET1v

I see that a few cores are getting up to 78C at some point during the test, which is frustrating and I think largely due to the fact that the massive Noctua DH-D15S has a very slight unevenness to the copper face of the heatsink, even with me lapping it with various grits of sandpaper to get it to as flat as possible (and lapping was very awkward because of the size of the massive heatsink). If you guys think I need to drop all cores below 78C then I could purchase a new Noctua D15S and hopefully it has a more even surface, and then also apply some of that sexy liquid metal thermal compound that I see everybody raving about online. Both of those are probably good for another 2-3C drop in temps.
These chips are going to be very hard to cool with an air cooler. If you do plan to overclock it, you're gonna need at least a 280/360 AIO or full custom water.


EDIT:

Also... before you do anything. How old is the BIOS? Your chip is the refresh line, not the original run. If your BIOS is older than late last year, it might not be properly updated for your chip.

Last fiddled with by Mysticial on 2019-02-05 at 22:13
Mysticial is offline   Reply With Quote
Old 2019-02-05, 22:07   #263
simon389
 
Aug 2013

3·29 Posts
Default

Quote:
Originally Posted by GP2 View Post
But run-of-the-mill hardware errors are caught by Gerbicz error checking.

This is the only system we know of that reports erroneous final PRP residues to Primenet.
If George wants me to ship the system to him, I can.
simon389 is offline   Reply With Quote
Old 2019-02-05, 22:14   #264
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

224018 Posts
Default

Quote:
Originally Posted by simon389 View Post
If George wants me to ship the system to him, I can.
Please don't, yet.

We're currently having a "discussion" as to how to optimally proceed. Some call it an argument.

Some feel very comfortable in an argument. Others, not so much....
chalsall is online now   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 16:46.

Fri Feb 26 16:46:08 UTC 2021 up 85 days, 12:57, 0 users, load averages: 1.84, 1.76, 1.69

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.