![]() |
[QUOTE=simon389;507726]Update with my situation: I installed an 850W Seagate PSU and instead of trying 29.5b9 I ran AIDA64 stress test with FPU option checked to see if it was stable.
The system is not stable in AIDA64, even with the 850W PSU. It frequently crashes AIDA64's stress test, citing "Error: hardware error". So it appears the problem is on my end and is not the result of Prime95's AVX512 optimizations. I do find it interesting that all 4 of my Skylake-X 9800X CPUs are stable in 29.4 or in stress tests without AVX512 optimizations, but the moment I include AVX512 in AIDA the system breaks down. I will have to do some digging about this situation.[/QUOTE] What's the AVX512 offset set to in the BIOS? If it says "auto", what does CPUz show the CPU frequencies as when you run AVX512 on all cores? |
[QUOTE=simon389;507726]The system is not stable in AIDA64, even with the 850W PSU. It frequently crashes AIDA64's stress test, citing "Error: hardware error". So it appears the problem is on my end and is not the result of Prime95's AVX512 optimizations.
I do find it interesting that all 4 of my Skylake-X 9800X CPUs are stable in 29.4 or in stress tests without AVX512 optimizations, but the moment I include AVX512 in AIDA the system breaks down. I will have to do some digging about this situation.[/QUOTE] In its current flawed state, the system is an invaluable source of reliably bad PRP results that test and (so far) overcome the program's defenses against such an outcome. Actually fixing it would destroy its usefulness for this purpose. So, how would you feel about shipping the whole box, as-is, to George for him to experiment with? How much out-of-pocket expense would this represent for you? I am semi-seriously proposing that this forum might take up a collection to make this happen. |
[QUOTE=GP2;507728]I am semi-seriously proposing that this forum might take up a collection to make this happen.[/QUOTE]
I would suggest a slightly different "angle of attack". Perhaps simon389 could make the machine available to George by way of remote access. There is a lot to be said for testing rigorously a machine which is exhibiting an issue which is not yet fully understood, while hoping to change as few variables as possible. |
[QUOTE=GP2;507728]In its current flawed state, the system is an invaluable source of reliably bad PRP results that test and (so far) overcome the program's defenses against such an outcome.
Actually fixing it would destroy its usefulness for this purpose. So, how would you feel about shipping the whole box, as-is, to George for him to experiment with? How much out-of-pocket expense would this represent for you? I am semi-seriously proposing that this forum might take up a collection to make this happen.[/QUOTE] I don't think it's necessary. The problem is well known. I've seen it enough myself both on my own hardware and on OC forums that I usually feel confident calling it out. So if you're an experienced overclocker and you understand the problem, it's pretty easy to trigger it artificially on any hardware. (I've done this myself for the very purpose of testing software error-detection.) The cause is that the AVX512 is running at too high a frequency. If it's way too high, the system freezes or BSODs. If it's only slightly too high, you get soft errors. The reason why it's too high is because the BIOS is borked. When Skylake-X first launched, neither the engineers nor support even knew what it was. And there was no software at the time to test AVX512. So almost every single mobo ended up shipping with a broken BIOS that would fail on AVX512. The fix is to either update to a fixed BIOS (if available). Or to manually override the offsets. Likewise, to artificially trigger it, pick an offset that puts the core right on that edge of stability. If the chip overheats, you may need to drop all the frequencies and pull back the voltage as well. The goal is to get at least one core on that edge without any core going too far unstable. SIMD instability tends to cause soft errors instead of crashes since SIMD rarely affects control flow and is usually on a different power domain in the CPU core. So its much easier to trigger soft SIMD errors as opposed to soft scalar execution errors. |
I'm running this computer with an EVGA X299 2 motherboard using the latest bios. There is no "AVX512 offset" setting in the BIOS that I can find, although it seems like people have different names for different things sometimes.
Here's CPUZ running with Prime95 295b9: [url]https://imgur.com/pv5pf59[/url] The 9800X doing a LL Doublecheck with AVX512 seems to be oscillating between 4096 and 4115 Mhz. I would be sad to have to underclock my CPU to get it stable at AVX512, which sort of defeats the purpose of the boost it gives to my iterations per second. Although maybe the voltage simply needs to be bumped up. Here's the results of running AIDA64 with 512AVX tests enabled: [url]https://imgur.com/S5YET1v[/url] I see that a few cores are getting up to 78C at some point during the test, which is frustrating and I think largely due to the fact that the massive Noctua DH-D15S has a very slight unevenness to the copper face of the heatsink, even with me lapping it with various grits of sandpaper to get it to as flat as possible (and lapping was very awkward because of the size of the massive heatsink). If you guys think I need to drop all cores below 78C then I could purchase a new Noctua D15S and hopefully it has a more even surface, and then also apply some of that sexy liquid metal thermal compound that I see everybody raving about online. Both of those are probably good for another 2-3C drop in temps. |
[QUOTE=simon389;507739]I see that a few cores are getting up to 78C...[/QUOTE]
Personally, I make my morning coffee on my CPUs. Everything you do is OK, so long as you understand what you are doing.... |
[QUOTE=Mysticial;507731]I don't think it's necessary. The problem is well known. I've seen it enough myself both on my own hardware and on OC forums that I usually feel confident calling it out.[/QUOTE]
But run-of-the-mill hardware errors are caught by Gerbicz error checking. This is the only system we know of that reports erroneous final PRP residues to Primenet. |
[QUOTE=GP2;507742]This is the only system we know of that reports erroneous final PRP residues to Primenet.[/QUOTE]
But, as Mysticial has suggested, the reason for this is not a mystery.... |
[QUOTE=simon389;507739]I'm running this computer with an EVGA X299 2 motherboard using the latest bios. There is no "AVX512 offset" setting in the BIOS that I can find, although it seems like people have different names for different things sometimes.
Here's CPUZ running with Prime95 295b9: [url]https://imgur.com/pv5pf59[/url] The 9800X doing a LL Doublecheck with AVX512 seems to be oscillating between 4096 and 4115 Mhz. I would be sad to have to underclock my CPU to get it stable at AVX512, which sort of defeats the purpose of the boost it gives to my iterations per second. Although maybe the voltage simply needs to be bumped up.[/QUOTE] 4.1 GHz all-core AVX512 is way too high for any chip that isn't overclocked. I can't tell for sure, but it looks like the BIOS is not applying any AVX512 offset. What's the all-core frequency for non-AVX? Is it the same? If it's the same, then it confirms my suspicion that the BIOS isn't doing the offsets. I looked up your motherboard and the option does exist. So you might have to find it. When you do, I recommend setting the AVX offset to -3 and the AVX512 offset to -5. That will get you closer to the "true" stock settings. If my suspicions are true, your chip is already running overclocked for AVX(2) and AVX512. It just hasn't been crashing for AVX2. Yes, increasing the voltages will also work. But of course that counts as overclocking. Normally, I don't try to overclock until I get the system stable at stock. [QUOTE]Here's the results of running AIDA64 with 512AVX tests enabled: [url]https://imgur.com/S5YET1v[/url] I see that a few cores are getting up to 78C at some point during the test, which is frustrating and I think largely due to the fact that the massive Noctua DH-D15S has a very slight unevenness to the copper face of the heatsink, even with me lapping it with various grits of sandpaper to get it to as flat as possible (and lapping was very awkward because of the size of the massive heatsink). If you guys think I need to drop all cores below 78C then I could purchase a new Noctua D15S and hopefully it has a more even surface, and then also apply some of that sexy liquid metal thermal compound that I see everybody raving about online. Both of those are probably good for another 2-3C drop in temps.[/QUOTE] These chips are going to be very hard to cool with an air cooler. If you do plan to overclock it, you're gonna need at least a 280/360 AIO or full custom water. EDIT: Also... before you do anything. How old is the BIOS? Your chip is the refresh line, not the original run. If your BIOS is older than late last year, it might not be properly updated for your chip. |
[QUOTE=GP2;507742]But run-of-the-mill hardware errors are caught by Gerbicz error checking.
This is the only system we know of that reports erroneous final PRP residues to Primenet.[/QUOTE] If George wants me to ship the system to him, I can. |
[QUOTE=simon389;507745]If George wants me to ship the system to him, I can.[/QUOTE]
Please don't, yet. We're currently having a "discussion" as to how to optimally proceed. Some call it an argument. Some feel very comfortable in an argument. Others, not so much.... |
| All times are UTC. The time now is 22:08. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.