![]() |
Zeroed results on Skylake
Hello @all,
I have assembled a new machine as a replacement for my old primary one that died on me -- and ever since it is driving me crazy because I cannot get it stable. And I don't mean OC-stable, because I am not trying to overclock the CPU at all. I need this machine to be rock-solid and reliable. The problem I am fighting at the moment is that I keep running into this zeroed results / errors and I cannot seem to fix or locate the actual problem: [CODE] [Sat Jul 30 08:54:06 2016] Self-test 512K passed! Self-test 20K passed! Self-test 576K passed! Self-test 864K passed! Self-test 768K passed! Self-test 576K passed! Self-test 672K passed! Self-test 28K passed! Self-test 12K passed! FATAL ERROR: Final result was 00000000, expected: 7CD3183C. Hardware failure detected, consult stress.txt file. [Sat Jul 30 08:59:14 2016] [/CODE] [CODE] [Aug 1 11:13] Test 1, 21000 Lucas-Lehmer iterations of M12969343 using FMA3 FFT length 800K, Pass1=320, Pass2=2560. [Aug 1 11:16] FATAL ERROR: Final result was 00000000, expected: 477EEFE4. [Aug 1 11:16] Hardware failure detected, consult stress.txt file. [Aug 1 11:16] Torture Test completed 966 tests in 39 hours, 7 minutes - 1 errors, 0 warnings. [Aug 1 11:16] Worker stopped. [/CODE] [CODE] [Aug 3 17:10] Test 1, 4000000 Lucas-Lehmer iterations of M138527 using FMA3 FFT length 8K, Pass1=128, Pass2=64. [Aug 3 17:15] FATAL ERROR: Final result was 00000000, expected: D2E65D57. [Aug 3 17:15] Hardware failure detected, consult stress.txt file. [Aug 3 17:15] Torture Test completed 796 tests in 49 hours, 8 minutes - 1 errors, 0 warnings. [Aug 3 17:15] Worker stopped. [/CODE] The system is a 6700k on a ASUS Z170-Deluxe w/ EFI firmware 1902 and MC 74. Memory is G.SKILL Trident Z F4-3200C14D-32GT. I tested with Prime95 28.9 blend (win64, as my main Linux install is still put on hold until the machine is stable). I suspected it had something to do with running XMP (even though lowered to 3000) but even now at stock everything, the error still occurs. The CPU was never OC'd in the sense that it was always running at stock frequencies (but apparently not voltages, due to XMP). Right now I am down to stock everything (including memory) but after 49 hours, the above failure happened again. I was intending to run it for 72h to be sure... but well. The temperatures are absolutely fine, by the way. I am kind of out of ideas and before I start chasing ghosts, I would kindly ask for any advice / suggestions from the forum. I am a little bit baffled that it is always zeroed out results, never any rounding errors. Does this point to something specific? I was trying to find more information on the specific errors that could happen during a Prime95 stress-test but I came up empty handed. :( I would really appreciate any help, suggestions, opinions and/or hints... Thanks so much, Matthias |
Have you run memtest?
|
[QUOTE=BinaryKhaos;439277]
I am kind of out of ideas and before I start chasing ghosts, I would kindly ask for any advice / suggestions from the forum. I am a little bit baffled that it is always zeroed out results, never any rounding errors. Does this point to something specific?[/QUOTE] Unfortunately, a zeroed residue, a mismatched residue, a roundoff error, a crash all mean the same thing: something went wrong. My guess (emphasis on guess) would be memory is the culprit. Memtest is a good idea, but even if you pass a multi-threaded memtest that does not eliminate memory as a suspect. You could try running memory below its rated speed or overvolt the memory a smidge to see if that helps. You could also try overvolting the CPU. Best of luck. These rare errors are painful to track down. |
[QUOTE=BinaryKhaos;439277] The system is a 6700k on a ASUS Z170-Deluxe w/ EFI firmware 1902 and MC 74. Memory is G.SKILL Trident Z F4-3200C14D-32GT [/QUOTE]
I would try running the memory at no more than 2400, as the type number you quote doesn't appear to be on ASUS's qualified vendors list for that motherboard. |
Hello @all...
[QUOTE=Mark Rose;439281]Have you run memtest?[/QUOTE] Yes, MemTest86 v7 Free w/ multithreading enabled -- but also the latest run w/ 13 hours did not produce any errors. Unfortunately I cannot let it run for 48 or 72 hours... since I do need the machine regardless. :( [QUOTE=Prime95;439283]Unfortunately, a zeroed residue, a mismatched residue, a roundoff error, a crash all mean the same thing: something went wrong.[/QUOTE] Yeah, I expected that much unfortunately. [QUOTE=Prime95;439283]You could try running memory below its rated speed or overvolt the memory a smidge to see if that helps. You could also try overvolting the CPU.[/QUOTE] The RAM is running at DDR4-2133 w/ 1.20v. Like I said, everything is down to stock. But still the error occurs. The last Prime95 run (see above) was with stock settings and it took 49 hours for the error to show up. If I could at least find a reliable way to trigger this more easily, that would make locating and diagnosing it a lot more feasible. The way it is now, I feel like my hands are tight behind my back and there is not much I can actually do except for swapping components at random. The questions remains though: Could this be a bug in Prime95 that is just very hard to trigger and I am barking up the wrong tree? Raising voltages when I am at stock everything, feels wrong to me and shouldn't be necessary, imho. [QUOTE=Prime95;439283]Best of luck. These rare errors are painful to track down.[/QUOTE] Thank you very much. :-( [QUOTE=Antonio;439285]I would try running the memory at no more than 2400, as the type number you quote doesn't appear to be on ASUS's qualified vendors list for that motherboard.[/QUOTE] G.SKILL actually uses the ASUS Z170-Deluxe for their own memory testing. And for that particular memory, the board is naturally also listed in their QVL. Nevertheless, the last Prime95 run (see above) was done at DDR4-2133 at all stock settings and the error still showed after 49 hours. If anyone has any more suggestions, I would very gladly hear and appreciate them. :-) Thanks, Matthias |
[QUOTE=BinaryKhaos;439304]Could this be a bug in Prime95 that is just very hard to trigger and I am barking up the wrong tree?[/QUOTE]I think it is time for you to find a different tree.
Edit: to clarify, I doubt you have found a bug in P95. Your time would be more productive spent towards fixing your computer. |
Hello..
[QUOTE=retina;439305]I doubt you have found a bug in P95. Your time would be more productive spent towards fixing your computer.[/QUOTE] Even though I do agree, as a software engineer myself, I know bugs do crawl in and sometimes are very sporadic and circumstancial. And besides, I guess no one believed Skylake to be faulty when the Prime95 failures started. So maybe this is another corner-case Skylake bug. But again, I do agree, everything else is much more likely and I am grasping at straws here. The thing is, if I swap cpu, board and memory... but end up in the same situation afterwards, I will clearly be devastated. Yes, then the PSU is the last thing left. But that could easily go in circles. Oh well... Reminds me: Someone said to me that if I run the test long enough, I am bound to get a bit flip. And I do agree, naturally. But 19 to 49 hours, reproducible with some time variation, shouldn't be it, should it? Thanks, Matthias |
Can you reproduce the bug reliably if rerunning the same work?
|
Hi...
[QUOTE=Mark Rose;439309]Can you reproduce the bug reliably if rerunning the same work?[/QUOTE] I am not too well versed with Prime95. I use blend, set the amount of memory to as much as I can spare and let it run -- basically. Thus far, the failures that I have properly logged, have been at 8k and 800k. The others, I don't know where they exactly failed since results.txt isn't too informative and I missed copying the worker window log. :( What would you suggest? Thanks, Matthias |
[QUOTE=BinaryKhaos;439308]So maybe this is another corner-case Skylake bug.[/QUOTE]Perhaps, but probably not. More likely just another bad mobo, or socket connection, or some RAM degradation, or voltage instability, or etc. giving you random errors. I've had bad RAM sticks give me occasional errors (once in a few weeks), it happens. It also takes a lot of patience and time to finally narrow down the culprit. Sometimes we can never find the culprit. Then life is sad and we demote the machine to other less exacting work.
|
Hello...
[QUOTE=retina;439311]I've had bad RAM sticks give me occasional errors (once in a few weeks), it happens.[/QUOTE] How did you notice that? Except if you had ECC RAM, naturally. [QUOTE=retina;439311]Sometimes we can never find the culprit. Then life is sad and we demote the machine to other less exacting work.[/QUOTE] Currently I am contemplating swapping CPU, RAM and board at the same time. I simply am not sure yet, if it is the right thing to do... or if am missing something important still and end up in the same situation afterwards. Reminds me: Shouldn't all CPUs pass Prime95 stress tests at stock voltages? I read the some people needed to raise their voltages to pass Prime95 even without a single OC. Is that common? Thanks Matthias |
| All times are UTC. The time now is 06:48. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.