mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2016-01-16, 13:09   #1
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

72×59 Posts
Default Weird hardware error

During verification of the new prime (no spoilers) my Prime95 on my Haswell-E 5960X suddenly did not match the interim residues of the others without any errors shown and I had OutputRoundoff=1 in prime.txt and the max round off was like 0.098. This happened 2 times with the first FFT and 1 time with a higher FFT.

Then I added ErrorCheck=1 and SumInputsErrorCheck=1 to prime.txt and started a new run with the higher FFT and residues matched up to iteration 23M, then I suddently got 4 x roundoff error > 0.4, I think the max was 0.5 and it said "confidence in final result is low" but the interim 32M residue still matched the others, very wierd.

I restarted a new run with a much higher FFT and still ErrorCheck=1 and SumInputsErrorCheck=1 and I got roundoff > 0.4 a few times even before iteration 1M, but the 1M residue still matched the others. At this point I gave up and just finished my CudaLucas run.

I got this computer in the beginning of November and it has 30+ successful double checks, so I assumed one or more of my RAM sticks was FUBAR. I am running the XMP profile 15-16-16-39 and 3000 Mhz, and the processor is overclocked to 3500 Mhz.

I restarted computer on a USB with Memtest86 which tests all 32 GB RAM and it ran for ~36 hours without any errors most of the time on all 8 cores. I restarted back in Windows and ran ~ 45 hours of Prime95 stress test without any errors! Now I'm running the verification again and now the residues match so far.

Can this be some error on boot up which can go away again after restart? I did restart my computer just before starting the initial verification run, and did not restart it again until the Memtest86.
Any other ideas? I will do some more double checks soon and see what happens.

Last fiddled with by ATH on 2016-01-16 at 13:19
ATH is offline   Reply With Quote
Old 2016-01-16, 14:03   #2
Brain
 
Brain's Avatar
 
Dec 2009
Peine, Germany

331 Posts
Default Power Supply or RAM

If it has always been stable before and does now fail without any change I would think of the power supply. I had a stable system which suddenly started with sporadic errors shown in Prime95. This become worse and worse over the months and ended in several blue screens, even without load. Finally, I ended up with swapping my power supply and all was good again. I can't say whether there were residue errors without Prime95 showing an error or not but I guess not all calculation errors lead to an error message (as you noticed).
The same applies to CUDALucas: before downclocking my Titan's memory to 2600MHz from 3000MHz I never had a roundoff error message but ended up with wrong residues.
Brain is offline   Reply With Quote
Old 2016-01-16, 15:40   #3
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

1011010010112 Posts
Default

Yeah you might be right, it is just strange that all the testing gave not a single error, and now it seems fine again. I'll have to continue to watch it and do some more double checks.
ATH is offline   Reply With Quote
Old 2016-01-16, 15:46   #4
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

33×41 Posts
Default

Link training usualy gives different results (settings) on each (re-)boot for PCIe, DDRx, QPI, whanever links...

Oliver

P.S. I've seen Xeons with DDR3 reg. ECC memory throwing hundreds of correctable ECC errors per second, after next reboot that number went down to maybe one error per hour.. and the next boot the system refuses to boot. This is of course an extreme example, not the general case.

Last fiddled with by TheJudger on 2016-01-16 at 15:47
TheJudger is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Weird error message ThomRuley Msieve 15 2017-03-30 18:39
Weird freezing error, desperate for a solution. jasong Lounge 5 2016-11-18 00:43
Weird error message in small fft and blend samhot84 Information & Answers 4 2014-04-20 19:40
Weird seiving error popandbob Twin Prime Search 7 2007-06-09 20:37
Weird Game and Prime 95 problems, may it be Hardware? Arthanis Hardware 30 2005-01-07 11:16

All times are UTC. The time now is 18:55.

Wed Aug 12 18:55:53 UTC 2020 up 26 days, 14:42, 0 users, load averages: 1.88, 2.12, 2.24

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.