mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2015-09-01, 21:44   #1
CuriousKit
 
"J. Gareth Moreton"
Feb 2015
Nomadic

10110102 Posts
Default Catastrophic hardware failure

I'm sure this is everyone's worst nightmare, but here goes...

This morning, I arrived at my workplace to find my workstation had shut down. Not thinking anything much of that, apart from minor inconvenience of having lost some prime number search time, I discovered to my horror that it wouldn't boot... keyboard didn't activate, monitors didn't get a signal, and after about a minute, one of the cooling fans started to sound like a jet engine. Initially I thought that Prime95 had somehow caused the CPU to burn out, but upon some diagnostics with the in-house support team, we found that the PSU had failed and the fact that I was running a prime number checker was just a coincidence. Everything else in the computer still works, but they can't simply replace the bust PSU due to 'warranty'.

In the meantime I have been given a temporary replacement machine (annoyingly, less powerful than my original workstation), but I did request that I may want to read the hard drive of my old computer (to recover the progress made by Prime95, since one of the tests was a 100-million digit test that had been running for over 100 days), although I'm not sure if I'll be able to get access to that hard drive again. If worst comes to the worst, would those tests have to be started again from scratch or can they be partly recovered (I don't know if partial residues are ever sent to the server)?
CuriousKit is offline   Reply With Quote
Old 2015-09-01, 22:11   #2
UBR47K
 
UBR47K's Avatar
 
Aug 2015

59 Posts
Default

Partial residues are never sent to the server. You'll need to start them from scratch in the worst case scenario.
UBR47K is offline   Reply With Quote
Old 2015-09-02, 00:52   #3
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

2·5,639 Posts
Default

For very long runs like that, I suggest making a habit of copying one of the redundant residue file every 10Miters or so (I like to append the approx. iter count in M to the filename, e.g. [save].130M to uniquify it) and offloading to somewhere else. Live and learn.

Good luck with the data recovery, in any event!
ewmayer is offline   Reply With Quote
Old 2015-09-02, 13:04   #4
CuriousKit
 
"J. Gareth Moreton"
Feb 2015
Nomadic

2×32×5 Posts
Default

And I just realised that partial residues would be p bits long anyway (from 2p - 1), much too long to send to a server on a periodic basis. I'll see if I can recover the partially-completed work.

Oh well, you live and learn!
CuriousKit is offline   Reply With Quote
Old 2015-09-02, 14:22   #5
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

B2216 Posts
Default

The line "InterimFiles=10000000" in prime.txt will save a full backup file every 10M iterations, which for a 332M+ exponent will probably be >40 Mb ?
ATH is offline   Reply With Quote
Old 2015-09-02, 14:34   #6
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

2×7×613 Posts
Default

@OP:

If you can't get access to the HDD, you will have to do the tests again from scratch.

OTOH, the P95 and the crash of the PSU may not be coincidental. If the PSU was somehow at the limit (as in "a 500W" or "a 750W" PSU, depending on the other HW you had in the box), then the additional (and continuous) stress P95 is putting into it, will blow it off (as opposite to "normal work", word, excel, compiling, etc, which still can suck more energy occasionally, but not for long time continuous, so the mosfets in the PSU have some time to "cool down").

Last fiddled with by LaurV on 2015-09-02 at 14:35 Reason: added @OP for clarity
LaurV is offline   Reply With Quote
Old 2015-09-02, 18:58   #7
CuriousKit
 
"J. Gareth Moreton"
Feb 2015
Nomadic

2·32·5 Posts
Default

Hmmm, that's a good point. The PSU was only 240W (it normally doesn't need much power - the computer doesn't have a dedicated graphics card, for example) so that might have pushed it to breaking point - something I better investigate actually.

In the meantime, I've got the old computer back with a replaced PSU. I've also made sure that Prime95's progress files and work list are saved to a network store.
CuriousKit is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Prime 95 result - Hardware Failure pbunn Information & Answers 37 2013-04-22 21:41
Hardware failure detected !!! MaZeNsMz Information & Answers 2 2008-06-21 12:05
Hardware Failure Detected bigal_nz Hardware 2 2007-02-07 10:43
NEW USER - HARDWARE FAILURE - PLEASE HELP Cosmo Hardware 45 2005-10-17 10:00
Hardware failure only detected on torture test or also when factoring/LL-testing...? Jasmin Hardware 10 2005-02-14 01:58

All times are UTC. The time now is 03:58.

Sun Jul 5 03:58:20 UTC 2020 up 102 days, 1:31, 1 user, load averages: 1.83, 1.51, 1.37

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.