mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Catastrophic hardware failure (https://www.mersenneforum.org/showthread.php?t=20461)

CuriousKit 2015-09-01 21:44

Catastrophic hardware failure
 
I'm sure this is everyone's worst nightmare, but here goes...

This morning, I arrived at my workplace to find my workstation had shut down. Not thinking anything much of that, apart from minor inconvenience of having lost some prime number search time, I discovered to my horror that it wouldn't boot... keyboard didn't activate, monitors didn't get a signal, and after about a minute, one of the cooling fans started to sound like a jet engine. Initially I thought that Prime95 had somehow caused the CPU to burn out, but upon some diagnostics with the in-house support team, we found that the PSU had failed and the fact that I was running a prime number checker was just a coincidence. Everything else in the computer still works, but they can't simply replace the bust PSU due to 'warranty'.

In the meantime I have been given a temporary replacement machine (annoyingly, less powerful than my original workstation), but I did request that I may want to read the hard drive of my old computer (to recover the progress made by Prime95, since one of the tests was a 100-million digit test that had been running for over 100 days), although I'm not sure if I'll be able to get access to that hard drive again. If worst comes to the worst, would those tests have to be started again from scratch or can they be partly recovered (I don't know if partial residues are ever sent to the server)?

UBR47K 2015-09-01 22:11

Partial residues are never sent to the server. You'll need to start them from scratch in the worst case scenario.

ewmayer 2015-09-02 00:52

For very long runs like that, I suggest making a habit of copying one of the redundant residue file every 10Miters or so (I like to append the approx. iter count in M to the filename, e.g. [save].130M to uniquify it) and offloading to somewhere else. Live and learn.

Good luck with the data recovery, in any event!

CuriousKit 2015-09-02 13:04

And I just realised that partial residues would be p bits long anyway (from 2[SUP]p[/SUP] - 1), much too long to send to a server on a periodic basis. I'll see if I can recover the partially-completed work.

Oh well, you live and learn!

ATH 2015-09-02 14:22

The line "InterimFiles=10000000" in prime.txt will save a full backup file every 10M iterations, which for a 332M+ exponent will probably be >40 Mb ?

LaurV 2015-09-02 14:34

@OP:

If you can't get access to the HDD, you will have to do the tests again from scratch.

OTOH, the P95 and the crash of the PSU may not be coincidental. If the PSU was somehow at the limit (as in "a 500W" or "a 750W" PSU, depending on the other HW you had in the box), then the additional (and [U]continuous[/U]) stress P95 is putting into it, will blow it off (as opposite to "normal work", word, excel, compiling, etc, which still can suck more energy occasionally, but not for long time continuous, so the mosfets in the PSU have some time to "cool down").

CuriousKit 2015-09-02 18:58

Hmmm, that's a good point. The PSU was only 240W (it normally doesn't need much power - the computer doesn't have a dedicated graphics card, for example) so that might have pushed it to breaking point - something I better investigate actually.

In the meantime, I've got the old computer back with a replaced PSU. I've also made sure that Prime95's progress files and work list are saved to a network store.


All times are UTC. The time now is 03:09.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.