mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Computer starts cackling twice a week after two years OK (https://www.mersenneforum.org/showthread.php?t=20192)

fivemack 2015-04-22 22:43

Computer starts cackling twice a week after two years OK
 
The Haswell machine that I bought in June 2013 has crashed (dropped off the network; when you go to look at it, the power light is on but there's nothing displaying on the screen) twice in the last week. It was running linear algebra at the time, but it's been running linear algebra for months and up to now it's been very reliable. Any idea how to investigate?

Mark Rose 2015-04-22 23:42

My first guess is overheating. When was the last time you dusted out the CPU cooler and power supply?

ewmayer 2015-04-23 06:51

[QUOTE=fivemack;400663]The Haswell machine that I bought in June 2013 has crashed (dropped off the network; when you go to look at it, the power light is on but there's nothing displaying on the screen) twice in the last week. It was running linear algebra at the time, but it's been running linear algebra for months and up to now it's been very reliable. Any idea how to investigate?[/QUOTE]

I started getting overheating-related system diagnostics early this year on my similar-vintage Haswell system, which has been crunching 24/7 on a 4-threaded AVX2-math Pe'pin test of F29. (Yes, I know F29 has a known small factor - by way of working up to F33 I'm generating full Pepin residues for all the Fermats between F24 and F33.) Carefully removing the heatsink - the thermal compound still looked quite good, so I didn't touch it - and vacuuming the accumulated dust off of that and off the MoBo resolved the issue. That may not be what ails your system, but should be the first thing you do.

Note: Properly reseating the heatsink is a bit tricky, those split-end plastic connectors are prone to having their ends bent rather than slipping into the MoBo holes. Same issue as arises on new-install, but it's worse when reseating things after dedusting, because the connector ends have been spread apart by the coaxial push-to-lock mechanism, and one may need to physically squeeze them together with pliers to recover something approaching the when-new state.

fivemack 2015-04-23 10:46

I brought the machine in and opened it up; it looks pretty clean inside - dust is basically a product of human activity, and it lives in an outbuilding with a concrete floor which humans go into only when one of the servers crashes.

paulunderwood 2015-04-23 11:59

Assuming dust is not the problem, fire up the BIOS and have a look at temperatures and voltages and fan speeds. If these are all right, run memtest for a while, followed by mprime/prime95 torture test. You might need to increase the CPU voltage a smidgen HTH :smile:

ps. Check all plugs and sockets, like the power cables and disk cables and cards, by reseating them.

Mark Rose 2015-04-23 14:35

[QUOTE=fivemack;400695]I brought the machine in and opened it up; it looks pretty clean inside - dust is basically a product of human activity, and it lives in an outbuilding with a concrete floor which humans go into only when one of the servers crashes.[/QUOTE]

I'd still blow things out with compressed air for good measure. There may not be much dust, but dust is an excellent thermal insulator and will impede air from cooling heat sinks.

After that, it's anyone's guess what the faulty component is. I would start with the power supply, as in my experience they're the first to go from a power surge, but if you have a spare component you can swap in, start with that as it's free.

pinhodecarlos 2015-04-23 17:10

Only blow things out with compressed air [B]if air is dried[/B].

aurashift 2015-04-25 04:54

I'm not positive but look for swollen capacitors too. Sometimes the cheap ones fail, or maybe your surge suppressor has exhausted its protection ability, which happens over time.

kladner 2015-05-08 01:37

When it cackles, does it lay eggs, golden or otherwise?

[QUOTE]I'm not positive but look for swollen capacitors too.[/QUOTE]

Absolutely! They are fairly easy to spot, and might even be replaceable, depending on your manual skills.

fivemack 2015-05-08 09:28

The machine has run happily for the last week, sitting in my study inside. I shook it and something which I suspect was an extremely well-dried dead slug fell out of one of the PCI slots, which I'm sure helped.

kladner 2015-05-08 11:51

[QUOTE=fivemack;401965]The machine has run happily for the last week, sitting in my study inside. I shook it and something which I suspect was an extremely well-dried dead slug fell out of one of the PCI slots, which I'm sure helped.[/QUOTE]

The meaning of "bug" expands!


All times are UTC. The time now is 08:28.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.