mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2015-04-22, 22:43   #1
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

2×132×19 Posts
Default Computer starts cackling twice a week after two years OK

The Haswell machine that I bought in June 2013 has crashed (dropped off the network; when you go to look at it, the power light is on but there's nothing displaying on the screen) twice in the last week. It was running linear algebra at the time, but it's been running linear algebra for months and up to now it's been very reliable. Any idea how to investigate?
fivemack is offline   Reply With Quote
Old 2015-04-22, 23:42   #2
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2·5·293 Posts
Default

My first guess is overheating. When was the last time you dusted out the CPU cooler and power supply?

Last fiddled with by Mark Rose on 2015-04-22 at 23:42
Mark Rose is offline   Reply With Quote
Old 2015-04-23, 06:51   #3
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5×17×137 Posts
Default

Quote:
Originally Posted by fivemack View Post
The Haswell machine that I bought in June 2013 has crashed (dropped off the network; when you go to look at it, the power light is on but there's nothing displaying on the screen) twice in the last week. It was running linear algebra at the time, but it's been running linear algebra for months and up to now it's been very reliable. Any idea how to investigate?
I started getting overheating-related system diagnostics early this year on my similar-vintage Haswell system, which has been crunching 24/7 on a 4-threaded AVX2-math Pe'pin test of F29. (Yes, I know F29 has a known small factor - by way of working up to F33 I'm generating full Pepin residues for all the Fermats between F24 and F33.) Carefully removing the heatsink - the thermal compound still looked quite good, so I didn't touch it - and vacuuming the accumulated dust off of that and off the MoBo resolved the issue. That may not be what ails your system, but should be the first thing you do.

Note: Properly reseating the heatsink is a bit tricky, those split-end plastic connectors are prone to having their ends bent rather than slipping into the MoBo holes. Same issue as arises on new-install, but it's worse when reseating things after dedusting, because the connector ends have been spread apart by the coaxial push-to-lock mechanism, and one may need to physically squeeze them together with pliers to recover something approaching the when-new state.
ewmayer is offline   Reply With Quote
Old 2015-04-23, 10:46   #4
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

2×132×19 Posts
Default

I brought the machine in and opened it up; it looks pretty clean inside - dust is basically a product of human activity, and it lives in an outbuilding with a concrete floor which humans go into only when one of the servers crashes.
fivemack is offline   Reply With Quote
Old 2015-04-23, 11:59   #5
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

1110101010112 Posts
Default

Assuming dust is not the problem, fire up the BIOS and have a look at temperatures and voltages and fan speeds. If these are all right, run memtest for a while, followed by mprime/prime95 torture test. You might need to increase the CPU voltage a smidgen HTH

ps. Check all plugs and sockets, like the power cables and disk cables and cards, by reseating them.

Last fiddled with by paulunderwood on 2015-04-23 at 12:02
paulunderwood is offline   Reply With Quote
Old 2015-04-23, 14:35   #6
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2·5·293 Posts
Default

Quote:
Originally Posted by fivemack View Post
I brought the machine in and opened it up; it looks pretty clean inside - dust is basically a product of human activity, and it lives in an outbuilding with a concrete floor which humans go into only when one of the servers crashes.
I'd still blow things out with compressed air for good measure. There may not be much dust, but dust is an excellent thermal insulator and will impede air from cooling heat sinks.

After that, it's anyone's guess what the faulty component is. I would start with the power supply, as in my experience they're the first to go from a power surge, but if you have a spare component you can swap in, start with that as it's free.
Mark Rose is offline   Reply With Quote
Old 2015-04-23, 17:10   #7
pinhodecarlos
 
pinhodecarlos's Avatar
 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK

3×17×97 Posts
Default

Only blow things out with compressed air if air is dried.
pinhodecarlos is offline   Reply With Quote
Old 2015-04-25, 04:54   #8
aurashift
 
Jan 2015

111111012 Posts
Default

I'm not positive but look for swollen capacitors too. Sometimes the cheap ones fail, or maybe your surge suppressor has exhausted its protection ability, which happens over time.
aurashift is offline   Reply With Quote
Old 2015-05-08, 01:37   #9
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

1015810 Posts
Default

When it cackles, does it lay eggs, golden or otherwise?

Quote:
I'm not positive but look for swollen capacitors too.
Absolutely! They are fairly easy to spot, and might even be replaceable, depending on your manual skills.

Last fiddled with by kladner on 2015-05-08 at 01:41
kladner is offline   Reply With Quote
Old 2015-05-08, 09:28   #10
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

2×132×19 Posts
Default

The machine has run happily for the last week, sitting in my study inside. I shook it and something which I suspect was an extremely well-dried dead slug fell out of one of the PCI slots, which I'm sure helped.
fivemack is offline   Reply With Quote
Old 2015-05-08, 11:51   #11
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2×3×1,693 Posts
Default

Quote:
Originally Posted by fivemack View Post
The machine has run happily for the last week, sitting in my study inside. I shook it and something which I suspect was an extremely well-dried dead slug fell out of one of the PCI slots, which I'm sure helped.
The meaning of "bug" expands!
kladner is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Random starts of conversations Historian Lounge 11 2010-03-29 06:11
Prime95 Stops Mid-Test, Starts New One jinydu Lounge 25 2008-09-08 02:35
Hardware Of the Week #1 moo Hardware 4 2005-10-19 15:58
LLRnet starts as a system tray icon vaughan Prime Sierpinski Project 1 2005-01-26 15:43
Computer Starts Beeping Unregistered Hardware 10 2003-12-15 19:41

All times are UTC. The time now is 08:28.


Tue Jul 27 08:28:53 UTC 2021 up 4 days, 2:57, 0 users, load averages: 1.99, 1.82, 1.77

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.