mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2004-06-28, 13:59   #1
Jeff Gilchrist
 
Jeff Gilchrist's Avatar
 
Jun 2003
Ottawa, Canada

100100100012 Posts
Question Mystery Computer Crashes...

A few weeks ago, my AMD MP-2600+ Dual machine (Asus A7M266-D mobo) with Kingston Hyper-X DDR SDRAM started crashing randomly. I am usually in Windows on that machine and the machine would start doing random crashes (causing XP Pro to reboot). I hadn't recently installed anything on the machine and I could never reproduce the error on demand, it would sometimes happen a few times in a day or not at all for a couple of days.

I watched the temperature of the machine closely and both processors always remained under 62C. I used the memtest86, memtest86+, and even the special Prime95 torture test boot disk and no problems were ever detected. All fans inside the case were running fine and lots of airflow. But when I went back into Windows the probems were still happening. I figured it was maybe a bad driver in Windows somewhere causing the problem so I decided to boot the latest Knoppix on my machine and run mprime to see if there were any problems. I discovered that running 1 instance of mprime was fine, but when I ran two at the same time (one for each CPU) I would see errors. One copy of mprime never made it past the initial LL tests before it start working on a LL. The other copy made it past but would get sum errors. So it seemed it wasn't related to Windows specifically. Booting back into Windows I found the same thing, I could run one copy of Prime95 without errors, but two copies would cause problems. At least I could reprocude/see problems now with Prime95 (Yeah George!)

Talking with Xyzzy, he said there was a guy with dual AMD CPUs and an Asus board that saw the same problems and it turned out to be a bad CPU. So I spent the whole weekend running tests.

I first pulled out all my PCI cards and disabled the internal audio and booted Knoppix. Still problems with two copies of mprime running so it wasn't any of the PCI cards. I pulled out one CPU and one stick of ram (so I now just had 1 stick of 512MB) and tested that with both mprime torture tests and memtest86, no problems. I swapped sticks of RAM and redid the tests, no problems. Ok, so I figured the RAM was good. I pulled the CPU out, and put the other CPU back in. Redid the tests, no problems. I swapped the RAM stick again, and re-did the tests, still no problems.

I was a little stumped since none of these combinations produced any errors. So I put the second CPU back in and tested with Knoppix and 2x mprime with 1 stick of RAM. No problems. I stuck the second stick of RAM back in so now there were 2 CPUs and both sticks of RAM. Redid the tests again. No problems.

This morning I plugged all my PCI cards back in and reconnected all the drives, about an hour of testing and everything was ok so hopefully when I get home tonight it will still be working.

So I have no idea what the problem was, but by taking apart my system, and re-installing the hardware, it seems to be working fine. It could be that something was loose I guess. Our house was recently build and there is a lot of construction going on in the area so our house vibrates whenever any heavy equipment drives by, maybe something got unseated slightly.

Has anyone else seen this kind of thing happen before?

Jeff.

Last fiddled with by Jeff Gilchrist on 2004-06-28 at 14:04
Jeff Gilchrist is offline   Reply With Quote
Old 2004-06-28, 14:58   #2
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

101001010110112 Posts
Default

Quote:
Originally Posted by Jeff Gilchrist
So I have no idea what the problem was, but by taking apart my system, and re-installing the hardware, it seems to be working fine. It could be that something was loose I guess. Our house was recently build and there is a lot of construction going on in the area so our house vibrates whenever any heavy equipment drives by, maybe something got unseated slightly.

Has anyone else seen this kind of thing happen before?

Jeff.
Not with the specific hardware you describe, but many times I've seen strange hardware problems being fixed by reseating everything in sight.

A common cause is dust, or other such material, filtering down between pins and sockets as a result of vibration or thermal powercycling. Sometimes the slow build-up of oxides between metal to metal contacts is the culprit.

A very useful tool kit for working on machines includes an air blower, a small battery powered vacuum cleaner with both narrow nozzle and long-bristled brush attachment, and a small mallet (rubber or nylon head is best, wood will serve). If you really don't have an alternative, a plastic or wooden handled screwdriver will do at a pinch. The use of the first two should be obvious. The mallet comes in handy for gently tapping flat-mounted chips to reseat them.

Now for something that doesn't apply in Jeff's case but can be a lifesaver and seems not to be widely known.

The mallet is also very useful for ungumming disks if they should fail to spin up after being powered down for a while. Remove the disk from the mounting cage but leave the cables connected. First try the following with the power disconnected. Holding the circuit board side in your hand, give the other side a sharp tap with the mallet. Very frequently this shock is enough to break the stiction in the disk bearings. Apply power and see whether the disk restarts. If it doesn't, try again but there's no sense in trying more than 3 or 4 times. If the disk remains stuck, the next phase is to remove the disk completely, put it in a plastic bag and squash the air out, then put it in the freezer for 24 hours. Remove from freezer and the bag, attach cables and power it up while it's still as cold as possible, hoping the differential thermal contraction will fix it.

If after all that, the disk still doesn't work you should now be getting desperate. It's unlikely you'll see your data again, but if you want one last try before junking the disk, you can repeat the mallet trick but with the disk powered up this time. There is a fair chance that the shock will cause permanent damage to at least some of the data, possibly all of it but if the choice is between possible damage and certain data loss it's worth a try in my opinion. If you go down this route and the disk does return to life, copy all the data somewhere safe and then replace the disk.


I've been fixing hardware for something over 20 years now. My guess is that reseating cards and chips has repaired a dozen or two machines in that time. Beating and/or freezing has revived about the same number of disks. Your mileage may vary.

Paul
xilman is offline   Reply With Quote
Old 2004-06-28, 15:32   #3
TauCeti
 
TauCeti's Avatar
 
Mar 2003
Braunschweig, Germany

22610 Posts
Default

Thanks a lot for the description of these procedures, Paul. I have also had my share of hardware problems the last 20 years but was not aware of the freezer trick :)

I think i will give it a try with my NFSNET Test-HD that died (no spin-up) some weeks ago even if it has already been replaced now.

One technical question: 'Freezer' means minus twenty degree centigrade and not merely a frigde?

And let me add the small advice that one should not try those recovery tricks if the data on the disc so valuable that you need to consult some professional data recovery service if the tricks does not work.

Lars

--
There are 2 kinds of hard drives, ones that have crashed and ones that will crash.
TauCeti is offline   Reply With Quote
Old 2004-06-28, 17:31   #4
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101Γ—103 Posts

100100100100112 Posts
Default

Quote:
Originally Posted by TauCeti
There are 2 kinds of hard drives, ones that have crashed and ones that will crash.
That sounds a lot like the 2 types of motorcyclists: those that have gone down and those that will gone down.


Paul, I have successfully revived and MFM drive (long enough to back up) by acutally being able to get my finger tip on the end of the spindle and breaking it loose that way. (It was a refurbed 42M disk, going from a single 20M to 2 42's was great.)

I have a machione that is shutting down at seeming random. I think that my OG 411MB SCSI is in it's death throws.
Uncwilly is offline   Reply With Quote
Old 2004-06-28, 19:35   #5
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

3·3,529 Posts
Default

Quote:
Originally Posted by TauCeti
Thanks a lot for the description of these procedures, Paul. I have also had my share of hardware problems the last 20 years but was not aware of the freezer trick :)

I think i will give it a try with my NFSNET Test-HD that died (no spin-up) some weeks ago even if it has already been replaced now.

One technical question: 'Freezer' means minus twenty degree centigrade and not merely a frigde?

And let me add the small advice that one should not try those recovery tricks if the data on the disc so valuable that you need to consult some professional data recovery service if the tricks does not work.
Yes, the disk should be as cold as possible, to maximise the difference in contraction between the various materials. That said, dry ice or liquid nitrogen is probably overdoing it and could well cause problems by itself.

I was assuming that the data was not worth the price, measured in thousands of EUR/GBP/USD, that professionals would charge to recover data. If you have data worth that much, why on earth didn't you have adequate backups in the first place? (Rhetorical question: I know NFSNET data isn't worth that much!


Paul
xilman is offline   Reply With Quote
Old 2004-06-28, 19:39   #6
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

101001010110112 Posts
Default

Quote:
Originally Posted by Uncwilly
That sounds a lot like the 2 types of motorcyclists: those that have gone down and those that will gone down.
Just passed my thirtieth anniversary as a motorcyclist without serious damage being done. Apart from a few scrapes and bruises, my worst injury was a cracked scaphoid (a small bone in the wrist) when I fell off at about 5mph / 10kph and put my hand down in exactly the wrong way when I hit the road. Something like 200,000 miles / 350,000 km on motorcyles in that time. Currently a member of the Pan Clan (http://www.pan-clan.org).

Paul
xilman is offline   Reply With Quote
Old 2004-06-28, 23:54   #7
Jeff Gilchrist
 
Jeff Gilchrist's Avatar
 
Jun 2003
Ottawa, Canada

7×167 Posts
Default

I got home today and my machine was still working without any errors even with all my hardware plugged back in.

The other interesting thing I noticed was that before, my CPUs usually ran at 57C and 61C at full load. Now they are running at 51C and 52C at full load. I guess the thermal compound I used this time between the CPU and the heatsink must be better than whatever was used when the machine was built. I also removed all the dust so I'm sure that helped too.

Jeff.
Jeff Gilchrist is offline   Reply With Quote
Old 2004-06-29, 13:44   #8
Reboot It
 
Reboot It's Avatar
 
Aug 2002
London, UK

10110112 Posts
Default

Following on from xilman's comments about how to unstick a drive that has been shut down for a while and won't restart, the following has been said to be a useful technique to try to prevent that problem arising in the first place. It applies mostly to drives that are getting old and that have been on 24/7/365 for a long time that need to be switched off for an extended period.

Before permanently shutting down the computer with the drive(s) in question, shut it down and switch it off for only 30 seconds before restarting it again. Then shut it down and leave it for the long shutdown.

I think the idea is that as the head landing zone has hardly ever been used, there is a risk of the heads somehow getting stuck there, so a "practice run" at landing and moving off is what this achieves.
Reboot It is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Computer crashes robert44444uk Hardware 8 2017-01-19 11:18
Small FFTs immediately crashes my computer, help please! scrawlings Information & Answers 39 2014-08-02 21:48
Never fails...walk out the door and computer crashes Chuck GPU to 72 28 2013-11-27 04:29
Memtest86+ shows no errors but computer crashes with Prime95 TObject Hardware 11 2013-05-09 11:43
Everything crashes my computer Unregistered Hardware 6 2004-08-09 19:28

All times are UTC. The time now is 00:22.

Wed Mar 3 00:22:47 UTC 2021 up 89 days, 20:34, 0 users, load averages: 2.39, 2.89, 2.88

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.