mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2005-09-30, 16:54   #1
janderson
 
Sep 2005

3 Posts
Default Wierd Hardware Error

I've been using prime95's torture test for years as a system stability gauge.

I recently upgraded my PC to a dual-core Athlon 64x2 4400+ (2.2 GHz), and prime95 exhibits some odd behavior. I can run p95 on any one of the two CPU cores for extended periods - I ran p95 for 12 hours on each core in 64-bit mode without faults.

However, running on both cores simultaneously with two instances of prime95 causes a reproducible error - a rounding error (0.5, where 0.4 or less is expected) occurs on core0 in the middle of the fourth 1024k test. core1 continues to run (I've let it go for up to 8 hours after core0 fails)

Here's what makes it really wierd to me:
1) The RAM passes memtest86 tests - I can run memtest86 for hours without problems.
2) The CPU is cool enough - 45 celcius max under extreme load. It's watercooled and not overclocked. The waterblock temp is ~29 celcius under load.
3) Because the CPU failed Prime95 tests, I exchanged it for a new identical CPU - both CPUs behave identically. Same error, same place.
4) The error is ALWAYS the same. EXACTLY the same. The same number in the error (0.5, expecting 0.4 or less), in the same spot (in the middle of 4 iterations of 1024k). Always on the same CPU core (core0).
5) No lack of power: good Enermax 430W PSU.
6) The same error occurs at the same spot with both 64-bit prime95 and 32-bit prime 95. (both in XP Pro x64 edition)

In my experience, hardware errors/failures stick out because errors occur intermittently or unpredictably. Since the error on this one is 100% reproducible and always the same, I'm pretty much stumped. I'm going to try to underclock the CPU and RAM over the weekend, and see what happens, but beyond that, I really don't know.

Any suggestions on how to debug, etc would be most helpful.
janderson is offline   Reply With Quote
Old 2005-09-30, 20:07   #2
dsouza123
 
dsouza123's Avatar
 
Sep 2002

2·331 Posts
Default

Underclocking the CPU and RAM is a very good technique
for finding where it is stable.
Also increasing the CPU voltage slightly may help.

It still could be a RAM issue,
not fully stable at full load at the current speed,
underclocking the RAM alone will detect this.
dsouza123 is offline   Reply With Quote
Old 2005-10-01, 07:37   #3
Cruelty
 
Cruelty's Avatar
 
May 2005

23×7×29 Posts
Default

Which version of prime95 are you using and which stress test?
What motherboard do you have? What kind of RAM do you have, and what timings do you use? What are the chipset and "pwmic" temperatures (generally both should be below 50C under long-time stress test)...
Give us some more information.
Right now the only suggestion I can make is to make sure you have the newest BIOS available for your particular motherboard.
You can also try S&M application to verify your system stability ander full load.
Cruelty is offline   Reply With Quote
Old 2005-10-01, 15:24   #4
janderson
 
Sep 2005

3 Posts
Default

Turns out it was a RAM timing problem. For testing the first CPU I had set the RAM timing "manually" to very relaxed settings (under which it still failed). After having replaced the CPU, the motherboard's BIOS applied "default" settings accross the board, including very aggressive RAM timings - more aggressive than SPD. I set the board back to "Auto" RAM timings and the problem disappeared. My fault for being too eager to test a new CPU.

I can now run Prime95-64bit for hours on end on both cores, no problem. (Had it running for 10 hours last night to test.)

You mention something I'd never heard of before: pwmic temps. I looked it up, and now I'm starting to wish I'd special ordered a better motherboard - I bought the only half-decent NF3-250gb motherboard they had in stock so I didn't have to trash my "old" 6800GT AGP. The motherboard doesn't seem to have more than just the CPU temp and voltage sensors built in. Motherboard monitor seems to get really wacky values for temp sensors 2 and 3 (~ -5c and ~110c), so I assume they're not hooked up. I guess NF4s are even better for that kind of thing.

It would be pretty cool to have PWM IC and Chipset temp sensors. Oh well. I'm not going to be overclocking this one, so it probably doesn't matter.

Thanks for the input guys. It's very much appreciated.
janderson is offline   Reply With Quote
Old 2005-10-01, 16:47   #5
Cruelty
 
Cruelty's Avatar
 
May 2005

162410 Posts
Default

I have had previously socket 754 DFI nF3 motherboard and as far as I remember it had 3 sensors built-in - not sure if one of them was controling PWM IC temp though...
BTW: What monitoring software are you using? The last version of motherboard monitor is 15 months old so you should probably use some other tool for temperature monitoring...
Cruelty is offline   Reply With Quote
Old 2005-10-01, 17:40   #6
janderson
 
Sep 2005

3 Posts
Default

I use GigaByte's own proprietary software in Windows. It only shows the CPU temp (hence why I tried MBM5). Is there anything nearly as good as MBM5 out there? I certainly haven't found anything as complete and configurable as MBM was. I've fiddled with "EVEREST Home Edition" (okay) and SysMetrix (depends on MBM).

Interestingly, everest reads the 3 temps as 30C (CPU), 50C (GPU), and 42C (GPU Ambient).

I use Linux most of the time anyway - gkrellm2 there which reads temps from the kernel, which in turn reads from the it87 sensor on the mobo. Linux can see three temperatures monitored on the it87, one reads ~35-40C (the CPU), the second reads ~25C (unknown, probably it87 itself, case, or chipset) and the other reads 5000 (raw), which could mean 50C (PWM?) or that the sensor isn't connected.

I really with this stuff had documentation somewhere on the motherboard manufacturer's website, rather than just marketing fluff.
janderson is offline   Reply With Quote
Old 2005-10-01, 23:36   #7
Cruelty
 
Cruelty's Avatar
 
May 2005

23·7·29 Posts
Default

Someone mentioned "speedfan" on this forum as a tool to monitor temperatures, however the best way would be to get the software of the producer of the sensor, in this case ITE.
As for Everest, I wouldn't trust it too much as far as voltage and temperature monitoring is concerned... e.g. you can monitor actual GPU temperatures using nVidia advanced display propeties.
Cruelty is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Hardware Error? Fred Software 11 2016-03-09 19:18
Possible hardware error kladner Hardware 2 2011-09-01 22:13
Software error or hardware error GuloGulo Software 3 2011-01-19 00:36
Error, hardware causing CRC error's Unregistered Information & Answers 3 2008-05-05 05:40
Hardware error Citrix Prime Sierpinski Project 12 2006-06-07 09:40

All times are UTC. The time now is 07:09.


Fri Aug 6 07:09:14 UTC 2021 up 14 days, 1:38, 1 user, load averages: 2.56, 2.64, 2.65

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.