![]() |
|
|
#1 |
|
Sep 2005
3 Posts |
I've been using prime95's torture test for years as a system stability gauge.
I recently upgraded my PC to a dual-core Athlon 64x2 4400+ (2.2 GHz), and prime95 exhibits some odd behavior. I can run p95 on any one of the two CPU cores for extended periods - I ran p95 for 12 hours on each core in 64-bit mode without faults. However, running on both cores simultaneously with two instances of prime95 causes a reproducible error - a rounding error (0.5, where 0.4 or less is expected) occurs on core0 in the middle of the fourth 1024k test. core1 continues to run (I've let it go for up to 8 hours after core0 fails) Here's what makes it really wierd to me: 1) The RAM passes memtest86 tests - I can run memtest86 for hours without problems. 2) The CPU is cool enough - 45 celcius max under extreme load. It's watercooled and not overclocked. The waterblock temp is ~29 celcius under load. 3) Because the CPU failed Prime95 tests, I exchanged it for a new identical CPU - both CPUs behave identically. Same error, same place. 4) The error is ALWAYS the same. EXACTLY the same. The same number in the error (0.5, expecting 0.4 or less), in the same spot (in the middle of 4 iterations of 1024k). Always on the same CPU core (core0). 5) No lack of power: good Enermax 430W PSU. 6) The same error occurs at the same spot with both 64-bit prime95 and 32-bit prime 95. (both in XP Pro x64 edition) In my experience, hardware errors/failures stick out because errors occur intermittently or unpredictably. Since the error on this one is 100% reproducible and always the same, I'm pretty much stumped. I'm going to try to underclock the CPU and RAM over the weekend, and see what happens, but beyond that, I really don't know. Any suggestions on how to debug, etc would be most helpful. |
|
|
|
|
|
#2 |
|
Sep 2002
2×331 Posts |
Underclocking the CPU and RAM is a very good technique
for finding where it is stable. Also increasing the CPU voltage slightly may help. It still could be a RAM issue, not fully stable at full load at the current speed, underclocking the RAM alone will detect this. |
|
|
|
|
|
#3 |
|
May 2005
23·7·29 Posts |
Which version of prime95 are you using and which stress test?
What motherboard do you have? What kind of RAM do you have, and what timings do you use? What are the chipset and "pwmic" temperatures (generally both should be below 50C under long-time stress test)... Give us some more information. Right now the only suggestion I can make is to make sure you have the newest BIOS available for your particular motherboard. You can also try S&M application to verify your system stability ander full load. |
|
|
|
|
|
#4 |
|
Sep 2005
3 Posts |
Turns out it was a RAM timing problem. For testing the first CPU I had set the RAM timing "manually" to very relaxed settings (under which it still failed). After having replaced the CPU, the motherboard's BIOS applied "default" settings accross the board, including very aggressive RAM timings - more aggressive than SPD. I set the board back to "Auto" RAM timings and the problem disappeared. My fault for being too eager to test a new CPU.
I can now run Prime95-64bit for hours on end on both cores, no problem. (Had it running for 10 hours last night to test.) You mention something I'd never heard of before: pwmic temps. I looked it up, and now I'm starting to wish I'd special ordered a better motherboard - I bought the only half-decent NF3-250gb motherboard they had in stock so I didn't have to trash my "old" 6800GT AGP. The motherboard doesn't seem to have more than just the CPU temp and voltage sensors built in. Motherboard monitor seems to get really wacky values for temp sensors 2 and 3 (~ -5c and ~110c), so I assume they're not hooked up. I guess NF4s are even better for that kind of thing. It would be pretty cool to have PWM IC and Chipset temp sensors. Oh well. I'm not going to be overclocking this one, so it probably doesn't matter. Thanks for the input guys. It's very much appreciated. |
|
|
|
|
|
#5 |
|
May 2005
23·7·29 Posts |
I have had previously socket 754 DFI nF3 motherboard and as far as I remember it had 3 sensors built-in - not sure if one of them was controling PWM IC temp though...
BTW: What monitoring software are you using? The last version of motherboard monitor is 15 months old so you should probably use some other tool for temperature monitoring... |
|
|
|
|
|
#6 |
|
Sep 2005
3 Posts |
I use GigaByte's own proprietary software in Windows. It only shows the CPU temp (hence why I tried MBM5). Is there anything nearly as good as MBM5 out there? I certainly haven't found anything as complete and configurable as MBM was. I've fiddled with "EVEREST Home Edition" (okay) and SysMetrix (depends on MBM).
Interestingly, everest reads the 3 temps as 30C (CPU), 50C (GPU), and 42C (GPU Ambient). I use Linux most of the time anyway - gkrellm2 there which reads temps from the kernel, which in turn reads from the it87 sensor on the mobo. Linux can see three temperatures monitored on the it87, one reads ~35-40C (the CPU), the second reads ~25C (unknown, probably it87 itself, case, or chipset) and the other reads 5000 (raw), which could mean 50C (PWM?) or that the sensor isn't connected. I really with this stuff had documentation somewhere on the motherboard manufacturer's website, rather than just marketing fluff. |
|
|
|
|
|
#7 |
|
May 2005
23·7·29 Posts |
Someone mentioned "speedfan" on this forum as a tool to monitor temperatures, however the best way would be to get the software of the producer of the sensor, in this case ITE.
As for Everest, I wouldn't trust it too much as far as voltage and temperature monitoring is concerned... e.g. you can monitor actual GPU temperatures using nVidia advanced display propeties. |
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Hardware Error? | Fred | Software | 11 | 2016-03-09 19:18 |
| Possible hardware error | kladner | Hardware | 2 | 2011-09-01 22:13 |
| Software error or hardware error | GuloGulo | Software | 3 | 2011-01-19 00:36 |
| Error, hardware causing CRC error's | Unregistered | Information & Answers | 3 | 2008-05-05 05:40 |
| Hardware error | Citrix | Prime Sierpinski Project | 12 | 2006-06-07 09:40 |