![]() |
![]() |
#1 |
23·5·239 Posts |
![]()
Like threads 10734 and 1514, I'm getting "Test 1, 4000 Lucas-Lehmer iterations of M19922945 using FFT length 1024K... FATAL ERROR: Rounding was 0.5, expected less than 0.4 ... Hardware failure detected, consult stress.txt file".
Like thread 11001 (and a few others, according to a Google search) there is no "stress.txt" file to consult. Oddly, I know that I really do have a hardware problem, though I haven't narrowed down WHAT problem. What I can't figure out is if I get an error this quickly, how any program on any computer ever works for anything. The percentage of people who actually bother to stress test their systems is miniscule, so the mere handful of errors like this one must translate into hundres of thousands, if not millions, of machines in the real world with similar issues. This one particular error would seem to indicate a problem on the order of the original Pentium floating-point bug. So what is the next step? Surely this error narrows the problem down to an extremely small number of possibilities, like a specific CPU instruction using specific registers. I can't see how anything else could possibly cause a rounding error like this, other than cosmic rays randomly flipping bits in the CPU. What is this error actually telling us? Thanks, James |
![]() |
![]() |
#2 |
100110110111002 Posts |
![]()
Hello,
I've just posted a similar problem, I am expecting an answer as well; I just want to know where does this problem come from. Most probably RAM, as I get it only on Blend torture test, not on small ffts. |
![]() |
![]() |
#3 | |||||||||
"Richard B. Woods"
Aug 2002
Wisconsin USA
22·3·641 Posts |
![]() Quote:
Quote:
One way you can get the stress.txt file is to download the appropriate .zip/.gz file directly from ftp://mersenne.org/gimps, then extract the stress.txt file from that. Quote:
The Prime95 stress test can cause such problems to show up even when no other stress test does because of the way it hammers all parts of your system, especially the FPU and memory bus, simultaneously in a way that few other special-purpose stress tests do. For instance, memtest86 is a fine and competent test for memory, but AFAIK it doesn't exercise the FPU simultaneously with its memory tests, as prime95 does as a routine matter of the way it works, so it's quite possible for a system to pass memtest86 with no error, but fail early in Prime95 torture test even because of a memory stick that passed memtest86 because of the different workload imposed on memory, CPUs and data busses. This is not to say that Prime95 is necessarily superior to other stress tests, but that the type of workload it imposes on your system is almost certainly not duplicated by other stress tests, and thus each different stress test will have its strengths and weaknesses in exposing certain problems. Quote:
So the Prime95 users that use it to actually test for primes _do_ have some evidence that their system is okay for that purpose. It is, of course, quite okay to use Prime95 only to stress-test a system that will be used later for things other than prime-testing. Much as GIMPS would love to have their participation, we won't begrudge their use of Prime95 only to torture-test, and we are willing to help solve the hardware problem even if it will only be used for Duke Nukem or EverQuest or whatever is popular this century (Second Life?) :-) Quote:
It's not that GIMPS users' systems never have hardware problems; it's that because of the extra testing requirements _plus_ the cross-checking that is performed automatically during a prime-testing assignment run, GIMPS users will have a lower percentage of undetected/uncorrected errors on faulty systems. Furthermore, if a prime95 cross-check detects an error in the middle of a run, it backs up to the previous save-file and re-runs the portion that had the error. Often, such errors turn out to be "soft" and don't recur during the rerun (so the crosscheck gets a correct result the second time). The actual data is that about 1-2% of GIMPS prime-testing runs turn out to give a faulty result. (How do we know? GIMPS _always_ requires a doublecheck run with a matching result before accepting the result as "good", and about 2-3% of the doublecheck run results differ. Of course, sometimes it's the DC run that's in error. Anyway, when first-time and DC results differ, there's a triple-check, and if necessary, quadruple-check etc. ... until we're fairly certain we have a correct result computed by independent systems.) Quote:
Quote:
Quote:
Quote:
Go ahead -- try to surprise us ... as long as you give us all your specs as asked above! :-) Last fiddled with by S485122 on 2008-12-06 at 08:30 Reason: some of the latest ZIP's do not include a stress.txt file. |
|||||||||
![]() |
![]() |
![]() |
#4 | ||
"Richard B. Woods"
Aug 2002
Wisconsin USA
22·3·641 Posts |
![]() Quote:
Quote:
You could try swapping RAM sticks, to see whether that makes a difference. And clean dust out of the case! Last fiddled with by cheesehead on 2008-12-06 at 07:41 |
||
![]() |
![]() |
![]() |
#5 |
"Richard B. Woods"
Aug 2002
Wisconsin USA
22·3·641 Posts |
![]() |
![]() |
![]() |
![]() |
#6 |
Aug 2002
32×23×41 Posts |
![]() |
![]() |
![]() |
![]() |
#7 | |||||
2A216 Posts |
![]()
I really appreciate your very detailed answer, cheesehead. Thank you. I'll try to answer some of your questions.
Quote:
CPU is an Intel Q9450, which is a Core 2 Quad at 2.66GHz. I'm skipping the RAM, other than to say I tried three different manufacturers, all of which were 2 x 2GB of DDR2. The CPU is practically ambient. 27 C. No, I'm not making that up. The motherboard reports at 35 C. The reason the chips is so cold is that (a) the test fails instantaneously and (b) the Tuniq Tower CPU cooler is royal pain to install but it sure does work. Quote:
Quote:
Indeed, this was the case for me. I KNEW I had a hardware problem because I kept getting lock ups and blue screens in Windows. I switched to Ubuntu to see if it was a driver issue instead of an actual hardware problem. And I now have a solution, as well. It was the CPU. I swapped the CPU (requiring a fight with the %$^#*! Tuniq Tower cooler). Problems are gone. I'd already swapped the RAM and the issue stayed. Plus swapping OSes. Quote:
Quote:
I'm dumb, but I'm not so dumb I would post about an error if I was overclocking. Like I said, you could damn near cool your beer with it. From mersenne.org Well, to wrap up, I know I have a bad CPU. And again, thanks for the response. I'll be leaving my system on overnight to make sure it's all good. -James Ingraham |
|||||
![]() |
![]() |
#8 | ||||||||||||
"Richard B. Woods"
Aug 2002
Wisconsin USA
769210 Posts |
![]()
James,
I'm glad you found your problem. Bad CPUs are neither common enough to list as a first suspicion nor unheard-of. Also uncommon but not unheard-of is to have more than one faulty component, _especially when the failure was too abrupt for the torture test to have stressed other components yet_. I suggest stress-testing your system as hard after replacing the CPU as you would have done originally. Since you seem more interested in the process of problem-diagnosing than the average poster, I'll go into some detail about my thoughts on this end when I had only your first posting, and what your later additional details mean to me. You caught the most likely-to-be-wordy responder (me) in a wordier-than-usual mood. ![]() Because you quoted 3 specific thread numbers, I knew you had already reviewed some past threads here, which marks you as unusual, but the rarity of details you gave about your system could have indicated that you hadn't noticed that we almost always ask for more specific details about systems. So I originally had conflicting impressions of your likely cluefulness. Quote:
Quote:
Quote:
Quote:
Quote:
Again, it's perfectly consistent with a faulty CPU in _your_ case -- but not with some fundamental flaw in CPUs in general unless we were getting frequent similar reports from others. Quote:
Quote:
I'm not trying to criticize your original post here! What I _am_ doing is explaining my thoughts on this end that led me to respond the way I did, and explain why some of your ideas (e.g., widespread or systematic CPU flaws) don't look likely from this end. Quote:
Quote:
![]() ![]() Quote:
Over 99% of participant computers run one of the most stressful computations they'll ever do with not a single error in over 3*10^15 clock cycles. Quote:
Quote:
|
||||||||||||
![]() |
![]() |
![]() |
#9 | ||||
410010 Posts |
![]() Quote:
![]() I'm afraid I made a classic mistake. I was so focused on my problem I forgot that everybody else doesn't know every detail. Sorry. Quote:
Quote:
Quote:
Again, thanks for taking the time to reply. -James Ingraham |
||||
![]() |
![]() |
#10 |
Jan 2003
2×103 Posts |
![]()
Back in the P4 days, I remember having a system that would give errors unless I underclocked the memory. It turned out to be the motherboard, which was a very cheap budget model. Running at stock speeds doesn't automatically guarantee prime95 stability.
Even with my new system, the RAM is DDR2-1066, rated at 2.2-2.4V. By default, my motherboard choses to run it at 1.8V. Wrongly detected I guess, but operating at it's stock rated 1066MHz would cause errors. |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Icon red but no error on torture test | justintime | Hardware | 2 | 2011-07-02 17:41 |
Error message during torture test | esqrkim | Hardware | 9 | 2010-03-21 15:28 |
fatal error in torture test | Unregistered | Hardware | 3 | 2006-12-18 15:30 |
Torture Test Error | krypton_ls | Hardware | 36 | 2006-10-13 21:26 |
Torture Test error | Unregistered | Hardware | 27 | 2005-12-29 15:37 |