mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2008-12-05, 03:34   #1
Unregistered
 

24·32·13 Posts
Default Yet another Torture Test Fatal Error

Like threads 10734 and 1514, I'm getting "Test 1, 4000 Lucas-Lehmer iterations of M19922945 using FFT length 1024K... FATAL ERROR: Rounding was 0.5, expected less than 0.4 ... Hardware failure detected, consult stress.txt file".

Like thread 11001 (and a few others, according to a Google search) there is no "stress.txt" file to consult.

Oddly, I know that I really do have a hardware problem, though I haven't narrowed down WHAT problem. What I can't figure out is if I get an error this quickly, how any program on any computer ever works for anything. The percentage of people who actually bother to stress test their systems is miniscule, so the mere handful of errors like this one must translate into hundres of thousands, if not millions, of machines in the real world with similar issues. This one particular error would seem to indicate a problem on the order of the original Pentium floating-point bug.

So what is the next step? Surely this error narrows the problem down to an extremely small number of possibilities, like a specific CPU instruction using specific registers. I can't see how anything else could possibly cause a rounding error like this, other than cosmic rays randomly flipping bits in the CPU. What is this error actually telling us?


Thanks,
James
  Reply With Quote
Old 2008-12-05, 22:21   #2
Unregistered
 

226468 Posts
Default

Hello,

I've just posted a similar problem, I am expecting an answer as well; I just want to know where does this problem come from. Most probably RAM, as I get it only on Blend torture test, not on small ffts.
  Reply With Quote
Old 2008-12-06, 04:34   #3
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

22×3×641 Posts
Default

Quote:
Originally Posted by Unregistered View Post
Like threads 10734 and 1514, I'm getting "Test 1, 4000 Lucas-Lehmer iterations of M19922945 using FFT length 1024K... FATAL ERROR: Rounding was 0.5, expected less than 0.4 ... Hardware failure detected, consult stress.txt file".
May we please have the Prime95 (or mprime) version number, your operating system, plus details of CPU and RAM? How hot does your CPU get during a torture test?

Quote:
Like thread 11001 (and a few others, according to a Google search) there is no "stress.txt" file to consult.
Indeed the P95v256.zip and P95v257.zip did not include the file.

One way you can get the stress.txt file is to download the appropriate .zip/.gz file directly from ftp://mersenne.org/gimps, then extract the stress.txt file from that.

Quote:
Oddly, I know that I really do have a hardware problem, though I haven't narrowed down WHAT problem. What I can't figure out is if I get an error this quickly, how any program on any computer ever works for anything.
Actually, it's simple: most stress-test hardware failures come from a small number of possible causes: too-much-overclocked CPUs, inadequate cooling (including dust-clogging), faulty memory sticks -- that cause a bit error only once in a trillion/zillion times. (This is not a comprehensive list; we need more information to suggest what's happening in your case, but odds are it's one of those three.)

The Prime95 stress test can cause such problems to show up even when no other stress test does because of the way it hammers all parts of your system, especially the FPU and memory bus, simultaneously in a way that few other special-purpose stress tests do.

For instance, memtest86 is a fine and competent test for memory, but AFAIK it doesn't exercise the FPU simultaneously with its memory tests, as prime95 does as a routine matter of the way it works, so it's quite possible for a system to pass memtest86 with no error, but fail early in Prime95 torture test even because of a memory stick that passed memtest86 because of the different workload imposed on memory, CPUs and data busses.

This is not to say that Prime95 is necessarily superior to other stress tests, but that the type of workload it imposes on your system is almost certainly not duplicated by other stress tests, and thus each different stress test will have its strengths and weaknesses in exposing certain problems.

Quote:
The percentage of people who actually bother to stress test their systems is miniscule,
... except that Prime95 will not run an actual prime-testing assignment unless the system has passed the appropriate torture tests for a certain length of time. It's an actual requirement in Prime95 that users see only if they use Prime95 on a prime-testing assignment.

So the Prime95 users that use it to actually test for primes _do_ have some evidence that their system is okay for that purpose.

It is, of course, quite okay to use Prime95 only to stress-test a system that will be used later for things other than prime-testing. Much as GIMPS would love to have their participation, we won't begrudge their use of Prime95 only to torture-test, and we are willing to help solve the hardware problem even if it will only be used for Duke Nukem or EverQuest or whatever is popular this century (Second Life?) :-)

Quote:
so the mere handful of errors like this one must translate into hundres of thousands, if not millions, of machines in the real world with similar issues.
Yes, but that would apply to the systems that are _not_ used for actual GIMPS prime-testing assignments. Duke Nukem, EverQuest, and Second Life are not as sensitive to bit-flips as Prime95 (where a single bit error can make the whole result wrong) is.

It's not that GIMPS users' systems never have hardware problems; it's that because of the extra testing requirements _plus_ the cross-checking that is performed automatically during a prime-testing assignment run, GIMPS users will have a lower percentage of undetected/uncorrected errors on faulty systems. Furthermore, if a prime95 cross-check detects an error in the middle of a run, it backs up to the previous save-file and re-runs the portion that had the error. Often, such errors turn out to be "soft" and don't recur during the rerun (so the crosscheck gets a correct result the second time).

The actual data is that about 1-2% of GIMPS prime-testing runs turn out to give a faulty result. (How do we know? GIMPS _always_ requires a doublecheck run with a matching result before accepting the result as "good", and about 2-3% of the doublecheck run results differ. Of course, sometimes it's the DC run that's in error. Anyway, when first-time and DC results differ, there's a triple-check, and if necessary, quadruple-check etc. ... until we're fairly certain we have a correct result computed by independent systems.)

Quote:
This one particular error would seem to indicate a problem on the order of the original Pentium floating-point bug.
That might be true _IF_ the cause of the errors was always a faulty CPU instruction result, but actually GIMPS has had that happen only when there was the real Pentium FDIV big that Intel had to fix. Errors due to faulty memory, or to circuitry that's running too hot, are not the sort of systematic problem across millions of systems like the Pentium bug.

Quote:
So what is the next step?
If your system is at all dusty inside the case, clean it out with compressed-gas dust remover designed specifically for this purpose (follow its instructions carefully). See if this cleaning has cured the problem. If not, then, as above, send us more info on your version, CPU specs, memory, overclocking if any, temperature of CPU during torture tests, and the way you got your prime95.

Quote:
Surely this error narrows the problem down to an extremely small number of possibilities, like a specific CPU instruction using specific registers.
As indicated above, there are lots of other more-likely possibilities.

Quote:
I can't see how anything else could possibly cause a rounding error like this, other than cosmic rays randomly flipping bits in the CPU.
... but some of us "grew up" in the computing industry when hardware was much less reliable than it is today, and we've seen all kinds of errors, including some you might not believe could be possible. Heat is a frequent problem!

Go ahead -- try to surprise us ... as long as you give us all your specs as asked above! :-)

Last fiddled with by S485122 on 2008-12-06 at 08:30 Reason: some of the latest ZIP's do not include a stress.txt file.
cheesehead is offline   Reply With Quote
Old 2008-12-06, 07:33   #4
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

22·3·641 Posts
Default

Quote:
Originally Posted by Unregistered View Post
I've just posted a similar problem, I am expecting an answer as well; I just want to know where does this problem come from.
As above, it can come from several different causes. Send us the details I asked for in my previous post, so we can narrow it down.

Quote:
Most probably RAM, as I get it only on Blend torture test, not on small ffts.
That's consistent with the RAM's being the problem, but other details such as temperatures might point in a different direction. (For instance, your CPU, RAM, etc. may run hotter during the blend tests than during the small FFT tests -- so, heat, not faulty RAM, might be the problem.)

You could try swapping RAM sticks, to see whether that makes a difference.

And clean dust out of the case!

Last fiddled with by cheesehead on 2008-12-06 at 07:41
cheesehead is offline   Reply With Quote
Old 2008-12-06, 08:31   #5
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

1E0C16 Posts
Default

UPDATE:

Quote:
Originally Posted by cheesehead View Post
I've seen others report this, too. But every .zip, and .gz distribution file I've seen for Prime95 does include the stress.txt file
I hadn't seen the p95v257.zip and p64v257.zip distributions. Now that I have, I see that neither includes stress.txt!
cheesehead is offline   Reply With Quote
Old 2008-12-06, 16:14   #6
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

2·33·151 Posts
Default

http://www.mersenneforum.org/showpos...70&postcount=2
Xyzzy is offline   Reply With Quote
Old 2008-12-07, 04:42   #7
Unregistered
 

2×3×5×173 Posts
Default

I really appreciate your very detailed answer, cheesehead. Thank you. I'll try to answer some of your questions.

Quote:
Originally Posted by cheesehead View Post
May we please have the Prime95 (or mprime) version number, your operating system, plus details of CPU and RAM? How hot does your CPU get during a torture test?
25.73 under Windows Vista Ultimate x64 and 24.14.2 under Ubuntu 8.04 Hardy Heron, also 64-bit.

CPU is an Intel Q9450, which is a Core 2 Quad at 2.66GHz.

I'm skipping the RAM, other than to say I tried three different manufacturers, all of which were 2 x 2GB of DDR2.

The CPU is practically ambient. 27 C. No, I'm not making that up. The motherboard reports at 35 C. The reason the chips is so cold is that (a) the test fails instantaneously and (b) the Tuniq Tower CPU cooler is royal pain to install but it sure does work.



Quote:
Originally Posted by cheesehead View Post
Indeed the P95v256.zip and P95v257.zip did not include the file. One way you can get the stress.txt file...
The confusion here, and it is not just mine, is that when the report says to look at stress.txt "for more information", it seems like there should be a file called stress.txt created that tells you something very specific about the test that just failed. That is not what stress.txt does. It's a generic document that tells has absolutely NO additional information about why you're failing or what to do about it. Granted, this was a misunderstanding on my part. However, considering the number of "wtf?" posts about it that turns up from a Google search that message does seem to be a bit misleading.

Quote:
Originally Posted by cheesehead View Post
Actually, it's simple: most stress-test hardware failures come from a small number of possible causes: too-much-overclocked CPUs, inadequate cooling (including dust-clogging), faulty memory sticks -- that cause a bit error only once in a trillion/zillion times...
Agreed. However, here we seem to have a trend of failures that are happening 100% repeatable at the very beginning of a test. That's different from stress-testing the hell out of your system making weird occaissional errors pop up. And the heat issue shouldn't happen instantly. If prime faults out on run 1, that implies lots of things will be going wrong.

Indeed, this was the case for me. I KNEW I had a hardware problem because I kept getting lock ups and blue screens in Windows. I switched to Ubuntu to see if it was a driver issue instead of an actual hardware problem.

And I now have a solution, as well. It was the CPU. I swapped the CPU (requiring a fight with the %$^#*! Tuniq Tower cooler). Problems are gone. I'd already swapped the RAM and the issue stayed. Plus swapping OSes.


Quote:
Originally Posted by cheesehead View Post
The actual data is that about 1-2% of GIMPS prime-testing runs turn out to give a faulty result.
That sounds like a LOT to me. Like you said, most systems aren't tested so harshly. Still, I'm shocked planes aren't falling out the sky.

Quote:
Originally Posted by cheesehead View Post
That might be true _IF_ the cause of the errors was always a faulty CPU instruction result...
Sorry that I seem to be quoting you responding so something I hadn't said yet, but I think the spirit is the same. The reason I think this can be narrowed down to a singe instruction (or thereabouts) is the immediacy of it. We're not talking about after an hour or even a minute of testing. We're talking about rundie with no pause in between.


Quote:
Originally Posted by cheesehead View Post
overclocking if any
I'm dumb, but I'm not so dumb I would post about an error if I was overclocking.

Quote:
Originally Posted by cheesehead View Post
temperature of CPU during torture tests
Like I said, you could damn near cool your beer with it.

Quote:
Originally Posted by cheesehead View Post
and the way you got your prime95.
From mersenne.org

Well, to wrap up, I know I have a bad CPU. And again, thanks for the response. I'll be leaving my system on overnight to make sure it's all good.

-James Ingraham
  Reply With Quote
Old 2008-12-09, 01:06   #8
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

22·3·641 Posts
Default

James,

I'm glad you found your problem. Bad CPUs are neither common enough to list as a first suspicion nor unheard-of. Also uncommon but not unheard-of is to have more than one faulty component, _especially when the failure was too abrupt for the torture test to have stressed other components yet_. I suggest stress-testing your system as hard after replacing the CPU as you would have done originally.

Since you seem more interested in the process of problem-diagnosing than the average poster, I'll go into some detail about my thoughts on this end when I had only your first posting, and what your later additional details mean to me.

Quote:
Originally Posted by Unregistered View Post
I really appreciate your very detailed answer, cheesehead.
You caught the most likely-to-be-wordy responder (me) in a wordier-than-usual mood. Also, because your original post had so few details, I decided to write as though you might be among the many who are rather lost when they first post here.

Because you quoted 3 specific thread numbers, I knew you had already reviewed some past threads here, which marks you as unusual, but the rarity of details you gave about your system could have indicated that you hadn't noticed that we almost always ask for more specific details about systems. So I originally had conflicting impressions of your likely cluefulness.

Quote:
25.73 under Windows Vista Ultimate x64 and 24.14.2 under Ubuntu 8.04 Hardy Heron, also 64-bit.

CPU is an Intel Q9450, which is a Core 2 Quad at 2.66GHz.

I'm skipping the RAM, other than to say I tried three different manufacturers, all of which were 2 x 2GB of DDR2.
Ah -- so you had already tried memory swaps and even an OS swap by the time you first posted here.

Quote:
The CPU is practically ambient. 27 C. No, I'm not making that up.
That's consistent with an instantaneous failure. But your original post said only "this quickly", and some folks might have meant "after only an hour" by that phrase.

Quote:
The motherboard reports at 35 C. The reason the chips is so cold is that (a) the test fails instantaneously and (b) the Tuniq Tower CPU cooler is royal pain to install but it sure does work.
If you'd mentioned that at first, I wouldn't have gone on so much about heat. :-)

Quote:
The confusion here, and it is not just mine, is that when the report says to look at stress.txt "for more information", it seems like there should be a file called stress.txt created that tells you something very specific about the test that just failed. That is not what stress.txt does. It's a generic document that tells has absolutely NO additional information about why you're failing or what to do about it.
Some day, one of us non-project-administrators will work up the gumption to volunteer to rewrite stress.txt and a few other documentation bits, so that we won't have to wait a decade for project-founder George Woltman, whose genius seems to lie more in the big-picture, project-concept and code-optimization areas, to do so. :-)

Quote:
However, here we seem to have a trend of failures that are happening 100% repeatable at the very beginning of a test.
... in _your_ case (and from only your original posting we didn't know it was at the very beginning). But since this is not being frequently reported _from others_, it does not indicate, to me, that it's some widespread systematic FDIV-bug-like CPU problem as you seemed to think.

Again, it's perfectly consistent with a faulty CPU in _your_ case -- but not with some fundamental flaw in CPUs in general unless we were getting frequent similar reports from others.

Quote:
That's different from stress-testing the hell out of your system making weird occaissional errors pop up. And the heat issue shouldn't happen instantly.
Yes -- so if I'd known that originally, I wouldn't have harped on heat.

Quote:
If prime faults out on run 1, that implies lots of things will be going wrong.
Actually, from my point of view, knowing that the fault is "100% repeatable at the very beginning of a test" _rules out_ a lot of possible causes. I can't say for sure that my original response would've suggested that your CPU might be faulty if you'd mentioned that originally, but that wouldn't have been lower than a second-tier possibility to me, rather than third-tier according to what I saw in the original post.

I'm not trying to criticize your original post here! What I _am_ doing is explaining my thoughts on this end that led me to respond the way I did, and explain why some of your ideas (e.g., widespread or systematic CPU flaws) don't look likely from this end.

Quote:
Indeed, this was the case for me. I KNEW I had a hardware problem because I kept getting lock ups and blue screens in Windows.
... and what _I_ noticed was that you claimed "I know that I really do have a hardware problem", but gave few details (you never mentioned lock ups and blue screens, or "100% repeatable at the very beginning of a test", in your original post) to support that conclusion, so I remained skeptical.

Quote:
And I now have a solution, as well. It was the CPU.


Quote:
That sounds like a LOT to me.
Remember, when you consider the 1-2% GIMPS error rate, that it takes as little as a single erroneous bit to render a month-long L-L result invalid. Furthermore, most of the erroneous results come from repeat-offenders, so the actual percentage of GIMPS-reliable CPUs is well over 99%.

Over 99% of participant computers run one of the most stressful computations they'll ever do with not a single error in over 3*10^15 clock cycles.

Quote:
Like you said, most systems aren't tested so harshly. Still, I'm shocked planes aren't falling out the sky.
I doubt there's much critical aviation software written by unpaid volunteer math geeks in their spare time, or many CPUs that haven't met FAA specifications, in aircraft.

Quote:
I'm dumb, but I'm not so dumb I would post about an error if I was overclocking.
We can see that now. :-)
cheesehead is offline   Reply With Quote
Old 2008-12-09, 23:05   #9
Unregistered
 

1AC816 Posts
Default

Quote:
Originally Posted by cheesehead View Post
So I originally had conflicting impressions of your likely cluefulness."
"Cluefulness" is my new favorite word.


Quote:
Originally Posted by cheesehead View Post
your original post said only "this quickly"
I'm afraid I made a classic mistake. I was so focused on my problem I forgot that everybody else doesn't know every detail. Sorry.


Quote:
Originally Posted by cheesehead View Post
why some of your ideas (e.g., widespread or systematic CPU flaws) don't look likely from this end.
I think I also made the classic mistake of filling in missing details from other threads with my own data. So one of the threads I quoted said "This is occuring after Test 1." I thought Test 1 is the absolute first thing happening. This is not correct. Test 1 happens lots of times. I was filtering all the Google hits I saw on this error message through the lens of "instantaneous fail-out."


Quote:
Originally Posted by cheesehead View Post
Over 99% of participant computers run one of the most stressful computations they'll ever do with not a single error in over 3*10^15 clock cycles.
The same data somehow seems much better when stated like that.


Quote:
Originally Posted by cheesehead View Post
I doubt there's much critical aviation software written by unpaid volunteer math geeks in their spare time, or many CPUs that haven't met FAA specifications, in aircraft.
It's true that the systems designed to handle aircraft or nuke plants are held to a much higher standard than Newegg. Still, the idea that a COMPUTER can look like it's working and then suddenly give me a WRONG COMPUTATION is very frightening to me.

Again, thanks for taking the time to reply.

-James Ingraham
  Reply With Quote
Old 2008-12-10, 04:17   #10
db597
 
db597's Avatar
 
Jan 2003

7·29 Posts
Default

Back in the P4 days, I remember having a system that would give errors unless I underclocked the memory. It turned out to be the motherboard, which was a very cheap budget model. Running at stock speeds doesn't automatically guarantee prime95 stability.

Even with my new system, the RAM is DDR2-1066, rated at 2.2-2.4V. By default, my motherboard choses to run it at 1.8V. Wrongly detected I guess, but operating at it's stock rated 1066MHz would cause errors.
db597 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Icon red but no error on torture test justintime Hardware 2 2011-07-02 17:41
Error message during torture test esqrkim Hardware 9 2010-03-21 15:28
fatal error in torture test Unregistered Hardware 3 2006-12-18 15:30
Torture Test Error krypton_ls Hardware 36 2006-10-13 21:26
Torture Test error Unregistered Hardware 27 2005-12-29 15:37

All times are UTC. The time now is 11:49.

Sun May 9 11:49:16 UTC 2021 up 31 days, 6:30, 0 users, load averages: 2.49, 2.52, 2.63

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.