mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2005-02-08, 16:21   #1
Jasmin
 
Feb 2005

216 Posts
Default Hardware failure only detected on torture test or also when factoring/LL-testing...?

Hi there...

...have a simple question...thread title about says it all...I would like to know, if hardware failures (miscalculations) are only detected when running torture test or also when doing "real" work such as factoring/LL-testing etc...?

Hope someone can enlighten me on this...thanks in advance!

Last fiddled with by Jasmin on 2005-02-08 at 16:22
Jasmin is offline   Reply With Quote
Old 2005-02-08, 21:42   #2
patrik
 
patrik's Avatar
 
"Patrik Johansson"
Aug 2002
Uppsala, Sweden

1A816 Posts
Default

Hardware failures are also detected when you run an LL-test, but not all errors are detected.

When an error is detected the program restarts from the last save file, so a detected error should do no harm. The problem is that if errors are detected, there if a fair chance that there is at least one undetected error as well (and one is enough to spoil the whole computation).

One way of making a better test is to run a few double-checks, wait until the status files are updated, and then see if the match the first test that has been done earlier.
patrik is offline   Reply With Quote
Old 2005-02-08, 21:43   #3
PhilF
 
PhilF's Avatar
 
Feb 2005
Colorado

7×67 Posts
Default

Sounds most likely heat related to me, especially if you have a fast P4 or AMD processor. Does your system have an option to monitor the CPU temperature?

My 3.4 Ghz P4 Prescott runs under 30 degrees C idle, but when running a torture test or LL test it averages 50 degrees C. And that is with a lot of work at getting the CPU's heat sink very efficient. My first try at installing the heat sink resulted in much higher temperatures, and without monitoring those temperatures I would have never known there was a problem.

-Phil
PhilF is offline   Reply With Quote
Old 2005-02-08, 21:48   #4
Digital Concepts
 
Digital Concepts's Avatar
 
Aug 2002

1101102 Posts
Default

The torture test runs snippets of FFTs with known results, providing the best indication of hardware failure.

With LL, P-1, and trial factoring, the end result is unknown and so - prime95 is hamstrung in detecting hardware failures outside of torture testing.

That being said, with LL and P-1 factoring, even though the exact correct result is unknown, when you are expecting a result less than one (or some other bound) and it isn't, then you can report with some certainty that a hardware error has occurred. This check would be expensive to perform on every iteration, but is worthwhile every so often, and that is what prime95 does. Notice that since a check doesn't happen every iteration, an error could have occurred even though a "no error found" status is returned.

Because trial factoring is done totally in CPU/cache (very little RAM/bus access needed), machines that get stressed and fail doing LL/P-1 may still work satisfactorily there.

How this helped.
Digital Concepts is offline   Reply With Quote
Old 2005-02-09, 08:39   #5
Jasmin
 
Feb 2005

2 Posts
Default

So if I got you correct, basically the torture tests finds more errors, because it can compare to known results, but the LL, factoring etc. also do find some errors, but not as much as torture test does...thanks for clearing that up...

That would mean it's potentially possible to send wrong results back to PrimeNet, if run on a shaky system...guess, that's what the double-checks are good for? ;)

@PhilF: Thanks, but it's not that I have a failure problem, I need help with...I know the procedures/difficulties etc. in overclocking, I am just trying to find the limits of my system, and having reached a possibly stable point, at which I still would run torture test for several days, I thought, the possibility being quite high, it already is stable, I might as well let it do some real prime work, but given these answers, I better stick to torture test, until it's real stable, don't wanna submit wrong results...

Last fiddled with by Jasmin on 2005-02-09 at 08:40
Jasmin is offline   Reply With Quote
Old 2005-02-09, 12:24   #6
Boulder
 
May 2003

3·13 Posts
Default

I've found out that running two torture tests simultaneously is a very good stress tester for HT-capable processors. Just set the affinity to 0 and 1 so that the load is maximal. I usually run the max FPU stress test + maximum heat&power consumption test.
Boulder is offline   Reply With Quote
Old 2005-02-10, 00:35   #7
PhilF
 
PhilF's Avatar
 
Feb 2005
Colorado

46910 Posts
Default

Ok Jasmin, I understand now. I misread your original post and thought you meant your computer was fine until you ran a torture test OR LL test.

Boulder, I don't think you are stressing your HT processor any more by running 2 tests simultaneously. I know that with only one test running you show approximately 50% utilization, but that is normal with XP running on a Hyper Threaded CPU. You really are using 100% of it. It is just that XP thinks you are using 100% of one CPU and very little of the other CPU, so it reports 50%.

On the other hand, I just read your post again and I see now that you are running 2 different types of tests simultaneously. That might stress the system further, but probably not by much.

-Phil

Last fiddled with by PhilF on 2005-02-10 at 00:38
PhilF is offline   Reply With Quote
Old 2005-02-10, 00:44   #8
Digital Concepts
 
Digital Concepts's Avatar
 
Aug 2002

2·33 Posts
Default

Quote:
Originally Posted by Boulder
I've found out that running two torture tests simultaneously is a very good stress tester for HT-capable processors. Just set the affinity to 0 and 1 so that the load is maximal. I usually run the max FPU stress test + maximum heat&power consumption test.
I would like to hear more about running two torture tests - even on non-HT processors.

My gut would have said that if one prime95 torture test is coded to stress the system, that running two would not be as stressful, due to the possible cooling effect of swapping between processes. But I have heard of two torture tests running show failure when one would not.

I'm running v23.7.1. Does a later version have separate 'max FPU' and 'max heat/power' options?

I believe in stress testing for 24 hours. After that I'd run a couple of double checks (might as well benefit the project if your system is running good). I'd then throw in another double check every month or two.

Anyone want to explain how fast the results of a double check get cleared, and how you can tell if it passed? We have super competent and gracious folks who do it for us over on the [www dot]teamprimerib[dot com] team.
Digital Concepts is offline   Reply With Quote
Old 2005-02-12, 01:10   #9
patrik
 
patrik's Avatar
 
"Patrik Johansson"
Aug 2002
Uppsala, Sweden

23·53 Posts
Default

Quote:
I'm running v23.7.1. Does a later version have separate 'max FPU' and 'max heat/power' options?
In v23.6.1 and v23.8.1 that I am running, small FFTs and in-place large FFTs are said to give maximum FPU and maximum heat and power stress respectively.
Quote:
Anyone want to explain how fast the results of a double check get cleared, and how you can tell if it passed? We have super competent and gracious folks who do it for us over on the [www dot]teamprimerib[dot com] team.
At the bottom of the Mersenne status page there are links to different status files. They are updated once every 1-2 weeks (and at about the same time this page is updated; look at the date at the bottom). The three interesting files, hrf3, lucas_v and bad, are zipped ASCII files (although the "bad" file hasn't got the ".txt" extension).

After a first time test an exponent goes to hrf3. Here you also find non-matching double-checks (along with the first test). Either your double-check or the first time test can be faulty.

When two matching tests are found, both are put/moved into lucas_v. Any non-matching test (which then is bad for certain) goes into bad.
patrik is offline   Reply With Quote
Old 2005-02-14, 01:36   #10
Digital Concepts
 
Digital Concepts's Avatar
 
Aug 2002

3616 Posts
Default

Thanks Patrik. I looked at my torture test menu and saw what you and Boulder were talking about - guess I haven't added a new machine in a while.

So it would take 1-2 weeks to get back results using my suggestion to run double checks instead of a couple day long torture test. Well that's what I do every month or two to make sure my machines are in top working order. I had to ask about those files because some good hearted folks on my team (Team Prime Rib) keep track of "latest results" for the rest of us.
Digital Concepts is offline   Reply With Quote
Old 2005-02-14, 01:58   #11
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

48516 Posts
Default

Quote:
Originally Posted by Digital Concepts
I would like to hear more about running two torture tests - even on non-HT processors.
I have a hyperthreaded 2.4C that at 3.0GHz or a bit higher will run a single instance of the torture test for 24 hours with no errors, but if I run two instances, one will segfault after a few minutes. I have to reduce the overclock a little so that two instances will run for 24 hours.

If I was just doing LL testing then I would turn hyperthreading off and run at 3.0GHz, but for most other projects hyperthreading is worth an extra 20%-25% throughput.
geoff is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Catastrophic hardware failure CuriousKit Hardware 6 2015-09-02 18:58
Hardware failure detected !!! MaZeNsMz Information & Answers 2 2008-06-21 12:05
Hardware Failure Detected bigal_nz Hardware 2 2007-02-07 10:43
Trial factoring/P-1 torture test? cmokruhl Software 2 2005-08-03 03:54
Torture Test Failure Follow-up jugbugs Hardware 8 2004-04-30 07:04

All times are UTC. The time now is 10:01.

Wed Jul 15 10:01:31 UTC 2020 up 112 days, 7:34, 0 users, load averages: 1.25, 1.19, 1.23

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.