mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > Prime Sierpinski Project

Reply
 
Thread Tools
Old 2006-01-04, 02:07   #1
Citrix
 
Citrix's Avatar
 
Jun 2003

32·52·7 Posts
Exclamation Hardware error

Some of my computers (All of which are identical) say they have the following error.

Bit: 2176/1842096, ERROR: ROUND OFF (0.465637207) > 0.40
Possible hardware failure, consult the readme file.
Continuing from last save file.

What does it mean? Secondly what should I do? Thirdly are the residues from the same machines that were not affected by the error correct or not? Also for n=1842096, since the machine started from scratch, is the residue trust worthy?

Thanks,
Citrix
edit: I was using LLRnet.

Last fiddled with by Citrix on 2006-01-04 at 02:08
Citrix is offline   Reply With Quote
Old 2006-01-04, 07:39   #2
Jean Penné
 
Jean Penné's Avatar
 
May 2004
FRANCE

3·11·17 Posts
Default This is very probably harmless...

This is very probably not an hardware error, and if so, it is harmless.
Another LLR user encountered this problem in November 2005.
I had then a discussion with George Woltman about that.

George wrote :

"Hi,

At 10:00 AM 11/22/2005, Darren Bedwell wrote:
Bit: 291584/410094, ERROR: ROUND OFF (0.4375) > 0.40
Possible hardware failure, consult the readme file.
Continuing from last save file.

The error is harmless. It means that you are testing a number near the
limit of this FFT size. If the 0.4375 was 0.5 or 0.4999... then the cause
is a hardware problem.

LLR should handle this more gracefully -- not entering an infinite
loop resuming from the last save file."

That is what I will do in the next release :
If the exact error is reproduced on the exact same bit number, the program will say "disregard that error message, all is OK" and continue on.

For now, if the error occurs on an SSE2 machine, you may work around by
setting "CpuSupportsSSE2=0" in the .ini file, and redo the test on the problematic number ; the x87 gwnums code beeing totally different, it will work.

Note that this error does not affect any other test done on the same machine.

I am sorry for the drawbacks...

Jean
Jean Penné is offline   Reply With Quote
Old 2006-01-04, 23:56   #3
japelprime
 
japelprime's Avatar
 
"Erling B."
Dec 2005

1148 Posts
Default LLR362

So I am not the only one with this problem.

Bit: 1825443/1825449, ERROR: ROUND OFF (0.4999542236) > 0.40
Possible hardware failure, consult the readme file.


and I have had this one to....

Bit: 478553/1825215, ERROR: SUM(INPUTS) != SUM(OUTPUTS), 3300737325341117 != 1080654913947812
Possible hardware failure, consult the readme file.

then I changed the AMD CPU on the motherboard without any success.
japelprime is offline   Reply With Quote
Old 2006-01-05, 00:36   #4
Citrix
 
Citrix's Avatar
 
Jun 2003

30478 Posts
Default

Quote:
Originally Posted by Jean Penné
This is very probably not an hardware error, and if so, it is harmless.
Another LLR user encountered this problem in November 2005.
I had then a discussion with George Woltman about that.

George wrote :

"Hi,

At 10:00 AM 11/22/2005, Darren Bedwell wrote:
Bit: 291584/410094, ERROR: ROUND OFF (0.4375) > 0.40
Possible hardware failure, consult the readme file.
Continuing from last save file.

The error is harmless. It means that you are testing a number near the
limit of this FFT size. If the 0.4375 was 0.5 or 0.4999... then the cause
is a hardware problem.

LLR should handle this more gracefully -- not entering an infinite
loop resuming from the last save file."

That is what I will do in the next release :
If the exact error is reproduced on the exact same bit number, the program will say "disregard that error message, all is OK" and continue on.

For now, if the error occurs on an SSE2 machine, you may work around by
setting "CpuSupportsSSE2=0" in the .ini file, and redo the test on the problematic number ; the x87 gwnums code beeing totally different, it will work.

Note that this error does not affect any other test done on the same machine.

I am sorry for the drawbacks...

Jean
What if it shows up 28 times in 700 residues and on double check some of the residues do not match to machines, that do not show this.

Citrix
Citrix is offline   Reply With Quote
Old 2006-05-24, 08:21   #5
Citrix
 
Citrix's Avatar
 
Jun 2003

32×52×7 Posts
Default

Lars,

What happened to the residues with this error that I had emailed you, a few months back. Were they double checked? Are the computers safe to put more ranges on?

Do they still need double/triple checking? If you could post the numbers, I am sure some one will volunteer to double check them.

Thanks!
Citrix is offline   Reply With Quote
Old 2006-05-24, 09:35   #6
hhh
 
hhh's Avatar
 
Jun 2005

22×3×31 Posts
Default

Sure I'll do it. But Lars is on vacation now. Do you still have these numbers?

BUT: The round-off error is, from what I read, really harmless. The wrong thing is in fact to call it an error.
The program expects a value, and if it does not quite match it's expectations, restarts from the save file in order to see if the same result will occure again. If yes, everything is fine, and it can interpret the result in a correct way.
Something like this. It's related to the FFT-thresholds. You could highten the FFT-length and not get the error, but the computing time would go up. So you prefer to restart from the safefile from time to time, but are still sooner done with the test.
Don't worry. H.
hhh is offline   Reply With Quote
Old 2006-05-24, 18:11   #7
Citrix
 
Citrix's Avatar
 
Jun 2003

62716 Posts
Default

Quote:
Originally Posted by hhh
Sure I'll do it. But Lars is on vacation now. Do you still have these numbers?

BUT: The round-off error is, from what I read, really harmless.
I will look for the numbers. (But no point doing them, if Lars already did them).
Lars, does not go for vacation for another 3 days.

I did some double check myself and more than 1/2 the residues failed irrspective of whether the error occured or not, that were done on computers producing the errors.

For residues to double check, see file attached.
Attached Files
File Type: txt res.txt (23.5 KB, 82 views)

Last fiddled with by Citrix on 2006-05-24 at 18:18
Citrix is offline   Reply With Quote
Old 2006-05-24, 18:47   #8
ltd
 
ltd's Avatar
 
Apr 2003

22×193 Posts
Default

I have not done the triple checks for your numbers so far.
I would not recomment to put these machines back to work as they produced wrong results when their was no warning from the client.

I have at the moment 100 triple checks in my queue but don't know when i will find the time( resources) to finish them.

Most of the checks are needed due to the fact that one residue was done with PRP or an old llrnet version and the other one with the new llrnet.

Lars
ltd is offline   Reply With Quote
Old 2006-05-24, 19:16   #9
Citrix
 
Citrix's Avatar
 
Jun 2003

32×52×7 Posts
Default

If you post the numbers. I am sure we can divide the work.

As for doing the numbers, all the numbers were done on the new LLR client. ( I never used PRP or the old llrclient to do these numbers).
Citrix is offline   Reply With Quote
Old 2006-05-24, 19:54   #10
ltd
 
ltd's Avatar
 
Apr 2003

22·193 Posts
Default

The different client is the reason for most of the other "errors" and not special for your ranges.

The attached file contains all tests that need a third test.
The client to be used for these test should be llr3.5 and greater or llrnet3.5

Cheers,

Lars
Attached Files
File Type: txt test.txt (1.6 KB, 98 views)

Last fiddled with by ltd on 2006-05-24 at 19:55
ltd is offline   Reply With Quote
Old 2006-05-24, 20:01   #11
hhh
 
hhh's Avatar
 
Jun 2005

22·3·31 Posts
Default

So it seems to be an issue about the machines, not the round-off error which is not an error, but more like a sign of correctness of the program.

PM me a file with those numbers, and I'll put them to my P4 on Friday. It ran stable the last time.

Why don't you run a torture test on all of your systems, and then try to make a permutation on the set of the hardware parts of the faulty ones.

H.

Edit: I started the tests today. Guess it'll take a week or so. H.

Last fiddled with by hhh on 2006-05-26 at 12:57
hhh is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Hardware Error after 1s StechusKaktus Information & Answers 13 2018-02-20 07:46
Hardware Error? Fred Software 11 2016-03-09 19:18
Possible hardware error kladner Hardware 2 2011-09-01 22:13
Software error or hardware error GuloGulo Software 3 2011-01-19 00:36
Error, hardware causing CRC error's Unregistered Information & Answers 3 2008-05-05 05:40

All times are UTC. The time now is 02:22.

Mon Oct 26 02:22:38 UTC 2020 up 45 days, 23:33, 0 users, load averages: 1.91, 1.84, 1.82

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.