mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

View Poll Results: Faster LL or more error checking?
Yes, faster is better. 16 30.77%
No, faster LL isn't worth the lost error checking. 18 34.62%
Make it a user option. 17 32.69%
No opinion, instead reprogram the server to assign me the 48th Mersenne prime. 1 1.92%
Voters: 52. You may not vote on this poll

Reply
 
Thread Tools
Old 2010-06-05, 13:25   #45
axn
 
axn's Avatar
 
Jun 2003

117378 Posts
Default

Quote:
Originally Posted by ATH View Post
Those that won't be flagged suspect if its turned off should be those with code 000000aa and xx0000aa, those that has only SUM(INPUTS) != SUM(OUTPUTS) errors and those which also have the reproducible errors.
Flagging a suspect test is like closing the stable door after horse has bolted. Better criteria would be, those tests that were saved from corruption. Hence, analyse for those tests where at least one non-reproducible SUMIN error occurred -- xxyyzzaa where xx < aa, not just xx0000aa (it would have been better if the "reproducibles" were broken down into individual error categories).

Last fiddled with by axn on 2010-06-05 at 13:28
axn is offline   Reply With Quote
Old 2010-06-05, 17:44   #46
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2×1,579 Posts
Default

I was looking for how many of those test would not have been flagged suspect if George had removed the SUM(INPUTS) != SUM(OUTPUTS) check earlier, and that figure was about 1% of the total tests and 59% of those turned out to be bad tests.
ATH is offline   Reply With Quote
Old 2010-06-06, 02:39   #47
Rhyled
 
Rhyled's Avatar
 
May 2010

3F16 Posts
Exclamation

Quote:
Originally Posted by Prime95 View Post
Raw error code data from LL tests between 18M and 22M is attached.
Don't remove that test, or we'll all be sorry. Unfortunately, all it takes is a single error in the LL chain to invalidate the final result.

Using the best assumption from the raw data - the 1075 counts out of 186096 of the bad results where the only error detected was SUMIN<>SUMOUT, that's a 0.58% error rate that wouldn't be detected under the new logic. Sounds like a fair trade for a 2-3% performance increase, but it isn't.

The problem is that the raw data averages a 20M exponent. Similar to compound interest, as the count goes up, so does the chance of an error occuring. At the 335+M exponent level (which a bunch of people are chasing for some reason ) the chance of success drops to 90.75%. That is NOT worth a 2-3% performance boost.

I'd need more experimentation on the math - because it might be seriously worse. As exponents increase, the amount of LL computation increases as the square of the exponent. (Each iteration gets more complex due to larger FFTs and there are more iterations). If the risk of undetected error also increases by a squared term, the chance of success drops to < 20%.

This involves extrapolating far beyond the data set, which always worries me as far as accuracy goes. Perhaps a more pure math approach would give a different result, but I don't see the risk work a 2-3% performance boost.

I have this data in an Excel 2007 spreadsheet, but apparently can't upload that file type.
Rhyled is offline   Reply With Quote
Old 2010-06-06, 03:06   #48
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19×397 Posts
Default

Quote:
Originally Posted by Rhyled View Post
Similar to compound interest, as the count goes up, so does the chance of an error occuring.
I think the flaw in this reasoning is that you are assuming sumout errors are randomly distributed. I don't have hard statistics to back this up, but I think 90%+ machines never get an error. Wouldn't GIMPS be better of if all these healthy machines are producing 3% more results?
Prime95 is offline   Reply With Quote
Old 2010-06-06, 07:49   #49
only_human
 
only_human's Avatar
 
"Gang aft agley"
Sep 2002

72528 Posts
Default

I've have hand examined the error text file in a text editor. I did something similar in this thread back in 2003 Most popular error codes

I think there was some good error analysis between then and now and maybe even charts but I couldn't remember or find the details.

Looking at raw_error_data.txt in a text editor, I noticed that when the reproducible count matches the rounding error count and no other errors occurred, in every case except one occasion a good result was returned (this is strictly a visual inspection and assuming I didn't miss any, the only bad result of this type was this line "2 07000700 1"). There were some very high error counts on some entries that still returned good results when the errors were strictly reproducible rounding errors.

Removing lines that had that ended with the "aa" field equal to 00 left the remaining error lines with at least one detected SUM(INPUTS) != SUM(OUTPUTS) error. This dropped the 1634 line count to 1276.

I followed this notation from earlier in the thread:
Quote:
The second column is an 8-byte hex error code (xxyyzzaa) that consists of 4 values:

aa: Number of SUM(INPUTS) != SUM(OUTPUTS) errors
zz: Number of Roundoff > 0.4 errors
yy: Number of ILLEGAL SUMOUT errors
xx: Number of "errors" that were reproducible (i.e not an error)
The next thing I was curious about was getting a grip on reproducible errors; so I looked at how remaining lines that had any reproducible errors whatsoever and also had some SUM(INPUTS) != SUM(OUTPUTS) errors and yet still produced a good result. There were 40 of these spread over 17 values:
Code:
1	01000001	24
1	01000102	2
1	01000203	1
1	01000401	1
1	01000501	1
1	01000810	1
1	01000901	1
1	01001F0D	1
1	01002B01	1
1	0100543D	1
1	01007408	1
1	01010204	1
1	02000002	2
1	02000301	1
1	0285FF35	1
1	03000601	1
1	04010F04	1
1	09000909	1
As you can see these are very few entries. The lines in red were the strictly only reproducible SUM(INPUTS) != SUM(OUTPUTS) errors (with no other errors) that yet returned a good result. From this I conclude that reproducibility is not very important for returning a good result from detected SUM(INPUTS) != SUM(OUTPUTS) errors.

I also looked at how many bad results had some reproducible errors and also some SUM(INPUTS) != SUM(OUTPUTS) errors. In these, there were no bad result cases with a reproducible error count strictly matching SUM(INPUTS) != SUM(OUTPUTS) error counts. This means that all that bad results that had any kind of reproducible errors and also had SUM(INPUTS) != SUM(OUTPUTS) errors also had other errors. There were 155 bad result lines with SUM(INPUTS) != SUM(OUTPUTS) errors involved and most code values only occurred once.

I don't infer much from all this but perhaps it is helpful anyway. Not finding much meat in the reproducible errors with regards to SUM(INPUTS) != SUM(OUTPUTS) errors, I moved on to the SUM(INPUTS) != SUM(OUTPUTS) errors that don't have any reproducible count. When no other errors occurred, there were only 33 (total) good results for SUM(INPUTS) != SUM(OUTPUTS) errors counts of 10 or more. These are the lines that had less than 10 of these errors and a good result:
Code:
1	00000001	486
1	00000002	109
1	00000003	39
1	00000004	24
1	00000005	12
1	00000006	9
1	00000007	11
1	00000008	5
1	00000009	5
Corresponding lines with bad results show that there are more of them. They also drop off slower (I assume because higher error counts are more likely to return bad results).
Code:
2	00000001	386
2	00000002	154
2	00000003	90
2	00000004	77
2	00000005	37
2	00000006	27
2	00000007	28
2	00000008	31
2	00000009	25
I recall from overheating hardware that I could get several of these errors in a day or so; if this particular error check remains as an option, it might be helpful to emphasize that very few good results occur after several of these errors occur. There are a few mysterious cases with high combinations of errors that nevertheless turned in a good result (e.g. 0285FF35)
only_human is offline   Reply With Quote
Old 2010-06-06, 08:12   #50
RMAC9.5
 
RMAC9.5's Avatar
 
Jun 2003

100110012 Posts
Default

George believes and is probably correct that
Quote:
I think 90%+ machines never get an error.
The first problem I see is how do we identify the < 10% of the unhealthy machines that get "weak" memory and/or over clocking errors if we remove this error check? The second problem is how do we keep these unhealthy machines from submitting lots of bad LL results if we can't identify them for two or three years (i.e. until the corresponding DCs are done)? Finally, can any of the math wizards on this forum calculate how many extra triple checks it will take to offset the 3 to 4% speed increase that George has discovered?

Maybe the best solution is implement this in the beginning only for DC tests? This would help to keep the DC tests from falling further behind the more popular LL tests and it would enable the server to identify and keep track of the healthy, reliable machines. Once these healthy, reliable machines were identified, the server could offer users of these machines the faster client.
RMAC9.5 is offline   Reply With Quote
Old 2010-06-06, 09:44   #51
only_human
 
only_human's Avatar
 
"Gang aft agley"
Sep 2002

EAA16 Posts
Default

Well some errors are going to be caught anyway as I mentioned here:
Quote:
I also looked at how many bad results had some reproducible errors and also some SUM(INPUTS) != SUM(OUTPUTS) errors. In these, there were no bad result cases with a reproducible error count strictly matching SUM(INPUTS) != SUM(OUTPUTS) error counts. This means that all that all bad results that had any kind of reproducible errors and also had SUM(INPUTS) != SUM(OUTPUTS) errors also had other errors. There were 155 bad result lines with SUM(INPUTS) != SUM(OUTPUTS) errors involved and most code values only occurred once.
Also as for the magnitude of the problem, note that many more bad results have no indication of any error at all (Many more than those called out for SUM(INPUTS) != SUM(OUTPUTS) errors). As this line indicates, 3499 bad results had no error indication at all "2 00000000 3499"

Last fiddled with by only_human on 2010-06-06 at 09:59 Reason: snuck an extra "bad" into bold sentence for clarity
only_human is offline   Reply With Quote
Old 2010-06-06, 12:43   #52
lycorn
 
lycorn's Avatar
 
"GIMFS"
Sep 2002
Oeiras, Portugal

5C216 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Yes, prime95 does recover from a SUM(INPUTS) != SUM(OUTPUTS) error.

The problem is machines that generate these hardware errors are likely to get undetected hardware errors that ruin the final result thus forcing a triple check.
This is for me the strongest argument in favour of keeping the error checking. The capability of recovering from this type of errors would be lost.
But...

1. Many bad results are returned with zero error code even now, with the error checking in place.
2. Many (the vast majority) of machines are reliable.


All things said, my proposal is to have the error checking feature as an option (turned off by default), so the more cautious people could turn it on, particularly while not yet confortable about the reliability of a given machine. This could be done upon putting a new machine into service, or if some machine, so far reliable, started giving out bad results with zero error code.
But I am really happy with dropping this check on machines that are proven reliable. 3-4% improvement is a nice figure, and we may turn the error checking on whenever we have any reason to feel suspicious about the reliability of any machine.
I also propose that upon Prime95 installation the program suggested people to start by doing a couple of double checks to gain (or not...) confidence in their machine“s reliabilty before moving on to first time LLs. This would of course be just a suggestion, people would be free to jump straight in to LL if they so wish, but I think it would be beneficial to the project, by helping to have an early reliability verification, and as a side effect a little push on DCs trailing edge.
lycorn is offline   Reply With Quote
Old 2010-06-06, 13:41   #53
S485122
 
S485122's Avatar
 
"Jacob"
Sep 2006
Brussels, Belgium

110101011102 Posts
Default

Quote:
Originally Posted by lycorn View Post
All things said, my proposal is to have the error checking feature as an option (turned off by default), so the more cautious people could turn it on, particularly while not yet confortable about the reliability of a given machine. This could be done upon putting a new machine into service, or if some machine, so far reliable, started giving out bad results with zero error code.
I have a completely reliable machine before and after ... a mobile phone software was installed. But during the month or so that my daughter had that program installed only bad results were returned ! (But there were no error codes so this does not apply to the current discussion.)

Jacob
S485122 is offline   Reply With Quote
Old 2010-06-06, 21:12   #54
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Quote:
Originally Posted by RMAC9.5 View Post
Maybe the best solution is implement this in the beginning only for DC tests? This would help to keep the DC tests from falling further behind the more popular LL tests and it would enable the server to identify and keep track of the healthy, reliable machines. Once these healthy, reliable machines were identified, the server could offer users of these machines the faster client.
Hmm...what about a variation on this idea? The server already keeps track of machines' reliability based on the amount of detected errors they report. So it should be somewhat easy to do this:

-Have two code paths in the software: one that uses the faster code without error checking, and the other that's 3% slower but can detect SUM(INPUTS) != SUM(OUTPUTS) errors.
-Intially, the software always keeps the error checking on.
-Once the server decides that the machine has accumulated a decent reliability record, it would transmit that to the client at communications. (It may already do this; I don't know.)
-The client would then follow the faster code path, confident that it is not prone to SUM(INPUTS) != SUM(OUTPUTS) errors.
-For those doing non-PrimeNet testing (and therefore not having the benefit of the server's confirmation that it's OK to turn off error checking), a "hidden" option could be added to one of the configuration files that forces error checking off. There would NOT be a GUI option for this, to discourage people from manually forcing their clients to use the less reliable code without first confirming that their machine is stable.

Last fiddled with by mdettweiler on 2010-06-06 at 21:13
mdettweiler is offline   Reply With Quote
Old 2010-06-06, 21:36   #55
Rhyled
 
Rhyled's Avatar
 
May 2010

32·7 Posts
Thumbs down include in Torture Test - optional otherwise

Quote:
Originally Posted by Prime95 View Post
I think the flaw in this reasoning is that you are assuming sumout errors are randomly distributed. I don't have hard statistics to back this up, but I think 90%+ machines never get an error. Wouldn't GIMPS be better of if all these healthy machines are producing 3% more results?
I stand corrected. If these errors only occur due to hardware overclocking, then the statistics change dramatically. In that case, only the unstable overclocked systems will trigger errors, with or without the SUM checks. Stable systems can skip the checks.

New suggestion - include the SUM check in the Torture Test section, where OC'rs want to know about errors, and don't really care about Prime95 performance other than the ability to stress the cpu/memory. Make it an option (like round off checking currently is) for those that may or may not want to run it.
Rhyled is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Fast and robust error checking on Proth/Pepin tests R. Gerbicz Number Theory Discussion Group 15 2018-09-01 13:23
Probabilistic primality tests faster than Miller Rabin? mathPuzzles Math 14 2017-03-27 04:00
Round Off Checking and Sum (Inputs) Error Checking Forceman Software 2 2013-01-30 17:32
Early double-checking to determine error-prone machines? GP2 Data 13 2003-11-15 06:59
Error rate for LL tests GP2 Data 5 2003-09-15 23:34

All times are UTC. The time now is 23:23.


Fri Aug 6 23:23:27 UTC 2021 up 14 days, 17:52, 1 user, load averages: 4.41, 4.10, 4.05

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.