![]() |
Curiosity: Some errors more common than others
I noticed the following in my results.txt file:
[code] Iteration: 17844238/34XXXXXX, ERROR: ROUND OFF (0.40625) > 0.40 Continuing from last save file. Disregard last error. Result is reproducible and thus not a hardware problem. For added safety, redoing iteration using a slower, more reliable method. [/code] I noticed that the specific round off error 0.40625 has occurred three times in my results.txt file. Is there a reason why that specific number occurs more often than others? Also, is there something I can do to avoid these errors? My worry is, does Prime95 catch most of these errors when they are made? |
These errors are when the calculations
are near size boundaries for dwt, they are redone in a slightly different way. |
Assuming the code has been carefully written and the FFT-size breakpoints appropriately chosen, a typical plot of roundoff errors (not just the ones > 0.4 that get reported) will show something like a Poisson distribution, with most RO errors clustered between (say) 0 and 0.1, and the frequency decreasing rapidly (IIRC quasi-exponentially) for larger errors. Of course errors near 0.5 tend to be of form a/2^k, with k some fairly small integer, i.e. you'll see a very coarse distribution there (Think of discrete errors as being like English socket wrench sizes). Those two effects combined explain why near an FFT threshold you tend to see some 13/32 = 0.40625 errors (often quite a few of these), hopefully relatively few 7/16 = 0.4375s, and yes, an occasional 15/32 = 0.46875 or the deadly 1/2 = 0.5.
|
It still sounds like these errors, whether a hardware problem or not, are still bad. Do I know that all (or at least 99.99% of) errors are found? Is it possible to avoid these errors?
|
[QUOTE=Unregistered]It still sounds like these errors, whether a hardware problem or not, are still bad. Do I know that all (or at least 99.99% of) errors are found? Is it possible to avoid these errors?[/QUOTE]
These error are not bad, per se. If you can run a test 7% faster, but on occasion you need to rerun part of it (that takes less than 2% of the time, these numbers are pure hand waving), you are time ahead. This is what happens near the breakpoints, if you can run the short FFT length, it is faster, but the errors can pop up (but they can be dealt with). There are other threads that deal with this issue. Look down in this thread:[URL="http://www.mersenneforum.org/showthread.php?t=3387&highlight=error+reproducible"]http://www.mersenneforum.org/showthread.php?t=3387&highlight=error+reproducible[/URL] What you may want to do is short the time between save file writes. |
Actually, those errors are not bad. If you were to look at the file containing the results returned for completed exponents near the one you are testing, you would see that the error code indicates many tests run on exponents near the upper limit of the FFT size produces these errors.
Here is an excerpt from the file I referred to: [CODE]34623493,jwdepen,pennsy12,Wc1,03000300 34623521,maekke,meiermb,WZ1,00000000 34623731,pfrakes,DP280336,Wc2,05000500 34623739,bobhinkle,Margo,WZ1,00000000 34623751,edwardsm,C84262AF2,WZ1,00000000 34623839,S62207,C052FA406,Wc2,04000400 34623857,pfrakes,DC240356,Wc2,01000100[/CODE] The 8 digit number at the end of each line is the error code. The first two digits are how many round off > 0.4 errors occurred, and the 5th and 6th digit is how many of those errors are reproducible. As you can see, 4 out of seven tests had round off errors > 0.4, and all were reproducible. Quite normal. In a way, they are not errors at all. As long as the round off is less than 0.5, then proper rounding will always occur. But George wrote the program so that round off errors should always be less than 0.4, just to be on the safe side. Sometimes, as a result of an "unlucky bit pattern" while testing an exponent that is bumping up against the FFT size limit, the round off might be 0.4125 or even slightly more. That's OK, because it is still below 0.5 and will get rounded properly. But, since the program expects all round offs to be less than 0.4, it calls it an error and tries again, starting from the last save file. As long as the second try results in the exact same amount of round off error we know we are OK. You could get rid of the errors by forcing the program to use the next larger FFT size, but that would make the test run slower without accomplishing anything. You are better off considering these "errors" as "informational messages". |
[QUOTE=PhilF]As long as the round off is less than 0.5, then proper rounding will always occur.[/QUOTE]
Not so - assuming one defines the fractional error in a floating number x as frac = abs(x - nint(x)) then the result is always in [0, 0.5]. The danger of frac values near 0.5 is that the closer frac is to 0.5, the greater the chance of an incorrect rounding, i.e. what you think is a fractional error (frac) is really a (1-frac) aliased to (frac) by the above formula, e.g. a fractional error of 0.6 aliased to 0.4. In my experience if you only see a few errors of the 0.40625 variety there is little chance of this having occurred, but once one starts seeing errors like 0.4375 (especially more than one of these) one is on dangerous ground. The point of setting an appropriate fraction-error threshold is to reduce the odds of this kind of incorrect rounding to acceptably low levels. If you think about it, it's really quite astonishing that we routinely can get away with setting it as high as 0.4 and get away with it - that only works due to a combination of very carefully written code and the quasirandom nature of LL-test intermediate residues. |
[QUOTE=ewmayer]Not so - assuming one defines the fractional error in a floating number x as
frac = abs(x - nint(x))[/QUOTE]I stand corrected.[QUOTE=ewmayer]works due to a combination of very carefully written code...[/QUOTE]:bow: Hear, hear! :bow: |
I'm the same poster as above.
I looked at my results.txt file, and I have had one error 0.5 which wasn't reproducible, and one error .4375 which was. Would you suggest then that I manually change my fft length to something safer? |
[QUOTE=Unregistered]I'm the same poster as above.
I looked at my results.txt file, and I have had one error 0.5 which wasn't reproducible, and one error .4375 which was. Would you suggest then that I manually change my fft length to something safer?[/QUOTE] No - from the sound of it the 0.5 error was probably a hardware glitch (and was caught, i.e. didn't corrupt the computation), and a small number of reproducible 0.4375 errors is usually not fatal . |
[QUOTE=Unregistered]I looked at my results.txt file, and I have had one error 0.5 which wasn't reproducible, and one error .4375 which was. Would you suggest then that I manually change my fft length to something safer?[/QUOTE]No. The 0.5 error was not caused by FFT size. That indicates some sort of hardware problem, especially if you get another one. In general, one-half of tests that have one un-reproducible round off error give incorrect results.
The reproducible errors are fine, they are not causing harm. If I were you, I would start a Prime95 Torture Test on the machine, and let it run for at least 24 hours. |
Re. the FFT-length issue: Since it wouldn't help anyone else "steal" credit for your work, could you (the original poster) give us at least the next 2 digits of the exponent? According to my calculations, the upper limit for safe testing at 1792K is around 34.5-34.6 million using SSE2 floating arithmetic, slightly higher if using an older x86 without SSE2 support (i.e. 64-mantissa-bit register floats).
|
| All times are UTC. The time now is 16:59. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.