View Single Post
Old 2019-02-24, 01:52   #19
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·1,619 Posts
Default Error rates

(This originated as https://www.mersenneforum.org/showpo...1&postcount=23 and https://www.mersenneforum.org/showpo...5&postcount=55)

What are typical error rates? The usual figure is about 2.% per LL test near the wavefront. That might be from before the addition of the Jacobi check to prime95. It has fluctuated over time as exponent increased, and additional code was written, additional bugs introduced, and later fixed, additional error checks added, etc. It will go up some as running larger exponents takes longer, requiring time roughly in proportion to p2.1, for the same hardware reliability hourly. The probability of an LL test being in error goes up considerably if the error counts accumulated during a prime95 run are nonzero. Even a single illegal sumout error recorded raises the probability of erroneous final residue to around 40%, if I recall Madpoo's recent post about that correctly. Hardware tends to get less reliable with age.

PRP with Gerbicz check is much more reliable in producing correct final residues, and in sufficiently recent versions of prime95/mprime or gpuowl, also produces a proof file that allows avoiding over 99.5% of double-check effort, so run PRP with proof whenever possible. PRP/GEC was bulletproof on a very unreliable system I tested it on. It is still possible to have errors in the final residue with PRP and Gerbicz check, but it is unlikely, and the best we can do for now. The PRP/GEC test overall error rate per test is so small we don't have sufficient statistics to gauge an error rate for it, with only a few known erroneous results after two years of frequent use. There was a case in prime95 where in code outside the GEC check an error could occur. That code has been modified to address that. The PRP/GEC overall error rate is thought to be orders of magnitude smaller than the LL/Jacobi-check error rate. It's so low we do not have a sufficient empirical / statistical basis on which to compute an error rate. In a check for PRP tests on exponents > 50M reported 2019-08-12 to 2021-08-12 (>123,000 verified PRP/GEC or PRP/GEC/proof results), 3 bad results were found, indicating an error rate ~24. per million PRP tests. At least one of those 3 errors (possibly all 3) was from before the addition of error hardening for the code handling final residues outside the reach of the GEC.

In any event, run self tests such as double checks regularly, at least annually, to check system reliability on these very unforgiving calculations.

Error rate does depend on the software and hardware used. Mlucas, CUDALucas, cllucas and some LL-capable versions of gpuowl do LL and do not have the Jacobi check. The Jacobi check has a 50% chance of detecting an error if one occurs. Hardware with unreliable memory is more error prone. Overclocking too high or overheating increases error rates.
In CUDALucas, there are CUDA levels and GPU models that interact badly, even on highly reliable hardware. These produce errors such as instead of the usual LL sequence, at some point all zeros gets returned. If that is before the subtraction of 2, then FFF...FFD is the result (the equivalent of -2). It gets squared and 2 subtracted, and voila, now you have 000...02. (-2)2-2=2. Then it will iterate at 2 until the end. These sorts of errors can be triggered at will. Some of them under certain circumstances have the side effect of making the iterations go much faster than expected. If something seems too good to be true, it probably is. (CUDA 4.0 or 4.1, 1024 threads, or certain fft lengths, typically is trouble in CUDALucas, if I recall correctly.) That is an example where the first and second test probability of a false positive match may be 100%. More typical would be of order 10-6 to 10-12. CUDALucas 2.06 May 5 2017 version has software traps for these error residues built in. There are other modes of error. The recent false positive by CUDALucas 2.05.1 was resulting in the interim residue having value zero. I'm guessing that's some failure to copy an array of values. Don't run CUDALucas versions earlier than 2.06, and don't let your friends either.

Other applications also have characteristic error residues. Someone who wanted to use such bugs as the CUDALucas early zero bug to fake finding a prime would be disappointed, as the error would be quickly discovered early in the verification process.
I've created application-specific reference threads for several of the popular GIMPS applications. Most of them have a post with a bug and wishlist tabulation attached, specific to that application.
https://www.mersenneforum.org/forumdisplay.php?f=154
It helps to know what to avoid and how to avoid it.
If you identify any issues that are not listed there yet, please PM me with details.
As such issues are identified, they might be fixable, or code to detect and guard against them could be added, if still of sufficient interest. (Fixing or trapping for CUDA 4.0 or 4.1 issues is not of much interest now, since many GPUs are running at CUDA 8 or above.)

It's common practice for the applications to keep more than one save file, and be able to restart from one or the other if something's detected to have gone seriously wrong in the past minutes of a lengthy run, thereby perhaps saving most of the time already expended. Some users will run side by side on two sets of hardware, duplicate runs that take months, periodically comparing interim 64-bit residues that should match along the way.
Re odds of matching wrong residues:
Given that the number of residues for a completed run of first and second LL checks of all primes below the current mersenne.org limit is about 2 n ~ 50 847 478* 2 ~101,694,956, while the number of possible unique 64-bit residues is r=18,446,744,073,709,551,616, with only the zero value indicating a correctly completed LL test of any of the 50-plus Mersenne prime exponents, the chance of one randomly distributed wrong residue coinciding with another randomly distributed wrong residue is very slim. If every prime exponent resulted in one randomly distributed wrong unique residue, the last wrong one, which has the most other residues in a list to dodge, and so the highest odds of coinciding, would have a chance 1/(r-2n)*2n of coinciding with another residue; ~5.5 10-12. If only 2% of residues are wrong and the wrong ones are randomly distributed, that chance drops by 49% to ~2.8 10-12 . The odds of any of the incorrect wrong residues coinciding with another residue by random chance is ~0.00014 if every exponent has one wrong randomly distributed residue, ~2.9 10-6 if 2% have a randomly distributed wrong residue. (Note though that the preceding figures do not account for the run times and so the error rates climbing with exponent. Or alternately, progress is assumed to occur roughly in sync with computing speed advances, so that run time and error rate do not grow.)
The problem is the bad residues from software or hardware issues are not randomly distributed. If they were, we would not be patching and trapping and searching databases for known application-specific bad residues as markers of what exponents to double or triple check. https://www.mersenneforum.org/showpo...&postcount=142 https://www.mersenneforum.org/showpo...&postcount=150
There is an LL primality test error rate of ~2%/exponent, and similarly on second checks. We iterate until there's a match.
We're always on the lookout for ways to reduce and to catch errors (without hurting performance too much). Some error detections if efficient enough will increase net accurate throughput.

We know from some GPU runs that some bugs/misconfigurations will preferentially stabilize on a specific wrong res64 result, not a random wrong one. One such value is a false positive, as Madpoo has long known and dealt with. So that's an existence proof of nonrandom result from error, that occurs despite nonzero offset. A patch to detect and halt such runs was added. (See item 4 in the CUDALucas bug and wish list attached at https://www.mersenneforum.org/showpo...24&postcount=3)
Quote:
# (in perl form) application-specific bad residues, indicative of some problem causing the calculation to go wrong
# for applications other than gpuowl their detection means the run should be halted and the problem fixed before continuing
# for gpuowl, the Gerbicz check will cause a lot of iterations recalculation requiring more time. Fixing the issue is recommended
%badresidues=(
'cllucas', '0x0000000000000002, 0xffffffff80000000',
'cudalucas', '0x0000000000000000, 0x0000000000000002, 0xffffffff80000000, 0xfffffffffffffffd',
'cudapm1', '0x0000000000000000, 0x0000000000000001, 0xfff7fffbfffdfffe, 0xfff7fffbfffdffff, 0xfff7fffbfffffffe, 0xfff7fffbffffffff, '.
'0xfff7fffffffdfffe, 0xfff7fffffffdffff, 0xfff7fffffffffffe, 0xfff7ffffffffffff, 0xfffffffbfffdfffe, 0xfffffffbfffdffff, '.
'0xfffffffbfffffffe, 0xfffffffbffffffff, 0xfffffffffffdfffe, 0xfffffffffffdffff, 0xfffffffffffffffe, 0xffffffffffffffff',
'gpuowl', '0x0000000000000000',
'mfaktc', '',
'mfakto', ''
); #fff* added to cudapm1 list 7/19/18
# note, since second to last LL iteration's full residue can be +-2^[(p+1)/2], for a Mersenne prime,
# and, above M127, that looks like in a res64, '0x0000000000000000, 0xffffffffffffffff', special handling may be required
# for iteration p-3 for cllucas and cudalucas; add checks for below to cllucas and cudalucas checking code as (ok) exceptions to bad residues
$llpm3okresidues='0x0000000000000000, 0xffffffffffffffff';
# see http://www.mersenneforum.org/showthread.php?t=5862
# see also http://www.hoegge.dk/mersenne/resultspenultimate.txt
# http://www.hoegge.dk/mersenne/penultimate.txt
You might find the strategic double check thread https://www.mersenneforum.org/showth...462#post508462 and trippple check thread https://www.mersenneforum.org/showth...=17108&page=82 interesting background also.

Historically, error rates were somewhat higher. https://www.mail-archive.com/mersenn.../msg07476.html

With the approximate empirical 2% error rate per primality test completed, and certain assumptions that seem plausible, the chance of one exponent (total, out of the 50 million plus prime exponents p<109, not individually per prime exponent) having two matched wrong residues is ~2.9ppm. This seems to me to be a lower bound for matched wrong residues slipping by error detection. It's difficult to estimate probabilities for the nonrandom sources of incorrect matching residues; undetected software bugs, malicious reports, etc. which are additional. So let's suppose for now that the combined chance of random and nonrandom error producing matching wrong residues is 10ppm. Assuming further that it is distributed uniformly and independently over the ~50847478 prime exponents below 109, containing a probable number of ~55 mersenne primes, the chance of matching wrong residues occurring, times the chance of it coinciding with a Mersenne prime is 10ppm x 55/50847478 or 1.08x10-12. If we assume the occurrence of matching wrong residues is somehow connected to the Mersenne number being prime, the probability estimate of missing a Mersenne prime rises to the assumed 10ppm value. If we assume the occurrence of matching wrong residues is somehow connected to the Mersenne number being composite, the probability estimate of missing a Mersenne prime through matched wrong residues falls to zero. We could make various sets of assumptions about the relative probabilities of various assumptions (independent, prime-connected, composite-connected) and compute new probabilities, as possible estimates of the real probability. At some point, such estimates seem to rest on too shaky a foundation of assumptions and guesses to pursue further. Perhaps someone with a better background in statistics could help here.
Working three cases here for illustration, carrying to much higher precision than justified;

independent 99.98% 98 34
prime-linked .01 1 33
composite-linked .01 1 33

0.9998x1.08x10-12 + 0.0001 x 10ppm + 0.0001 x 0ppm = 0.00100108 ppm
0.98x1.08x10-12 + 0.01 x 10ppm + 0.01 x 0ppm = 0.10000106 ppm
0.34x1.08x10-12 + 0.33 x 10ppm + 0.33 x 0ppm = 3.30000037 ppm
Intuition tells me to weight "independent" heavily, but it's unclear how many nines to give it.

Now note that the assumption made earlier about the chance of error being distributed uniformly among the prime exponents is a convenient simplification for estimating probabilities, but it is wrong. It would be hard to get the primality of M2 or M3 wrong. It gets easier and more likely to have an error as the exponent gets bigger. I suppose we could sum the relative run times of all prime exponents, and assign a computed probability of error proportional to individual run times, fit through the empirical experience.

The odds of three matching wrong residues due to independent error would be much smaller. As I recall, triple checking was done of all exponents below ~3M. Some have had many more matching residues reported. See for example https://www.mersenne.org/report_expo...=101000&full=1
and note that in that range, any matching PRP results were preceded by matching LL results. It's my understanding that so far, all GIMPS discoveries of Mersenne primes were by first LL test, not by double check or later.

At the outset I gave a rough figure of LL test error rate as 2%. From my own running experience, it's clearly possible to do much better. Over a period of producing 447 verified LL tests, I also produced 6 bad residues, for a rate of 1.32% overall, 1.2% on prime95, 1.47% on GPUs. Also for a small sample of verified PRP3 tests, zero errors (23 prime95, 1 gpuowl). More to the point, decommissioning CPUs and GPUs that produce bad residues has led to zero GPU-produced bad residues since late 2017, and only one CPU-produced bad residue since then. Addition of more software checks would also help prevent completion and submission of bad runs.

It's also clearly possible to do much worse than the 2.% figure. A small sample of 47 LL tests on 21 100Mdigit exponents yields at least 9 bad residues, possibly as many as 13, for an estimated error rate of 19.% per LL test in that region. Extrapolating based on run time yields for 300Mdigit exponents, an estimate of 88.% error rate per test.

George computed the likely number of Mersenne primes below p=109 according to the Wagstaff conjecture and posted the result, 57.09, at https://www.mersenneforum.org/showpo...&postcount=204. Note that's a bit higher than the 55 I used above in computing estimates of matching wrong residue probability, but not by enough to shift the probabilities much. Two more primes is about a 4% greater value for the primes' generally negligible contribution.


Top of this reference thread: https://www.mersenneforum.org/showpo...89&postcount=1
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-08-13 at 15:31 Reason: updated for PRP/GEC test error rate
kriesel is online now