How reliable are the major Mersenne testing codes, and how is it achieved? What more might be done?

There are three types of errors in primality testing we would like to avoid, and one we generally accept

- A false positive indication, indicating the Mersenne number corresponding to a given prime exponent, when the Mersenne number is actually composite, sounds serious but is very likely to be detected quickly in the verification process applied to any suspected prime. (Multiple parallel runs by multiple trusted people on multiple hardware architectures with multiple software packages.) There are known software issues that generate this case at much higher than random probability.
- A false negative indication (mistaking a Mersenne prime as composite). This has serious consequence. It both reduces achievement of the GIMPS goal of finding Mersenne primes, and distorts the perceived or empirical distribution of Mersenne primes that number theorists use to inform their study of them. Limiting its impact is performed by (eventually) testing twice or more, all exponents that survive factoring tests.
- A false residue, concluding a composite Mersenne number is composite, but with a computation that occurred with error sufficient to give an incorrect final residue. Since nearly all Mersenne numbers are composite, probability heavily favors this type. These are less serious. They are found when a second run does not match the first.
- Recoverable errors, that are recovered from during the computation, so that they do not affect the final result. We'd like to moderate both the frequency of occurrence, and the throughput impact of them also. Some level of recoverable error is accepted, to the extent that it enables higher overall aggregate correct throughput, than reducing or avoiding them would allow.

Looked at another way, there are two categories of error: those we find, and those we don't.

GIMPS uses two types of primality tests. The Lucas-Lehmer sequence is understood to be a conclusive primality test for Mersenne numbers when performed correctly with a properly selected seed value. There is a small set of seed values suitable for any prime exponent. GIMPS convention is to use the seed 4. The Fermat pseudoprime test (PRP) is not conclusive regarding primality, but a pseudoprime test that indicates whether a candidate is definitely composite, or extremely likely but not certainly prime. Usually the seed (base) 3 is used. Verification with multiple LL tests and probably also a pseudoprime double check with different users, hardware, and software would promptly follow a Fermat pseudoprime test indicating "probably prime".

Lucas-Lehmer primality tests have produced, as of 18-20 Dec 2018, the following "interesting" bad residues that occurred near minimum or maximum value:

LL seed 4 64-bit residues near zero regarded as bad:

https://www.mersenneforum.org/showpo...&postcount=142
Code:

Residue Count
0000000000000000 311 (known issue with CUDALucas < v2.06 and other software)
0000000000000002 14 (known issue with CUDALucas < v2.06 and other software)
0000000000000003 1 attributed to V17 prime95 shift bug
0000000000000004 1 cause uncertain
000000000000006C 1 attributed to V17 prime95 shift bug
0000000000000269 1 attributed to V17 prime95 shift bug
total 329

LL seed 4 64-bit residues near ffffffffffffffff regarded as bad:

https://www.mersenneforum.org/showpo...&postcount=150
Code:

exponent Partial Residue Status
3370013 FFFFFFFFFFFFFFFE Bad
37614691 FFFFFFFFFFFFFFFD Bad
47359579 FFFFFFFFFFFFFFFF Bad
64847711 FFFFFFFF80000000 Bad
67077133 FFFFFFFFFFFFFFFD Bad
67151753 FFFFFFFFFFFFFFFD Bad
68834723 FFFFFFFFFFFFFFFD Bad
81857519 FFFFFFFF80000000 Bad
81857537 FFFFFFFF80000000 Bad
81860447 FFFFFFFF80000000 Bad
81860479 FFFFFFFF80000000 Bad
88283761 FFFFFFFF80000000 Bad
88283807 FFFFFFFF80000000 Bad

So, by residue, we have counts

Residue Count

ffffffffffffffff 1 (residue is equivalent to 0)

fffffffffffffffe 1

fffffffffffffffd 4 (residue is equivalent to -2; known issue with algebra, and CUDALucas < v2.06)

ffffffff80000000 7 (known issue with CUDALucas; 6 of 7 were from the same user in 2018)

total 13

The number of total PRP primality tests done to date is not nearly as large as of LL tests, and the highly effective Gerbicz error check prevents most errors from persisting, so statistics are thin on errors from that. From a PRP results export available 2019-05-13 at

https://www.mersenneforum.org/showpo...1&postcount=11, a dash of perl after residue-sort in OpenOffice yielded:

1 header record

9945 exponent records

0 residue-64s shared among multiple exponents (no duplication of res64 values)

A review of verified good and verified bad results above 77M from 2019-08-12 to 2021-08-12 yields an error rate of 3 PRP bad results/123522 verified (good and bad) total, or ~24. ppm PRP error rate, on a mix of PRP/GEC and PRP/GEC/proof runs. A check of 50M to 77M for the same period showed 751 verified results, 0 errors.

Duplication would be very highly unlikely by random chance in such a small sample sprinkled over a 2

^{64} res64 space, so if duplication occurred, it would be indicative of some sort of errors. Inspection of the low and high ends showed no special or suspicious residues (0, 3, ff...ffc, ff...ff), just as you'd expect if the Gerbicz error check was working well.

State indicators found were as follows

R Reliable (which means it's Gerbicz check passed, but not yet verified)

S Suspect (had error codes attached)

U Unverified (one result for the exponent; not R or S)

V Verified (more than one result for the exponent; matched)

Perl counted 7111 reliable, 3 suspect, 1878 unverified, 953 verified, total 9945. Two of those 3 suspect are by the same user and have 0 offset. A suspect rate of 3 / 953 = 0.0031 is markedly better than the typical LL track record of ~2.0% bad results.

Some of the codes have ancestry in common. My crude diagram for the ancestry of the various codes is attached at

https://www.mersenneforum.org/showpo...04&postcount=5 If there was an issue in common, present in a bit of "DNA in common", despite the best efforts of the authors and extreme care in using differing programs on differing hardware in multiple confirmations by multiple individuals, how could we know?

The ancestry in common question occurs on multiple levels.

- If there is any mersenne-search-specific code in common, that may contain flaw;
- If there is reuse of libraries with issues, such as completely independently written programs that nevertheless both use a library such as the NVIDIA CUDA fft library, or the gwnum routines used in many packages;
- Commonality of conceptual error (writers of different software making the same mistake or invalid assumption);
- A flaw in the hardware design on which the programs run;
- An underlying flaw in the mathematical basis for the programs. (Imagine if the Lucas-Lehmer test was not actually a conclusive primality test, but _almost_ certain, a pseudoprime test, that gave primality indication at 0.1ppm of exponents too often. Or if there was some sort of issue with the irrational base discrete weighted transform that gave a similar rare false-positive effect.)

We can rerun new software on inputs for which we know or think we know the correct results. Those results are obtained from earlier software, or in small-input cases, from other methods.

If they all pass, it does not necessarily mean the existing codes are without flaw.

If one fails, it does not necessarily mean the existing codes are flawed, it could be a transient hardware issue.

There's a sliver of a chance of a computation error occurring outside the loop of iterations that are guarded for accuracy by the Jacobi check or Gerbicz check. Bugs have been identified outside the checking loop.

No disrespect to any of the authors (or users, or hardware designers for that matter; some tasks are just very hard). I've fixed code that failed nonreproducibly at low rates. Ppb or lower frequency bugs can be hard to determine exist, much less identify and resolve. Current assignments can be of order 10

^{18} core clocks to complete.

The output of programs, and the best understandings and conjectures of number theorists, are compared.

In the case of prime95, which can process not only Mersenne number computations, but Proth numbers, Fermat, etc, or its underlying gwnum routines that are also used in other software, an error detected in those other numbers if it occurred might be revealed to lie in fft or other code common to the various computations. The more eyes on it and the more cases checked, the better for the code's reliability.

Over time, and with a basis in number theory, some interim results (indicated by the low 64 bit values) have been shown to sometimes occur when they should not, and checks for specific incorrect early residues have been incorporated into most of the various software applications. Some of these problem residues are application-specific, not merely algorithm-specific.

Most of the programs use

- either the conclusive Lucas-Lehmer primality test or a (usually base 3) pseudoprime test,
- standard libraries or carefully developed and tested math routines for performing very efficient computations,
- pseudorandom varied shift (offset), as a means of having different numerical error occur in different runs for the same exponent by the same algorithm,
- double precision floating point for performance, at the cost of some imprecision, addressed by various error checks and correction methods,
- roundoff error checking,
- checking for sum of inputs ~ sum of outputs, within a set tolerance,
- interim save files, for resumption from an earlier point if serious error is detected,
- some sort of checksum or CRC stored in the save files, and checked upon subsequent read,
- output of interim and final 64-bit residues, for comparison, and determination of at what point a given computation went wrong,
- known-bad-interim-residue checks, which are algorithm-dependent or application-dependent,
- logging for later comparison and diagnosis.

Some programs additionally use the Jacobi test periodically which detects LL computation errors at 50% probability, or the Gerbicz error check which detects errors in the base 3 pseudoprime computations at nearly 100% probability. Jacobi symbol check:

https://mersenneforum.org/showpost.p...3&postcount=30 Gerbicz check:

https://www.mersenneforum.org/showpo...1&postcount=88
To my knowledge, only prime95/mprime and gpuOwL incorporated the Jacobi test or Gerbicz PRP check, as of early March 2019. Mlucas added PRP and GEC at some point. The Gerbicz check has quite low overhead, costing only about 0.2% on run time at a one million iteration interval (Gerbicz block size 1000), which is comparable to the checking overhead of Jacobi symbol on LL tests.

https://www.mersenneforum.org/showpo...0&postcount=31 The big difference is in net error rate after checking; Jacobi symbol check is only 50% effective at detecting errors, while the GEC is essentially 100% reliable in detecting errors, resuming from known-good save files, and preventing final residues from being in error.

A more recent development is the PRP proof and verification (certificate) process. Using MD5 hash and other safeguards, this provides proof of correct completion of a PRP test and protects against falsified results. Gpuowl and prime95 now include this feature. It is being added to Mlucas. There is an analogous proof proposed by R. Gerbicz for the LL test, which requires more interim residues and data storage, and has not yet been implemented by anyone in any application.

Another layer of error checking, detection, and correction occurs at the project / PrimeNet server level.

The GIMPS project records 64-bit residues returned for composite primality tests, At least two tests are done per unfactored Mersenne number (except for a PRP first test with successful proof, which is more reliable). Additional tests are done if the first two are of different types, or the first n are of the same type but not matching, until two of matching type and residue are obtained, produced by different users.

Some programs maintain and report error counts. Certain errors substantially increase the probability of a wrong final residue. Those results are flagged for early checking. (See the strategic double check and triple check

thread on mersenneforum.org.)

Reliability of a given set of hardware declines over time. Users can help maintain the reliability of their hardware by avoiding overclocking and environmental extremes, by periodically checking logs, by periodically testing their hardware with memory test programs, by periodically running double checks, and periodically running self-tests incorporated in the software.

Users can help maintain and improve the accuracy of their own results and the project as a whole by logging runs, reviewing the logs for anomalies, and reporting anomalies seen, to the software authors, to the GIMPS community, and/or to me for inclusion in the bug and wish lists I am maintaining for GPU software.

If we assume the hardware error rate is about constant per unit time, averaged over the fleet, as the project moves to larger exponents over time we can expect the error rate per primality test to increase. Addition of recently identified error checks allowing restart from a known-good savefile will help counter this effect, as will switching from LL, to PRP with its excellent Gerbicz error check. It is possible that other error checks or periodic self test will be added in the future.

Current error rate of final residue from LL testing (in the absence of substantial hardware issues or software misconfiguration or certain logged error types) is around 2% per LL primality test,

4% per double-LL-tested exponent. Error rate rises to around 40% for prime95 LL tests with illegal sumout errors logged.

Error rate of final residue from PRP3 with Gerbicz check is quite small by comparison, but nonzero; hardware error or bugs affecting software operation "outside" the Gerbicz error check of blocks of iterations can occur and affect the results, and have occurred since introduction of the Gerbicz check. Software has been rewritten to reduce the occurrence of errors outside the GEC. An evaluation of PRP errors found and verified results over the period 2019-08-12 to 2021-08-12 for exponents above 77M or above 50M yielded an observed PRP/GEC and PRP/GEC/proof combined error rate of ~24. per million PRP tests

**(~24 ppm)**. That's about

**800. times lower** than the LL without Jacobi check error rate, and corresponds to 3 PRP errors total, at least one of which occurred before the prime95 modification to reduce errors occurring outside the GEC error check, such as handling the final residue value.

There was a software bug in gpuowl V7.1 which produced 3 proof errors before it was identified. (Not residue errors, IIRC.) That has been addressed. There is discussion of this error type beginning at

https://mersenneforum.org/showpost.p...&postcount=405
In prime95 / mprime, measures hardening proof generation against error are described briefly at

https://mersenneforum.org/showpost.p...40&postcount=8
Error rate from a user can be

seasonal or longer scale variation.

https://www.mersenne.org/various/math.php indicates an error rate per LL primality test of 1.5% when no serious errors (excluding illegal sumout as a serious error) are reported; 50% when a serious error is reported.

Historically, LL primality test error rate varied considerably, up to nearly 4% per test.

https://www.mail-archive.com/mersenn.../msg07476.html
See also

https://www.mersenneforum.org/showpo...3&postcount=19 for more discussion of error rates. This post is already rather long.

https://www.mersenneforum.org/showpo...&postcount=102 late 2016 error rate charts

https://www.mersenneforum.org/showpo...&postcount=104 late 2017 error rate charts

https://www.mersenneforum.org/showpo...&postcount=105 late 2018 error rate charts

https://www.mersenneforum.org/showpo...&postcount=107 late 2019 error rate charts

https://www.mersenneforum.org/showpo...&postcount=111 late 2020 error rate charts

(Thanks Patrik for all those!)

You can read a bit about the history circa 1999 of GIMPS bugs and QA at

https://www.mersenneforum.org/showth...=23877&page=13
Additional error checks, after discovery, that save more than they cost, and are otherwise deemed worthwhile, may be implemented by the various software titles' authors or maintainers at their discretion.

Top of reference tree:

https://www.mersenneforum.org/showpo...22&postcount=1