View Single Post
Old 2019-04-29, 21:14   #9
kriesel's Avatar
Mar 2017
US midwest

29·173 Posts
Default Why don't we use the initial iterations of a standard primality test as a self-test?

I don't know. As far as I know, no existing GIMPS software is using it, and yes I've read some of the source code. Maybe it's regarded by the talented cpu and gpu GIMPS programmers as not worth their time, based on its short duration, other existing checks, or other priorities.
See, in which the 5 initial LL seed 4 iterations are listed. These are the same for every p>64. At current GIMPS wavefronts for first time and double checking, dozens of beginning residue iterations are unaffected by the Mod Mp and so they are the same from one exponent p to another, in the absence of a very early error. (See the attachment for a table.)

The following is an idea I've been mulling for a while.

There exists a method for brief, almost free, self test, of hardware and software correct operation, at the start of every GIMPS primality test, for Lucas-Lehmer testing with either seed 4 or seed 10, and also for some PRP3 residue runs, where the initially computed powers of 3 are not dependent on the exponent's bit pattern.

It is therefore applicable to CUDALucas, ClLucas, Mlucas, some versions of GpuOwl, and some types of prime95 or mprime computation, as well as possible future implementation of PRP on CUDA computing PRP3 residues in certain ways. It is applicable to zero-offset and pseudorandom-offset forms of the algorithms.

It is also applicable to P-1.

It does not interfere with any of the other, existing error detection methods. That's important.

It can detect early effect of the following error types and sources. (Note the error check is generic, and does not require previous knowledge of the error type or of the erroneous result, as some other existing error checks do.)
  1. Serious memory error or other hardware error, such as but not limited to stuck memory bits in the most significant bit cells. (Errors occurring in insignificant bit positions will go undetected, covered by roundoff. Errors in MSB mantissa positions, or exponent positions of nonzero-mantissa floating point words in the fft length are likely to be detected.)
  2. User input error, such as specifying an fft length that is too small or much too large.
  3. Configuration issues, such as the use of -USE options in gpuOwL that are incompatible with the required fft length, or CUDA level, thread count, etc combinations that give rise to repeating 0x0 or 0x02 or 0xfffffffffffffffd residues in CUDALucas, or similar in clLucas.
  4. Some symptoms of some software bugs.
  5. Incipient hardware degradation or temporary or new issues such as excessive clock rate or thermal issues.
It does not detect any error occurring in iterations where the mod Mp has effect on the residue, since this check is performed before that begins.

The method consists of the following:
  1. Include in the applications, a table of known-good res64 values for initial iterations up to ~32, assuming no effect of the mod Mp, for each of the algorithm and seed value combinations supported by the application. These values are independent of exponent, so the tables involved are compact. Thirty two iterations are enough to cover the supported computation types up to the Mlucas limit of p<232. So if an application supports all of seed 4 LL, seed 10 LL, and PRP3 residue type 4 and type 1, the tables occupy 64 bits times 32 entries each, 4 x 64 / 8 * 32 = 1024 bytes as unsigned integers, twice that if stored as hex ASCII. (For attempts on extremely large exponents, such as F33, res128 might be useful.)
  2. For the exponent to be primality tested, and the algorithm and seed value combination to be computed, initially compute a safe iteration number at which to perform the check, which is near but no higher than the last iteration before the mod Mp begins to have effect. In other words, determine the highest safe iteration, as the iteration number that would have no more than p bits result, if the modulo were not applied. For seed 4 LL, this is ~floor(log2 p).
  3. Compute the iterations normally, through that iteration count.
  4. Obtain the res64 for the computation to that point. Assuming occupancy of at least 16 bits / fft word, this typically involves 5 fft words, since the res64 boundary is unlikely to land exactly on an fft word boundary for pseudorandom shift, and fft lengths are normally as short as are safe to use. (Any adjacent pending carries should be resolved.)
  5. Compare the res64 value obtained in step 4 to the corresponding value in the table. An exact match indicates no error detected, so continue. A mismatch means a serious reliability issue, occurring very early, that should be resolved before the computation is restarted from the seed value, or a problem in the error check. An error message and computation halt is recommended. The message could be something like:"Early iterations have produced a fatal error: incorrect res64 detected at iteration "<iter #>". This is indicative of serious memory error or other hardware error, application misconfiguration, incompatible combination of parameters, CUDA/OpenCL/driver versions, hardware model, etc, software bug, or inappropriate user specification of fft lengths. (Use ~10<p/fftlength<~17) It is unsafe to continue the computation as currently specified and configured. Please identify and correct the problem, before retrying."
The computational cost of this check is very small; one additional res64 determination, and the lookup and comparison, per exponent start, and occasionally output the error message.

The feasible number of iterations for this check and seed 4 LL is ~floor(log2 p); about 24 iterations for current LL DC wavefront, 25 for first primality test current wavefront, 27 for 100-megadigit exponents, 29 near the upper limit of, 30 very near or above p=231, 31 very near the 232 upper limit of Mlucas, ~32 for F33 P-1 or Pepin testing. For some PRP3 computation approaches, some exponents can use a little less, or one more iteration; for seed 10 LL some exponents allow one less.

If considering only the lengths of the terms per iteration in unsigned integer form, during this very brief self-test, and sum them, for LL seed 4, iteration 1...log2(p), it's 4 bits, 8, 16, ...~p/2<=bits<=p, and the sum is ~p<sum<~2p, or about the equivalent of one to two full width iterations. The fft has the effect of spreading the computation to a greater number of bits. The upper bound for this is the number of safe iterations. If the run can't pass the low threshold of matching a known-good res64 after such brief work, it should not be continued, because something is very seriously wrong. Which might be correctable.

The run time of so few iterations provides a very fast check for some basic function. It's almost instantaneous in normal usage, so is very likely to be seen by the user if there is a problem detected. The iterations are being performed anyway. Even on my old slow thermally challenged i3-370M laptop, a 79M PRP3 runs at 8 iterations/second, with significant throttling, so this initial check is around 3-4 seconds. It could potentially save two or more Gerbicz check blocks if a problem is detected early (or potentially many more if the user is inclined to launch and forget the app and not check on it often). On faster machines, it is even more likely to complete while the user is still looking at the application after launching it. For novices running CUDALucas, which has no Jacobi check yet, or safeguard against user specified too-small fft length, adding it could save days of bad runs, and perhaps some spurious submission of bad residues. For a GTX1080, with iteration time around 4.4msec at the current 84M first-test wavefront, 25 iterations =~0.11 second, so a serious misconfiguration or badly wrong fft length would provide almost instant feedback. On an RX480, 3.8msec/iter at 84M, ~0.09 second, while the first-two-1000-iteration blocks GEC take 7.6 seconds. For 100Mdigit exponents on the RX480, 16.3msec x 27 iterations =~0.5 seconds, vs. the ~33. seconds for the first-two-1000-iteration blocks GEC. For exponents near the 109 limit on the RX480 (not recommended due to years run time), 72msec x 29 iterations = ~2.1 seconds, vs. the 144. seconds for the first-two-1000-iteration blocks GEC. (Note earlier gpuowl versions use smaller block sizes, of 800, 500, 400 or 200 iterations. Normal gpuowl operation is GEC check every block-size squared or 106 iterations. Prime95 typically uses 1000-iteration blocks, 106 iteration GEC intervals, and dynamically adapts the interval according to error experience; minimum block size 25. So typical GEC occurs at >1 hour intervals.) The early res64 comparison test is also much less costly than a Jacobi check, and occurs once per exponent, much earlier than a Jacobi, so might save up to a full 12 hours of bad LL test in prime95's default Jacobi check interval. The Jacobi has a 50% chance of detecting an error, while if the early residue is wrong, detection by res64 table lookup should be 100%. The Jacobi check is not present in CUDALucas, cllucas, gpulucas, CUDAPm1, etc. The GEC is not applicable to P-1 as normally computed, so is not present in P-1 software (except gpuowl v7.x).

I'm unclear on the various approaches to PRP residue computations in the various GIMPS software and PRP residue types, so the PRP-specific content above is necessarily still vague. Perhaps some of the experts could help me out with that a bit.

All the preceding was written in the context of Mersenne number testing and P-1 factoring. It seems adaptable to P-1 factoring and Pepin testing of Fermat numbers also. For very large exponents, F33 and larger, it may be worthwhile to preload correct res128, since some res64 become very few nonzero bits and eventually 0x02 in early iterations for very large exponents. Res64=0x02 is also an indicator of some software errors, and typically treated as an error condition when encountered.
(note to self; check/update attachment someday for PRP type 1 res64 or res128)

Top of reference tree:
Attached Files
File Type: pdf iterations of res64 independent of exponent.pdf (22.3 KB, 134 views)

Last fiddled with by kriesel on 2021-03-02 at 19:40 Reason: adaptable to Fermat number P-1 factoring and Pepin testing; misc updates
kriesel is online now