View Single Post
Old 2019-05-03, 15:20   #10
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23×3×11×19 Posts
Default Why don't we build statistical checks into the GIMPS applications and server?

1) Why don't we count detected errors and report them to the server with the results?
prime95/mprime do this. The counts are limited to a few bits per type.
I'm not sure what Mlucas does.
Gpuowl counts and logs Gerbicz check errors detected, and some versions report the count in their results.
Gpu applications other than gpuowl report errors detected to the console but do not count them or include them in the result report.
Adding this capability could allow Madpoo to do error-count-based selection of gpu results for early double-checking. Users could sign up, perhaps in a new field in their account settings https://www.mersenne.org/update/, to receive automatically generated email when there are indications of probable hardware issues with any one of their systems.

2) Why don't we thoroughly analyze server contents to identify probably unreliable results, unreliable systems, unreliable users, etc?
Madpoo periodically produces reports of exponents which had certain prime95 error result codes, for early retesting.
Probably, extending to system by system reliability rating and flagging is more server utilization or manpower than available.
Individual users can retrieve their own primality test results as in https://www.mersenne.org/report_LL/?...exfactor=1&B1= and attempt analysis themselves.
Multiple-gpu users will not be able to distinguish which gpu is responsible for which result, so gpu-specific bad residue analysis is not available. (It's all currently lumped into one "manual testing" bin, even if the "computername" included in the report was or included a unique gpu identifier.)

An extreme example I imagined was detecting a hint of existence of an otherwise undetected software bug, because the returned results in a large sample size from multiple users and multiple sets of hardware shows a distribution that is, somewhere, enough sigmas away from the expected distribution, for found factors or produced res64 values, from a single software program or version. Either a value occurring "too often", a bump in the distribution, or "not often enough", a notch in the distribution, possibly both (displacing some occurrences of what should have been res64 A onto res64 B; or separate issues).

If all res64 values are equally likely (ignoring for the moment the special cases of primalilty indicators for LL or PRP, or penultimate residues), finding a suspiciously large bump in the res64 distribution seems possible. Finding a notch appears to be impossible. Some error conditions we've already identified create occurrences of specific res64 values in error, so too often. In p<109 there are only going to be roughly 40 million final primality test residues computed. (https://www.mersenne.ca/status/ shows >21 million unfactored exponents below 109, some will later be factored, and the remaining will probably get two or more primality tests, and that will continue until there are ~20 million matching res64s, or alternately PRP proofs.) So the most probable number of occurrences of a given randomly chosen res64 is zero (<~20 million correct residue64 values / 264 possible res64 values =~10-12 average occurrence per possible value).

Subdividing the res64 into smaller chunks shrinks the range of possible values and increases the probable number of occurrences.

3) Why don't we perform statistical analysis on select interim results on the fly during the various GIMPS client computations in production?
The code doesn't exist and would need to be added, tested, documented, released, and deployed.

I suspect that in many cases, the signal/noise ratio would be too low to detect an error often enough and early enough to be useful, and so would not justify the computational or developmental effort.

However, especially on unusually long runs, I think there are some possibilities of detecting some useful things by careful application, of simple statistical analysis, to data GIMPS software generates, at the running applications, that as far as I know, are not currently being done. I've tested against past anomalous results saved. It looks to be potentially useful on P-1, which is comparatively lacking in existing error checks.

When the GIMPS project started, the exponents were small enough that these statistical approaches would have been less effective because of limited sample size per exponent or total. At current double check or first-primality test or TF or P-1 wavefronts, sample size can be larger per exponent.

Any statistically based checks on the fly would require the following to be useful:
  1. Knowing, or having a strong confidence in, the expected statistical distribution of an interim output
  2. Having access to that output, with sufficient sample quantity
  3. Efficiently gathering and processing that output
  4. Having a justified criterion for what is a suspicious departure from the expected distribution
  5. Having a provision for taking action when there's a suspicious departure detected
  6. Making detection of a suspicious departure early enough in a computation to make restart from a saved thought-good state effective
  7. Sufficiently low false-positive rate (error flagged when it's just normal statistical variation should be rare)
Detecting during a program run on a single exponent, statistically significant deviation from expected statistical distribution of interim res64 values, indicating a configuration issue, or onset of a hardware problem during a run, would be especially beneficial in LL runs with or without Jacobi symbol check; or in P-1 runs, for which the GEC does not apply, Jacobi symbol check application is difficult and limited, and so far as I know, Jacobi symbol checks are not now implemented in P-1 production code at all, except for gpuowl V7.x.

4) Why don't developers use statistical analysis during code development?
Maybe it's not useful, maybe they're already doing any that's useful, probably they're focusing on doing the necessary number theory computations correctly and efficiently.

The importance of speed in the statistical gathering and processing depends on the application.
  1. For postprocessing use, such as analysis of server database contents, speed is not very important. It could run in the background until it eventually finishes its pass through old entries.
  2. For extra thorough examination of interim results or even every iteration during development and code testing, speed is less important than detecting any issues practical so they can be addressed. Such code could be switched off by default by an ini file setting for production, or even commented out or removed from the source code before the recompile for the release version.
  3. For on the fly application in TF, P-1, or primality testing, speed is important. In the slim space of detecting error that PRP GEC does not, speed is extremely important. I speculate that most of the statistical work on the fly in GIMPS applications could be done on the side in a separate hyperthread without much effect on performance.
  4. For primenet server ongoing use when receiving results, speed is important. Timing out in web interaction is bad.
What are the useful statistical properties of the various data?
  1. I think the relative probability of the various k values producing prime factors, in either TF or P-1 factor found output, is well defined.
  2. It seems not possible to collect and do statistics on the individual TF trial remainders fast enough to be worthwhile.
  3. Past the point where the modulo Mp starts having effect in a primality test, or in a P-1 stage, the individual hex digit positions of the interim res64 values are effectively pseudorandom, a good approximation of uniformly distributed, for correct software and hardware. So sampling the interim residues at 10n iteration intervals and doing statistical analysis on the fly on the digit distributions could show significant departures from expected distributions, that are indicative of configuration error, software bug, or hardware malfunction, if n is enough smaller than p that good statistical sample size can be accumulated. In extreme cases, subsetting on the fly can show the onset of issues. Repetition of a single res64, as happens sometimes in CUDAPm1 stage 2, would quickly in a small sample size develop a very anomalous distribution for a given digit slot: ___|____________ instead of ---------------- or --_--=_----==_--. So would cyclic residues, as in https://www.mersenneforum.org/showpo...&postcount=571. The number of interim P-1 stage 2 residues output to console depends on NRP (hence available ram) and on ini file settings.
  4. This approach could detect some persistent memory errors, such as stuck bits, when offset = 0, as in recent gpuowl versions, or in P-1 computations. It seems unlikely to detect small bit density persistent memory errors for offset !=0, while the res64 walks the p-bit field at every iteration. Stuck bits artifacts arrive at the res64 position one of two ways; the stuck bit is in the res64 footprint; or the stuck bit was elsewhere and its effect has been brought into the res64 position by a modulo Mp operation or a succession of them. The direct presence is <1ppm likelihood, and declining as the wavefront advances. This appears to me to be very weak detection, very unlikely to produce any useful signal unless the memory is so bad that it's likely to cause other obvious problems.
I've started a thread at https://www.mersenneforum.org/showpo...32&postcount=1 with a description of what I think the statistical properties are. (No responses as of 2021 March 2.) There's an interesting post at https://www.mersenneforum.org/showpo...6&postcount=80 showing a several percent higher rate of finding factors that are 7 mod 8 than 1 mod 8.

Data sizes required for keeping such on-the-fly statistics are not large. Res64 is only 16 hex characters long. A table of hexadecimal digit frequencies in the res64 is 16 positions, by 16 possible values. In the extreme case, if we tabulated frequency of values for every hex digit position separately, for every iteration, of half or greater the current maximal Mlucas v18 exponent, that's a table of 256 32-bit unsigned values to ensure avoiding overflow in even the improbable pathological case of the res64 being stuck on a single value for the whole run and the error detection failing to detect, alert, and halt such a bad run. (Or perhaps twice that data size, if implementing 64-bit counts to allow for >32-bit exponents such as F33 and avoid overflow even if every iteration was stuck on the same res64 hex digit(s) and error detection and response failed.) If we wanted to keep also a short term rolling horizon histogram capability, for detecting onset of an issue and attempting rollback to a thought-good state, add another such table, and a circular buffer of recent res64 values. The in-memory or on-disk storage requirements for this are very modest, only one to several Kbytes. Ideally the table would be saved when checkpoints are, and loadable on program restarts.

Since the res64 values in primality tests for LL seed 4, LL seed 10, or PRP residue type 4 seed 3 are highly repeatable sequences, in the early iterations, before the Mod Mp starts to have effect, we would want to exclude them from statistics gathering so as not to distort the distribution gathered. Fortunately that is easy to do. Excluding the seed and first 31 iteration res64s appears sufficient for p<232 to ensure the mod Mp is having effect in all the listed primality cases.

I've experimented a bit with some statistical measures; mean, expected mean, binomial probability table, standard deviation (generally incorrect since the frequency is bounded by zero). In some pathological cases as little as 5 interim residues may be enough to identify an issue. This works independently of whether any of the residues are known-bad values; it's the pattern detection, that digits are distinctly not pseudorandom, but quite nonuniformly distributed, that flags them as suspicious.
I've crunched some res64 series from various runs, of LL, PRP3, and P-1, separately, as follows. Per hex digit position in the res64, count the occurrence of each of the possible values. Plotting these as 2d histograms makes some interesting looking patterns. It looks to me like a novel repeating, or novel cyclic behavior could be spotted by an algorithm within a quite small number of hex-digit-wise similar interim res64s. An example 2d histogram is included for 5 consecutive res64 outputs in the pdf attachment. This is fast enough that an error condition onset could be detected on the fly and if multiple previous checkpoints were saved on disk, a retreat and resume from last-thought-good state could be tried.

I think in these cases, of a finite sample of res64s, the relevant probability function for a given digit position, given hex value, is a binomial distribution, probability 1/16 per trial, n= number of res64 values. Computing the probabilities of differing numbers of occurrences for largish n leads to numerical problems (0, inf, NaN) pretty easily, even with floating-point rescaling and reordering the computation. I'm considering rescaling to log(probability) to get around that.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
File Type: pdf res64 digit frequency visualization.pdf (2.33 MB, 154 views)

Last fiddled with by kriesel on 2021-03-02 at 19:06 Reason: updated for gpuowl v7.x P-1 developments; retry on anomaly indication
kriesel is offline