mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2018-06-24, 13:15   #1
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×1,481 Posts
Default Mersenne software error detection or correction opportunities

Beyond selecting reliable hardware, good systems management including occasional hardware reliability testing, and user alertness to signs of degrading reliability at the system level (event logs, crashes) or at the application level (erroneous factors, mismatched double checks, etc), what can be done to increase error detection and correction in Mersenne computing?

What algorithmic strategies can be used? Are there more?

Primality testing:
The Jacobi check has been applied to prime95 / mprime 's LL testing, and to the LL version of gpuOwL. Although computing it every iteration could provide a 75% probability of catching errors, that is too computationally expensive, dwarfing the LL iteration time. So instead it's computed much less frequently and provides a 50% chance of catching an error in the LL sequence between checks. http://www.mersenneforum.org/showpos...2&postcount=36
http://www.mersenneforum.org/showpos...3&postcount=30 & related discussion
http://www.mersenneforum.org/showthr...t=22471&page=4 That chance is independent of the other checks, such as checking roundoff error periodically along the way, sum of inputs vs outputs along the way, and double-checking interim or final 64-bit residues.

Gerbicz' efficient idea for error checking PRP testing has led to the conversion of gpuOwL from LL testing on OpenCl gpus to PRP, and the addition of PRP to mprime / prime95. Other applications, such as Mlucas, CUDALucas, and clLucas, have not yet implemented Jacobi checking or PRP with Gerbicz check.

Gerbicz has also proposed a 512-bit residue recently with application as factors are found and cofactors tested. http://www.mersenneforum.org/showthr...366#post490366

What can be done for P-1?
Prime95 uses roundoff checking in its P-1 computations and adjusts fft length for the long multiplications accordingly, and uses checksums in its save files. Jacobi checks are not implemented in the prime95 P-1 factoring.
On the gpu, CUDAPm1 does roundoff checking, but does not to my knowledge implement sum of inputs checking, which may not be applicable. It does not implement the Jacobi check.

The Jacobi check appears to be suitable for application to parts of P-1 calculations. In some portions it seems to be too computationally expensive as a check; in others, it looks to be sufficiently quick to pose a possible net gain.

Jacobi checking (and perhaps other check types) could be made optional, so that current speed is preserved, but users questioning the reliability of their configuration because of low factor yield could turn on some error checking and either detect some errors or be reassured somewhat.

Gerbicz refers to a possible check on P-1, when there's a known factor, which is a very considerable restriction, in http://www.mersenneforum.org/showpos...&postcount=252

Regardless of factorization, there are a number of simple checks that can be added to P-1. CUDAPm1 computes and periodically outputs interim 64-bit residues. Some values are known to indicate problems. Residue 0x0000000000000000 in stage 1 or 2, 0x0000000000000001 in stage 1, and residues in stage 2 that repeat the final stage 1 residue are all indicative of specific problems and reason to halt or rollback to the previous save file and try again. The gcds are done on the cpu, and above certain gpu-dependent values for exponent, halt rather than complete. Save files retained from immediately before and after gcd calculations would help. (In some cases the gcd fails, and CUDAPm1 halts silently. In some cases a gpu can't perform stage 2 for an exponent it can complete stage one on. Other gpu models fail at exponent levels in stage 1 apparently lower than in stage 2 on the same gpu.)
More error checking and handling on CUDA calls could be implemented in a slower error-checking version. Refusing to attempt a stage that exceeds the memory available would avoid some bad-residue cases.

Can we compare interim residues of parallel runs? (Presumably only if all of exponent, B1, B2, d, e, nrp match, if at all, in stage 2, or if exponent and B1 match, in stage 1.) Parallel runs are not justified in the normal course of production runs by the few factors that may be missed due to error, but may be justified as an occasional check made on hardware reliability.

Trial factoring?
Trial factoring is so fast per candidate factor, until deeply factoring exponents, that it seems to me it's not worth spending much effort on error correction. Maybe there's something to be gained there. Run times can approach an hour between console updates at high bit levels, corresponding to weeks per bit level.

PrimeNet server user summary:
I'd like to see a bit more of a scoreboard, for errors, per user, by system and work type, available to the user on the Primenet Server's web pages, missed factors, bad factors, bad residues, expirations. If there were total counts and error counts, or error rate, per userid/computerid combination, tabulated, it would make it easier for users to identify low reliability hardware, and take action to reduce error rate. By the user incorporating a unique gpuid into computerid, granularity down to the individual gpu could be provided in the error rate report for a user.
kriesel is offline   Reply With Quote
Old 2018-06-24, 14:26   #2
science_man_88
 
science_man_88's Avatar
 
"Forget I exist"
Jul 2009
Dumbassville

26·131 Posts
Default

Guessing you know -2,-1,0,1,2 before last iteration will all get stuck at an unchanging residue for the rest of the test. Modified LL (starting from a real LL) can be used to check factors as well but it's extremely inefficient.
science_man_88 is offline   Reply With Quote
Old 2018-06-24, 15:07   #3
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·1,481 Posts
Default

Yes, and thanks for the reminder on +/-1, and hinting I had left out residue-validity checks from primality testing.

For conciseness below I'm using M as the mersenne number's binary representation.
In algebra, 1 if it occurs gives -1 as iteration result; -1 repeats.
But doesn't -1 represent as M-1?
(M-1)^2-2 = M^2 -2M +1 -2
M^2-2M-1 mod M is -1
Wouldn't -1 be represented as M-1 again?
Or does the -1 stay in the FP convolution where negative values are native, not wrapped around modulo M?
Has either been observed in practice? (64-bit residue 0xfffffffffffffffe, not observed in practice to my knowledge.)

+/-2 for s_j is very bad, yields (+/-2)^2-2 = 2. 2 then repeats to the last iteration, unless another error occurs. This has been observed.

In CUDALucas 2.06beta April, the repeating residue 0xfffffffffffffffd was observed, after a 2 or -2 in 2.05 was continued with 2.06. That's the 64-bit residue we'd expect for (unsigned) M-2 mod M. And it repeated.

I think most known-bad residue cases are already checked for in prime95 and CUDALucas. Not so in clLucas as I recall.
The general repeating residue, not otherwise a special value, is not to my knowledge checked for in any primality test software yet.

Zero 64-bit residue repeating would be a sign probably of multiple errors reoccurring.

Lists of known-bad residues I have collected are:
'cllucas', '0x0000000000000002, 0xffffffff80000000',
'cudalucas', '0x0000000000000000, 0x0000000000000002, 0xfffffffffffffffd',
'cudapm1', '0x0000000000000000, 0x0000000000000001',
'gpuowl', '0x0000000000000000',

The programs don't do algebra, they do computations, and need to check for the binary representations that are present.
So instead of checking for -1 or -2, they check for the least 64 bits of M-1 or M-2.
The clLucas second value is a special case that surfaces when fft length used is much too short, as I recall.

Last fiddled with by kriesel on 2018-06-24 at 15:20
kriesel is offline   Reply With Quote
Old 2018-08-06, 17:08   #4
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·1,481 Posts
Default legitimate unusual residues in LL

If counting LL iterations from s0=4, such that the final iteration is numbered p-2, iteration p-3 in the case of a Mersenne prime produces +- 2(p+1)/2. At p>127, this produces res64=0x0000 0000 0000 0000 or 0xffff ffff ffff ffff legitimately. Error detection must allow for these in LL programs. see also http://www.hoegge.dk/mersenne/resultspenultimate.txt
http://www.hoegge.dk/mersenne/penultimate.txt
http://www.mersenneforum.org/showthread.php?t=5862
kriesel is offline   Reply With Quote
Old 2018-08-06, 17:36   #5
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10111001001002 Posts
Default Are there special residues near the end of PRP-3, like there are for LL?

I wonder if there are special ending 64-bit residues, for Mersenne primes, that might otherwise appear to be indicative of errors, during PRP-3 sequences.
kriesel is offline   Reply With Quote
Old 2018-08-06, 17:50   #6
GP2
 
GP2's Avatar
 
Sep 2003

258510 Posts
Default

Quote:
Originally Posted by kriesel View Post
I wonder if there are special ending 64-bit residues, for Mersenne primes, that might otherwise appear to be indicative of errors, during PRP-3 sequences.
Gerbicz error correction for PRP-3 catches a much broader category of error than obviously anomalous bit patterns. So it's probably a moot point. Any special 64-bit residues would be corrected out of existence before you saw them.
GP2 is offline   Reply With Quote
Old 2018-08-06, 21:15   #7
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×1,481 Posts
Default

Quote:
Originally Posted by GP2 View Post
Gerbicz error correction for PRP-3 catches a much broader category of error than obviously anomalous bit patterns. So it's probably a moot point. Any special 64-bit residues would be corrected out of existence before you saw them.
Gerbicz error detection and recomputation of errored spans of iterations may be pretty bulletproof and is very useful, but it too has limits, at least as implemented and when running on significantly unreliable hardware.

I have one hardware case, which provided a severe test for the Gerbicz check, an old Athlon system, that was sufficiently unreliable in long error runs, that it went far beyond maxing out the prime95 error check counts, to in some cases, logging numbers of consecutive error iterations longer than the prime95 Gerbicz check interval. In that case, some of the errored iterations appear to me to escape recomputation. The Gerbicz check is still in that case effective enough that the retreat on iteration number it makes, substantially lengthens the time estimate to completion, and it logs lots of Gerbicz detected errors and retreats to earlier iteration numbers. Percent complete indicated gradually ramps up, and then drops back. Eventually it reached a point where no net progress was being made in a month, at ~6.8% on a 77M exponent PRP. It was powered off at that point. I'm not sure looking at its logs would be worth Prime95's while, but they could be made available.

A quick perl test program indicates the second to last iteration of a Mersenne prime's PRP3 is -3 (actually Mp-3; 0x[1 or 7] f...fc) at least at small residue < 264 implying exponent <64.
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Available software for pursuing mersenne primes kriesel GPU Computing 48 2020-01-14 02:56
LL GPU error-detection side computation preda GPU Computing 14 2017-05-09 07:47
@ Supermods: Error Correction please Raman Forum Feedback 72 2013-06-22 07:24
Software error or hardware error GuloGulo Software 3 2011-01-19 00:36
Trial division software for Mersenne SPWorley Factoring 7 2009-08-16 00:23

All times are UTC. The time now is 22:10.


Fri Dec 3 22:10:29 UTC 2021 up 133 days, 16:39, 0 users, load averages: 1.52, 1.56, 1.47

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.