mersenneforum.org Mlucas v19 available
 Register FAQ Search Today's Posts Mark Forums Read

 2019-12-01, 23:09 #1 ewmayer ∂2ω=0     Sep 2002 República de California 32·1,303 Posts Mlucas v19 available Mlucas v19 has gone live. Use this thread to report bugs, build issues, and for any other related discussion. Last fiddled with by Uncwilly on 2020-11-28 at 20:51
 2019-12-03, 01:19 #2 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 22·1,619 Posts Haven't tried it yet, but congrats on getting it out.
2019-12-03, 21:58   #3
ewmayer
2ω=0

Sep 2002
República de California

32×1,303 Posts

Quote:
 Originally Posted by kriesel Haven't tried it yet, but congrats on getting it out.
Thanks. Meanwhile I have discovered a bug related to the new PRP-handling logic of the kind I expected would be shaken out by further testing ... this one specifically affects exponents really close to an FFT-length breakover point (I discovered it when I fired up a first-time PRP test of M96365419, which is very close to the 5120K-FFT exponent limit), turns out the Gerbicz-check-related breaking of the usual checkpointing interval into multiple smaller subintervals (at the end of each of which we update the G-checkproduct) breaks the is-roundoff-error-reproducible-on-retry logic. It's a simple fix, I just uploaded updated versions of the release tarball and ARM prebuilt binaries, but folks who previously built and are running the the Dec 1 code snapshot can use the simpler expedient of incremental-rebuild-and-relink of the single attached sourcefile.
Attached Files
 mers_mod_square.c.bz2 (27.5 KB, 357 views)

Last fiddled with by ewmayer on 2019-12-03 at 22:00

2019-12-05, 21:51   #4
ewmayer
2ω=0

Sep 2002
República de California

32·1,303 Posts

An interesting subtheme re. the newly-added PRP assigment-type support and the Gerbicz check ... shortly after the initial v19 release, got e-mail from George about the importance of adding redundancy to the G-checking mechanism:
Quote:
 On Dec 4, 2019, at 7:57 AM, George Woltman wrote: You are basically examining the code looking for any place a one-bit error could doom your result. This is mostly likely to occur right after a Gerbicz compare. In my first implementation, after the compare succeeded I threw away the equal value and only had the one value in memory (and wrote a save file with one value). If a cosmic ray hit during that time period, the end result would be wrong. I now keep the compared (and equal) value in memory. One starts the next Gerbicz comparison value, the other continues the PRP exponentiation. Similarly when the computation ends, I generate a residue from both values and check that the res64s match.

Well, let's review how my code does things, say, starting from a post-interrupt savefile-read:

1. Read PRP residue into array a[], accumulated G-checkproduct into b[]. Both of these residues are written to savefiles together with associated full-residue checksums - I use the Selfridge-Hurwitz residues (full-length residue mod (2^35-1) and mod (2^36-1)) for that - and the checksums compared with those recomputed during the read-from-file.

2. Do an iteration interval leading up to the next savefile update, 10k or 100k mod-squarings of a[]. Every 1000 squarings update b[] *= a[]. The initial b[] is in pure-integer form; on subsequent mul-by-a[] updates the result is left in the partially-fwd-FFTed form returned by the carry step, i.e. fwd-weighted and initial-fwd-FFT-pass done.

3. On final G-checkproduct update of the current iteration interval, 1000 iterations before the next savefile write, save a copy of the current G-checkproduct b[] in a third array c[], before doing the usual G-checkproduct update b[] *= a[].

4. At end of the current iteration interval, prior to writing savefiles, do 1000 mod-squarings of c[] and compare the result to b[]. If mismatch, no savefiles written, instead roll back to last 'good' G-checkproduct data, which in my current first implementation means the previous multiple of 1M iterations.

So during the above, the G-checkproduct accumulator b[] is vulnerable to a 1-bit error, of the kind which would not show up, say, via a roundoff error during the ensuing *= a[] FFT-mul update.

So, what to do? Since the b[] data are kept in partially-fwd-FFTed form for most of the iteration interval, the Selfridge-Hurwitz (or similar CRC-style) checksums can't be easily computed from that. I think the easiest thing would be, every time I do an update b[] *= a[], do a memcpy to save a separate copy of the result, and compare that vs b[] prior to each update of the latter.

[followup e-mail a few hours later] Additional thoughts:

We are essentially trying to guard against a "false G-check failure", in the sense that the G-check might fail not because the PRP-residue array a[] had gotten corrupted but rather because the G-checkproduct accumulator b[] had. So every time we update b[] (or read it from a savefile) we also make a copy c[] = b[], and prior to each b[] *= a[] update we check that b == c. OK, but if at some point we find b != c, how can we tell which of the 2 is the good one? Obvious answer is to compute some kind of whole-array checksum at every update. Since post-update b[] may be in some kind of partially-FFTed state (that is the case for my code) the checksum needs to not assume integer data - perhaps simply treat the floats in a[] as integer bitfields. Would something as simple as computing a mod-2^64 sum of the uint64-reinterpretation-casted elements of a[] suffice, do you think? Further, any such checksum will be a much smaller bit-corruption target than b[], but to be safe one should probably make at least 2 further copies of *it*, call our 3 redundant checksums s1,s2,s3, then the attendant logic would look something like this:

Code:
// Mod-2^64 sum of elements of double-float array a[], treated as uint64 bitfields:
uint64 sum64(double a[], int n) {
int i;
uint64 sum = 0ull;
for(i = 0; i < n; i++)
sum += *(uint64*(a+i));	// Type-punning cast of a[i]
return sum;
}
// Simply majority-vote consensus:
uint64 consensus_checksum(uint64 s1, uint64 s2, uint64 s3) {
if(s1 == s2) return s1;
if(s1 == s3) return s1;
if(s2 == s3) return s2;
return 0ull;
}
int n;		// FFT length in doubles
double a[], b[], c[];	// a[] is PRP residue; b,c are redundant copies of G-checkproduct array
uint64 s1,s2,s3;	// Triply-redundant whole-array checksum on b,c-arrays
...
[bunch of mod-squaring updates of a[]]
// prior to each b[]-update, check integrity of array data:
if(b[] != c[]) {	// Houston, we have a problem
s1 = consensus_checksum(s1,s2,s3);
if(s1 == sum64(b,n))	// b-data good
/* no-op */
else if(s1 == sum64(c,n))	// c-data good, copy back into b
b[] = c[];
else	// Catastrophic data corruption
[roll back to last-good G-check savefile]
}
b[] *= a[];	// G-checkproduct update
s1 = s2 = s3 = sum64(b,n);	// Triply-redundant whole-array checksum update
c[] = b[];	// Make a copy
And if that is an effective anti-corruption strategy, the obvious question is, why not apply it to the main residue array a[] itself? Likely performance impact is one issue - the cost of making a copy of a[] and of updating the whole-array checksum at each iteration, while O(n) and thus certainly smaller than that of an FFT-mod-squaring, is likely going to be nontrivial, a few percent I would guess.

 2019-12-05, 23:24 #5 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 647610 Posts After the dust settles, an update on the Mlucas save file format description to final v18, and to v19 PRP would be appreciated. For your convenience, https://www.mersenneforum.org/showpo...91&postcount=2
2019-12-06, 02:25   #6
ewmayer
2ω=0

Sep 2002
República de California

101101110011112 Posts

Quote:
 Originally Posted by kriesel After the dust settles, an update on the Mlucas save file format description to final v18, and to v19 PRP would be appreciated. For your convenience, https://www.mersenneforum.org/showpo...91&postcount=2
Simple, PRP test savefiles tack on an additional version of the last 4 items in the 'current Mlucas file format' list - full-length residue byte-array (this one holding the accumulated Gerbicz checkproduct) and the 3 associated checksums totaling a further 18 bytes. Thus, where an LL-savefile read reads one such residue+checksum data quartet, 3 bytes for FFT-length-in-Kdoubles which the code was using at time of savefile write (*note* your quote in post #2 needs to change that from 4 to 3 bytes), 8 bytes for circular-shift to be applied to the (shift-removed) savefile residue, a PRP-savefile read follows those reads with another read of a residue+checksum data quartet.

2019-12-06, 07:18   #7
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

11001010011002 Posts

Quote:
 Originally Posted by ewmayer Simple, PRP test savefiles tack on an additional version of the last 4 items in the 'current Mlucas file format' list - full-length residue byte-array (this one holding the accumulated Gerbicz checkproduct) and the 3 associated checksums totaling a further 18 bytes. Thus, where an LL-savefile read reads one such residue+checksum data quartet, 3 bytes for FFT-length-in-Kdoubles which the code was using at time of savefile write (*note* your quote in post #2 needs to change that from 4 to 3 bytes), 8 bytes for circular-shift to be applied to the (shift-removed) savefile residue, a PRP-savefile read follows those reads with another read of a residue+checksum data quartet.
Thanks; https://www.mersenneforum.org/showpo...91&postcount=2 is updated and extended.

2019-12-06, 20:13   #8
ewmayer
2ω=0

Sep 2002
República de California

267178 Posts

Quote:
 Originally Posted by kriesel Thanks; https://www.mersenneforum.org/showpo...91&postcount=2 is updated and extended.
FYI, the "master reference" for savefile format is, as always, the actual code - the relevant functions are read|write_ppm1_savefiles in the Mlucas.c source. ('ppm1' is short for 'Primality-test and P-1' ... the latter as yet unsupportd, but we remain ever-optimistic ... the 2-input FFT-modmul support added in v19 for Gerbicz-checking will help in that regard, since p-1 stage 2 needs that capability.)

2020-01-03, 03:05   #9
ewmayer
2ω=0

Sep 2002
República de California

267178 Posts

***Patch *** 03 Jan 2020: This patch adds one functionality-related item, namely adding redundancy to the PRP-test Gerbicz-check mechanism to prevent data corruption in the G-check residue from causing a "false Gerbicz-check failure", i.e. a failure not due to a corrupted PRP-test residue itself. This more or less follows the schema laid out in post #4.

I have also patched another logic bug related to roundoff-error-retry, this one was occasionally causing the run to switch to the next-larger FFT length when encountering a reproducible roundoff error, rather than first retrying at the current FFT length but with a shorter carry-chain recurrence computation for DWT weights. Not fatal, just suboptimal in terms of CPU usage.

NOTE ALSO that I hit a Primenet-server-side bug on 31. Dec when I used the primenet.py script to submit my first batch of v19 LL-test results (my previous v19 submissions were all PRP-test ones). The server code was incorrectly expecting a Prime95-style checksum as part of such results lines. The really nasty part of this was that I almost missed it - until now, the primenet.py script grepped the page resulting from each attempted result-line submission for "Error code", if it found that it emitted a user-visible echo of the error message which was found, and the attempted submission line was not copied to the results_sent.txt file for archiving. In this case - I only saw this after retrying one of the submits via the manual test webform - there was "Error" on the returned page, but that was not followed by "code", so the script treated the submissions as successful. I only saw the problem when I checked the exponent status page for one of the expos, and saw no result had been registered. James Heinrich has fixed the server-side issue and to be safe I've tweaked the primenet.py script to only grep for "Error", but if you used the script to submit any v19 LL-test results (PRP tests were being correctly handled at both ends) prior to the current patch, please delete the corresponding lines from your results_sent.txt file and retry submitting using the patched primenet.py file. To be safe, check the exponent status at mersenne.org to make sure your results appear there.

I just uploaded updated versions of the release tarball and ARM prebuilt binaries, but folks who previously built and are running the the Dec 3 code snapshot can use the simpler expedient of incremental-rebuild-and-relink of the attached Mlucas.c sourcefile. The also-attached tweaked primenet.py file - matching the updated one in the release tarball - is not necessary now that James has made the above-described server-side bugfix, but better safe than sorry, I say.
Attached Files
 Mlucas.c.bz2 (74.8 KB, 320 views) primenet.py.bz2 (7.3 KB, 319 views)

 2020-01-16, 14:14 #10 Jumba   Jan 2020 110 Posts Version 19.0 error I'm getting the following error after getting to the 100% mark: ERROR: at line 2313 of file ../src/Mlucas.c Assertion failed: After short-div, R != 0 (mod B) Nothing has been written to the results.txt file since I started the run a week ago. I can restart the process, and it resumes from just before the end, but still spits out the same error after a couple minutes.
2020-01-16, 20:15   #11
ewmayer
2ω=0

Sep 2002
República de California

32×1,303 Posts

Quote:
 Originally Posted by Jumba I'm getting the following error after getting to the 100% mark: ERROR: at line 2313 of file ../src/Mlucas.c Assertion failed: After short-div, R != 0 (mod B) Nothing has been written to the results.txt file since I started the run a week ago. I can restart the process, and it resumes from just before the end, but still spits out the same error after a couple minutes.
That looks like you've found a bug in the PRP-residue postprocessing code ... could you upload your p[exponent] savefile to Dropbox or similar site so I can download it and re-run the final few (whatever) iterations within a debug session? PM me the resulting download location and the worktodo.ini file entry.

In the meantime, if you've not already done so, suggest you switch the top 2 entries in worktodo.ini and start on the next assignment. By the time that finishes you can grab a bug-patched version of the code, which should allow you successfully complete your above run.

Oh, your data should be fine, like I said this appears to strictly be a postprocessing bug.

Last fiddled with by ewmayer on 2020-01-16 at 20:16

 Similar Threads Thread Thread Starter Forum Replies Last Post ewmayer Mlucas 48 2019-11-28 02:53 ewmayer Mlucas 3 2017-06-17 11:18 Lorenzo Mlucas 52 2016-03-13 08:45 Unregistered Mlucas 0 2009-10-27 20:35 delta_t Mlucas 14 2007-10-04 05:45

All times are UTC. The time now is 10:05.

Sun May 22 10:05:12 UTC 2022 up 38 days, 8:06, 0 users, load averages: 0.97, 1.07, 1.11