![]() |
|
|
#56 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24·3·163 Posts |
Quote:
PrimeNet's track record overall, absent flaky failing hardware, some misconfigurations or pilot error issues such as using an incompatible threads number on a GPU or too small an fft length for an exponent, is pretty good for these lengthy computations mostly done without ECC memory. Worst case, a bad lltest gets submitted, and probably identified years later with a double check and triple check. If I recall correctly, the percentage of bad residues at completion is increasing as we progress to higher exponents, and currently around 4%? Thank you Mihai for taking the initiative and pursuing a new avenue to detect and reduce impact of computational errors along the way! Last fiddled with by kriesel on 2017-08-09 at 19:39 |
|
|
|
|
|
|
#57 |
|
Sep 2003
2·5·7·37 Posts |
There is a selection effect because strategic double checks are being done systematically on likely-bad exponents well ahead of the wavefront. So in the interim there is an artificially high error rate for higher exponents, which will settle back down to normal as the advancing wavefront finishes double-checking all the routine correct first-time checks.
|
|
|
|
|
|
#58 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
11110100100002 Posts |
Quote:
e3,170j2,17 representing roundoff error checked 170 times in the exponent's run and excessive roundoff recovered from by retries used 3 times; Jacobi checked 17 times and recovered twice. One could also make a case for continuing despite a Jacobi retry failure. That's the same error rate we've had until now without the test being implemented. These could also be counted, and would mark the result as somewhat suspect, an early candidate for double-check, the more so as the count of separate error occurrences per exponent or hardware increases: e3,170j2,17J3 More frequent error checking will more frequently detect errors, for systems of equal reliability. Counting and reporting the total Jacobi checks may allow for compensating for that confounding effect. Other possibilities I feel are inferior include either negative numbers representing error types or counts in place of the residue, or non-hex characters in the residue. Who knows, "00000000DEADBEEF" might actually be a legit final residue sometime. And as an error flag it is vague; did it fail the Jacobi check, the zero residue check, the repeating twos check, or something else? Given enough residues (monkeys on typewriters) all sorts of things show up. Checking one GPU's logs, I found both "dead" and "beef"; not together; 0x887b4dead500ba11 for example. (Is that flagging a large zombie formal dance?) |
|
|
|
|
|
|
#59 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24·3·163 Posts |
Quote:
|
|
|
|
|
|
|
#60 | |
|
Sep 2014
1716 Posts |
Quote:
|
|
|
|
|
|
|
#61 |
|
"Mihai Preda"
Apr 2015
5AC16 Posts |
The Jacoby symbol computation for 75M range on Ryzen 1700x (3.4 GHz) is only 30s single-core (hyperthreading disabled), better than on the Xeon 2.4 GHz where it's 50s (likely RAM plays a role too).
For gpuOwL this works nicely because the Jacobi check is done on the CPU "in the shadow" of the GPU kernel -- i.e. when the CPU normally waits for the enqueued GPU kernels to complete. As a batch of 20K iterations takes about 45s on the GPU, there is no additional delay introduced by Jacobi-check. Of course, the Jacobi check is not free because in does take some CPU from mprime. Now, by default, the check is done every 200K iterations (10 * logstep). I dropped the -offset option from gpuOwL 0.6 (the only, implicit offset is now 0). Before doing this I created a branch "offset" on github https://github.com/preda/gpuowl/tree/offset in case anybody wants to look etc. The reason for dropping the offset is that I felt that the resulting simplification outweighs the offset benefit. Today I added a new option in gpuOwL, -supersafe. This runs every iteration *twice* using independent memory buffers (so that a memory corruption affects the two computations differently), and checks after each batch (20K iterations) for identical results -- otherwise retries. The two parallel runs are with the same offset (0). The drawback of -supersafe is that it's twice slower. The benefit is that it's very strongly protected from hardware errors such as memory corruption (at any level, global/cache/register) or non-systematic arithmetic corruption (if there is such a kind of hardware error). I have a GPU that went bad. I could not use it for LL anymore. Now, with -supersafe, it's twice slower but I trust the results again. (also, I can drop the underclock that I was using in hope to improve reliability before). Last fiddled with by preda on 2017-08-10 at 11:15 |
|
|
|
|
|
#62 | |
|
"Mihai Preda"
Apr 2015
22×3×112 Posts |
A log excerpt, GPU is 390x, with -supersafe and Jacobi. (Jacobi I would say is not really needed when using -supersafe). The ms/iter is the time for the "double" iteration.
Quote:
Last fiddled with by preda on 2017-08-10 at 11:18 |
|
|
|
|
|
|
#63 |
|
Sep 2003
1010000111102 Posts |
I started a thread about this in the Number Theory Discussion Group subforum.
I hope that number theory experts can not only confirm the theoretical soundness of this idea — seems simple, so how did this elude us for so long? — but maybe even suggest additional, independent checks on interim or final Lucas-Lehmer residues, if any exist. |
|
|
|
|
|
#64 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1E9016 Posts |
Quote:
I'm curious about your choice to run the Jacobi check on the CPU rather than GPU. Was this a design decision to do the check with different probably more reliable hardware than your problematic GPU or with different hardware in general, for ease of programming, something else? Removing nonzero offset seems unfortunate. The two supersafe runs may need to be the same offset but nonzero same should work. Nonzero offset is a feature that promotes utility in double checks. |
|
|
|
|
|
|
#65 | |
|
Serpentine Vermin Jar
Jul 2014
D4E16 Posts |
Quote:
At the end of the test those temp files could be deleted then. Probably the only reason Prime95 doesn't do that now has everything to do with the good old days when it started, back in '96. Drive space (and speed) were factors and saving a bunch of temp files along the way could have caused issues. A 79M exponent has a tmp file about 10MB... keeping one every 10% and maxing at 100MB in that exponent size wouldn't be terrible (especially if it was optional, but under the guise of "this could potentially save a lot of time if it finds an error). Something to think about anyway. EDIT: Oh, and that reminds me of the another idea that's been brought up before: to have Primenet save partial residues at fixed % along the way as well, so you'd know sooner during a double-check if a mismatch happened, or it might be interesting to know at what point along the way the mismatches began. Last fiddled with by Madpoo on 2017-08-10 at 21:18 |
|
|
|
|
|
|
#66 | |
|
"Forget I exist"
Jul 2009
Dartmouth NS
8,461 Posts |
Quote:
https://en.wikipedia.org/wiki/Legendre_symbol etc. |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Stockfish / Lutefisk game, move 14 poll. Hungry for fish and black pieces. | MooMoo2 | Other Chess Games | 0 | 2016-11-26 06:52 |
| Redoing factoring work done by unreliable machines | tha | Lone Mersenne Hunters | 23 | 2016-11-02 08:51 |
| Unreliable AMD Phenom 9850 | xilman | Hardware | 4 | 2014-08-02 18:08 |
| [new fish check in] heloo | mwxdbcr | Lounge | 0 | 2009-01-14 04:55 |
| The Happy Fish thread | xilman | Hobbies | 24 | 2006-08-22 11:44 |