![]() |
|
|
#1 |
|
"Mihai Preda"
Apr 2015
5AC16 Posts |
It appears one of my GPUs recently became less reliable than before -- once in a while (about every 12hours) I get "Error is too large; retrying", with the retry producing a different, plausible-looking result, and it keeps going from there.
This got me thinking about how to make better use of unreliable hardware. Let's say - the probability to get a correct result in any one iteration is "p", then - the probability to get a correct result after N iterations is p^N - which is approximated with 1 - N*(1 - p) when N*(1-p) is small (close to 0). In short, the probability to have a wrong LL result grows linearly with the number of iterations N. Even generally reliable hardware gets into trouble as N grows. As an example, a GPU which produces 80% correct for a 75M exponent, would produce about 40% correct for a 300M exponent (because 0.8**4 == 0.4), or less. Last fiddled with by kladner on 2018-06-14 at 02:35 |
|
|
|
|
|
#2 | |
|
"Forget I exist"
Jul 2009
Dartmouth NS
8,461 Posts |
Quote:
|
|
|
|
|
|
|
#3 |
|
"Mihai Preda"
Apr 2015
22×3×112 Posts |
The classical way to "validate" an LL result is the double check. If two independent LLs produce the same result it is extremely unlikely that the result is wrong. (because the space of the LL results is huge, even the space of 64-bit residues is huge, and assuming a mostly uniform distribution of wrong results over this space, the probability of two erroneous LL matching "by chance" is v. small).
But what if my GPU, for some big exponent range, displays a reliability of 20%? than most of the results would be wrong. Even if later disproved by double-checks, I would call the work of this GPU useless or even negative. The situation changes radically if the GPU itself applies iterative double-checking. For example, it would double check every iteration at every step along the way. The probability of an individual iteration being correct is extremely high (e.g. 0.99999998 for the previous example 20% reliability at 80M exponent). If the results of running the iteration twice [with different offsets] match than we are "sure" the iteration result is correct. Thus from a "bad" GPU we get extremely reliable LL results. I would argue such a result, let's call it "iterativelly self-double-checked" is almost as strong as an independent double-check. It does take twice the work -- though in this aspect it's not different from a double-checked LL (twice the work as well). Last fiddled with by preda on 2017-07-24 at 02:02 |
|
|
|
|
|
#4 |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Based on much personal experience with this sort of thing, 2 side-by-side runs with different shits or slightly differing FFT lengths, proceeding at as close to the same speed as possible and saving uniquely-named checkpoint files every (say) 10Miter is the way to go. But from the perspective of the project as a whole:
[1] That is only marginally better in terms of avoiding wasted cycles on runs which have gone off the rails than the current scheme, based on the assumption of an overall low error rate. From the perspective of nailing a single LL test result with minimal cycle wastage, though, above is good - if daily check reveals the 2 runs have diverged, stop 'em both and restart from whichever 10Miter (or whatever - on your hardware every 1Miter makes more sense) persistent checkpoint file was deposited before the point of divergence, after making sure said file matches between both runs, hopefully on retry both runs will now agree past the previous point of divergence. [2] The major drawback from the project perspective, however, is that it relies on the user being honest. Not a problem if the user claims to have found a prime - then we just insist on a copy of the last written checkpoint file and rerun the small number of iterations from that to the end, if it comes up "prime" we proceed to a full formal independent DC. But let's say someone wants to hurdle up the Top Producers list just for bragging rights and starts submitting faked-up "double checks" of this kind - if we accepted them we could easily miss a prime. I use the above 2-side-by-side runs method in my Fermat number testing, but the difference there is that I would never think of publishing a primality-test result gotten via this method without also making the full set of interim CP files available, enabling a rapid "parallel" triple-check method whereby multiple machines can run the individual 10M-iter intervals simultaneously, each one checking if its result after 10Miters agrees with the next such deposited CP file, as described here. |
|
|
|
|
|
#5 |
|
∂2ω=0
Sep 2002
República de California
1175610 Posts |
|
|
|
|
|
|
#6 | |
|
Serpentine Vermin Jar
Jul 2014
2·13·131 Posts |
Quote:
![]() In theory though, yeah, it makes perfect sense to do double-checking along the way, especially if you're doing a huge exponent like those 600M+ results, where he did a verifying run alongside and (presumably) compared residues along the way. If they diverged then you roll both back to the last place they matched and then resume. You do save cycles there because you're catching the error without having to run through the whole thing, waiting for a DC, getting a mismatch, doing a triple (or even more) check, etc. I still go by my general approximation of a 5% bad result rate, so if you were able to do side-by-side runs, you could effectively increase the throughput of the entire project (first and double-check, not just first-time milestones) by 5%. |
|
|
|
|
|
|
#7 | |
|
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
24·199 Posts |
Quote:
|
|
|
|
|
|
|
#8 | |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Quote:
As Mark notes, as long as the overall error rate remains reasonably low, the potential savings is simply unlikely to be worth the effort. |
|
|
|
|
|
|
#9 | |
|
"David"
Jul 2015
Ohio
10058 Posts |
Quote:
|
|
|
|
|
|
|
#10 | |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
3·23·89 Posts |
Quote:
|
|
|
|
|
|
|
#11 |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
6,793 Posts |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Stockfish / Lutefisk game, move 14 poll. Hungry for fish and black pieces. | MooMoo2 | Other Chess Games | 0 | 2016-11-26 06:52 |
| Redoing factoring work done by unreliable machines | tha | Lone Mersenne Hunters | 23 | 2016-11-02 08:51 |
| Unreliable AMD Phenom 9850 | xilman | Hardware | 4 | 2014-08-02 18:08 |
| [new fish check in] heloo | mwxdbcr | Lounge | 0 | 2009-01-14 04:55 |
| The Happy Fish thread | xilman | Hobbies | 24 | 2006-08-22 11:44 |