mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   Optimized doublechecking (https://www.mersenneforum.org/showthread.php?t=10727)

biwema 2008-10-02 23:12

Optimized doublechecking
 
I already suggested this some years ago. At that time the exponents were too small so that it was not worth to implement.

When the exponents get twice as large, there is more than 4 times as much work to complete one exponent. If there is an error during a LL test, a lot of work will be lost. Therefore it makes sense to find out as soon as possible, if a doublecheck does not match anymore.

To prevent that problem, we could change the double check procedure this way:
- First time LL test saves an interim residue every million iterations and submits them to the server when the test is complete.
- When this exponent is assigned as a doublecheck, that computer also gets there interim residues (maybe with some masked bits).
- When an intermediate residue matches, the intermediate result is saved. If after one more million iterations the intermediate residue does still match, that result replaces the previous intermediate result and the calculation continues.
- If an intermediate residue does not match, the calculation restarts at the last intermediate result and the last million iterations are recalculated.
- If the result matches now, the calculation continues. The test recovered from an error.
- If the result matches the result from the last run of the million iterations, the first time check failed in that segment. Now the doublecheck becomes a first time check; the previous first time LL test is erroneous.
- If the result does not match any results, the segment needs to be recalculated a third time.

Now it is the best time to implement that functionality when the server is updated to version 5 and Prme95 Version25 is beta.
It is possible to use less reliable machines for doublechecking.
Doublechecks could even be used as torture tests.

I hope it is not too difficult to implement

Uncwilly 2008-10-02 23:44

[QUOTE=biwema;144335]I already suggested this some years ago. At that time the exponents were too small so that it was not worth to implement.

To prevent that problem, we could change the double check procedure this way:
- First time LL test saves an interim residue every million iterations and submits them to the server when the test is complete.
- When this exponent is assigned as a doublecheck, that computer also gets there interim residues (maybe with some masked bits).[/QUOTE]
Maybe, have the server hold all the cards. Have the computer check in every 1 - 5 - 10 M iterations. Then, depending on the response, proceed as you dictate. If the residues don't match after a redo, have a "trusted machine/user" start the triple.

cheesehead 2008-10-03 03:37

biwema,

Thank you for presenting this idea. I appreciate your desire to improve GIMPS.

However, there are fundamental flaws in your proposed procedure that make it unacceptable for GIMPS to use.

[quote=biwema;144335]To prevent that problem, we could change the double check procedure this way:
- First time LL test saves an interim residue every million iterations and submits them to the server when the test is complete.
- When this exponent is assigned as a doublecheck, that computer also gets there interim residues (maybe with some masked bits).
- When an intermediate residue matches, the intermediate result is saved.[/quote]By "intermediate residue", I presume you mean the final 64 bits of the current S[sub]n[/sub].

By "intermediate result", I take it you mean the entire S[sub]n[/sub] full-length value at that point.

[quote]- If an intermediate residue does not match, the calculation restarts at the last intermediate result and the last million iterations are recalculated.[/quote]So, in effect, your procedure would start a triplecheck at the previous checkpoint ...

... except that instead of having the doublecheck and triplecheck start with different randomized offsets (as GIMPS currently does), your procedure would use exactly the same offset for each (because it doesn't go all the way back to the start when recalculating). Also, instead of performing the triplecheck on a different system than either the first-time or doublecheck (as GIMPS currently does), your procedure performs the triplecheck on the same system as the doublecheck.

So in your procedure, we lose the independence between doublecheck and triplecheck that GIMPS currently has, thus leading to less reliable security. In addition to making it harder to detect computational errors, it makes it easier to fake a doublecheck result without actually doing the calculation.

[quote]- If the result matches now,[/quote]I presume you mean "If the latest [I]residue[/I] (i.e., last 64 bits of result) matches the first-time test residue now,"

[quote] the calculation continues. The test recovered from an error.[/quote]In effect, your procedure moves the two-out-of-three vote to here, but not on independent systems, rather than after both the full doublecheck and full triplecheck on independent systems as it is now.

[quote]- If the result matches the result from the last run of the million iterations,[/quote]I presume you mean, "If the [I]residue[/I] matches the [I]residue[/I] ..."

[quote]the first time check failed in that segment.[/quote]Or, at least, the two-out-of-three vote is in favor of the current run.

[quote]Now the doublecheck becomes a first time check; the previous first time LL test is erroneous.[/quote][U]Not necessarily![/U]

It could also be that the second system simply has an identical error occur in each of its duplicate calculations on the current interval. In two-out-of-three voting, two wrongs outweigh one right.

The infamous division bug in Intel's first Pentium CPU would repeat exactly the same result every time, and could not have been detected by your procedure. In fact, if GIMPS used the FDIV instruction in its main L-L calculation loops (it doesn't, because there are faster ways to do what GIMPS wants), your procedure would decide that the buggy CPU was correct and a non-buggy one was faulty, exactly the opposite of what we want.

By removing the independence of doublechecks and triplechecks, your procedure makes it easier for some errors to be accepted as correct results.

[quote]- If the result does not match any results, the segment needs to be recalculated a third time.[/quote]No.

The failure of two calculations performed on the same system to match either each other or the first-time test means that at least one, maybe both, of these latter two calculations is incorrect.

At this point, [U]we need to have this machine stop its test and report its errors to GIMPS because [I]it has demonstrated that it is unreliable[/I][/U] and should not be trusted in future tests.

[quote]It is possible to use less reliable machines for doublechecking.[/quote]Yes, but not as simultaneous double- and triple-checkers.

In GIMPS, correctness of calculations is the first priority. Optimizations that impair that are unacceptable.

biwema 2008-10-03 11:51

[QUOTE=cheesehead;144351]biwema,

By "intermediate residue", I presume you mean the final 64 bits of the current S[sub]n[/sub].

By "intermediate result", I take it you mean the entire S[sub]n[/sub] full-length value at that point.[/QUOTE]

Exactly.
It is only necessary to synchronize the residue with the server.
The intermediate results is necessary to resart the last segment after an error is detected.
It is also possible to keep the whole intermediate files on the server. In that case the first time check could also be restartet from that point. But there will be a huge storage and transfer bottleneck.

The whole procedure is to detect a not matching residue as early as possible. The main purpose it to detect random errors. If the error is reproducible, that doublecheck becomes a first time check. That means, the first time check before is marked as "probably faulty". In that case a third check is necessary because we need a matching doublecheck with a different offset.

Besides that it is still possible to redo the last segment with a larger FFT size and/or change the offset (is that possible during the LL test).

The 2 out of 3 vote is always a problem. It could also be that the first time check is faulty but has a correct residue. A doublecheck would not detecht that.
A 128 bit residue could further reduce the chance of matching a interim residue by accident.

I think there will be no negative impact on the correctness of the project. there is still a complete doublecheck with a different offset necessary.

cheesehead 2008-10-03 20:37

[quote=biwema;144381]The whole procedure is to detect a not matching residue as early as possible. The main purpose it to detect random errors.[/quote]I understand what you mean.

What I'm trying to explain is that there are [I]non-random[/I] errors that your procedure fails to detect, and erroneously treats as correct. Furthermore, the extra execution time that your procedure requires (i.e., redoing a million-iteration if there's an intermediate residue mismatch) will not be compensated-for by the quicker detection of errors in some cases, because those cases don't occur often enough to make it worthwhile.

You have a good idea, and I think a slight modification of it (I won't try to explain now) has a potential to improve GIMPS throughput -- but only very, very slightly, and it would require changing all the software (GIMPS and PrimeNet) and database, plus the saving of intermediate residues, at the added cost of making it much more complicated for some. Furthermore, it could not be used with any of the saved data that GIMPS has for the millions of tests already performed before this modification was implemented. These added complications would not be acceptable for obtaining the very small improvement it would bring (unless there were other benefits I haven't thought of).

For now, I'll resume responding to your last posting.

[quote]If the error is reproducible, that doublecheck becomes a first time check. That means, the first time check before is marked as "probably faulty".[/quote]Wrong thing to do! Your procedure has _not_ shown that the first time check before is in error. If the second check that you're doing has a reproducible error, then _it_ should be the one labelled as "faulty", not the first one!

Please carefully re-read my first post, where I presented a specific example (Intel Pentium FDIV bug) of a reproducible, non-random error that your procedure would treat in exactly the reverse of the correct way -- your procedure would label a correct result as faulty, and an incorrect result as correct.

[quote]In that case a third check is necessary because we need a matching doublecheck with a different offset.[/quote]Yes, but you can't declare that the first time check before was faulty. Your procedure does not prove that.

[quote]Besides that it is still possible to redo the last segment with a larger FFT size and/or change the offset (is that possible during the LL test).[/quote]Yes, it's possible, but you have to go back to the very beginning of the LL test to do so. You can't change FFT size or the offset at some intermediate point. So it would waste all the calculation done up until that time.

[quote]The 2 out of 3 vote is always a problem.[/quote]The key flaw is that in your procedure, 2 of the votes come from the same system's run, so that there are not 3 _independent_ votes. Unless all votes are from independent systems, 2-out-of-3 voting is meaningless.

[quote]It could also be that the first time check is faulty but has a correct residue.[/quote]... or that it's not faulty at all.

BTW, by "faulty", I mean that the residue is incorrect. If there are nonreproducible errors in a calculation, but the residue matches residues independently calculated on separate systems anyway, then we consider the residue as almost certainly correct.

[quote]A doublecheck would not detecht that.[/quote]No, a triplecheck is required ... not using your procedure, but it's possible if the double- and triple-checks are done on independent systems.

[quote]A 128 bit residue could further reduce the chance of matching a interim residue by accident.[/quote]Not really. With a 64-bit residue match, it's already been mathematically shown that the probability that the full-length result is incorrect is much, much, much smaller than the probability of many other types of error.

[quote]I think there will be no negative impact on the correctness of the project. there is still a complete doublecheck with a different offset necessary.[/quote]Your procedure could improve GIMPS, but only by cutting short a doublecheck test run that detected its own errors. Unfortunately, that would not happen often enough (based on GIMPS's experience with types of errors that actually occur) to compensate for extra time spent uselessly in many other tests.

The flaws in your procedure are not in the quality of your design for early detection of an error in a doublecheck run, consided only by itself -- they're in its failure to account for other types of random and nonrandom error.

biwema 2008-10-03 23:36

I completely read your first post. Probably i just did not choose the correct words.
Mainly the idea is to detect random errors of the doublecheck.

I cannot improve the first time run.

Of course you are right. If there is a reproducible error in the doublecheck, i cannot detect that error. i just see that my residue match where the first time check did not match. if that happens a third verification is necessary (not only to prevent reproducible errors, there is no way to show that the rest of the doublecheck is correct after the first residue mismatch occurs.

I should not have said "probably faulty". i see that you do not like that expression. In there cases it is more likely that there was an error during the first time run.
I better name it "a bit more likely that an error occurred during the first time run". Prime95 does the same, if a test run is marked as more risky after a recovery of a roundoff or sumout error. There exponents are looked at more carefully.

About the change of the FFT size: i think it is possible to change the FFT size during a run. Every iteration the number is transformed back to its representation mod Mxxxxx. So the FFT size can be changed then.
I don't know if it is possible to chane the offset, as well or if it is merged too much. I think it is possible, otherwise it would not be possible to get a residue to compare between different offsets.

[QUOTE=cheesehead;144428]Your procedure could improve GIMPS, but only by cutting short a doublecheck test run that detected its own errors. Unfortunately, that would not happen often enough (based on GIMPS's experience with types of errors that actually occur) to compensate for extra time spent uselessly in many other tests.

The flaws in your procedure are not in the quality of your design for early detection of an error in a doublecheck run, consided only by itself -- they're in its failure to account for other types of random and nonrandom error.[/QUOTE]

After all, it is no big improvement and will not change much until there are enoough exponents with logged interim residues. When the candidates grow, it could be more and more useful. Compared to other improvements (intermediate FFT sizes 9,11,13,15 *2^N, running first time and double check in parallel, mixing Trialfactoring and P-1 etc. ) it is quite easy to implement.

CRGreathouse 2008-10-04 04:35

[QUOTE=cheesehead;144428][quote]If the error is reproducible, that doublecheck becomes a first time check. That means, the first time check before is marked as "probably faulty".[/quote]

Wrong thing to do! Your procedure has _not_ shown that the first time check before is in error. If the second check that you're doing has a reproducible error, then _it_ should be the one labelled as "faulty", not the first one![/QUOTE]

Of course we want to catch nonrandom errors; biwema's procedure is useful only for double checks, not triple checks. If one system has a repeatable residue that conflicts with the first check residue, then:
1. the first check is wrong due to a random failure, or
2. the first check is wrong due to a systemic failure, or
3. the second check is wrong due to a systemic failure, or
4. the second check is wrong due to a random failure (with vanishingly small probability, say one-in-a-trillion that it happens to any system before GIMPS hits billion-digit exponents).

#1 is far more likely than #4, thus biwema's suggestion that it is assumed 'by default' to be correct. Of course to protect against systemic failures like the Pentium bug it must be triple-checked, ideally on a computer different from both the primary and secondary checker. It could be checked by the same processor type as the primary checker, but the result could only be accepted if the result matched the secondary. If it matched the primary, with overwhelming probability a systemic failure has been detected. A triple-check on a processor matching the second would be worthless.

cheesehead 2008-10-04 15:08

[quote=biwema;144442]Mainly the idea is to detect random errors of the doublecheck.[/quote]Yes, and [U]if[/U] it did that without imposing any additional cost (re-doing iterations, need to change the database and much of the software, ...), then it could be evaluated without regard to any other factors, such as frequency of random errors, occurrence of nonrandom errors, extra time needed for computation, and so .

But it _does_ impose added cost, so we need to compare its added value to its added cost. Evaluation of added cost drags in other factors -- frequency of random errors, occurrence of nonrandom errors, and so on.

[quote]About the change of the FFT size: i think it is possible to change the FFT size during a run.[/quote]Yes, I think I goofed on that.

[quote]I don't know if it is possible to chane the offset, as well or if it is merged too much.[/quote]It's possible, but this would require still more modifications to both software and database. Prime95 would need to record and report the offset changes and when they occurred, I think. However, this change wouldn't help anything. Changing the offset mid-way through does nothing to improve reliability or error detection.

[quote]Compared to other improvements (intermediate FFT sizes 9,11,13,15 *2^N, running first time and double check in parallel, mixing Trialfactoring and P-1 etc. ) it is quite easy to implement.[/quote]But unlike those others, its benefit does not outweigh its cost, it appears to me.

cheesehead 2008-10-04 15:31

[quote=CRGreathouse;144451]#1 is far more likely than #4, thus biwema's suggestion that it is assumed 'by default' to be correct.[/quote]I agree that biwema's suggestion would reduce (but not eliminate) the proportion of reported doublecheck results that were corrupted by random errors. I'm arguing that the magnitude of this benefit alone does not justify the added cost.

CRGreathouse 2008-10-04 20:08

[QUOTE=cheesehead;144471]I agree that biwema's suggestion would reduce (but not eliminate) the proportion of reported doublecheck results that were corrupted by random errors. I'm arguing that the magnitude of this benefit alone does not justify the added cost.[/QUOTE]

Are there good numbers out there on the percentage of doublechecks that come up bad?

Costs:
1. Processor cycles to check partial residues
2. Bandwidth for partial residue checking
3. Centralized storage for partial residues
4. Programmer time to implement

Benefits:
I. Redundant processor cycles saved
II. Better knowledge of where verification fails

I would expect that #I would easily outweigh #1 unless the intervals are too close. #2 seems to be a nonissue.

I'm concerned about #3 and especially #4. #3 is all cost, no benefit until the numbers we're starting to check now are double checked, and could easily quadruple hard drive space needed. (This might not be too bad, actually; the residues are pretty small, after all.) #4 is even worse: skilled programmer time is the scarcest resource of all in the project.

#II offers little benefit to random errors if our current understanding of how they happen is correct. But if not (if some parts are more vulnerable than others, for some nonobvious reason) this could have diagnostic value. It also offers benefits for identifying systemic errors, though those are major enough that the time savings might not matter at all.

Do you agree with this analysis, cheesehead?

cheesehead 2008-10-05 08:21

[quote=CRGreathouse;144482]Are there good numbers out there on the percentage of doublechecks that come up bad?[/quote]How to gather the data (skip to the boldface [B]%[/B] figure, if you wish):

Look at the database of LL results. Tote up cases something like these (which go beyond your question):

Category 1: First-time and doublecheck residues agree.

Case 1a: Neither reports a nonzero error code (the string that has numbers of nonreproducible roundoff errors, and so forth).

Case 1b: First-time reports a nonzero error code. DC doesn't.

Case 1c: DC reports a nonzero error code. First-time doesn't.

Case 1d: Both first-time and DC report nonzero error codes.

Category 2: First-time and doublecheck residues disagree. Triplecheck agrees with first-time.

Cases 2a-d: as 1a-d above, with no triplecheck error.

Cases 2e-h: like 2a-d except triplecheck returned nonzero error code.

Category 3: First-time and doublecheck residues disagree. Triplecheck agrees with doublecheck.

Cases 3a-d and 3e-h like 2a-d and 2e-h above.

Category 4: First-time, doublecheck, and triplecheck all disagree with each other.

Cases 4a-zzz in sets of 8 according to how many nnnble-checks are necessary for two residues to match.

Now, having said all that, let me [I]answer your question[/I], "Are there good numbers out there on the percentage of doublechecks that come up bad?":

It's been discussed several times in the past; such threads might be distributed across the Software, Hardware, PrimeNet and Data subforums. I'd start looking in the Data subforum first. In those discussions the numbers you(we) seek sometimes appear.

George has built in an estimate of [B]1.8%[/B], but this is for all L-L runs, not just doublechecks. It's in source module commonc.h:
[code]#define ERROR_RATE 0.018 /* Estimated error rate on clean run */
[/code]Prime95 uses that figure in the calculations for setting P-1 limits. (module ecm.c) Why? It's used in calculating the time required for (one or two, as the case may be) L-L runs that would be saved if factoring found a factor. That, in turn, is used to calculate how far P-1 should go in order to maximize the benefit of trying P-1 before L-L.

[quote]Benefits:
I. Redundant processor cycles saved[/quote]... but as I pointed out above, this is only in the case where a doublecheck using biwema's procedure cuts itself short because it can't agree with itself after redoing an interval.

[quote]II. Better knowledge of where verification fails[/quote]Suppose we find that verification failed on the interval of iterations 16000000 to 17000000. What does that tell us, that can be of any real use? I think the answer to that is: nothing.

[quote]I would expect that #I would easily outweigh #1 unless the intervals are too close.[/quote]I disagree.

Draw up an example.

[quote]#2 seems to be a nonissue.[/quote]Since the current transmitted assignment and report messages are fairly short (< 100 bytes IIRC), those intermediate residues (8 bytes each, plus overhead for residue-checking interval length and probably some other stuff) could easily double, or more, their lengths. So we could be looking at a substantial increase in bandwidth.

[quote]#II offers little benefit to random errors if our current understanding of how they happen is correct.[/quote]Right.

[quote]But if not (if some parts are more vulnerable than others, for some nonobvious reason) this could have diagnostic value.[/quote]Not really. Tell me how knowing that, e.g., iteration interval 16000000-17000000 is the first one that failed helps us.

[quote]It also offers benefits for identifying systemic errors[/quote]Perhaps if it's iteration interval 1-1000000. :) But I think that might become evident in the QA testing that several folks do before a new version is released.

Otherwise, tell me how knowing the iteration interval that first failed (but is later than 1-1000000) helps us in this regard.


All times are UTC. The time now is 06:51.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.