mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   PrimeNet (https://www.mersenneforum.org/forumdisplay.php?f=11)
-   -   Large sample size reliability (https://www.mersenneforum.org/showthread.php?t=21608)

Dubslow 2016-09-27 10:51

Large sample size reliability
 
Why does my computer with 100s of verified results and zero bad results still have a reliability of 0.98?

Prime95 2016-09-27 11:46

The wacky way Primenet processes results makes 0.98 about the maximum possible value.

Each LL result you submit adds to your rolling average either 0.98 for a flawless run, 0.5 for a run with one non-reproducible error, and 0.3 for multiple errors.

The above is for first-time checks and double-checks. If the number was previously verified then you get either 1.0 for a matching triple-check or 0.0 for a mismatch.


However, I just tweaked the above. When you submit a matching double-check you will get 1.0 instead of 0.98. Note that you don't get the extra 0.02 when you are the first-time tester and someone later verifies your result.

Dubslow 2016-09-27 12:19

:tu:


Does this apply retroactively to DC results?

Also what about computers that turn in LLs which are subsequently verified? Is that reliability subsequently upped as well? Or could that be effectively done?

ATH 2016-09-27 13:03

[QUOTE=Dubslow;443581]:tu:


Does this apply retroactively to DC results?

Also what about computers that turn in LLs which are subsequently verified? Is that reliability subsequently upped as well? Or could that be effectively done?[/QUOTE]



Answer:

[QUOTE=Prime95;443578]Note that you don't get the extra 0.02 when you are the first-time tester and someone later verifies your result.[/QUOTE]

Dubslow 2016-09-27 13:11

[QUOTE=ATH;443587]Answer:[/QUOTE]

Ah yes, my second question was indeed indicative of how quickly I read his post. :davieddy: I suppose I'm still curious if that could be changed (assuming we all agree there's a rationale for doing so).

However, my first question remains open.

chris2be8 2016-09-27 15:47

Logically a mis-matching DC should get 0.5 because it's equally likely either result is wrong. This could be adjusted for the number of errors each run had.

But that would strengthen the case for updating the score when a matching result is finally found.

Chris

CRGreathouse 2016-09-27 16:21

[QUOTE=chris2be8;443601]Logically a mis-matching DC should get 0.5 because it's equally likely either result is wrong. This could be adjusted for the number of errors each run had.[/QUOTE]

The probability should be slightly less than 0.5, because both may be wrong. If the probabilities of each system being right are p and q, then the probability that the first is wrong is
[$$]\frac{1-p}{1-pq}[/$$]

so if the reliability for each is 0.98, for example, they 'should' each get 49/99 = 0.4949....

I'm not suggesting this needs to be implemented, of course. When we're using 0.5 and 0.3 as ballpark estimates of success chance with one or more errors, there's no need to worry about an extra 0.5% here or there. :smile:

Madpoo 2016-09-28 15:34

[QUOTE=CRGreathouse;443606]The probability should be slightly less than 0.5, because both may be wrong. If the probabilities of each system being right are p and q, then the probability that the first is wrong is
[$$]\frac{1-p}{1-pq}[/$$]

so if the reliability for each is 0.98, for example, they 'should' each get 49/99 = 0.4949....

I'm not suggesting this needs to be implemented, of course. When we're using 0.5 and 0.3 as ballpark estimates of success chance with one or more errors, there's no need to worry about an extra 0.5% here or there. :smile:[/QUOTE]

If it were up to me, a first-time check wouldn't change the reliability at all because it means nothing by itself. :smile:

If reliability were adjusted after the fact, when a result is either proven good or bad, then a match would be 1.0 and a bad result gets a big fat zero.

Right now though, reliability isn't adjusted after the fact, which is why it gets set when the first-time check comes in and is based on the result code from the run. It provides some instant feedback but since flawless runs often turn out to be bad, it's also misleading.

Could that code be changed to only adjust reliability once the results are known? Well, I imagine it could be. I could probably do something now to go back over the history of each machine and calculate the *real* reliability, omitting any #'s where it's unknown. Would a machine with no verifications at all be set to 0.5, indicating it's absolute ambiguity?

Would that be an incentive to anyone to do some double-checks to improve the reliability rating of their system? Perhaps, although right now the assignment rules (I think?) are using more real time data over the past XX days to see how a machine is performing... is it fast enough, is it "up" and not letting work expire, etc. And besides looking to see if it's had any recent bad results, I don't think it's using the reliability score.

I could be wrong on that though.

CRGreathouse 2016-09-29 20:18

[QUOTE=Madpoo;443704]If it were up to me, a first-time check wouldn't change the reliability at all because it means nothing by itself. :smile:[/QUOTE]

Sure it does -- you get to see the error codes, if any. You can look at the total number of tests submitted without error codes and for which a double-check exists, and use that to approximate reliability. Some for runs with error codes. I feel like you could use the unverified checks as a Bayesian prior or something (as funny as that sounds) based only on the number of errors found (compared to the averages with errors in all verified cases). That would also have the effect of making it fairly unimportant for users with a reasonable number of verified checks.

[QUOTE=Madpoo;443704]If reliability were adjusted after the fact, when a result is either proven good or bad, then a match would be 1.0 and a bad result gets a big fat zero.[/QUOTE]

Right.

[QUOTE=Madpoo;443704]Would a machine with no verifications at all be set to 0.5, indicating it's absolute ambiguity?[/QUOTE]

Probably their reliability should be set to the average reliability (which seems like the most reasonable prior). Of course if you have additional information, like what kind of CPU and clock speed they have, you could provide a more educated estimate.

[QUOTE=Madpoo;443704]Could that code be changed to only adjust reliability once the results are known? Well, I imagine it could be. I could probably do something now to go back over the history of each machine and calculate the *real* reliability, omitting any #'s where it's unknown.[/QUOTE]

By the way I'm not suggesting that this be done -- probably it doesn't matter that much. But it can be fun to think about how to do it regardless. :smile:

Madpoo 2016-10-02 21:09

[QUOTE=CRGreathouse;443842]Sure it does -- you get to see the error codes, if any. You can look at the total number of tests submitted without error codes and for which a double-check exists, and use that to approximate reliability.[/QUOTE]

Maybe... I've seen that on average, a "suspect" result has a 50/50 chance of being bad. Some suspect results are worse than others, but a "suspect" flag is binary yes/no in general terms.

On my own runs I've had a number of results come in as suspect but the result matched. I chalk that up mainly to exponents that were on the threshold of some FFT range. Usually the error is repeatable and it does that thing where it tries again using a safer method or whatever, but sometimes that's not the case, or it happens near the end of the run and finishes before it retries and finds out it was repeatable.

I would posit that since a suspect result has a 50/50 chance of being bad, it's still not a great indicator of even short term reliability and could wind up tarnishing some awesome systems. Maybe if it only included certain types of errors that would be reasonable.

One thing's for sure... the history of proven good/bad results is going to be the ultimate benchmark. :smile:

chalsall 2016-10-02 22:50

[QUOTE=Madpoo;444079]I would posit that since a suspect result has a 50/50 chance of being bad, it's still not a great indicator of even short term reliability and could wind up tarnishing some awesome systems. Maybe if it only included certain types of errors that would be reasonable.[/QUOTE]

You (and we) have a great deal of data available for mining to determine (or even just posit) the optimal equation based on many metrics.

Perhaps we should try bring some "deep learning" based on heuristics into this, if for no other reason than as an exercise.

[QUOTE=Madpoo;444079]One thing's for sure... the history of proven good/bad results is going to be the ultimate benchmark. :smile:[/QUOTE]

Without question. Which can, of course, feedback into the above. In real-time.


All times are UTC. The time now is 15:23.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.