![]() |
[QUOTE=Madpoo;420308]who gets credit (only the one who finishes, if you ask me), should the server start saving temporary partial residues along the way to provide more "point in time" comparisons besides just the final, etc.[/QUOTE]
Re: credit - we could do one of two things: (1) gussy up the server enough to award a proportional number of GHz-days to the original and subsequent tester(s) (e.g. 20% of a 200-GHz-day LL test would earn 40 GHz-days, 35% of the same test would earn 70 GHz-days, etc.), or (2) hold the interim result until the original AID has expired (which would end any claim that the original assignee might have on the exponent or its test result). The GIMPS legal disclaimers could be amended (if they do not already indeed state this) to equate expiration of an assignment with expiration of any claim on the test result. Re: partial residues - definitely worth considering. Saving a Res64 even at 1% intervals would only require a total of 800 bytes or so. Trivial to transfer and store. |
[QUOTE=NBtarheel_33;420323]Trivial to transfer and store.[/QUOTE]
I understand what you are saying, but _*not*_ trivial to code. |
[QUOTE=chalsall;420325]I understand what you are saying, but _*not*_ trivial to code.[/QUOTE]
Would it be any easier to just send a Res64 every time Prime95 "phones home" to PrimeNet? That's only an extra eight bytes of information added to the ETA, etc. that otherwise gets sent every day. |
[QUOTE=NBtarheel_33;420364]Would it be any easier to just send a Res64 every time Prime95 "phones home" to PrimeNet? That's only an extra eight bytes of information added to the ETA, etc. that otherwise gets sent every day.[/QUOTE]Even assuming you could and assuming it is just a five minute job to change everything (perhaps quite aggressive assumptions!), what is the purpose of the captured data? How is it used? What happens if systems don't agree on the staged res64?
|
The server should keep the 64 bit residue for every 5 or 10 million iterations.
|
[QUOTE=retina;420367]Even assuming you could and assuming it is just a five minute job to change everything (perhaps quite aggressive assumptions!), what is the purpose of the captured data? How is it used? What happens if systems don't agree on the staged res64?[/QUOTE]
I would imagine a system where:[LIST][*]First time LL test has a saved partial residue (64-bit, just like now) every 5% or something[*]Second (double) check would compare it's residue along the way.[*]If everything is matching, good, hopefully the final one matches as well[*]If there's a mismatch at any point along the way, the exponent can be made available immediately for a triple-check since we know it'll need one.[/LIST] There's another benefit, and it's the ability to spot bad machines faster. Right now we only know a result is bad when 3 checks (or more) have finished and we finally get a match that tells us which one was bad. However, if we're able to compare residues along the way, we could potentially know which one is bad as early as 5% (or whatever interval the periodic residues are saved) into the 3rd run. If two of the three match at some interval, we can be almost certain the odd one out is the bad result. That's based on my assumption that a match at any given iteration is going to be as certain as a match at the final iteration... i.e. if they match at the 10 millionth iteration, they're both doing great up to that point. Knowing your own triple-check is on the right path and that one of the others definitely went astray will increase that user's confidence in their triple-check, and lets someone like me who searches for the bad systems identify them faster. It would also help identify bad systems that never completed their check. I imagine such a system would save those partial residues even if it never finished. Let's say a system did a bunch of first time checks and also had a few that went 10-20% and were abandoned. It may be years and years before anyone starts double-checking their first-time work, but if they abandoned some exponents before completion, those would be re-assigned as first-time work much sooner. If those new assignments show mismatches along the way, we have a good idea that machine was bad and we can check their other first-time stuff way before we would have normally, well before we would have any notion that system was wonky if we had to wait for double-checking to get to them. |
[QUOTE=ATH;420369]The server should keep the 64 bit residue for every 5 or 10 million iterations.[/QUOTE]
FYI, I realized that I like that notion more than every xx% along the way... doing it every 5e6 or 10e6 or whatever gives better fixed reference points. As far as the coding, yes, it would mean client changes to send those partial residues as part of it's normal communication, and some stuff on the server side. Probably just a new table to hold the info... similar to the table that holds the final residue that holds user/cpu info, exponent, and maybe the assignment ID for tracking purposes, shift count. Then new columns for "nth iteration" and "64-bit residue" for the actual meat of it. There would be some back-end magic pixie dust to actually do something with that data... look for mismatches and make things available for a triple-check right away, or use it for analyzing possibly bad systems. Not terribly complicated, but then I'm not a coder. :smile: |
[QUOTE=Madpoo;420311]
[CODE]% Done Count Equivalent Full LL Tests GHz-days saved 0 116600 0 0 10 11806 4261 852,200 20 7164 6160 1,232,000 [B]30 5474 7091 1,418,200 40 4103 7265 1,453,000 50 3551 7030 1,406,000 [/B] 60 3084 6305 1,261,000 70 2749 5197 1,039,400 80 2565 3740 748,000 90 2107 1899 379,800 100 3 3 600[/CODE][/QUOTE] The above is assuming 200 GHz-days per LL test; current mainstream assignments are actually a little more expensive. The moral of the analysis looks to be that if this is worth implementing, the most "bang for the buck" (as measured by salvaged throughput to GIMPS) is achieved by collecting a single results file around 40% completion, or two or three results files between 30% and 50% completion. |
[QUOTE=Madpoo;420378]FYI, I realized that I like that notion more than every xx% along the way... doing it every 5e6 or 10e6 or whatever gives better fixed reference points.
As far as the coding, yes, it would mean client changes to send those partial residues as part of it's normal communication, and some stuff on the server side. Probably just a new table to hold the info... similar to the table that holds the final residue that holds user/cpu info, exponent, and maybe the assignment ID for tracking purposes, shift count. Then new columns for "nth iteration" and "64-bit residue" for the actual meat of it. There would be some back-end magic pixie dust to actually do something with that data... look for mismatches and make things available for a triple-check right away, or use it for analyzing possibly bad systems. Not terribly complicated, but then I'm not a coder. :smile:[/QUOTE] Would you also stop the machine doing the original DC or assign it a new item? There are some other interesting ways this could be used. If the client was modified to keep a local full copy of its residue every Xe6 (let's say we size this to roughly a day worth of work) as well the moment the server detected a mismatch the original client could roll back and try again from a known good point, while also alerting the user of a hardware problem. The client could remove old full checkpoints once verified as matched. Alternatively the offending machine could then upload it's last good full iteration to primenet so a triple check wouldn't have to start from zero. This would let even lightly misbehaving machines continue to contribute, and is especially useful for those that are unattended for a long time. In the long run it also opens the possibilities of simultaneous tests by different users. We are talking about 210 +\- GhzDay tests now, but to effectively work on say 100M tests it would be really nice to run half-speed but catch (and correct) errors months early. |
[QUOTE=airsquirrels;420420]Would you also stop the machine doing the original DC or assign it a new item?
There are some other interesting ways this could be used. If the client was modified to keep a local full copy of its residue every Xe6 (let's say we size this to roughly a day worth of work) as well the moment the server detected a mismatch the original client could roll back and try again from a known good point, while also alerting the user of a hardware problem. The client could remove old full checkpoints once verified as matched. [/QUOTE] If this is implemented, it would also be good if Prime95 stopped LL tests if a factor is found. I'm not sure if it does currently? |
[QUOTE=airsquirrels;420420]Would you also stop the machine doing the original DC or assign it a new item?[/QUOTE]
Hmm... well, there's two (or three) things that could happen. By way of refresher, I'm referring to a situation where a double-check in our theoretical system has a residue mismatch at, oh, let's say 20% just for example. Since there's a mismatch at that point with the first check, it's made available for a triple-check since we know it'll need one. Things that could happen:[LIST=1][*]The machine doing the double-check finishes first, checks in it's result and then just waiting on the triple-check to figure out which one is correct.[*]The machine assigned the triple-check gets to that 20% where there was a mismatch noted, and it matches the residue of the first check. Should the double-checker go ahead and give up, knowing that it must have screwed up since the residue it had at 20% failed to match to other independent runs?[*]The machine assigned the triple-check gets to 20% and matches the DC at the same point. Looking good for the 2nd and 3rd tests and we can go ahead and assume the 1st check is the bad one.[/LIST] I personally like that notion of spotting the bad one much earlier in the process by comparing residues along the way. And of course there's always the chance that the triple-checker won't match *either* of the first two and we'll need a quad+ check. It's rare but it definitely happens. [QUOTE]There are some other interesting ways this could be used. If the client was modified to keep a local full copy of its residue every Xe6 (let's say we size this to roughly a day worth of work) as well the moment the server detected a mismatch the original client could roll back and try again from a known good point, while also alerting the user of a hardware problem. The client could remove old full checkpoints once verified as matched.[/QUOTE] That is true... going to my example (and I'm using % done instead of 1M iterations just for whatever reason), let's say the mismatch occurred at 20%. The client could roll back to it's 10% value and try again... if it arrived at the same residue it had before, then it'll just continue on with confidence it's on the right track. Or it may match the first check, so it knows it had some problem and might need to switch to a larger FFT or do something else for the rest of it's run. Or it could come up with something else entirely, which again points to a problem with this current run since it's inconsistent. As you suggested, if the client didn't have to roll back more than a day, that would be easiest on the resources. Rolling back a whopping 10% could be days or even weeks for some exponents/clients, so I just use that by way of example. Since storing partial residues on the server side is (relatively) cheap, being only 64 bits (plus other junk) for each entry, it could be every 1M iterations, or 500K iterations, etc. The client would only need to save maybe the last 2 or 3 full interim files, just enough so that if it mismatched a previous run, it could go back to it's last known match point and start again. Since the client *already* saves 2 backup files at 30 minute intervals (by default) perhaps it's not too much to ask the clients to save an additional backup file or two, going back to the previous 1M point? [QUOTE]Alternatively the offending machine could then upload it's last good full iteration to primenet so a triple check wouldn't have to start from zero.[/QUOTE] True... we'd have to tell the double-checker to quit working on it at that point (if we've come to the conclusion that it's flaky because it keeps coming up with different residues), because the saved file will have a fixed shift-count. We can't have the original person and someone else who picks up from there *both* completing it because the shift-counts will be the same and can't be used as verification of each other. [QUOTE]This would let even lightly misbehaving machines continue to contribute, and is especially useful for those that are unattended for a long time. In the long run it also opens the possibilities of simultaneous tests by different users. We are talking about 210 +\- GhzDay tests now, but to effectively work on say 100M tests it would be really nice to run half-speed but catch (and correct) errors months early.[/QUOTE] Right now, some users (looking at you LaurV) do this on their own... running two tests of the same exponent alongside each other on different machines, comparing residues at fixed interims along the way. It does help identify a potential problem midway through the run rather than only finding the problem at the end like we do now. On large (like 100M digit) exponents, this makes even more sense when an LL test can take months or even years. I know I'd like to know much sooner whether my machine crapped out at some point and I could just roll back to the last time the residues matched (on both machines, since I wouldn't know which was bad) and start over from there. Much better than doing a full triple-check starting at zero. For that to be effective, I think I hinted at the problem where LL and DC running alongside each other would mean *both* systems rolling back to the last time they matched, since it would be unknown which is bad. That could result in more lost time for the faster of the two machines... it might have to roll back over the last several iterations to get to where the slower machine is. So to be really effective I guess we'd have to match machines with the same effective throughput. But those are all "technical" problems... in theory it seems like a good idea? :smile: Now I'm just waiting for someone to come along and poke holes in it, but most of the objections will probably be towards implementation issues... I understand it wouldn't be trivial to code and implement, but if it's a good idea and would save time, it'd be worth it, I think. |
| All times are UTC. The time now is 23:15. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.