 2003-10-04, 02:55 #1 GP2     Sep 2003 258510 Posts Team_Prime_Rib error-prone machines I don't want to single out Team_Prime_Rib, but they already keep track of their own bad results on their Triple Checks Required page and Incorrect Team Results page. So it would be a useful comparison to see which machines are identified as error-prone by my proposed criteria, and compare it with what they know about their own machines. Comments from Team_Prime_Rib would be welcome. Maybe we can see where to set the appropriate threshold percentage for error-prone machines (maybe 50% is too high) Once again, the proposed standard is: bad / (bad + good) >= X % and bad >= 2 or uv3_plus / (uv3_plus + uv2) >= X % and uv3_plus >=2 where X % = 50% uv2 = unverified exponents needing a 2nd check. uv3_plus = unverified exponents needing a 3rd or higher check. If we set X % = 50%, there are 28 error-prone machines in TPR. If we lower the threshold to 33% there are 5 more machines. If we lower the threshold to 20% there are another 6 machines. Total distinct machines evaluated for TPR = 491. 50% 16 20 24 29 40 DSheets_09 Hades_au_P4e KL_Dancer2 KL_KenOffice KL_Looi KL_Orphanage KL_Zedd SC_reaver02 Tasuke5 Tasuke9 bayanne_MRoe bayanne_clv2 garo4 glenon1 glenon7 greensinozw0 hum7 outlnder01 outlnder02 outlnder4 outlnderprim p1000 shlide 33% DSheets_62 Odessit5 SC_derek2 SC_reaver01 robcreid_01 20% boxen_05 forge2 outlnder06 outlnder07 outlnder1 outlnder5 Last fiddled with by GP2 on 2003-10-04 at 02:56
2003-10-04, 03:20   #2
GP2

Sep 2003

Posts
Re: Team_Prime_Rib error-prone machines

Quote:
 Originally posted by GP2 Total distinct machines evaluated for TPR = 491.

Some TPR machines were not even considered, even though they might have a high error rate, because there are no exponents associated with them to release for early double-checking.

An example:

The machine outlndr52 has 6 bad, 4 good, 0 unverified needing a 2nd check, and 11 unverified needing a 3rd check or higher.

So it is error-prone.

But there are 0 exponents needing a 2nd check. There are 11 exponents needing a 3rd check or higher, but all of these already have one presumed-good result (from other, non-error-prone machines).

The basic philosophy is: any exponent that does not have one presumed-good result should be scheduled for early re-testing, while any exponent that does will be re-tested in due course (sometimes years later).

So in other words, the list in the previous message (and the 491 total TPR machines considered) was not a complete list of TPR machines, but only the ones that hold interest for purposes of releasing exponents for early double-checking.

Last fiddled with by GP2 on 2003-10-04 at 03:22

 2003-10-04, 04:04 #3 outlnder     Aug 2002 2×3×53 Posts Fortunately for Team Prime Rib and Gimps in gerneral, all listed outlnder machines no longer crunch for TPR. Just to clarify some of the questions about my machines, none were overclocked, most used aynchronous FSB settings and most were on cheap(ECS) motherboards. And just for the hell of it, most had 0 Prime95 errors.
 2003-10-04, 04:24 #4 GP2     Sep 2003 5×11×47 Posts If we include all TPR machines (1365 in all), rather than just the 491 considered in the previous message, we get: 50% 16 20 22 24 28 29 33 35 40 DSheets_09 DSheets_16 DSheets_20 DSheets_22 DSheets_33 DSheets_35 Hades_au_P4e KL_Dancer2 KL_Gar KL_KenOffice KL_Looi KL_Orphanage KL_Zedd Odessit PJG-G4-800 PM_node_2 SC_12_derek2 SC_reaver02 SlashDude10 SlashDude_WT TGC_03 Tasuke Tasuke5 Tasuke9 Tasuke_26 adoptfactor alvin bayanne_MRoe bayanne_clv2 garo4 garo_jul glenon1 glenon7 greensinozw0 hum7 kvizbar_srv outlnder01 outlnder02 outlnder04 outlnder2 outlnder21 outlnder4 outlnder52 outlnder55 outlnderprim p1000 riskin01 shlide 33% 4 DSheets_29 DSheets_62 Odessit5 SC_derek2 SC_reaver01 Tasuke10 Tasuke3 Tasuke4 dizzytcis glenon_700hm robcreid_01 20% Tasuke_27 boxen_05 forge2 outlnder06 outlnder07 outlnder1 outlnder12 outlnder5 pvillecat6
2003-10-04, 04:38   #5
GP2

Sep 2003

Posts

Quote:
 Originally posted by outlnder Fortunately for Team Prime Rib and Gimps in gerneral, all listed outlnder machines no longer crunch for TPR.
Sorry about your farm, outlnder. You did have some good machines (outlnder10, 11, and many others), and even the error-prone ones returned a lot of good exponents in total.

Quote:
 And just for the hell of it, most had 0 Prime95 errors.
I think we have to conclude that although nonzero error code is a good predictor of a possible bad result, a zero error code is not a good predictor of a good result.

The same was true for Team_Italia/Paperino in the M77909869 thread.

2003-10-04, 06:54   #6
robreid

Aug 2002
New Zealand

Posts

Quote:
 Originally posted by GP2 If we include all TPR machines (1365 in all), rather than just the 491 considered in the previous message, we get: 33% robcreid_01
This computer is definatly confirmed bad, it has since been dismantled and redeployed. Parts of it are now in the boxen that tracks our incorrect stats. I've already come to terms with the fact that the outstanding LL will all probably fail doublecheck

 2003-10-04, 07:11 #7 outlnder     Aug 2002 31810 Posts PageFault has made some deductions that asychronous memory/FSB settings are responsible for some /all errors in machines that show "0" errors. Maybe this is something we could get feedback on?? Ask for participants that have bad results if their machines are set to asychronous memory/FSB settings.
 2003-10-04, 17:41 #8 PageFault     Aug 2002 Dawn of the Dead 5·47 Posts Asynchronous ram / fsb was suspect on boxen_05 as this machine produced AM radio interference when at 133 fsb / 166 ram. Forcing 1:1 ratio got rid of the interference but the machine is still failing doublechecks. This machine's problems were first found in a batch of doublechecks ran last fall, about 50 % needing a triplecheck or confirmed bad. The machine had been on 33M most of this year until June, when I started a batch of dc's - all were bad. It was down for a few months, recently I revived it to try and solve the problem - current batch of dc's all require triplechecks. I will try once again at full stock (1.6A @ 2133) after an in progress (and certainly corrupt) 33M test completes. This box is probably scrap, I'll get around to fixing or replacing it soon.
 2003-10-05, 00:01 #9 garo     Aug 2002 Termonfeckin, IE 22×691 Posts GP2, I am surprised that you have garo4 and garo_jul in that list. Can you tell me what criteria was used to get these machines? My guess is that you used the outstanding triplecheck ratio criteria. In that case, I would like to suggest a revision. What you do not take into account is that fact that not a single exponent from either of these machines has been confirmed bad and there are literally dozens of confirmed good results from both these machines. So the high triple check ratio is in fact a coincidence, i.e. the other machine is more likely to be at fault. So I would like to suggest that the unverified criteria be used only in the case of those machines that do not have enough verified results. I am posting this observation in the other thread as well. Also, I know from my observations that all the KL machines you listed are bad as well as the SC, PM_node and the DSheets machines you listed.
2003-10-05, 01:27   #10
GP2

Sep 2003

Posts

Quote:
 Originally posted by garo GP2, I am surprised that you have garo4 and garo_jul in that list. Can you tell me what criteria was used to get these machines?
OK, since you posted in the other thread as well, I'll answer it there.

One of the reasons I posted the list of TPR machines above was precisely to get this type of feedback.

 2003-10-05, 18:34 #11 PageFault     Aug 2002 Dawn of the Dead 111010112 Posts More on TPR machines: Unlike most of the history of crunching prime95, a competitive team is going to have many enthusiaist machines, overclocked machines. For some, commitment is casual and they may walk away leaving a pile of bad results. Those dedicated to being #1 are going to be more involved in what is happening. One aspect of winning is not losing credit due to bad tests. There is a trend in the making at TPR, that a new machine must pass a set of doublechecks before going on to the first time or 33M tests. A machine should periodically do a set of doublechecks to verify integrity. Errors need to be isolated and corrected - this is going to make many bad results. TPR machines are more prone than average to run doublechecks. Many are borged work machines and the reduced P-1 time and memory mandate this. This practice has revealed problems, i.e., dsheets has a group of identical machines which all fail due to some chipset issue (or other). Many prefer the fast credit of doublechecks. Many have a spectrum of hardware at home. Old PIII's and tbirds tend to get put on doublechecks. Indeed the default work type for a PIII is now a doublecheck. All these factors are going to show on TPR's error incidence. It is going to be higher than that of other groups. This may even push up the project overall error rate. The good thing about this is that it promotes self awareness. TPR is but one of a group of teams, a group that has upper ranking in most of the other DC projects. Knowing that a machine is capable of good results can only benefit the other projects that get crunched.

