View Single Post
Old 2009-08-20, 04:28   #1205
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

101001001101012 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Hmm...I wouldn't be so sure about that. What's more likely is a situation like this:

1) server hands out original to client A
2) client A doesn't return in time for the jobMaxTime
3) server hands out duplicate to client B
4) client A returns a little late, but before client B returns (results are credited to client B)
5) client B returns and is rejected since the server no longer has information on the tests

As you can see, 1) and 3) would be quite a ways apart. Of course, this is only one of a number of situations that could lead to rejected results, but it is the most common. Anyway, long story short, it's not a given that the duplicates and originals were haded out at the same time.

Even in the case of the power outage, they would have been handed out two times a ways apart: one right before the power outage, and another soon after the servers came back online.
Max
I never said they were handed out at the same time. I said they were handed out twice close together. Close together = right before and right after a crash. Close together = a few hours apart. I had the server back up in an hour. One hour certainly does not constitute "a ways apart".

You are not so sure? How could you NOT be 100% sure that they were handed out well < 24 hours apart? You have all of the proof you need in my post #1185. Had you looked closely at that, you would have concluded much differently. Please check that post that shows:

Rejected results from Aug. 18th:
333 970326 10:59:52 marco.bs
315 970327 10:59:52 marco.bs
339 970327 15:38:58 kar_bon
315 970334 15:38:59 kar_bon
333 970334 15:38:59 kar_bon
(etc.)

Original returned results:
333 970326 16:34:25 Aug. 17th gd_barnes 1224 secs = handed out 16:14:01 on 17th
315 970327 16:34:28 Aug. 17th gd_barnes 1215 secs = handed out 16:14:13 on 17th
339 970327 10:59:52 Aug. 18th kar_bon 3855 secs = handed out 09:55:37 on 18th
315 970334 10:59:52 Aug. 18th kar_bon 3854 secs = handed out 09:55:38 on 18th
333 970334 10:59:53 Aug. 18th kar_bon 3855 secs = handed out 09:55:38 on 18th
(etc.)

Comparisons:
333 970326 reject return: 8/18 10:59:52 original handed out 8/17 16:14:01
315 970327 reject return: 8/18 10:59:52 original handed out 8/17 16:14:13
339 970327 reject return: 8/18 15:38:58 original handed out 8/18 09:55:37
315 970334 reject return: 8/18 15:38:59 original handed out 8/18 09:55:38
333 970334 reject return: 8/18 15:38:59 original handed out 8/18 09:55:38
(etc.)

Differences between original handed out and rejected returned:
333 970326 18 hrs, 45 mins, 51 secs
315 970327 18 hrs, 45 mins, 39 secs
339 970327 5 hrs, 43 mins, 21 secs
315 970334 5 hrs, 43 mins, 21 secs
333 970334 5 hrs, 43 mins, 21 secs
(etc.)

Would you care to restate what is the more likely scenario now?

You seem to want to think that there is no problem. I've already proven that there is a problem with the way pairs are handed out after an outage or crash but you seem to want to unprove me by implying that the server waited its usual 24 hours before doing so. In math terms, that's not mathematically possible!

I know you may deem this to be a fruitless exercise but we have gleaned very important information from it. That is:

When the LLRnet servers crash and/or there is a power outage, they frequently hand out the same k/n pair twice in short succession.

This may not help now but it could easily help narrow the problem greatly in the future.

This kind of sleuthing would also come in even handier on the PRPnet server debugging because the logs are time stamped. Using time math to determine a pattern of when problems occurred will very frequently help determine WHY they occurred!


Gary

Last fiddled with by gd_barnes on 2009-08-20 at 08:15
gd_barnes is offline   Reply With Quote