![]() |
![]() |
#1200 | ||
A Sunny Moo
Aug 2007
USA (GMT-5)
3·2,083 Posts |
![]() Quote:
1) server hands out original to client A 2) client A doesn't return in time for the jobMaxTime 3) server hands out duplicate to client B 4) client A returns a little late, but before client B returns (results are credited to client B) 5) client B returns and is rejected since the server no longer has information on the tests As you can see, 1) and 3) would be quite a ways apart. Of course, this is only one of a number of situations that could lead to rejected results, but it is the most common. Anyway, long story short, it's not a given that the duplicates and originals were haded out at the same time. Even in the case of the power outage, they would have been handed out two times a ways apart: one right before the power outage, and another soon after the servers came back online. Quote:
Max ![]() |
||
![]() |
![]() |
![]() |
#1201 | |
A Sunny Moo
Aug 2007
USA (GMT-5)
3·2,083 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#1202 | |
A Sunny Moo
Aug 2007
USA (GMT-5)
141518 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#1203 |
I ♥ BOINC!
Oct 2002
Glendale, AZ. (USA)
111310 Posts |
![]()
I don't have server crashes, due to power outages or any other reasons.
All CPUs are run at stock speeds, no other tasks running other than the llrnet servers, mysql and crontab entries that handle the stats part of things. I use the utility screen in a shell prompt to run the servers within in a single shell, you use some other GUI method, so can't help you there - Max? As to the http://nplb.ironbits.net Server status web pages, that is created by me using vbscript under a windows network and samba shares. I can send the code to Max and he can convert it to perl or shell, or AMDave for that matter, then they can subsitute the vbscript script with the new perl one and have it point to http://noprimeleftbehind.net as the default page, rather than the current one that AMDave does for you that replaced it, then I'll disable my script that runs every hour, and they can crontab it to run their new perl or shell version. If the Server crashes, reboot the server so it comes up clean, then restart the clients from /etc/init.d/start_llrnetservers start (assuming you have a start_llrnetservers shell script). |
![]() |
![]() |
![]() |
#1204 | |||
A Sunny Moo
Aug 2007
USA (GMT-5)
3·2,083 Posts |
![]() Quote:
Quote:
![]() ![]() Dave, not meaning to impose, but do you possibly have enough knowledge of vbscript, Perl, and/or bash scripting that you could do such a conversion? Or, even better, do you know of a way to run vbscripts natively on Linux so that we don't have to convert anything (besides changing a few pathnames in the scripts)? Secondly, we're not interested in replacing the database display at http://www.noprimeleftbehind.net/ with the current http://nplb.ironbits.net/ page; rather, we'd like to have the latter still available as a separate page, like it is now. I'm thinking http://llrnet.noprimeleftbehind.net/ would work well for that. Quote:
Max ![]() |
|||
![]() |
![]() |
![]() |
#1205 | |
May 2007
Kansas; USA
19·541 Posts |
![]() Quote:
You are not so sure? How could you NOT be 100% sure that they were handed out well < 24 hours apart? You have all of the proof you need in my post #1185. Had you looked closely at that, you would have concluded much differently. Please check that post that shows: Rejected results from Aug. 18th: 333 970326 10:59:52 marco.bs 315 970327 10:59:52 marco.bs 339 970327 15:38:58 kar_bon 315 970334 15:38:59 kar_bon 333 970334 15:38:59 kar_bon (etc.) Original returned results: 333 970326 16:34:25 Aug. 17th gd_barnes 1224 secs = handed out 16:14:01 on 17th 315 970327 16:34:28 Aug. 17th gd_barnes 1215 secs = handed out 16:14:13 on 17th 339 970327 10:59:52 Aug. 18th kar_bon 3855 secs = handed out 09:55:37 on 18th 315 970334 10:59:52 Aug. 18th kar_bon 3854 secs = handed out 09:55:38 on 18th 333 970334 10:59:53 Aug. 18th kar_bon 3855 secs = handed out 09:55:38 on 18th (etc.) Comparisons: 333 970326 reject return: 8/18 10:59:52 original handed out 8/17 16:14:01 315 970327 reject return: 8/18 10:59:52 original handed out 8/17 16:14:13 339 970327 reject return: 8/18 15:38:58 original handed out 8/18 09:55:37 315 970334 reject return: 8/18 15:38:59 original handed out 8/18 09:55:38 333 970334 reject return: 8/18 15:38:59 original handed out 8/18 09:55:38 (etc.) Differences between original handed out and rejected returned: 333 970326 18 hrs, 45 mins, 51 secs 315 970327 18 hrs, 45 mins, 39 secs 339 970327 5 hrs, 43 mins, 21 secs 315 970334 5 hrs, 43 mins, 21 secs 333 970334 5 hrs, 43 mins, 21 secs (etc.) Would you care to restate what is the more likely scenario now? ![]() You seem to want to think that there is no problem. I've already proven that there is a problem with the way pairs are handed out after an outage or crash but you seem to want to unprove me by implying that the server waited its usual 24 hours before doing so. In math terms, that's not mathematically possible! ![]() I know you may deem this to be a fruitless exercise but we have gleaned very important information from it. That is: When the LLRnet servers crash and/or there is a power outage, they frequently hand out the same k/n pair twice in short succession. This may not help now but it could easily help narrow the problem greatly in the future. This kind of sleuthing would also come in even handier on the PRPnet server debugging because the logs are time stamped. Using time math to determine a pattern of when problems occurred will very frequently help determine WHY they occurred! Gary Last fiddled with by gd_barnes on 2009-08-20 at 08:15 |
|
![]() |
![]() |
![]() |
#1206 |
I ♥ BOINC!
Oct 2002
Glendale, AZ. (USA)
3·7·53 Posts |
![]()
Vertical Server Status Reports can be viewed here:
http://www.noprimeleftbehind.net/iro...tml/index.html |
![]() |
![]() |
![]() |
#1207 | |
May 2007
Kansas; USA
19×541 Posts |
![]() Quote:
Edited my MyDog: Yup thats the one, thanks David. Last fiddled with by MyDogBuster on 2009-08-20 at 06:41 |
|
![]() |
![]() |
![]() |
#1208 |
Jan 2006
deep in a while-loop
2×7×47 Posts |
![]() |
![]() |
![]() |
![]() |
#1209 |
May 2007
Kansas; USA
1027910 Posts |
![]()
After further investigation, I see that the joblist.txt file is updated continuously. Although it is possible that something could slip through the cracks during a power outage and cause a minimal # of pairs to be assigned twice, the problem should be very isolated. I was able to nail down to the power outage just 10 rejected pairs in port G8000 that had been handed out twice. That is certainly a reasonable explanation for a small # of them. The numerous rejected results since then is another matter and may NOT be as a result of the server dropping and then coming back up.
Now, we have another problem today. Karsten, you have already returned 19 rejected results as of 2:00 AM CDT Aug. 20th (7:00 AM GMT), all of which already had returned results by you from about 11:45 AM CDT (4:45 PM GMT) Aug. 19th. Here is a list of them: Code:
345 971680 339 971681 321 971682 345 971686 339 971769 321 971773 315 971774 339 971777 321 971858 339 971859 327 971861 339 971861 345 971931 327 971936 345 971937 321 971941 339 972019 345 972019 327 972024 Karsten, at this point, I have 2 revelations for you: Revelation #1: All of the pairs were originally returned by you in < 3 mins., some in as few as 36 seconds and further...they were the only pairs on that day with such short timings!! Now, unless you've invented some new software or mathetmatics that none of the rest of us are aware of, that would not be possible unless you crunched those pairs before you got them from the server. Would you care to enlighten us on how you managed that? Revelation #2: In every case above, it appears that you reserved them in 30-pair chunks BUT...only the last 4 in EVERY case ended up with a problem. Conclusion: Something is going on wrong with the last 4 pairs out of the 30 that Karsten caches each time. Max, here is an example from the stdout.log file: Code:
connection closed (socket 4) connection reqeust from 14e48554:1242 (socket 5) (20 misc. proposed pairs) Proposing pair 327/971841 to kar_bon Proposing pair 333/971843 to kar_bon Proposing pair 345/971844 to kar_bon Proposing pair 315/971845 to kar_bon Proposing pair 327/971845 to kar_bon Proposing pair 345/971847 to kar_bon Proposing pair 321/971858 to kar_bon } Proposing pair 339/971859 to kar_bon } all were eventually cancelled, handed out Proposing pair 327/971861 to kar_bon } a 2nd time, and ultimately rejected Proposing pair 339/971861 to kar_bon } connection closed (socket 5) connection reqeust from 14e48554:1247 (socket 4) Revelation #3: All of the offending pairs were cancelled and reassigned in one fell swoop in consecutive fashion! To demonstrate my great predictive capality (lol), I will predict that before today is done, the following pair will also be rejected: 333 972027 Because that is the final pair in the final group of 4 above and is also the final pair that was cancelled in error and reassigned in the group of 20 consecutive cancelled-reassigned pairs. Max, Karsten, or anyone else...any thoughts on this? Gary Last fiddled with by gd_barnes on 2009-08-20 at 08:26 |
![]() |
![]() |
![]() |
#1210 |
Mar 2006
Germany
286610 Posts |
![]()
here is the list of the rejected pairs you found from my original resultfile with local timings (GMT):
Code:
[2009-08-19 23:16:33] 345*2^971680-1 is not prime. Res64: A3058E1F1BB01541 Time : 1822.835 sec. [2009-08-19 23:46:58] 339*2^971681-1 is not prime. Res64: E9B22ACD57CAD289 Time : 1824.900 sec. [2009-08-20 00:17:23] 321*2^971682-1 is not prime. Res64: BA5C48FA90A17D3D Time : 1824.457 sec. [2009-08-20 00:47:48] 345*2^971686-1 is not prime. Res64: F229FEFE53441AE2 Time : 1824.755 sec. [2009-08-20 01:18:14] 339*2^971769-1 is not prime. Res64: 5A02350FBA24EFB3 Time : 1825.932 sec. [2009-08-20 01:48:39] 321*2^971773-1 is not prime. Res64: B50AAC2F1E3605A3 Time : 1825.393 sec. [2009-08-20 02:19:05] 315*2^971774-1 is not prime. Res64: D68B6BBA049EDC8C Time : 1825.324 sec. [2009-08-20 02:49:30] 339*2^971777-1 is not prime. Res64: CF808406635150CA Time : 1825.271 sec. [2009-08-20 03:19:56] 321*2^971858-1 is not prime. Res64: 7B148E3A3E36AC2C Time : 1825.337 sec. [2009-08-20 03:50:21] 339*2^971859-1 is not prime. Res64: 8581276FB298D4CF Time : 1824.778 sec. [2009-08-20 04:20:50] 327*2^971861-1 is not prime. Res64: 012FA1B9DB743746 Time : 1828.781 sec. [2009-08-20 04:51:16] 339*2^971861-1 is not prime. Res64: C33A7F731DB30AF3 Time : 1825.660 sec. [2009-08-20 05:21:48] 345*2^971931-1 is not prime. Res64: 5FDA988415CA2DAF Time : 1832.460 sec. [2009-08-20 05:52:20] 327*2^971936-1 is not prime. Res64: 41A291142D1304CB Time : 1830.618 sec. [2009-08-20 06:22:50] 345*2^971937-1 is not prime. Res64: 300195E9F52B74B4 Time : 1829.116 sec. [2009-08-20 06:53:15] 321*2^971941-1 is not prime. Res64: 056DFC3984788400 Time : 1825.302 sec. [2009-08-20 07:23:43] 339*2^972019-1 is not prime. Res64: 6293C6AA532B82F2 Time : 1827.491 sec. [2009-08-20 07:54:07] 345*2^972019-1 is not prime. Res64: 7FD353B995437DEF Time : 1824.657 sec. [2009-08-20 08:24:34] 327*2^972024-1 is not prime. Res64: 99B92A53CA02A8E7 Time : 1826.415 sec. these 19 pairs were all done previously by the other 5 cores with following dates: Code:
core #2: [2009-08-19 07:57:39] 345*2^971680-1 is not prime. Res64: A3058E1F1BB01541 Time : 1220.857 sec. [2009-08-19 08:28:05] 339*2^971681-1 is not prime. Res64: E9B22ACD57CAD289 Time : 1825.646 sec. [2009-08-19 09:58:23] 321*2^971682-1 is not prime. Res64: BA5C48FA90A17D3D Time : 1824.229 sec. [2009-08-19 10:28:48] 345*2^971686-1 is not prime. Res64: F229FEFE53441AE2 Time : 1825.130 sec. core #3: [2009-08-19 07:39:54] 339*2^971769-1 is not prime. Res64: 5A02350FBA24EFB3 Time : 152.072 sec. [2009-08-19 08:10:20] 321*2^971773-1 is not prime. Res64: B50AAC2F1E3605A3 Time : 1825.383 sec. [2009-08-19 08:40:45] 315*2^971774-1 is not prime. Res64: D68B6BBA049EDC8C Time : 1825.262 sec. [2009-08-19 09:11:13] 339*2^971777-1 is not prime. Res64: CF808406635150CA Time : 1828.084 sec. core #4: [2009-08-19 07:53:26] 321*2^971858-1 is not prime. Res64: 7B148E3A3E36AC2C Time : 960.326 sec. [2009-08-19 08:23:53] 339*2^971859-1 is not prime. Res64: 8581276FB298D4CF Time : 1826.467 sec. [2009-08-19 08:54:21] 327*2^971861-1 is not prime. Res64: 012FA1B9DB743746 Time : 1827.778 sec. [2009-08-19 09:24:46] 339*2^971861-1 is not prime. Res64: C33A7F731DB30AF3 Time : 1825.092 sec. core #5: [2009-08-19 07:56:08] 345*2^971931-1 is not prime. Res64: 5FDA988415CA2DAF Time : 1118.067 sec. [2009-08-19 08:26:33] 327*2^971936-1 is not prime. Res64: 41A291142D1304CB Time : 1825.077 sec. [2009-08-19 08:57:00] 345*2^971937-1 is not prime. Res64: 300195E9F52B74B4 Time : 1826.751 sec. [2009-08-19 09:27:27] 321*2^971941-1 is not prime. Res64: 056DFC3984788400 Time : 1826.395 sec. core #6: [2009-08-19 08:01:49] 339*2^972019-1 is not prime. Res64: 6293C6AA532B82F2 Time : 1455.675 sec. [2009-08-19 08:32:17] 345*2^972019-1 is not prime. Res64: 7FD353B995437DEF Time : 1827.579 sec. [2009-08-19 09:02:43] 327*2^972024-1 is not prime. Res64: 99B92A53CA02A8E7 Time : 1825.951 sec. [2009-08-19 09:33:08] 333*2^972027-1 is not prime. Res64: D27DC9BDAB3B2AAE Time : 1824.835 sec. the issue with the completion timings of about 36 seconds i don't know. as you can see each core needs about half an hour for one pair. so my working for this drive is: - testing with 6 cores each a WUCacheSize = 30 - after completion all 30 pairs are submitted: one LLRnet-dir after the other, so there's not much traffic for the server. so normally there should be 180 pairs in one hour listed on the stats-pages. but: if i completed not all 30 pairs for one core, say about 20 pairs and submit them, i got 10 'old' pairs left plus 20 new ones. i submit 2 times a day my results. but this means the maxJobTime can not be at 1 day at all. if this is set now to 2 days, this should not happen again. i think i should do this drive (about 9900 pairs left) manually if such issue still occur! PS: i will just submit again my pairs and 333 972027 seems rejected then! Last fiddled with by kar_bon on 2009-08-20 at 15:06 |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
PRPnet servers for NPLB | mdettweiler | No Prime Left Behind | 228 | 2018-12-26 04:50 |
Servers for NPLB | gd_barnes | No Prime Left Behind | 0 | 2009-08-10 19:21 |
LLRnet servers for CRUS | gd_barnes | Conjectures 'R Us | 39 | 2008-07-15 10:26 |
NPLB LLRnet server discussion | em99010pepe | No Prime Left Behind | 229 | 2008-04-30 19:13 |
NPLB LLRnet server #1 - dried | em99010pepe | No Prime Left Behind | 19 | 2008-03-26 06:19 |