![]() |
[quote=gd_barnes;186612]Here's why: Likely the originals and the duplicates were handed out at about the same time.[/quote]
Hmm...I wouldn't be so sure about that. What's more likely is a situation like this: 1) server hands out original to client A 2) client A doesn't return in time for the jobMaxTime 3) server hands out duplicate to client B 4) client A returns a little late, but before client B returns (results are credited to client B) 5) client B returns and is rejected since the server no longer has information on the tests As you can see, 1) and 3) would be quite a ways apart. Of course, this is only one of a number of situations that could lead to rejected results, but it is the most common. Anyway, long story short, it's not a given that the duplicates and originals were haded out at the same time. Even in the case of the power outage, they would have been handed out two times a ways apart: one right before the power outage, and another soon after the servers came back online. [quote]One more question: Will getting David's code on to my servers mean that we can avoid the "loop thing" code to restart the servers? If so, that will prevent quite a bit of this "after outage" multiple crashes that we keep encountering.[/quote]I don't know. David, have you ever had problems with servers crashing, especially after things like power outages? (Not that you usually encounter those since you have a UPS, but...) If so, how do you deal with them? Max :smile: |
[quote=gd_barnes;186613]Lennart and AMDave, both of these responses are incorrect and both link to the same incorrect page. Ian asked for the "noprimeleftbehind" link name version of [URL="http://nplb.ironbits.net/"][COLOR=#800080]http://nplb.ironbits.net/[/COLOR][/URL]. If everything is going to roll over to the new server, we need a new link name with "noprimeleftbehind" in it that specifically has this web page in it.
I previously inquired to David about this. David, are you just going to leave this one link on the old "ironbits" link name or can we expect a new link that has "noprimeleftbehind" in it? This is an important page that we don't want to lose. I'll Email David with a link to this posting. Thanks, Gary[/quote] As I understand it, the plan was to get this on [URL]http://llrnet.noprimeleftbehind.net/[/URL] even before we made the plans to move things to your new server. Until the recent IP problems on David's network, that link redirected to [URL]http://nplb.ironbits.net/[/URL]. Now, it just goes to David's personal page. Once we get your static IP set up, we'll have llrnet.noprimeleftbehind.net point to your server and have your server configured to answer to that address with the nplb.ironbits.net status page. |
[quote=kar_bon;186614]so if you need timestamps for the output, try this: [URL]http://www.mersenneforum.org/showthread.php?t=10066[/URL]
i've given those timestamps for the client-side to write this (still using this on my clients): [code] [2009-08-20 00:36:21] 2013*2^235548-1 is not prime. Res64: 0BAAB87826667E2E Time : 61.858 sec. [2009-08-20 00:37:23] 2013*2^235595-1 is not prime. Res64: A6ED66F8AA9036F5 Time : 61.854 sec. [2009-08-20 00:38:24] 2013*2^235640-1 is not prime. Res64: 8BCA8E2B12058E30 Time : 61.950 sec. [2009-08-20 00:39:26] [/code] note: a result and the following timestamp are a pair (couldn't handle this in other order). so you have to read as. [code] [2009-08-20 00:37:23] 2013*2^235548-1 is not prime. Res64: 0BAAB87826667E2E Time : 61.858 sec. [2009-08-20 00:38:24] 2013*2^235595-1 is not prime. Res64: A6ED66F8AA9036F5 Time : 61.854 sec. [2009-08-20 00:39:26] 2013*2^235640-1 is not prime. Res64: 8BCA8E2B12058E30 Time : 61.950 sec. [/code] every line has it's timestamp and result. perhaps you can change the "server.lua" the same.[/quote] Ah, good idea. I'll see about trying to get that to work as soon as I get the chance. |
I don't have server crashes, due to power outages or any other reasons.
All CPUs are run at stock speeds, no other tasks running other than the llrnet servers, mysql and crontab entries that handle the stats part of things. I use the utility screen in a shell prompt to run the servers within in a single shell, you use some other GUI method, so can't help you there - Max? As to the [url]http://nplb.ironbits.net[/url] Server status web pages, that is created by me using vbscript under a windows network and samba shares. I can send the code to Max and he can convert it to perl or shell, or AMDave for that matter, then they can subsitute the vbscript script with the new perl one and have it point to [url]http://noprimeleftbehind.net[/url] as the default page, rather than the current one that AMDave does for you that replaced it, then I'll disable my script that runs every hour, and they can crontab it to run their new perl or shell version. If the Server crashes, reboot the server so it comes up clean, then restart the clients from /etc/init.d/start_llrnetservers start (assuming you have a start_llrnetservers shell script). |
[quote=IronBits;186634]I don't have server crashes, due to power outages or any other reasons.
All CPUs are run at stock speeds, no other tasks running other than the llrnet servers, mysql and crontab entries that handle the stats part of things. I use the utility screen in a shell prompt to run the servers within in a single shell, you use some other GUI method, so can't help you there - Max?[/quote] Yes, what we have is, essentially, GUI terminal windows for each of the servers, with the server started manually from inside the respective windows. [quote]As to the [URL]http://nplb.ironbits.net[/URL] Server status web pages, that is created by me using vbscript under a windows network and samba shares. I can send the code to Max and he can convert it to perl or shell, or AMDave for that matter, then they can subsitute the vbscript script with the new perl one and have it point to [URL]http://noprimeleftbehind.net[/URL] as the default page, rather than the current one that AMDave does for you that replaced it, then I'll disable my script that runs every hour, and they can crontab it to run their new perl or shell version.[/quote] First of all, my programming skills (especially my knowledge of vbscript which is zero) are not really up to snuff to do the conversion of your scripts to Perl or shell scripts. If you looked at the code from the Perl scripts I use to generate the status pages and etc. for the GB servers, you'd see what I mean. :smile: The whole thing is a too-tall tower of blocks that could come tumbling down if one more block is added to it; for one, that's why I haven't added the jobMaxTime and prunePeriod values to the displayed info on the status page as Karsten's requested a couple of times. (Karsten, don't worry, I hear you; rest assured, I'll figure out a way eventually. :smile:) Dave, not meaning to impose, but do you possibly have enough knowledge of vbscript, Perl, and/or bash scripting that you could do such a conversion? Or, even better, do you know of a way to run vbscripts natively on Linux so that we don't have to convert anything (besides changing a few pathnames in the scripts)? Secondly, we're not interested in replacing the database display at [URL]http://www.noprimeleftbehind.net/[/URL] with the current [URL]http://nplb.ironbits.net/[/URL] page; rather, we'd like to have the latter still available as a separate page, like it is now. I'm thinking [URL]http://llrnet.noprimeleftbehind.net/[/URL] would work well for that. [quote]If the Server crashes, reboot the server so it comes up clean, then restart the clients from /etc/init.d/start_llrnetservers start (assuming you have a start_llrnetservers shell script).[/quote] Okay, I see. I don't have the servers set up in init.d or anything like that (having no appreciable experience with utilizing init.d; on the new server, I'll probably use cron instead since I'm more familiar with it). At any rate, though, for now the GB servers are just started manually as I mentioned above. But, possibly you misunderstood what I meant; I was referring to crashes of the LLRnet server applications themselves, not of the actual server machine itself. Have you ever encountered the former on your servers? Max :smile: |
[quote=mdettweiler;186617]Hmm...I wouldn't be so sure about that. What's more likely is a situation like this:
1) server hands out original to client A 2) client A doesn't return in time for the jobMaxTime 3) server hands out duplicate to client B 4) client A returns a little late, but before client B returns (results are credited to client B) 5) client B returns and is rejected since the server no longer has information on the tests As you can see, 1) and 3) would be quite a ways apart. Of course, this is only one of a number of situations that could lead to rejected results, but it is the most common. Anyway, long story short, it's not a given that the duplicates and originals were haded out at the same time. Even in the case of the power outage, they would have been handed out two times a ways apart: one right before the power outage, and another soon after the servers came back online. Max :smile:[/quote] I never said they were handed out at the same time. I said they were handed out twice close together. Close together = right before and right after a crash. Close together = a few hours apart. I had the server back up in an hour. One hour certainly does not constitute "a ways apart". You are not so sure? How could you NOT be 100% sure that they were handed out well < 24 hours apart? You have all of the proof you need in my post #1185. Had you looked closely at that, you would have concluded much differently. Please check that post that shows: Rejected results from Aug. 18th: 333 970326 10:59:52 marco.bs 315 970327 10:59:52 marco.bs 339 970327 15:38:58 kar_bon 315 970334 15:38:59 kar_bon 333 970334 15:38:59 kar_bon (etc.) Original returned results: 333 970326 16:34:25 Aug. 17th gd_barnes 1224 secs = handed out 16:14:01 on 17th 315 970327 16:34:28 Aug. 17th gd_barnes 1215 secs = handed out 16:14:13 on 17th 339 970327 10:59:52 Aug. 18th kar_bon 3855 secs = handed out 09:55:37 on 18th 315 970334 10:59:52 Aug. 18th kar_bon 3854 secs = handed out 09:55:38 on 18th 333 970334 10:59:53 Aug. 18th kar_bon 3855 secs = handed out 09:55:38 on 18th (etc.) Comparisons: 333 970326 reject return: 8/18 10:59:52 original handed out 8/17 16:14:01 315 970327 reject return: 8/18 10:59:52 original handed out 8/17 16:14:13 339 970327 reject return: 8/18 15:38:58 original handed out 8/18 09:55:37 315 970334 reject return: 8/18 15:38:59 original handed out 8/18 09:55:38 333 970334 reject return: 8/18 15:38:59 original handed out 8/18 09:55:38 (etc.) Differences between original handed out and rejected returned: 333 970326 18 hrs, 45 mins, 51 secs 315 970327 18 hrs, 45 mins, 39 secs 339 970327 5 hrs, 43 mins, 21 secs 315 970334 5 hrs, 43 mins, 21 secs 333 970334 5 hrs, 43 mins, 21 secs (etc.) Would you care to restate what is the more likely scenario now? :smile: You seem to want to think that there is no problem. I've already proven that there is a problem with the way pairs are handed out after an outage or crash but you seem to want to unprove me by implying that the server waited its usual 24 hours before doing so. In math terms, that's not mathematically possible! :smile: I know you may deem this to be a fruitless exercise but we have gleaned very important information from it. That is: [B]When the LLRnet servers crash and/or there is a power outage, they frequently hand out the same k/n pair twice in short succession.[/B] This may not help now but it could easily help narrow the problem greatly in the future. This kind of sleuthing would also come in even handier on the PRPnet server debugging because the logs are time stamped. Using time math to determine a pattern of when problems occurred will very frequently help determine WHY they occurred! Gary |
Vertical Server Status Reports can be viewed here:
[url]http://www.noprimeleftbehind.net/ironbits/html/index.html[/url] |
[quote=IronBits;186654]Vertical Server Status Reports can be viewed here:
[URL]http://www.noprimeleftbehind.net/ironbits/html/index.html[/URL][/quote] Excellent. Thanks. The 1st post in this thread has been changed. Edited my MyDog: Yup thats the one, thanks David. |
[QUOTE=mdettweiler;186638]Dave, not meaning to impose, but do you possibly have enough knowledge of vbscript, Perl, and/or bash scripting that you could do such a conversion? [/QUOTE]
yes |
After further investigation, I see that the joblist.txt file is updated continuously. Although it is possible that something could slip through the cracks during a power outage and cause a minimal # of pairs to be assigned twice, the problem should be very isolated. I was able to nail down to the power outage just 10 rejected pairs in port G8000 that had been handed out twice. That is certainly a reasonable explanation for a small # of them. The numerous rejected results since then is another matter and may NOT be as a result of the server dropping and then coming back up.
Now, we have another problem today. Karsten, you have already returned 19 rejected results as of 2:00 AM CDT Aug. 20th (7:00 AM GMT), all of which already had returned results by you from about 11:45 AM CDT (4:45 PM GMT) Aug. 19th. Here is a list of them: [code] 345 971680 339 971681 321 971682 345 971686 339 971769 321 971773 315 971774 339 971777 321 971858 339 971859 327 971861 339 971861 345 971931 327 971936 345 971937 321 971941 339 972019 345 972019 327 972024 [/code] Here is what I found out: In checking stdout.log, it appears that all of the offending pairs had been previously originally assigned to you (or me in 2-3 cases), then they appeared to be cancelled in error, and then they were re-assigned to you. Karsten, at this point, I have 2 revelations for you: Revelation #1: All of the pairs were originally returned by you in < 3 mins., some in as few as 36 seconds and further...they were the only pairs on that day with such short timings!! Now, unless you've invented some new software or mathetmatics that none of the rest of us are aware of, that would not be possible unless you crunched those pairs before you got them from the server. Would you care to enlighten us on how you managed that? Revelation #2: In every case above, it appears that you reserved them in 30-pair chunks BUT...only the last 4 in EVERY case ended up with a problem. Conclusion: Something is going on wrong with the last 4 pairs out of the 30 that Karsten caches each time. Max, here is an example from the stdout.log file: [code] connection closed (socket 4) connection reqeust from 14e48554:1242 (socket 5) (20 misc. proposed pairs) Proposing pair 327/971841 to kar_bon Proposing pair 333/971843 to kar_bon Proposing pair 345/971844 to kar_bon Proposing pair 315/971845 to kar_bon Proposing pair 327/971845 to kar_bon Proposing pair 345/971847 to kar_bon Proposing pair 321/971858 to kar_bon } Proposing pair 339/971859 to kar_bon } all were eventually cancelled, handed out Proposing pair 327/971861 to kar_bon } a 2nd time, and ultimately rejected Proposing pair 339/971861 to kar_bon } connection closed (socket 5) connection reqeust from 14e48554:1247 (socket 4) [/code] In every case above, it was always the last 4 pairs before the connection was closed that the pairs were cancelled and handed out a 2nd time. This caused the results to be returned twice with the 2nd set of results being rejected. Revelation #3: All of the offending pairs were cancelled and reassigned in one fell swoop in consecutive fashion! To demonstrate my great predictive capality (lol), I will predict that before today is done, the following pair will also be rejected: 333 972027 Because that is the final pair in the final group of 4 above and is also the final pair that was cancelled in error and reassigned in the group of 20 consecutive cancelled-reassigned pairs. Max, Karsten, or anyone else...any thoughts on this? Gary |
here is the list of the rejected pairs you found from my original resultfile with local timings (GMT):
[code] [2009-08-19 23:16:33] 345*2^971680-1 is not prime. Res64: A3058E1F1BB01541 Time : 1822.835 sec. [2009-08-19 23:46:58] 339*2^971681-1 is not prime. Res64: E9B22ACD57CAD289 Time : 1824.900 sec. [2009-08-20 00:17:23] 321*2^971682-1 is not prime. Res64: BA5C48FA90A17D3D Time : 1824.457 sec. [2009-08-20 00:47:48] 345*2^971686-1 is not prime. Res64: F229FEFE53441AE2 Time : 1824.755 sec. [2009-08-20 01:18:14] 339*2^971769-1 is not prime. Res64: 5A02350FBA24EFB3 Time : 1825.932 sec. [2009-08-20 01:48:39] 321*2^971773-1 is not prime. Res64: B50AAC2F1E3605A3 Time : 1825.393 sec. [2009-08-20 02:19:05] 315*2^971774-1 is not prime. Res64: D68B6BBA049EDC8C Time : 1825.324 sec. [2009-08-20 02:49:30] 339*2^971777-1 is not prime. Res64: CF808406635150CA Time : 1825.271 sec. [2009-08-20 03:19:56] 321*2^971858-1 is not prime. Res64: 7B148E3A3E36AC2C Time : 1825.337 sec. [2009-08-20 03:50:21] 339*2^971859-1 is not prime. Res64: 8581276FB298D4CF Time : 1824.778 sec. [2009-08-20 04:20:50] 327*2^971861-1 is not prime. Res64: 012FA1B9DB743746 Time : 1828.781 sec. [2009-08-20 04:51:16] 339*2^971861-1 is not prime. Res64: C33A7F731DB30AF3 Time : 1825.660 sec. [2009-08-20 05:21:48] 345*2^971931-1 is not prime. Res64: 5FDA988415CA2DAF Time : 1832.460 sec. [2009-08-20 05:52:20] 327*2^971936-1 is not prime. Res64: 41A291142D1304CB Time : 1830.618 sec. [2009-08-20 06:22:50] 345*2^971937-1 is not prime. Res64: 300195E9F52B74B4 Time : 1829.116 sec. [2009-08-20 06:53:15] 321*2^971941-1 is not prime. Res64: 056DFC3984788400 Time : 1825.302 sec. [2009-08-20 07:23:43] 339*2^972019-1 is not prime. Res64: 6293C6AA532B82F2 Time : 1827.491 sec. [2009-08-20 07:54:07] 345*2^972019-1 is not prime. Res64: 7FD353B995437DEF Time : 1824.657 sec. [2009-08-20 08:24:34] 327*2^972024-1 is not prime. Res64: 99B92A53CA02A8E7 Time : 1826.415 sec. [/code] the last pair was done 08:24 GMT, then i submitted all 6(cores)*30(pairs/core) to the server at about 08:40. these 19 pairs were all done previously by the other 5 cores with following dates: [code] core #2: [2009-08-19 07:57:39] 345*2^971680-1 is not prime. Res64: A3058E1F1BB01541 Time : 1220.857 sec. [2009-08-19 08:28:05] 339*2^971681-1 is not prime. Res64: E9B22ACD57CAD289 Time : 1825.646 sec. [2009-08-19 09:58:23] 321*2^971682-1 is not prime. Res64: BA5C48FA90A17D3D Time : 1824.229 sec. [2009-08-19 10:28:48] 345*2^971686-1 is not prime. Res64: F229FEFE53441AE2 Time : 1825.130 sec. core #3: [2009-08-19 07:39:54] 339*2^971769-1 is not prime. Res64: 5A02350FBA24EFB3 Time : 152.072 sec. [2009-08-19 08:10:20] 321*2^971773-1 is not prime. Res64: B50AAC2F1E3605A3 Time : 1825.383 sec. [2009-08-19 08:40:45] 315*2^971774-1 is not prime. Res64: D68B6BBA049EDC8C Time : 1825.262 sec. [2009-08-19 09:11:13] 339*2^971777-1 is not prime. Res64: CF808406635150CA Time : 1828.084 sec. core #4: [2009-08-19 07:53:26] 321*2^971858-1 is not prime. Res64: 7B148E3A3E36AC2C Time : 960.326 sec. [2009-08-19 08:23:53] 339*2^971859-1 is not prime. Res64: 8581276FB298D4CF Time : 1826.467 sec. [2009-08-19 08:54:21] 327*2^971861-1 is not prime. Res64: 012FA1B9DB743746 Time : 1827.778 sec. [2009-08-19 09:24:46] 339*2^971861-1 is not prime. Res64: C33A7F731DB30AF3 Time : 1825.092 sec. core #5: [2009-08-19 07:56:08] 345*2^971931-1 is not prime. Res64: 5FDA988415CA2DAF Time : 1118.067 sec. [2009-08-19 08:26:33] 327*2^971936-1 is not prime. Res64: 41A291142D1304CB Time : 1825.077 sec. [2009-08-19 08:57:00] 345*2^971937-1 is not prime. Res64: 300195E9F52B74B4 Time : 1826.751 sec. [2009-08-19 09:27:27] 321*2^971941-1 is not prime. Res64: 056DFC3984788400 Time : 1826.395 sec. core #6: [2009-08-19 08:01:49] 339*2^972019-1 is not prime. Res64: 6293C6AA532B82F2 Time : 1455.675 sec. [2009-08-19 08:32:17] 345*2^972019-1 is not prime. Res64: 7FD353B995437DEF Time : 1827.579 sec. [2009-08-19 09:02:43] 327*2^972024-1 is not prime. Res64: 99B92A53CA02A8E7 Time : 1825.951 sec. [2009-08-19 09:33:08] 333*2^972027-1 is not prime. Res64: D27DC9BDAB3B2AAE Time : 1824.835 sec. [/code] all these i've submitted on 2009-08-19 at about 18:40 GMT. the issue with the completion timings of about 36 seconds i don't know. as you can see each core needs about half an hour for one pair. so my working for this drive is: - testing with 6 cores each a WUCacheSize = 30 - after completion all 30 pairs are submitted: one LLRnet-dir after the other, so there's not much traffic for the server. so normally there should be 180 pairs in one hour listed on the stats-pages. but: if i completed not all 30 pairs for one core, say about 20 pairs and submit them, i got 10 'old' pairs left plus 20 new ones. i submit 2 times a day my results. but this means the maxJobTime can not be at 1 day at all. if this is set now to 2 days, this should not happen again. i think i should do this drive (about 9900 pairs left) manually if such issue still occur! PS: i will just submit again my pairs and 333 972027 seems rejected then! |
| All times are UTC. The time now is 20:56. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.