mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > No Prime Left Behind

Reply
 
Thread Tools
Old 2009-08-19, 23:21   #1200
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

624110 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
Here's why: Likely the originals and the duplicates were handed out at about the same time.
Hmm...I wouldn't be so sure about that. What's more likely is a situation like this:

1) server hands out original to client A
2) client A doesn't return in time for the jobMaxTime
3) server hands out duplicate to client B
4) client A returns a little late, but before client B returns (results are credited to client B)
5) client B returns and is rejected since the server no longer has information on the tests

As you can see, 1) and 3) would be quite a ways apart. Of course, this is only one of a number of situations that could lead to rejected results, but it is the most common. Anyway, long story short, it's not a given that the duplicates and originals were haded out at the same time.

Even in the case of the power outage, they would have been handed out two times a ways apart: one right before the power outage, and another soon after the servers came back online.

Quote:
One more question: Will getting David's code on to my servers mean that we can avoid the "loop thing" code to restart the servers? If so, that will prevent quite a bit of this "after outage" multiple crashes that we keep encountering.
I don't know. David, have you ever had problems with servers crashing, especially after things like power outages? (Not that you usually encounter those since you have a UPS, but...) If so, how do you deal with them?

Max
mdettweiler is offline   Reply With Quote
Old 2009-08-19, 23:23   #1201
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

792 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
Lennart and AMDave, both of these responses are incorrect and both link to the same incorrect page. Ian asked for the "noprimeleftbehind" link name version of http://nplb.ironbits.net/. If everything is going to roll over to the new server, we need a new link name with "noprimeleftbehind" in it that specifically has this web page in it.

I previously inquired to David about this.

David, are you just going to leave this one link on the old "ironbits" link name or can we expect a new link that has "noprimeleftbehind" in it?

This is an important page that we don't want to lose. I'll Email David with a link to this posting.


Thanks,
Gary
As I understand it, the plan was to get this on http://llrnet.noprimeleftbehind.net/ even before we made the plans to move things to your new server. Until the recent IP problems on David's network, that link redirected to http://nplb.ironbits.net/. Now, it just goes to David's personal page. Once we get your static IP set up, we'll have llrnet.noprimeleftbehind.net point to your server and have your server configured to answer to that address with the nplb.ironbits.net status page.
mdettweiler is offline   Reply With Quote
Old 2009-08-19, 23:24   #1202
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

792 Posts
Default

Quote:
Originally Posted by kar_bon View Post
so if you need timestamps for the output, try this: http://www.mersenneforum.org/showthread.php?t=10066

i've given those timestamps for the client-side to write this (still using this on my clients):

Code:
[2009-08-20 00:36:21] 2013*2^235548-1 is not prime.  Res64: 0BAAB87826667E2E  Time : 61.858 sec.
[2009-08-20 00:37:23] 2013*2^235595-1 is not prime.  Res64: A6ED66F8AA9036F5  Time : 61.854 sec.
[2009-08-20 00:38:24] 2013*2^235640-1 is not prime.  Res64: 8BCA8E2B12058E30  Time : 61.950 sec.
[2009-08-20 00:39:26]
note: a result and the following timestamp are a pair (couldn't handle this in other order).
so you have to read as.

Code:
[2009-08-20 00:37:23] 2013*2^235548-1 is not prime.  Res64: 0BAAB87826667E2E  Time : 61.858 sec.
[2009-08-20 00:38:24] 2013*2^235595-1 is not prime.  Res64: A6ED66F8AA9036F5  Time : 61.854 sec.
[2009-08-20 00:39:26] 2013*2^235640-1 is not prime.  Res64: 8BCA8E2B12058E30  Time : 61.950 sec.
every line has it's timestamp and result.

perhaps you can change the "server.lua" the same.
Ah, good idea. I'll see about trying to get that to work as soon as I get the chance.
mdettweiler is offline   Reply With Quote
Old 2009-08-20, 00:58   #1203
IronBits
I ♥ BOINC!
 
IronBits's Avatar
 
Oct 2002
Glendale, AZ. (USA)

3×7×53 Posts
Default

I don't have server crashes, due to power outages or any other reasons.
All CPUs are run at stock speeds, no other tasks running other than the llrnet servers, mysql and crontab entries that handle the stats part of things.
I use the utility screen in a shell prompt to run the servers within in a single shell, you use some other GUI method, so can't help you there - Max?

As to the http://nplb.ironbits.net Server status web pages,
that is created by me using vbscript under a windows network and samba shares.

I can send the code to Max and he can convert it to perl or shell, or AMDave for that matter, then they can subsitute the vbscript script with the new perl one and have it point to http://noprimeleftbehind.net as the default page, rather than the current one that AMDave does for you that replaced it, then I'll disable my script that runs every hour, and they can crontab it to run their new perl or shell version.

If the Server crashes, reboot the server so it comes up clean, then restart the clients from /etc/init.d/start_llrnetservers start (assuming you have a start_llrnetservers shell script).
IronBits is offline   Reply With Quote
Old 2009-08-20, 02:11   #1204
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

792 Posts
Default

Quote:
Originally Posted by IronBits View Post
I don't have server crashes, due to power outages or any other reasons.
All CPUs are run at stock speeds, no other tasks running other than the llrnet servers, mysql and crontab entries that handle the stats part of things.
I use the utility screen in a shell prompt to run the servers within in a single shell, you use some other GUI method, so can't help you there - Max?
Yes, what we have is, essentially, GUI terminal windows for each of the servers, with the server started manually from inside the respective windows.

Quote:
As to the http://nplb.ironbits.net Server status web pages,
that is created by me using vbscript under a windows network and samba shares.

I can send the code to Max and he can convert it to perl or shell, or AMDave for that matter, then they can subsitute the vbscript script with the new perl one and have it point to http://noprimeleftbehind.net as the default page, rather than the current one that AMDave does for you that replaced it, then I'll disable my script that runs every hour, and they can crontab it to run their new perl or shell version.
First of all, my programming skills (especially my knowledge of vbscript which is zero) are not really up to snuff to do the conversion of your scripts to Perl or shell scripts. If you looked at the code from the Perl scripts I use to generate the status pages and etc. for the GB servers, you'd see what I mean. The whole thing is a too-tall tower of blocks that could come tumbling down if one more block is added to it; for one, that's why I haven't added the jobMaxTime and prunePeriod values to the displayed info on the status page as Karsten's requested a couple of times. (Karsten, don't worry, I hear you; rest assured, I'll figure out a way eventually. )

Dave, not meaning to impose, but do you possibly have enough knowledge of vbscript, Perl, and/or bash scripting that you could do such a conversion? Or, even better, do you know of a way to run vbscripts natively on Linux so that we don't have to convert anything (besides changing a few pathnames in the scripts)?

Secondly, we're not interested in replacing the database display at http://www.noprimeleftbehind.net/ with the current http://nplb.ironbits.net/ page; rather, we'd like to have the latter still available as a separate page, like it is now. I'm thinking http://llrnet.noprimeleftbehind.net/ would work well for that.

Quote:
If the Server crashes, reboot the server so it comes up clean, then restart the clients from /etc/init.d/start_llrnetservers start (assuming you have a start_llrnetservers shell script).
Okay, I see. I don't have the servers set up in init.d or anything like that (having no appreciable experience with utilizing init.d; on the new server, I'll probably use cron instead since I'm more familiar with it). At any rate, though, for now the GB servers are just started manually as I mentioned above. But, possibly you misunderstood what I meant; I was referring to crashes of the LLRnet server applications themselves, not of the actual server machine itself. Have you ever encountered the former on your servers?

Max
mdettweiler is offline   Reply With Quote
Old 2009-08-20, 04:28   #1205
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

1024110 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Hmm...I wouldn't be so sure about that. What's more likely is a situation like this:

1) server hands out original to client A
2) client A doesn't return in time for the jobMaxTime
3) server hands out duplicate to client B
4) client A returns a little late, but before client B returns (results are credited to client B)
5) client B returns and is rejected since the server no longer has information on the tests

As you can see, 1) and 3) would be quite a ways apart. Of course, this is only one of a number of situations that could lead to rejected results, but it is the most common. Anyway, long story short, it's not a given that the duplicates and originals were haded out at the same time.

Even in the case of the power outage, they would have been handed out two times a ways apart: one right before the power outage, and another soon after the servers came back online.
Max
I never said they were handed out at the same time. I said they were handed out twice close together. Close together = right before and right after a crash. Close together = a few hours apart. I had the server back up in an hour. One hour certainly does not constitute "a ways apart".

You are not so sure? How could you NOT be 100% sure that they were handed out well < 24 hours apart? You have all of the proof you need in my post #1185. Had you looked closely at that, you would have concluded much differently. Please check that post that shows:

Rejected results from Aug. 18th:
333 970326 10:59:52 marco.bs
315 970327 10:59:52 marco.bs
339 970327 15:38:58 kar_bon
315 970334 15:38:59 kar_bon
333 970334 15:38:59 kar_bon
(etc.)

Original returned results:
333 970326 16:34:25 Aug. 17th gd_barnes 1224 secs = handed out 16:14:01 on 17th
315 970327 16:34:28 Aug. 17th gd_barnes 1215 secs = handed out 16:14:13 on 17th
339 970327 10:59:52 Aug. 18th kar_bon 3855 secs = handed out 09:55:37 on 18th
315 970334 10:59:52 Aug. 18th kar_bon 3854 secs = handed out 09:55:38 on 18th
333 970334 10:59:53 Aug. 18th kar_bon 3855 secs = handed out 09:55:38 on 18th
(etc.)

Comparisons:
333 970326 reject return: 8/18 10:59:52 original handed out 8/17 16:14:01
315 970327 reject return: 8/18 10:59:52 original handed out 8/17 16:14:13
339 970327 reject return: 8/18 15:38:58 original handed out 8/18 09:55:37
315 970334 reject return: 8/18 15:38:59 original handed out 8/18 09:55:38
333 970334 reject return: 8/18 15:38:59 original handed out 8/18 09:55:38
(etc.)

Differences between original handed out and rejected returned:
333 970326 18 hrs, 45 mins, 51 secs
315 970327 18 hrs, 45 mins, 39 secs
339 970327 5 hrs, 43 mins, 21 secs
315 970334 5 hrs, 43 mins, 21 secs
333 970334 5 hrs, 43 mins, 21 secs
(etc.)

Would you care to restate what is the more likely scenario now?

You seem to want to think that there is no problem. I've already proven that there is a problem with the way pairs are handed out after an outage or crash but you seem to want to unprove me by implying that the server waited its usual 24 hours before doing so. In math terms, that's not mathematically possible!

I know you may deem this to be a fruitless exercise but we have gleaned very important information from it. That is:

When the LLRnet servers crash and/or there is a power outage, they frequently hand out the same k/n pair twice in short succession.

This may not help now but it could easily help narrow the problem greatly in the future.

This kind of sleuthing would also come in even handier on the PRPnet server debugging because the logs are time stamped. Using time math to determine a pattern of when problems occurred will very frequently help determine WHY they occurred!


Gary

Last fiddled with by gd_barnes on 2009-08-20 at 08:15
gd_barnes is offline   Reply With Quote
Old 2009-08-20, 05:54   #1206
IronBits
I ♥ BOINC!
 
IronBits's Avatar
 
Oct 2002
Glendale, AZ. (USA)

3·7·53 Posts
Default

Vertical Server Status Reports can be viewed here:

http://www.noprimeleftbehind.net/iro...tml/index.html
IronBits is offline   Reply With Quote
Old 2009-08-20, 06:27   #1207
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

101000000000012 Posts
Default

Quote:
Originally Posted by IronBits View Post
Vertical Server Status Reports can be viewed here:

http://www.noprimeleftbehind.net/iro...tml/index.html
Excellent. Thanks. The 1st post in this thread has been changed.

Edited my MyDog: Yup thats the one, thanks David.

Last fiddled with by MyDogBuster on 2009-08-20 at 06:41
gd_barnes is offline   Reply With Quote
Old 2009-08-20, 06:52   #1208
AMDave
 
AMDave's Avatar
 
Jan 2006
deep in a while-loop

2·7·47 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Dave, not meaning to impose, but do you possibly have enough knowledge of vbscript, Perl, and/or bash scripting that you could do such a conversion?
yes
AMDave is offline   Reply With Quote
Old 2009-08-20, 07:34   #1209
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

101000000000012 Posts
Default

After further investigation, I see that the joblist.txt file is updated continuously. Although it is possible that something could slip through the cracks during a power outage and cause a minimal # of pairs to be assigned twice, the problem should be very isolated. I was able to nail down to the power outage just 10 rejected pairs in port G8000 that had been handed out twice. That is certainly a reasonable explanation for a small # of them. The numerous rejected results since then is another matter and may NOT be as a result of the server dropping and then coming back up.

Now, we have another problem today. Karsten, you have already returned 19 rejected results as of 2:00 AM CDT Aug. 20th (7:00 AM GMT), all of which already had returned results by you from about 11:45 AM CDT (4:45 PM GMT) Aug. 19th.

Here is a list of them:
Code:
345 971680
339 971681
321 971682
345 971686
339 971769
321 971773
315 971774
339 971777
321 971858
339 971859
327 971861
339 971861
345 971931
327 971936
345 971937
321 971941
339 972019
345 972019
327 972024
Here is what I found out: In checking stdout.log, it appears that all of the offending pairs had been previously originally assigned to you (or me in 2-3 cases), then they appeared to be cancelled in error, and then they were re-assigned to you.

Karsten, at this point, I have 2 revelations for you:

Revelation #1: All of the pairs were originally returned by you in < 3 mins., some in as few as 36 seconds and further...they were the only pairs on that day with such short timings!! Now, unless you've invented some new software or mathetmatics that none of the rest of us are aware of, that would not be possible unless you crunched those pairs before you got them from the server. Would you care to enlighten us on how you managed that?

Revelation #2: In every case above, it appears that you reserved them in 30-pair chunks BUT...only the last 4 in EVERY case ended up with a problem.

Conclusion: Something is going on wrong with the last 4 pairs out of the 30 that Karsten caches each time.

Max, here is an example from the stdout.log file:
Code:
connection closed (socket 4)
connection reqeust from 14e48554:1242 (socket 5)
(20 misc. proposed pairs)
Proposing pair 327/971841 to kar_bon
Proposing pair 333/971843 to kar_bon
Proposing pair 345/971844 to kar_bon
Proposing pair 315/971845 to kar_bon
Proposing pair 327/971845 to kar_bon
Proposing pair 345/971847 to kar_bon
Proposing pair 321/971858 to kar_bon }
Proposing pair 339/971859 to kar_bon } all were eventually cancelled, handed out
Proposing pair 327/971861 to kar_bon } a 2nd time, and ultimately rejected
Proposing pair 339/971861 to kar_bon }
connection closed (socket 5)
connection reqeust from 14e48554:1247 (socket 4)
In every case above, it was always the last 4 pairs before the connection was closed that the pairs were cancelled and handed out a 2nd time. This caused the results to be returned twice with the 2nd set of results being rejected.

Revelation #3: All of the offending pairs were cancelled and reassigned in one fell swoop in consecutive fashion! To demonstrate my great predictive capality (lol), I will predict that before today is done, the following pair will also be rejected:
333 972027
Because that is the final pair in the final group of 4 above and is also the final pair that was cancelled in error and reassigned in the group of 20 consecutive cancelled-reassigned pairs.

Max, Karsten, or anyone else...any thoughts on this?


Gary

Last fiddled with by gd_barnes on 2009-08-20 at 08:26
gd_barnes is offline   Reply With Quote
Old 2009-08-20, 15:05   #1210
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

22×23×31 Posts
Default

here is the list of the rejected pairs you found from my original resultfile with local timings (GMT):
Code:
[2009-08-19 23:16:33] 345*2^971680-1 is not prime.  Res64: A3058E1F1BB01541  Time : 1822.835 sec.
[2009-08-19 23:46:58] 339*2^971681-1 is not prime.  Res64: E9B22ACD57CAD289  Time : 1824.900 sec.
[2009-08-20 00:17:23] 321*2^971682-1 is not prime.  Res64: BA5C48FA90A17D3D  Time : 1824.457 sec.
[2009-08-20 00:47:48] 345*2^971686-1 is not prime.  Res64: F229FEFE53441AE2  Time : 1824.755 sec.
[2009-08-20 01:18:14] 339*2^971769-1 is not prime.  Res64: 5A02350FBA24EFB3  Time : 1825.932 sec.
[2009-08-20 01:48:39] 321*2^971773-1 is not prime.  Res64: B50AAC2F1E3605A3  Time : 1825.393 sec.
[2009-08-20 02:19:05] 315*2^971774-1 is not prime.  Res64: D68B6BBA049EDC8C  Time : 1825.324 sec.
[2009-08-20 02:49:30] 339*2^971777-1 is not prime.  Res64: CF808406635150CA  Time : 1825.271 sec.
[2009-08-20 03:19:56] 321*2^971858-1 is not prime.  Res64: 7B148E3A3E36AC2C  Time : 1825.337 sec.
[2009-08-20 03:50:21] 339*2^971859-1 is not prime.  Res64: 8581276FB298D4CF  Time : 1824.778 sec.
[2009-08-20 04:20:50] 327*2^971861-1 is not prime.  Res64: 012FA1B9DB743746  Time : 1828.781 sec.
[2009-08-20 04:51:16] 339*2^971861-1 is not prime.  Res64: C33A7F731DB30AF3  Time : 1825.660 sec.
[2009-08-20 05:21:48] 345*2^971931-1 is not prime.  Res64: 5FDA988415CA2DAF  Time : 1832.460 sec.
[2009-08-20 05:52:20] 327*2^971936-1 is not prime.  Res64: 41A291142D1304CB  Time : 1830.618 sec.
[2009-08-20 06:22:50] 345*2^971937-1 is not prime.  Res64: 300195E9F52B74B4  Time : 1829.116 sec.
[2009-08-20 06:53:15] 321*2^971941-1 is not prime.  Res64: 056DFC3984788400  Time : 1825.302 sec.
[2009-08-20 07:23:43] 339*2^972019-1 is not prime.  Res64: 6293C6AA532B82F2  Time : 1827.491 sec.
[2009-08-20 07:54:07] 345*2^972019-1 is not prime.  Res64: 7FD353B995437DEF  Time : 1824.657 sec.
[2009-08-20 08:24:34] 327*2^972024-1 is not prime.  Res64: 99B92A53CA02A8E7  Time : 1826.415 sec.
the last pair was done 08:24 GMT, then i submitted all 6(cores)*30(pairs/core) to the server at about 08:40.

these 19 pairs were all done previously by the other 5 cores with following dates:
Code:
core #2:
[2009-08-19 07:57:39] 345*2^971680-1 is not prime.  Res64: A3058E1F1BB01541  Time : 1220.857 sec.
[2009-08-19 08:28:05] 339*2^971681-1 is not prime.  Res64: E9B22ACD57CAD289  Time : 1825.646 sec.
[2009-08-19 09:58:23] 321*2^971682-1 is not prime.  Res64: BA5C48FA90A17D3D  Time : 1824.229 sec.
[2009-08-19 10:28:48] 345*2^971686-1 is not prime.  Res64: F229FEFE53441AE2  Time : 1825.130 sec.

core #3:
[2009-08-19 07:39:54] 339*2^971769-1 is not prime.  Res64: 5A02350FBA24EFB3  Time : 152.072 sec.
[2009-08-19 08:10:20] 321*2^971773-1 is not prime.  Res64: B50AAC2F1E3605A3  Time : 1825.383 sec.
[2009-08-19 08:40:45] 315*2^971774-1 is not prime.  Res64: D68B6BBA049EDC8C  Time : 1825.262 sec.
[2009-08-19 09:11:13] 339*2^971777-1 is not prime.  Res64: CF808406635150CA  Time : 1828.084 sec.

core #4:
[2009-08-19 07:53:26] 321*2^971858-1 is not prime.  Res64: 7B148E3A3E36AC2C  Time : 960.326 sec.
[2009-08-19 08:23:53] 339*2^971859-1 is not prime.  Res64: 8581276FB298D4CF  Time : 1826.467 sec.
[2009-08-19 08:54:21] 327*2^971861-1 is not prime.  Res64: 012FA1B9DB743746  Time : 1827.778 sec.
[2009-08-19 09:24:46] 339*2^971861-1 is not prime.  Res64: C33A7F731DB30AF3  Time : 1825.092 sec.

core #5:
[2009-08-19 07:56:08] 345*2^971931-1 is not prime.  Res64: 5FDA988415CA2DAF  Time : 1118.067 sec.
[2009-08-19 08:26:33] 327*2^971936-1 is not prime.  Res64: 41A291142D1304CB  Time : 1825.077 sec.
[2009-08-19 08:57:00] 345*2^971937-1 is not prime.  Res64: 300195E9F52B74B4  Time : 1826.751 sec.
[2009-08-19 09:27:27] 321*2^971941-1 is not prime.  Res64: 056DFC3984788400  Time : 1826.395 sec.

core #6:
[2009-08-19 08:01:49] 339*2^972019-1 is not prime.  Res64: 6293C6AA532B82F2  Time : 1455.675 sec.
[2009-08-19 08:32:17] 345*2^972019-1 is not prime.  Res64: 7FD353B995437DEF  Time : 1827.579 sec.
[2009-08-19 09:02:43] 327*2^972024-1 is not prime.  Res64: 99B92A53CA02A8E7  Time : 1825.951 sec.
[2009-08-19 09:33:08] 333*2^972027-1 is not prime.  Res64: D27DC9BDAB3B2AAE  Time : 1824.835 sec.
all these i've submitted on 2009-08-19 at about 18:40 GMT.

the issue with the completion timings of about 36 seconds i don't know. as you can see each core needs about half an hour for one pair.

so my working for this drive is:

- testing with 6 cores each a WUCacheSize = 30
- after completion all 30 pairs are submitted:
one LLRnet-dir after the other, so there's not much traffic for the server.

so normally there should be 180 pairs in one hour listed on the stats-pages.
but: if i completed not all 30 pairs for one core, say about 20 pairs and submit them, i got 10 'old' pairs left plus 20 new ones. i submit 2 times a day my results.

but this means the maxJobTime can not be at 1 day at all.
if this is set now to 2 days, this should not happen again.

i think i should do this drive (about 9900 pairs left) manually if such issue still occur!

PS: i will just submit again my pairs and 333 972027 seems rejected then!

Last fiddled with by kar_bon on 2009-08-20 at 15:06
kar_bon is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
PRPnet servers for NPLB mdettweiler No Prime Left Behind 228 2018-12-26 04:50
Servers for NPLB gd_barnes No Prime Left Behind 0 2009-08-10 19:21
LLRnet servers for CRUS gd_barnes Conjectures 'R Us 39 2008-07-15 10:26
NPLB LLRnet server discussion em99010pepe No Prime Left Behind 229 2008-04-30 19:13
NPLB LLRnet server #1 - dried em99010pepe No Prime Left Behind 19 2008-03-26 06:19

All times are UTC. The time now is 09:09.

Wed Nov 25 09:09:26 UTC 2020 up 76 days, 6:20, 4 users, load averages: 1.89, 1.47, 1.32

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.