mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > No Prime Left Behind

Reply
 
Thread Tools
Old 2009-08-19, 16:02   #1189
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

2×5×1,013 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Ah, that would make sense. Both G8000 and G7000 crashed a couple of times after the outage, as always seems to happen after an outage. They seem to have stabilized now, though I've put them in a loop so that they'll restart if they do crash again.
They kept crashing Max. Please don't use the phrase "couple of times" when you don't know how many times. Just look in the restart.txt file. There are multiple crashes. I looked at 3 AM CDT this morning and I saw that both had crashed again within the last few hours and were automatically restarted with the loop thing.

You are bound and determined to gloss over this whole issue without doing a detailed look at the exact times and matching up when the rejected results were originally handed out. I took 2 hours last night to do that for you now. How about looking into it this time please?

Please calculate when the 26 rejected results were originally handed out today. I saved them off under an obvious file name. Like I said, I only had time to look at the first 2-3 and those were handed out at 09:55-10:00 CDT on Aug. 18th. Simply take the time that the original result was returned and subtract the # of seconds that it took to return it.

I'm not going to back off on this until we nail it down. I nailed down 10 rejected results to the original power outage. The other 46 still have no explanation. How do we know that they were as a result of yet another crash? We don't. We need to match up exact crash times with times in which the original pairs were handed out.

We seem to have gotten into this habit of glossing over these server problems and that habit needs to end.

I don't know if this will help but it can't hurt:

On port G8000 only, please increase the JobMaxTime to 2 days.

Please tell me how you safely stop the server to do this. If you can let me know how that is done, then I'll do it if it is needed in the future. I now how to change the JobMaxTime and to restart it but don't want to create a problem when I stop it.

Karsten, can we talk you into returning pairs normally instead of ~100 at a time about twice a day? If you need to do so many at a time, how about you write a script to do ~20 each hour for 5 hours or something like that? That may help some.


Gary

Last fiddled with by gd_barnes on 2009-08-19 at 16:06
gd_barnes is offline   Reply With Quote
Old 2009-08-19, 19:27   #1190
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

1010111100002 Posts
Default

i've sent the 2 outstanding pairs at n=957k for GB8000 some time ago and they are in the "last copy off"-file, but the "First unprocessed k/n-pairs" still show one of them!

why? was the prune-time not 1 hour?

please edit the stats-page for the GB ports to show those settings like the IB ports!

PS: the stats updated 14:45 CDT with n=971k!

Last fiddled with by kar_bon on 2009-08-19 at 20:08
kar_bon is offline   Reply With Quote
Old 2009-08-19, 20:52   #1191
MyDogBuster
 
MyDogBuster's Avatar
 
May 2008
Wilmington, DE

283610 Posts
Default

What has this http://nplb.ironbits.net/ been replaced with? You know, the one with the IB port, all the primes for the day, the first n to process, links to the rejects, results for the day, etc; in the vertical format.

Last fiddled with by MyDogBuster on 2009-08-19 at 20:56
MyDogBuster is offline   Reply With Quote
Old 2009-08-19, 21:10   #1192
Lennart
 
Lennart's Avatar
 
"Lennart"
Jun 2007

25·5·7 Posts
Default

http://noprimeleftbehind.net/index.php

Lennart

Quote:
Originally Posted by MyDogBuster View Post
What has this http://nplb.ironbits.net/ been replaced with? You know, the one with the IB port, all the primes for the day, the first n to process, links to the rejects, results for the day, etc; in the vertical format.
Lennart is offline   Reply With Quote
Old 2009-08-19, 21:11   #1193
AMDave
 
AMDave's Avatar
 
Jan 2006
deep in a while-loop

10100100102 Posts
Default

try http://www.noprimeleftbehind.net
AMDave is offline   Reply With Quote
Old 2009-08-19, 21:14   #1194
AMDave
 
AMDave's Avatar
 
Jan 2006
deep in a while-loop

2×7×47 Posts
Default

SNAP!
AMDave is offline   Reply With Quote
Old 2009-08-19, 21:50   #1195
MyDogBuster
 
MyDogBuster's Avatar
 
May 2008
Wilmington, DE

22×709 Posts
Default

try http://www.noprimeleftbehind.net

http://noprimeleftbehind.net/index.php

Not the one's I'm looking for. The one I had in mind did not show the hourly progress.

It did show all the primes found for that day listed by each port.

It is similar to http://nplb-gb1.no-ip.org/llrnet/ but instead for IB. It did have a like to http://www.noprimeleftbehind.net, but also had links to all current results for the day, rejects, etc.

Last fiddled with by MyDogBuster on 2009-08-19 at 21:50
MyDogBuster is offline   Reply With Quote
Old 2009-08-19, 22:04   #1196
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
They kept crashing Max. Please don't use the phrase "couple of times" when you don't know how many times. Just look in the restart.txt file. There are multiple crashes. I looked at 3 AM CDT this morning and I saw that both had crashed again within the last few hours and were automatically restarted with the loop thing.

You are bound and determined to gloss over this whole issue without doing a detailed look at the exact times and matching up when the rejected results were originally handed out. I took 2 hours last night to do that for you now. How about looking into it this time please?

Please calculate when the 26 rejected results were originally handed out today. I saved them off under an obvious file name. Like I said, I only had time to look at the first 2-3 and those were handed out at 09:55-10:00 CDT on Aug. 18th. Simply take the time that the original result was returned and subtract the # of seconds that it took to return it.

I'm not going to back off on this until we nail it down. I nailed down 10 rejected results to the original power outage. The other 46 still have no explanation. How do we know that they were as a result of yet another crash? We don't. We need to match up exact crash times with times in which the original pairs were handed out.

We seem to have gotten into this habit of glossing over these server problems and that habit needs to end.

I don't know if this will help but it can't hurt:

On port G8000 only, please increase the JobMaxTime to 2 days.

Please tell me how you safely stop the server to do this. If you can let me know how that is done, then I'll do it if it is needed in the future. I now how to change the JobMaxTime and to restart it but don't want to create a problem when I stop it.

Karsten, can we talk you into returning pairs normally instead of ~100 at a time about twice a day? If you need to do so many at a time, how about you write a script to do ~20 each hour for 5 hours or something like that? That may help some.


Gary
Okay. I see that G7000 has restarted a number of times, while G4000 and G8000 have not. Note that there's no timestamps on the restart log file, so there's no way to know exactly when the crashes occurred.

As for the rejected results, here's a tally of how many were handed out when:
-10 rejected from marco.bs around 11:00 CDT, 8/18
-20 rejected from kar_bon around 15:39 CDT, 8/18
-21 rejected from kar_bon around 00:37 CDT, 8/19
-5 rejected from marco.bs around 2:23 CDT, 8/19

Note that we can't tell exactly when these were handed out because they were (like many rejected results) listed with a time of 0.0 sec. Correlating with times on the same k/n pairs in the main results files is not helpful in this case, since those could very well have been assigned at a different time.

Note that all of these rejected results are from G8000, a server which did not crash at all. Thus, we can't even circumstantially correlate these with any particular crashes. Even if it had been known to crash, then we wouldn't be able to know when the crashes happened; the time and date of restarts aren't logged.

I know it may seem like I'm glossing over this stuff, but quite frankly, LLRnet doesn't let me do much more than that. It just plain doesn't log enough info. Yes, we redirect the screen output to a file, but that's essentially useless since there's no timestamps on it. Because of this, most server glitches simply have to be glossed over, because any further investigation is just going to waste a lot of time on something that there's not enough information to pinpoint. The glitches usually (as is the case now) will be handled by the server through its normal processes of expiry and reassignment; there's just nothing more we can do except let it run its course.

This is one of the reasons why PRPnet will be very, very nice when it's all ready for production use. It keeps very detailed logs that are of great help when tracing down problems of any sort.

As for the jobMaxTime, unfortunately it's a rather difficult process to stop the server once it's in the loop. I can do it, but in order to verify that the server's actually stopped correctly, I have to do a number of "geek things" that would be really, really hard to explain. Ditto for restarting with the whole loop thing. I've just now changed G8000 to 2 days jobMaxTime; if you need any such changes performed while the servers are in the loop thingy, let me know and I'll do it the absolute soonest that I can. I'd love to tell you how to do it so that it isn't dependent on my availability, but quite frankly, as I said that may be a bit difficult.

Max
mdettweiler is offline   Reply With Quote
Old 2009-08-19, 22:41   #1197
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

1013010 Posts
Default

I found calculating when the original pairs (that were later handed out a 2nd time) were handed out to be helpful, even though you couldn't determine when the duplicated pairs (that actually DID reject) were handed out. When the rejected results were returned doesn't help us much.

Here's why: Likely the originals and the duplicates were handed out at about the same time. As you could see from the calculated times above, those original pairs were all handed out at 2 distinct times, one of which I was able to correlate almost exactly to the power outage. In other words, this tells me that there was some distinct problem that occurred at those 2 times. Had the original pairs been handed out at more random times, we could not come to such a conclusion. Even if the duplicated pairs had been handed out at distinct times, we couldn't discern such because, as you said, the rejected results don't show how much time was taken.

By gleening as much info. as possible through calculations such as this allows us to hopefully cut down on it in the future. Anyway, I agree, it's not easy to gleen much info. from things on LLRnet. I guess we'll have to stop now.

One more question: Will getting David's code on to my servers mean that we can avoid the "loop thing" code to restart the servers? If so, that will prevent quite a bit of this "after outage" multiple crashes that we keep encountering.


Gary

Last fiddled with by gd_barnes on 2009-08-19 at 23:00
gd_barnes is offline   Reply With Quote
Old 2009-08-19, 22:51   #1198
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

2×5×1,013 Posts
Default

Quote:
Originally Posted by MyDogBuster View Post
What has this http://nplb.ironbits.net/ been replaced with? You know, the one with the IB port, all the primes for the day, the first n to process, links to the rejects, results for the day, etc; in the vertical format.
Quote:
Originally Posted by Lennart View Post
Quote:
Originally Posted by AMDave View Post
Quote:
Originally Posted by MyDogBuster View Post
try http://www.noprimeleftbehind.net

http://noprimeleftbehind.net/index.php

Not the one's I'm looking for. The one I had in mind did not show the hourly progress.

It did show all the primes found for that day listed by each port.

It is similar to http://nplb-gb1.no-ip.org/llrnet/ but instead for IB. It did have a like to http://www.noprimeleftbehind.net, but also had links to all current results for the day, rejects, etc.

Lennart and AMDave, both of these responses are incorrect and both link to the same incorrect page. Ian asked for the "noprimeleftbehind" link name version of http://nplb.ironbits.net/. If everything is going to roll over to the new server, we need a new link name with "noprimeleftbehind" in it that specifically has this web page in it.

I previously inquired to David about this.

David, are you just going to leave this one link on the old "ironbits" link name or can we expect a new link that has "noprimeleftbehind" in it?

This is an important page that we don't want to lose. I'll Email David with a link to this posting.


Thanks,
Gary
gd_barnes is offline   Reply With Quote
Old 2009-08-19, 22:52   #1199
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

24·52·7 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
I know it may seem like I'm glossing over this stuff, but quite frankly, LLRnet doesn't let me do much more than that. It just plain doesn't log enough info. Yes, we redirect the screen output to a file, but that's essentially useless since there's no timestamps on it.
so if you need timestamps for the output, try this: http://www.mersenneforum.org/showthread.php?t=10066

i've given those timestamps for the client-side to write this (still using this on my clients):

Code:
[2009-08-20 00:36:21] 2013*2^235548-1 is not prime.  Res64: 0BAAB87826667E2E  Time : 61.858 sec.
[2009-08-20 00:37:23] 2013*2^235595-1 is not prime.  Res64: A6ED66F8AA9036F5  Time : 61.854 sec.
[2009-08-20 00:38:24] 2013*2^235640-1 is not prime.  Res64: 8BCA8E2B12058E30  Time : 61.950 sec.
[2009-08-20 00:39:26]
note: a result and the following timestamp are a pair (couldn't handle this in other order).
so you have to read as.

Code:
[2009-08-20 00:37:23] 2013*2^235548-1 is not prime.  Res64: 0BAAB87826667E2E  Time : 61.858 sec.
[2009-08-20 00:38:24] 2013*2^235595-1 is not prime.  Res64: A6ED66F8AA9036F5  Time : 61.854 sec.
[2009-08-20 00:39:26] 2013*2^235640-1 is not prime.  Res64: 8BCA8E2B12058E30  Time : 61.950 sec.
every line has it's timestamp and result.

perhaps you can change the "server.lua" the same.
kar_bon is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
PRPnet servers for NPLB mdettweiler No Prime Left Behind 228 2018-12-26 04:50
Servers for NPLB gd_barnes No Prime Left Behind 0 2009-08-10 19:21
LLRnet servers for CRUS gd_barnes Conjectures 'R Us 39 2008-07-15 10:26
NPLB LLRnet server discussion em99010pepe No Prime Left Behind 229 2008-04-30 19:13
NPLB LLRnet server #1 - dried em99010pepe No Prime Left Behind 19 2008-03-26 06:19

All times are UTC. The time now is 08:45.

Thu Jun 4 08:45:38 UTC 2020 up 71 days, 6:18, 0 users, load averages: 1.63, 1.40, 1.35

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.