mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > No Prime Left Behind

Reply
 
Thread Tools
Old 2009-08-18, 17:31   #1178
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

2×5×1,013 Posts
Default

Karsten,

As you probably noticed, I haven't had time to look into the problems with the missing pairs on port G8000. I looked at it briefly last night but found it was going to take quite a while to determine exactly what happened. It is on my "short term" to do list now.

Once I verify exactly what is missing and see if I can figure out what happened, I'll stop the server, add the pairs in the correct n-value order to knpairs.txt, and restart it. They should then be immediately handed out.

I also want to see if I can figure out why all of those k/n pairs got rejected yesterday. I didn't immediately see them in the results.txt file. That is a different situation than the 2 rejected results for port G7000. Fot that server, I was quickly able to find that the pairs were already in results.txt so the server must have handed the pairs out twice with the short power blip the likely culprit.

Here is what I speculate may have happened on them. Please note that this is very much a guess. Something similar only different occurred on port G8000 as occurred on G7000. But in the port G8000 case, I think it may have already removed the pairs from both knpairs.txt and joblist.txt because it was "in the process" of receiving the results right as the power blip occurred. Therefore the results will never get into results.txt...not a good scenario. If that is the case, what we'll need to do is the following so that Karsten and whomever else receives credit for the already processed pairs but do not have to crunch them again:
1. Add the pairs back to knpairs.txt and to joblist.txt. In joblist.txt, make sure that the person's ID who originally worked on the pairs is correct.
2. Remove the pairs from the rejected results file.
3. Convert the rejected results to the one line "client format", that is the format that the client sends to the server when it is done with a pair.
4. Send the client formatted results from #3 back to the server. I could do this myself. If the joblist.txt file is properly updated in #1, even if I send them to the server, the correct people will get the credit that they should without crunching the pairs again.

Max,

Can you do me a favor? Can you post the file here that you loaded into port G8000 originally up to n=970K? (I think you loaded in specifically the n=900K-970K range.) I want to check for oddball carriage control characters and other such nonsense that seems to occassionally cause the servers to simply skip over k/n pairs. I also want to make sure the missing pairs were in the file to begin with.

One thing good about having all servers on my machines that I'm sure that Karsten will like...I can do specific tweakings like this to quickly account for and give credit for missing or rejected pairs and results.


Gary

Last fiddled with by gd_barnes on 2009-08-18 at 17:47
gd_barnes is offline   Reply With Quote
Old 2009-08-18, 20:03   #1179
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

141518 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
Karsten,

As you probably noticed, I haven't had time to look into the problems with the missing pairs on port G8000. I looked at it briefly last night but found it was going to take quite a while to determine exactly what happened. It is on my "short term" to do list now.

Once I verify exactly what is missing and see if I can figure out what happened, I'll stop the server, add the pairs in the correct n-value order to knpairs.txt, and restart it. They should then be immediately handed out.

I also want to see if I can figure out why all of those k/n pairs got rejected yesterday. I didn't immediately see them in the results.txt file. That is a different situation than the 2 rejected results for port G7000. Fot that server, I was quickly able to find that the pairs were already in results.txt so the server must have handed the pairs out twice with the short power blip the likely culprit.

Here is what I speculate may have happened on them. Please note that this is very much a guess. Something similar only different occurred on port G8000 as occurred on G7000. But in the port G8000 case, I think it may have already removed the pairs from both knpairs.txt and joblist.txt because it was "in the process" of receiving the results right as the power blip occurred. Therefore the results will never get into results.txt...not a good scenario. If that is the case, what we'll need to do is the following so that Karsten and whomever else receives credit for the already processed pairs but do not have to crunch them again:
1. Add the pairs back to knpairs.txt and to joblist.txt. In joblist.txt, make sure that the person's ID who originally worked on the pairs is correct.
2. Remove the pairs from the rejected results file.
3. Convert the rejected results to the one line "client format", that is the format that the client sends to the server when it is done with a pair.
4. Send the client formatted results from #3 back to the server. I could do this myself. If the joblist.txt file is properly updated in #1, even if I send them to the server, the correct people will get the credit that they should without crunching the pairs again.

Max,

Can you do me a favor? Can you post the file here that you loaded into port G8000 originally up to n=970K? (I think you loaded in specifically the n=900K-970K range.) I want to check for oddball carriage control characters and other such nonsense that seems to occassionally cause the servers to simply skip over k/n pairs. I also want to make sure the missing pairs were in the file to begin with.

One thing good about having all servers on my machines that I'm sure that Karsten will like...I can do specific tweakings like this to quickly account for and give credit for missing or rejected pairs and results.


Gary
Okay, first of all, the rejected files are cleaned out every 24 hours, so unless you made a backup of that before, we don't have a record of that any more. The reason why they're cleared out every day is because otherwise, the files could easily get very big if (say) someone had a runaway misconfigured client.

I don't have the original file that I loaded in any longer, though I do have the master file I made for the 1st 6-k minidrive, all the way for 600K-1M. That's where I pulled the data out of to load into the server, so if you'd like that I can send it to you.

Actually, quite frankly, since there were only a couple of these rejected pairs, I figured it would by far be easiest to simply let it go for now, and then when the entire range is being processed in the end, find out exactly which k/n pairs are missing and re-do them at that time. Otherwise, there's an enormous potential for messing things up much, much more than they are now (which currently is only a minor problem that seems to have only affected two k/n pairs). Believe me, I've learned that lesson over and over again with some of the mess-ups we've had before on PRPnet G3000; even though that was PRPnet rather than LLRnet, the basic idea of trying to "fix" these things manually leading to a big mess still applies.

Max
mdettweiler is offline   Reply With Quote
Old 2009-08-18, 20:49   #1180
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

24×52×7 Posts
Default

Quote:
Originally Posted by kar_bon View Post
in the n-range 950k-960k there're 3 pairs missing in the results!

345 957466
333 957578
315 957595
the resultfile from 2009-08-18 contains one of those pairs:

Code:
user=gd_barnes
[2009-08-17 16:31:04]
345*2^957466-1 is not prime.  Res64: B3595697CA967AE7  Time : 1202.0 sec.
the second pair from above is now still the first unprocessed k/n-pair.

i just sent some results to port GB8000 but i got none of the 2 remaining missing pairs.

perhaps they've reserved by marco.bs because he has done some the last hour.
could someone check this in the joblist, please?!
kar_bon is offline   Reply With Quote
Old 2009-08-18, 21:02   #1181
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

624910 Posts
Default

Quote:
Originally Posted by kar_bon View Post
the resultfile from 2009-08-18 contains one of those pairs:

Code:
user=gd_barnes
[2009-08-17 16:31:04]
345*2^957466-1 is not prime.  Res64: B3595697CA967AE7  Time : 1202.0 sec.
the second pair from above is now still the first unprocessed k/n-pair.

i just sent some results to port GB8000 but i got none of the 2 remaining missing pairs.

perhaps they've reserved by marco.bs because he has done some the last hour.
could someone check this in the joblist, please?!
No, it isn't reserved by marco.bs; I just checked and the second one from your list is reserved by Gary. Gary, perhaps you forgot to cancel your test workunits on the server?

Anyway, this would seem to confirm that there are in fact no pairs missing in the server; as long as they're listed in knpairs.txt (which is how they'd get on the status page), then we're good. All we have to do is let them expire naturally, and they'll be reassigned and dealt with.
mdettweiler is offline   Reply With Quote
Old 2009-08-19, 04:23   #1182
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

236228 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Hmm...I see. First of all, your servers have been at 1 day for a while (we set them to that a while back for reasons I don't remember off the top of my head, and we never bothered to set them back). As for the servers being off for one hour, if Karsten had cached pairs from over 24 hours ago, that 1 hour might have been just enough to throw a wrench into his plans to return them before the deadline.
Quote:
Originally Posted by gd_barnes View Post
You told me they were back at 3 days again quite a while ago after everyone had this big argument over that. The agreement was that David's would be 1 day and mine 3 days except for IB9000 that was put at 2 days. Oh well, never mind. In the future, I'll check the JobMaxTime myself and set it at whatever I deem appropriate and simply let everyone know what that is. The democratic process on that has not worked at all.
Quote:
Originally Posted by mdettweiler View Post
No, it isn't reserved by marco.bs; I just checked and the second one from your list is reserved by Gary. Gary, perhaps you forgot to cancel your test workunits on the server?

Anyway, this would seem to confirm that there are in fact no pairs missing in the server; as long as they're listed in knpairs.txt (which is how they'd get on the status page), then we're good. All we have to do is let them expire naturally, and they'll be reassigned and dealt with.
Max,

I have a question for you and I want you to answer it without looking at my servers: Is the JobMaxTime on my servers 1 day or 3 days?

After answering, then check the servers and see if you are right. If not, please post the correct JobMaxTime.

First, your 2 statements above could not possibly both be true. It's either one or the other but not both. Think about and look closely at the timing of things and you'll figure it out.

Second, there are 30 rejected results today (Tuesday)! The first at 11:00 AM CDT and the last at 15:39 CDT and you responded at 15:03 CDT with your post saying that there are only a "couple of rejected pairs". PLEASE STOP saying that there is nothing wrong and that it will work its way out!! There is definitely something wrong and I am looking into it now.

I've asked this before: Please slow down when responding. If you can't take 15-30 mins. to analyze a technical problem in great detail, then don't respond at all. It only confuses matters more. If you don't have time, ask me to do a detailed analysis on it and I will.

BTW, you are correct on one thing. I failed to return those pairs on port G8000. I'll do that now


Thank you,
Gary
gd_barnes is offline   Reply With Quote
Old 2009-08-19, 04:40   #1183
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
Max,

I have a question for you and I want you to answer it without looking at my servers: Is the JobMaxTime on my servers 1 day or 3 days?
1 day--that's easy. I checked it earlier today and know it's set to 1 day.

Quote:
After answering, then check the servers and see if you are right. If not, please post the correct JobMaxTime.
Confirmed, 1 day.

Quote:
First, your 2 statements above could not possibly both be true. It's either one or the other but not both. Think about and look closely at the timing of things and you'll figure it out.

Second, there are 30 rejected results today (Tuesday)! The first at 11:00 AM CDT and the last at 15:39 CDT and you responded at 15:03 CDT with your post saying that there are only a "couple of rejected pairs". PLEASE STOP saying that there is nothing wrong and that it will work its way out!! There is definitely something wrong and I am looking into it now.
Okay, I hadn't seen those. As of my 15:03 post, I hadn't checked the servers yet, but was just responding to your request for the original sieve file, and to head off any attempt to mess with re-adding work to the server and whatnot (which, as I described, would be a much bigger mess than it's worth).

Quote:
I've asked this before: Please slow down when responding. If you can't take 15-30 mins. to analyze a technical problem in great detail, then don't respond at all. It only confuses matters more. If you don't have time, ask me to do a detailed analysis on it and I will.

BTW, you are correct on one thing. I failed to return those pairs on port G8000. I'll do that now


Thank you,
Gary
Okay, sorry if I rushed at all on checking this out. Actually, I must say, I did do about 15-20 minutes of checking around when I posted that last message, and at the time, things really did seem to be quite OK. I don't have time to analyze the latest 30 rejected pairs just now (I'll do it tomorrow) but in the meantime, I'll hazard a guess that they're perfectly "normal" rejected pairs. That is, pairs that were taken by a client, held past the jobMaxTime, sent to another client, and returned by the other client before being returned by the first client and rejected. That happens from time to time due to perfectly normal causes.

As long as there aren't any pairs that both a) aren't in any results files and b) aren't in knpairs.txt, then everything is normal, since any possible "problems" are being handled throught the server's normal process of expiry, reassignment, and rejection. Any interference in a situation like that is likely to make things much, much worse and end up dropping a large # of results through the cracks. The best way to proceed is to simply let the server do its thing.

Max
mdettweiler is offline   Reply With Quote
Old 2009-08-19, 04:49   #1184
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

2·5·1,013 Posts
Default

Odd. Then why were my pairs still sitting in knpairs.txt from 30 hours ago as of right before I returned them to the server 15 mins. ago? They were all retrieved at 16:11 CDT on 8/17 and never returned.

This seems to imply a JobMaxTime of longer than 1 day.

This is likely a mountain out of a molehill but it's more me trying to understand the quirks of these LLRnet servers.

Last fiddled with by gd_barnes on 2009-08-19 at 04:52
gd_barnes is offline   Reply With Quote
Old 2009-08-19, 05:48   #1185
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

2×5×1,013 Posts
Default

I have matched up all results in port 8000 vs. the original sieve file for n>=900K as of Aug. 19th at 12:01 AM CDT (5 AM GMT).

Conclusion 1: All pairs are accounted for. That is, all pairs are either still in the knpairs.txt file or they have returned results for them.

Conclusion 2: We do not know why the pairs that I reserved > 30 hours ago have not be reassigned yet. We will keep an eye on that.

Conclusion 3: Pairs were handed out twice right after the time of the power outage around 16:00-16:30 CDT on Aug. 17th causing 10 rejected results. This is understandable.

Conclusion 4: Pairs were handed out twice around 09:50-10:00 CDT on Aug. 18th for an unknown reason causing at least 46 rejected results. This is bad and warrants further investigation.

Clarification: The times that the pairs were handed out were when they were handed out the 1st time. It is not known when they were handed out the 2nd time. Because the rejected.txt file does not show how much time is taken to return the pair, it is not known.

Explanations:

Pairs reserved by me but not tested and returned to server 30 hours later as of ~1-2 hours ago. These should be handed out at any time now. If not, then there is a problem with the server.

Code:
333 957578
315 957595
315 971061
339 971061
327 971068
333 971070
339 971075
345 971076
333 971079
345 971080
327 971082
339 971083
333 971094
321 971105
327 971106
345 971106
333 971355
345 971566
315 971575
345 971577
30 pairs rejected on Aug. 18th CDT:

Code:
333 970326 10:59:52 marco.bs
315 970327 10:59:52 marco.bs
339 970327 15:38:58 kar_bon
315 970334 15:38:59 kar_bon
333 970334 15:38:59 kar_bon
321 970337 15:39:00 kar_bon
333 970343 10:59:53 marco.bs
315 970347 15:39:00 kar_bon
327 970352 15:39:01 kar_bon
327 970361 15:39:01 kar_bon
315 970366 15:39:02 kar_bon
327 970404 11:00:45 marco.bs
315 970419 11:00:47 marco.bs
339 970419 11:00:47 marco.bs
345 970419 11:00:47 marco.bs
321 970422 11:00:47 marco.bs
315 970426 11:00:47 marco.bs
327 970429 11:00:48 marco.bs
333 970410 15:39:02 kar_bon
333 970415 15:39:03 kar_bon
345 970996 15:39:20 kar_bon
321 970997 15:39:21 kar_bon
339 970999 15:39:21 kar_bon
333 971004 15:39:22 kar_bon
339 971005 15:39:22 kar_bon
315 971014 15:39:23 kar_bon
345 971019 15:39:23 kar_bon
315 971027 15:39:24 kar_bon
339 971029 15:39:24 kar_bon
315 971030 15:39:25 kar_bon
Using time math, I determined that for the above 30 rejected pairs, all of the 30 pairs were handed out the 1st time either between 16:00-16:30 CDT on Aug. 17th or 09:50-10:00 CDT on Aug. 18th. The former time was right after the power outage. The latter time is a mystery that needs investigating. The following is the return time and total time taken for the original 30 results:
Code:
333 970326 16:34:25 Aug. 17th gd_barnes 1224 secs = handed out 16:14:01 on 17th
315 970327 16:34:28 Aug. 17th gd_barnes 1215 secs = handed out 16:14:13 on 17th
339 970327 10:59:52 Aug. 18th kar_bon 3855 secs = handed out 09:55:37 on 18th
315 970334 10:59:52 Aug. 18th kar_bon 3854 secs = handed out 09:55:38 on 18th
333 970334 10:59:53 Aug. 18th kar_bon 3855 secs = handed out 09:55:38 on 18th
321 970337 10:59:53 Aug. 18th kar_bon 3855 secs = handed out 09:55:38 on 18th
333 970343 16:34:46 Aug. 17th gd_barnes 1220 secs = handed out 16:14:26 on 17th
315 970347 10:59:53 Aug. 18th kar_bon 3854 secs = handed out 09:55:39 on 18th
327 970352 10:59:54 Aug. 18th kar_bon 3855 secs = handed out 09:55:39 on 18th
327 970361 10:59:54 Aug. 18th kar_bon 3854 secs = handed out 09:55:40 on 18th
315 970366 10:59:54 Aug. 18th kar_bon 3854 secs = handed out 09:55:40 on 18th
327 970404 16:35:09 Aug. 17th gd_barnes 1225 secs = handed out 16:14:44 on 17th
333 970410 11:00:46 Aug. 18th kar_bon 3906 secs = handed out 09:55:40 on 18th
333 970415 11:00:46 Aug. 18th kar_bon 3905 secs = handed out 09:55:41 on 18th
315 970419 09:55:48 Aug. 18th kar_bon 62788 secs = handed out 16:29:20 on 17th
339 970419 09:55:49 Aug. 18th kar_bon 62788 secs = handed out 16:29:21 on 17th
345 970419 09:55:49 Aug. 18th kar_bon 62788 secs = handed out 16:29:21 on 17th
321 970422 09:55:50 Aug. 18th kar_bon 62788 secs = handed out 16:29:22 on 17th
315 970426 09:55:51 Aug. 18th kar_bon 62789 secs = handed out 16:29:22 on 17th
327 970429 09:55:52 Aug. 18th kar_bon 62790 secs = handed out 16:29:22 on 17th
345 970996 10:59:56 Aug. 18th kar_bon 3805 secs = handed out 09:56:31 on 18th
321 970997 10:59:56 Aug. 18th kar_bon 3804 secs = handed out 09:56:32 on 18th
339 970999 10:59:57 Aug. 18th kar_bon 3805 secs = handed out 09:56:32 on 18th
333 971004 10:59:57 Aug. 18th kar_bon 3804 secs = handed out 09:56:33 on 18th
339 971005 10:59:58 Aug. 18th kar_bon 3805 secs = handed out 09:56:33 on 18th
315 971014 10:59:59 Aug. 18th kar_bon 3806 secs = handed out 09:56:33 on 18th
345 971019 10:59:59 Aug. 18th kar_bon 3805 secs = handed out 09:56:34 on 18th
315 971027 13:59:03 Aug. 18th kar_bon 14549 secs = handed out 09:56:34 on 18th
339 971029 13:59:03 Aug. 18th kar_bon 14549 secs = handed out 09:56:34 on 18th
315 971030 13:59:04 Aug. 18th kar_bon 14549 secs = handed out 09:56:35 on 18th
Max, we need to take a serious look at what is happening here on the 09:55-10:00 timeframe on the 18th. This was < 24 hours ago so these should not have been handed out a second time.

Is there a server log that shows what might have happened? It appears that something is not being saved off to the joblist.txt file correctly. There was no power outage at that time.


Gary

Last fiddled with by gd_barnes on 2009-08-19 at 07:32 Reason: done editing now
gd_barnes is offline   Reply With Quote
Old 2009-08-19, 06:21   #1186
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

1010111100002 Posts
Default

just returned some results for GB8000 and assigned the two outstanding pairs at n=957k!

so i can return those results in about 8 hours (home again).
kar_bon is offline   Reply With Quote
Old 2009-08-19, 07:20   #1187
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

2·5·1,013 Posts
Default

There are 26 more rejected results from Karsten on Aug. 19th as of 2:30 AM CDT (7:30 AM GMT).

I checked the original results for the first 2-3 pairs and they were all originally assigned the first time, once again, between 09:55 and 10:00 on Aug. 18th.

Karsten, for any pairs that your machines were assigned the FIRST TIME around 14:55-15:00 GMT on Aug. 18th, you'll likely wind up with a second assignment of the same pair that is rejected. I hope that you didn't cache 100 pairs at that time.

At the time of this post, we now have 56 rejected results in the last 24 hours. Only 10 of those can directly be associated with the power outage. The other 46 were originally assigned in the Aug. 18th 09:55-10:00 CDT time frame.

I suspect this is an unfixable LLRnet bug. The server may have simply become a little unstable after the outage. Hopefully it has worked its way through.

My battery power backup should be here by Friday. Hopefully these kinds of problems will be history after that.


Gary

Last fiddled with by gd_barnes on 2009-08-19 at 07:29
gd_barnes is offline   Reply With Quote
Old 2009-08-19, 14:05   #1188
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
I suspect this is an unfixable LLRnet bug. The server may have simply become a little unstable after the outage. Hopefully it has worked its way through.
Ah, that would make sense. Both G8000 and G7000 crashed a couple of times after the outage, as always seems to happen after an outage. They seem to have stabilized now, though I've put them in a loop so that they'll restart if they do crash again.
mdettweiler is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
PRPnet servers for NPLB mdettweiler No Prime Left Behind 228 2018-12-26 04:50
Servers for NPLB gd_barnes No Prime Left Behind 0 2009-08-10 19:21
LLRnet servers for CRUS gd_barnes Conjectures 'R Us 39 2008-07-15 10:26
NPLB LLRnet server discussion em99010pepe No Prime Left Behind 229 2008-04-30 19:13
NPLB LLRnet server #1 - dried em99010pepe No Prime Left Behind 19 2008-03-26 06:19

All times are UTC. The time now is 08:51.

Thu Jun 4 08:51:11 UTC 2020 up 71 days, 6:24, 0 users, load averages: 1.65, 1.53, 1.42

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.