mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > No Prime Left Behind

Reply
 
Thread Tools
Old 2010-08-13, 18:49   #12
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

33·5·7·11 Posts
Default

Just to clarifly: It was likely the attempt to get into the machine through VNC that caused the reboot. It apparently was not a random event. Since we know what caused that, that particular issue should hopefully not happen again.
gd_barnes is offline   Reply With Quote
Old 2010-08-13, 20:48   #13
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

11×577 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
I just now stopped and restarted the server. It now seems to be handing out and accepting pairs again,

I'll keep an eye on it. Max, if you can help with that as you have time, that would help also. Do you know why there would be a maximum # of connections?

Edit: I just did one more thing to prevent a problem that we had with the last rally: I renamed the prpserver.log file to prpserver-0813.log. It was already at 755 MB. I'll keep doing that at least once/day. Something about its size caused the server to have problems last time.
Is there any explanation as to why prpserver.log is so large? I assume that debuglevel is set to 0 in the prpserver.ini file. Has anyone looked at it to see if there is anything unusual in that file? The file might contain information to reveal why max connections was reached. The last time I had seen this it appeared to be related to a user who had a .sh file (or .bat file) that got stuck in a loop trying to start the client. I never found out which version of the client they were running or what their script looked like to determine if the problem was truly with their .sh/.bat file or with the client.

You can set the max size of the log file in prpserver.ini. When it reaches that size it will rename it to prpserver.log.old (after deleting the previous prpserver.log.old file).
rogue is offline   Reply With Quote
Old 2010-08-13, 22:07   #14
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Quote:
Originally Posted by rogue View Post
Is there any explanation as to why prpserver.log is so large? I assume that debuglevel is set to 0 in the prpserver.ini file. Has anyone looked at it to see if there is anything unusual in that file? The file might contain information to reveal why max connections was reached. The last time I had seen this it appeared to be related to a user who had a .sh file (or .bat file) that got stuck in a loop trying to start the client. I never found out which version of the client they were running or what their script looked like to determine if the problem was truly with their .sh/.bat file or with the client.

You can set the max size of the log file in prpserver.ini. When it reaches that size it will rename it to prpserver.log.old (after deleting the previous prpserver.log.old file).
I have debuglevel set to 1 on all the servers--there have been way too many times that we ran into a one-off server bug that is nearly impossible to reproduce, so I like to have all the debug information available for sure. It produces some very big log files, but by renaming them to prpserver-MMDD.log periodically and compressing them with lzma, I can store them pretty efficiently.

I should implement the max log file size feature as you suggest and set it to 500MB or so. I'll still try to rename them periodically anyway to prevent any content from being deleted, but that should at least provide a failsafe in case I forget. Also, something I've been meaning to do for a while but haven't yet gotten the chance is to write a script that automatically renames the logs and moves them to a log/ folder within the server directory--that would completely obviate the need for manual renaming.

As far as this specific instance, I remember last time something like this happened the log file wasn't much help; still, I'll give it a look and see if I can come up with anything useful.
mdettweiler is offline   Reply With Quote
Old 2010-08-14, 03:23   #15
vaughan
 
vaughan's Avatar
 
Jan 2005
Sydney, Australia

5·67 Posts
Default

Where's MyDogBuster? With the rally running now his #2 rank is looking shaky.

Question for the mods: Would it help reduce the load on the server if we cached more than 5 tasks?
vaughan is offline   Reply With Quote
Old 2010-08-14, 03:45   #16
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Quote:
Originally Posted by vaughan View Post
Question for the mods: Would it help reduce the load on the server if we cached more than 5 tasks?
No, it's not the server load that's the problem, rather that I dropped the ball and forgot to clean out the logfile as often as I should. For the duration of the rally Gary's going to be moving it out twice a day, which should be more than plenty (I'd even be fine with once/day); we'd surely be fine even if everyone ran with a cache size of 1.

Also, note that this only applies to PRPnet port 9000. LLRnet (which is where I see your cores are) generates the same load on the server no matter what your cache size is; it is more accurately described as a queue system opearating on a FIFO (first in, first out) model, with each pair being returned as soon as it completes and a new one being added to the end of the queue at that time. So no matter what the queue size, it communicates with the server just as often.
mdettweiler is offline   Reply With Quote
Old 2010-08-14, 05:26   #17
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

I think I just figured out why the server got stuck in that "too many connections" issue earlier--and it didn't have anything to do with the size of the log file. I just realized that I had entirely forgotten to specify the maxclients= line in prpserver.ini! The maxclients feature was introduced in v3.3.0, but I didn't make any changes to any of the servers' .ini files when I upgraded them from 3.2.5--hence, maxclients= was unspecified and it used the default value of 10.

10 concurrent connections is far from impossible during a heavy-load period like this--and indeed it looks like it did happen:
Code:
[2010-08-13 12:09:56 CDT] 7: client connecting from 91.149.36.77
[2010-08-13 12:09:57 CDT] 4: client connecting from 91.149.36.77
[2010-08-13 12:10:05 CDT] 6: client connecting from 77.21.236.127
[2010-08-13 12:10:07 CDT] 8: client connecting from 91.149.36.77
[2010-08-13 12:10:15 CDT] 9: client connecting from 77.21.236.127
[2010-08-13 12:10:17 CDT] 10: client connecting from 91.149.36.77
[2010-08-13 12:10:19 CDT] 11: client connecting from 91.149.36.77
[2010-08-13 12:10:27 CDT] 12: client connecting from 91.149.36.77
[2010-08-13 12:10:27 CDT] 13: client connecting from 91.149.36.77
[2010-08-13 12:10:28 CDT] 14: client connecting from 91.149.36.77
[2010-08-13 12:10:28 CDT] 14: sending [ERROR:  Server cannot handle more connections]
[2010-08-13 12:10:28 CDT] Server has reached max connections of 10.  Connection from 91.149.36.77 rejected
[2010-08-13 12:10:28 CDT] 14: closing socket
[2010-08-13 12:10:30 CDT] 14: client connecting from 91.149.36.77
[2010-08-13 12:10:30 CDT] 14: sending [ERROR:  Server cannot handle more connections]
[2010-08-13 12:10:30 CDT] Server has reached max connections of 10.  Connection from 91.149.36.77 rejected
[2010-08-13 12:10:30 CDT] 14: closing socket
[2010-08-13 12:10:30 CDT] 14: client connecting from 91.149.36.77
[2010-08-13 12:10:30 CDT] 14: sending [ERROR:  Server cannot handle more connections]
[2010-08-13 12:10:30 CDT] Server has reached max connections of 10.  Connection from 91.149.36.77 rejected
[2010-08-13 12:10:30 CDT] 14: closing socket
What should have happened was for the server to just reject the 10th client (in this case, socket #14), and continue communicating with the others--and then once those finished and disconnected, sockets would be freed up and it would again accept communication on them. However, what actually happened was that the server got stuck in rejecting communications on socket #14--apparently dropping the others (#7, 4, 6, 8, 9, 10, 11, 12, and 13) and having no further communication with them. From that point they remained perpetually open, and any new clients got the unlucky socket #14 which of course is the 10th socket to be opened, thus they were turned away.

Mark, this looks like a bug--any idea why it's doing this?

@all: meanwhile, I've set the server's log file limit to 500 MB as previously discussed. It looks like it's not needed after all, but it shouldn't hurt. Also, I've set maxclients=1000, so we shouldn't run into any more of the above problem unless somebody's client really does go bonkers and get in some kind of loop.

Last fiddled with by mdettweiler on 2010-08-14 at 05:27
mdettweiler is offline   Reply With Quote
Old 2010-08-14, 08:15   #18
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

33·5·7·11 Posts
Default

Good work Max.

On the logfile, I'm about ready to do my semi-daily renaming. One thing about your 500 MB limit: That won't help if that turns out to be an issue. It had reached 778 MB in right around a day after the rally started. [Or perhaps the logfile included a lot of logging from long before the rally started. I hadn't checked that.] Regardless, the twice daily renaming won't hurt.
gd_barnes is offline   Reply With Quote
Old 2010-08-14, 09:08   #19
AMDave
 
AMDave's Avatar
 
Jan 2006
deep in a while-loop

2×7×47 Posts
Default

have a look at:
$man 8 logrotate

Max should have the server doing that for you in a jiffy
AMDave is offline   Reply With Quote
Old 2010-08-14, 14:39   #20
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

11000110010112 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Mark, this looks like a bug--any idea why it's doing this?
My best guess is that some sort of deadlock is occurring and the server isn't handling it correctly. Being a multi-threaded application, the server needs to lock certain resources so that only one thread can access them at a time. It is possible that two threads are deadlocking. IIRC, pthreads will tell you when a deadlock has occurred. Since I don't check for return codes from some pthread calls, it is possible that I'm not handing an error that needs to be handled. I need to investigate further.
rogue is offline   Reply With Quote
Old 2010-08-14, 16:47   #21
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

141518 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
Good work Max.

On the logfile, I'm about ready to do my semi-daily renaming. One thing about your 500 MB limit: That won't help if that turns out to be an issue. It had reached 778 MB in right around a day after the rally started. [Or perhaps the logfile included a lot of logging from long before the rally started. I hadn't checked that.] Regardless, the twice daily renaming won't hurt.
Yes, that did include over a month of earlier logging IIRC.
Quote:
Originally Posted by AMDave View Post
have a look at:
$man 8 logrotate

Max should have the server doing that for you in a jiffy
Thanks, I'll check it out.
mdettweiler is offline   Reply With Quote
Old 2010-08-14, 17:01   #22
Vato
 
Jan 2009

2010 Posts
Default

I'm a couple of days late (due to work) but I'm in the rally now!
Vato is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
LLRnet/PRPnet rally April 4th-11th mdettweiler No Prime Left Behind 55 2011-04-25 09:35
LLRnet/PRPnet rally January 3rd-10th mdettweiler No Prime Left Behind 48 2011-01-12 10:14
LLRnet/PRPnet rally Oct. 27th-Nov. 3rd mdettweiler No Prime Left Behind 33 2010-12-24 19:16
LLRnet/PRPnet rally June 4th-6th gd_barnes No Prime Left Behind 61 2010-07-30 17:28
LLRnet server rally 400<k<1001 August 8-10 mdettweiler No Prime Left Behind 66 2008-08-11 03:00

All times are UTC. The time now is 11:08.


Sat Jul 17 11:08:17 UTC 2021 up 50 days, 8:55, 1 user, load averages: 1.28, 1.11, 1.16

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.