mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Conjectures 'R Us (https://www.mersenneforum.org/forumdisplay.php?f=81)
-   -   PRPnet server bugs and barfing problem (https://www.mersenneforum.org/showthread.php?t=12256)

mdettweiler 2009-08-03 18:18

Server barfing again!
 
I just noticed that the G3000 server started "barfing" again about an hour ago. Like the last few times, it seems to have started doing so in response to what appears to be a communications error between the server and one of Lennart's boxes. Lennart, can you please check over your various boxes that you have on G3000 and see if any of them contain any clues to what might be happening? I'm afraid I can't narrow it down to a specific box from the logs.

I'm going to turn on "debug mode" on the server, which will have it log full socket communication data to a file. I don't usually like to use this for production servers since it produces extremely large logfiles, but I'll use it for now since it should allow us to pinpoint exactly which machine is causing this problem. It's possible that it's the same one every time, which would indicate a problem on Lennart's end rather than the server's.

Interestingly enough, though, I have never seen this barfing problem happen on any other PRPnet server. Possibly there is something specific to G3000 that's causing the problem.

Lennart 2009-08-03 18:31

[quote=mdettweiler;183872]I just noticed that the G3000 server started "barfing" again about an hour ago. Like the last few times, it seems to have started doing so in response to what appears to be a communications error between the server and one of Lennart's boxes. Lennart, can you please check over your various boxes that you have on G3000 and see if any of them contain any clues to what might be happening? I'm afraid I can't narrow it down to a specific box from the logs.

I'm going to turn on "debug mode" on the server, which will have it log full socket communication data to a file. I don't usually like to use this for production servers since it produces extremely large logfiles, but I'll use it for now since it should allow us to pinpoint exactly which machine is causing this problem. It's possible that it's the same one every time, which would indicate a problem on Lennart's end rather than the server's.

Interestingly enough, though, I have never seen this barfing problem happen on any other PRPnet server. Possibly there is something specific to G3000 that's causing the problem.[/quote]

I have checked all and i cant find the one making this.

I keep looking.

Lennart

mdettweiler 2009-08-03 19:34

[quote=Lennart;183876]I have checked all and i cant find the one making this.

I keep looking.

Lennart[/quote]
Okay, thanks. The server seems to be holding up OK since I last restarted it (right before I posted my last message), but if it happens again (which, given time, it probably will, given the track record we've been seeing), I should be able to catch exactly what machine is talking to the server at the time from the debug logs.

mdettweiler 2009-08-03 20:35

Looks like the server barfed again. I've restarted it to fix it back up again, and I'll start looking through the logs for clues to what caused it. :smile:

mdettweiler 2009-08-03 20:37

[quote=mdettweiler;183891]Looks like the server barfed again. I've restarted it to fix it back up again, and I'll start looking through the logs for clues to what caused it. :smile:[/quote]
I've looked through the debug log a bit, and while I can't find anything conclusively linking the problem to a particular machine, it does look like the problem may have started occurring when the server communicated with Lennart's machine "_207". Lennart, you may want to check the box with that ID and see if there's anything strange going on with it.

Lennart 2009-08-03 21:00

[quote=mdettweiler;183892]I've looked through the debug log a bit, and while I can't find anything conclusively linking the problem to a particular machine, it does look like the problem may have started occurring when the server communicated with Lennart's machine "_207". Lennart, you may want to check the box with that ID and see if there's anything strange going on with it.[/quote]

I have increased numbers to cache and set error time out to 1min.

Lennart

mdettweiler 2009-08-03 21:05

[quote=Lennart;183895]I have increased numbers to cache and set error time out to 1min.

Lennart[/quote]
Just curious, have you by chance been having any problems with your internet connection lately--sudden dropoffs, etc? Because if so, that might possibly explain why these problems are happening (say, if the connection gets cut during a communication with the server).

Lennart 2009-08-03 21:21

[quote=mdettweiler;183896]Just curious, have you by chance been having any problems with your internet connection lately--sudden dropoffs, etc? Because if so, that might possibly explain why these problems are happening (say, if the connection gets cut during a communication with the server).[/quote]

No and i run on prpnet.primegrid at the same time and have no problem there.

Lennart

mdettweiler 2009-08-03 21:42

[quote=Lennart;183898]No and i run on prpnet.primegrid at the same time and have no problem there.

Lennart[/quote]
Hmm, I guess that's ruled out then. I did, however, find something interesting upon further examination of the debug log. For reference, a normal result-reporting communication between client and server looks something like this. (<<< represents data coming in from the client, and >>> represents data going out from the server.)
[code][2009-08-03 18:20:13 GMT] Message coming on socket 5
[2009-08-03 18:20:13 GMT] socket 5 <<<< FROM sm5ymt@pekhult.se _153 sm5ymt
[2009-08-03 18:20:13 GMT] sm5ymt@pekhult.se connecting from *.*.*.*
[2009-08-03 18:20:13 GMT] socket 5 <<<< RETURNWORK 2.2.3
[2009-08-03 18:20:13 GMT] socket 5 <<<< WorkUnit: 31340*6^145004+1 1249318058
[2009-08-03 18:20:13 GMT] socket 5 >>>> INFO: Workunit found
[2009-08-03 18:20:13 GMT] socket 5 <<<< Test Result: pfgw BD78034F699566B1
[2009-08-03 18:20:13 GMT] socket 5 <<<< End of WorkUnit
[2009-08-03 18:20:13 GMT] socket 5 >>>> INFO: Test for candidate 31340*6^145004+1 accepted
[2009-08-03 18:20:13 GMT] 31340*6^145004+1: Test received by sm5ymt@pekhult.se at *.*.*.* Residue Residue: BD78034F699566B1
[2009-08-03 18:20:13 GMT] socket 5 >>>> End of Workunit Message
[2009-08-03 18:20:14 GMT] socket 5 <<<< WorkUnit: 124221*6^145005+1 1249318058
[2009-08-03 18:20:14 GMT] socket 5 >>>> INFO: Workunit found
[2009-08-03 18:20:14 GMT] socket 5 <<<< Test Result: pfgw 30DF4273EBA52CE2
[2009-08-03 18:20:14 GMT] socket 5 <<<< End of WorkUnit
[2009-08-03 18:20:14 GMT] socket 5 >>>> INFO: Test for candidate 124221*6^145005+1 accepted
[2009-08-03 18:20:14 GMT] 124221*6^145005+1: Test received by sm5ymt@pekhult.se at *.*.*.* Residue Residue: 30DF4273EBA52CE2
[2009-08-03 18:20:14 GMT] socket 5 >>>> End of Workunit Message
[2009-08-03 18:20:15 GMT] socket 5 <<<< End of Message
[2009-08-03 18:20:15 GMT] socket 5 >>>> INFO: All 2 test results were accepted
[2009-08-03 18:20:15 GMT] socket 5 >>>> End of Message
[2009-08-03 18:20:15 GMT] socket 5 <<<< QUIT
[2009-08-03 18:20:15 GMT] closing socket 5[/code]
But, by contrast, the sessions that seem to be throwing off the server look like this:
[code][2009-08-03 21:19:35 GMT] Message coming on socket 5
[2009-08-03 21:19:35 GMT] socket 5 <<<< FROM sm5ymt@pekhult.se _31 sm5ymt
[2009-08-03 21:19:35 GMT] sm5ymt@pekhult.se connecting from *.*.*.*
[2009-08-03 21:19:35 GMT] socket 5 <<<< RETURNWORK 2.2.3
[2009-08-03 21:19:35 GMT] socket 5 <<<< WorkUnit: 124221*6^148285+1 1249333282
[2009-08-03 21:19:35 GMT] socket 5 >>>> INFO: Workunit found
[2009-08-03 21:19:35 GMT] socket 5 <<<< WorkUnit: 74612*6^148287+1 1249333282
[2009-08-03 21:19:35 GMT] socket 5 <<<< WorkUnit: 172257*6^148286+1 1249333282
[2009-08-03 21:19:35 GMT] socket 5 <<<< End of Message
[2009-08-03 21:19:35 GMT] socket 5 <<<< QUIT
[2009-08-03 21:19:46 GMT] socket 5 (nothing received)
[2009-08-03 21:19:46 GMT] socket 5 >>>> INFO: Test for candidate 124221*6^148285+1 accepted
[2009-08-03 21:19:46 GMT] Error sending <<INFO: Test for candidate 124221*6^148285+1 accepted>> to localhost:3000
[2009-08-03 21:19:46 GMT] socket 5 >>>> !!! send error !!!
[2009-08-03 21:19:46 GMT] 124221*6^148285+1: Test received by sm5ymt@pekhult.se at *.*.*.* Residue Residue:
[2009-08-03 21:19:46 GMT] socket 5 >>>> End of Workunit Message
[2009-08-03 21:19:46 GMT] Error sending <<End of Workunit Message>> to localhost:3000
[2009-08-03 21:19:46 GMT] socket 5 >>>> !!! send error !!!
[2009-08-03 21:19:57 GMT] socket 5 (nothing received)
[2009-08-03 21:19:57 GMT] socket 5 >>>> INFO: All 1 test results were accepted
[2009-08-03 21:19:57 GMT] Error sending <<INFO: All 1 test results were accepted>> to localhost:3000
[2009-08-03 21:19:57 GMT] socket 5 >>>> !!! send error !!!
[2009-08-03 21:19:57 GMT] socket 5 >>>> End of Message
[2009-08-03 21:19:57 GMT] Error sending <<End of Message>> to localhost:3000
[2009-08-03 21:19:57 GMT] socket 5 >>>> !!! send error !!!
[2009-08-03 21:20:08 GMT] socket 5 (nothing received)
[2009-08-03 21:20:08 GMT] closing socket 5[/code]
It seems that the client is sending the "WorkUnit:" message, but then instead of sending the "Test Result:" message, which is supposed to directly follow it, it's sending [i]another[/i] WorkUnit message before sending the first test's result! And then after that, the client just terminates the connection with "End of Message" and "QUIT". Meanwhile the server is still waiting for a result, until it times out and gives up with a "(nothing received)" message. But then, we have a problem: then, the server takes the incomplete result report and registers it as if it was true, but with a blank residue and application name since it was never told them--hence the blank results I was seeing in completed_tests.log!

This all makes perfect sense now. It is most definitely not a connection error like I thought earlier, but I was confused because connection errors also tend to produce those mysterious "Error sending <<x>> to localhost:3000" errors. Instead, it looks more like an odd bug in the client that is confusing the heck out of the server. I'll report it to Mark.

rogue 2009-08-03 22:26

If someone could append a debug log from a client that has experienced this issue, that would be helpful.

I will put a separate fix into the server to address it losing the candidates since there is no valid test result.

mdettweiler 2009-08-03 22:56

[quote=rogue;183902]If someone could append a debug log from a client that has experienced this issue, that would be helpful.

I will put a separate fix into the server to address it losing the candidates since there is no valid test result.[/quote]
I'm afraid I haven't encountered this issue myself on the client end, so I can't help you there; Lennart, could you possibly put all of your G3000 clients on level 1 debug logging, if they aren't already?


All times are UTC. The time now is 09:39.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.