mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > Conjectures 'R Us

Reply
 
Thread Tools
Old 2009-08-06, 23:09   #67
Lennart
 
Lennart's Avatar
 
"Lennart"
Jun 2007

46016 Posts
Default

Quote:
Originally Posted by rogue View Post
These are large files, can you narrow down the time frame for the issue? Can you verify that the issue is in one of these files?

I think Max can get the time from serverlog.

If i get that time i have many more i can look in.

Lennart
Lennart is offline   Reply With Quote
Old 2009-08-06, 23:14   #68
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

11000011010012 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Based on the server logs, it seems there was a lot of barfing between [2009-08-06 15:47:02 GMT] and [2009-08-06 16:50:34 GMT].
Aha! I found something in debug2.log:
Code:
[2009-08-06 15:54:30 GMT] socket 3 >>>> FROM sm5ymt@pekhult.se _162 sm5ymt
[2009-08-06 15:54:30 GMT] crus: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-06 15:54:30 GMT] socket 3 >>>> RETURNWORK 2.2.3
[2009-08-06 15:54:30 GMT] socket 3 >>>> WorkUnit: 121736*6^146599+1 1249572434
[2009-08-06 15:54:41 GMT] socket 3 (nothing received) 
[2009-08-06 15:54:41 GMT] socket 3 >>>> WorkUnit: 168610*6^146600+1 1249572434
[2009-08-06 15:54:52 GMT] socket 3 (nothing received) 
[2009-08-06 15:54:52 GMT] socket 3 >>>> WorkUnit: 36772*6^146604+1 1249572434
[2009-08-06 15:55:03 GMT] socket 3 (nothing received) 
[2009-08-06 15:55:03 GMT] socket 3 >>>> WorkUnit: 118147*6^146604+1 1249572434
[2009-08-06 15:55:14 GMT] socket 3 (nothing received) 
[2009-08-06 15:55:14 GMT] socket 3 >>>> WorkUnit: 124125*6^146606+1 1249572434
[2009-08-06 15:55:25 GMT] socket 3 (nothing received) 
[2009-08-06 15:55:25 GMT] socket 3 >>>> End of Message
[2009-08-06 15:55:36 GMT] socket 3 (nothing received) 
[2009-08-06 15:55:36 GMT] socket 3 >>>> QUIT
[2009-08-06 15:55:36 GMT] closing socket 3
[2009-08-06 15:55:36 GMT] closing socket 3
This seems to be the client end of a barfed batch. The client is sending five WorkUnit: lines in a row but without any actual results in between. It also doesn't seem to try to fetch the greeting when this happens.

Also of note, the client doesn't seem to consider these "accepted" by the server, but rather keeps on trying to return them again and again, with the same result. Eventually the server gets out of its "barfing mood" and lets the client commuicate normally--except that now the server thinks all of those tests are already accounted for (having registered them with blank results):
Code:
[2009-08-06 16:45:30 GMT] socket 3 >>>> FROM sm5ymt@pekhult.se _162 sm5ymt
[2009-08-06 16:45:30 GMT] socket 3 >>>> GETGREETING
[2009-08-06 16:45:35 GMT] socket 3 <<<< ############
[2009-08-06 16:45:35 GMT] socket 3 <<<< Welcome to the CRUS G3000 PRPnet beta test server! :-D
[2009-08-06 16:45:35 GMT] socket 3 <<<< Server is running PRPnet v2.2.3
[2009-08-06 16:45:35 GMT] socket 3 <<<< ############
[2009-08-06 16:45:35 GMT] socket 3 <<<< OK.
[2009-08-06 16:45:35 GMT] socket 3 >>>> QUIT
[2009-08-06 16:45:35 GMT] closing socket 3
[2009-08-06 16:45:35 GMT] Total Time:  9:55:42  Total Tests: 42  Total PRPs Found: 0
[2009-08-06 16:45:36 GMT] socket 3 >>>> FROM sm5ymt@pekhult.se _162 sm5ymt
[2009-08-06 16:45:36 GMT] crus: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-06 16:45:36 GMT] socket 3 >>>> RETURNWORK 2.2.3
[2009-08-06 16:45:36 GMT] socket 3 >>>> WorkUnit: 121736*6^146599+1 1249572434
[2009-08-06 16:45:40 GMT] socket 3 <<<< ERROR: Workunit 121736*6^146599+1 not found on server
[2009-08-06 16:45:40 GMT] crus: ERROR: Workunit 121736*6^146599+1 not found on server
[2009-08-06 16:45:40 GMT] crus: The client will delete this workunit
[2009-08-06 16:45:40 GMT] socket 3 >>>> WorkUnit: 168610*6^146600+1 1249572434
[2009-08-06 16:45:41 GMT] socket 3 <<<< ERROR: Workunit 168610*6^146600+1 not found on server
[2009-08-06 16:45:41 GMT] crus: ERROR: Workunit 168610*6^146600+1 not found on server
[2009-08-06 16:45:41 GMT] crus: The client will delete this workunit
[2009-08-06 16:45:41 GMT] socket 3 >>>> WorkUnit: 36772*6^146604+1 1249572434
[2009-08-06 16:45:41 GMT] socket 3 <<<< ERROR: Workunit 36772*6^146604+1 not found on server
[2009-08-06 16:45:41 GMT] crus: ERROR: Workunit 36772*6^146604+1 not found on server
[2009-08-06 16:45:41 GMT] crus: The client will delete this workunit
[2009-08-06 16:45:41 GMT] socket 3 >>>> WorkUnit: 118147*6^146604+1 1249572434
[2009-08-06 16:45:41 GMT] socket 3 <<<< INFO: Workunit found
[2009-08-06 16:45:41 GMT] socket 3 >>>> Test Result: pfgw CCC87E55A38FA5F5
[2009-08-06 16:45:41 GMT] socket 3 >>>> End of WorkUnit
[2009-08-06 16:45:42 GMT] socket 3 <<<< INFO: Test for candidate 118147*6^146604+1 accepted
[2009-08-06 16:45:42 GMT] crus: INFO: Test for candidate 118147*6^146604+1 accepted
[2009-08-06 16:45:42 GMT] socket 3 <<<< End of Workunit Message
[2009-08-06 16:45:42 GMT] socket 3 >>>> WorkUnit: 124125*6^146606+1 1249572434
[2009-08-06 16:45:42 GMT] socket 3 <<<< INFO: Workunit found
[2009-08-06 16:45:42 GMT] socket 3 >>>> Test Result: pfgw B88C22D5DB33C4EA
[2009-08-06 16:45:42 GMT] socket 3 >>>> End of WorkUnit
[2009-08-06 16:45:43 GMT] socket 3 <<<< INFO: Test for candidate 124125*6^146606+1 accepted
[2009-08-06 16:45:43 GMT] crus: INFO: Test for candidate 124125*6^146606+1 accepted
[2009-08-06 16:45:43 GMT] socket 3 <<<< End of Workunit Message
[2009-08-06 16:45:43 GMT] socket 3 >>>> End of Message
[2009-08-06 16:45:43 GMT] socket 3 <<<< INFO: 2 of 5 test results were accepted
[2009-08-06 16:45:43 GMT] crus: INFO: 2 of 5 test results were accepted
[2009-08-06 16:45:43 GMT] socket 3 >>>> QUIT
[2009-08-06 16:45:43 GMT] closing socket 3
[2009-08-06 16:45:43 GMT] closing socket 3
mdettweiler is offline   Reply With Quote
Old 2009-08-07, 00:01   #69
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

24×397 Posts
Default

That is very useful. Now I would also like to see the server's debug log corresponding to these messages so that I can match them up.

What I am most concerned with, based upon what I see in the log, is that the server is not responding in a timely fashion to the client. The server does respond (sometimes), but not quickly enough for the client. The client expects a response within 10 seconds. Is there something on the server that is running at higher priority, thus taking cycles away from the PRPNet server? I hope to see some things within the server's debug log.

Last fiddled with by rogue on 2009-08-07 at 00:16
rogue is offline   Reply With Quote
Old 2009-08-07, 02:35   #70
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Quote:
Originally Posted by rogue View Post
That is very useful. Now I would also like to see the server's debug log corresponding to these messages so that I can match them up.

What I am most concerned with, based upon what I see in the log, is that the server is not responding in a timely fashion to the client. The server does respond (sometimes), but not quickly enough for the client. The client expects a response within 10 seconds. Is there something on the server that is running at higher priority, thus taking cycles away from the PRPNet server? I hope to see some things within the server's debug log.
Okay, coming right up. First of all, though, I couldn't find something at 15:54 GMT from Lennart's machine _162 on the server. Based on correlations of the connection data from the server and client logfiles, it seems that box _162 is about 7 minutes slow. (Lennart, can you possibly check to see if _162 is ~7 minutes slow and thus that I did indeed get the right log excerpt? Thanks.) With that in mind, here's what the server saw at the same time as the first (barfed) log entry in my last post:
Code:
[2009-08-06 16:01:35 GMT] Message coming on socket 5
[2009-08-06 16:01:35 GMT] socket 5 <<<< FROM sm5ymt@pekhult.se _162 sm5ymt
[2009-08-06 16:01:35 GMT] sm5ymt@pekhult.se connecting from 91.149.39.143
[2009-08-06 16:01:35 GMT] socket 5 <<<< RETURNWORK 2.2.3
[2009-08-06 16:01:35 GMT] socket 5 <<<< WorkUnit: 121736*6^146599+1 1249572434
[2009-08-06 16:01:35 GMT] socket 5 >>>> INFO: Workunit found
[2009-08-06 16:01:35 GMT] socket 5 <<<< WorkUnit: 168610*6^146600+1 1249572434
[2009-08-06 16:01:35 GMT] socket 5 <<<< WorkUnit: 36772*6^146604+1 1249572434
[2009-08-06 16:01:35 GMT] socket 5 <<<< WorkUnit: 118147*6^146604+1 1249572434
[2009-08-06 16:01:35 GMT] socket 5 <<<< WorkUnit: 124125*6^146606+1 1249572434
[2009-08-06 16:01:35 GMT] socket 5 <<<< End of Message
[2009-08-06 16:01:35 GMT] socket 5 <<<< QUIT
[2009-08-06 16:01:46 GMT] socket 5 (nothing received) 
[2009-08-06 16:01:46 GMT] socket 5 >>>> INFO: Test for candidate 121736*6^146599+1 accepted
[2009-08-06 16:01:46 GMT] Error sending <<INFO: Test for candidate 121736*6^146599+1 accepted>> to localhost:3000
[2009-08-06 16:01:46 GMT] socket 5 >>>> !!! send error !!!
[2009-08-06 16:01:46 GMT] 121736*6^146599+1: Test received by sm5ymt@pekhult.se at 91.149.39.143  Residue Residue: 
[2009-08-06 16:01:46 GMT] socket 5 >>>> End of Workunit Message
[2009-08-06 16:01:46 GMT] Error sending <<End of Workunit Message>> to localhost:3000
[2009-08-06 16:01:46 GMT] socket 5 >>>> !!! send error !!!
[2009-08-06 16:01:57 GMT] socket 5 (nothing received) 
[2009-08-06 16:01:57 GMT] socket 5 >>>> INFO: All 1 test results were accepted
[2009-08-06 16:01:57 GMT] Error sending <<INFO: All 1 test results were accepted>> to localhost:3000
[2009-08-06 16:01:57 GMT] socket 5 >>>> !!! send error !!!
[2009-08-06 16:01:57 GMT] socket 5 >>>> End of Message
[2009-08-06 16:01:57 GMT] Error sending <<End of Message>> to localhost:3000
[2009-08-06 16:01:57 GMT] socket 5 >>>> !!! send error !!!
[2009-08-06 16:02:08 GMT] socket 5 (nothing received) 
[2009-08-06 16:02:08 GMT] closing socket 5
It would seem that the server got the chance to send a "Workunit found" after the first "WorkUnit:" line from the client (so either the client's "nothing recieved" message on that one was in error, or my log file correlation skills aren't all they're cracked up to be and I'm looking at the wrong segment ). At any rate, the remaining four "WorkUnit" lines were apparently sent so fast that either a) the server didn't have the chance to respond with a "Workunit found"; or b) the server didn't know what to make of a client sending another WorkUnit line before it sent the last Test Result line.

The weird thing is, it would seem that the logs on the client and on the server both tell a completely different story; the server log indicates that all the "WorkUnit" messages were received in very short succession, but the client says that it waited 10 seconds for a response between each one. Possibly the client is goofing and thinks it didn't send the WorkUnit messages when it really did?

Then again, maybe I'm looking at the wrong times in the respective logfiles; I'm going to keep digging and see if I can find either a) correlation for the idea that the client logs and server logs tell a completely different story; or b) another instance of this that seems to make sense in the repspective logs.
mdettweiler is offline   Reply With Quote
Old 2009-08-07, 02:59   #71
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Okay, here's another one. The client had just finished 5 G3000 workunits, and was going to return them:
Code:
[2009-08-06 05:54:51 GMT] socket 3 >>>> FROM sm5ymt@pekhult.se _162 sm5ymt
[2009-08-06 05:54:51 GMT] crus: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-06 05:54:51 GMT] socket 3 >>>> RETURNWORK 2.2.3
[2009-08-06 05:54:51 GMT] socket 3 >>>> WorkUnit: 36772*6^141684+1 1249536452
[2009-08-06 05:55:02 GMT] socket 3 (nothing received) 
[2009-08-06 05:55:02 GMT] socket 3 >>>> WorkUnit: 123285*6^141684+1 1249536452
[2009-08-06 05:55:13 GMT] socket 3 (nothing received) 
[2009-08-06 05:55:13 GMT] socket 3 >>>> WorkUnit: 172257*6^141686+1 1249536452
[2009-08-06 05:55:24 GMT] socket 3 (nothing received) 
[2009-08-06 05:55:24 GMT] socket 3 >>>> WorkUnit: 98860*6^141689+1 1249536452
[2009-08-06 05:55:27 GMT] socket 3 <<<< INFO: Workunit found
[2009-08-06 05:55:27 GMT] socket 3 >>>> Test Result: pfgw 98796D25133B4E62
[2009-08-06 05:55:27 GMT] socket 3 >>>> End of WorkUnit
[2009-08-06 05:55:28 GMT] socket 3 <<<< INFO: Test for candidate 36772*6^141684+1 accepted
[2009-08-06 05:55:28 GMT] crus: INFO: Test for candidate 36772*6^141684+1 accepted
[2009-08-06 05:55:28 GMT] socket 3 <<<< End of Workunit Message
[2009-08-06 05:55:28 GMT] socket 3 >>>> WorkUnit: 113966*6^141691+1 1249536452
[2009-08-06 05:55:28 GMT] socket 3 <<<< INFO: Workunit found
[2009-08-06 05:55:28 GMT] socket 3 >>>> Test Result: pfgw C8AB5F044A322BB5
[2009-08-06 05:55:28 GMT] socket 3 >>>> End of WorkUnit
[2009-08-06 05:55:29 GMT] socket 3 <<<< INFO: Test for candidate 113966*6^141691+1 accepted
[2009-08-06 05:55:29 GMT] crus: INFO: Test for candidate 113966*6^141691+1 accepted
[2009-08-06 05:55:29 GMT] socket 3 <<<< End of Workunit Message
[2009-08-06 05:55:29 GMT] socket 3 >>>> End of Message
[2009-08-06 05:55:29 GMT] socket 3 <<<< INFO: All 2 test results were accepted
[2009-08-06 05:55:29 GMT] crus: INFO: All 2 test results were accepted
[2009-08-06 05:55:29 GMT] socket 3 >>>> QUIT
[2009-08-06 05:55:29 GMT] closing socket 3
[2009-08-06 05:55:29 GMT] closing socket 3
Hmm...interesting. Since this batch contained both results that were accepted and some that weren't, it would seem to point to the idea that there is in reality no bug on the client, but instead maybe just a bug on the server. It looks like the client tried to send the first three, but didn't get a response; within 3 seconds of sending the fourth one's WorkUnit line, it got a "Workunit found" response and sent a "Test Result" message. The server accepted that normally, then proceeded to accept the last result normally (with no abnormally long delays).

Now let's see how this looked on the server side of things. It seems that this time around, the times aren't quite as far off as they were before (now they're about half a minute off, which is within a normal range). I think last time I messed up on correlating the log file. Anyway, here we go:
Code:
[2009-08-06 05:55:30 GMT] Message coming on socket 5
[2009-08-06 05:55:30 GMT] socket 5 <<<< FROM sm5ymt@pekhult.se _162 sm5ymt
[2009-08-06 05:55:30 GMT] sm5ymt@pekhult.se connecting from 91.149.39.143
[2009-08-06 05:55:30 GMT] socket 5 <<<< RETURNWORK 2.2.3
[2009-08-06 05:55:30 GMT] socket 5 <<<< WorkUnit: 36772*6^141684+1 1249536452
[2009-08-06 05:55:30 GMT] socket 5 >>>> INFO: Workunit found
[2009-08-06 05:55:30 GMT] socket 5 <<<< WorkUnit: 123285*6^141684+1 1249536452
[2009-08-06 05:55:30 GMT] socket 5 <<<< WorkUnit: 172257*6^141686+1 1249536452
[2009-08-06 05:55:30 GMT] socket 5 <<<< WorkUnit: 98860*6^141689+1 1249536452
[2009-08-06 05:55:30 GMT] socket 5 <<<< Test Result: pfgw 98796D25133B4E62
[2009-08-06 05:55:30 GMT] socket 5 <<<< End of WorkUnit
[2009-08-06 05:55:30 GMT] socket 5 >>>> INFO: Test for candidate 36772*6^141684+1 accepted
[2009-08-06 05:55:30 GMT] 36772*6^141684+1: Test received by sm5ymt@pekhult.se at 91.149.39.143  Residue Residue: 98796D25133B4E62
[2009-08-06 05:55:30 GMT] socket 5 >>>> End of Workunit Message
[2009-08-06 05:55:31 GMT] socket 5 <<<< WorkUnit: 113966*6^141691+1 1249536452
[2009-08-06 05:55:31 GMT] socket 5 >>>> INFO: Workunit found
[2009-08-06 05:55:31 GMT] socket 5 <<<< Test Result: pfgw C8AB5F044A322BB5
[2009-08-06 05:55:31 GMT] socket 5 <<<< End of WorkUnit
[2009-08-06 05:55:31 GMT] socket 5 >>>> INFO: Test for candidate 113966*6^141691+1 accepted
[2009-08-06 05:55:31 GMT] 113966*6^141691+1: Test received by sm5ymt@pekhult.se at 91.149.39.143  Residue Residue: C8AB5F044A322BB5
[2009-08-06 05:55:31 GMT] socket 5 >>>> End of Workunit Message
[2009-08-06 05:55:32 GMT] socket 5 <<<< End of Message
[2009-08-06 05:55:32 GMT] socket 5 >>>> INFO: All 2 test results were accepted
[2009-08-06 05:55:32 GMT] socket 5 >>>> End of Message
[2009-08-06 05:55:32 GMT] socket 5 <<<< QUIT
[2009-08-06 05:55:32 GMT] closing socket 5
Ouch! It's almost like the server froze in time for all the while the client was sending the first three "WorkUnit" lines, then suddenly came back to life and gave the "Workunit found" line for the first test the client asked about--right after the client sent the line for the fourth workunit! So, the client thinks "okay, the server found the third workunit" and innocently transmits the data for the fourth WU. The server then goes and gobbles up the third WU's residual and logs it for the first WU!

At that point, the client sent the "WorkUnit" line for the fifth workunit, and the server accepted it normally and registered it for the correct test (apparently ignoring the second, third, and fourth "WorkUnit" lines now, since it had never gotten around to sending the "Workunit found" messages for those).

This would seem to lend credence to your idea that possibly something is stealing CPU cycles from the server. Right now the server has dried (yet again), so what I'm going to do is re-load the 140K-150K range into it, then change the server to a higher priority so that other applications can't steal CPU cycles from it. I'll see if that works.

Edit: Okay, the server is now loaded up again and running at nice value -10.

Last fiddled with by mdettweiler on 2009-08-07 at 03:02
mdettweiler is offline   Reply With Quote
Old 2009-08-07, 04:27   #72
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

242438 Posts
Default

Strange. I didn't know the server needed anything significant in the way of CPU cycles. Max, keep in mind that I have all 4 cores on that machine running LLRnet clients against David's port IB2000. I don't really want the server to take CPU cycles from my LLRing so maybe it's better that I only run 3 cores on LLRnet clients.

When the next round of testing comes back complete, let me know if you think that cutting back to 3 cores running LLRnet would help. BUT...that said, perhaps I can still get more total throughput running all 4 cores on LLRnet with the server set higher in priority. If that works OK, then we should probably bump all PRPnet servers on that machine up to a nice value of -10.

One more thing to remember: There are something like 6-7 servers on that machine, including both public and personal LLRnet and PRPnet servers. Rather than the issue being that LLRnet is running 4 clients, it could be that we just have too many servers running on one machine.


Gary

Last fiddled with by gd_barnes on 2009-08-07 at 04:28
gd_barnes is online now   Reply With Quote
Old 2009-08-07, 05:17   #73
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
Strange. I didn't know the server needed anything significant in the way of CPU cycles. Max, keep in mind that I have all 4 cores on that machine running LLRnet clients against David's port IB2000. I don't really want the server to take CPU cycles from my LLRing so maybe it's better that I only run 3 cores on LLRnet clients.

When the next round of testing comes back complete, let me know if you think that cutting back to 3 cores running LLRnet would help. BUT...that said, perhaps I can still get more total throughput running all 4 cores on LLRnet with the server set higher in priority. If that works OK, then we should probably bump all PRPnet servers on that machine up to a nice value of -10.

One more thing to remember: There are something like 6-7 servers on that machine, including both public and personal LLRnet and PRPnet servers. Rather than the issue being that LLRnet is running 4 clients, it could be that we just have too many servers running on one machine.


Gary
Nah, the server doesn't need much. In fact, theoretically it should be getting plenty just with a nice value of 0 (the level for "normal", non-background apps). The only thing is to make sure that what little it needs (since it obviously needs something in order to even move forward) isn't getting crowded out accidentally by the lowest-priority LLRnet clients. As an example of this: one day I was stopping a PRPnet client to switch that core over to something else. To avoid any idle "gaps" in processing, when switching a core over to a new type of work, I usually start the new program before stopping the old one--leaving the two to overlap for a brief time. However, this particular time, I had aliqueit.exe running Aliquot sequence work on the other core, and it was set to a nice value of 5 ("Below Normal" on Windows). PRPnet, meanwhile, was set to 19 ("Low"). I think I was starting Prime95 as the new application; Prime95 is one of those appliacations that tends to "bully" others of equal priority for its CPU time if it has to share a core with them. (Not sure why it does this, though I've noticed that gwnum-based apps such as PFGW and LLR also do the same thing over other apps.) Anyway, this meant that even though PRPnet and Prime95's worker thread were both at nice 19, Prime95 would hog all the CPU resources. Actually, PRPnet was running LLR at the moment, so since both Prime95 and LLR are equally "hoggy" they split the core evenly--until I stopped LLR. That meant that Prime95 was free to take the entire core, *except* PRPnet still needed to do a few last-minute tasks before shutdown, such as reporting results to the server. Of course, PRPnet isn't nearly as "macho" as Prime95 when it comes to battling for CPU time, so PRPnet practically froze in a limbo between "Detected LLR shutdown, exiting" and "reporting completed results to server". It would have crawled along like that for quite a while until it managed to gradually pick up enough spare cycles along the way to finally eke through the wee bit of CPU-work needed to simply return its results to the server, had I not went into Task Manager and manually set PRPnet to "High" priority to let it finish up and exit.

A similar thing could very well be happening with the PRPnet servers. However, the odd thing is, they're at nice 0, which is way ahead of all your LLRnet clients at nice 19. Theoretically, the servers should be getting all the CPU time they need, just like a regular desktop application (which usually runs at nice 0). However, it would look like that's not happening here, for reasons we don't quite understand yet. I'm pretty sure that it's not the other servers that are doing it; there's no way they could hog the CPU for long enough to keep G3000 "frozen" for 20-30 seconds as seems to have occurred here.

Mark, is the PRPnet server multithreaded at all? If so, do any of the threads by chance run at nice 19? If so, then that might be the problem here.

Last fiddled with by mdettweiler on 2009-08-07 at 05:19
mdettweiler is offline   Reply With Quote
Old 2009-08-07, 12:32   #74
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

24×397 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Mark, is the PRPnet server multithreaded at all? If so, do any of the threads by chance run at nice 19? If so, then that might be the problem here.
No, it is not multi-threaded.

By default the server is to run in normal (default) priority. This can be overridden by modifying the idle= setting in the prpserver.ini. Did you modify that?

I will modify the client to not send any more workunits if it doesn't get a reply from the server within 10 seconds. I will then bump that up to 30 seconds.

I have released 2.2.4. Read here.

Last fiddled with by rogue on 2009-08-07 at 12:48
rogue is offline   Reply With Quote
Old 2009-08-07, 14:50   #75
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Quote:
Originally Posted by rogue View Post
No, it is not multi-threaded.

By default the server is to run in normal (default) priority. This can be overridden by modifying the idle= setting in the prpserver.ini. Did you modify that?

I will modify the client to not send any more workunits if it doesn't get a reply from the server within 10 seconds. I will then bump that up to 30 seconds.

I have released 2.2.4. Read here.
No, I didn't change the idle= option; it's at 0 for all servers.

BTW, I see in your 2.2.4 release notes that you made some tweaks to the web page code so that the connection is closed at the end of such a session; possibly that could have casued the server to have problems before? (Or is it capable of dealing with multiple connections simultaneously?)

If you've got all the data you need on this, I'll go ahead and upgrade all the servers to 2.2.4. I'll also post new client packages at NPLB.

Edit: Oh, and Gary, since it seems that the barfing could have caused quite a few residuals to get switched around, I'm going to re-do the entire 100K-150K range in the server. At the rate we're going, I doubt it will take very long to do it; and that way we can be sure that nothing is messed up and all the results are good.

Last fiddled with by mdettweiler on 2009-08-07 at 14:52
mdettweiler is offline   Reply With Quote
Old 2009-08-07, 16:32   #76
MyDogBuster
 
MyDogBuster's Avatar
 
May 2008
Wilmington, DE

22×23×31 Posts
Default

Hi guys,

A weird one. Looks like a barf but my server hung. I ctrl-c'd out of it when I noticed it 6 minutes later. I've highlighted the test in question.
It was the first test in the sequence BUT wasn't listed when the client started uploading. The first one shown was the second test in the sequence.

Now the strange part. When I checked everything out my Candidates filke was missing about 35K tests. It went from 165K tests to 130K tests.

I had to rebuild it.

From what I can tell, two clients were attempting to upload results at the same time. All the tests from the client with the goofy test errored out.

Methinks the server has timing problems accepting results from more than 1 client. The client does seem impatient for an answer.

I've seen this happen on and off for the last few days but when I checked the client logs, it stated that the results were all accepted (I have all clients cached at 20 units)

This setup has run for 2 days with no known problems (except whats looks like a barf, but with all tests were being accepted). This is the first server hangup. All other programs on the machine were running just fine while the server was hung.

FROM THE SERVER: (2.2.3)

[2009-08-07 14:23:18 GMT] IMGunn1654@gmail.com (Sophie#4) at 192.168.2.100: Sent 4894*24^40589+1
[2009-08-07 14:23:18 GMT] IMGunn1654@gmail.com (Sophie#4) at 192.168.2.100: Sent 656*24^40590+1
[2009-08-07 14:23:18 GMT] IMGunn1654@gmail.com (Sophie#4) at 192.168.2.100: Sent 18724*24^40589+1
[2009-08-07 14:23:18 GMT] IMGunn1654@gmail.com (Sophie#4) at 192.168.2.100: Sent 5324*24^40591+1
[2009-08-07 14:28:54 GMT] 17819*24^40551+1: Test received by imgunn1654@gmail.com at 192.168.2.8 Residue Residue:
[2009-08-07 14:35:00 GMT] Accepted force quit. Waiting to close sockets before exiting

[2009-08-07 14:35:00 GMT] 656*24^40326+1: Test for user IMGunn1654@gmail.com has expired
[2009-08-07 14:35:00 GMT] 3031*24^40322+1: Test for user IMGunn1654@gmail.com has expired
[2009-08-07 14:35:00 GMT] 3051*24^40324+1: Test for user IMGunn1654@gmail.com has expired


FROM THE CLIENT

[2009-08-07 14:05:50 GMT] Base24: 17819*24^40551+1 is not prime. Residue 20A05F9B77C0E566
[2009-08-07 14:06:45 GMT] Base24: 1099*24^40553+1 is not prime. Residue D342FD5C641F6722
[2009-08-07 14:08:12 GMT] Base24: 9726*24^40552+1 is not prime. Residue 9DC7E0FEF8810C84
[2009-08-07 14:09:08 GMT] Base24: 6181*24^40552+1 is not prime. Residue 9FD8141B967D6A3B
[2009-08-07 14:10:04 GMT] Base24: 5129*24^40553+1 is not prime. Residue 467417A0E33484E0
[2009-08-07 14:10:59 GMT] Base24: 7394*24^40553+1 is not prime. Residue 6E6E638320684226
[2009-08-07 14:11:55 GMT] Base24: 7481*24^40554+1 is not prime. Residue 75E9D6864F9D8ADD
[2009-08-07 14:12:49 GMT] Base24: 3526*24^40554+1 is not prime. Residue B6FE09C3477726A8
[2009-08-07 14:13:45 GMT] Base24: 4606*24^40554+1 is not prime. Residue 346919CC613EDC68
[2009-08-07 14:14:40 GMT] Base24: 5129*24^40555+1 is not prime. Residue 1BBF04CC5EBADD4E
[2009-08-07 14:16:08 GMT] Base24: 12799*24^40555+1 is not prime. Residue 3CC90EEFFF538DEE
[2009-08-07 14:17:36 GMT] Base24: 9279*24^40555+1 is not prime. Residue 5FBE0B1556DB1F86
[2009-08-07 14:19:03 GMT] Base24: 21439*24^40555+1 is not prime. Residue 43036915D93D1B27
[2009-08-07 14:20:31 GMT] Base24: 12969*24^40555+1 is not prime. Residue F5E5E5C0C95DFA2D
[2009-08-07 14:21:26 GMT] Base24: 7481*24^40556+1 is not prime. Residue E1C265FFF9A82304
[2009-08-07 14:22:22 GMT] Base24: 7746*24^40556+1 is not prime. Residue 7F3BAE3DBC691F02
[2009-08-07 14:23:50 GMT] Base24: 20731*24^40556+1 is not prime. Residue 2C2569D7BB19C6C2
[2009-08-07 14:25:18 GMT] Base24: 21776*24^40556+1 is not prime. Residue 506D029179A2E896
[2009-08-07 14:26:45 GMT] Base24: 29601*24^40556+1 is not prime. Residue D577DBEEA0C3A2D8
[2009-08-07 14:28:13 GMT] Base24: 22356*24^40556+1 is not prime. Residue 18C59E905E97B7EA
[2009-08-07 14:28:13 GMT] Total Time: 44:43:19 Total Tests: 2660 Total PRPs Found: 3
[2009-08-07 14:28:13 GMT] Base24: Returning work to server 192.168.2.7 at port 7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 5129*24^40553+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 7394*24^40553+1 1249653889>> to 192.168.2.7:7102

[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 7481*24^40554+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 3526*24^40554+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 4606*24^40554+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 5129*24^40555+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 12799*24^40555+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 9279*24^40555+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 21439*24^40555+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 12969*24^40555+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 7481*24^40556+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 7746*24^40556+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 20731*24^40556+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 21776*24^40556+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 29601*24^40556+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 22356*24^40556+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<End of Message>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<QUIT>> to 192.168.2.7:7102
MyDogBuster is offline   Reply With Quote
Old 2009-08-07, 16:40   #77
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

11000011010012 Posts
Default

Yes, that's a classic barf. I'd recommend upgrading the server and all clients to v2.2.4 as soon as possible; that should fix the barfing problem.

Regarding the server not being able to accept more than one connection at a time: yes, it looks sort of like that. It would almost seem that the server has to wait for one client to finish talking to it before another one can begin. This may have been the cause of some of the timeouts that seemed to happen during the barfs. However, since the 2.2.4 client changes the timeout to 30 seconds, and won't keep sending results if the server isn't ready for them, this should now be fixed.
mdettweiler is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
PRPNet server for personal use johnadam74 Software 2 2016-01-01 15:58
New SR5 PRPnet server online ltd Sierpinski/Riesel Base 5 15 2013-03-19 18:03
First PSP PRPnet 4.0.6 server online ltd Prime Sierpinski Project 9 2011-03-15 04:58
PRPnet 3.1.3 stress-test server mdettweiler No Prime Left Behind 40 2010-01-30 18:05
First pass PRPNet server out of work? opyrt Prime Sierpinski Project 6 2009-09-24 18:14

All times are UTC. The time now is 09:42.


Tue Jul 27 09:42:39 UTC 2021 up 4 days, 4:11, 0 users, load averages: 2.02, 1.95, 1.88

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.