mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Conjectures 'R Us (https://www.mersenneforum.org/forumdisplay.php?f=81)
-   -   PRPnet server bugs and barfing problem (https://www.mersenneforum.org/showthread.php?t=12256)

rogue 2009-08-03 23:04

Lennart, can you tell me what the client does with the workunits when this happens? Does it delete them or save them and try again?

Lennart 2009-08-03 23:33

[code][2009-08-03 16:44:08 GMT] Total Time: 2:12:11 Total Tests: 15 Total PRPs Found: 0
[2009-08-03 16:44:53 GMT] crus: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-03 16:47:10 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-03 16:47:10 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-03 16:47:10 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-03 16:47:11 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-03 16:47:11 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-03 16:47:11 GMT] 27121: Getting work from server prpnet.primegrid.com at port 12006
[2009-08-03 17:49:36 GMT] 27121: 27*2^1543462+1 is not prime. Residue 2D44561896DD41CE
[2009-08-03 17:49:36 GMT] Total Time: 3:17:39 Total Tests: 16 Total PRPs Found: 0
[2009-08-03 17:49:36 GMT] 27121: Returning work to server prpnet.primegrid.com at port 12006
[2009-08-03 17:49:38 GMT] 27121: INFO: Test for candidate 27*2^1543462+1 accepted
[2009-08-03 17:49:38 GMT] 27121: INFO: All 1 test results were accepted
[2009-08-03 17:49:38 GMT] crus: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:43 GMT] crus: ERROR: Workunit 124221*6^148285+1 not found on server
[2009-08-03 17:49:43 GMT] crus: The client will delete this workunit
[2009-08-03 17:49:44 GMT] crus: INFO: Test for candidate 74612*6^148287+1 accepted
[2009-08-03 17:49:45 GMT] crus: INFO: Test for candidate 172257*6^148286+1 accepted
[2009-08-03 17:49:45 GMT] crus: INFO: 2 of 3 test results were accepted
[2009-08-03 17:49:46 GMT] crus: Getting work from server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:47 GMT] crus: INFO: No available candidates are left on this server.
[2009-08-03 17:49:48 GMT] crus: Getting work from server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:49 GMT] crus: INFO: No available candidates are left on this server.
[2009-08-03 17:49:50 GMT] crus: Getting work from server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:51 GMT] crus: INFO: No available candidates are left on this server.
[2009-08-03 17:49:52 GMT] crus: Getting work from server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:53 GMT] crus: INFO: No available candidates are left on this server.
[2009-08-03 17:49:54 GMT] crus: Getting work from server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:55 GMT] crus: INFO: No available candidates are left on this server.
[2009-08-03 17:49:56 GMT] crus: Getting work from server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:57 GMT] crus: INFO: No available candidates are left on this server.
[2009-08-03 17:49:57 GMT] 27121: Getting work from server prpnet.primegrid.com at port 12006
[2009-08-03 18:43:33 GMT] 27121: 27*2^1543856+1 is not prime. Residue DE6D07D9F6EA4450
[2009-08-03 18:43:33 GMT] Total Time: 4:11:36 Total Tests: 17 Total PRPs Found: 0
[2009-08-03 18:43:34 GMT] 27121: Returning work to server prpnet.primegrid.com at port 12006
[2009-08-03 18:43:37 GMT] 27121: INFO: Test for candidate 27*2^1543856+1 accepted
[/code]

Here is the log as you see my clock was not correct.

I have started setting all on debug=1

Lennart

rogue 2009-08-03 23:42

[QUOTE=Lennart;183918]Here is the log as you see my clock was not correct.

I have started setting all on debug=1

Lennart[/QUOTE]

As soon as you get the log, that would be great. There isn't enough information in these lines to give me a clear picture of what happened.

Lennart 2009-08-04 00:08

[quote=mdettweiler;183911]I'm afraid I haven't encountered this issue myself on the client end, so I can't help you there; Lennart, could you possibly put all of your G3000 clients on level 1 debug logging, if they aren't already?[/quote]


They are now :smile:

Lennart

gd_barnes 2009-08-04 04:27

[quote=mdettweiler;183842]All right, 140K-150K has been reloaded exactly as described above. I see Lennart's hungry machines have already swooped in and grabbed a bunch of work. :smile:

Checking the various web pages:
[URL]http://nplb-gb1.no-ip.org:3000/[/URL] (a.k.a. server_stats.html) checks out except for the "Min N" label on both the Min N and Max N columns, which is a known bug for Sierpinski/Riesel mode in PRPnet and will be fixed in a future release, but is only a cosmetic error for now.

[URL]http://nplb-gb1.no-ip.org:3000/server_status.html[/URL] still looks kind of weird:

It seems that this page is displaying inaccurately, due to a bug in PRPnet. However, as before, this is a cosmetic error (even if a bit more serious than the extra "Min N" label), and regardless of how the words seem to come out on the page, the basic information does line up with the following (as of when I pulled down this page):
-k's remaining: 30
-n's remaining: 4065
-min n remaining: 140005
The # of digits doesn't seem to be presented at all, regardless of the wording.

[URL]http://nplb-gb1.no-ip.org:3000/user_stats.html[/URL] checks out.

I'll report the bug on the server_status page to Mark. However, since all the information on that page can be gathered from the server_stats/home page anyway, it's somewhat less of a big deal since it's not affecting anything except the display of the data. So, we should be OK even with that bug present.

Max :smile:[/quote]


Gotta have all interfaces (as weel as barfing) fixed including grammar/spacing/clarity/etc. before we load n>150K even if it requires a new PRPnet release. We'll keep rerunning n=140K-150K until we get a clean test. The "n's remaining" of 4065 is misleading. It should say something like "pairs remaining" (assuming that is what it is referring to.)

Thanks for getting that going. It's good to see the # of k's remaining is correct now.


Thanks,
Gary

gd_barnes 2009-08-04 04:30

Could these problems with "barfing" be as a result of my servers not being able to handle a very big load? That's quite a bit of crunching power on there by Lennart. (I have the equivalent of 2 cores on there, i.e. a 50-50 split with another effort on a full quad.)

We should definitively know about load being a possible problem when port G5000 at NPLB gets rolling with the very teeny tests that will only take a few secs. each. That should be a big load even with just a few quads on it! :smile:

gd_barnes 2009-08-04 04:42

Something new for this time around:

First, I believe this drive is being processed by n-vaule. Based on that, I see that at [URL]http://nplb-gb1.no-ip.org:3000/[/URL] k=124125 and 124221 have a min n of 140006 and 140005 respectively even though most of this testing effort is at n>143K. Could it be because someone has received some pairs that haven't been returned to the server in a long time. I'm trying to determine if the "min n" is updating properly on all k's.

Second, the max n is showing as n=~148.7K for nearly all k's. (~147.6K for a few k's, perhaps because they are lower weight?) It should be showing as n=~150K for all k's unless the full n=140K-150K range was not loaded. This looked correct last time. How come it doesn't look correct this time?


Final question: How frequently is the "min n" and "max n" by k page updated?


Thanks,
Gary

mdettweiler 2009-08-04 04:50

[quote=gd_barnes;183948]Something new for this time around:

First, I believe this drive is being processed by n-vaule. Based on that, I see that at [URL]http://nplb-gb1.no-ip.org:3000/[/URL] k=124125 and 124221 have a min n of 140006 and 140005 respectively even though most of this testing effort is at n>143K. Could it be because someone has received some pairs that haven't been returned to the server in a long time. I'm trying to determine if the "min n" is updating properly on all k's.

Second, the max n is showing as n=~148.7K for all k's. It should be showing as n=~150K unless the full n=140K-150K range was not loaded. This looked correct last time. How come it doesn't look correct this time?


Final question: How frequently is the "min n" and "max n" by k page updated?


Thanks,
Gary[/quote]
Min n and Max n are updated every 5 minutes; intermediate changes are shown in the columns to the right of those. (The intermediate changes are absorbed into the larger Min n and Max n columns at the 5 minute updates when completed tests are removed from the prpserver.candidates file.)

Regarding the barfing possibly being related to server stress: no, I've talked with Mark and he's definitely confirmed that there's a server bug that needs to be fixed, as well as possibly a client bug pending further investigation of debug.log files. He's sent me a fix for the server side of things (which should definitely fix the barfing), though he asked me not to apply the fixed version to the server yet so Lennart's clients can get a chance to catch a log of their end of the barfing in their debug.log files for him to examine and see if there's a bug in the client as well.

gd_barnes 2009-08-04 04:58

OK, great. I'm glad to hear we've nailed down the "server barfing" problems.

Thanks for the "absorbing of changes" to min/max n explanation. It's a little clearer to me now.

It seems we still have a "min n" and "max n" problem though; mainly "max n". The "min n" issue could be as a result of a few pairs not having been returned yet, although that seems a little suspect since Lennart and I are the main ones on there and our machines have remained connected (I think). The "max n" should be n=~150K for all k's. Can you look into that? Is the full n=140K-150K file loaded into the server?


Gary

mdettweiler 2009-08-04 13:06

[quote=gd_barnes;183952]It seems we still have a "min n" and "max n" problem though; mainly "max n". The "min n" issue could be as a result of a few pairs not having been returned yet, although that seems a little suspect since Lennart and I are the main ones on there and our machines have remained connected (I think). The "max n" should be n=~150K for all k's. Can you look into that? Is the full n=140K-150K file loaded into the server?[/quote]
Look at the # of candidates left for each k--you'll notice that it's a very small amount. The reason why the "max n" is not quite up to ~150K is because most of the work has been completed; all you're seeing now are just whatever the min and max of any stragglers happen to be.

BTW, Lennart, did you catch the barfing in one of your client debug.log's? I see a small bit of barfing on the server from around August 3, 21:15 GMT; here are the clients involved: _31, _206, _162, _127, _71, _31, and last but not least, humpford (one of Gary's finest :wink:). (Of course since Gary doesn't have his clients set to debug logging, that last one is rather irrelevant; no big deal, there should be plenty of data from Lennart's logs.)

Interestingly enough, last night doesn't seem to be a big one for barfing; I had to go all the way back to the time of the abovementioned barf in order to find any instance of it.

Lennart 2009-08-04 13:24

[quote=mdettweiler;184004]Look at the # of candidates left for each k--you'll notice that it's a very small amount. The reason why the "max n" is not quite up to ~150K is because most of the work has been completed; all you're seeing now are just whatever the min and max of any stragglers happen to be.

BTW, Lennart, did you catch the barfing in one of your client debug.log's? I see a small bit of barfing on the server from around August 3, 21:15 GMT; here are the clients involved: _31, _206, _162, _127, _71, _31, and last but not least, humpford (one of Gary's finest :wink:). (Of course since Gary doesn't have his clients set to debug logging, that last one is rather irrelevant; no big deal, there should be plenty of data from Lennart's logs.)

Interestingly enough, last night doesn't seem to be a big one for barfing; I had to go all the way back to the time of the abovementioned barf in order to find any instance of it.[/quote]

Debug was not on that early. 22:00GMT -23:45GMT was the time i enabled debug.

Lennart


All times are UTC. The time now is 09:39.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.