mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Conjectures 'R Us (https://www.mersenneforum.org/forumdisplay.php?f=81)
-   -   PRPnet server bugs and barfing problem (https://www.mersenneforum.org/showthread.php?t=12256)

gd_barnes 2009-08-04 20:50

[quote=mdettweiler;184004]Look at the # of candidates left for each k--you'll notice that it's a very small amount. The reason why the "max n" is not quite up to ~150K is because most of the work has been completed; all you're seeing now are just whatever the min and max of any stragglers happen to be.[/quote]


Wow, that was fast that we got down to only remaining stragglers. That makes sense now. Thanks for enlightening me. We still need to get the formatting fixed on that one web page. We also need to change "n remaining" to "pairs remaining".

The option to run percentages of certain servers is outstanding! Whenever we reload for another test, Lennart's and my machines just start gobbling them up. Very cool! :-)


Gary

gd_barnes 2009-08-04 20:53

When did the last run hand out its last pair? In other words, when did the server "dry" by my definition?

I'm wondering because there are still 350 stragglers remaining. Unless someone has shut off a machine, all stragglers should have been processed by now. Can that be checked? Thanks.

One more thing. Is there the equivalent of a "JobMaxTime" in PRPnet? If so, what has G3000 been set to?


Gary

mdettweiler 2009-08-04 21:01

[quote=gd_barnes;184109]When did the last run hand out its last pair? In other words, when did the server "dry" by my definition?

I'm wondering because there are still 350 stragglers remaining. Unless someone has shut off a machine, all stragglers should have been processed by now. Can that be checked? Thanks.

One more thing. Is there the equivalent of a "JobMaxTime" in PRPnet? If so, what has G3000 been set to?


Gary[/quote]
PRPnet does have an equivalent of a jobMaxTime value, specified in the file prpserver.delay on the server. It's been set to 3 days.

It's a little hard to tell you exactly when the server "dried" by your definition. I'm presuming that by that, you mean when did the server run out of its "main" stash of work until everything that's left was stragglers? If so, that will take a bit of digging to find out.

mdettweiler 2009-08-04 21:34

I just checked the server, and it seems that the last time any test was handed out was at 22:17 GMT, 8/3. Within the next couple of hours Lennart's various machines returned their various results, but since then there's been no activity besides the server sending out "no available candidates on server" messages. I've changed the time limit to 6 hours; I see now that a large number (possibly all, I didn't do an exact count) of the stragglers have been expired and are being reassigned.

gd_barnes 2009-08-04 23:51

[quote=mdettweiler;184115]I just checked the server, and it seems that the last time any test was handed out was at 22:17 GMT, 8/3. Within the next couple of hours Lennart's various machines returned their various results, but since then there's been no activity besides the server sending out "no available candidates on server" messages. I've changed the time limit to 6 hours; I see now that a large number (possibly all, I didn't do an exact count) of the stragglers have been expired and are being reassigned.[/quote]

Who were the stragglers assigned to to begin with? Could this be a possible PRPnet bug or did someone receive some pairs and then disconnect or turn off their machines? I would think that if it is Lennart or me, then they should have been quickly processed.

This is important to figure out because in the first test, the same thing happened. I observed that the server was dry by my definition about 4-5 hours before you confirmed that it was really dry by your definition. If the pairs are coming back in a reasonable time frame, that difference should have been < 1 hour. If something is causing some pairs to "get stuck" for an extended period, we need to figure that out.

To clarify again:
Dry by my defintion: No new work is available to hand out. Some straggling pairs still need results to be returned.

Dry by your definition: All pairs have been processed and returned.

Let me come up with a better way to state this: Your definition is probably more accurate in a purely technical sense so how about we call my definition "nominally dried" and stick with calling your definition simply "dried".


Gary

mdettweiler 2009-08-05 01:46

[quote=gd_barnes;184142]Who were the stragglers assigned to to begin with? Could this be a possible PRPnet bug or did someone receive some pairs and then disconnect or turn off their machines? I would think that if it is Lennart or me, then they should have been quickly processed.

This is important to figure out because in the first test, the same thing happened. I observed that the server was dry by my definition about 4-5 hours before you confirmed that it was really dry by your definition. If the pairs are coming back in a reasonable time frame, that difference should have been < 1 hour. If something is causing some pairs to "get stuck" for an extended period, we need to figure that out.

To clarify again:
Dry by my defintion: No new work is available to hand out. Some straggling pairs still need results to be returned.

Dry by your definition: All pairs have been processed and returned.

Let me come up with a better way to state this: Your definition is probably more accurate in a purely technical sense so how about we call my definition "nominally dried" and stick with calling your definition simply "dried".


Gary[/quote]
Unless I missed something, all of the stragglers were from Lennart. Lennart, do you know of a particular reason why you had these abandoned pairs, or does this seem more like a bug?

Regarding dried vs. nominally dried: okay, that works. :smile:

Lennart 2009-08-05 02:01

[quote=mdettweiler;184153]Unless I missed something, all of the stragglers were from Lennart. Lennart, do you know of a particular reason why you had these abandoned pairs, or does this seem more like a bug?

Regarding dried vs. nominally dried: okay, that works. :smile:[/quote]

No those are when the server did not receive them( when i not could connect). But they will not come out again before the time set in the delay file is over.

[2009-08-03 17:49:38 GMT] crus: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:43 GMT] crus: ERROR: Workunit 124221*6^148285+1 not found on server
[2009-08-03 17:49:43 GMT] crus: The client will delete this workunit

Here you see that the candidates was deleted when we had the conection error. so if you had one day in delay file i don't get them again before those 24 hr.

Lennart

mdettweiler 2009-08-05 03:35

[quote=Lennart;184159]No those are when the server did not receive them( when i not could connect). But they will not come out again before the time set in the delay file is over.

[2009-08-03 17:49:38 GMT] crus: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:43 GMT] crus: ERROR: Workunit 124221*6^148285+1 not found on server
[2009-08-03 17:49:43 GMT] crus: The client will delete this workunit

Here you see that the candidates was deleted when we had the conection error. so if you had one day in delay file i don't get them again before those 24 hr.

Lennart[/quote]
Okay, I see. Since I know that the client won't just delete a result if it merely can't connect, but rather will hang on to the result and try again later, it's got to be something a little different than just that. From what your logs reported, it would seem almost like those tests were already completed on the server end, but the client didn't know that...ooh! I just thought of something! What if when the server barfs on a test, the client ends up deleting the *next* result, thinking that it wasn't found? This would fit with what we were seeing in the server's debug logs, where the client would seem to be sending the server a second workunit line at a time that would be out of place according to the PRPnet communication protocol. The server barfs on this, and registers the first test with a blank residual, but ignores the second one. Meanwhile, the client could quite believably think that the server is saying "the second test is not on the server" and therefore it deletes the result, thinking it's unneeded. Bingo! We have our abandoned tests.

Lennart, could you provide debug.log excerpts from when your clients behaved like in your example? This might be just the key we're looking for. Meanwhile, I'll check and see if on the server, a test was barfed at the same time shown in your logs for your example.

Edit: oh, never mind, looks like that particular example was from before I put the server on debug logging. Lennart, do you have a similar example from somewhere after 2009-08-03 18:20:04 GMT?

gd_barnes 2009-08-05 09:37

Your last sentence notwithstanding, THAT is EXACTLY what I was afraid of and thought it might be!! (caps for emphasis not yelling) I had been seeing "result not accepted" and something about it being deleted in some of my files. Yet my one quad was definitely connected the entire time.

This clearly has to be a PRPnet bug -or- related to load on my server that either the server or the PRPnet software cannot handle. For some reason, it is not accepting some returned results even though it should be. It seems to think they are already done when in fact they are not.

Whew, and I thought I was hallucinating about the huge difference between the nominal drying time and actual drying time. It seems I was not as all machines were connected at all times. Good luck both Max and Rogue figuring it out. That doesn't sound easy. If you need examples from my machine, let me know.

You can thank me now or thank me later for observing the unusually large difference between the two drying times for machines that were connected the entire time. lol


Gary

rogue 2009-08-05 12:25

[QUOTE=gd_barnes;184180]Your last sentence notwithstanding, THAT is EXACTLY what I was afraid of and thought it might be!! (caps for emphasis not yelling) I had been seeing "result not accepted" and something about it being deleted in some of my files. Yet my one quad was definitely connected the entire time.

This clearly has to be a PRPnet bug -or- related to load on my server that either the server or the PRPnet software cannot handle. For some reason, it is not accepting some returned results even though it should be. It seems to think they are already done when in fact they are not.

Whew, and I thought I was hallucinating about the huge difference between the nominal drying time and actual drying time. It seems I was not as all machines were connected at all times. Good luck both Max and Rogue figuring it out. That doesn't sound easy. If you need examples from my machine, let me know.

You can thank me now or thank me later for observing the unusually large difference between the two drying times for machines that were connected the entire time.[/QUOTE]

I have provided a server side patch to Max (which I've instructed him to not install), but I'm waiting to see a debug log from a client that is exhibiting the problem so that I can determine if there is a bug in the client as well.

mdettweiler 2009-08-05 12:48

[quote=gd_barnes;184180]Your last sentence notwithstanding, THAT is EXACTLY what I was afraid of and thought it might be!! (caps for emphasis not yelling) I had been seeing "result not accepted" and something about it being deleted in some of my files. Yet my one quad was definitely connected the entire time.

This clearly has to be a PRPnet bug -or- related to load on my server that either the server or the PRPnet software cannot handle. For some reason, it is not accepting some returned results even though it should be. It seems to think they are already done when in fact they are not.

Whew, and I thought I was hallucinating about the huge difference between the nominal drying time and actual drying time. It seems I was not as all machines were connected at all times. Good luck both Max and Rogue figuring it out. That doesn't sound easy. If you need examples from my machine, let me know.

You can thank me now or thank me later for observing the unusually large difference between the two drying times for machines that were connected the entire time. lol


Gary[/quote]
Hey, looky! I just noticed something! G3000 has now *really* dried (not just nominally). It now shows just one insanely huge k with 999999999 as the min-n on the server_stats.html page, which is how dried PRPnet servers are "supposed" to react. (It's just a cosmetic error, though rest assured, yes, we're working on that too. :smile:)

This means, of course, that if there's any data to be had on the client side of things, it's already sitting in one of Lennart's debug.log's and waiting for us to collect. :smile:


All times are UTC. The time now is 09:39.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.