mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > Conjectures 'R Us

Reply
 
Thread Tools
Old 2009-08-04, 20:50   #45
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

101×103 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Look at the # of candidates left for each k--you'll notice that it's a very small amount. The reason why the "max n" is not quite up to ~150K is because most of the work has been completed; all you're seeing now are just whatever the min and max of any stragglers happen to be.

Wow, that was fast that we got down to only remaining stragglers. That makes sense now. Thanks for enlightening me. We still need to get the formatting fixed on that one web page. We also need to change "n remaining" to "pairs remaining".

The option to run percentages of certain servers is outstanding! Whenever we reload for another test, Lennart's and my machines just start gobbling them up. Very cool! :-)


Gary
gd_barnes is online now   Reply With Quote
Old 2009-08-04, 20:53   #46
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

101×103 Posts
Default

When did the last run hand out its last pair? In other words, when did the server "dry" by my definition?

I'm wondering because there are still 350 stragglers remaining. Unless someone has shut off a machine, all stragglers should have been processed by now. Can that be checked? Thanks.

One more thing. Is there the equivalent of a "JobMaxTime" in PRPnet? If so, what has G3000 been set to?


Gary

Last fiddled with by gd_barnes on 2009-08-04 at 20:54
gd_barnes is online now   Reply With Quote
Old 2009-08-04, 21:01   #47
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
When did the last run hand out its last pair? In other words, when did the server "dry" by my definition?

I'm wondering because there are still 350 stragglers remaining. Unless someone has shut off a machine, all stragglers should have been processed by now. Can that be checked? Thanks.

One more thing. Is there the equivalent of a "JobMaxTime" in PRPnet? If so, what has G3000 been set to?


Gary
PRPnet does have an equivalent of a jobMaxTime value, specified in the file prpserver.delay on the server. It's been set to 3 days.

It's a little hard to tell you exactly when the server "dried" by your definition. I'm presuming that by that, you mean when did the server run out of its "main" stash of work until everything that's left was stragglers? If so, that will take a bit of digging to find out.
mdettweiler is offline   Reply With Quote
Old 2009-08-04, 21:34   #48
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

141518 Posts
Default

I just checked the server, and it seems that the last time any test was handed out was at 22:17 GMT, 8/3. Within the next couple of hours Lennart's various machines returned their various results, but since then there's been no activity besides the server sending out "no available candidates on server" messages. I've changed the time limit to 6 hours; I see now that a large number (possibly all, I didn't do an exact count) of the stragglers have been expired and are being reassigned.
mdettweiler is offline   Reply With Quote
Old 2009-08-04, 23:51   #49
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

101000101000112 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
I just checked the server, and it seems that the last time any test was handed out was at 22:17 GMT, 8/3. Within the next couple of hours Lennart's various machines returned their various results, but since then there's been no activity besides the server sending out "no available candidates on server" messages. I've changed the time limit to 6 hours; I see now that a large number (possibly all, I didn't do an exact count) of the stragglers have been expired and are being reassigned.
Who were the stragglers assigned to to begin with? Could this be a possible PRPnet bug or did someone receive some pairs and then disconnect or turn off their machines? I would think that if it is Lennart or me, then they should have been quickly processed.

This is important to figure out because in the first test, the same thing happened. I observed that the server was dry by my definition about 4-5 hours before you confirmed that it was really dry by your definition. If the pairs are coming back in a reasonable time frame, that difference should have been < 1 hour. If something is causing some pairs to "get stuck" for an extended period, we need to figure that out.

To clarify again:
Dry by my defintion: No new work is available to hand out. Some straggling pairs still need results to be returned.

Dry by your definition: All pairs have been processed and returned.

Let me come up with a better way to state this: Your definition is probably more accurate in a purely technical sense so how about we call my definition "nominally dried" and stick with calling your definition simply "dried".


Gary

Last fiddled with by gd_barnes on 2009-08-04 at 23:54
gd_barnes is online now   Reply With Quote
Old 2009-08-05, 01:46   #50
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

624910 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
Who were the stragglers assigned to to begin with? Could this be a possible PRPnet bug or did someone receive some pairs and then disconnect or turn off their machines? I would think that if it is Lennart or me, then they should have been quickly processed.

This is important to figure out because in the first test, the same thing happened. I observed that the server was dry by my definition about 4-5 hours before you confirmed that it was really dry by your definition. If the pairs are coming back in a reasonable time frame, that difference should have been < 1 hour. If something is causing some pairs to "get stuck" for an extended period, we need to figure that out.

To clarify again:
Dry by my defintion: No new work is available to hand out. Some straggling pairs still need results to be returned.

Dry by your definition: All pairs have been processed and returned.

Let me come up with a better way to state this: Your definition is probably more accurate in a purely technical sense so how about we call my definition "nominally dried" and stick with calling your definition simply "dried".


Gary
Unless I missed something, all of the stragglers were from Lennart. Lennart, do you know of a particular reason why you had these abandoned pairs, or does this seem more like a bug?

Regarding dried vs. nominally dried: okay, that works.
mdettweiler is offline   Reply With Quote
Old 2009-08-05, 02:01   #51
Lennart
 
Lennart's Avatar
 
"Lennart"
Jun 2007

100011000002 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Unless I missed something, all of the stragglers were from Lennart. Lennart, do you know of a particular reason why you had these abandoned pairs, or does this seem more like a bug?

Regarding dried vs. nominally dried: okay, that works.
No those are when the server did not receive them( when i not could connect). But they will not come out again before the time set in the delay file is over.

[2009-08-03 17:49:38 GMT] crus: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:43 GMT] crus: ERROR: Workunit 124221*6^148285+1 not found on server
[2009-08-03 17:49:43 GMT] crus: The client will delete this workunit

Here you see that the candidates was deleted when we had the conection error. so if you had one day in delay file i don't get them again before those 24 hr.

Lennart
Lennart is offline   Reply With Quote
Old 2009-08-05, 03:35   #52
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

141518 Posts
Default

Quote:
Originally Posted by Lennart View Post
No those are when the server did not receive them( when i not could connect). But they will not come out again before the time set in the delay file is over.

[2009-08-03 17:49:38 GMT] crus: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:43 GMT] crus: ERROR: Workunit 124221*6^148285+1 not found on server
[2009-08-03 17:49:43 GMT] crus: The client will delete this workunit

Here you see that the candidates was deleted when we had the conection error. so if you had one day in delay file i don't get them again before those 24 hr.

Lennart
Okay, I see. Since I know that the client won't just delete a result if it merely can't connect, but rather will hang on to the result and try again later, it's got to be something a little different than just that. From what your logs reported, it would seem almost like those tests were already completed on the server end, but the client didn't know that...ooh! I just thought of something! What if when the server barfs on a test, the client ends up deleting the *next* result, thinking that it wasn't found? This would fit with what we were seeing in the server's debug logs, where the client would seem to be sending the server a second workunit line at a time that would be out of place according to the PRPnet communication protocol. The server barfs on this, and registers the first test with a blank residual, but ignores the second one. Meanwhile, the client could quite believably think that the server is saying "the second test is not on the server" and therefore it deletes the result, thinking it's unneeded. Bingo! We have our abandoned tests.

Lennart, could you provide debug.log excerpts from when your clients behaved like in your example? This might be just the key we're looking for. Meanwhile, I'll check and see if on the server, a test was barfed at the same time shown in your logs for your example.

Edit: oh, never mind, looks like that particular example was from before I put the server on debug logging. Lennart, do you have a similar example from somewhere after 2009-08-03 18:20:04 GMT?

Last fiddled with by mdettweiler on 2009-08-05 at 03:38
mdettweiler is offline   Reply With Quote
Old 2009-08-05, 09:37   #53
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

101×103 Posts
Default

Your last sentence notwithstanding, THAT is EXACTLY what I was afraid of and thought it might be!! (caps for emphasis not yelling) I had been seeing "result not accepted" and something about it being deleted in some of my files. Yet my one quad was definitely connected the entire time.

This clearly has to be a PRPnet bug -or- related to load on my server that either the server or the PRPnet software cannot handle. For some reason, it is not accepting some returned results even though it should be. It seems to think they are already done when in fact they are not.

Whew, and I thought I was hallucinating about the huge difference between the nominal drying time and actual drying time. It seems I was not as all machines were connected at all times. Good luck both Max and Rogue figuring it out. That doesn't sound easy. If you need examples from my machine, let me know.

You can thank me now or thank me later for observing the unusually large difference between the two drying times for machines that were connected the entire time. lol


Gary

Last fiddled with by gd_barnes on 2009-08-05 at 09:44 Reason: thank me now or later :-)
gd_barnes is online now   Reply With Quote
Old 2009-08-05, 12:25   #54
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

635210 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
Your last sentence notwithstanding, THAT is EXACTLY what I was afraid of and thought it might be!! (caps for emphasis not yelling) I had been seeing "result not accepted" and something about it being deleted in some of my files. Yet my one quad was definitely connected the entire time.

This clearly has to be a PRPnet bug -or- related to load on my server that either the server or the PRPnet software cannot handle. For some reason, it is not accepting some returned results even though it should be. It seems to think they are already done when in fact they are not.

Whew, and I thought I was hallucinating about the huge difference between the nominal drying time and actual drying time. It seems I was not as all machines were connected at all times. Good luck both Max and Rogue figuring it out. That doesn't sound easy. If you need examples from my machine, let me know.

You can thank me now or thank me later for observing the unusually large difference between the two drying times for machines that were connected the entire time.
I have provided a server side patch to Max (which I've instructed him to not install), but I'm waiting to see a debug log from a client that is exhibiting the problem so that I can determine if there is a bug in the client as well.
rogue is offline   Reply With Quote
Old 2009-08-05, 12:48   #55
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

186916 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
Your last sentence notwithstanding, THAT is EXACTLY what I was afraid of and thought it might be!! (caps for emphasis not yelling) I had been seeing "result not accepted" and something about it being deleted in some of my files. Yet my one quad was definitely connected the entire time.

This clearly has to be a PRPnet bug -or- related to load on my server that either the server or the PRPnet software cannot handle. For some reason, it is not accepting some returned results even though it should be. It seems to think they are already done when in fact they are not.

Whew, and I thought I was hallucinating about the huge difference between the nominal drying time and actual drying time. It seems I was not as all machines were connected at all times. Good luck both Max and Rogue figuring it out. That doesn't sound easy. If you need examples from my machine, let me know.

You can thank me now or thank me later for observing the unusually large difference between the two drying times for machines that were connected the entire time. lol


Gary
Hey, looky! I just noticed something! G3000 has now *really* dried (not just nominally). It now shows just one insanely huge k with 999999999 as the min-n on the server_stats.html page, which is how dried PRPnet servers are "supposed" to react. (It's just a cosmetic error, though rest assured, yes, we're working on that too. )

This means, of course, that if there's any data to be had on the client side of things, it's already sitting in one of Lennart's debug.log's and waiting for us to collect.
mdettweiler is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
PRPNet server for personal use johnadam74 Software 2 2016-01-01 15:58
New SR5 PRPnet server online ltd Sierpinski/Riesel Base 5 15 2013-03-19 18:03
First PSP PRPnet 4.0.6 server online ltd Prime Sierpinski Project 9 2011-03-15 04:58
PRPnet 3.1.3 stress-test server mdettweiler No Prime Left Behind 40 2010-01-30 18:05
First pass PRPNet server out of work? opyrt Prime Sierpinski Project 6 2009-09-24 18:14

All times are UTC. The time now is 09:42.


Tue Jul 27 09:42:18 UTC 2021 up 4 days, 4:11, 0 users, load averages: 2.02, 1.95, 1.87

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.