mersenneforum.org PRPNet released!
 Register FAQ Search Today's Posts Mark Forums Read

2008-12-30, 23:55   #45
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

3·2,083 Posts

Okay, I've switched the G3000 server over to v1.0.1. I've also attached my Linux build of the v1.0.1 client. Just drop it in over the old one and you should be good.
Attached Files
 prpclient.zip (29.4 KB, 64 views)

 2008-12-31, 01:00 #46 mdettweiler A Sunny Moo     Aug 2007 USA (GMT-5) 3×2,083 Posts Hmm...I've just noticed something odd in the PRPnet server. Earlier today, I'd noticed that there were a lot of abandoned candidates, likely due to the fact that the way the client shuts down sometimes ends up abandoning some k/n pairs along the way. So, I decreased the deadline for all candidates to 3 hours, and immediately the server began assigning older work, as expected. However, when I looked through the prpserver.candidates file a few minutes ago, I was surprised to find that some of the lowest Sierp. base 6 numbers (they have the lowest k's of the bunch, so thus it assigns work for those first when available) had two residuals collected! Of course, the residuals were identical, but since I had not set the server to assign doublechecks, needless to say this was quite surprising. So, I stopped the server, set the first-pass deadline back to 3 days, and it immediately went back to handing out first-pass Riesel base 3 work. Rogue, do you know why it was handing out doublechecks even though I've got the doublecheck setting set to 0 in prpserver.ini? Thanks, Max
2008-12-31, 02:01   #47
rogue

"Mark"
Apr 2003
Between here and the

5,717 Posts

Quote:
 Originally Posted by mdettweiler Rogue, do you know why it was handing out doublechecks even though I've got the doublecheck setting set to 0 in prpserver.ini?
Do you know the order in which the two tests where sent to clients and then responded to? For example, if a test was sent to one client, but had no response, then sent to a second client, then both clients gave a response, that could cause that to happen.

You could run the server with debugging on for a couple of days to see if there are any more odd results like this.

A couple of questions I have are "Is the client shutting down normally or is there an issue causing it to terminate without sending a message to the server?

2008-12-31, 02:43   #48
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

11000011010012 Posts

Quote:
 Originally Posted by rogue Do you know the order in which the two tests where sent to clients and then responded to? For example, if a test was sent to one client, but had no response, then sent to a second client, then both clients gave a response, that could cause that to happen. You could run the server with debugging on for a couple of days to see if there are any more odd results like this. A couple of questions I have are "Is the client shutting down normally or is there an issue causing it to terminate without sending a message to the server?
The tests were sent to a client, but some were lost on the client side (as will be described in more detail in a moment) and thus were abandoned on the server. They had been sitting like that for a number of hours.

Then, after the tests had been abandoned for about 10-12 hours, I went in and changed the first-pass deadline to 3 hours. The server then immediately got started on handing out the Sierp. base 6 work again--*all* of them, from the beginning, regardless of whether they had been abandoned, or had had a result returned. Thus there are many such results with more than one result listed in prpserver.candidates.

Once I noticed this, I changed the deadline back to 3 days (I have both firstpass and doublecheck deadlines set to 3 days, even though doublecheck is set to 0--I presume that means that it's turned off?). The server went back to handing out tests from the leading edge of the Riesel base 3 work that it had been handing out before I had changed the deadline to 3 hours earlier.

Now for why the clients keep abandoning all these candidates. What I've noticed is that, if a client is stopped while it's on the first candidate of its batch, it will immediately shut down, leaving the G3000.save and phrot.chkpt files in place as it should. When restarted, the client resumes from the in-progress test, as it should, and continues normally.

However, if there are results waiting around in G3000.save when the client is exited, it will stop, send any completed results to the server, then *delete the G3000.save file*, and exit. By deleting the G3000.save file, it essentially removes the client's memory of the remaining untested candidates, and thus they are abandoned.

2008-12-31, 04:19   #49
rogue

"Mark"
Apr 2003
Between here and the

5,717 Posts

Quote:
You'll need to turn on debugging in order for me to determine why it thinks double-checking is necessary. BTW, what that an "O" or a "0". It should be a "0" (zero). I do see an issue in the server code that doesn't validate the setting correctly, so if you have "O", then I could understand the behavior.

I will modify the behavior where undone tests are lost. I presume that the client is stopped because you have terminated the process, not because it detected an error and terminated itself. If I am wrong on that I need to know.

Last fiddled with by rogue on 2008-12-31 at 04:22

2008-12-31, 05:12   #50
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

3·2,083 Posts

Quote:
 Originally Posted by rogue You'll need to turn on debugging in order for me to determine why it thinks double-checking is necessary. BTW, what that an "O" or a "0". It should be a "0" (zero). I do see an issue in the server code that doesn't validate the setting correctly, so if you have "O", then I could understand the behavior.
I do know for a fact that is has a zero, not an O, in the ini file--I re-typed it there myself, if memory serves. I'll go turn on debugging, though, and try again with setting the deadline to 3 hours, and see how it turns out.
Quote:
 I will modify the behavior where undone tests are lost. I presume that the client is stopped because you have terminated the process, not because it detected an error and terminated itself. If I am wrong on that I need to know.
Yes, you are correct--in all such instances, I'd stopped it with a Ctrl-C or pkill command (SIGINT or SIGTERM, respectively).

 2008-12-31, 06:12 #51 mdettweiler A Sunny Moo     Aug 2007 USA (GMT-5) 3·2,083 Posts Okay, I turned on debugging, set the deadline to 3 hours, and here's what I got: Code: [2008-12-31 06:04:32 GMT] socket 4 <<<< FROM bugmesticky@googlemail.com Core2Duo [2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com connecting from 74.37.226.253 [2008-12-31 06:04:32 GMT] socket 4 <<<< GETWORK 1.0.0 10 [2008-12-31 06:04:32 GMT] socket 4 >>>> ServerVersion: 1.0.0 [2008-12-31 06:04:32 GMT] First check candidate 0, 26375*6^125217+1 [2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125217+1 1230703472 26375 6 125217 1 [2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125217+1 [2008-12-31 06:04:32 GMT] First check candidate 1, 26375*6^125221+1 [2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125221+1 1230703472 26375 6 125221 1 [2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125221+1 [2008-12-31 06:04:32 GMT] First check candidate 2, 26375*6^125301+1 [2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125301+1 1230703472 26375 6 125301 1 [2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125301+1 [2008-12-31 06:04:32 GMT] First check candidate 3, 26375*6^125325+1 [2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125325+1 1230703472 26375 6 125325 1 [2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125325+1 [2008-12-31 06:04:32 GMT] First check candidate 4, 26375*6^125341+1 [2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125341+1 1230703472 26375 6 125341 1 [2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125341+1 [2008-12-31 06:04:32 GMT] First check candidate 5, 26375*6^125397+1 [2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125397+1 1230703472 26375 6 125397 1 [2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125397+1 [2008-12-31 06:04:32 GMT] First check candidate 6, 26375*6^125545+1 [2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125545+1 1230703472 26375 6 125545 1 [2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125545+1 [2008-12-31 06:04:32 GMT] First check candidate 7, 26375*6^125565+1 [2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125565+1 1230703472 26375 6 125565 1 [2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125565+1 [2008-12-31 06:04:32 GMT] First check candidate 8, 26375*6^125629+1 [2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125629+1 1230703472 26375 6 125629 1 [2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125629+1 [2008-12-31 06:04:32 GMT] First check candidate 9, 26375*6^125637+1 [2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125637+1 1230703472 26375 6 125637 1 [2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125637+1 [2008-12-31 06:04:32 GMT] socket 4 >>>> End of Message [2008-12-31 06:04:33 GMT] socket 4 <<<< GETGREETING [2008-12-31 06:04:33 GMT] socket 4 >>>> ############ [2008-12-31 06:04:33 GMT] socket 4 >>>> Welcome to the CRUS G3000 PRPnet beta test server! :-D [2008-12-31 06:04:33 GMT] socket 4 >>>> Server is running PRPnet v1.0.1 [2008-12-31 06:04:33 GMT] socket 4 >>>> ############ [2008-12-31 06:04:33 GMT] socket 4 >>>> OK. [2008-12-31 06:04:33 GMT] socket 4 <<<< QUIT [2008-12-31 06:04:33 GMT] closing socket 4 Sounds normal, right? Well, it would be, except that these 10 candidates it just handed out are the first 10 candidates from the lowest Sierp. base 6 k--and all have already had TWO residuals returned! Maybe the server is forgetting which candidates have already been returned when it is stopped and restarted? Methinks it might work a little better if the server marked the finished numbers as "inactive" and dumped them to a "finished" file, sort of like it does when you find a PRP with the sierpinskiriesel=1 option set. In fact, having it handle finished candidates like this might even make processing the results a *lot* easier. Edit: Meanwhile, I've put the server back on the Riesel base 3 numbers so that we're not throwing our CPU time out the window on unnecessary triple-checks. Last fiddled with by mdettweiler on 2008-12-31 at 06:14
 2008-12-31, 06:27 #52 gd_barnes     May 2007 Kansas; USA 236238 Posts Max, Thanks for all of the excellent testing here! It's good to get the small issues weeded out in the Beta process. Since we're using "production" files from actual drives here, I'll leave it up to you to balance the k/n pairs results returned by the server to the original sieve file. Also, spot-double-checking some of the residuals might be a good idea. Gary
2008-12-31, 06:34   #53
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

3·2,083 Posts

Quote:
 Originally Posted by gd_barnes Max, Thanks for all of the excellent testing here! It's good to get the small issues weeded out in the Beta process. Since we're using "production" files from actual drives here, I'll leave it up to you to balance the k/n pairs results returned by the server to the original sieve file. Also, spot-double-checking some of the residuals might be a good idea. Gary
Yep, definitely--I'll be sure that the results are of at least as good quality as those we'd get from an LLRnet server before I send 'em to you.

Specifically, as for the Sierp. base 6 numbers, those are just going to be, in turn, submitted back to the IB6 LLRnet server, so they'll end up being balanced with the original sieve file anyway when I process the results from that server.

2008-12-31, 13:54   #54
rogue

"Mark"
Apr 2003
Between here and the

165516 Posts

Quote:
 Originally Posted by mdettweiler Maybe the server is forgetting which candidates have already been returned when it is stopped and restarted? Methinks it might work a little better if the server marked the finished numbers as "inactive" and dumped them to a "finished" file, sort of like it does when you find a PRP with the sierpinskiriesel=1 option set. In fact, having it handle finished candidates like this might even make processing the results a *lot* easier. Edit: Meanwhile, I've put the server back on the Riesel base 3 numbers so that we're not throwing our CPU time out the window on unnecessary triple-checks.
I found the problem. In Candidates.cpp the i_TestsPerformed variable was not getting set correctly when the server is restarted. Replace the AddTest function with this:

Code:
int32_t   Candidate::AddTest(uint64_t testID, char *program, char *residue, char *emailID, char *machineID, uint32_t logTest)
{
test_t   *tPtr;
char      theMessage[BUFFER_SIZE];
Log      *prpLog;

if (!m_Test)
{
m_Test = new test_t;
tPtr = m_Test;
}
else
{
tPtr = m_Test;
while (tPtr)
{
if (testID == tPtr->l_TestID &&
!strcmp(emailID, tPtr->s_EmailID) &&
!strcmp(machineID, tPtr->s_ClientID))
return RC_FAILURE;

// We have two tests with the same residue, thus no more double-checking is needed
if (strcmp(tPtr->s_Residue, "inprogress") && !strcmp(residue, tPtr->s_Residue))
b_NeedsDoubleCheck = 0;

if (!tPtr->m_Next)
break;
tPtr = (test_t *) tPtr->m_Next;
}

tPtr->m_Next = new test_t;
tPtr = (test_t *) tPtr->m_Next;
}

tPtr->l_TestID = testID;
strcpy(tPtr->s_Program, program);
strcpy(tPtr->s_Residue, residue);
strcpy(tPtr->s_EmailID, emailID);
strcpy(tPtr->s_ClientID, machineID);
tPtr->m_Next = 0;

if (strcmp(tPtr->s_Residue, "inprogress"))
i_TestsPerformed++;

if (!strcmp(tPtr->s_Residue, "PRP") || !strcmp(tPtr->s_Residue, "PRIME"))
{
b_IsPRP = 1;
b_IsActive = 0;

if (logTest)
{
if (!strcmp(tPtr->s_Residue, "PRP"))
sprintf(theMessage, "%s: PRP returned by %s (%s) using %s!", s_Name, emailID, machineID, program);
else
sprintf(theMessage, "%s: Prime returned by %s (%s) using %s!", s_Name, emailID, machineID, program);

prpLog = new Log(0, "PRP.log", 0, NULL);
prpLog->LogMessage(theMessage);
delete prpLog;
}
}

return RC_SUCCESS;
}

2008-12-31, 17:28   #55
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

3·2,083 Posts

Quote:
 Originally Posted by rogue I found the problem. In Candidates.cpp the i_TestsPerformed variable was not getting set correctly when the server is restarted. Replace the AddTest function with this: Code: int32_t Candidate::AddTest(uint64_t testID, char *program, char *residue, char *emailID, char *machineID, uint32_t logTest) { test_t *tPtr; char theMessage[BUFFER_SIZE]; Log *prpLog; if (!m_Test) { m_Test = new test_t; tPtr = m_Test; } else { tPtr = m_Test; while (tPtr) { // We already know about this test, so ignore this result if (testID == tPtr->l_TestID && !strcmp(emailID, tPtr->s_EmailID) && !strcmp(machineID, tPtr->s_ClientID)) return RC_FAILURE; // We have two tests with the same residue, thus no more double-checking is needed if (strcmp(tPtr->s_Residue, "inprogress") && !strcmp(residue, tPtr->s_Residue)) b_NeedsDoubleCheck = 0; if (!tPtr->m_Next) break; tPtr = (test_t *) tPtr->m_Next; } tPtr->m_Next = new test_t; tPtr = (test_t *) tPtr->m_Next; } tPtr->l_TestID = testID; strcpy(tPtr->s_Program, program); strcpy(tPtr->s_Residue, residue); strcpy(tPtr->s_EmailID, emailID); strcpy(tPtr->s_ClientID, machineID); tPtr->m_Next = 0; if (strcmp(tPtr->s_Residue, "inprogress")) i_TestsPerformed++; if (!strcmp(tPtr->s_Residue, "PRP") || !strcmp(tPtr->s_Residue, "PRIME")) { b_IsPRP = 1; b_IsActive = 0; if (logTest) { if (!strcmp(tPtr->s_Residue, "PRP")) sprintf(theMessage, "%s: PRP returned by %s (%s) using %s!", s_Name, emailID, machineID, program); else sprintf(theMessage, "%s: Prime returned by %s (%s) using %s!", s_Name, emailID, machineID, program); prpLog = new Log(0, "PRP.log", 0, NULL); prpLog->LogMessage(theMessage); delete prpLog; } } return RC_SUCCESS; }
Thanks! I'll get the fix compiled and loaded into the server shortly. And then...back to 3 hours deadline to finally clean out some of those pesky Sierp. base 6 numbers that we've got hanging around!

Last fiddled with by mdettweiler on 2008-12-31 at 17:29

 Similar Threads Thread Thread Starter Forum Replies Last Post rogue Software 152 2020-03-30 17:01 ltd Prime Sierpinski Project 86 2012-06-06 02:30 rogue Software 84 2011-11-16 21:20 Joe O Sierpinski/Riesel Base 5 1 2010-10-22 20:11 rogue Conjectures 'R Us 220 2010-10-12 20:48

All times are UTC. The time now is 04:54.

Sat Jun 6 04:54:18 UTC 2020 up 73 days, 2:27, 0 users, load averages: 1.27, 1.43, 1.36