mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > Conjectures 'R Us

Reply
 
Thread Tools
Old 2008-12-30, 23:55   #45
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Okay, I've switched the G3000 server over to v1.0.1. I've also attached my Linux build of the v1.0.1 client. Just drop it in over the old one and you should be good.
Attached Files
File Type: zip prpclient.zip (29.4 KB, 64 views)
mdettweiler is offline   Reply With Quote
Old 2008-12-31, 01:00   #46
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Hmm...I've just noticed something odd in the PRPnet server. Earlier today, I'd noticed that there were a lot of abandoned candidates, likely due to the fact that the way the client shuts down sometimes ends up abandoning some k/n pairs along the way. So, I decreased the deadline for all candidates to 3 hours, and immediately the server began assigning older work, as expected.

However, when I looked through the prpserver.candidates file a few minutes ago, I was surprised to find that some of the lowest Sierp. base 6 numbers (they have the lowest k's of the bunch, so thus it assigns work for those first when available) had two residuals collected! Of course, the residuals were identical, but since I had not set the server to assign doublechecks, needless to say this was quite surprising. So, I stopped the server, set the first-pass deadline back to 3 days, and it immediately went back to handing out first-pass Riesel base 3 work.

Rogue, do you know why it was handing out doublechecks even though I've got the doublecheck setting set to 0 in prpserver.ini?

Thanks,
Max
mdettweiler is offline   Reply With Quote
Old 2008-12-31, 02:01   #47
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

5,717 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Rogue, do you know why it was handing out doublechecks even though I've got the doublecheck setting set to 0 in prpserver.ini?
Do you know the order in which the two tests where sent to clients and then responded to? For example, if a test was sent to one client, but had no response, then sent to a second client, then both clients gave a response, that could cause that to happen.

You could run the server with debugging on for a couple of days to see if there are any more odd results like this.

A couple of questions I have are "Is the client shutting down normally or is there an issue causing it to terminate without sending a message to the server?
rogue is offline   Reply With Quote
Old 2008-12-31, 02:43   #48
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

11000011010012 Posts
Default

Quote:
Originally Posted by rogue View Post
Do you know the order in which the two tests where sent to clients and then responded to? For example, if a test was sent to one client, but had no response, then sent to a second client, then both clients gave a response, that could cause that to happen.

You could run the server with debugging on for a couple of days to see if there are any more odd results like this.

A couple of questions I have are "Is the client shutting down normally or is there an issue causing it to terminate without sending a message to the server?
The tests were sent to a client, but some were lost on the client side (as will be described in more detail in a moment) and thus were abandoned on the server. They had been sitting like that for a number of hours.

Then, after the tests had been abandoned for about 10-12 hours, I went in and changed the first-pass deadline to 3 hours. The server then immediately got started on handing out the Sierp. base 6 work again--*all* of them, from the beginning, regardless of whether they had been abandoned, or had had a result returned. Thus there are many such results with more than one result listed in prpserver.candidates.

Once I noticed this, I changed the deadline back to 3 days (I have both firstpass and doublecheck deadlines set to 3 days, even though doublecheck is set to 0--I presume that means that it's turned off?). The server went back to handing out tests from the leading edge of the Riesel base 3 work that it had been handing out before I had changed the deadline to 3 hours earlier.

Now for why the clients keep abandoning all these candidates. What I've noticed is that, if a client is stopped while it's on the first candidate of its batch, it will immediately shut down, leaving the G3000.save and phrot.chkpt files in place as it should. When restarted, the client resumes from the in-progress test, as it should, and continues normally.

However, if there are results waiting around in G3000.save when the client is exited, it will stop, send any completed results to the server, then *delete the G3000.save file*, and exit. By deleting the G3000.save file, it essentially removes the client's memory of the remaining untested candidates, and thus they are abandoned.
mdettweiler is offline   Reply With Quote
Old 2008-12-31, 04:19   #49
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

5,717 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
The tests were sent to a client, but some were lost on the client side (as will be described in more detail in a moment) and thus were abandoned on the server. They had been sitting like that for a number of hours.

Then, after the tests had been abandoned for about 10-12 hours, I went in and changed the first-pass deadline to 3 hours. The server then immediately got started on handing out the Sierp. base 6 work again--*all* of them, from the beginning, regardless of whether they had been abandoned, or had had a result returned. Thus there are many such results with more than one result listed in prpserver.candidates.

Once I noticed this, I changed the deadline back to 3 days (I have both firstpass and doublecheck deadlines set to 3 days, even though doublecheck is set to 0--I presume that means that it's turned off?). The server went back to handing out tests from the leading edge of the Riesel base 3 work that it had been handing out before I had changed the deadline to 3 hours earlier.

Now for why the clients keep abandoning all these candidates. What I've noticed is that, if a client is stopped while it's on the first candidate of its batch, it will immediately shut down, leaving the G3000.save and phrot.chkpt files in place as it should. When restarted, the client resumes from the in-progress test, as it should, and continues normally.

However, if there are results waiting around in G3000.save when the client is exited, it will stop, send any completed results to the server, then *delete the G3000.save file*, and exit. By deleting the G3000.save file, it essentially removes the client's memory of the remaining untested candidates, and thus they are abandoned.
You'll need to turn on debugging in order for me to determine why it thinks double-checking is necessary. BTW, what that an "O" or a "0". It should be a "0" (zero). I do see an issue in the server code that doesn't validate the setting correctly, so if you have "O", then I could understand the behavior.

I will modify the behavior where undone tests are lost. I presume that the client is stopped because you have terminated the process, not because it detected an error and terminated itself. If I am wrong on that I need to know.

Last fiddled with by rogue on 2008-12-31 at 04:22
rogue is offline   Reply With Quote
Old 2008-12-31, 05:12   #50
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Quote:
Originally Posted by rogue View Post
You'll need to turn on debugging in order for me to determine why it thinks double-checking is necessary. BTW, what that an "O" or a "0". It should be a "0" (zero). I do see an issue in the server code that doesn't validate the setting correctly, so if you have "O", then I could understand the behavior.
I do know for a fact that is has a zero, not an O, in the ini file--I re-typed it there myself, if memory serves. I'll go turn on debugging, though, and try again with setting the deadline to 3 hours, and see how it turns out.
Quote:
I will modify the behavior where undone tests are lost. I presume that the client is stopped because you have terminated the process, not because it detected an error and terminated itself. If I am wrong on that I need to know.
Yes, you are correct--in all such instances, I'd stopped it with a Ctrl-C or pkill command (SIGINT or SIGTERM, respectively).
mdettweiler is offline   Reply With Quote
Old 2008-12-31, 06:12   #51
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Okay, I turned on debugging, set the deadline to 3 hours, and here's what I got:
Code:
[2008-12-31 06:04:32 GMT] socket 4 <<<< FROM bugmesticky@googlemail.com Core2Duo

[2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com connecting from 74.37.226.253
[2008-12-31 06:04:32 GMT] socket 4 <<<< GETWORK 1.0.0 10

[2008-12-31 06:04:32 GMT] socket 4 >>>> ServerVersion: 1.0.0
[2008-12-31 06:04:32 GMT] First check candidate 0, 26375*6^125217+1
[2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125217+1 1230703472 26375 6 125217 1
[2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125217+1
[2008-12-31 06:04:32 GMT] First check candidate 1, 26375*6^125221+1
[2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125221+1 1230703472 26375 6 125221 1
[2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125221+1
[2008-12-31 06:04:32 GMT] First check candidate 2, 26375*6^125301+1
[2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125301+1 1230703472 26375 6 125301 1
[2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125301+1
[2008-12-31 06:04:32 GMT] First check candidate 3, 26375*6^125325+1
[2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125325+1 1230703472 26375 6 125325 1
[2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125325+1
[2008-12-31 06:04:32 GMT] First check candidate 4, 26375*6^125341+1
[2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125341+1 1230703472 26375 6 125341 1
[2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125341+1
[2008-12-31 06:04:32 GMT] First check candidate 5, 26375*6^125397+1
[2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125397+1 1230703472 26375 6 125397 1
[2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125397+1
[2008-12-31 06:04:32 GMT] First check candidate 6, 26375*6^125545+1
[2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125545+1 1230703472 26375 6 125545 1
[2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125545+1
[2008-12-31 06:04:32 GMT] First check candidate 7, 26375*6^125565+1
[2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125565+1 1230703472 26375 6 125565 1
[2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125565+1
[2008-12-31 06:04:32 GMT] First check candidate 8, 26375*6^125629+1
[2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125629+1 1230703472 26375 6 125629 1
[2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125629+1
[2008-12-31 06:04:32 GMT] First check candidate 9, 26375*6^125637+1
[2008-12-31 06:04:32 GMT] socket 4 >>>> WorkUnit: 26375*6^125637+1 1230703472 26375 6 125637 1
[2008-12-31 06:04:32 GMT] bugmesticky@googlemail.com (Core2Duo) at 74.37.226.253: Sent 26375*6^125637+1
[2008-12-31 06:04:32 GMT] socket 4 >>>> End of Message
[2008-12-31 06:04:33 GMT] socket 4 <<<< GETGREETING

[2008-12-31 06:04:33 GMT] socket 4 >>>> ############
[2008-12-31 06:04:33 GMT] socket 4 >>>> Welcome to the CRUS G3000 PRPnet beta test server! :-D
[2008-12-31 06:04:33 GMT] socket 4 >>>> Server is running PRPnet v1.0.1
[2008-12-31 06:04:33 GMT] socket 4 >>>> ############
[2008-12-31 06:04:33 GMT] socket 4 >>>> OK.
[2008-12-31 06:04:33 GMT] socket 4 <<<< QUIT

[2008-12-31 06:04:33 GMT] closing socket 4
Sounds normal, right? Well, it would be, except that these 10 candidates it just handed out are the first 10 candidates from the lowest Sierp. base 6 k--and all have already had TWO residuals returned!

Maybe the server is forgetting which candidates have already been returned when it is stopped and restarted? Methinks it might work a little better if the server marked the finished numbers as "inactive" and dumped them to a "finished" file, sort of like it does when you find a PRP with the sierpinskiriesel=1 option set. In fact, having it handle finished candidates like this might even make processing the results a *lot* easier.

Edit: Meanwhile, I've put the server back on the Riesel base 3 numbers so that we're not throwing our CPU time out the window on unnecessary triple-checks.

Last fiddled with by mdettweiler on 2008-12-31 at 06:14
mdettweiler is offline   Reply With Quote
Old 2008-12-31, 06:27   #52
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

236238 Posts
Default

Max,

Thanks for all of the excellent testing here! It's good to get the small issues weeded out in the Beta process.

Since we're using "production" files from actual drives here, I'll leave it up to you to balance the k/n pairs results returned by the server to the original sieve file. Also, spot-double-checking some of the residuals might be a good idea.


Gary
gd_barnes is offline   Reply With Quote
Old 2008-12-31, 06:34   #53
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
Max,

Thanks for all of the excellent testing here! It's good to get the small issues weeded out in the Beta process.

Since we're using "production" files from actual drives here, I'll leave it up to you to balance the k/n pairs results returned by the server to the original sieve file. Also, spot-double-checking some of the residuals might be a good idea.


Gary
Yep, definitely--I'll be sure that the results are of at least as good quality as those we'd get from an LLRnet server before I send 'em to you.

Specifically, as for the Sierp. base 6 numbers, those are just going to be, in turn, submitted back to the IB6 LLRnet server, so they'll end up being balanced with the original sieve file anyway when I process the results from that server.
mdettweiler is offline   Reply With Quote
Old 2008-12-31, 13:54   #54
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

165516 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Maybe the server is forgetting which candidates have already been returned when it is stopped and restarted? Methinks it might work a little better if the server marked the finished numbers as "inactive" and dumped them to a "finished" file, sort of like it does when you find a PRP with the sierpinskiriesel=1 option set. In fact, having it handle finished candidates like this might even make processing the results a *lot* easier.

Edit: Meanwhile, I've put the server back on the Riesel base 3 numbers so that we're not throwing our CPU time out the window on unnecessary triple-checks.
I found the problem. In Candidates.cpp the i_TestsPerformed variable was not getting set correctly when the server is restarted. Replace the AddTest function with this:

Code:
int32_t   Candidate::AddTest(uint64_t testID, char *program, char *residue, char *emailID, char *machineID, uint32_t logTest)
{
   test_t   *tPtr;
   char      theMessage[BUFFER_SIZE];
   Log      *prpLog;

   if (!m_Test)
   {
      m_Test = new test_t;
      tPtr = m_Test;
   }
   else
   {
      tPtr = m_Test;
      while (tPtr)
      {
         // We already know about this test, so ignore this result
         if (testID == tPtr->l_TestID &&
             !strcmp(emailID, tPtr->s_EmailID) &&
             !strcmp(machineID, tPtr->s_ClientID))
            return RC_FAILURE;

         // We have two tests with the same residue, thus no more double-checking is needed
         if (strcmp(tPtr->s_Residue, "inprogress") && !strcmp(residue, tPtr->s_Residue))
            b_NeedsDoubleCheck = 0;

         if (!tPtr->m_Next)
            break;
         tPtr = (test_t *) tPtr->m_Next;
      }

      tPtr->m_Next = new test_t;
      tPtr = (test_t *) tPtr->m_Next;
   }

   tPtr->l_TestID = testID;
   strcpy(tPtr->s_Program, program);
   strcpy(tPtr->s_Residue, residue);
   strcpy(tPtr->s_EmailID, emailID);
   strcpy(tPtr->s_ClientID, machineID);
   tPtr->m_Next = 0;

   if (strcmp(tPtr->s_Residue, "inprogress"))
      i_TestsPerformed++;

   if (!strcmp(tPtr->s_Residue, "PRP") || !strcmp(tPtr->s_Residue, "PRIME"))
   {
      b_IsPRP = 1;
      b_IsActive = 0;

      if (logTest)
      {
         if (!strcmp(tPtr->s_Residue, "PRP"))
            sprintf(theMessage, "%s: PRP returned by %s (%s) using %s!", s_Name, emailID, machineID, program);
         else
            sprintf(theMessage, "%s: Prime returned by %s (%s) using %s!", s_Name, emailID, machineID, program);

         prpLog = new Log(0, "PRP.log", 0, NULL);
         prpLog->LogMessage(theMessage);
         delete prpLog;
      }
   }

   return RC_SUCCESS;
}
rogue is offline   Reply With Quote
Old 2008-12-31, 17:28   #55
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Quote:
Originally Posted by rogue View Post
I found the problem. In Candidates.cpp the i_TestsPerformed variable was not getting set correctly when the server is restarted. Replace the AddTest function with this:

Code:
int32_t   Candidate::AddTest(uint64_t testID, char *program, char *residue, char *emailID, char *machineID, uint32_t logTest)
{
   test_t   *tPtr;
   char      theMessage[BUFFER_SIZE];
   Log      *prpLog;

   if (!m_Test)
   {
      m_Test = new test_t;
      tPtr = m_Test;
   }
   else
   {
      tPtr = m_Test;
      while (tPtr)
      {
         // We already know about this test, so ignore this result
         if (testID == tPtr->l_TestID &&
             !strcmp(emailID, tPtr->s_EmailID) &&
             !strcmp(machineID, tPtr->s_ClientID))
            return RC_FAILURE;

         // We have two tests with the same residue, thus no more double-checking is needed
         if (strcmp(tPtr->s_Residue, "inprogress") && !strcmp(residue, tPtr->s_Residue))
            b_NeedsDoubleCheck = 0;

         if (!tPtr->m_Next)
            break;
         tPtr = (test_t *) tPtr->m_Next;
      }

      tPtr->m_Next = new test_t;
      tPtr = (test_t *) tPtr->m_Next;
   }

   tPtr->l_TestID = testID;
   strcpy(tPtr->s_Program, program);
   strcpy(tPtr->s_Residue, residue);
   strcpy(tPtr->s_EmailID, emailID);
   strcpy(tPtr->s_ClientID, machineID);
   tPtr->m_Next = 0;

   if (strcmp(tPtr->s_Residue, "inprogress"))
      i_TestsPerformed++;

   if (!strcmp(tPtr->s_Residue, "PRP") || !strcmp(tPtr->s_Residue, "PRIME"))
   {
      b_IsPRP = 1;
      b_IsActive = 0;

      if (logTest)
      {
         if (!strcmp(tPtr->s_Residue, "PRP"))
            sprintf(theMessage, "%s: PRP returned by %s (%s) using %s!", s_Name, emailID, machineID, program);
         else
            sprintf(theMessage, "%s: Prime returned by %s (%s) using %s!", s_Name, emailID, machineID, program);

         prpLog = new Log(0, "PRP.log", 0, NULL);
         prpLog->LogMessage(theMessage);
         delete prpLog;
      }
   }

   return RC_SUCCESS;
}
Thanks! I'll get the fix compiled and loaded into the server shortly. And then...back to 3 hours deadline to finally clean out some of those pesky Sierp. base 6 numbers that we've got hanging around!

Last fiddled with by mdettweiler on 2008-12-31 at 17:29
mdettweiler is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
PRPNet 5.4.3 Released rogue Software 152 2020-03-30 17:01
PSP goes prpnet ltd Prime Sierpinski Project 86 2012-06-06 02:30
PRPNet 4.0.0 Released rogue Software 84 2011-11-16 21:20
PRPNet 4.0.1 Released Joe O Sierpinski/Riesel Base 5 1 2010-10-22 20:11
PRPNet 3.0.0 Released rogue Conjectures 'R Us 220 2010-10-12 20:48

All times are UTC. The time now is 04:54.

Sat Jun 6 04:54:18 UTC 2020 up 73 days, 2:27, 0 users, load averages: 1.27, 1.43, 1.36

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.