mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > No Prime Left Behind

Reply
 
Thread Tools
Old 2010-05-03, 05:15   #78
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
I only run one Windows machine against the servers, my I7, and it doesn't have the latest version of the Windows client on it that corrected the problem with pairs not being sent that had completed during the time in which there was a server/power outage. (Fortunately there has been no outage of late.) There is no way for you to test it? I'm knee deep in CRUS updates tonight and Monday. After getting back from my trip and having my kids for 3 days, this is the 1st extended period that I have to work on them in about 2 weeks.

I think that Karsten would be the better one to test it.

As for the Linux client, I long ago extensively verified it. I also subsequently cancelled many pairs on my end when I changed servers on many of my machines for the rally. I then spot checked about 10 of them and they were definitely showing up properly as cancelled in the joblist.txt file.

I do know one thing: There was an extreme amount of testing needed to fix the problem with it not losing completed pairs during an outage and it also involved the cancellation process. I'm sure that Karsten tested the heck out of his changes but it wouldn't surprise me that there was some lone scenario that might have been missed. I twice thought I was done and everything worked before I thought of one final scenario that turned out not to work and required more changes and testing. I spent 5-6 hours on the Linux client getting it right for the "dropped completed pairs during a server outage/disconnect" issue. The thing that took so long is that everything was intertwined. When I fixed those final 2 issues that I had, I had to retest all prior scenarios because I had already made at least one fix where I broke something else in my testing.


Gary
Yeah, you're right, I suppose I could do some more testing myself...I was originally thinking of you since unlike me, you wouldn't be running the script through Cygwin but then I remembered that, duh, there's nothing keeping me form simply not running it through Cygwin for testing purposes. Okay, I'll see what I can do.
mdettweiler is offline   Reply With Quote
Old 2010-05-03, 06:22   #79
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

1011000010012 Posts
Default

Please remember: LLRnet is only updating the joblist when the server receives new results from clients! The joblist.txt contains something like this:
Code:
-- update [2010-04-16 15:54:24]
...
So it's the normal behaviour that after returning results, the joblist contains more than one entry for the same pair. Same with cancelling a pair.

The joblist will cleared when the pruntime is over. I personally set this timing to about some hours or less (30 min), depends on the n-level of the pairs.

As Max' ouput of the history shows the cancellation to be correct., wait until the prunetime is over and all should be ok than.

Another thing to hold in mind: If the server do not receives any request from a client, things like pruning pairs will never happen! The server is running as a job waiting for requests from clients and not doing things with a timer.

Suggestion:
Max, wait until prunetime is over, than look again to the joblist.
If this is an error, I'll have a look at it.
kar_bon is offline   Reply With Quote
Old 2010-05-03, 16:00   #80
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Quote:
Originally Posted by kar_bon View Post
Please remember: LLRnet is only updating the joblist when the server receives new results from clients! The joblist.txt contains something like this:
Code:
-- update [2010-04-16 15:54:24]
...
So it's the normal behaviour that after returning results, the joblist contains more than one entry for the same pair. Same with cancelling a pair.

The joblist will cleared when the pruntime is over. I personally set this timing to about some hours or less (30 min), depends on the n-level of the pairs.

As Max' ouput of the history shows the cancellation to be correct., wait until the prunetime is over and all should be ok than.

Another thing to hold in mind: If the server do not receives any request from a client, things like pruning pairs will never happen! The server is running as a job waiting for requests from clients and not doing things with a timer.

Suggestion:
Max, wait until prunetime is over, than look again to the joblist.
If this is an error, I'll have a look at it.
But, I thought the server usually adds an "update" entry to the joblist for canceled pairs as well? (Otherwise it risks forgetting the cancellation in case of a power failure, etc.)

Edit: BTW, I just checked the server and the pair in question, 447/770657, is still listed as reserved by me in joblist.txt despite the server having surely pruned a number of times since last night.

Last fiddled with by mdettweiler on 2010-05-03 at 16:02
mdettweiler is offline   Reply With Quote
Old 2010-05-03, 22:15   #81
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

282510 Posts
Default

I've found the issue in the cancellation of kn-pairs with do.bat-script:

The special part, the script cancel pairs looks like this:
Code:
:jobcancel3
llrnet -c
for %%F in (workfile.txt) do set wsize=%%~zF
if %wsize% GTR 5 goto jobcancel3
rem llrnet -c >nul
rem echo All jobs cancelled!
del tosend.txt, workfile.txt, workfile.bak, llr.ini, z*
goto bye
Remove the red remark-code, so that 'llrnet -c' is called again.
This fixes Max' problem.

Note:
I'll update the WIN-version the next days with this and enhancements like

- no GUI-code (not needed anymore)
- WUChacheSize = 0 still reserve one pair -> must cancelled with 'do -c'
- cancel all pairs with one call of 'llrnet -c' -> for now, calling for every pair 'llrnet -c'
- delete 'serviceName' from llr-clientconfig.txt (not needed)
kar_bon is offline   Reply With Quote
Old 2010-05-03, 23:53   #82
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Quote:
Originally Posted by kar_bon View Post
I've found the issue in the cancellation of kn-pairs with do.bat-script:

The special part, the script cancel pairs looks like this:
Code:
:jobcancel3
llrnet -c
for %%F in (workfile.txt) do set wsize=%%~zF
if %wsize% GTR 5 goto jobcancel3
rem llrnet -c >nul
rem echo All jobs cancelled!
del tosend.txt, workfile.txt, workfile.bak, llr.ini, z*
goto bye
Remove the red remark-code, so that 'llrnet -c' is called again.
This fixes Max' problem.

Note:
I'll update the WIN-version the next days with this and enhancements like

- no GUI-code (not needed anymore)
- WUChacheSize = 0 still reserve one pair -> must cancelled with 'do -c'
- cancel all pairs with one call of 'llrnet -c' -> for now, calling for every pair 'llrnet -c'
- delete 'serviceName' from llr-clientconfig.txt (not needed)
That worked--thanks!
mdettweiler is offline   Reply With Quote
Old 2010-05-04, 00:16   #83
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

2·5,077 Posts
Default

Quote:
Originally Posted by kar_bon View Post
I've found the issue in the cancellation of kn-pairs with do.bat-script:

The special part, the script cancel pairs looks like this:
Code:
:jobcancel3
llrnet -c
for %%F in (workfile.txt) do set wsize=%%~zF
if %wsize% GTR 5 goto jobcancel3
rem llrnet -c >nul
rem echo All jobs cancelled!
del tosend.txt, workfile.txt, workfile.bak, llr.ini, z*
goto bye
Remove the red remark-code, so that 'llrnet -c' is called again.
This fixes Max' problem.

Note:
I'll update the WIN-version the next days with this and enhancements like

- no GUI-code (not needed anymore)
- WUChacheSize = 0 still reserve one pair -> must cancelled with 'do -c'
- cancel all pairs with one call of 'llrnet -c' -> for now, calling for every pair 'llrnet -c'
- delete 'serviceName' from llr-clientconfig.txt (not needed)
OK, then I should probably do mostly the same for the Linux client. Hum. Karsten, that's going to be a lot of work to make the llrnet -c command do that. Is that necessary? After all do -c does what is needed. Won't more files than just the script need to be modified for the llrnet -c change?

When you're done with your changes, can you be specific about where you changed them? Then I'll tweak the Linux client. Here is what I'm thinking:

1. No GUI-code: Not a script change; just removal of stuff from other client files. Can you state which files will change?
2. WuCacheSize: A script change only to get 1 pair if WuCacheSize is 0 or 1.
3. Cancel all pairs with llrnet -c: Both a script change and possible changes to other client files. As asked above, do we really need to do this? If other client files need to be changed, do you know which ones?
4. Obvious quick change to delete a line in llr-clientconfig.txt.


Can you let me know if the above is approximately correct? I want to get an idea ahead of time of what I'll be looking at changing on the Linux client.


Thanks,
Gary

Last fiddled with by gd_barnes on 2010-05-04 at 00:24
gd_barnes is online now   Reply With Quote
Old 2010-05-04, 00:20   #84
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

2×5,077 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
That worked--thanks!
Great, glad to see that that works. I'll tell you, that is a very difficult bug to find in testing. I had all kinds of issues like that when correcting the Linux client for the dropping of completed pairs during a server outage/downtime. The fixes would "appear" to work but then the appropriate update wouldn't get written to joblist.txt. That was a very tricky fix; both for the normal pair returns and for the pair cancellation.

Max, question for you: Does PRPnet have a way to cancel pairs? When I've run it in the past, I haven't been sure how to return pairs to the server. It seems to return pairs immediately that have not been started but if it's in the middle of processing one, of course (as it should), it does not return it. I'd like to be able to return a pair to the server that is already partially worked on. Is there a command for that?


Gary

Last fiddled with by gd_barnes on 2010-05-04 at 00:23
gd_barnes is online now   Reply With Quote
Old 2010-05-04, 00:27   #85
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

52×113 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
When you're done with your changes, can you be specific about where you changed them. Then I'll tweak the Linux client. Here is what I'm thinking:

1. No GUI-code: Not a script change; just removal of stuff from other client files. Can you state which files will change?
2. WuCacheSize: A script change only to get 1 pair if WuCacheSize is 0 or 1.
3. Cancel all pairs with llrnet -c: Both a script change and possible changes to other client files. As asked above, do we really need to do this?
4. Obvious quick change to delete a line in llr-clientconfig.txt.

Can you let me know if the above is approximately correct? I want to get an idea ahead of time of what I'll be looking at changing on the Linux client.
1. I've done this weeks ago for a version only without GUI-support. I have to check the diffs, but I'm sure I only changed 'llrnet.lua' and deleted 'gui.lua'!
2. A one-liner in 'llrnet.lua'.
3. Small change in 'llrnet.lua': instead of 'if' use 'while'! The call from the 'do'-script is easier then!
4. Correct! So 'llr-clientconfig.txt' only needs 4 parameters for server, port, username and WUCacheSize. Easier for users, especially new users to set up a client!

Hope this answer some questions.
kar_bon is offline   Reply With Quote
Old 2010-05-04, 00:33   #86
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
Max, question for you: Does PRPnet have a way to cancel pairs? When I've run it in the past, I haven't been sure how to return pairs to the server. It seems to return pairs immediately that have not been started but if it's in the middle of processing one, of course (as it should), it does not return it. I'd like to be able to return a pair to the server that is already partially worked on. Is there a command for that?
Well, the process is a little different than in LLRnet, though yes, the same end results can be achieved. Per this section in prpclient.ini:
Code:
// This option is used to default the startup option if the previous
// shutdown left uncompleted workunits.
//    0 - prompt
//    1 - Return completed work units and abandon the rest
//    2 - Complete assigned work units
startoption=2
 
// This option is used to default the stop option when the client
// is terminated
//    0 - prompt
//    1 - Return completed work units and abandon the rest
//    2 - Return completed work units
//    3 - Do nothing with current work units and terminate the process
stopoption=2
By default, both are set to 0, which means that when the client is stopped or started and there are work units in queue, it will ask you what to do with them. Alternatively, you can provide the answers to those queries in the ini file, as I've done above. Essentially, startoption=2 and stopoption=3 mimic do.bat/pl's normal operative behavior, i.e. running without any flags. (stopoption=2 differs only in that when shutting down, it reports anything already completed in the current batch.)

Setting startoption=1 is equivalent to running do.bat/pl with the -c flag. Setting stopoption=1 is akin to running do.bat/pl without any flags, then after you stop it with Ctrl-C, immediately running -c.
mdettweiler is offline   Reply With Quote
Old 2010-05-04, 06:13   #87
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

100111101010102 Posts
Default

Quote:
Originally Posted by kar_bon View Post
1. I've done this weeks ago for a version only without GUI-support. I have to check the diffs, but I'm sure I only changed 'llrnet.lua' and deleted 'gui.lua'!
2. A one-liner in 'llrnet.lua'.
3. Small change in 'llrnet.lua': instead of 'if' use 'while'! The call from the 'do'-script is easier then!
4. Correct! So 'llr-clientconfig.txt' only needs 4 parameters for server, port, username and WUCacheSize. Easier for users, especially new users to set up a client!

Hope this answer some questions.
OK, here is what I'll do for the Linux client:

For #1 and #3, can you check the Linux client and see if those changes are already there? I don't know if they would be or not. I can't remember if we tried to sync up all of the client files before releasing the 1st versions of our scripts. If they are not synced up, it will be just a matter of me copying over your llrnet.lua file and deleting the gui.lua file from the Linux script and then running a quick test on it.

For #2, since there are no script changes, I'll just copy over your llrnet.lua file into the Linux client after you've made your changes to it.

For #4, as stated a quick change to bring it down to only the 4 paramaters.

Good, this should all go fairly quickly. Let me know when you've made the fixes and I'll get everything copied over to the Linux client.

There is one extremely minor fix that I will do for the Linux do.pl script. If no pairs are returned as completed when doing ./do.pl -c, the workfile.res is not deleted like it should be. I'll make that minor fix and update the do.pl version #. It will be good to have a new version # since we'll be making changes to several of the client files for the various items above. This way, it will be easy for people to know that they have an updated version of the Linux as well as the Windows client.

One more thing that you might test: I seem to remember that the workfile.res file is not deleted on the Windows client either whenever do -c is executed but there are no completed pairs in lresults.txt. That might have been fixed with the latest Windows client but I'm not sure. Can you check that?


Gary

Last fiddled with by gd_barnes on 2010-05-04 at 06:16
gd_barnes is online now   Reply With Quote
Old 2010-05-04, 06:20   #88
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

236528 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Well, the process is a little different than in LLRnet, though yes, the same end results can be achieved. Per this section in prpclient.ini:
Code:
// This option is used to default the startup option if the previous
// shutdown left uncompleted workunits.
//    0 - prompt
//    1 - Return completed work units and abandon the rest
//    2 - Complete assigned work units
startoption=2
 
// This option is used to default the stop option when the client
// is terminated
//    0 - prompt
//    1 - Return completed work units and abandon the rest
//    2 - Return completed work units
//    3 - Do nothing with current work units and terminate the process
stopoption=2
By default, both are set to 0, which means that when the client is stopped or started and there are work units in queue, it will ask you what to do with them. Alternatively, you can provide the answers to those queries in the ini file, as I've done above. Essentially, startoption=2 and stopoption=3 mimic do.bat/pl's normal operative behavior, i.e. running without any flags. (stopoption=2 differs only in that when shutting down, it reports anything already completed in the current batch.)

Setting startoption=1 is equivalent to running do.bat/pl with the -c flag. Setting stopoption=1 is akin to running do.bat/pl without any flags, then after you stop it with Ctrl-C, immediately running -c.

OK, I still need a clarfication because I remember reading that before and still couldn't quite get it. If I'm merrily running along with stopoption=2 and then stop the client with CTL-C, obviously I know that it returns all completed results. What I don't know is if it properly cancels all remaining inprocess and unprocess pairs and lets the server know that they are no longer reserved.

To make the question clearer: If I'm running with stopoption=2 and then hit CTL-C, do I then still need to do a final run with STARToption=1 to return the inprocess/untested pair(s) to the server so that they can be immediately handed back out for testing? If that is the case, then that is far from clear by reading the above.

To make it clearer: If startoption (or stopoption) 1 returns incompleted pairs to the server -or- if it somehow "tells" the server that the pairs are no longer reserved, then "abandon the rest" is very misleading. That needs to be reworded to show something like "cancel the rest" or "release the rest". The word "abandon" is effectively what people do when they don't do the "do -c" command on the LLRnet client. Those pairs are abandoned and so the server has to wait 2 days to hand them back out again. We do not want to "abandon" them, we want to return them to the server or tell the server that they are no longer reserved.


Gary

Last fiddled with by gd_barnes on 2010-05-04 at 06:33
gd_barnes is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
LLRNET ValerieVonck Software 12 2010-03-15 18:09
llrnet 64 bit balachmar Prime Sierpinski Project 4 2008-07-19 08:21
LLRNet em99010pepe Riesel Prime Search 20 2007-09-11 21:03
Bush Supports $120 Billion Iraq War Compromise ewmayer Soap Box 23 2007-05-27 12:37
LLRnet over proxy? Bananeweizen Sierpinski/Riesel Base 5 4 2006-10-14 07:51

All times are UTC. The time now is 06:10.

Sat Jul 11 06:10:45 UTC 2020 up 108 days, 3:43, 0 users, load averages: 1.48, 1.49, 1.37

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.