mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   No Prime Left Behind (https://www.mersenneforum.org/forumdisplay.php?f=82)
-   -   LLRnet supports LLR V3.8! (LLRnet2010 V0.73L) (https://www.mersenneforum.org/showthread.php?t=13165)

kar_bon 2010-03-12 22:57

LLRnet supports LLR V3.8! (LLRnet2010 V0.73L)
 
1 Attachment(s)
[CENTER][SIZE=+2]LLRnet supports LLR V3.8 and higher! (LLRnet2010 V0.73L)[/SIZE][/CENTER]


LLRnet is a Client/Server program to search for primes by using LLR.

LLRnet was programmed in 2004-2005 by Vincent Penné in LUA. It uses internally LLR V3.5 programmed by Jean Penné.

The latest LLR-Version available is 3.8.1, which is 10%-20% faster than V3.7.1c, with more possibilities in testing different values and less issues in small n-values.

Vincent do not support a newer version of LLRnet, so the idea is to make both (LLRnet and LLR) working together with a script, using the feature for the client/server communication and the speed of the new LLR V3.8.1!

There are 2 versions of this script available:

[B]A Perl-script for Linux 32-bit [/B][URL="http://www.noprimeleftbehind.net/downloads/llrnet-script-perl-0.74-linux32.zip"]here[/URL] (V0.74 w/LLRnet 3.8.2).

[B]A DOS-script for Windows 32-bit[/B] [URL="http://www.rieselprime.de/dl/LLRnetV073.zip"]here[/URL] (V0.73 w/LLRnet 3.8.1).

Older versions:
[B]A Perl-script for Linux 32-bit [/B][URL="http://www.rieselprime.de/dl/llrnet-script-perl-0.71-linux32.zip"][COLOR=#000080]here[/COLOR][/URL] (V0.71 w/LLRnet 3.8.0).
[B]A DOS-script for Windows 32-bit[/B] [URL="http://www.rieselprime.de/dl/LLRnet_new.zip"]here[/URL] or [URL="http://noprimeleftbehind.net/downloads/llrnet-script-batch-0.72-win32.zip"]here[/URL] (V0.72 w/LLRnet 3.8.0).

Please read the README for more information about settings and handling.

[B]Many thanks to:[/B]
- Vincent Penné for the LLRnet version
- Jean Penné for the current LLR V3.8.0 (and further development).
- Max Dettweiler and Gary Barnes for the conversion of the script in Perl and testing
- Ian Gunn for testing.

This script was done during the last 5 weeks and was well tested in many cases.
If there're any issues or suggestions about this script, please post here.

[B]Notes:[/B]
- There were no changes on the Server-side from LLRnet, only the Client was changed.
Please uses all files included with the downloads for proper working!

- To use a newer Version of LLR, copy the latest cLLR (Win/Linux) in the client folder.

- To use a local LLRnet-server the Win-version contains a small example.


[B]Screenshot and batches:[/B]

Here is a screenshot of 4 clients running under WIN together with 2 batches to show information for:

- Pairs done all over
- Primes found all over
- Currently reserved pairs
- Setting of the work units cache
- Current line cLLR is working on

There's also a batch starting all 4 clients by calling the "do.bat" and entitle each DOS-box separately.

The batches can be downloaded [URL="http://www.rieselprime.de/dl/LLRnet2010_Batches.zip"]here[/URL].


Happy hunting with a new Dimension in Prime Searching with LLRnet!

K.Bonath

em99010pepe 2010-03-13 09:28

I'm getting CRC errors when unpacking the zip. Could you please check it up? I will try to install winrar to see if the issue is with IZArs.

EDIT: It was a download error, got it now. Sorry.

kar_bon 2010-03-13 09:31

i've downloaded all 3 zip's and everything is ok!


AND extracted them all without error!

PCZ 2010-03-13 21:27

I've given this a try out using the dos script.
Working OK so far.

Noticed when running bases other than 2 that the results file in llrnet server always reports base 2.
The hist files in the client directories report the base correctly.

Good Job

kar_bon 2010-03-13 21:30

[QUOTE=PCZ;208296]I've given this a try out using the dos script.
Working OK so far.

Noticed when running bases other than 2 that the results file in llrnet server always reports base 2.
The hist files in the client directories report the base correctly.

Good Job[/QUOTE]

edit the line
[code]
displayFormat = "%s*2^%s-1"
[/code]

in the 'llr-serverconfig.txt' (folder 'LLRnet_server') and substitute the '2' with the base you're testing!

mdettweiler 2010-03-13 21:31

[quote=kar_bon;208297]edit the line
[code]
displayFormat = "%s*2^%s-1"
[/code]

in the 'llr-serverconfig.txt' and substitute the '2' with the base you're testing![/quote]
Yes, this happens regardless of what client you're using if you have displayFormat set incorrectly.

PCZ 2010-03-13 22:11

Thanks that fixed it.

gd_barnes 2010-03-14 09:22

I believe in giving credit where it is due: This was Karsten's original idea so it was up to him how to present it here. Max was instrumental in converting the Windows script to the Linux script in very quick order. I did extensive detailed testing on the Linux side and made several modifications to the Linux script.

Many thanks to both Karsten and Max for their extensive efforts on this and picking up the slack after I had to leave on a business trip a little over a week ago.

One note to the public: While this has been extensively "alpha" testing, it can still be considered in a public "beta" testing phase. By that I mean, to fully prove itself, it needs to have 100s of cores running it.

There is one thing that I wanted to bring up. I know that many people, including myself, don't always read the documentation. There is one thing that is somewhat important to understand with the new LLRnet. It tests "in batch", meaning it will not return any pairs to the server until it is completely done with its cache. For that reason, whenever you stop one of your cores, please be sure and run:

llrnet -c

What that will do is send already-tested results to the server and return the untested pairs at the same time. The nice improvement that we made is that you only have to do it once to return all pairs; both completed and uncompleted. (This was one of the more difficult things to make work correctly during testing.)

As an example, if you cached 5 pairs and decide you want to stop after completing testing on 3 of them. The above command will return the 3 completed results PLUS the 2 remaining untested pairs to the server.

This is important because if you don't do that, not only will you miss credit for your completed results, you could have a prime in those results and miss getting credit for a prime!

Sometime in late March or early April, we will plan to have our first rally in quite some time and of course we plan to use the new client. In the week prior to that, perhaps I can combine with another heavy hitter such as Ian or Lennart to put at least 100 cores on it in preparation for the upcoming rally.

Everyone have fun with it and if you find anything, please let us know. I know firsthand that it is a whole lot faster than the "old klunky" LLRnet and I can personally attest to the fact that it works on at least 35 cores running at once.


Gary

kar_bon 2010-03-14 14:11

I've just been aware of an issue in the Win-batch V0.70:

Scenario:
- the WUCachSize is set to a value greater than 1 (for example 5)
- stopping the script during LLR is testing and a prime was found (for example the 2nd pair was prime and stopped after 3 pairs done)
- cancelling undone jobs with calling 'do -c'

-> this prime was submitted to the server but not logged in the local file 'primes.txt'

I've updated the ZIP-file in the first post and attached the new script here.
Please rename the TXT-file to 'do.bat' and place it in the client-folder(s).

[b]Note on 2010-03-30: With the new version the attachment is obsolete and was deleted here! Please refer to the link in post #1! (kar_bon)[/b]

gd_barnes 2010-03-14 19:51

[quote=kar_bon;208363]I've just been aware of an issue in the Win-batch V0.70:

Scenario:
- the WUCachSize is set to a value greater than 1 (for example 5)
- stopping the script during LLR is testing and a prime was found (for example the 2nd pair was prime and stopped after 3 pairs done)
- cancelling undone jobs with calling 'do -c'

-> this prime was submitted to the server but not logged in the local file 'primes.txt'

I've updated the ZIP-file in the first post and attached the new script here.
Please rename the TXT-file to 'do.bat' and place it in the client-folder(s).[/quote]


Nice catch Karsten. That's what we have beta testing for. It's almost impossible to think of every situation that will come up in alpha testing. Other testing scenarios that you might check out: Do the same thing but where the 1st pair is a prime -or- where the most recently completed pair is a prime. For example in the above scenario having stopped after 3 pairs were processed out of 5, try having pair #1 be a prime -and- also try having pair #3 be a prime. Bugs frequently come out at the beginning or end of a process, especially when the process is interrupted by something.

One question: Did you update the client in post #1 here? If not, that would be helpful. The hotel I'm at on my trip has a fairly severe limitation on downloading anything of any size and is very slow in doing so. After a minute of attempting to look at your DOS link, I decided to stop it as it was only barely above 10%.

While this is a small issue that doesn't affect reporting to the server, all issues need to be ironed out. I will test the same scenario on the Linux client by Tuesday after I get back from my trip.

For just small 1 or 2-line changes like this, I'm going to suggest that we don't update the version # just yet until we get all known small issues ironed out. Once they are all fixed, then we can "officially" make it version 0.71 or something like that. If there is a big issue as a result of something in large-scale stress testing, then I think it would be good to have a fix for that be either version 0.71 or 0.8 immediately, depending on what you guys think.

Does that sound reasonable?


Gary

kar_bon 2010-03-14 20:02

[QUOTE=gd_barnes;208384]Does that sound reasonable?[/QUOTE]

it's ok for me!

i've also put the determination of the local prime-logfile (primes.txt) in the WIN-script at the start of it, so it's done only once for a batch-run. otherwise it would be done everytime a prime was found. not that big timing issue.

mdettweiler 2010-03-14 20:11

[quote=kar_bon;208385]it's ok for me!

i've also put the determination of the local prime-logfile (primes.txt) in the WIN-script at the start of it, so it's done only once for a batch-run. otherwise it would be done everytime a prime was found. not that big timing issue.[/quote]
BTW, just to let you guys know: this bug should not exist in the Perl script. Since Gary copied over the same code for submitting/logging results, checking for primes, etc. that was used under "normal" circumstances, and used it for the -c code path as well, everything should be logged the same either way.

As for the version numbers, agreed, that sounds good.

gd_barnes 2010-03-14 20:59

So is the updated Windows script in the client in the 1st post here?

kar_bon: yes, it is!

MyDogBuster 2010-03-15 00:00

The old LLRNET client kept a zxxxxxxxx file maintained during a test in case the test was interrupted. cllr doesn't. One is created if a ctrl-c is enacted.

Any chance of maintaining a zxxxxxxx file for cllr just like LLRNET did.

The reason I ask is that I had a power failure and lost any work done on current tests.

mdettweiler 2010-03-15 00:37

[quote=MyDogBuster;208409]The old LLRNET client kept a zxxxxxxxx file maintained during a test in case the test was interrupted. cllr doesn't. One is created if a ctrl-c is enacted.

Any chance of maintaining a zxxxxxxx file for cllr just like LLRNET did.

The reason I ask is that I had a power failure and lost any work done on current tests.[/quote]
LLR/cllr updates its z* file per an amount of minutes specified in llr.ini (the DiskWriteTime= option). If unspecified, it defaults to 30 minutes. From what I've observed, I think old LLRnet saved at 50% through the test in addition to every 30 minutes (which may have been configurable there as well, depending on whether it actually read the llr.ini file that was usually included with the client--I don't think it did).

While it would be rather hard to change this in standard LLR/cllr, in the do.pl script you can change the frequency at which it saves its z* files by setting the $iniOptions setting as follows:
$iniOptions = "OutputIterations=10000\nDiskWriteTime=1\n";
...which would set it to 1 minute. Note that while this probably isn't possible in do.bat since it handles setting OutputIterations differently, do.pl does work on Windows as well as long as you have Perl installed so if it's particularly important you can use that.

kar_bon 2010-03-15 00:46

[QUOTE=mdettweiler;208412]Note that while this probably isn't possible in do.bat since it handles setting OutputIterations differently[/QUOTE]

you can do this also in the do.bat by inserting a line:
[code]
(...)
if not exist tosend.txt goto error_notos
echo OutputIterations=%op_Iterations% >llr.ini
[color=red]echo DiskWriteTime=1 >>llr.ini[/color]

:do_llrnet
llrnet
(...)
[/code]

that's it.
perhaps it's worth a new option at the top of the script the user can set when needed (when testing-time is more than say 30 min).

MyDogBuster 2010-03-15 01:52

Thanks guys. I tested BOTH solutions and they work.

gd_barnes 2010-03-25 12:01

Hi Karsten,

I'm running the Windows DOS client for the first time on my Windows I7. It appears that the option to change the output iterations is not working. I even tried tweaking the code and it didn't work.

No matter what I do, it will not write that line to the llr.ini file that allows it to be changed. Instead it keeps writing out that annoying line every 10,000 iterations with a percentage complete. Even testing at n=~550K, it fairly quickly fills up the screen because there's no way to make the DOS window wider.

I'd like to set it to 1,000,000 iterations like all of my other clients. Can you help?

Edit: I just noticed something. It corrects itself after the first batch, writes the line to llr.ini, and stops outputting every 10,000 iterations. I'm surprsied that me trying to move the statement ahead of the "if" statement right ahead of it did not fix the problem. So it is a more minor bug than I originally thought but it still is a bit annoying.


Thanks,
Gary

Mini-Geek 2010-03-25 12:17

[quote=gd_barnes;209493]It appears that the option to change the output iterations is not working. I even tried tweaking the code and it didn't work.[/quote]
Works for me (Windows XP).
[quote=gd_barnes;209493]Instead it keeps writing out that annoying line every 10,000 iterations with a percentage complete. Even testing at n=~550K, it fairly quickly fills up the screen because there's no way to make the DOS window wider.[/quote]
To make the window wide, right click on the command window's title bar, click Properties, go to the Layout tab, and change both Width values to something higher (e.g. 90), then press OK, check "Save properties for future windows with the same title," and click OK. Ta-da!

kar_bon 2010-03-25 12:18

[QUOTE=gd_barnes;209493]
Edit: I just noticed something. It corrects itself after the first batch, writes the line to llr.ini, and stops outputting every 10,000 iterations. I'm surprsied that me trying to move the statement ahead of the "if" statement right ahead of it did not fix the problem. So it is a more minor bug than I originally thought but it still is a bit annoying.
[/QUOTE]

yep, the change will first be set when llr tested all pairs in workfile.txt.
after getting the next set of pairs, the script will write again/new llr.ini with this new iteration-setting!

this could be done in the DOS-script by telling to update the llr.ini with code like this:
[code]
type llr.ini | find /v "OutputIterations=" > llr.new
echo OutputIterations=%1>> llr.new
move /Y llr.new llr.ini >nul
[/code]

with parameter %1 as the number of iterations.
so the script could be stopped while llr is running and starting again the script (with a parameter or comparing the current setting) with changing to this parameter and continue llr-testing.

i can make this modification, if needed.

gd_barnes 2010-03-25 14:03

[quote=kar_bon;209496]yep, the change will first be set when llr tested all pairs in workfile.txt.
after getting the next set of pairs, the script will write again/new llr.ini with this new iteration-setting!

this could be done in the DOS-script by telling to update the llr.ini with code like this:
[code]
type llr.ini | find /v "OutputIterations=" > llr.new
echo OutputIterations=%1>> llr.new
move /Y llr.new llr.ini >nul
[/code]with parameter %1 as the number of iterations.
so the script could be stopped while llr is running and starting again the script (with a parameter or comparing the current setting) with changing to this parameter and continue llr-testing.

i can make this modification, if needed.[/quote]


It seems a little bit cheesy to have it do something incorrectly for a short time and then do it correctly. If it's not too much trouble, yes, I think the change should be made.

Thanks for the info. on making the screen wider Tim. Regardless, I'm not looking at my machines 99% of the time so I'd rather not have the iterations show up.

One more thing: How do you easily change do.bat? When I made the change to the # of iterations, I renamed do.bat to do.txt, modified it in notepad, and then renamed it back to do.bat. Is there an easier way? A novice might wonder how to change the various options since you can't just edit do.bat.

Mini-Geek 2010-03-25 14:11

[quote=gd_barnes;209500]One more thing: How do you easily change do.bat? When I made the change to the # of iterations, I renamed do.bat to do.txt, modified it in notepad, and then renamed it back to do.bat. Is there an easier way? A novice might wonder how to change the various options since you can't just edit do.bat.[/quote]
Right click on do.bat > Edit, or
Open Notepad, browse to do.bat's location, set Files of type: All Files, open do.bat.

Not too hard. :smile:

gd_barnes 2010-03-25 14:31

[quote=Mini-Geek;209501]Right click on do.bat > Edit, or
Open Notepad, browse to do.bat's location, set Files of type: All Files, open do.bat.

Not too hard. :smile:[/quote]

I guess I'm that novice I'm talking about. :smile:

kar_bon 2010-03-25 18:25

I've updated the WIN-DOS script with that latest improvement:

To change the parameter "OutputIterations" for cLLR (# of iterations between outputs) only break the script by pressing CTRL-C, edit the option on top of 'do.bat' and start again (after updating with this new script, of course!).

Default value is 10000. If this value is different from the default, it's written to 'llr.ini' everytime a new set of pairs will processed.
If 'llr.ini' exist (when batch stopped while cLLR was testing) the value is updated, so cLLR will immediatly take this new value!

I've also updated the file in the link in the first post (only WIN-version).
The same script is attached here (rename it to 'do.bat').

Karsten

[b]Note on 2010-03-30: With the new version the attachment is obsolete and was deleted here! Please refer to the link in post #1! (kar_bon)[/b]

gd_barnes 2010-03-25 22:45

Thanks Karsten. Very cool. :smile:

kar_bon 2010-03-27 00:44

Any responses from the work with this new script?

Except for the issue I found (when cancelling, found prime was not written in local primes.txt) and
the two additions (OutputIterations and DiskWriteTime for cLLR) there seems no real bug in the script so far.

Are there any suggestions to make it even better?

gd_barnes 2010-03-27 08:42

[quote=kar_bon;209659]Any responses from the work with this new script?

Except for the issue I found (when cancelling, found prime was not written in local primes.txt) and
the two additions (OutputIterations and DiskWriteTime for cLLR) there seems no real bug in the script so far.

Are there any suggestions to make it even better?[/quote]

I think that first para. in the README documentation needs to be tweaked to look like the Linux README where the word "code" appears 3 times.

After changing that, I might suggest updating whatever is applicable to show the changes made since the public release and now call it version 0.71.

Other than that, the only thing I can think of is to remove all of the commented-out code in the various .lua files so that it doesn't appear so "hackish". I don't really see anything else that needs to be improved at this point.

I think I may have found a small bug in the Linux script that only occurs in rare situations. I'll check more into the details of it and report back later this weekend.


Gary

kar_bon 2010-03-27 11:04

I've uploded the new V0.71!

NOTE: I've changed the link in the first post here (without Version-number in file-name).

This Version contains:
- Changed handling of option 'OutputIterations'
- History in the ReadMe.txt
- Changed the first paragraph wordings

I will tidy up the lua-files the next weeks, with changes made against the original version from Vincent Penné.

If nobody got any enhancements, i got one:

For now the server only saves the time a pair was done by 'counting' the seconds the server sent the pair to client and received the result. So if i set my WUCacheSize to 50 and those 50 pairs take almost a day, the server will save them with timings of about 86000 seconds although cLLR only needed 1000!
To support this, the server- [b]and[/b] client-side has to be changed and the server has to handle 'old' clients, too, which don't send the timings.

This should be the next change in a Version 0.8.

henryzz 2010-03-27 11:51

[quote=kar_bon;209685] the server has to handle 'old' clients, too, which don't send the timings.[/quote]
It might help some people if the new client worked with the old servers as well.

kar_bon 2010-03-27 20:34

[QUOTE=henryzz;209686]It might help some people if the new client worked with the old servers as well.[/QUOTE]

The new client or script as it is now (V0.70 or V0.71) works with the 'old' server!
NPLB is running those servers without changing, only the client-side was edited!
And I'm running the new script in the first version I wrote weeks ago!

For the mentioned enhancement with those timings, as i said, both server and client has to be changed.
So why should someone use a new client with such support but the server (old one) will ignore this to functioning correctly?
And if such thing should work (new client with timings and old server), the server [b]must[/b] be changed to [b]not[/b] support the new client! So it would be new/changed!

Mini-Geek 2010-03-28 22:19

1 Attachment(s)
I'm getting this a lot of the time when I try to run the LLRnet script: (it also will do it after it's ran properly for some time, but does it most of the time)
(this is with do.pl on Windows)
[code]+----------------------------------------+
| LLRnet client v0.9b7 with LLR v3.8.0 |
| M.Dettweiler, 2010-02-20, version 0.7 |
+----------------------------------------+

Error: could not find lresults.txt.[/code](this appears instantly; if cLLR was called at all, it must've exited immediately)
I'm not sure exactly what's going on, but I'm guessing that LLR is not getting called correctly (and/or is rejecting the input it's getting), and exiting immediately without showing anything I can see (and then, of course, the script sees that no lresults.txt exists, because LLR didn't run correctly, and exits with the above message).
I haven't looked into why this is happening much yet, but I'm attaching all the non-exe files from the folder, so hopefully one of the script's writers can reproduce, troubleshoot, and fix this.

gd_barnes 2010-03-29 03:52

Unfortunately we did little testing of the do.pl client/script in Windows. It works great in Linux.

Guys, we should probably remove the possibility of people running the do.pl client/script in Windows. Carlos had a problem with it too. In the future, we should not be releasing something unless it has been fully tested in the architecure for which it is intended.

Sorry about that Tim. For now, I suggest downloading Karsten's Windows DOS client. I can confirm firsthand that it's working great in Windows because I have part of an i7 running it right now on port 6000. I can also confirm that the do.pl client/script works great in Linux.

Edit: Karsten, I updated the 1st post here to remove the do.pl client/script for Windows.


Gary

mdettweiler 2010-03-29 14:11

[quote=gd_barnes;209874]Unfortunately we did little testing of the do.pl client/script in Windows. It works great in Linux.

Guys, we should probably remove the possibility of people running the do.pl client/script in Windows. Carlos had a problem with it too. In the future, we should not be releasing something unless it has been fully tested in the architecure for which it is intended.

Sorry about that Tim. For now, I suggest downloading Karsten's Windows DOS client. I can confirm firsthand that it's working great in Windows because I have part of an i7 running it right now on port 6000. I can also confirm that the do.pl client/script works great in Linux.

Edit: Karsten, I updated the 1st post here to remove the do.pl client/script for Windows.


Gary[/quote]
Actually, that might be a tad premature. :smile: Tim, I see from your attachment that you don't have cllr.exe in your directory. That's needed for do.pl to work on Windows, and I just confirmed that it is included in the client package; did you accidentally delete it by chance? You might want to try again after putting it back.

Mini-Geek 2010-03-29 17:01

[quote=mdettweiler;209910]Actually, that might be a tad premature. :smile: Tim, I see from your attachment that you don't have cllr.exe in your directory. That's needed for do.pl to work on Windows, and I just confirmed that it is included in the client package; did you accidentally delete it by chance? You might want to try again after putting it back.[/quote]
Sorry, not that simple. Like I said, I removed the .exe's from the folder before attaching it, and it works sometimes, which it wouldn't if cllr.exe weren't there. cllr.exe is there. See:
[quote=Mini-Geek;209843]I'm getting this [B]a lot of the time[/B] when I try to run the LLRnet script: (it also will do it [B]after it's ran properly for some time[/B], but does it most of the time)
....
...I'm attaching all the [B]non-exe files[/B] from the folder...[/quote]

I've just started using do.bat, which seems to work so far. Edit: Hm, not so fast. I had finished 2 of 3 numbers, used do -c to report/cancel them, and it didn't seem to have reported the first two. From what it outputted, it seems to have canceled all 3 without returning the results. Here's those results: [code]2221*2^548899-1 is not prime. LLR Res64: D2B93491410C1AE1 Time : 363.222 sec.
2401*2^548899-1 is not prime. LLR Res64: 132E13C16414CEBF Time : 365.859 sec.[/code]Can you check if those numbers were reported as complete in the DB? I can't find any indication on the pages that they were. I don't care too much about the credit for two numbers this size, and they're probably already assigned elsewhere, so I guess we'll just have a spot double check...
If it was indeed returned, the output should really be changed to reassure you that they were returned and not canceled (like the Perl version, IIRC).

Whenever you have an idea for me to check something to troubleshoot either of these things, you can tell me (here or in PM) and I'll try it. I'd like to get this worked out. :smile:

mdettweiler 2010-03-29 18:02

[quote=Mini-Geek;209929]Sorry, not that simple. Like I said, I removed the .exe's from the folder before attaching it, and it works sometimes, which it wouldn't if cllr.exe weren't there. cllr.exe is there. See:[/quote]
Ah, whoops, missed that. :rolleyes:
[quote]I've just started using do.bat, which seems to work so far. Edit: Hm, not so fast. I had finished 2 of 3 numbers, used do -c to report/cancel them, and it didn't seem to have reported the first two. From what it outputted, it seems to have canceled all 3 without returning the results. Here's those results: [code]2221*2^548899-1 is not prime. LLR Res64: D2B93491410C1AE1 Time : 363.222 sec.
2401*2^548899-1 is not prime. LLR Res64: 132E13C16414CEBF Time : 365.859 sec.[/code]Can you check if those numbers were reported as complete in the DB? I can't find any indication on the pages that they were. I don't care too much about the credit for two numbers this size, and they're probably already assigned elsewhere, so I guess we'll just have a spot double check...
If it was indeed returned, the output should really be changed to reassure you that they were returned and not canceled (like the Perl version, IIRC).

Whenever you have an idea for me to check something to troubleshoot either of these things, you can tell me (here or in PM) and I'll try it. I'd like to get this worked out. :smile:[/quote]
It seems both of those results were canceled; I have them listed in port 6000's results.txt as completed by Gary, so they must have been canceled and reassigned.

BTW, even if they weren't successfully canceled or submitted, the server would eventually reassign them in 2 days; so they wouldn't be "missed" per se, i.e. no need for a later spot doublecheck of them.

em99010pepe 2010-03-29 18:04

The DOS script it will go down if:

a) you get "ERROR: SUM(INPUTS) != SUM(OUTPUTS)".....it is a cllr.exe issue.
b) your internet connection goes down while you upload the results.

Carlos

Mini-Geek 2010-03-29 18:14

[quote=mdettweiler;209943]BTW, even if they weren't successfully canceled or submitted, the server would eventually reassign them in 2 days; so they wouldn't be "missed" per se, i.e. no need for a later spot doublecheck of them.[/quote]
I meant that my results plus the results from the reassignment (Gary's results) would make a spot doublecheck, which it already has (though I don't know the result of it; do the residues match?).

mdettweiler 2010-03-29 18:25

[QUOTE=Mini-Geek;209947]I meant that my results plus the results from the reassignment (Gary's results) would make a spot doublecheck, which it already has (though I don't know the result of it; do the residues match?).[/QUOTE]
Ah, whoops, I see what you mean now. Yes, the residues do match:
[code]user=gd_barnes
[2010-03-29 12:43:25]
2221*2^548899-1 is not prime. Res64: D2B93491410C1AE1 Time : 849.0 sec.
user=gd_barnes
[2010-03-29 12:43:25]
2401*2^548899-1 is not prime. Res64: 132E13C16414CEBF Time : 849.0 sec.
[/code]
BTW, you don't have to have direct access to the server to see these; each server's results.txt file is updated at [URL]http://www.noprimeleftbehind.net/llrnet/[/URL] every 15 minutes. In this case I got it from [url=http://www.noprimeleftbehind.net/llrnet/todayresults_6000.txt]here[/url].

gd_barnes 2010-03-29 19:56

Karsten,

On the DOS script:

Tim is saying that do -c does not work with the Windows DOS client.

Carlos has 2 problems.

Can you check into those please?

I just also recently noticed that the Linux do.pl script will not return completed results to the server if the server goes down during OR BEFORE the time in which they are completed. The key is "or before" there and only applies if the server is STILL down when the batch completes. It "attempts" to send them, assumes they've been sent, deletes tosend.txt, and waits for the server to come back up to get new pairs. It never seems to know that the previous results had not been sent. I see the problem and will work on fixing it today. This problem seems to be the same or similar to the one that Carlos is experiencing on the DOS script.

Fortunately it seems that these 3 problems are situations not related to load that we did quite a bit of alpha testing on but to exception situations that we either did not think to alpha test or did not test enough. Karsten, any thoughts on how the problem with do -c got missed? I did extensive testing on do.pl -c on the Linux side and it definitely works. It shows the # of pairs returned to the server and the # of paris cancelled.

I would suggest documenting the fixes in README or wherever and increasing the version # after this. Is README where we are showing fixes and new versions?

Max, I haven't responded to your Email suggestion for an upcoming rally yet because I didn't feel like we have properly beta tested everything yet. I want this thread to go "dry" with problems for a week before we have a rally.


Gary

gd_barnes 2010-03-29 19:58

[quote=mdettweiler;209948]Ah, whoops, I see what you mean now. Yes, the residues do match:
[code]user=gd_barnes
[2010-03-29 12:43:25]
2221*2^548899-1 is not prime. Res64: D2B93491410C1AE1 Time : 849.0 sec.
user=gd_barnes
[2010-03-29 12:43:25]
2401*2^548899-1 is not prime. Res64: 132E13C16414CEBF Time : 849.0 sec.
[/code]
BTW, you don't have to have direct access to the server to see these; each server's results.txt file is updated at [URL]http://www.noprimeleftbehind.net/llrnet/[/URL] every 15 minutes. In this case I got it from [URL="http://www.noprimeleftbehind.net/llrnet/todayresults_6000.txt"]here[/URL].[/quote]

I didn't feel you responded to what he implied. I think he was hoping that these would go in the DB as a doublecheck. Unfortunately...not possible: Since it's the same server, these results were rejected by the server. I confirmed as much.

Sorry about the problems Tim.

gd_barnes 2010-03-29 20:05

[quote=mdettweiler;209910]Actually, that might be a tad premature. :smile: Tim, I see from your attachment that you don't have cllr.exe in your directory. That's needed for do.pl to work on Windows, and I just confirmed that it is included in the client package; did you accidentally delete it by chance? You might want to try again after putting it back.[/quote]

Premature? How's that? Both Tim and Carlos (in the base 5 thread) have had problems with the Windows do.pl script. You can't be publicly posting something that hasn't been tested in the environment for which it was intended. That was a blunder on our part.

We should just stick with the DOS do script for Windows and do.pl script for Linux only. There is less testing that way. With version 7.1 of the do.pl Linux script, I will tweak the do.pl README to remove the part that says it "should" work with Windows. It "should" work if we had tested it, which we didn't.


Gary

gd_barnes 2010-03-29 20:44

I'd like to get a problem log working here so that nothing gets missed.

Everyone, please chime in if I am missing or misstating any known problem here.

Problems in the clients that need to be fixed:

1. Windows DOS, Tim says do -c does not return completed results to the server. It cancels all pairs instead.
[B]Resolution: This is a non-issue. It was an error in the testing environment.[/B]

2. Windows DOS, Carlos says the script will go down while returning completed results if your internet connection drops while doing so.
[B]Same issue as #6? Is the script going down the same as the pairs not being returned?[/B]

3. Windows DOS, Carlos is getting the ERROR: SUM issue in CLLR. Is that something that we should be able to fix? If not, we'll just list it as a known "feature" in the documentation.

4. Windows do.pl, various problems due to lack of testing of exception situations. I've removed the link in the 1st post here and suggest that we not attempt to maintain it.
[B]Resolution: Maintenance of the Windows do.pl client is not being done.[/B]

5. Linux do.pl, if the server goes down during or before results are completed, the clients will: complete the tests, create the tosend.txt file, attempt to send the tosend.txt file, assume that it has been correctly sent and delete it, and wait for the server to come back up to get new pairs to test. It needs to avoid deletion of the tosend.txt file if the server is down or the internet connection is lost. That way, it will send them when the server comes back up.
[B]Resolution: Solved by Gary in Version 0.71. The tosend.txt is not deleted until there is confirmation that the pairs are successfully sent.[/B]

6. Windows DOS, same issue as Linux do.pl #5.
[B]Resolution: Solved by Karsten in Version 0.72: The tosend.txt is not deleted anymore and a note on screen is displayed.[/B]

Karsten:
I will test #2 with a batch of more pairs done at once and try to disconnect the server while client is sending those results!
For #3 as i mentioned: I need more info to handle this.

Thanks Carlos and Tim for testing and posting known issues.

Gary

em99010pepe 2010-03-29 20:56

3.

It is due to the overclocking but the client, as happens with the LLR GUI version, should keep testing from last save point. Anyway, this breaks the scripts.

mdettweiler 2010-03-29 20:57

[quote=gd_barnes;209960]Premature? How's that? Both Tim and Carlos (in the base 5 thread) have had problems with the Windows do.pl script. You can't be publicly posting something that hasn't been tested in the environment for which it was intended. That was a blunder on our part.

We should just stick with the DOS do script for Windows and do.pl script for Linux only. There is less testing that way. With version 7.1 of the do.pl Linux script, I will tweak the do.pl README to remove the part that says it "should" work with Windows. It "should" work if we had tested it, which we didn't.[/quote]
I did quite a bit of testing with do.pl on my own Windows setup; I had it run for a number of days straight on a "production" server and it worked great. I'm not sure what could have gone wrong here. You mentioned a couple posts up that you see where the problem is; could you point me to it?

Almost all of do.pl should be OS-independent. I wonder if Tim's problem with it on Windows is related to the dropped-connection issue: possibly his connection cut out somewhere along the way on the times where he got "could not find lresults.txt" errors? It might be another manifestation of the same problem.

Mini-Geek 2010-03-29 21:08

[quote=mdettweiler;209968]Almost all of do.pl should be OS-independent. I wonder if Tim's problem with it on Windows is related to the dropped-connection issue: possibly his connection cut out somewhere along the way on the times where he got "could not find lresults.txt" errors? It might be another manifestation of the same problem.[/quote]
I seriously doubt it. The bug still happens when the client/script already has work to do and is just resuming it (i.e. when I do no server communication in that running of the script).

kar_bon 2010-03-29 21:42

To issue #1:

I've tested the following with the DOS-script downloadable in the first post:
the client and server are there and set up for a local test, knpairs contains 20 pairs.

[code]
start:
- client/server as in V0.71-download
- server started with 'llrserver' in folder 'LLLRnet_server'
- client startet with 'do' in folder 'LLRnet_client'
- client completed first 5 workunits and sent to server
- client completed 3 from next 5 workunits, than stopped
- calling 'do -c'

proved:
- client-folder
- lresults_hist.txt contains 8 results and note "Cancelled 2 kn-pairs!"
- no file like workfile.txt, lresults.txt, tosend.txt or llr.ini
so this is the same as a new installed client-folder, ready to start again!
- tray-icon of server says: 5 connections, 8 results
- server-folder:
- results.txt contains 8 pairs, the same as in the client-folder
- knpairs.txt contains the remaining pairs (8 first deleted)
- joblist.txt contains 4 entries: 2 pairs with status 'working' AND 'abandonned' (cancelled)

do next:
- stopping server: tray -> 'Exit LLRserver'
- calling simplify.bat in the server-folder
- joblist.txt contains 2 cancelled pairs only (2 entries)

next:
- starting server again
- starting 'do' again
- client is working on all other pairs, the 2 cancelled, too

proved:
- client folder: lresults_hist.txt contains all 20 pairs
- server folder: after calling 'simplify.bat':
- results.txt contains 60 lines -> 20 results
- knpairs.txt contains no pair, only worktype in first line
- joblist.txt contains 12 pairs, all 'solved'
- calling 'simplify.bat' again:
- joblist.txt contains empty list
[/code]


-> all is OK!

So please Tim, test this again on your PC and post your result here!

kar_bon 2010-03-29 21:46

To issue #3:

An OC'ed PC is not the 'normal' way to use a script like this, nor LLR.

To handle this, i need the output LLR creates or any file to check if such error occurs.

I only can act in a script, when i know where looking at, so does LLR!






(If i got more info, I will update this post.)

gd_barnes 2010-03-29 21:52

[quote=mdettweiler;209968]I did quite a bit of testing with do.pl on my own Windows setup; I had it run for a number of days straight on a "production" server and it worked great. I'm not sure what could have gone wrong here. You mentioned a couple posts up that you see where the problem is; could you point me to it?

Almost all of do.pl should be OS-independent. I wonder if Tim's problem with it on Windows is related to the dropped-connection issue: possibly his connection cut out somewhere along the way on the times where he got "could not find lresults.txt" errors? It might be another manifestation of the same problem.[/quote]

I didn't say that I know where Tim's problem is on Windows do.pl. I know where MY problem is on Linux do.pl. I give a lot of detail of it in problem #5. I'll start working on it within a couple of hours.

Don't worry about do.pl on the Windows side. We don't need two different Windows clients. It just complicates things. Also, having people use the DOS script is much better and easier since they don't have to download files related to running Perl.

Having it run for a # of days straight correctly proves little other than it works OK when there are no technical glitches along the way. It doesn't test exception situations like different bases, dropped internet connections, dropped servers, etc. You have to test the exceptions. I'm guilty of it on the Linux side. Although I ran a test with a stopped server on the Linux side, I failed to fully analyze all files. The client showed that the pairs were sent and that it was waiting for new pairs. I failed to check the server side and see if the pairs were actually received after the server came back up, which they were not; hence the problem.

Edit: I went ahead and modified the #4 problem from "due to lack of testing" to "due to lack of testing of exception situations".

Edit2: Please do not first assume that the problem is on the user's end as in "I wonder if it is a problem with Tim's connection". Usually it is not. I get so tired of hearing that from businesses and others when I happen to have problems with my internet connection, phone problems, software, etc. Most of the time when dealing with technically very competent users such as Tim, the problem is in the software itself, not with the user.


Gary

kar_bon 2010-03-29 21:59

To issue #2:

When cancelling the server (same setting as for issue #1) while cLLR is testing pairs, the client will delete all files!
Only the done work during server-down is logged in the client 'lresults_hist.txt' but no tosend.txt anymore!

To fix this i will change the script as soon as possible!
Perhaps other issues to solve first.

Mini-Geek 2010-03-29 22:18

[quote=kar_bon;209981]To issue #1:

I've tested the following with the DOS-script downloadable in the first post:
the client and server are there and set up for a local test, knpairs contains 20 pairs.

...
-> all is OK!

So please Tim, test this again on your PC and post your result here![/quote]
I have just confirmed that this works. All exactly as you said, except that you left out doing "do -c" ([B]kar_bon: corrected in that post, thanks![/B]), and another minor thing (detailed after the next quote). As the .bat is unchanged, I must conclude there was something wrong with one or more of:
my client files/folder (which may very well be, as I started it with the files of the do.pl and then added do.bat and the files along with it),
the server I'd been connecting to (seems unlikely)
the connection between my client and the server (seems unlikely, unless anything about the way do.bat communicates doesn't work with the LLRnet server G6000)

I'm currently using the test folder (the one I set up to run your test, which I know to be in working order) to run a number from G6000, so I can use do -c to return it and make sure it all works. Assuming it does, that means that whatever the problem was was probably mainly my fault (due to the aforementioned mix-and-match). Edit: yes, it worked correctly. It returned one candidate (of 5) and canceled the other 4. It doesn't really tell me everything that's going on, but at least it worked. Here's all it said:[code]+-------------------------------------+
| LLRnet client V0.9b7 with cLLR V3.8 |
| K.Bonath, 2010-02-10, Version 0.71 |
+-------------------------------------+

Current configuration:
server = "www.noprimeleftbehind.net"
port = 6000
username = "Mini-Geek"
WUCacheSize = 5

1 file(s) copied.
[2010-03-29 17:23:43]
Cancelling : 2201/548954 (30000000000000:M:1:2:258)
[2010-03-29 17:23:43]
Cancelling : 2205/548954 (30000000000000:M:1:2:258)
[2010-03-29 17:23:44]
Cancelling : 2295/548954 (30000000000000:M:1:2:258)
[2010-03-29 17:23:45]
Cancelling : 2421/548954 (30000000000000:M:1:2:258)
[2010-03-29 17:23:45]
No more job to cancel !
All jobs canceled![/code](those 4 were the canceled ones) It'd be preferable to say something like "Returning 2195/548954" before the cancellings, followed by a "1 pair returned, 4 pairs canceled" message.
[quote=kar_bon;209981][code]- server folder: after calling 'simplify.bat':
- joblist.txt contains 12 pairs, all 'solved'
...
- calling 'simplify.bat' again:
- joblist.txt contains empty list
[/code][/quote]
I only called simplify.bat once, (after the server shut down, if that matters) but joblist.txt was immediately empty. I hope this isn't too big of a problem. :smile:
([B]kar_bon: I should make a note in the README.txt of the download-zip to use 'simplify' twice for that![/B])

gd_barnes 2010-03-29 22:26

[quote=Mini-Geek;209995]I have just confirmed that this works. All exactly as you said, except that you left out doing "do -c", and another minor thing (detailed after the next quote). As the .bat is unchanged, I must conclude there was something wrong with one or more of:
my client files/folder (which may very well be, as I started it with the files of the do.pl and then added do.bat and the files along with it),
the server I'd been connecting to (seems unlikely)
the connection between my client and the server (seems unlikely, unless anything about the way do.bat communicates doesn't work with the LLRnet server G6000)

I'm currently using the test folder (the one I set up to run your test, which I know to be in working order) to run a number from G6000, so I can use do -c to return it and make sure it all works. Assuming it does, that means that whatever the problem was was probably mainly my fault (due to the aforementioned mix-and-match).

I only called simplify.bat once, (after the server shut down, if that matters) but joblist.txt was immediately empty. I hope this isn't too big of a problem. :smile:[/quote]


Tim,

To put it all in a nutshell:

You suspect that problem #1 in the "problem log" were as a result of possibly having some Windows do.pl files in your Windows DOS client? Is that correct?

I wasn't quite clear. Is there still a problem with do -c?

If you can confirm that everything is working correctly with the Windows DOS client after you ran it clean without the Windows do.pl client files in the same folder, I'll update the problem log post to show it as a "non issue".

On the Windows side, that still leaves open problems #2 and #3.

Karsten,

I wasn't clear with your test. Were you able to simulate Carlos's internet connection outage for the problem in #2 and that it worked correctly?

I see that you are waiting to get some output from Carlos on the OC'd issue #3. I wasn't clear if it is related to a specific problem with LLR (CLLR?) that the clients need to be able to handle -or- if LLR (or CLLR) needs to be fixed. Can you clarify that?


Gary

Mini-Geek 2010-03-29 22:30

[quote=gd_barnes;209996]To put it all in a nutshell:

You suspect that problems #1 and #2 in the "problem log" were as a result of possibly having some Windows do.pl files in your Windows DOS client?

Is that correct?

I wasn't quite clear. Is there a problem with do -c?

If you can confirm that everything is working correctly with the Windows DOS client after you ran it clean without the Windows do.pl client files in the same folder, I'll update the problem log post to show them as "non issues".[/quote]
Yep, as I added to my previous post:
[quote=Mini-Geek;209995]I'm currently using the test folder (the one I set up to run your test, which I know to be in working order) to run a number from G6000, so I can use do -c to return it and make sure it all works. Assuming it does, that means that whatever the problem was was probably mainly my fault (due to the aforementioned mix-and-match). Edit: yes, it worked correctly. It returned one candidate (of 5) and canceled the other 4.[/quote]So yes, I believe #1 to be a non-issue. #2 is something I don't think I've seen (one that Carlos reported), so I can't really comment on that.

gd_barnes 2010-03-29 22:37

[quote=kar_bon;209989]To issue #2:

When cancelling the server (same setting as for issue #1) while cLLR is testing pairs, the client will delete all files!
Only the done work during server-down is logged in the client 'lresults_hist.txt' but no tosend.txt anymore!

To fix this i will change the script as soon as possible!
Perhaps other issues to solve first.[/quote]


THAT is the EXACT same issue as in the Linux do.pl client!!

I too will be working on that this evening.

I first noticed that something might be amiss when I dropped a server a few days ago to load more pairs. There ended up being some pairs that the server never received even though it appeared that the clients had sent them. I wasn't sure what the problem was at first. You'll see a posting from me from last Friday where I said there may be a problem in the Linux do.pl script related to that. It sure snowballed in a hurry as others found it also and it perhaps caused some other problems.

In our defense (lol), this was not an easy issue to see. I'm sure that like do.pl, do.bat "shows" that the pairs are sent to the server and sits waiting to get new pairs. The problem is that it never actually sends the previous results after the server comes up because it has already deleted tosend.txt. The server files needed to be more closely inspected when testing.

That's why we do beta testing. :-)


Gary

gd_barnes 2010-03-29 22:43

[quote=Mini-Geek;209997]Yep, as I added to my previous post:
So yes, I believe #1 to be a non-issue. #2 is something I don't think I've seen (one that Carlos reported), so I can't really comment on that.[/quote]

OK, I'm marking #1 as a non-issue. I goofed when I said #1 and #2 and had edited my post as such but you had already quoted it.

In other news, I'll mark an issue #6 for Windows do.bat that is the same as issue #5 for Linux do.pl. That is the issue with results not being sent to the server when a server (or internet connection) is dropped.

gd_barnes 2010-03-30 10:32

1 Attachment(s)
I have fixed the problem in the Linux client with results not being returned to the server whenever there is an internet outage or server problem at or before the time that the batch is finished.

Karsten, this took some extensive changes and several hours of testing over many different scenarios and conditions. The more difficult changes were in the pairs cancellation process. You have to test cancelled pairs/results being returned to the server while the server is down at that moment and then comes back up, after the last batch completed while the server was down and the server is currently up (or down), and of course the more usual where the user cancelled the last batch before it was done and wishes to return processed results and unprocessed pairs to the server while it is up and running. Personally, I didn't think to test multiple scenarios with the server being up or down in the cancellation process in alpha testing. I only did some of it in the regular process but obviously hadn't looked close enough at this problem that came up.

The main thing that I did was put it into a loop whenever the results were being returned. Previously we only had a loop when retrieving pairs but a similar loop is needed when returning pairs in both the main and cancellation processes. As it previously existed, the code always assumed that the pair return process worked correctly (that is that the server was up and the internet connection was good). Although the results were showing up in lresults_hist, file cleanup subsequent to the presumed pair return caused the converted tosend.txt file to be very "quietly" lost whenver there was a communication issue with the server.

Testing was done using both Riesel and Sierp files for n<1000 and for n=10K-10.1K. I didn't feel that an all out stress test was necessary since the communication with the server was little changed. But right now, I'm loading it on all of my machines so the main public drives will somewhat stress test it.

Attached are the updated do.pl script and README.txt. It is now officially do.pl version V0.71. I have added versioning comments at the bottom of README. Please edit the 1st post here to incorporate the changes in the client files.

Whew, I'm glad that is done. But now it's back in beta testing phase. We can't assume that it is fully correct yet. :smile:


Gary

kar_bon 2010-03-30 12:31

I've just uploaded the new Version (now V0.72 for the DOS-script) with the no-connection issue solved.

- While testing pairs with cLLR, the script will not delete the results done when the server died. There's also a message on screen:
[b]The file 'tosend.txt' contains unsent pairs!
Starting this script will submit those pairs to the server.[/b]

As stated, running the script when the server is available again, the results in tosend.txt will be sent to the server first, then new pairs received.

- I've changed the logging when cancelling pairs: now the cancelled pairs are written in lresults_hist.txt (and on screen as before)!

- I set all date/times of every file in the zip to 2010-03-30 07:02 (date and Version).

- I had to change 'README.txt, ''do.bat' and 'do_clearwork.awk'.

Please refer to the link in the first post to get the current version!

I've deleted the attachments in some posts (old version of script) to avoid confusion and also updated the post with the issue-summary.

Karsten

mdettweiler 2010-03-30 14:35

I tried to upload the latest version of do.pl just now but I couldn't connect to the server. I'm not sure what's up, though hopefully it's just a transient problem. (Those occur from time to time; maybe somebody in the ISP slipped on a banana peel and knocked over a server rack. :wink:)

I'll post here as soon as I successfully upload the files.

gd_barnes 2010-03-30 19:23

[quote=mdettweiler;210062]I tried to upload the latest version of do.pl just now but I couldn't connect to the server. I'm not sure what's up, though hopefully it's just a transient problem. (Those occur from time to time; maybe somebody in the ISP slipped on a banana peel and knocked over a server rack. :wink:)

I'll post here as soon as I successfully upload the files.[/quote]

You couldn't connect to which server?

gd_barnes 2010-03-30 19:25

Karsten,

Nice work. Please upload the new Linux client version 0.71 to the 1st post link. Thanks.


Gary

gd_barnes 2010-03-30 19:55

Karsten or Carlos,

Can you confirm that issues #2 and #6 in the [URL="http://www.mersenneforum.org/showpost.php?p=209965&postcount=42"]problems log post[/URL] are the same issue? If so, I'll delete issue #2.

We'll just need more info. on issue #3. That should be everything.


Gary

mdettweiler 2010-03-30 20:18

[quote=gd_barnes;210093]You couldn't connect to which server?[/quote]
The noprimeleftbehind.net server. It's working again now; I'm not sure what happened.
[quote=gd_barnes;210094]Karsten,

Nice work. Please upload the new Linux client version 0.71 to the 1st post link. Thanks.


Gary[/quote]
Now that the server's back online I'll do that right now. Edit: done

gd_barnes 2010-03-30 20:37

Max,

Oh, I thought Karsten was uploading them to his pages. Heck, I could have uploaded it to the noprimeleftbehind server. One thing: I've changed all the commonly accessed links from no-IP to noprimeleftbehind on the 2 projects since they are synonymous and the latter is easier to remember. So I tweaked your link location.

Did you know that your quad did not connect to our Sierp base 9 port all night past about midnight CDT? I wonder if there was a connection issue on your end, which is why you couldn't get into the noprimeleftbehind pages this morning. I didn't experience any outages here.

A Perl hint for you: The -s command checks for a file > 0 bytes. That way you don't have to check for both its existence (-e) and whether it is not empty (!-z). That allowed the simplification of one of your until statements.

On the send results looping process, I looped until the tosend.txt file was empty (-z). If it never becomes empty, then there is a connection issue. As you probably know, LLRnet returns an empty tosend.txt file after accepting results instead of deleting it so I had to check whether it was empty instead of for its lack of existense.


Gary

kar_bon 2010-03-30 20:44

[QUOTE=gd_barnes;210108]Oh, I thought Karsten was uploading them to his pages.
[/quote]

I can do this, too, to be sure if one server is offline, the scripts are available on the other!

[QUOTE=gd_barnes;210108]
As you probably know, LLRnet returns an empty tosend.txt file after accepting results instead of deleting it so I had to check whether it was empty instead of for its lack of existense.
[/quote]

I do this the same in the DOS-script!

The errorhandling and checking/reacting on those exeptions are most of the code now!

My first attempt for this script was 11 lines long!

gd_barnes 2010-03-30 21:11

[quote=kar_bon;210109]I can do this, too, to be sure if one server is offline, the scripts are available on the other!



I do this the same in the DOS-script!

The errorhandling and checking/reacting on those exeptions are most of the code now!

My first attempt for this script was 11 lines long![/quote]

Very good. The same situation now exists on the Linux script with a large portion of the code now being for exception situations.

As they say in programming: We spend 95% of our time coding for 5% of situations. (Or it could be 99%-1%!) This is certainly no exception. To take the point a step further: It's simple to write a program that works 95% (or 99%) of the time. It's very difficult to write one that works 100% of the time.


Gary

mdettweiler 2010-03-31 02:44

[quote=gd_barnes;210108]Did you know that your quad did not connect to our Sierp base 9 port all night past about midnight CDT? I wonder if there was a connection issue on your end, which is why you couldn't get into the noprimeleftbehind pages this morning. I didn't experience any outages here.[/quote]
Ah, that would sound about right. Last night it wasn't working but since I didn't have anything to upload at the time, I didn't worry about it; I then posted about it in the morning when I tried to upload and it was still down.

Regarding mirroring the Linux script on Karsten's rieselprime.de server, good idea. I'll do the same for his script and upload it to noprimeleftbehind.net--that way both scripts will always be available as long as at least one server is up. Edit: done

kar_bon 2010-03-31 03:34

[QUOTE=kar_bon;210109]I can do this, too, to be sure if one server is offline, the scripts are available on the other!
[/QUOTE]

Done.

gd_barnes 2010-03-31 06:29

Carlos,

We're down to issues #2 and #3 in the problem log in post #42 [URL="http://www.mersenneforum.org/showpost.php?p=209965&postcount=42"]here[/URL], which are yours.

We think that issue #2 is the same as issue #6. Can you download the new client and confirm that it has been fixed?

As requested by Karsten, we'll need more info. on problem #3.


Thanks,
Gary

mdettweiler 2010-05-01 14:58

I'm encountering a rather strange issue with do.bat on Windows Vista. On my quad, I have it set to run as a service (using a [url=http://free-dc.org/forum/showpost.php?p=46632&postcount=2]method[/url] I picked up on the Free-DC forum a while back that lets you run any application as a service) so that it will run as the "LOCAL SYSTEM" account whenever the computer is on, regardless of who's logged on (since I'm not the primary user of the computer and those who are commonly log on and off of their usernames).

For the rally I installed four copies of do.bat for the first time on this computer and set them up as services. (I'd never had occasion to run LLRnet on this machine before since the new client out, so this was my first experiment with that combo.) I purposely set op_connect = TRUE in the configuration section so that the clients would keep trying to reconnect in case of an outage--because if they stopped, it could go unnoticed for hours until I noticed a drop in my output, logged into the machine remotely via VNC, and restarted the services manually.

Yet, despite this, it seems that something is happening to make the clients stop every so often. Unfortunately, I have no access whatsoever to the clients' console output due to them being run as services; the best clues I have are that when that stop, tosend.txt is present in the client directory but not workfile.txt or any of the other files indicating that the client is in the middle of a batch. Essentially, it looks just like what happens when the client gives up after a few failed connections with op_connect = FALSE--yet I have it set to TRUE, so that wouldn't make sense. Also, I'm sure it's not just in that "waiting period" where it's pausing 60 seconds before another reconnect--no way it would wait like that for hours on end when my connection is perfectly good and the other cores on the same machine are connecting without issue.

I won't be able to do it during this rally, but sometime afterwards I'm going to try changing the copies of do.bat on that machine to echo their output to a file instead of to console (since they have no visible console). That way hopefully I can get some clues as to what's going on.

So, stay tuned...I hope to have a better idea of what's going on here in the near future. :smile:

gd_barnes 2010-05-01 21:13

What is a service and why does one need to be used?

I realize it a machine that you prefer that the command prompt windows not be shown but what does that have to with a service (if anything)?

mdettweiler 2010-05-01 22:44

[quote=gd_barnes;213733]What is a service and why does one need to be used?

I realize it a machine that you prefer that the command prompt windows not be shown but what does that have to with a service (if anything)?[/quote]
Windows keeps a registry of "Services" behind the scenes that are started whenever the computer's booted--operating word [i]booted[/i], not logged on. A lot of these are actual Windows background components. However, the service functionality is not limited to Windows components, and can be used with other applications as well. Manual LLR (the GUI version), for instance, has a menu option that lets you install it as a service so that it runs automatically. The old LLRnet client has a similar option; the new one does not, but I can use the method I mentioned earlier to serve as a "wrapper" and make any application a service.

The reason why I run it as a service is so that it keeps running uninterrupted regardless of who's logged on (or not). Putting it in the Startup folder (or the equivalent registry keys) would just start the application at logon, which means that it only runs when a particular user is logged on. Even if I put it in the All Users Startup folder, it still wouldn't run when nobody's logged on, and even worse, if somebody's logged on and another logs on without the first logging off, the clients will be started a SECOND time leading to all sorts of mayhem.

The service method does work quite well; that's not the issue here. The problem is that, somehow, do.bat is exiting when it's not supposed to, and therefore has to be restarted. That would happen even if I was just running it normally. The tricky thing is, I can't see the console output since I have the command window hidden (which is not necessarily limited to services; it can be done with non-services with a program like runh.exe as well), so I can't see the exact messages telling me why it exited.

S485122 2010-05-02 07:07

Why do you use a ".bat" extension and not a ".cmd" extension ?

With the ".cmd" extension you can redirect the standard output to a file as you plan to do, you can also redirect the error messages to another (or the same file) file.

For instance if the command for your service is "do.cmd" you can redirect all the output of the batch command to a message file with the command
"do.cmd 1> c:\do\do_out.txt 2> c:\do\do_err.txt"

To have all output go to one file use
"do.cmd 1> c:\do\do_out.txt 2> &1"

As you probably know if you use >> instead of > the files will not be overwritten each time the batch launches. You could also use output redirection on individual tasks of your batch and eliminate output you do not need to the null device "nul"
"task 1> nul 2> c:\do\do_err.txt"

Once the batch file stops, just analyse the output files. You can even monitor work in progress.

Jacob

mdettweiler 2010-05-02 15:47

[quote=S485122;213751]Why do you use a ".bat" extension and not a ".cmd" extension ?

With the ".cmd" extension you can redirect the standard output to a file as you plan to do, you can also redirect the error messages to another (or the same file) file.

For instance if the command for your service is "do.cmd" you can redirect all the output of the batch command to a message file with the command
"do.cmd 1> c:\do\do_out.txt 2> c:\do\do_err.txt"

To have all output go to one file use
"do.cmd 1> c:\do\do_out.txt 2> &1"

As you probably know if you use >> instead of > the files will not be overwritten each time the batch launches. You could also use output redirection on individual tasks of your batch and eliminate output you do not need to the null device "nul"
"task 1> nul 2> c:\do\do_err.txt"

Once the batch file stops, just analyse the output files. You can even monitor work in progress.

Jacob[/quote]
Hmm...I had no idea that didn't work with a .bat extension. I thought .bat and .cmd were essentially equivalent except that .cmd didn't work on Windows 9x--I guess not. :smile:

S485122 2010-05-02 17:09

My post was misleading : the redirections possibilities are the same.

The command.com window is launched through .bat .pif files or by running command.com. You get an error if you try to close the window. It uses the 8.3 file name format. It is initialised by the autoexec.nt and config.nt files. You need to include support for non USA keyboards in those files.

The "Command Prompt" window is launched through .cmd files or by running cmd.exe. You can close the window. It uses the NTFS name file format. It is initialised by the autoexec.bat and config.sys files (very confusing but it is MS.)

Jacob

mdettweiler 2010-05-02 17:46

[quote=S485122;213772]My post was misleading : the redirections possibilities are the same.

The command.com window is launched through .bat .pif files or by running command.com. You get an error if you try to close the window. It uses the 8.3 file name format. It is initialised by the autoexec.nt and config.nt files. You need to include support for non USA keyboards in those files.

The "Command Prompt" window is launched through .cmd files or by running cmd.exe. You can close the window. It uses the NTFS name file format. It is initialised by the autoexec.bat and config.sys files (very confusing but it is MS.)

Jacob[/quote]
Hmm...strange. In my experience, .bat files don't have any problems with non-8.2 file name formats; and I haven't run into any issues closing the window, either. Sure, there's the "Terminate batch file? Y/N" thing that comes up when you hit Ctrl-C, but doesn't that happen with .cmd files as well?

S485122 2010-05-02 20:26

Forget all my ranting about .bat and .cmd :-(

.bat files can also be run by cmd.exe (I finally looked up "Batch file" in Wikipedia...)[quote]Filename extensions
* .bat: The first extension used by Microsoft for batch files. This extension can be run in most Microsoft Operating Systems, including MS-DOS and most versions of Microsoft Windows.
* .cmd: Designates a Windows NT Command Script, which is written for the Cmd.exe shell, and is not backward-compatible with COMMAND.COM.

Differences
The only known difference between .cmd and .bat file processing is that in a .cmd file the ERRORLEVEL variable changes even on a successful command that is affected by Command Extensions (when Command Extensions are enabled), whereas in .bat files the ERRORLEVEL variable changes only upon errors.[/quote]

Jacob

mdettweiler 2010-05-03 03:51

Hmm, seems do.bat -c isn't working properly for me. I ran it with 3 workunits in queue, 2 of which had been completed; here's the lresults_hist.txt output:
[code]593*2^770656-1 is not prime. LLR Res64: 95AC38F43E705AF6 Time : 1110.256 sec.
597*2^770656-1 is not prime. LLR Res64: 3595733D25372C5C Time : 1098.580 sec.
[2010-02-05 23:35:15] Cancelled pair: 447 770657
Cancelled 1 pair(s)!
[/code]
The first two correctly show up in the server's joblist.txt as completed. 447/770657, however, does not--it's still reserved by me! Somehow, despite do.bat confirming "Canceled 1 pair(s)", the pair was not canceled but rather just deleted on my end.

I'm sure I have the latest do.bat; I extracted it directly from my copy of version 0.72 that I have on my hard drive, the very same one I downloaded from Karsten's website and uploaded to the noprimeleftbehind.net server. Yet -c is not working properly, which is something I thought we fixed long ago! :max:

Gary, since you have access to the server, can you verify this? What you'll need to do is, after canceling a pair, go to port 3000's joblist.txt and use the Find function to search for the text "k/n" where k and n are replaced with the k and n of the pair canceled. You can then use the Find function to step through all such instances found by the search. That should give you a summary of that k/n pair's life story. :smile:

It's possible this problem is just some weird effect of me running under Cygwin (though I can't imagine how that would affect this adversely). If it works just fine in a similar situation (3 in queue, first 2 complete) for Gary, then there must be something weird going on for me.

gd_barnes 2010-05-03 04:33

I only run one Windows machine against the servers, my I7, and it doesn't have the latest version of the Windows client on it that corrected the problem with pairs not being sent that had completed during the time in which there was a server/power outage. (Fortunately there has been no outage of late.) There is no way for you to test it? I'm knee deep in CRUS updates tonight and Monday. After getting back from my trip and having my kids for 3 days, this is the 1st extended period that I have to work on them in about 2 weeks.

I think that Karsten would be the better one to test it.

As for the Linux client, I long ago extensively verified it. I also subsequently cancelled many pairs on my end when I changed servers on many of my machines for the rally. I then spot checked about 10 of them and they were definitely showing up properly as cancelled in the joblist.txt file.

I do know one thing: There was an extreme amount of testing needed to fix the problem with it not losing completed pairs during an outage and it also involved the cancellation process. I'm sure that Karsten tested the heck out of his changes but it wouldn't surprise me that there was some lone scenario that might have been missed. I twice thought I was done and everything worked before I thought of one final scenario that turned out not to work and required more changes and testing. I spent 5-6 hours on the Linux client getting it right for the "dropped completed pairs during a server outage/disconnect" issue. The thing that took so long is that everything was intertwined. When I fixed those final 2 issues that I had, I had to retest all prior scenarios because I had already made at least one fix where I broke something else in my testing.


Gary

mdettweiler 2010-05-03 05:15

[quote=gd_barnes;213809]I only run one Windows machine against the servers, my I7, and it doesn't have the latest version of the Windows client on it that corrected the problem with pairs not being sent that had completed during the time in which there was a server/power outage. (Fortunately there has been no outage of late.) There is no way for you to test it? I'm knee deep in CRUS updates tonight and Monday. After getting back from my trip and having my kids for 3 days, this is the 1st extended period that I have to work on them in about 2 weeks.

I think that Karsten would be the better one to test it.

As for the Linux client, I long ago extensively verified it. I also subsequently cancelled many pairs on my end when I changed servers on many of my machines for the rally. I then spot checked about 10 of them and they were definitely showing up properly as cancelled in the joblist.txt file.

I do know one thing: There was an extreme amount of testing needed to fix the problem with it not losing completed pairs during an outage and it also involved the cancellation process. I'm sure that Karsten tested the heck out of his changes but it wouldn't surprise me that there was some lone scenario that might have been missed. I twice thought I was done and everything worked before I thought of one final scenario that turned out not to work and required more changes and testing. I spent 5-6 hours on the Linux client getting it right for the "dropped completed pairs during a server outage/disconnect" issue. The thing that took so long is that everything was intertwined. When I fixed those final 2 issues that I had, I had to retest all prior scenarios because I had already made at least one fix where I broke something else in my testing.


Gary[/quote]
Yeah, you're right, I suppose I could do some more testing myself...I was originally thinking of you since unlike me, you wouldn't be running the script through Cygwin but then I remembered that, duh, there's nothing keeping me form simply [i]not[/i] running it through Cygwin for testing purposes. :rolleyes: Okay, I'll see what I can do.

kar_bon 2010-05-03 06:22

Please remember: LLRnet is only updating the joblist when the server receives new results from clients! The joblist.txt contains something like this:
[code]
-- update [2010-04-16 15:54:24]
...
[/code]

So it's the normal behaviour that after returning results, the joblist contains more than one entry for the same pair. Same with cancelling a pair.

The joblist will cleared when the pruntime is over. I personally set this timing to about some hours or less (30 min), depends on the n-level of the pairs.

As Max' ouput of the history shows the cancellation to be correct., wait until the prunetime is over and all should be ok than.

Another thing to hold in mind: If the server do not receives any request from a client, things like pruning pairs will never happen! The server is running as a job waiting for requests from clients and not doing things with a timer.

Suggestion:
Max, wait until prunetime is over, than look again to the joblist.
If this is an error, I'll have a look at it.

mdettweiler 2010-05-03 16:00

[quote=kar_bon;213815]Please remember: LLRnet is only updating the joblist when the server receives new results from clients! The joblist.txt contains something like this:
[code]
-- update [2010-04-16 15:54:24]
...
[/code]

So it's the normal behaviour that after returning results, the joblist contains more than one entry for the same pair. Same with cancelling a pair.

The joblist will cleared when the pruntime is over. I personally set this timing to about some hours or less (30 min), depends on the n-level of the pairs.

As Max' ouput of the history shows the cancellation to be correct., wait until the prunetime is over and all should be ok than.

Another thing to hold in mind: If the server do not receives any request from a client, things like pruning pairs will never happen! The server is running as a job waiting for requests from clients and not doing things with a timer.

Suggestion:
Max, wait until prunetime is over, than look again to the joblist.
If this is an error, I'll have a look at it.[/quote]
But, I thought the server usually adds an "update" entry to the joblist for canceled pairs as well? (Otherwise it risks forgetting the cancellation in case of a power failure, etc.)

Edit: BTW, I just checked the server and the pair in question, 447/770657, is still listed as reserved by me in joblist.txt despite the server having surely pruned a number of times since last night.

kar_bon 2010-05-03 22:15

I've found the issue in the cancellation of kn-pairs with do.bat-script:

The special part, the script cancel pairs looks like this:
[code]
:jobcancel3
llrnet -c
for %%F in (workfile.txt) do set wsize=%%~zF
if %wsize% GTR 5 goto jobcancel3
[color=red]rem[/color] llrnet -c >nul
rem echo All jobs cancelled!
del tosend.txt, workfile.txt, workfile.bak, llr.ini, z*
goto bye
[/code]

Remove the red remark-code, so that 'llrnet -c' is called again.
This fixes Max' problem.

Note:
I'll update the WIN-version the next days with this and enhancements like

- no GUI-code (not needed anymore)
- WUChacheSize = 0 still reserve one pair -> must cancelled with 'do -c'
- cancel all pairs with one call of 'llrnet -c' -> for now, calling for every pair 'llrnet -c'
- delete 'serviceName' from llr-clientconfig.txt (not needed)

mdettweiler 2010-05-03 23:53

[quote=kar_bon;213912]I've found the issue in the cancellation of kn-pairs with do.bat-script:

The special part, the script cancel pairs looks like this:
[code]
:jobcancel3
llrnet -c
for %%F in (workfile.txt) do set wsize=%%~zF
if %wsize% GTR 5 goto jobcancel3
[COLOR=red]rem[/COLOR] llrnet -c >nul
rem echo All jobs cancelled!
del tosend.txt, workfile.txt, workfile.bak, llr.ini, z*
goto bye
[/code]

Remove the red remark-code, so that 'llrnet -c' is called again.
This fixes Max' problem.

Note:
I'll update the WIN-version the next days with this and enhancements like

- no GUI-code (not needed anymore)
- WUChacheSize = 0 still reserve one pair -> must cancelled with 'do -c'
- cancel all pairs with one call of 'llrnet -c' -> for now, calling for every pair 'llrnet -c'
- delete 'serviceName' from llr-clientconfig.txt (not needed)[/quote]
That worked--thanks! :smile:

gd_barnes 2010-05-04 00:16

[quote=kar_bon;213912]I've found the issue in the cancellation of kn-pairs with do.bat-script:

The special part, the script cancel pairs looks like this:
[code]
:jobcancel3
llrnet -c
for %%F in (workfile.txt) do set wsize=%%~zF
if %wsize% GTR 5 goto jobcancel3
[COLOR=red]rem[/COLOR] llrnet -c >nul
rem echo All jobs cancelled!
del tosend.txt, workfile.txt, workfile.bak, llr.ini, z*
goto bye
[/code]Remove the red remark-code, so that 'llrnet -c' is called again.
This fixes Max' problem.

Note:
I'll update the WIN-version the next days with this and enhancements like

- no GUI-code (not needed anymore)
- WUChacheSize = 0 still reserve one pair -> must cancelled with 'do -c'
- cancel all pairs with one call of 'llrnet -c' -> for now, calling for every pair 'llrnet -c'
- delete 'serviceName' from llr-clientconfig.txt (not needed)[/quote]

OK, then I should probably do mostly the same for the Linux client. Hum. Karsten, that's going to be a lot of work to make the llrnet -c command do that. Is that necessary? After all do -c does what is needed. Won't more files than just the script need to be modified for the llrnet -c change?

When you're done with your changes, can you be specific about where you changed them? Then I'll tweak the Linux client. Here is what I'm thinking:

1. No GUI-code: Not a script change; just removal of stuff from other client files. Can you state which files will change?
2. WuCacheSize: A script change only to get 1 pair if WuCacheSize is 0 or 1.
3. Cancel all pairs with llrnet -c: Both a script change and possible changes to other client files. As asked above, do we really need to do this? If other client files need to be changed, do you know which ones?
4. Obvious quick change to delete a line in llr-clientconfig.txt.


Can you let me know if the above is approximately correct? I want to get an idea ahead of time of what I'll be looking at changing on the Linux client.


Thanks,
Gary

gd_barnes 2010-05-04 00:20

[quote=mdettweiler;213918]That worked--thanks! :smile:[/quote]

Great, glad to see that that works. I'll tell you, that is a very difficult bug to find in testing. I had all kinds of issues like that when correcting the Linux client for the dropping of completed pairs during a server outage/downtime. The fixes would "appear" to work but then the appropriate update wouldn't get written to joblist.txt. That was a very tricky fix; both for the normal pair returns and for the pair cancellation.

Max, question for you: Does PRPnet have a way to cancel pairs? When I've run it in the past, I haven't been sure how to return pairs to the server. It seems to return pairs immediately that have not been started but if it's in the middle of processing one, of course (as it should), it does not return it. I'd like to be able to return a pair to the server that is already partially worked on. Is there a command for that?


Gary

kar_bon 2010-05-04 00:27

[QUOTE=gd_barnes;213921]When you're done with your changes, can you be specific about where you changed them. Then I'll tweak the Linux client. Here is what I'm thinking:

1. No GUI-code: Not a script change; just removal of stuff from other client files. Can you state which files will change?
2. WuCacheSize: A script change only to get 1 pair if WuCacheSize is 0 or 1.
3. Cancel all pairs with llrnet -c: Both a script change and possible changes to other client files. As asked above, do we really need to do this?
4. Obvious quick change to delete a line in llr-clientconfig.txt.

Can you let me know if the above is approximately correct? I want to get an idea ahead of time of what I'll be looking at changing on the Linux client.
[/QUOTE]

1. I've done this weeks ago for a version only without GUI-support. I have to check the diffs, but I'm sure I only changed 'llrnet.lua' and deleted 'gui.lua'!
2. A one-liner in 'llrnet.lua'.
3. Small change in 'llrnet.lua': instead of 'if' use 'while'! The call from the 'do'-script is easier then!
4. Correct! So 'llr-clientconfig.txt' only needs 4 parameters for server, port, username and WUCacheSize. Easier for users, especially new users to set up a client!

Hope this answer some questions.

mdettweiler 2010-05-04 00:33

[quote=gd_barnes;213923]Max, question for you: Does PRPnet have a way to cancel pairs? When I've run it in the past, I haven't been sure how to return pairs to the server. It seems to return pairs immediately that have not been started but if it's in the middle of processing one, of course (as it should), it does not return it. I'd like to be able to return a pair to the server that is already partially worked on. Is there a command for that?[/quote]
Well, the process is a little different than in LLRnet, though yes, the same end results can be achieved. Per this section in prpclient.ini:
[code]// This option is used to default the startup option if the previous
// shutdown left uncompleted workunits.
// 0 - prompt
// 1 - Return completed work units and abandon the rest
// 2 - Complete assigned work units
startoption=2

// This option is used to default the stop option when the client
// is terminated
// 0 - prompt
// 1 - Return completed work units and abandon the rest
// 2 - Return completed work units
// 3 - Do nothing with current work units and terminate the process
stopoption=2[/code]
By default, both are set to 0, which means that when the client is stopped or started and there are work units in queue, it will ask you what to do with them. Alternatively, you can provide the answers to those queries in the ini file, as I've done above. Essentially, startoption=2 and stopoption=3 mimic do.bat/pl's normal operative behavior, i.e. running without any flags. (stopoption=2 differs only in that when shutting down, it reports anything already completed in the current batch.)

Setting startoption=1 is equivalent to running do.bat/pl with the -c flag. Setting stopoption=1 is akin to running do.bat/pl without any flags, then after you stop it with Ctrl-C, immediately running -c.

gd_barnes 2010-05-04 06:13

[quote=kar_bon;213924]1. I've done this weeks ago for a version only without GUI-support. I have to check the diffs, but I'm sure I only changed 'llrnet.lua' and deleted 'gui.lua'!
2. A one-liner in 'llrnet.lua'.
3. Small change in 'llrnet.lua': instead of 'if' use 'while'! The call from the 'do'-script is easier then!
4. Correct! So 'llr-clientconfig.txt' only needs 4 parameters for server, port, username and WUCacheSize. Easier for users, especially new users to set up a client!

Hope this answer some questions.[/quote]

OK, here is what I'll do for the Linux client:

For #1 and #3, can you check the Linux client and see if those changes are already there? I don't know if they would be or not. I can't remember if we tried to sync up all of the client files before releasing the 1st versions of our scripts. If they are not synced up, it will be just a matter of me copying over your llrnet.lua file and deleting the gui.lua file from the Linux script and then running a quick test on it.

For #2, since there are no script changes, I'll just copy over your llrnet.lua file into the Linux client after you've made your changes to it.

For #4, as stated a quick change to bring it down to only the 4 paramaters.

Good, this should all go fairly quickly. Let me know when you've made the fixes and I'll get everything copied over to the Linux client.

There is one extremely minor fix that I will do for the Linux do.pl script. If no pairs are returned as completed when doing ./do.pl -c, the workfile.res is not deleted like it should be. I'll make that minor fix and update the do.pl version #. It will be good to have a new version # since we'll be making changes to several of the client files for the various items above. This way, it will be easy for people to know that they have an updated version of the Linux as well as the Windows client.

One more thing that you might test: I seem to remember that the workfile.res file is not deleted on the Windows client either whenever do -c is executed but there are no completed pairs in lresults.txt. That might have been fixed with the latest Windows client but I'm not sure. Can you check that?


Gary

gd_barnes 2010-05-04 06:20

[quote=mdettweiler;213925]Well, the process is a little different than in LLRnet, though yes, the same end results can be achieved. Per this section in prpclient.ini:
[code]// This option is used to default the startup option if the previous
// shutdown left uncompleted workunits.
// 0 - prompt
// 1 - Return completed work units and abandon the rest
// 2 - Complete assigned work units
startoption=2

// This option is used to default the stop option when the client
// is terminated
// 0 - prompt
// 1 - Return completed work units and abandon the rest
// 2 - Return completed work units
// 3 - Do nothing with current work units and terminate the process
stopoption=2[/code]
By default, both are set to 0, which means that when the client is stopped or started and there are work units in queue, it will ask you what to do with them. Alternatively, you can provide the answers to those queries in the ini file, as I've done above. Essentially, startoption=2 and stopoption=3 mimic do.bat/pl's normal operative behavior, i.e. running without any flags. (stopoption=2 differs only in that when shutting down, it reports anything already completed in the current batch.)

Setting startoption=1 is equivalent to running do.bat/pl with the -c flag. Setting stopoption=1 is akin to running do.bat/pl without any flags, then after you stop it with Ctrl-C, immediately running -c.[/quote]


OK, I still need a clarfication because I remember reading that before and still couldn't quite get it. If I'm merrily running along with stopoption=2 and then stop the client with CTL-C, obviously I know that it returns all completed results. What I don't know is if it properly cancels all remaining inprocess and unprocess pairs and lets the server know that they are no longer reserved.

To make the question clearer: If I'm running with stopoption=2 and then hit CTL-C, do I then still need to do a final run with STARToption=1 to return the inprocess/untested pair(s) to the server so that they can be immediately handed back out for testing? If that is the case, then that is far from clear by reading the above.

To make it clearer: If startoption (or stopoption) 1 returns incompleted pairs to the server -or- if it somehow "tells" the server that the pairs are no longer reserved, then "abandon the rest" is very misleading. That needs to be reworded to show something like "cancel the rest" or "release the rest". The word "abandon" is effectively what people do when they don't do the "do -c" command on the LLRnet client. Those pairs are abandoned and so the server has to wait 2 days to hand them back out again. We do not want to "abandon" them, we want to return them to the server or tell the server that they are no longer reserved.


Gary

mdettweiler 2010-05-04 06:38

[quote=gd_barnes;213942]One more thing that you might test: I seem to remember that the workfile.res file is not deleted on the Windows client either whenever do -c is executed but there are no completed pairs in lresults.txt. That might have been fixed with the latest Windows client but I'm not sure. Can you check that?[/quote]
I noticed this behavior just recently with the latest version of do.bat, so it is present in that version.

mdettweiler 2010-05-04 06:42

[quote=gd_barnes;213943]OK, I still need a clarfication because I remember reading that before and still couldn't quite get it. If I'm merrily running along with stopoption=2 and then stop the client with CTL-C, obviously I know that it returns all completed results. What I don't know is if it properly cancels all remaining inprocess and unprocess pairs and lets the server know that they are no longer reserved.

To make the question clearer: If I'm running with stopoption=2 and then hit CTL-C, do I then still need to do a final run with STARToption=1 to return the inprocess/untested pair(s) to the server so that they can be immediately handed back out for testing? If that is the case, then that is far from clear by reading the above.

To make it clearer: If startoption (or stopoption) 1 returns incompleted pairs to the server -or- if it somehow "tells" the server that the pairs are no longer reserved, then "abandon the rest" is very misleading. That needs to be reworded to show something like "cancel the rest" or "release the rest". The word "abandon" is effectively what people do when they don't do the "do -c" command on the LLRnet client. Those pairs are abandoned and so the server has to wait 2 days to hand them back out again. We do not want to "abandon" them, we want to return them to the server or tell the server that they are no longer reserved.


Gary[/quote]
If you're running stopoption=2 and Ctrl-C the client, it will return anything that's completed, and leave any incomplete tests in the queue. You would then need to do another run with startoption=1 to clean those out.

This is essentially just like how do.bat/pl behave; in fact with stopoption=3 they behave exactly the same. stopoption=2 only differs in that anything that's already done when you Ctrl-C will be returned to the server before shutdown; nothing else is touched.

Agreed that the "abandon" wording is a bit confusing. Mark, if you're reading this, can you change this to say "cancel" instead in the next version? That will hopefully be a little less confusing since "abandon" implies that they're just being dropped without any word to the server.

gd_barnes 2010-05-04 06:44

Very good. Thanks for the clarification. I'm glad to know that it's no more steps to return everything than with LLRnet.

kar_bon 2010-05-04 09:06

[quote=mdettweiler][quote=gd_barnes;213942]One more thing that you might test: I seem to remember that the workfile.res file is not deleted on the Windows client either whenever do -c is executed but there are no completed pairs in lresults.txt. That might have been fixed with the latest Windows client but I'm not sure. Can you check that?[/quote]
I noticed this behavior just recently with the latest version of do.bat, so it is present in that version.[/quote]

Yes, I noticed this, too. I've fixed this with the '-c' option and it's available with all other fixes the next days!

mdettweiler 2010-05-10 01:01

With how port 3000 ran out of pairs and went idle for a couple of hours just now, it got me thinking about a possible feature we may want to consider adding to a future version of do.bat/pl: backup servers. One very handy feature of PRPnet is that if a server goes down and looks like it's going to stay down (i.e., two or three successive connection attempts have failed), it can automatically fall back to an alternate server until the main one is back up--ensuring that the client never is without work. I wonder how hard it would be to implement this in do.bat/pl? That way, for instance, somebody could configure port 3000 as a primary server, but if it runs out of work, it could get work from port 6000 until 3000 is refilled.

kar_bon 2010-05-10 06:10

This is sure possible and could be a good way to avoid idle clients then.

In the batch-version I could set a second server as a new parameter, writing this to llr-clientconfig.txt after timeout of connecting to current server and call llrnet again with altered config-file.
Another option: make a second llr-clientconfig.txt with settings for another server and rename this to the current config-file and calling llrnet.
But if the second server don't respond as well, there must be a termination of that, too.

PS: I can implement this in the next update, but have to test some other things, too, so could take some days until ready.

gd_barnes 2010-05-10 06:14

[quote=kar_bon;214505]This is sure possible and could be a good way to avoid idle clients then.

In the batch-version I could set a second server as a new parameter, writing this to llr-clientconfig.txt after timeout of connecting to current server and call llrnet again with altered config-file.
Another option: make a second llr-clientconfig.txt with settings for another server and rename this to the current config-file and calling llrnet.
But if the second server don't respond as well, there must be a termination of that, too.[/quote]

This will be tricky and will have to be thought out. How to handle it if the original server comes back online?

It's almost as if, with each attempt to get new pairs, you have to check the original server, and after whatever number of tries, you then go to the backup server. This sounds like what PRPnet does and if so, would be the way to go.

As for what to do if the second server is unavailable...if you are going to do this, I would make it keep going back and forth between the 2 servers. It could get kind of hairy trying more than 2 servers. That would be like an "initial" release of such a thing. Later, you could do an updated release for more than 2 servers.

BTW, before doing these, please make sure the previous fixes are in place, tested, and complete so that I can incorporate them in the Linux client. In other words...fixes before enhancements.

kar_bon 2010-05-10 06:20

[QUOTE=gd_barnes;214507]This will be tricky and will have to be thought out. How to handle it if the original server comes back online?

It's almost as if, with each attempt to get new pairs, you have to check the original server, and after whatever number of tries, you then go to the backup server. This sounds like what PRPnet does and if so, would be the way to go.
[/QUOTE]

I think we should not make it so complex: this feature would only be used if 'the' server if offline/out of pairs, so the client don't go idle. I think the user should got an eye on his clients and could switch back to the old server when available. I could also write a small script for many clients at once to cancel pairs/submit results to second server and continue with original server.

mdettweiler 2010-05-10 12:59

[quote=gd_barnes;214507]This will be tricky and will have to be thought out. How to handle it if the original server comes back online?

It's almost as if, with each attempt to get new pairs, you have to check the original server, and after whatever number of tries, you then go to the backup server. This sounds like what PRPnet does and if so, would be the way to go.[/quote]
Exactly--that's what PRPnet does and what I was pretty much thinking for this.

[quote]As for what to do if the second server is unavailable...if you are going to do this, I would make it keep going back and forth between the 2 servers. It could get kind of hairy trying more than 2 servers. That would be like an "initial" release of such a thing. Later, you could do an updated release for more than 2 servers.[/quote]
Yeah, I see what you mean. Essentially what PRPnet does is to try the first server twice, then if that doesn't work try the next one, etc. on down the list. When it runs out of servers, it sleeps 60 seconds (or whatever you've specified in the prpclient.ini file) and tries again from the top. It can sound a tad messy, but in the end it turns out to work pretty well.

I can think of one thing, though, that might make it a little trickier for LLRnet: whereas PRPnet keeps its save files separate by server (for instance, work_G9000.save), LLRnet just uses one file (workfile.txt). That could lead to some potential confusion and mixed-up work unless the implementation is airtight on the backend; one possibility would be to change LLRnet to append the port # to the workfile.txt name a la PRPnet, though I imagine that would be tough to implement.

[quote]BTW, before doing these, please make sure the previous fixes are in place, tested, and complete so that I can incorporate them in the Linux client. In other words...fixes before enhancements.[/quote]
Indeed, most definitely. I primarily suggested this is a possibility to consider for the future; it may not turn out to be feasible due to the complexity of changing the existing LLRnet code (after all, it took at least a couple major PRPnet releases to initially get the multiple-servers things working in an airtight fashion, as I recall).


All times are UTC. The time now is 06:37.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.