![]() |
|
|
#67 |
|
May 2007
Kansas; USA
101000100110112 Posts |
Carlos,
We're down to issues #2 and #3 in the problem log in post #42 here, which are yours. We think that issue #2 is the same as issue #6. Can you download the new client and confirm that it has been fixed? As requested by Karsten, we'll need more info. on problem #3. Thanks, Gary Last fiddled with by gd_barnes on 2010-03-31 at 06:30 |
|
|
|
|
|
#68 |
|
A Sunny Moo
Aug 2007
USA (GMT-5)
3×2,083 Posts |
I'm encountering a rather strange issue with do.bat on Windows Vista. On my quad, I have it set to run as a service (using a method I picked up on the Free-DC forum a while back that lets you run any application as a service) so that it will run as the "LOCAL SYSTEM" account whenever the computer is on, regardless of who's logged on (since I'm not the primary user of the computer and those who are commonly log on and off of their usernames).
For the rally I installed four copies of do.bat for the first time on this computer and set them up as services. (I'd never had occasion to run LLRnet on this machine before since the new client out, so this was my first experiment with that combo.) I purposely set op_connect = TRUE in the configuration section so that the clients would keep trying to reconnect in case of an outage--because if they stopped, it could go unnoticed for hours until I noticed a drop in my output, logged into the machine remotely via VNC, and restarted the services manually. Yet, despite this, it seems that something is happening to make the clients stop every so often. Unfortunately, I have no access whatsoever to the clients' console output due to them being run as services; the best clues I have are that when that stop, tosend.txt is present in the client directory but not workfile.txt or any of the other files indicating that the client is in the middle of a batch. Essentially, it looks just like what happens when the client gives up after a few failed connections with op_connect = FALSE--yet I have it set to TRUE, so that wouldn't make sense. Also, I'm sure it's not just in that "waiting period" where it's pausing 60 seconds before another reconnect--no way it would wait like that for hours on end when my connection is perfectly good and the other cores on the same machine are connecting without issue. I won't be able to do it during this rally, but sometime afterwards I'm going to try changing the copies of do.bat on that machine to echo their output to a file instead of to console (since they have no visible console). That way hopefully I can get some clues as to what's going on. So, stay tuned...I hope to have a better idea of what's going on here in the near future.
|
|
|
|
|
|
#69 |
|
May 2007
Kansas; USA
33×5×7×11 Posts |
What is a service and why does one need to be used?
I realize it a machine that you prefer that the command prompt windows not be shown but what does that have to with a service (if anything)? |
|
|
|
|
|
#70 | |
|
A Sunny Moo
Aug 2007
USA (GMT-5)
186916 Posts |
Quote:
The reason why I run it as a service is so that it keeps running uninterrupted regardless of who's logged on (or not). Putting it in the Startup folder (or the equivalent registry keys) would just start the application at logon, which means that it only runs when a particular user is logged on. Even if I put it in the All Users Startup folder, it still wouldn't run when nobody's logged on, and even worse, if somebody's logged on and another logs on without the first logging off, the clients will be started a SECOND time leading to all sorts of mayhem. The service method does work quite well; that's not the issue here. The problem is that, somehow, do.bat is exiting when it's not supposed to, and therefore has to be restarted. That would happen even if I was just running it normally. The tricky thing is, I can't see the console output since I have the command window hidden (which is not necessarily limited to services; it can be done with non-services with a program like runh.exe as well), so I can't see the exact messages telling me why it exited. Last fiddled with by mdettweiler on 2010-05-01 at 22:49 |
|
|
|
|
|
|
#71 |
|
Sep 2006
Brussels, Belgium
2×3×281 Posts |
Why do you use a ".bat" extension and not a ".cmd" extension ?
With the ".cmd" extension you can redirect the standard output to a file as you plan to do, you can also redirect the error messages to another (or the same file) file. For instance if the command for your service is "do.cmd" you can redirect all the output of the batch command to a message file with the command "do.cmd 1> c:\do\do_out.txt 2> c:\do\do_err.txt" To have all output go to one file use "do.cmd 1> c:\do\do_out.txt 2> &1" As you probably know if you use >> instead of > the files will not be overwritten each time the batch launches. You could also use output redirection on individual tasks of your batch and eliminate output you do not need to the null device "nul" "task 1> nul 2> c:\do\do_err.txt" Once the batch file stops, just analyse the output files. You can even monitor work in progress. Jacob |
|
|
|
|
|
#72 | |
|
A Sunny Moo
Aug 2007
USA (GMT-5)
186916 Posts |
Quote:
|
|
|
|
|
|
|
#73 |
|
Sep 2006
Brussels, Belgium
2×3×281 Posts |
My post was misleading : the redirections possibilities are the same.
The command.com window is launched through .bat .pif files or by running command.com. You get an error if you try to close the window. It uses the 8.3 file name format. It is initialised by the autoexec.nt and config.nt files. You need to include support for non USA keyboards in those files. The "Command Prompt" window is launched through .cmd files or by running cmd.exe. You can close the window. It uses the NTFS name file format. It is initialised by the autoexec.bat and config.sys files (very confusing but it is MS.) Jacob |
|
|
|
|
|
#74 | |
|
A Sunny Moo
Aug 2007
USA (GMT-5)
3·2,083 Posts |
Quote:
|
|
|
|
|
|
|
#75 | |
|
Sep 2006
Brussels, Belgium
69616 Posts |
Forget all my ranting about .bat and .cmd :-(
.bat files can also be run by cmd.exe (I finally looked up "Batch file" in Wikipedia...) Quote:
|
|
|
|
|
|
|
#76 |
|
A Sunny Moo
Aug 2007
USA (GMT-5)
3·2,083 Posts |
Hmm, seems do.bat -c isn't working properly for me. I ran it with 3 workunits in queue, 2 of which had been completed; here's the lresults_hist.txt output:
Code:
593*2^770656-1 is not prime. LLR Res64: 95AC38F43E705AF6 Time : 1110.256 sec. 597*2^770656-1 is not prime. LLR Res64: 3595733D25372C5C Time : 1098.580 sec. [2010-02-05 23:35:15] Cancelled pair: 447 770657 Cancelled 1 pair(s)! I'm sure I have the latest do.bat; I extracted it directly from my copy of version 0.72 that I have on my hard drive, the very same one I downloaded from Karsten's website and uploaded to the noprimeleftbehind.net server. Yet -c is not working properly, which is something I thought we fixed long ago! ![]() Gary, since you have access to the server, can you verify this? What you'll need to do is, after canceling a pair, go to port 3000's joblist.txt and use the Find function to search for the text "k/n" where k and n are replaced with the k and n of the pair canceled. You can then use the Find function to step through all such instances found by the search. That should give you a summary of that k/n pair's life story. ![]() It's possible this problem is just some weird effect of me running under Cygwin (though I can't imagine how that would affect this adversely). If it works just fine in a similar situation (3 in queue, first 2 complete) for Gary, then there must be something weird going on for me. |
|
|
|
|
|
#77 |
|
May 2007
Kansas; USA
33×5×7×11 Posts |
I only run one Windows machine against the servers, my I7, and it doesn't have the latest version of the Windows client on it that corrected the problem with pairs not being sent that had completed during the time in which there was a server/power outage. (Fortunately there has been no outage of late.) There is no way for you to test it? I'm knee deep in CRUS updates tonight and Monday. After getting back from my trip and having my kids for 3 days, this is the 1st extended period that I have to work on them in about 2 weeks.
I think that Karsten would be the better one to test it. As for the Linux client, I long ago extensively verified it. I also subsequently cancelled many pairs on my end when I changed servers on many of my machines for the rally. I then spot checked about 10 of them and they were definitely showing up properly as cancelled in the joblist.txt file. I do know one thing: There was an extreme amount of testing needed to fix the problem with it not losing completed pairs during an outage and it also involved the cancellation process. I'm sure that Karsten tested the heck out of his changes but it wouldn't surprise me that there was some lone scenario that might have been missed. I twice thought I was done and everything worked before I thought of one final scenario that turned out not to work and required more changes and testing. I spent 5-6 hours on the Linux client getting it right for the "dropped completed pairs during a server outage/disconnect" issue. The thing that took so long is that everything was intertwined. When I fixed those final 2 issues that I had, I had to retest all prior scenarios because I had already made at least one fix where I broke something else in my testing. Gary Last fiddled with by gd_barnes on 2010-05-03 at 04:45 |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| LLRNET | ValerieVonck | Software | 12 | 2010-03-15 18:09 |
| llrnet 64 bit | balachmar | Prime Sierpinski Project | 4 | 2008-07-19 08:21 |
| LLRNet | em99010pepe | Riesel Prime Search | 20 | 2007-09-11 21:03 |
| Bush Supports $120 Billion Iraq War Compromise | ewmayer | Soap Box | 23 | 2007-05-27 12:37 |
| LLRnet over proxy? | Bananeweizen | Sierpinski/Riesel Base 5 | 4 | 2006-10-14 07:51 |