mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > No Prime Left Behind

Reply
 
Thread Tools
Old 2010-03-31, 06:29   #67
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

1015610 Posts
Default

Carlos,

We're down to issues #2 and #3 in the problem log in post #42 here, which are yours.

We think that issue #2 is the same as issue #6. Can you download the new client and confirm that it has been fixed?

As requested by Karsten, we'll need more info. on problem #3.


Thanks,
Gary

Last fiddled with by gd_barnes on 2010-03-31 at 06:30
gd_barnes is offline   Reply With Quote
Old 2010-05-01, 14:58   #68
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

I'm encountering a rather strange issue with do.bat on Windows Vista. On my quad, I have it set to run as a service (using a method I picked up on the Free-DC forum a while back that lets you run any application as a service) so that it will run as the "LOCAL SYSTEM" account whenever the computer is on, regardless of who's logged on (since I'm not the primary user of the computer and those who are commonly log on and off of their usernames).

For the rally I installed four copies of do.bat for the first time on this computer and set them up as services. (I'd never had occasion to run LLRnet on this machine before since the new client out, so this was my first experiment with that combo.) I purposely set op_connect = TRUE in the configuration section so that the clients would keep trying to reconnect in case of an outage--because if they stopped, it could go unnoticed for hours until I noticed a drop in my output, logged into the machine remotely via VNC, and restarted the services manually.

Yet, despite this, it seems that something is happening to make the clients stop every so often. Unfortunately, I have no access whatsoever to the clients' console output due to them being run as services; the best clues I have are that when that stop, tosend.txt is present in the client directory but not workfile.txt or any of the other files indicating that the client is in the middle of a batch. Essentially, it looks just like what happens when the client gives up after a few failed connections with op_connect = FALSE--yet I have it set to TRUE, so that wouldn't make sense. Also, I'm sure it's not just in that "waiting period" where it's pausing 60 seconds before another reconnect--no way it would wait like that for hours on end when my connection is perfectly good and the other cores on the same machine are connecting without issue.

I won't be able to do it during this rally, but sometime afterwards I'm going to try changing the copies of do.bat on that machine to echo their output to a file instead of to console (since they have no visible console). That way hopefully I can get some clues as to what's going on.

So, stay tuned...I hope to have a better idea of what's going on here in the near future.
mdettweiler is offline   Reply With Quote
Old 2010-05-01, 21:13   #69
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

22·2,539 Posts
Default

What is a service and why does one need to be used?

I realize it a machine that you prefer that the command prompt windows not be shown but what does that have to with a service (if anything)?
gd_barnes is offline   Reply With Quote
Old 2010-05-01, 22:44   #70
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
What is a service and why does one need to be used?

I realize it a machine that you prefer that the command prompt windows not be shown but what does that have to with a service (if anything)?
Windows keeps a registry of "Services" behind the scenes that are started whenever the computer's booted--operating word booted, not logged on. A lot of these are actual Windows background components. However, the service functionality is not limited to Windows components, and can be used with other applications as well. Manual LLR (the GUI version), for instance, has a menu option that lets you install it as a service so that it runs automatically. The old LLRnet client has a similar option; the new one does not, but I can use the method I mentioned earlier to serve as a "wrapper" and make any application a service.

The reason why I run it as a service is so that it keeps running uninterrupted regardless of who's logged on (or not). Putting it in the Startup folder (or the equivalent registry keys) would just start the application at logon, which means that it only runs when a particular user is logged on. Even if I put it in the All Users Startup folder, it still wouldn't run when nobody's logged on, and even worse, if somebody's logged on and another logs on without the first logging off, the clients will be started a SECOND time leading to all sorts of mayhem.

The service method does work quite well; that's not the issue here. The problem is that, somehow, do.bat is exiting when it's not supposed to, and therefore has to be restarted. That would happen even if I was just running it normally. The tricky thing is, I can't see the console output since I have the command window hidden (which is not necessarily limited to services; it can be done with non-services with a program like runh.exe as well), so I can't see the exact messages telling me why it exited.

Last fiddled with by mdettweiler on 2010-05-01 at 22:49
mdettweiler is offline   Reply With Quote
Old 2010-05-02, 07:07   #71
S485122
 
S485122's Avatar
 
Sep 2006
Brussels, Belgium

22×383 Posts
Default

Why do you use a ".bat" extension and not a ".cmd" extension ?

With the ".cmd" extension you can redirect the standard output to a file as you plan to do, you can also redirect the error messages to another (or the same file) file.

For instance if the command for your service is "do.cmd" you can redirect all the output of the batch command to a message file with the command
"do.cmd 1> c:\do\do_out.txt 2> c:\do\do_err.txt"

To have all output go to one file use
"do.cmd 1> c:\do\do_out.txt 2> &1"

As you probably know if you use >> instead of > the files will not be overwritten each time the batch launches. You could also use output redirection on individual tasks of your batch and eliminate output you do not need to the null device "nul"
"task 1> nul 2> c:\do\do_err.txt"

Once the batch file stops, just analyse the output files. You can even monitor work in progress.

Jacob
S485122 is offline   Reply With Quote
Old 2010-05-02, 15:47   #72
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

624910 Posts
Default

Quote:
Originally Posted by S485122 View Post
Why do you use a ".bat" extension and not a ".cmd" extension ?

With the ".cmd" extension you can redirect the standard output to a file as you plan to do, you can also redirect the error messages to another (or the same file) file.

For instance if the command for your service is "do.cmd" you can redirect all the output of the batch command to a message file with the command
"do.cmd 1> c:\do\do_out.txt 2> c:\do\do_err.txt"

To have all output go to one file use
"do.cmd 1> c:\do\do_out.txt 2> &1"

As you probably know if you use >> instead of > the files will not be overwritten each time the batch launches. You could also use output redirection on individual tasks of your batch and eliminate output you do not need to the null device "nul"
"task 1> nul 2> c:\do\do_err.txt"

Once the batch file stops, just analyse the output files. You can even monitor work in progress.

Jacob
Hmm...I had no idea that didn't work with a .bat extension. I thought .bat and .cmd were essentially equivalent except that .cmd didn't work on Windows 9x--I guess not.
mdettweiler is offline   Reply With Quote
Old 2010-05-02, 17:09   #73
S485122
 
S485122's Avatar
 
Sep 2006
Brussels, Belgium

22·383 Posts
Default

My post was misleading : the redirections possibilities are the same.

The command.com window is launched through .bat .pif files or by running command.com. You get an error if you try to close the window. It uses the 8.3 file name format. It is initialised by the autoexec.nt and config.nt files. You need to include support for non USA keyboards in those files.

The "Command Prompt" window is launched through .cmd files or by running cmd.exe. You can close the window. It uses the NTFS name file format. It is initialised by the autoexec.bat and config.sys files (very confusing but it is MS.)

Jacob
S485122 is offline   Reply With Quote
Old 2010-05-02, 17:46   #74
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

624910 Posts
Default

Quote:
Originally Posted by S485122 View Post
My post was misleading : the redirections possibilities are the same.

The command.com window is launched through .bat .pif files or by running command.com. You get an error if you try to close the window. It uses the 8.3 file name format. It is initialised by the autoexec.nt and config.nt files. You need to include support for non USA keyboards in those files.

The "Command Prompt" window is launched through .cmd files or by running cmd.exe. You can close the window. It uses the NTFS name file format. It is initialised by the autoexec.bat and config.sys files (very confusing but it is MS.)

Jacob
Hmm...strange. In my experience, .bat files don't have any problems with non-8.2 file name formats; and I haven't run into any issues closing the window, either. Sure, there's the "Terminate batch file? Y/N" thing that comes up when you hit Ctrl-C, but doesn't that happen with .cmd files as well?
mdettweiler is offline   Reply With Quote
Old 2010-05-02, 20:26   #75
S485122
 
S485122's Avatar
 
Sep 2006
Brussels, Belgium

22·383 Posts
Default

Forget all my ranting about .bat and .cmd :-(

.bat files can also be run by cmd.exe (I finally looked up "Batch file" in Wikipedia...)
Quote:
Filename extensions
* .bat: The first extension used by Microsoft for batch files. This extension can be run in most Microsoft Operating Systems, including MS-DOS and most versions of Microsoft Windows.
* .cmd: Designates a Windows NT Command Script, which is written for the Cmd.exe shell, and is not backward-compatible with COMMAND.COM.

Differences
The only known difference between .cmd and .bat file processing is that in a .cmd file the ERRORLEVEL variable changes even on a successful command that is affected by Command Extensions (when Command Extensions are enabled), whereas in .bat files the ERRORLEVEL variable changes only upon errors.
Jacob
S485122 is offline   Reply With Quote
Old 2010-05-03, 03:51   #76
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Hmm, seems do.bat -c isn't working properly for me. I ran it with 3 workunits in queue, 2 of which had been completed; here's the lresults_hist.txt output:
Code:
593*2^770656-1 is not prime.  LLR Res64: 95AC38F43E705AF6  Time : 1110.256 sec.
597*2^770656-1 is not prime.  LLR Res64: 3595733D25372C5C  Time : 1098.580 sec.
[2010-02-05 23:35:15] Cancelled pair: 447 770657
Cancelled 1 pair(s)!
The first two correctly show up in the server's joblist.txt as completed. 447/770657, however, does not--it's still reserved by me! Somehow, despite do.bat confirming "Canceled 1 pair(s)", the pair was not canceled but rather just deleted on my end.

I'm sure I have the latest do.bat; I extracted it directly from my copy of version 0.72 that I have on my hard drive, the very same one I downloaded from Karsten's website and uploaded to the noprimeleftbehind.net server. Yet -c is not working properly, which is something I thought we fixed long ago!

Gary, since you have access to the server, can you verify this? What you'll need to do is, after canceling a pair, go to port 3000's joblist.txt and use the Find function to search for the text "k/n" where k and n are replaced with the k and n of the pair canceled. You can then use the Find function to step through all such instances found by the search. That should give you a summary of that k/n pair's life story.

It's possible this problem is just some weird effect of me running under Cygwin (though I can't imagine how that would affect this adversely). If it works just fine in a similar situation (3 in queue, first 2 complete) for Gary, then there must be something weird going on for me.
mdettweiler is offline   Reply With Quote
Old 2010-05-03, 04:33   #77
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

22·2,539 Posts
Default

I only run one Windows machine against the servers, my I7, and it doesn't have the latest version of the Windows client on it that corrected the problem with pairs not being sent that had completed during the time in which there was a server/power outage. (Fortunately there has been no outage of late.) There is no way for you to test it? I'm knee deep in CRUS updates tonight and Monday. After getting back from my trip and having my kids for 3 days, this is the 1st extended period that I have to work on them in about 2 weeks.

I think that Karsten would be the better one to test it.

As for the Linux client, I long ago extensively verified it. I also subsequently cancelled many pairs on my end when I changed servers on many of my machines for the rally. I then spot checked about 10 of them and they were definitely showing up properly as cancelled in the joblist.txt file.

I do know one thing: There was an extreme amount of testing needed to fix the problem with it not losing completed pairs during an outage and it also involved the cancellation process. I'm sure that Karsten tested the heck out of his changes but it wouldn't surprise me that there was some lone scenario that might have been missed. I twice thought I was done and everything worked before I thought of one final scenario that turned out not to work and required more changes and testing. I spent 5-6 hours on the Linux client getting it right for the "dropped completed pairs during a server outage/disconnect" issue. The thing that took so long is that everything was intertwined. When I fixed those final 2 issues that I had, I had to retest all prior scenarios because I had already made at least one fix where I broke something else in my testing.


Gary

Last fiddled with by gd_barnes on 2010-05-03 at 04:45
gd_barnes is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
LLRNET ValerieVonck Software 12 2010-03-15 18:09
llrnet 64 bit balachmar Prime Sierpinski Project 4 2008-07-19 08:21
LLRNet em99010pepe Riesel Prime Search 20 2007-09-11 21:03
Bush Supports $120 Billion Iraq War Compromise ewmayer Soap Box 23 2007-05-27 12:37
LLRnet over proxy? Bananeweizen Sierpinski/Riesel Base 5 4 2006-10-14 07:51

All times are UTC. The time now is 01:53.

Thu Jul 16 01:53:16 UTC 2020 up 112 days, 23:26, 0 users, load averages: 1.75, 1.54, 1.50

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.