mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > No Prime Left Behind

Reply
 
Thread Tools
Old 2010-03-29, 21:08   #45
Mini-Geek
Account Deleted
 
Mini-Geek's Avatar
 
"Tim Sorbera"
Aug 2006
San Antonio, TX USA

10000101010112 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Almost all of do.pl should be OS-independent. I wonder if Tim's problem with it on Windows is related to the dropped-connection issue: possibly his connection cut out somewhere along the way on the times where he got "could not find lresults.txt" errors? It might be another manifestation of the same problem.
I seriously doubt it. The bug still happens when the client/script already has work to do and is just resuming it (i.e. when I do no server communication in that running of the script).

Last fiddled with by Mini-Geek on 2010-03-29 at 21:09
Mini-Geek is offline   Reply With Quote
Old 2010-03-29, 21:42   #46
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

22×691 Posts
Default

To issue #1:

I've tested the following with the DOS-script downloadable in the first post:
the client and server are there and set up for a local test, knpairs contains 20 pairs.

Code:
start:
- client/server as in V0.71-download
- server started with 'llrserver' in folder 'LLLRnet_server'
- client startet with 'do' in folder 'LLRnet_client'
- client completed first 5 workunits and sent to server
- client completed 3 from next 5 workunits, than stopped
- calling 'do -c'

proved:
- client-folder
  - lresults_hist.txt contains 8 results and note "Cancelled 2 kn-pairs!"
  - no file like workfile.txt, lresults.txt, tosend.txt or llr.ini
    so this is the same as a new installed client-folder, ready to start again!
- tray-icon of server says: 5 connections, 8 results
- server-folder:
  - results.txt contains 8 pairs, the same as in the client-folder
  - knpairs.txt contains the remaining pairs (8 first deleted)
  - joblist.txt contains 4 entries: 2 pairs with status 'working' AND 'abandonned' (cancelled)

do next:
- stopping server: tray -> 'Exit LLRserver'
- calling simplify.bat in the server-folder
- joblist.txt contains 2 cancelled pairs only (2 entries)

next:
- starting server again
- starting 'do' again
- client is working on all other pairs, the 2 cancelled, too

proved:
- client folder: lresults_hist.txt contains all 20 pairs
- server folder: after calling 'simplify.bat':
  - results.txt contains 60 lines -> 20 results
  - knpairs.txt contains no pair, only worktype in first line
  - joblist.txt contains 12 pairs, all 'solved'
- calling 'simplify.bat' again:
  - joblist.txt contains empty list

-> all is OK!

So please Tim, test this again on your PC and post your result here!

Last fiddled with by kar_bon on 2010-03-29 at 22:22
kar_bon is offline   Reply With Quote
Old 2010-03-29, 21:46   #47
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

22·691 Posts
Default

To issue #3:

An OC'ed PC is not the 'normal' way to use a script like this, nor LLR.

To handle this, i need the output LLR creates or any file to check if such error occurs.

I only can act in a script, when i know where looking at, so does LLR!






(If i got more info, I will update this post.)

Last fiddled with by kar_bon on 2010-03-29 at 21:54
kar_bon is offline   Reply With Quote
Old 2010-03-29, 21:52   #48
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

100111011010002 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
I did quite a bit of testing with do.pl on my own Windows setup; I had it run for a number of days straight on a "production" server and it worked great. I'm not sure what could have gone wrong here. You mentioned a couple posts up that you see where the problem is; could you point me to it?

Almost all of do.pl should be OS-independent. I wonder if Tim's problem with it on Windows is related to the dropped-connection issue: possibly his connection cut out somewhere along the way on the times where he got "could not find lresults.txt" errors? It might be another manifestation of the same problem.
I didn't say that I know where Tim's problem is on Windows do.pl. I know where MY problem is on Linux do.pl. I give a lot of detail of it in problem #5. I'll start working on it within a couple of hours.

Don't worry about do.pl on the Windows side. We don't need two different Windows clients. It just complicates things. Also, having people use the DOS script is much better and easier since they don't have to download files related to running Perl.

Having it run for a # of days straight correctly proves little other than it works OK when there are no technical glitches along the way. It doesn't test exception situations like different bases, dropped internet connections, dropped servers, etc. You have to test the exceptions. I'm guilty of it on the Linux side. Although I ran a test with a stopped server on the Linux side, I failed to fully analyze all files. The client showed that the pairs were sent and that it was waiting for new pairs. I failed to check the server side and see if the pairs were actually received after the server came back up, which they were not; hence the problem.

Edit: I went ahead and modified the #4 problem from "due to lack of testing" to "due to lack of testing of exception situations".

Edit2: Please do not first assume that the problem is on the user's end as in "I wonder if it is a problem with Tim's connection". Usually it is not. I get so tired of hearing that from businesses and others when I happen to have problems with my internet connection, phone problems, software, etc. Most of the time when dealing with technically very competent users such as Tim, the problem is in the software itself, not with the user.


Gary

Last fiddled with by gd_barnes on 2010-03-29 at 22:18
gd_barnes is offline   Reply With Quote
Old 2010-03-29, 21:59   #49
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

22·691 Posts
Default

To issue #2:

When cancelling the server (same setting as for issue #1) while cLLR is testing pairs, the client will delete all files!
Only the done work during server-down is logged in the client 'lresults_hist.txt' but no tosend.txt anymore!

To fix this i will change the script as soon as possible!
Perhaps other issues to solve first.
kar_bon is offline   Reply With Quote
Old 2010-03-29, 22:18   #50
Mini-Geek
Account Deleted
 
Mini-Geek's Avatar
 
"Tim Sorbera"
Aug 2006
San Antonio, TX USA

17×251 Posts
Default

Quote:
Originally Posted by kar_bon View Post
To issue #1:

I've tested the following with the DOS-script downloadable in the first post:
the client and server are there and set up for a local test, knpairs contains 20 pairs.

...
-> all is OK!

So please Tim, test this again on your PC and post your result here!
I have just confirmed that this works. All exactly as you said, except that you left out doing "do -c" (kar_bon: corrected in that post, thanks!), and another minor thing (detailed after the next quote). As the .bat is unchanged, I must conclude there was something wrong with one or more of:
my client files/folder (which may very well be, as I started it with the files of the do.pl and then added do.bat and the files along with it),
the server I'd been connecting to (seems unlikely)
the connection between my client and the server (seems unlikely, unless anything about the way do.bat communicates doesn't work with the LLRnet server G6000)

I'm currently using the test folder (the one I set up to run your test, which I know to be in working order) to run a number from G6000, so I can use do -c to return it and make sure it all works. Assuming it does, that means that whatever the problem was was probably mainly my fault (due to the aforementioned mix-and-match). Edit: yes, it worked correctly. It returned one candidate (of 5) and canceled the other 4. It doesn't really tell me everything that's going on, but at least it worked. Here's all it said:
Code:
+-------------------------------------+
| LLRnet client V0.9b7 with cLLR V3.8 |
| K.Bonath, 2010-02-10, Version 0.71  |
+-------------------------------------+

Current configuration:
server = "www.noprimeleftbehind.net"
port = 6000
username = "Mini-Geek"
WUCacheSize = 5

        1 file(s) copied.
[2010-03-29 17:23:43]
Cancelling : 2201/548954 (30000000000000:M:1:2:258)
[2010-03-29 17:23:43]
Cancelling : 2205/548954 (30000000000000:M:1:2:258)
[2010-03-29 17:23:44]
Cancelling : 2295/548954 (30000000000000:M:1:2:258)
[2010-03-29 17:23:45]
Cancelling : 2421/548954 (30000000000000:M:1:2:258)
[2010-03-29 17:23:45]
No more job to cancel !
All jobs canceled!
(those 4 were the canceled ones) It'd be preferable to say something like "Returning 2195/548954" before the cancellings, followed by a "1 pair returned, 4 pairs canceled" message.
Quote:
Originally Posted by kar_bon View Post
Code:
- server folder: after calling  'simplify.bat':
  - joblist.txt contains 12 pairs, all 'solved'
...
- calling 'simplify.bat' again:
  - joblist.txt contains empty list
I only called simplify.bat once, (after the server shut down, if that matters) but joblist.txt was immediately empty. I hope this isn't too big of a problem.
(kar_bon: I should make a note in the README.txt of the download-zip to use 'simplify' twice for that!)

Last fiddled with by Mini-Geek on 2010-03-29 at 22:29
Mini-Geek is offline   Reply With Quote
Old 2010-03-29, 22:26   #51
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

23×13×97 Posts
Default

Quote:
Originally Posted by Mini-Geek View Post
I have just confirmed that this works. All exactly as you said, except that you left out doing "do -c", and another minor thing (detailed after the next quote). As the .bat is unchanged, I must conclude there was something wrong with one or more of:
my client files/folder (which may very well be, as I started it with the files of the do.pl and then added do.bat and the files along with it),
the server I'd been connecting to (seems unlikely)
the connection between my client and the server (seems unlikely, unless anything about the way do.bat communicates doesn't work with the LLRnet server G6000)

I'm currently using the test folder (the one I set up to run your test, which I know to be in working order) to run a number from G6000, so I can use do -c to return it and make sure it all works. Assuming it does, that means that whatever the problem was was probably mainly my fault (due to the aforementioned mix-and-match).

I only called simplify.bat once, (after the server shut down, if that matters) but joblist.txt was immediately empty. I hope this isn't too big of a problem.

Tim,

To put it all in a nutshell:

You suspect that problem #1 in the "problem log" were as a result of possibly having some Windows do.pl files in your Windows DOS client? Is that correct?

I wasn't quite clear. Is there still a problem with do -c?

If you can confirm that everything is working correctly with the Windows DOS client after you ran it clean without the Windows do.pl client files in the same folder, I'll update the problem log post to show it as a "non issue".

On the Windows side, that still leaves open problems #2 and #3.

Karsten,

I wasn't clear with your test. Were you able to simulate Carlos's internet connection outage for the problem in #2 and that it worked correctly?

I see that you are waiting to get some output from Carlos on the OC'd issue #3. I wasn't clear if it is related to a specific problem with LLR (CLLR?) that the clients need to be able to handle -or- if LLR (or CLLR) needs to be fixed. Can you clarify that?


Gary

Last fiddled with by gd_barnes on 2010-03-29 at 22:32
gd_barnes is offline   Reply With Quote
Old 2010-03-29, 22:30   #52
Mini-Geek
Account Deleted
 
Mini-Geek's Avatar
 
"Tim Sorbera"
Aug 2006
San Antonio, TX USA

17·251 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
To put it all in a nutshell:

You suspect that problems #1 and #2 in the "problem log" were as a result of possibly having some Windows do.pl files in your Windows DOS client?

Is that correct?

I wasn't quite clear. Is there a problem with do -c?

If you can confirm that everything is working correctly with the Windows DOS client after you ran it clean without the Windows do.pl client files in the same folder, I'll update the problem log post to show them as "non issues".
Yep, as I added to my previous post:
Quote:
Originally Posted by Mini-Geek View Post
I'm currently using the test folder (the one I set up to run your test, which I know to be in working order) to run a number from G6000, so I can use do -c to return it and make sure it all works. Assuming it does, that means that whatever the problem was was probably mainly my fault (due to the aforementioned mix-and-match). Edit: yes, it worked correctly. It returned one candidate (of 5) and canceled the other 4.
So yes, I believe #1 to be a non-issue. #2 is something I don't think I've seen (one that Carlos reported), so I can't really comment on that.

Last fiddled with by Mini-Geek on 2010-03-29 at 22:31
Mini-Geek is offline   Reply With Quote
Old 2010-03-29, 22:37   #53
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

23·13·97 Posts
Default

Quote:
Originally Posted by kar_bon View Post
To issue #2:

When cancelling the server (same setting as for issue #1) while cLLR is testing pairs, the client will delete all files!
Only the done work during server-down is logged in the client 'lresults_hist.txt' but no tosend.txt anymore!

To fix this i will change the script as soon as possible!
Perhaps other issues to solve first.

THAT is the EXACT same issue as in the Linux do.pl client!!

I too will be working on that this evening.

I first noticed that something might be amiss when I dropped a server a few days ago to load more pairs. There ended up being some pairs that the server never received even though it appeared that the clients had sent them. I wasn't sure what the problem was at first. You'll see a posting from me from last Friday where I said there may be a problem in the Linux do.pl script related to that. It sure snowballed in a hurry as others found it also and it perhaps caused some other problems.

In our defense (lol), this was not an easy issue to see. I'm sure that like do.pl, do.bat "shows" that the pairs are sent to the server and sits waiting to get new pairs. The problem is that it never actually sends the previous results after the server comes up because it has already deleted tosend.txt. The server files needed to be more closely inspected when testing.

That's why we do beta testing. :-)


Gary

Last fiddled with by gd_barnes on 2010-03-29 at 22:40
gd_barnes is offline   Reply With Quote
Old 2010-03-29, 22:43   #54
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

23×13×97 Posts
Default

Quote:
Originally Posted by Mini-Geek View Post
Yep, as I added to my previous post:
So yes, I believe #1 to be a non-issue. #2 is something I don't think I've seen (one that Carlos reported), so I can't really comment on that.
OK, I'm marking #1 as a non-issue. I goofed when I said #1 and #2 and had edited my post as such but you had already quoted it.

In other news, I'll mark an issue #6 for Windows do.bat that is the same as issue #5 for Linux do.pl. That is the issue with results not being sent to the server when a server (or internet connection) is dropped.
gd_barnes is offline   Reply With Quote
Old 2010-03-30, 10:32   #55
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

23×13×97 Posts
Default

I have fixed the problem in the Linux client with results not being returned to the server whenever there is an internet outage or server problem at or before the time that the batch is finished.

Karsten, this took some extensive changes and several hours of testing over many different scenarios and conditions. The more difficult changes were in the pairs cancellation process. You have to test cancelled pairs/results being returned to the server while the server is down at that moment and then comes back up, after the last batch completed while the server was down and the server is currently up (or down), and of course the more usual where the user cancelled the last batch before it was done and wishes to return processed results and unprocessed pairs to the server while it is up and running. Personally, I didn't think to test multiple scenarios with the server being up or down in the cancellation process in alpha testing. I only did some of it in the regular process but obviously hadn't looked close enough at this problem that came up.

The main thing that I did was put it into a loop whenever the results were being returned. Previously we only had a loop when retrieving pairs but a similar loop is needed when returning pairs in both the main and cancellation processes. As it previously existed, the code always assumed that the pair return process worked correctly (that is that the server was up and the internet connection was good). Although the results were showing up in lresults_hist, file cleanup subsequent to the presumed pair return caused the converted tosend.txt file to be very "quietly" lost whenver there was a communication issue with the server.

Testing was done using both Riesel and Sierp files for n<1000 and for n=10K-10.1K. I didn't feel that an all out stress test was necessary since the communication with the server was little changed. But right now, I'm loading it on all of my machines so the main public drives will somewhat stress test it.

Attached are the updated do.pl script and README.txt. It is now officially do.pl version V0.71. I have added versioning comments at the bottom of README. Please edit the 1st post here to incorporate the changes in the client files.

Whew, I'm glad that is done. But now it's back in beta testing phase. We can't assume that it is fully correct yet.


Gary
Attached Files
File Type: gz do.pl-readme.txt.tar.gz (5.5 KB, 77 views)

Last fiddled with by kar_bon on 2010-03-30 at 12:43
gd_barnes is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
LLRNET ValerieVonck Software 12 2010-03-15 18:09
llrnet 64 bit balachmar Prime Sierpinski Project 4 2008-07-19 08:21
LLRNet em99010pepe Riesel Prime Search 20 2007-09-11 21:03
Bush Supports $120 Billion Iraq War Compromise ewmayer Soap Box 23 2007-05-27 12:37
LLRnet over proxy? Bananeweizen Sierpinski/Riesel Base 5 4 2006-10-14 07:51

All times are UTC. The time now is 05:41.

Wed Apr 1 05:41:43 UTC 2020 up 7 days, 3:14, 0 users, load averages: 1.26, 1.26, 1.23

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.