mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Conjectures 'R Us (https://www.mersenneforum.org/forumdisplay.php?f=81)
-   -   PRPnet server bugs and barfing problem (https://www.mersenneforum.org/showthread.php?t=12256)

henryzz 2009-07-29 08:39

PRPnet server bugs and barfing problem
 
[B]Note by mdettweiler: this thread was split off from the [url=http://www.mersenneforum.org/showthread.php?t=12225]main PRPnet thread[/url].[/B]

on [URL]http://nplb-gb1.no-ip.org:3000/server_stats.html[/URL] there are two columns named min N when one should be max N

rogue 2009-07-29 12:25

[QUOTE=henryzz;183249]on [url]http://nplb-gb1.no-ip.org:3000/server_stats.html[/url] there are two columns named min N when one should be max N[/QUOTE]

I'm aware of the problem. It's pretty minor, so until something more significant shows up, I won't put out another release.

mdettweiler 2009-08-01 15:07

Lennart just sent me a PM informing me that he couldn't reach the G3000 server. I checked the server, and it seems that somehow it got all messed up. There are a large number of results in completed_tests.log from the last few hours that have blank residuals! I'm not sure how this occurred; I'll need to do a bit more investigating. In the meantime, [b]G3000 is offline[/b].

mdettweiler 2009-08-01 15:41

Okay, I've got all that sorted out. Still no clue as to why it malfunctioned, but at least everything should be back to normal now. :smile:

mdettweiler 2009-08-01 15:54

[quote=mdettweiler;183658]Okay, I've got all that sorted out. Still no clue as to why it malfunctioned, but at least everything should be back to normal now. :smile:[/quote]
Upon further investigation of the log files, I've got a basic idea of what probably happened. My guess is that there was a brief interruption in a connection between the server and one of the clients around 8:55 AM GMT today. The interruption must have occurred while the server was in the middle of either sending or receiving (probably receiving, from what I could tell), which apparently confused the heck out of the server.

One unfortunate side effect of all this is that the prpserver.candidates file ended up getting wiped. I'm not sure how it was wiped, but presumably it happened as part of the whole mess-up. I was, however, able to rebuild the prpserver.candidates file from the original sieve file, and remove all completed tests, so we shouldn't miss anything. The only remaining side effect is that a few people might encounter rejected tests (in this case, that would only be Lennart and I, since we're the only ones on the server right now).

Max :smile:

gd_barnes 2009-08-02 08:02

Why are these two links the same thing?:

[URL]http://nplb-gb1.no-ip.org:3000/[/URL]
[URL]http://nplb-gb1.no-ip.org:3000/server_stats.html[/URL]

Please remove one of them or change one of the pages to what it should be.

Other issues:
At 3 PM CDT this afternoon, I was receiving pairs at n=~108000. How come most k's in the above pages are showing a "min n" of 100200-100500? I don't think that anyone is cacheing THAT many pairs? How often is this page updating?

Max, in the 1st post here, I have updated the "n-value" loaded in the server to the range shown in your PM to me as well as the "currently processing at" n-value from the n-range that is currently being handed out to my clients. You had it incorrectly marked as "n-value" 130K-150K and "currently processing at" ~140K. That mislead me greatly into thinking we only had a nominal amount of work left. When I started getting pairs at n=~107K Sat. afternoon, I was very confused. Hence our PM exchange. If we're going to have that info. there, please make sure that it is correct. The "currently processing at" n-value needs to be updated every 2-3 days maximum like is done at NPLB.

Can we fix the issue with the two "min n" columns?

Can we get a web page that shows in a similar format as the one at NPLB at [URL="http://nplb.ironbits.net/"]nplb.ironbits.net/[/URL]? We need more general info. like that.

I've done a lot of back office batch and online testing for legacy financial systems that had to be perfect in production. This is just a warning that I will pick you guys to death on some of this stuff so be prepared! :smile:


Thanks,
Gary

Lennart 2009-08-02 08:24

[quote=gd_barnes;183697]Why are these two links the same thing?:

[URL]http://nplb-gb1.no-ip.org:3000/[/URL]
[URL]http://nplb-gb1.no-ip.org:3000/server_stats.html[/URL]

Please remove one of them or change one of the pages to what it should be.

Other issues:
At 3 PM CDT this afternoon, I was receiving pairs at n=~108000. How come most k's in the above pages are showing a "min n" of 100200-100500? I don't think that anyone is cacheing THAT many pairs? How often is this page updating?

Max, in the 1st post here, I have updated the "n-value" loaded in the server to the range shown in your PM to me as well as the "currently processing at" n-value from the n-range that is currently being handed out to my clients. You had it incorrectly marked as "n-value" 130K-150K and "currently processing at" ~140K. That mislead me greatly into thinking we only had a nominal amount of work left. When I started getting pairs at n=~107K Sat. afternoon, I was very confused. Hence our PM exchange. If we're going to have that info. there, please make sure that it is correct. The "currently processing at" n-value needs to be updated every 2-3 days maximum like is done at NPLB.

Can we fix the issue with the two "min n" columns?

Can we get a web page that shows in a similar format as the one at NPLB at [URL="http://nplb.ironbits.net/"]nplb.ironbits.net/[/URL]? We need more general info. like that.

I've done a lot of back office batch and online testing for legacy financial systems that had to be perfect in production. This is just a warning that I will pick you guys to death on some of this stuff so be prepared! :smile:


Thanks,
Gary[/quote]
Must be a typo Change [URL="http://nplb-gb1.no-ip.org:3000/server_stats.html"]server_stats.html[/URL] to [URL]http://nplb-gb1.no-ip.org:3000/server_status.html[/URL]


Lennart

gd_barnes 2009-08-02 08:38

OK, corrected. Thanks Lennart for the correct link.

But the info. on the status link is incorrect. It shows exactly:

[quote]
Sierpinski Base 6 6 k remaining31 n remainingmin n is 1760 (100287 digits long)
[/quote]Can we please correct this? Almost everything is wrong including grammar, n-value, and size. It should be something like:

[quote]
Sierp base 6, 28 k's remaining, pairs remaining 1760, min n is xxxxxx (xxxxxx digits).
[/quote]Obviously the "min n" cannot be 1760. Also the size can't be right either. A size of 100287 would be for an n-value of ~129000. We don't even have that range loaded in the server!


Thanks,
Gary

gd_barnes 2009-08-02 08:44

Holy crap. I just checked. Lennart, you must have multiple quads on Sierp base 6. We're close to drying it out! Unless I'm misunderstanding something, nice work! Have you received notification of any primes?

Lennart 2009-08-02 08:49

[quote=gd_barnes;183700]Holy crap. I just checked. Lennart, you must have multiple quads on Sierp base 6. We're close to drying it out! Unless I'm misunderstanding something, nice work! Have you received notification of any primes?[/quote]

I have some cores on it. :smile:

No no mail. :smile:

Lennart

gd_barnes 2009-08-02 09:04

Cool. I just wanted to make sure I wasn't hallucinating there.

Guys, two more problem:

First:

[URL]http://nplb-gb1.no-ip.org:3000/[/URL] show k's remaining for 36772, 118147, and 157473. These were found prime back in January and should not be on the list. If you remove those, you get the correct 28 k's remaining. The primes were:
36772*6^126672+1
118147*6^122688+1
157473*6^113124+1

That will correct part of the problem with [URL]http://nplb-gb1.no-ip.org:3000/server_status.html[/URL] as for the # of k's remaining, which shows 31 k's remaining vs. 28.

Second:

Lennart received no Email but should have received them on 4 primes. The primes found for the current run that are shown in the PRP.log file on the server are:

72785*6^118347+1
98860*6^119849+1
123285*6^147200+1
145076*6^149946+1


A lot of problems...a lot of work to do yet. I'm not willing to load n>150K in the server until we do a rerun of about an n=10K range of this where there are primes and all of these issues are corrected. That is testing with test data, which means a double-check in this case.


Gary

gd_barnes 2009-08-02 09:22

Lennart found 2 new primes!:

123285*6^147200+1 is prime
145076*6^149946+1 is prime

Unfortunately they are not quite big enough for top 5000 right now but they are nice primes for the conjectures nonetheless.

28 k's now remain for Sierp base 6 at n=150K.

gd_barnes 2009-08-02 09:37

More problems:

[quote]
2009-08-02 03:03:47 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-02 03:06:56 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-02 03:07:52 GMT] Total Time: 6:27:55 Total Tests: 45 Total PRPs Found: 0
[2009-08-02 03:11:01 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-02 03:11:12 GMT] Total Time: 6:31:15 Total Tests: 45 Total PRPs Found: 0
[2009-08-02 03:11:15 GMT] G3000: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-02 03:11:41 GMT] Total Time: 6:31:44 Total Tests: 45 Total PRPs Found: 0
[2009-08-02 03:11:41 GMT] G3000: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-02 03:11:45 GMT] G3000: ERROR: Workunit 59506*6^114591+1 not found on server
[2009-08-02 03:11:45 GMT] G3000: The client will delete this workunit
[2009-08-02 03:11:46 GMT] G3000: INFO: 0 of 1 test results were accepted
[/quote][quote]
[2009-08-02 03:24:36 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-02 03:26:20 GMT] Total Time: 6:46:23 Total Tests: 48 Total PRPs Found: 0
[2009-08-02 03:26:23 GMT] G3000: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-02 03:27:21 GMT] Total Time: 6:47:24 Total Tests: 48 Total PRPs Found: 0
[2009-08-02 03:27:24 GMT] G3000: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-02 03:31:17 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-02 03:33:01 GMT] Total Time: 6:53:04 Total Tests: 48 Total PRPs Found: 0
[2009-08-02 03:33:22 GMT] G3000: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-02 03:33:54 GMT] G3000: ERROR: Workunit 145076*6^115554+1 not found on server
[2009-08-02 03:33:54 GMT] G3000: The client will delete this workunit
[2009-08-02 03:33:54 GMT] G3000: INFO: Workunit found
[2009-08-02 03:34:08 GMT] Total Time: 6:54:11 Total Tests: 48 Total PRPs Found: 0
[2009-08-02 03:35:41 GMT] G3000: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-02 03:37:05 GMT] Total Time: 6:57:08 Total Tests: 48 Total PRPs Found: 0
[2009-08-02 03:37:05 GMT] G3000: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-02 03:37:07 GMT] G3000: ERROR: Workunit 145076*6^115554+1 not found on server
[2009-08-02 03:37:07 GMT] G3000: The client will delete this workunit
[2009-08-02 03:37:08 GMT] G3000: ERROR: Workunit 68195*6^115556+1 not found on server
[2009-08-02 03:37:08 GMT] G3000: The client will delete this workunit
[2009-08-02 03:37:08 GMT] G3000: INFO: 0 of 2 test results were accepted
[/quote]Some of my results are not being accepted. Then they are deleted! What's up with that? The problem only seems to be on port 3000. No rejections on port 1470.

gd_barnes 2009-08-02 11:13

Max and Rogue:

Port 3000 has dried. Plan of action now:

1. Please correct all of the aforementioned problems.

2. Change all appropriate # of k's remaining, "min n" values for all k's, # of pairs remaining etc. in all web pages to the state that they should be at n=140K. Also change the 1st post of this thread to it's state at n=140K.

3. Load n=140K-150K back into the server and let it rip. When done, let's double check to make sure that all web pages update correctly and that all Emails are sent for primes found.

4. Verify that there are no rejected results and that the server doesn't barf for no reason like it did on Sat. morning.

5. Match all residuals with the 1st run.

Once this all runs correctly and everything is verified as correct, then we can consider the server and the interfacing with the web pages statuses/stats/etc. ready to go for n>150K or for other efforts. In other words, it will be production ready. It is the interfacing with these kinds of web pages and stats that I have the biggest concern about at NPLB.

That is what I would define as a test plan! :smile:

Thanks for the huge boost Lennart! :smile:


Gary

mdettweiler 2009-08-02 11:57

Whoa, I wake up, get online, and it turns out there's all this action waiting for me when I get on. :smile: I'll try to answer all of your questions below:
[quote=gd_barnes;183713]Max and Rogue:

Port 3000 has dried. Plan of action now:

1. Please correct all of the aforementioned problems.[/quote]
Okay, first of all, since we don't have an NPLB-like database set up for CRUS results, there is no mechanism in place for sending out emails when people find a prime. PRPnet has such a mechanism built in, but it will need to use an SMTP server that doesn't require a secured connection--that means we can't use Gmail's server. Essentially, we need to use an ISP server for it. Gary, I'll send you a PM shortly describing what will need to be done to get email notification active.

[quote]2. Change all appropriate # of k's remaining, "min n" values for all k's, # of pairs remaining etc. in all web pages to the state that they should be at n=140K. Also change the 1st post of this thread to it's state at n=140K.[/quote]
The web pages are automatically generated by the server. If it says that's what's in the server, then usually that is what's in the server. (Well, except for the server_status page that came out all garbled; I'm not sure why it did that.)

The reason why there were k's left in the server that had already been eliminated is because I had to reload some stuff, as described via PM. As soon as the primes for them are re-found, those k's will be automatically removed from the server.

[quote]3. Load n=140K-150K back into the server and let it rip. When done, let's double check to make sure that all web pages update correctly and that all Emails are sent for primes found.[/quote]
140K-150K [B]is[/B] in the server, and it's been working just fine. It's just that Lennart's mostly dried out the server and thus the only remaining k/n pairs on the server report are stragglers. Most of the work has been completed, including many of the ones at the end of the n-range (so, since the server only shows what it's got left, its "max n" value ends up shrinking a bit).

[quote]4. Verify that there are no rejected results and that the server doesn't barf for no reason like it did on Sat. morning.[/quote]
I've been keeping a close eye on that. It's been doing small periods of barfing since then, but it's been recovering rather nicely all on its own, and I have a script set up to salvage any and all barfed results so we don't lose anything. I've got it all under control. :smile:

[quote]5. Match all residuals with the 1st run.[/quote]
Unnecessary, since everything is working correctly as described above. :smile:

[quote]Once this all runs correctly and everything is verified as correct, then we can consider the server and the interfacing with the web pages statuses/stats/etc. ready to go for n>150K or for other efforts. In other words, it will be production ready. It is the interfacing with these kinds of web pages and stats that I have the biggest concern about at NPLB.[/quote]
Okay, to summarize:

-Email notification still needs to be set up for CRUS PRPnet servers. (For NPLB, it goes automatically through our DB, but that doesn't apply for CRUS.)
-The server barfing problem is under control.
-The problem on the server_status page seems to be due to some sort of bug in the server; at any rate, that was always one of the less useful pages (since all that information can be gathered from the main/server_stats page anyway). Possibly we may just want to leave its link off the first post here.
-The server is functioning correctly, and *has* processed all of 140K-150K exactly as it should.

Max :smile:

mdettweiler 2009-08-02 19:44

Okay, looks like G3000 has dried out. Gary, shall I load up 150K-200K?

(FYI: since the server has been fully dried out at the end of this range, I'll be able to "fresh start" the server for 150K-200K. That should clear out any residual effects of the earlier problems.)

gd_barnes 2009-08-02 20:05

Max,

Please reconsider your response in the context of the current situation. You haven't responded to the situation as it currently exists.

1. OK, the SMTP stuff needs to be set up. Thats fine. My bad. I forgot about that in your PM.

2. If it says something is in the server, then that is what is in the server?? What does that mean? The server has dried!! Those k's should not be in the server because they already have a prime! You say as soon as we find a prime for them, the pages will show correctly. How can we find a prime for them? We're done with the range you reloaded. That's my whole point. Let's get them out of there so the web pages show correctly. There should be 30 k's remaining at n=140K and 28 k's remaining at n=150K.

3. n=140K-150K IS in the server?? The server has dried! I said we need to RELOAD n=140K-150K in the server and run this thing again. Why are you saying that it IS in the server when I'm asking you to reload the range? Like I said, please consider the context of the current situation.

4. Matching of residuals is not necessary because everything is working correctly as shown above? Little is working correctly from what I can tell! If you don't feel that is necessary, please send them to me from the 1st run before we do a 2nd run and I will match them up after the 2nd run.

5. The situation with the barfing is not under control. We can't have scripts (i.e. patches) being written to fix problems. Let's use output from the problem to find it's root cause and fix it at its root cause so that patches aren't necessary.

We are not communicating properly here at all here. The only thing that I forgot about in a PM was the notification of primes. Nothing else from any PM's showed that the aforementioned problems should have occurred.

I don't feel that were under control at all. Let's get the server reloaded with n=140K-150K, make the pages so that the # of k's remaining are showing correctly for n=140K (30 k's remaining) [do we need to delete some prior k/n pairs for k's that already have a prime?], and run that range again as though it was a first pass test. At the end of that, there should be 28 k's remaining.

Reference "the server is functioning correctly": Not if it is still barfing. In my mind, "correctly" means that 100% of everything works correctly.

I've designed a test plan. There's nothing here that demontrstates conclusively that much of this will work correctly this next time around. Let's delete what's remaining the in the server for n=140K-150K as though it has never been done, make the web pages look like it had never been done (if that requires deleting prior pairs so that some k's don't show as remaining, then let's please do so), and then do that range again. If the web pages update correctly, then I'll be convinced but not before.

If my test plan is out of context given the situation, please design your own and let's discuss it before we proceed. A test plan should include:

1. The input into the system, i.e. the range that we test. You already stated that in the PM as what is shown in post #1 here so we're good there.
2. What the output should be and look like. That is: What should the web pages look like, what should the various prpnet files look like, etc. when the test is complete. That is the major problem that we are having.

If the output in #2 looks different in any manner, then the test is not correct.


Thank you,
Gary

gd_barnes 2009-08-02 20:06

[quote=mdettweiler;183745]Okay, looks like G3000 has dried out. Gary, shall I load up 150K-200K?

(FYI: since the server has been fully dried out at the end of this range, I'll be able to "fresh start" the server for 150K-200K. That should clear out any residual effects of the earlier problems.)[/quote]


What the heck? NO!!

It dried at 6 AM this morning. I've already stated that. Please stop, slow down, reread everything that I've previously written, and let's get it right.

Edit: Another problem: None of the links in the 1st post are working now!!

I'll be out later this afternoon and evening.

mdettweiler 2009-08-02 20:18

Okay, I think I could have communicated what I was trying to say a bit more clearly in my last post. Here we go again, this time hopefully clearer. :smile:

-The server did not dry at 6 AM. It still had quite a number of stragglers left. That is what I meant when I said that the server still had in it what the web pages said it did (which was correct). The web pages showed at the time that there were a few pairs left for each k, which was accurate at the time.

-The server [I]really[/I] dried shortly before I posted my latest message saying as such. That means the whole 100K-150K range is complete, stragglers and all.

-The barfing has ceased and desisted. I know what the problem was that caused the barfing, and it was due to a connection error, *not* a problem with PRPnet.

-All of the confusion regarding what the server had loaded into it, when it dried, etc. were not due at all to the barfing, but instead were fallout from a mess-up I made when I initially attempted to "fix" the barfing. I now have all that remedied.

-The links in the 1st post have stopped working because I shut down the server after it dried out. I did this so I could get the messes I made earlier under control better. This is of no particular consequence since the server has nothing in it anyway.

Long story short: none of the problems were caused by PRPnet itself; they were all due either to connection problems (which can happen to anybody and could cause other servers such as LLRnet to barf as well), or to human error on my part. The barfing has ceased (being somewhat of a random occurrence in the first place), and the human errors have been remedied.

There, that make sense now? :smile:

Max :smile:

gd_barnes 2009-08-03 06:17

Yes, that makes sense. But you still haven't addressed my concerns about the interfacing web pages yet. We still need a clean test and that hasn't happened yet. Just becuase we test, find problems, and "supposedly" fix those problems doesn't mean they are fixed. We have to retest! :smile:

On the "barfing", can you tell me what in the server that you fixed to make it no longer "barf". If you haven't done anything, then I see that you're saying that it was "due either to connection problems (which can happen to anybody and could cause other servers such as LLRnet to barf as well), or to human error on my part".

My response: I haven't seen LLRnet barf in ages. That tells me that it shouldn't happen to anyone on PRPnet and that it's either due to a PRPnet problem or to the human error. If it was due to a previous human error, then please refer to the test plan below on how we need to do it to avoid any human error.

Also, all of its interfaces must be proven to update and work correctly. Sorry man, I have to see it. Saying "it's fixed" won't do it. This means all related web pages that show: results, k's remaining, pairs remaining, stats, etc. Therefore I stand by my original test plan. We need to redo n=140K-150K to demonstrate that all of these are working correctly.

This may only be a hobby but it is mathematics and math is very exacting and continuing to run "production" data on something that isn't proven to have all of its interfaces working correctly is not something that I can stomach.

Here is a restated test plan with a little additional detail:

1. Please correct all of the aforementioned problems with the correct updating of the interfacing web pages to their state at n=140K. (I realize the server likely pulls this info. as it goes but something is not interfacing correctly with how it is pulling them or they would show correctly.) This includes the following:
a) k's remaining. (remove prior pairs or results if needed)
b) min n values for the page that shows all k's.
c) # of pairs remaining. (Does appear correct already as I was able to see it drop appropriately as it handed them out Sat. morning.)
d) The second "min n" column heading should show "max n" in the appropriate column on that one web page.

There must be some "preliminary" fix that is needed before the test begins or the pages would have shown correctly immediately after it began previously. There may simply need to be some prior results deleted...not sure there.

2. Delete any hint anywhere of all n=140K-150K from the server. If the # of k's will not display correctly because prior results are still in the server for k's where no prime was ever found, then we need to delete those results also. When we begin this new test, the web page should show 30 k's remaining.

3. Load n=140K-150K back into the server and let it rip. When done, let's double check to make sure that all web pages update correctly. This includes checking that:
(a) # of pairs remaining is correct as it is processing. (estimating is close enough)
(b) # of k's remaining is 30 when the test initially starts and is 28 when the test ends.
(c) The min n for each k is correct as it goes (estimate is sufficient) and is either blank or equal to the max n when the test is finished.

4. Verify that there are no rejected results and that the server doesn't barf for any reason.

5. Match all residuals with the 1st run.

It'd be great if we could have help with the rerun but I'll put multiple quads on it if needed to finish it within a day. Lennart still gets credit for the 2 new primes found that are now shown in the Sierp base 6 drive and on the web pages.

Finally: Let's come to a mutual agreement on the definition of "dried": To me, dried means the server is handing out no new pairs even if stragglers are remaining where the results have not come back yet. Just because stragglers are remaining doesn't mean anyone can get any work. That is if a new person came here and tried to access the server, he could not get any work. By that definition, the server dried at ~6 AM CDT Sunday morning because my machines were pulling no more new pairs after that. I'm not sure why it took hours for the server to receive back all of the results for the straggling pairs. Maybe somebody's machine was down for a while.


Thank you,
Gary

mdettweiler 2009-08-03 13:18

Okay, I see where you're coming from now. We'll re-run 140K-150K as a test to compare with this first run. However, before we do that, there's a couple things I'd like to clarify:

-I am quite sure that the barfing was *not* due to human error or PRPnet error in any way, but due to an essentially "random" connection fluctuation. I have no way of knowing which end of the line the fluctuation was on, but I do know this: there's not really any way to fix that. Though, they seem to happen rather rarely anyway, so it's probably not too big a deal in the big picture.

-Okay, I see what you're saying about the definition of a server being "dried". I'd always seen it used the other way around--to refer to a server being completely cleaned out. But, hey, whatever works. :smile:

With that out of the way, I'll go ahead and load 140K-150K into the server again. :smile:

mdettweiler 2009-08-03 13:42

All right, 140K-150K has been reloaded exactly as described above. I see Lennart's hungry machines have already swooped in and grabbed a bunch of work. :smile:

Checking the various web pages:
[URL]http://nplb-gb1.no-ip.org:3000/[/URL] (a.k.a. server_stats.html) checks out except for the "Min N" label on both the Min N and Max N columns, which is a known bug for Sierpinski/Riesel mode in PRPnet and will be fixed in a future release, but is only a cosmetic error for now.

[URL]http://nplb-gb1.no-ip.org:3000/server_status.html[/URL] still looks kind of weird:
[quote]Sierpinski Base 6 6 k remaining30 n remainingmin n is 4065 (140005 digits long)[/quote]
It seems that this page is displaying inaccurately, due to a bug in PRPnet. However, as before, this is a cosmetic error (even if a bit more serious than the extra "Min N" label), and regardless of how the words seem to come out on the page, the basic information does line up with the following (as of when I pulled down this page):
-k's remaining: 30
-n's remaining: 4065
-min n remaining: 140005
The # of digits doesn't seem to be presented at all, regardless of the wording.

[URL]http://nplb-gb1.no-ip.org:3000/user_stats.html[/URL] checks out.

I'll report the bug on the server_status page to Mark. However, since all the information on that page can be gathered from the server_stats/home page anyway, it's somewhat less of a big deal since it's not affecting anything except the display of the data. So, we should be OK even with that bug present.

Max :smile:

mdettweiler 2009-08-03 18:18

Server barfing again!
 
I just noticed that the G3000 server started "barfing" again about an hour ago. Like the last few times, it seems to have started doing so in response to what appears to be a communications error between the server and one of Lennart's boxes. Lennart, can you please check over your various boxes that you have on G3000 and see if any of them contain any clues to what might be happening? I'm afraid I can't narrow it down to a specific box from the logs.

I'm going to turn on "debug mode" on the server, which will have it log full socket communication data to a file. I don't usually like to use this for production servers since it produces extremely large logfiles, but I'll use it for now since it should allow us to pinpoint exactly which machine is causing this problem. It's possible that it's the same one every time, which would indicate a problem on Lennart's end rather than the server's.

Interestingly enough, though, I have never seen this barfing problem happen on any other PRPnet server. Possibly there is something specific to G3000 that's causing the problem.

Lennart 2009-08-03 18:31

[quote=mdettweiler;183872]I just noticed that the G3000 server started "barfing" again about an hour ago. Like the last few times, it seems to have started doing so in response to what appears to be a communications error between the server and one of Lennart's boxes. Lennart, can you please check over your various boxes that you have on G3000 and see if any of them contain any clues to what might be happening? I'm afraid I can't narrow it down to a specific box from the logs.

I'm going to turn on "debug mode" on the server, which will have it log full socket communication data to a file. I don't usually like to use this for production servers since it produces extremely large logfiles, but I'll use it for now since it should allow us to pinpoint exactly which machine is causing this problem. It's possible that it's the same one every time, which would indicate a problem on Lennart's end rather than the server's.

Interestingly enough, though, I have never seen this barfing problem happen on any other PRPnet server. Possibly there is something specific to G3000 that's causing the problem.[/quote]

I have checked all and i cant find the one making this.

I keep looking.

Lennart

mdettweiler 2009-08-03 19:34

[quote=Lennart;183876]I have checked all and i cant find the one making this.

I keep looking.

Lennart[/quote]
Okay, thanks. The server seems to be holding up OK since I last restarted it (right before I posted my last message), but if it happens again (which, given time, it probably will, given the track record we've been seeing), I should be able to catch exactly what machine is talking to the server at the time from the debug logs.

mdettweiler 2009-08-03 20:35

Looks like the server barfed again. I've restarted it to fix it back up again, and I'll start looking through the logs for clues to what caused it. :smile:

mdettweiler 2009-08-03 20:37

[quote=mdettweiler;183891]Looks like the server barfed again. I've restarted it to fix it back up again, and I'll start looking through the logs for clues to what caused it. :smile:[/quote]
I've looked through the debug log a bit, and while I can't find anything conclusively linking the problem to a particular machine, it does look like the problem may have started occurring when the server communicated with Lennart's machine "_207". Lennart, you may want to check the box with that ID and see if there's anything strange going on with it.

Lennart 2009-08-03 21:00

[quote=mdettweiler;183892]I've looked through the debug log a bit, and while I can't find anything conclusively linking the problem to a particular machine, it does look like the problem may have started occurring when the server communicated with Lennart's machine "_207". Lennart, you may want to check the box with that ID and see if there's anything strange going on with it.[/quote]

I have increased numbers to cache and set error time out to 1min.

Lennart

mdettweiler 2009-08-03 21:05

[quote=Lennart;183895]I have increased numbers to cache and set error time out to 1min.

Lennart[/quote]
Just curious, have you by chance been having any problems with your internet connection lately--sudden dropoffs, etc? Because if so, that might possibly explain why these problems are happening (say, if the connection gets cut during a communication with the server).

Lennart 2009-08-03 21:21

[quote=mdettweiler;183896]Just curious, have you by chance been having any problems with your internet connection lately--sudden dropoffs, etc? Because if so, that might possibly explain why these problems are happening (say, if the connection gets cut during a communication with the server).[/quote]

No and i run on prpnet.primegrid at the same time and have no problem there.

Lennart

mdettweiler 2009-08-03 21:42

[quote=Lennart;183898]No and i run on prpnet.primegrid at the same time and have no problem there.

Lennart[/quote]
Hmm, I guess that's ruled out then. I did, however, find something interesting upon further examination of the debug log. For reference, a normal result-reporting communication between client and server looks something like this. (<<< represents data coming in from the client, and >>> represents data going out from the server.)
[code][2009-08-03 18:20:13 GMT] Message coming on socket 5
[2009-08-03 18:20:13 GMT] socket 5 <<<< FROM sm5ymt@pekhult.se _153 sm5ymt
[2009-08-03 18:20:13 GMT] sm5ymt@pekhult.se connecting from *.*.*.*
[2009-08-03 18:20:13 GMT] socket 5 <<<< RETURNWORK 2.2.3
[2009-08-03 18:20:13 GMT] socket 5 <<<< WorkUnit: 31340*6^145004+1 1249318058
[2009-08-03 18:20:13 GMT] socket 5 >>>> INFO: Workunit found
[2009-08-03 18:20:13 GMT] socket 5 <<<< Test Result: pfgw BD78034F699566B1
[2009-08-03 18:20:13 GMT] socket 5 <<<< End of WorkUnit
[2009-08-03 18:20:13 GMT] socket 5 >>>> INFO: Test for candidate 31340*6^145004+1 accepted
[2009-08-03 18:20:13 GMT] 31340*6^145004+1: Test received by sm5ymt@pekhult.se at *.*.*.* Residue Residue: BD78034F699566B1
[2009-08-03 18:20:13 GMT] socket 5 >>>> End of Workunit Message
[2009-08-03 18:20:14 GMT] socket 5 <<<< WorkUnit: 124221*6^145005+1 1249318058
[2009-08-03 18:20:14 GMT] socket 5 >>>> INFO: Workunit found
[2009-08-03 18:20:14 GMT] socket 5 <<<< Test Result: pfgw 30DF4273EBA52CE2
[2009-08-03 18:20:14 GMT] socket 5 <<<< End of WorkUnit
[2009-08-03 18:20:14 GMT] socket 5 >>>> INFO: Test for candidate 124221*6^145005+1 accepted
[2009-08-03 18:20:14 GMT] 124221*6^145005+1: Test received by sm5ymt@pekhult.se at *.*.*.* Residue Residue: 30DF4273EBA52CE2
[2009-08-03 18:20:14 GMT] socket 5 >>>> End of Workunit Message
[2009-08-03 18:20:15 GMT] socket 5 <<<< End of Message
[2009-08-03 18:20:15 GMT] socket 5 >>>> INFO: All 2 test results were accepted
[2009-08-03 18:20:15 GMT] socket 5 >>>> End of Message
[2009-08-03 18:20:15 GMT] socket 5 <<<< QUIT
[2009-08-03 18:20:15 GMT] closing socket 5[/code]
But, by contrast, the sessions that seem to be throwing off the server look like this:
[code][2009-08-03 21:19:35 GMT] Message coming on socket 5
[2009-08-03 21:19:35 GMT] socket 5 <<<< FROM sm5ymt@pekhult.se _31 sm5ymt
[2009-08-03 21:19:35 GMT] sm5ymt@pekhult.se connecting from *.*.*.*
[2009-08-03 21:19:35 GMT] socket 5 <<<< RETURNWORK 2.2.3
[2009-08-03 21:19:35 GMT] socket 5 <<<< WorkUnit: 124221*6^148285+1 1249333282
[2009-08-03 21:19:35 GMT] socket 5 >>>> INFO: Workunit found
[2009-08-03 21:19:35 GMT] socket 5 <<<< WorkUnit: 74612*6^148287+1 1249333282
[2009-08-03 21:19:35 GMT] socket 5 <<<< WorkUnit: 172257*6^148286+1 1249333282
[2009-08-03 21:19:35 GMT] socket 5 <<<< End of Message
[2009-08-03 21:19:35 GMT] socket 5 <<<< QUIT
[2009-08-03 21:19:46 GMT] socket 5 (nothing received)
[2009-08-03 21:19:46 GMT] socket 5 >>>> INFO: Test for candidate 124221*6^148285+1 accepted
[2009-08-03 21:19:46 GMT] Error sending <<INFO: Test for candidate 124221*6^148285+1 accepted>> to localhost:3000
[2009-08-03 21:19:46 GMT] socket 5 >>>> !!! send error !!!
[2009-08-03 21:19:46 GMT] 124221*6^148285+1: Test received by sm5ymt@pekhult.se at *.*.*.* Residue Residue:
[2009-08-03 21:19:46 GMT] socket 5 >>>> End of Workunit Message
[2009-08-03 21:19:46 GMT] Error sending <<End of Workunit Message>> to localhost:3000
[2009-08-03 21:19:46 GMT] socket 5 >>>> !!! send error !!!
[2009-08-03 21:19:57 GMT] socket 5 (nothing received)
[2009-08-03 21:19:57 GMT] socket 5 >>>> INFO: All 1 test results were accepted
[2009-08-03 21:19:57 GMT] Error sending <<INFO: All 1 test results were accepted>> to localhost:3000
[2009-08-03 21:19:57 GMT] socket 5 >>>> !!! send error !!!
[2009-08-03 21:19:57 GMT] socket 5 >>>> End of Message
[2009-08-03 21:19:57 GMT] Error sending <<End of Message>> to localhost:3000
[2009-08-03 21:19:57 GMT] socket 5 >>>> !!! send error !!!
[2009-08-03 21:20:08 GMT] socket 5 (nothing received)
[2009-08-03 21:20:08 GMT] closing socket 5[/code]
It seems that the client is sending the "WorkUnit:" message, but then instead of sending the "Test Result:" message, which is supposed to directly follow it, it's sending [i]another[/i] WorkUnit message before sending the first test's result! And then after that, the client just terminates the connection with "End of Message" and "QUIT". Meanwhile the server is still waiting for a result, until it times out and gives up with a "(nothing received)" message. But then, we have a problem: then, the server takes the incomplete result report and registers it as if it was true, but with a blank residue and application name since it was never told them--hence the blank results I was seeing in completed_tests.log!

This all makes perfect sense now. It is most definitely not a connection error like I thought earlier, but I was confused because connection errors also tend to produce those mysterious "Error sending <<x>> to localhost:3000" errors. Instead, it looks more like an odd bug in the client that is confusing the heck out of the server. I'll report it to Mark.

rogue 2009-08-03 22:26

If someone could append a debug log from a client that has experienced this issue, that would be helpful.

I will put a separate fix into the server to address it losing the candidates since there is no valid test result.

mdettweiler 2009-08-03 22:56

[quote=rogue;183902]If someone could append a debug log from a client that has experienced this issue, that would be helpful.

I will put a separate fix into the server to address it losing the candidates since there is no valid test result.[/quote]
I'm afraid I haven't encountered this issue myself on the client end, so I can't help you there; Lennart, could you possibly put all of your G3000 clients on level 1 debug logging, if they aren't already?

rogue 2009-08-03 23:04

Lennart, can you tell me what the client does with the workunits when this happens? Does it delete them or save them and try again?

Lennart 2009-08-03 23:33

[code][2009-08-03 16:44:08 GMT] Total Time: 2:12:11 Total Tests: 15 Total PRPs Found: 0
[2009-08-03 16:44:53 GMT] crus: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-03 16:47:10 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-03 16:47:10 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-03 16:47:10 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-03 16:47:11 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-03 16:47:11 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-03 16:47:11 GMT] 27121: Getting work from server prpnet.primegrid.com at port 12006
[2009-08-03 17:49:36 GMT] 27121: 27*2^1543462+1 is not prime. Residue 2D44561896DD41CE
[2009-08-03 17:49:36 GMT] Total Time: 3:17:39 Total Tests: 16 Total PRPs Found: 0
[2009-08-03 17:49:36 GMT] 27121: Returning work to server prpnet.primegrid.com at port 12006
[2009-08-03 17:49:38 GMT] 27121: INFO: Test for candidate 27*2^1543462+1 accepted
[2009-08-03 17:49:38 GMT] 27121: INFO: All 1 test results were accepted
[2009-08-03 17:49:38 GMT] crus: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:43 GMT] crus: ERROR: Workunit 124221*6^148285+1 not found on server
[2009-08-03 17:49:43 GMT] crus: The client will delete this workunit
[2009-08-03 17:49:44 GMT] crus: INFO: Test for candidate 74612*6^148287+1 accepted
[2009-08-03 17:49:45 GMT] crus: INFO: Test for candidate 172257*6^148286+1 accepted
[2009-08-03 17:49:45 GMT] crus: INFO: 2 of 3 test results were accepted
[2009-08-03 17:49:46 GMT] crus: Getting work from server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:47 GMT] crus: INFO: No available candidates are left on this server.
[2009-08-03 17:49:48 GMT] crus: Getting work from server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:49 GMT] crus: INFO: No available candidates are left on this server.
[2009-08-03 17:49:50 GMT] crus: Getting work from server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:51 GMT] crus: INFO: No available candidates are left on this server.
[2009-08-03 17:49:52 GMT] crus: Getting work from server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:53 GMT] crus: INFO: No available candidates are left on this server.
[2009-08-03 17:49:54 GMT] crus: Getting work from server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:55 GMT] crus: INFO: No available candidates are left on this server.
[2009-08-03 17:49:56 GMT] crus: Getting work from server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:57 GMT] crus: INFO: No available candidates are left on this server.
[2009-08-03 17:49:57 GMT] 27121: Getting work from server prpnet.primegrid.com at port 12006
[2009-08-03 18:43:33 GMT] 27121: 27*2^1543856+1 is not prime. Residue DE6D07D9F6EA4450
[2009-08-03 18:43:33 GMT] Total Time: 4:11:36 Total Tests: 17 Total PRPs Found: 0
[2009-08-03 18:43:34 GMT] 27121: Returning work to server prpnet.primegrid.com at port 12006
[2009-08-03 18:43:37 GMT] 27121: INFO: Test for candidate 27*2^1543856+1 accepted
[/code]

Here is the log as you see my clock was not correct.

I have started setting all on debug=1

Lennart

rogue 2009-08-03 23:42

[QUOTE=Lennart;183918]Here is the log as you see my clock was not correct.

I have started setting all on debug=1

Lennart[/QUOTE]

As soon as you get the log, that would be great. There isn't enough information in these lines to give me a clear picture of what happened.

Lennart 2009-08-04 00:08

[quote=mdettweiler;183911]I'm afraid I haven't encountered this issue myself on the client end, so I can't help you there; Lennart, could you possibly put all of your G3000 clients on level 1 debug logging, if they aren't already?[/quote]


They are now :smile:

Lennart

gd_barnes 2009-08-04 04:27

[quote=mdettweiler;183842]All right, 140K-150K has been reloaded exactly as described above. I see Lennart's hungry machines have already swooped in and grabbed a bunch of work. :smile:

Checking the various web pages:
[URL]http://nplb-gb1.no-ip.org:3000/[/URL] (a.k.a. server_stats.html) checks out except for the "Min N" label on both the Min N and Max N columns, which is a known bug for Sierpinski/Riesel mode in PRPnet and will be fixed in a future release, but is only a cosmetic error for now.

[URL]http://nplb-gb1.no-ip.org:3000/server_status.html[/URL] still looks kind of weird:

It seems that this page is displaying inaccurately, due to a bug in PRPnet. However, as before, this is a cosmetic error (even if a bit more serious than the extra "Min N" label), and regardless of how the words seem to come out on the page, the basic information does line up with the following (as of when I pulled down this page):
-k's remaining: 30
-n's remaining: 4065
-min n remaining: 140005
The # of digits doesn't seem to be presented at all, regardless of the wording.

[URL]http://nplb-gb1.no-ip.org:3000/user_stats.html[/URL] checks out.

I'll report the bug on the server_status page to Mark. However, since all the information on that page can be gathered from the server_stats/home page anyway, it's somewhat less of a big deal since it's not affecting anything except the display of the data. So, we should be OK even with that bug present.

Max :smile:[/quote]


Gotta have all interfaces (as weel as barfing) fixed including grammar/spacing/clarity/etc. before we load n>150K even if it requires a new PRPnet release. We'll keep rerunning n=140K-150K until we get a clean test. The "n's remaining" of 4065 is misleading. It should say something like "pairs remaining" (assuming that is what it is referring to.)

Thanks for getting that going. It's good to see the # of k's remaining is correct now.


Thanks,
Gary

gd_barnes 2009-08-04 04:30

Could these problems with "barfing" be as a result of my servers not being able to handle a very big load? That's quite a bit of crunching power on there by Lennart. (I have the equivalent of 2 cores on there, i.e. a 50-50 split with another effort on a full quad.)

We should definitively know about load being a possible problem when port G5000 at NPLB gets rolling with the very teeny tests that will only take a few secs. each. That should be a big load even with just a few quads on it! :smile:

gd_barnes 2009-08-04 04:42

Something new for this time around:

First, I believe this drive is being processed by n-vaule. Based on that, I see that at [URL]http://nplb-gb1.no-ip.org:3000/[/URL] k=124125 and 124221 have a min n of 140006 and 140005 respectively even though most of this testing effort is at n>143K. Could it be because someone has received some pairs that haven't been returned to the server in a long time. I'm trying to determine if the "min n" is updating properly on all k's.

Second, the max n is showing as n=~148.7K for nearly all k's. (~147.6K for a few k's, perhaps because they are lower weight?) It should be showing as n=~150K for all k's unless the full n=140K-150K range was not loaded. This looked correct last time. How come it doesn't look correct this time?


Final question: How frequently is the "min n" and "max n" by k page updated?


Thanks,
Gary

mdettweiler 2009-08-04 04:50

[quote=gd_barnes;183948]Something new for this time around:

First, I believe this drive is being processed by n-vaule. Based on that, I see that at [URL]http://nplb-gb1.no-ip.org:3000/[/URL] k=124125 and 124221 have a min n of 140006 and 140005 respectively even though most of this testing effort is at n>143K. Could it be because someone has received some pairs that haven't been returned to the server in a long time. I'm trying to determine if the "min n" is updating properly on all k's.

Second, the max n is showing as n=~148.7K for all k's. It should be showing as n=~150K unless the full n=140K-150K range was not loaded. This looked correct last time. How come it doesn't look correct this time?


Final question: How frequently is the "min n" and "max n" by k page updated?


Thanks,
Gary[/quote]
Min n and Max n are updated every 5 minutes; intermediate changes are shown in the columns to the right of those. (The intermediate changes are absorbed into the larger Min n and Max n columns at the 5 minute updates when completed tests are removed from the prpserver.candidates file.)

Regarding the barfing possibly being related to server stress: no, I've talked with Mark and he's definitely confirmed that there's a server bug that needs to be fixed, as well as possibly a client bug pending further investigation of debug.log files. He's sent me a fix for the server side of things (which should definitely fix the barfing), though he asked me not to apply the fixed version to the server yet so Lennart's clients can get a chance to catch a log of their end of the barfing in their debug.log files for him to examine and see if there's a bug in the client as well.

gd_barnes 2009-08-04 04:58

OK, great. I'm glad to hear we've nailed down the "server barfing" problems.

Thanks for the "absorbing of changes" to min/max n explanation. It's a little clearer to me now.

It seems we still have a "min n" and "max n" problem though; mainly "max n". The "min n" issue could be as a result of a few pairs not having been returned yet, although that seems a little suspect since Lennart and I are the main ones on there and our machines have remained connected (I think). The "max n" should be n=~150K for all k's. Can you look into that? Is the full n=140K-150K file loaded into the server?


Gary

mdettweiler 2009-08-04 13:06

[quote=gd_barnes;183952]It seems we still have a "min n" and "max n" problem though; mainly "max n". The "min n" issue could be as a result of a few pairs not having been returned yet, although that seems a little suspect since Lennart and I are the main ones on there and our machines have remained connected (I think). The "max n" should be n=~150K for all k's. Can you look into that? Is the full n=140K-150K file loaded into the server?[/quote]
Look at the # of candidates left for each k--you'll notice that it's a very small amount. The reason why the "max n" is not quite up to ~150K is because most of the work has been completed; all you're seeing now are just whatever the min and max of any stragglers happen to be.

BTW, Lennart, did you catch the barfing in one of your client debug.log's? I see a small bit of barfing on the server from around August 3, 21:15 GMT; here are the clients involved: _31, _206, _162, _127, _71, _31, and last but not least, humpford (one of Gary's finest :wink:). (Of course since Gary doesn't have his clients set to debug logging, that last one is rather irrelevant; no big deal, there should be plenty of data from Lennart's logs.)

Interestingly enough, last night doesn't seem to be a big one for barfing; I had to go all the way back to the time of the abovementioned barf in order to find any instance of it.

Lennart 2009-08-04 13:24

[quote=mdettweiler;184004]Look at the # of candidates left for each k--you'll notice that it's a very small amount. The reason why the "max n" is not quite up to ~150K is because most of the work has been completed; all you're seeing now are just whatever the min and max of any stragglers happen to be.

BTW, Lennart, did you catch the barfing in one of your client debug.log's? I see a small bit of barfing on the server from around August 3, 21:15 GMT; here are the clients involved: _31, _206, _162, _127, _71, _31, and last but not least, humpford (one of Gary's finest :wink:). (Of course since Gary doesn't have his clients set to debug logging, that last one is rather irrelevant; no big deal, there should be plenty of data from Lennart's logs.)

Interestingly enough, last night doesn't seem to be a big one for barfing; I had to go all the way back to the time of the abovementioned barf in order to find any instance of it.[/quote]

Debug was not on that early. 22:00GMT -23:45GMT was the time i enabled debug.

Lennart

gd_barnes 2009-08-04 20:50

[quote=mdettweiler;184004]Look at the # of candidates left for each k--you'll notice that it's a very small amount. The reason why the "max n" is not quite up to ~150K is because most of the work has been completed; all you're seeing now are just whatever the min and max of any stragglers happen to be.[/quote]


Wow, that was fast that we got down to only remaining stragglers. That makes sense now. Thanks for enlightening me. We still need to get the formatting fixed on that one web page. We also need to change "n remaining" to "pairs remaining".

The option to run percentages of certain servers is outstanding! Whenever we reload for another test, Lennart's and my machines just start gobbling them up. Very cool! :-)


Gary

gd_barnes 2009-08-04 20:53

When did the last run hand out its last pair? In other words, when did the server "dry" by my definition?

I'm wondering because there are still 350 stragglers remaining. Unless someone has shut off a machine, all stragglers should have been processed by now. Can that be checked? Thanks.

One more thing. Is there the equivalent of a "JobMaxTime" in PRPnet? If so, what has G3000 been set to?


Gary

mdettweiler 2009-08-04 21:01

[quote=gd_barnes;184109]When did the last run hand out its last pair? In other words, when did the server "dry" by my definition?

I'm wondering because there are still 350 stragglers remaining. Unless someone has shut off a machine, all stragglers should have been processed by now. Can that be checked? Thanks.

One more thing. Is there the equivalent of a "JobMaxTime" in PRPnet? If so, what has G3000 been set to?


Gary[/quote]
PRPnet does have an equivalent of a jobMaxTime value, specified in the file prpserver.delay on the server. It's been set to 3 days.

It's a little hard to tell you exactly when the server "dried" by your definition. I'm presuming that by that, you mean when did the server run out of its "main" stash of work until everything that's left was stragglers? If so, that will take a bit of digging to find out.

mdettweiler 2009-08-04 21:34

I just checked the server, and it seems that the last time any test was handed out was at 22:17 GMT, 8/3. Within the next couple of hours Lennart's various machines returned their various results, but since then there's been no activity besides the server sending out "no available candidates on server" messages. I've changed the time limit to 6 hours; I see now that a large number (possibly all, I didn't do an exact count) of the stragglers have been expired and are being reassigned.

gd_barnes 2009-08-04 23:51

[quote=mdettweiler;184115]I just checked the server, and it seems that the last time any test was handed out was at 22:17 GMT, 8/3. Within the next couple of hours Lennart's various machines returned their various results, but since then there's been no activity besides the server sending out "no available candidates on server" messages. I've changed the time limit to 6 hours; I see now that a large number (possibly all, I didn't do an exact count) of the stragglers have been expired and are being reassigned.[/quote]

Who were the stragglers assigned to to begin with? Could this be a possible PRPnet bug or did someone receive some pairs and then disconnect or turn off their machines? I would think that if it is Lennart or me, then they should have been quickly processed.

This is important to figure out because in the first test, the same thing happened. I observed that the server was dry by my definition about 4-5 hours before you confirmed that it was really dry by your definition. If the pairs are coming back in a reasonable time frame, that difference should have been < 1 hour. If something is causing some pairs to "get stuck" for an extended period, we need to figure that out.

To clarify again:
Dry by my defintion: No new work is available to hand out. Some straggling pairs still need results to be returned.

Dry by your definition: All pairs have been processed and returned.

Let me come up with a better way to state this: Your definition is probably more accurate in a purely technical sense so how about we call my definition "nominally dried" and stick with calling your definition simply "dried".


Gary

mdettweiler 2009-08-05 01:46

[quote=gd_barnes;184142]Who were the stragglers assigned to to begin with? Could this be a possible PRPnet bug or did someone receive some pairs and then disconnect or turn off their machines? I would think that if it is Lennart or me, then they should have been quickly processed.

This is important to figure out because in the first test, the same thing happened. I observed that the server was dry by my definition about 4-5 hours before you confirmed that it was really dry by your definition. If the pairs are coming back in a reasonable time frame, that difference should have been < 1 hour. If something is causing some pairs to "get stuck" for an extended period, we need to figure that out.

To clarify again:
Dry by my defintion: No new work is available to hand out. Some straggling pairs still need results to be returned.

Dry by your definition: All pairs have been processed and returned.

Let me come up with a better way to state this: Your definition is probably more accurate in a purely technical sense so how about we call my definition "nominally dried" and stick with calling your definition simply "dried".


Gary[/quote]
Unless I missed something, all of the stragglers were from Lennart. Lennart, do you know of a particular reason why you had these abandoned pairs, or does this seem more like a bug?

Regarding dried vs. nominally dried: okay, that works. :smile:

Lennart 2009-08-05 02:01

[quote=mdettweiler;184153]Unless I missed something, all of the stragglers were from Lennart. Lennart, do you know of a particular reason why you had these abandoned pairs, or does this seem more like a bug?

Regarding dried vs. nominally dried: okay, that works. :smile:[/quote]

No those are when the server did not receive them( when i not could connect). But they will not come out again before the time set in the delay file is over.

[2009-08-03 17:49:38 GMT] crus: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:43 GMT] crus: ERROR: Workunit 124221*6^148285+1 not found on server
[2009-08-03 17:49:43 GMT] crus: The client will delete this workunit

Here you see that the candidates was deleted when we had the conection error. so if you had one day in delay file i don't get them again before those 24 hr.

Lennart

mdettweiler 2009-08-05 03:35

[quote=Lennart;184159]No those are when the server did not receive them( when i not could connect). But they will not come out again before the time set in the delay file is over.

[2009-08-03 17:49:38 GMT] crus: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-03 17:49:43 GMT] crus: ERROR: Workunit 124221*6^148285+1 not found on server
[2009-08-03 17:49:43 GMT] crus: The client will delete this workunit

Here you see that the candidates was deleted when we had the conection error. so if you had one day in delay file i don't get them again before those 24 hr.

Lennart[/quote]
Okay, I see. Since I know that the client won't just delete a result if it merely can't connect, but rather will hang on to the result and try again later, it's got to be something a little different than just that. From what your logs reported, it would seem almost like those tests were already completed on the server end, but the client didn't know that...ooh! I just thought of something! What if when the server barfs on a test, the client ends up deleting the *next* result, thinking that it wasn't found? This would fit with what we were seeing in the server's debug logs, where the client would seem to be sending the server a second workunit line at a time that would be out of place according to the PRPnet communication protocol. The server barfs on this, and registers the first test with a blank residual, but ignores the second one. Meanwhile, the client could quite believably think that the server is saying "the second test is not on the server" and therefore it deletes the result, thinking it's unneeded. Bingo! We have our abandoned tests.

Lennart, could you provide debug.log excerpts from when your clients behaved like in your example? This might be just the key we're looking for. Meanwhile, I'll check and see if on the server, a test was barfed at the same time shown in your logs for your example.

Edit: oh, never mind, looks like that particular example was from before I put the server on debug logging. Lennart, do you have a similar example from somewhere after 2009-08-03 18:20:04 GMT?

gd_barnes 2009-08-05 09:37

Your last sentence notwithstanding, THAT is EXACTLY what I was afraid of and thought it might be!! (caps for emphasis not yelling) I had been seeing "result not accepted" and something about it being deleted in some of my files. Yet my one quad was definitely connected the entire time.

This clearly has to be a PRPnet bug -or- related to load on my server that either the server or the PRPnet software cannot handle. For some reason, it is not accepting some returned results even though it should be. It seems to think they are already done when in fact they are not.

Whew, and I thought I was hallucinating about the huge difference between the nominal drying time and actual drying time. It seems I was not as all machines were connected at all times. Good luck both Max and Rogue figuring it out. That doesn't sound easy. If you need examples from my machine, let me know.

You can thank me now or thank me later for observing the unusually large difference between the two drying times for machines that were connected the entire time. lol


Gary

rogue 2009-08-05 12:25

[QUOTE=gd_barnes;184180]Your last sentence notwithstanding, THAT is EXACTLY what I was afraid of and thought it might be!! (caps for emphasis not yelling) I had been seeing "result not accepted" and something about it being deleted in some of my files. Yet my one quad was definitely connected the entire time.

This clearly has to be a PRPnet bug -or- related to load on my server that either the server or the PRPnet software cannot handle. For some reason, it is not accepting some returned results even though it should be. It seems to think they are already done when in fact they are not.

Whew, and I thought I was hallucinating about the huge difference between the nominal drying time and actual drying time. It seems I was not as all machines were connected at all times. Good luck both Max and Rogue figuring it out. That doesn't sound easy. If you need examples from my machine, let me know.

You can thank me now or thank me later for observing the unusually large difference between the two drying times for machines that were connected the entire time.[/QUOTE]

I have provided a server side patch to Max (which I've instructed him to not install), but I'm waiting to see a debug log from a client that is exhibiting the problem so that I can determine if there is a bug in the client as well.

mdettweiler 2009-08-05 12:48

[quote=gd_barnes;184180]Your last sentence notwithstanding, THAT is EXACTLY what I was afraid of and thought it might be!! (caps for emphasis not yelling) I had been seeing "result not accepted" and something about it being deleted in some of my files. Yet my one quad was definitely connected the entire time.

This clearly has to be a PRPnet bug -or- related to load on my server that either the server or the PRPnet software cannot handle. For some reason, it is not accepting some returned results even though it should be. It seems to think they are already done when in fact they are not.

Whew, and I thought I was hallucinating about the huge difference between the nominal drying time and actual drying time. It seems I was not as all machines were connected at all times. Good luck both Max and Rogue figuring it out. That doesn't sound easy. If you need examples from my machine, let me know.

You can thank me now or thank me later for observing the unusually large difference between the two drying times for machines that were connected the entire time. lol


Gary[/quote]
Hey, looky! I just noticed something! G3000 has now *really* dried (not just nominally). It now shows just one insanely huge k with 999999999 as the min-n on the server_stats.html page, which is how dried PRPnet servers are "supposed" to react. (It's just a cosmetic error, though rest assured, yes, we're working on that too. :smile:)

This means, of course, that if there's any data to be had on the client side of things, it's already sitting in one of Lennart's debug.log's and waiting for us to collect. :smile:

Lennart 2009-08-05 15:56

[quote=mdettweiler;184210]Hey, looky! I just noticed something! G3000 has now *really* dried (not just nominally). It now shows just one insanely huge k with 999999999 as the min-n on the server_stats.html page, which is how dried PRPnet servers are "supposed" to react. (It's just a cosmetic error, though rest assured, yes, we're working on that too. :smile:)

This means, of course, that if there's any data to be had on the client side of things, it's already sitting in one of Lennart's debug.log's and waiting for us to collect. :smile:[/quote]

It is not easy for me to find that I have to check ~80 logs on 20 Computer

Do you see anything strange in server log that can nail it down to one computer :smile:

Lennart

mdettweiler 2009-08-05 16:27

[quote=Lennart;184237]It is not easy for me to find that I have to check ~80 logs on 20 Computer

Do you see anything strange in server log that can nail it down to one computer :smile:

Lennart[/quote]
Okay, the last time the server's barfed was around the time of these log entries:

[code][2009-08-03 21:14:45 GMT] 172257*6^148184+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _31 Program: Residue:
[2009-08-03 21:15:18 GMT] 168610*6^148352+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _206 Program: Residue:
[2009-08-03 21:15:51 GMT] 87800*6^147867+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _162 Program: Residue:
[2009-08-03 21:16:14 GMT] 124125*6^148166+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _127 Program: pfgw Residue: 59589D7D3BD68AF3
[2009-08-03 21:16:28 GMT] 123285*6^148356+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _206 Program: Residue:
[2009-08-03 21:17:01 GMT] 108527*6^148607+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _71 Program: Residue:
[2009-08-03 21:17:34 GMT] 33706*6^147870+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _162 Program: Residue:
[2009-08-03 21:18:07 GMT] 59506*6^148227+1: Email: [EMAIL="gbarnes017@gmail.com"]gbarnes017@gmail.com[/EMAIL] User: gd_barnes Client: humpford Program: Residue:
[2009-08-03 21:18:40 GMT] 124125*6^148166+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _127 Program: Residue:
[2009-08-03 21:19:13 GMT] 108527*6^148607+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _71 Program: Residue:
[2009-08-03 21:19:46 GMT] 124221*6^148285+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _31 Program: Residue:
[2009-08-03 21:20:19 GMT] 13215*6^148275+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _127 Program: Residue: [/code]
Any of those with blank residuals above should do the trick. :smile:

Lennart 2009-08-05 18:18

[quote=mdettweiler;184241]Okay, the last time the server's barfed was around the time of these log entries:

[code][2009-08-03 21:14:45 GMT] 172257*6^148184+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _31 Program: Residue:
[2009-08-03 21:15:18 GMT] 168610*6^148352+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _206 Program: Residue:
[2009-08-03 21:15:51 GMT] 87800*6^147867+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _162 Program: Residue:
[2009-08-03 21:16:14 GMT] 124125*6^148166+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _127 Program: pfgw Residue: 59589D7D3BD68AF3
[2009-08-03 21:16:28 GMT] 123285*6^148356+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _206 Program: Residue:
[2009-08-03 21:17:01 GMT] 108527*6^148607+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _71 Program: Residue:
[2009-08-03 21:17:34 GMT] 33706*6^147870+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _162 Program: Residue:
[2009-08-03 21:18:07 GMT] 59506*6^148227+1: Email: [EMAIL="gbarnes017@gmail.com"]gbarnes017@gmail.com[/EMAIL] User: gd_barnes Client: humpford Program: Residue:
[2009-08-03 21:18:40 GMT] 124125*6^148166+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _127 Program: Residue:
[2009-08-03 21:19:13 GMT] 108527*6^148607+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _71 Program: Residue:
[2009-08-03 21:19:46 GMT] 124221*6^148285+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _31 Program: Residue:
[2009-08-03 21:20:19 GMT] 13215*6^148275+1: Email: [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] User: sm5ymt Client: _127 Program: Residue: [/code]Any of those with blank residuals above should do the trick. :smile:[/quote]

I did start debug 2009-08-03 22:00 UTC on almost all .

Have you anything after that time.

Lennart

mdettweiler 2009-08-05 20:10

[quote=Lennart;184259]I did start debug 2009-08-03 22:00 UTC on almost all .

Have you anything after that time.

Lennart[/quote]
Unfortunately, no. And since the server's already been dried...hmm...

Anyone up for another run-through of 140K-150K? :smile:

mdettweiler 2009-08-06 03:37

Okay, I've loaded 140K-150K into the server once again. Hopefully this time we'll get all the testing data we need and won't have to re-run this range any more. :smile:

gd_barnes 2009-08-06 04:00

Rerun it as many times as you need. With Lennart's and my machines on there, it goes very quickly and if there is any anomoly with rejected or deleted results, we can try to isolate every one of them.

I'm not clear, was the patch that Mark referred to put in for this run? If not, I'm assuming we'll have some of the same issues and of course we'll have to run it again with the patch (and/or additional patches) in place.


Gary

gd_barnes 2009-08-06 04:03

Oopsie...the one summary page is showing 35 k's remaining. I then went to the page with all of the k's showing "min n" and "max n" for each and it also shows 35 remaining.

With a run beginning at n=140K, there should only be 30 k's remaining.


Gary

mdettweiler 2009-08-06 14:24

[quote=gd_barnes;184308]Oopsie...the one summary page is showing 35 k's remaining. I then went to the page with all of the k's showing "min n" and "max n" for each and it also shows 35 remaining.

With a run beginning at n=140K, there should only be 30 k's remaining.


Gary[/quote]
Oops...sorry. :blush: This time around I forgot to remove all the k's that had been primed in n=100K-140K. Oh well, it's just a few more k's, and since the entire range is being rerun anyway, it will just give us that much more test data. :smile:

Edit: And yes, we're still NOT using the patched server version, so that we can catch the logs on both the server and client from when it barfs.

Lennart 2009-08-06 22:20

1 Attachment(s)
Here is some files.

Lennart

rogue 2009-08-06 22:55

[QUOTE=Lennart;184379]Here is some files.

Lennart[/QUOTE]

These are large files, can you narrow down the time frame for the issue? Can you verify that the issue is in one of these files?

mdettweiler 2009-08-06 23:08

[quote=rogue;184383]These are large files, can you narrow down the time frame for the issue? Can you verify that the issue is in one of these files?[/quote]
Based on the server logs, it seems there was a lot of barfing between [2009-08-06 15:47:02 GMT] and [2009-08-06 16:50:34 GMT].

Lennart 2009-08-06 23:09

[quote=rogue;184383]These are large files, can you narrow down the time frame for the issue? Can you verify that the issue is in one of these files?[/quote]


I think Max can get the time from serverlog.

If i get that time i have many more i can look in.

Lennart

mdettweiler 2009-08-06 23:14

[quote=mdettweiler;184384]Based on the server logs, it seems there was a lot of barfing between [2009-08-06 15:47:02 GMT] and [2009-08-06 16:50:34 GMT].[/quote]
Aha! I found something in debug2.log:
[code][2009-08-06 15:54:30 GMT] socket 3 >>>> FROM [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] _162 sm5ymt
[2009-08-06 15:54:30 GMT] crus: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-06 15:54:30 GMT] socket 3 >>>> RETURNWORK 2.2.3
[2009-08-06 15:54:30 GMT] socket 3 >>>> WorkUnit: 121736*6^146599+1 1249572434
[2009-08-06 15:54:41 GMT] socket 3 (nothing received)
[2009-08-06 15:54:41 GMT] socket 3 >>>> WorkUnit: 168610*6^146600+1 1249572434
[2009-08-06 15:54:52 GMT] socket 3 (nothing received)
[2009-08-06 15:54:52 GMT] socket 3 >>>> WorkUnit: 36772*6^146604+1 1249572434
[2009-08-06 15:55:03 GMT] socket 3 (nothing received)
[2009-08-06 15:55:03 GMT] socket 3 >>>> WorkUnit: 118147*6^146604+1 1249572434
[2009-08-06 15:55:14 GMT] socket 3 (nothing received)
[2009-08-06 15:55:14 GMT] socket 3 >>>> WorkUnit: 124125*6^146606+1 1249572434
[2009-08-06 15:55:25 GMT] socket 3 (nothing received)
[2009-08-06 15:55:25 GMT] socket 3 >>>> End of Message
[2009-08-06 15:55:36 GMT] socket 3 (nothing received)
[2009-08-06 15:55:36 GMT] socket 3 >>>> QUIT
[2009-08-06 15:55:36 GMT] closing socket 3
[2009-08-06 15:55:36 GMT] closing socket 3[/code]
This seems to be the client end of a barfed batch. The client is sending five WorkUnit: lines in a row but without any actual results in between. It also doesn't seem to try to fetch the greeting when this happens.

Also of note, the client doesn't seem to consider these "accepted" by the server, but rather keeps on trying to return them again and again, with the same result. Eventually the server gets out of its "barfing mood" and lets the client commuicate normally--except that now the server thinks all of those tests are already accounted for (having registered them with blank results):
[code][2009-08-06 16:45:30 GMT] socket 3 >>>> FROM [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] _162 sm5ymt
[2009-08-06 16:45:30 GMT] socket 3 >>>> GETGREETING
[2009-08-06 16:45:35 GMT] socket 3 <<<< ############
[2009-08-06 16:45:35 GMT] socket 3 <<<< Welcome to the CRUS G3000 PRPnet beta test server! :-D
[2009-08-06 16:45:35 GMT] socket 3 <<<< Server is running PRPnet v2.2.3
[2009-08-06 16:45:35 GMT] socket 3 <<<< ############
[2009-08-06 16:45:35 GMT] socket 3 <<<< OK.
[2009-08-06 16:45:35 GMT] socket 3 >>>> QUIT
[2009-08-06 16:45:35 GMT] closing socket 3
[2009-08-06 16:45:35 GMT] Total Time: 9:55:42 Total Tests: 42 Total PRPs Found: 0
[2009-08-06 16:45:36 GMT] socket 3 >>>> FROM [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] _162 sm5ymt
[2009-08-06 16:45:36 GMT] crus: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-06 16:45:36 GMT] socket 3 >>>> RETURNWORK 2.2.3
[2009-08-06 16:45:36 GMT] socket 3 >>>> WorkUnit: 121736*6^146599+1 1249572434
[2009-08-06 16:45:40 GMT] socket 3 <<<< ERROR: Workunit 121736*6^146599+1 not found on server
[2009-08-06 16:45:40 GMT] crus: ERROR: Workunit 121736*6^146599+1 not found on server
[2009-08-06 16:45:40 GMT] crus: The client will delete this workunit
[2009-08-06 16:45:40 GMT] socket 3 >>>> WorkUnit: 168610*6^146600+1 1249572434
[2009-08-06 16:45:41 GMT] socket 3 <<<< ERROR: Workunit 168610*6^146600+1 not found on server
[2009-08-06 16:45:41 GMT] crus: ERROR: Workunit 168610*6^146600+1 not found on server
[2009-08-06 16:45:41 GMT] crus: The client will delete this workunit
[2009-08-06 16:45:41 GMT] socket 3 >>>> WorkUnit: 36772*6^146604+1 1249572434
[2009-08-06 16:45:41 GMT] socket 3 <<<< ERROR: Workunit 36772*6^146604+1 not found on server
[2009-08-06 16:45:41 GMT] crus: ERROR: Workunit 36772*6^146604+1 not found on server
[2009-08-06 16:45:41 GMT] crus: The client will delete this workunit
[2009-08-06 16:45:41 GMT] socket 3 >>>> WorkUnit: 118147*6^146604+1 1249572434
[2009-08-06 16:45:41 GMT] socket 3 <<<< INFO: Workunit found
[2009-08-06 16:45:41 GMT] socket 3 >>>> Test Result: pfgw CCC87E55A38FA5F5
[2009-08-06 16:45:41 GMT] socket 3 >>>> End of WorkUnit
[2009-08-06 16:45:42 GMT] socket 3 <<<< INFO: Test for candidate 118147*6^146604+1 accepted
[2009-08-06 16:45:42 GMT] crus: INFO: Test for candidate 118147*6^146604+1 accepted
[2009-08-06 16:45:42 GMT] socket 3 <<<< End of Workunit Message
[2009-08-06 16:45:42 GMT] socket 3 >>>> WorkUnit: 124125*6^146606+1 1249572434
[2009-08-06 16:45:42 GMT] socket 3 <<<< INFO: Workunit found
[2009-08-06 16:45:42 GMT] socket 3 >>>> Test Result: pfgw B88C22D5DB33C4EA
[2009-08-06 16:45:42 GMT] socket 3 >>>> End of WorkUnit
[2009-08-06 16:45:43 GMT] socket 3 <<<< INFO: Test for candidate 124125*6^146606+1 accepted
[2009-08-06 16:45:43 GMT] crus: INFO: Test for candidate 124125*6^146606+1 accepted
[2009-08-06 16:45:43 GMT] socket 3 <<<< End of Workunit Message
[2009-08-06 16:45:43 GMT] socket 3 >>>> End of Message
[2009-08-06 16:45:43 GMT] socket 3 <<<< INFO: 2 of 5 test results were accepted
[2009-08-06 16:45:43 GMT] crus: INFO: 2 of 5 test results were accepted
[2009-08-06 16:45:43 GMT] socket 3 >>>> QUIT
[2009-08-06 16:45:43 GMT] closing socket 3
[2009-08-06 16:45:43 GMT] closing socket 3[/code]

rogue 2009-08-07 00:01

That is very useful. Now I would also like to see the server's debug log corresponding to these messages so that I can match them up.

What I am most concerned with, based upon what I see in the log, is that the server is not responding in a timely fashion to the client. The server does respond (sometimes), but not quickly enough for the client. The client expects a response within 10 seconds. Is there something on the server that is running at higher priority, thus taking cycles away from the PRPNet server? I hope to see some things within the server's debug log.

mdettweiler 2009-08-07 02:35

[quote=rogue;184389]That is very useful. Now I would also like to see the server's debug log corresponding to these messages so that I can match them up.

What I am most concerned with, based upon what I see in the log, is that the server is not responding in a timely fashion to the client. The server does respond (sometimes), but not quickly enough for the client. The client expects a response within 10 seconds. Is there something on the server that is running at higher priority, thus taking cycles away from the PRPNet server? I hope to see some things within the server's debug log.[/quote]
Okay, coming right up. First of all, though, I couldn't find something at 15:54 GMT from Lennart's machine _162 on the server. Based on correlations of the connection data from the server and client logfiles, it seems that box _162 is about 7 minutes slow. (Lennart, can you possibly check to see if _162 is ~7 minutes slow and thus that I did indeed get the right log excerpt? Thanks.) With that in mind, here's what the server saw at the same time as the first (barfed) log entry in my last post:
[code][2009-08-06 16:01:35 GMT] Message coming on socket 5
[2009-08-06 16:01:35 GMT] socket 5 <<<< FROM [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] _162 sm5ymt
[2009-08-06 16:01:35 GMT] [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] connecting from 91.149.39.143
[2009-08-06 16:01:35 GMT] socket 5 <<<< RETURNWORK 2.2.3
[2009-08-06 16:01:35 GMT] socket 5 <<<< WorkUnit: 121736*6^146599+1 1249572434
[2009-08-06 16:01:35 GMT] socket 5 >>>> INFO: Workunit found
[2009-08-06 16:01:35 GMT] socket 5 <<<< WorkUnit: 168610*6^146600+1 1249572434
[2009-08-06 16:01:35 GMT] socket 5 <<<< WorkUnit: 36772*6^146604+1 1249572434
[2009-08-06 16:01:35 GMT] socket 5 <<<< WorkUnit: 118147*6^146604+1 1249572434
[2009-08-06 16:01:35 GMT] socket 5 <<<< WorkUnit: 124125*6^146606+1 1249572434
[2009-08-06 16:01:35 GMT] socket 5 <<<< End of Message
[2009-08-06 16:01:35 GMT] socket 5 <<<< QUIT
[2009-08-06 16:01:46 GMT] socket 5 (nothing received)
[2009-08-06 16:01:46 GMT] socket 5 >>>> INFO: Test for candidate 121736*6^146599+1 accepted
[2009-08-06 16:01:46 GMT] Error sending <<INFO: Test for candidate 121736*6^146599+1 accepted>> to localhost:3000
[2009-08-06 16:01:46 GMT] socket 5 >>>> !!! send error !!!
[2009-08-06 16:01:46 GMT] 121736*6^146599+1: Test received by [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] at 91.149.39.143 Residue Residue:
[2009-08-06 16:01:46 GMT] socket 5 >>>> End of Workunit Message
[2009-08-06 16:01:46 GMT] Error sending <<End of Workunit Message>> to localhost:3000
[2009-08-06 16:01:46 GMT] socket 5 >>>> !!! send error !!!
[2009-08-06 16:01:57 GMT] socket 5 (nothing received)
[2009-08-06 16:01:57 GMT] socket 5 >>>> INFO: All 1 test results were accepted
[2009-08-06 16:01:57 GMT] Error sending <<INFO: All 1 test results were accepted>> to localhost:3000
[2009-08-06 16:01:57 GMT] socket 5 >>>> !!! send error !!!
[2009-08-06 16:01:57 GMT] socket 5 >>>> End of Message
[2009-08-06 16:01:57 GMT] Error sending <<End of Message>> to localhost:3000
[2009-08-06 16:01:57 GMT] socket 5 >>>> !!! send error !!!
[2009-08-06 16:02:08 GMT] socket 5 (nothing received)
[2009-08-06 16:02:08 GMT] closing socket 5[/code]
It would seem that the server got the chance to send a "Workunit found" after the first "WorkUnit:" line from the client (so either the client's "nothing recieved" message on that one was in error, or my log file correlation skills aren't all they're cracked up to be and I'm looking at the wrong segment :wink:). At any rate, the remaining four "WorkUnit" lines were apparently sent so fast that either a) the server didn't have the chance to respond with a "Workunit found"; or b) the server didn't know what to make of a client sending another WorkUnit line before it sent the last Test Result line.

The weird thing is, it would seem that the logs on the client and on the server both tell a completely different story; the server log indicates that all the "WorkUnit" messages were received in very short succession, but the client says that it waited 10 seconds for a response between each one. Possibly the client is goofing and thinks it didn't send the WorkUnit messages when it really did? :huh:

Then again, maybe I'm looking at the wrong times in the respective logfiles; I'm going to keep digging and see if I can find either a) correlation for the idea that the client logs and server logs tell a completely different story; or b) another instance of this that seems to make sense in the repspective logs.

mdettweiler 2009-08-07 02:59

Okay, here's another one. The client had just finished 5 G3000 workunits, and was going to return them:
[code][2009-08-06 05:54:51 GMT] socket 3 >>>> FROM [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] _162 sm5ymt
[2009-08-06 05:54:51 GMT] crus: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-06 05:54:51 GMT] socket 3 >>>> RETURNWORK 2.2.3
[2009-08-06 05:54:51 GMT] socket 3 >>>> WorkUnit: 36772*6^141684+1 1249536452
[2009-08-06 05:55:02 GMT] socket 3 (nothing received)
[2009-08-06 05:55:02 GMT] socket 3 >>>> WorkUnit: 123285*6^141684+1 1249536452
[2009-08-06 05:55:13 GMT] socket 3 (nothing received)
[2009-08-06 05:55:13 GMT] socket 3 >>>> WorkUnit: 172257*6^141686+1 1249536452
[2009-08-06 05:55:24 GMT] socket 3 (nothing received)
[2009-08-06 05:55:24 GMT] socket 3 >>>> WorkUnit: 98860*6^141689+1 1249536452
[2009-08-06 05:55:27 GMT] socket 3 <<<< INFO: Workunit found
[2009-08-06 05:55:27 GMT] socket 3 >>>> Test Result: pfgw 98796D25133B4E62
[2009-08-06 05:55:27 GMT] socket 3 >>>> End of WorkUnit
[2009-08-06 05:55:28 GMT] socket 3 <<<< INFO: Test for candidate 36772*6^141684+1 accepted
[2009-08-06 05:55:28 GMT] crus: INFO: Test for candidate 36772*6^141684+1 accepted
[2009-08-06 05:55:28 GMT] socket 3 <<<< End of Workunit Message
[2009-08-06 05:55:28 GMT] socket 3 >>>> WorkUnit: 113966*6^141691+1 1249536452
[2009-08-06 05:55:28 GMT] socket 3 <<<< INFO: Workunit found
[2009-08-06 05:55:28 GMT] socket 3 >>>> Test Result: pfgw C8AB5F044A322BB5
[2009-08-06 05:55:28 GMT] socket 3 >>>> End of WorkUnit
[2009-08-06 05:55:29 GMT] socket 3 <<<< INFO: Test for candidate 113966*6^141691+1 accepted
[2009-08-06 05:55:29 GMT] crus: INFO: Test for candidate 113966*6^141691+1 accepted
[2009-08-06 05:55:29 GMT] socket 3 <<<< End of Workunit Message
[2009-08-06 05:55:29 GMT] socket 3 >>>> End of Message
[2009-08-06 05:55:29 GMT] socket 3 <<<< INFO: All 2 test results were accepted
[2009-08-06 05:55:29 GMT] crus: INFO: All 2 test results were accepted
[2009-08-06 05:55:29 GMT] socket 3 >>>> QUIT
[2009-08-06 05:55:29 GMT] closing socket 3
[2009-08-06 05:55:29 GMT] closing socket 3[/code]
Hmm...interesting. Since this batch contained both results that were accepted and some that weren't, it would seem to point to the idea that there is in reality no bug on the client, but instead maybe just a bug on the server. It looks like the client tried to send the first three, but didn't get a response; within 3 seconds of sending the fourth one's WorkUnit line, it got a "Workunit found" response and sent a "Test Result" message. The server accepted that normally, then proceeded to accept the last result normally (with no abnormally long delays).

Now let's see how this looked on the server side of things. It seems that this time around, the times aren't quite as far off as they were before (now they're about half a minute off, which is within a normal range). I think last time I messed up on correlating the log file. Anyway, here we go:
[code][2009-08-06 05:55:30 GMT] Message coming on socket 5
[2009-08-06 05:55:30 GMT] socket 5 <<<< FROM [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] _162 sm5ymt
[2009-08-06 05:55:30 GMT] [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] connecting from 91.149.39.143
[2009-08-06 05:55:30 GMT] socket 5 <<<< RETURNWORK 2.2.3
[2009-08-06 05:55:30 GMT] socket 5 <<<< WorkUnit: 36772*6^141684+1 1249536452
[2009-08-06 05:55:30 GMT] socket 5 >>>> INFO: Workunit found
[2009-08-06 05:55:30 GMT] socket 5 <<<< WorkUnit: 123285*6^141684+1 1249536452
[2009-08-06 05:55:30 GMT] socket 5 <<<< WorkUnit: 172257*6^141686+1 1249536452
[2009-08-06 05:55:30 GMT] socket 5 <<<< WorkUnit: 98860*6^141689+1 1249536452
[2009-08-06 05:55:30 GMT] socket 5 <<<< Test Result: pfgw 98796D25133B4E62
[2009-08-06 05:55:30 GMT] socket 5 <<<< End of WorkUnit
[2009-08-06 05:55:30 GMT] socket 5 >>>> INFO: Test for candidate 36772*6^141684+1 accepted
[2009-08-06 05:55:30 GMT] 36772*6^141684+1: Test received by [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] at 91.149.39.143 Residue Residue: 98796D25133B4E62
[2009-08-06 05:55:30 GMT] socket 5 >>>> End of Workunit Message
[2009-08-06 05:55:31 GMT] socket 5 <<<< WorkUnit: 113966*6^141691+1 1249536452
[2009-08-06 05:55:31 GMT] socket 5 >>>> INFO: Workunit found
[2009-08-06 05:55:31 GMT] socket 5 <<<< Test Result: pfgw C8AB5F044A322BB5
[2009-08-06 05:55:31 GMT] socket 5 <<<< End of WorkUnit
[2009-08-06 05:55:31 GMT] socket 5 >>>> INFO: Test for candidate 113966*6^141691+1 accepted
[2009-08-06 05:55:31 GMT] 113966*6^141691+1: Test received by [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] at 91.149.39.143 Residue Residue: C8AB5F044A322BB5
[2009-08-06 05:55:31 GMT] socket 5 >>>> End of Workunit Message
[2009-08-06 05:55:32 GMT] socket 5 <<<< End of Message
[2009-08-06 05:55:32 GMT] socket 5 >>>> INFO: All 2 test results were accepted
[2009-08-06 05:55:32 GMT] socket 5 >>>> End of Message
[2009-08-06 05:55:32 GMT] socket 5 <<<< QUIT
[2009-08-06 05:55:32 GMT] closing socket 5[/code]Ouch! It's almost like the server froze in time for all the while the client was sending the first three "WorkUnit" lines, then suddenly came back to life and gave the "Workunit found" line for the [B]first[/B] test the client asked about--right after the client sent the line for the [B]fourth[/B] workunit! So, the client thinks "okay, the server found the third workunit" and innocently transmits the data for the fourth WU. The server then goes and gobbles up the third WU's residual and logs it for the first WU!

At that point, the client sent the "WorkUnit" line for the fifth workunit, and the server accepted it normally and registered it for the correct test (apparently ignoring the second, third, and fourth "WorkUnit" lines now, since it had never gotten around to sending the "Workunit found" messages for those).

This would seem to lend credence to your idea that possibly something is stealing CPU cycles from the server. Right now the server has dried (yet again), so what I'm going to do is re-load the 140K-150K range into it, then change the server to a higher priority so that other applications can't steal CPU cycles from it. I'll see if that works.

Edit: Okay, the server is now loaded up again and running at nice value -10.

gd_barnes 2009-08-07 04:27

Strange. I didn't know the server needed anything significant in the way of CPU cycles. Max, keep in mind that I have all 4 cores on that machine running LLRnet clients against David's port IB2000. I don't really want the server to take CPU cycles from my LLRing so maybe it's better that I only run 3 cores on LLRnet clients.

When the next round of testing comes back complete, let me know if you think that cutting back to 3 cores running LLRnet would help. BUT...that said, perhaps I can still get more total throughput running all 4 cores on LLRnet with the server set higher in priority. If that works OK, then we should probably bump all PRPnet servers on that machine up to a nice value of -10.

One more thing to remember: There are something like 6-7 servers on that machine, including both public and personal LLRnet and PRPnet servers. Rather than the issue being that LLRnet is running 4 clients, it could be that we just have too many servers running on one machine.


Gary

mdettweiler 2009-08-07 05:17

[quote=gd_barnes;184405]Strange. I didn't know the server needed anything significant in the way of CPU cycles. Max, keep in mind that I have all 4 cores on that machine running LLRnet clients against David's port IB2000. I don't really want the server to take CPU cycles from my LLRing so maybe it's better that I only run 3 cores on LLRnet clients.

When the next round of testing comes back complete, let me know if you think that cutting back to 3 cores running LLRnet would help. BUT...that said, perhaps I can still get more total throughput running all 4 cores on LLRnet with the server set higher in priority. If that works OK, then we should probably bump all PRPnet servers on that machine up to a nice value of -10.

One more thing to remember: There are something like 6-7 servers on that machine, including both public and personal LLRnet and PRPnet servers. Rather than the issue being that LLRnet is running 4 clients, it could be that we just have too many servers running on one machine.


Gary[/quote]
Nah, the server doesn't need much. In fact, theoretically it should be getting plenty just with a nice value of 0 (the level for "normal", non-background apps). The only thing is to make sure that what little it needs (since it obviously needs [I]something[/I] in order to even move forward) isn't getting crowded out accidentally by the lowest-priority LLRnet clients. As an example of this: one day I was stopping a PRPnet client to switch that core over to something else. To avoid any idle "gaps" in processing, when switching a core over to a new type of work, I usually start the new program before stopping the old one--leaving the two to overlap for a brief time. However, this particular time, I had aliqueit.exe running Aliquot sequence work on the other core, and it was set to a nice value of 5 ("Below Normal" on Windows). PRPnet, meanwhile, was set to 19 ("Low"). I think I was starting Prime95 as the new application; Prime95 is one of those appliacations that tends to "bully" others of equal priority for its CPU time if it has to share a core with them. (Not sure why it does this, though I've noticed that gwnum-based apps such as PFGW and LLR also do the same thing over other apps.) Anyway, this meant that even though PRPnet and Prime95's worker thread were both at nice 19, Prime95 would hog all the CPU resources. Actually, PRPnet was running LLR at the moment, so since both Prime95 and LLR are equally "hoggy" they split the core evenly--until I stopped LLR. That meant that Prime95 was free to take the entire core, *except* PRPnet still needed to do a few last-minute tasks before shutdown, such as reporting results to the server. Of course, PRPnet isn't nearly as "macho" as Prime95 when it comes to battling for CPU time, so PRPnet practically froze in a limbo between "Detected LLR shutdown, exiting" and "reporting completed results to server". It would have crawled along like that for quite a while until it managed to gradually pick up enough spare cycles along the way to finally eke through the wee bit of CPU-work needed to simply return its results to the server, had I not went into Task Manager and manually set PRPnet to "High" priority to let it finish up and exit. :smile:

A similar thing could very well be happening with the PRPnet servers. However, the odd thing is, they're at nice 0, which is way ahead of all your LLRnet clients at nice 19. Theoretically, the servers should be getting all the CPU time they need, just like a regular desktop application (which usually runs at nice 0). However, it would look like that's not happening here, for reasons we don't quite understand yet. I'm pretty sure that it's not the other servers that are doing it; there's no way they could hog the CPU for long enough to keep G3000 "frozen" for 20-30 seconds as seems to have occurred here.

Mark, is the PRPnet server multithreaded at all? If so, do any of the threads by chance run at nice 19? If so, then that might be the problem here.

rogue 2009-08-07 12:32

[QUOTE=mdettweiler;184410]Mark, is the PRPnet server multithreaded at all? If so, do any of the threads by chance run at nice 19? If so, then that might be the problem here.[/QUOTE]

No, it is not multi-threaded.

By default the server is to run in normal (default) priority. This can be overridden by modifying the idle= setting in the prpserver.ini. Did you modify that?

I will modify the client to not send any more workunits if it doesn't get a reply from the server within 10 seconds. I will then bump that up to 30 seconds.

I have released 2.2.4. Read [URL="http://www.mersenneforum.org/showthread.php?p=184439#post184439"]here[/URL].

mdettweiler 2009-08-07 14:50

[quote=rogue;184436]No, it is not multi-threaded.

By default the server is to run in normal (default) priority. This can be overridden by modifying the idle= setting in the prpserver.ini. Did you modify that?

I will modify the client to not send any more workunits if it doesn't get a reply from the server within 10 seconds. I will then bump that up to 30 seconds.

I have released 2.2.4. Read [URL="http://www.mersenneforum.org/showthread.php?p=184439#post184439"]here[/URL].[/quote]
No, I didn't change the idle= option; it's at 0 for all servers.

BTW, I see in your 2.2.4 release notes that you made some tweaks to the web page code so that the connection is closed at the end of such a session; possibly that could have casued the server to have problems before? (Or is it capable of dealing with multiple connections simultaneously?)

If you've got all the data you need on this, I'll go ahead and upgrade all the servers to 2.2.4. I'll also post new client packages at NPLB.

Edit: Oh, and Gary, since it seems that the barfing could have caused quite a few residuals to get switched around, I'm going to re-do the entire 100K-150K range in the server. At the rate we're going, I doubt it will take very long to do it; and that way we can be sure that nothing is messed up and all the results are good.

MyDogBuster 2009-08-07 16:32

Hi guys,

A weird one. Looks like a barf but my server hung. I ctrl-c'd out of it when I noticed it 6 minutes later. I've highlighted the test in question.
It was the first test in the sequence BUT wasn't listed when the client started uploading. The first one shown was the second test in the sequence.

Now the strange part. When I checked everything out my Candidates filke was missing about 35K tests. It went from 165K tests to 130K tests.

I had to rebuild it.

From what I can tell, two clients were attempting to upload results at the same time. All the tests from the client with the goofy test errored out.

Methinks the server has timing problems accepting results from more than 1 client. The client does seem impatient for an answer.

I've seen this happen on and off for the last few days but when I checked the client logs, it stated that the results were all accepted (I have all clients cached at 20 units)

This setup has run for 2 days with no known problems (except whats looks like a barf, but with all tests were being accepted). This is the first server hangup. All other programs on the machine were running just fine while the server was hung.

FROM THE SERVER: (2.2.3)

[2009-08-07 14:23:18 GMT] [email]IMGunn1654@gmail.com[/email] (Sophie#4) at 192.168.2.100: Sent 4894*24^40589+1
[2009-08-07 14:23:18 GMT] [email]IMGunn1654@gmail.com[/email] (Sophie#4) at 192.168.2.100: Sent 656*24^40590+1
[2009-08-07 14:23:18 GMT] [email]IMGunn1654@gmail.com[/email] (Sophie#4) at 192.168.2.100: Sent 18724*24^40589+1
[2009-08-07 14:23:18 GMT] [email]IMGunn1654@gmail.com[/email] (Sophie#4) at 192.168.2.100: Sent 5324*24^40591+1
[COLOR=SandyBrown][2009-08-07 14:28:54 GMT] 17819*24^40551+1: Test received by [email]imgunn1654@gmail.com[/email] at 192.168.2.8 Residue Residue:
[2009-08-07 14:35:00 GMT] Accepted force quit. Waiting to close sockets before exiting[/COLOR]
[2009-08-07 14:35:00 GMT] 656*24^40326+1: Test for user [email]IMGunn1654@gmail.com[/email] has expired
[2009-08-07 14:35:00 GMT] 3031*24^40322+1: Test for user [email]IMGunn1654@gmail.com[/email] has expired
[2009-08-07 14:35:00 GMT] 3051*24^40324+1: Test for user [email]IMGunn1654@gmail.com[/email] has expired


FROM THE CLIENT

[COLOR=DarkOrange][2009-08-07 14:05:50 GMT] Base24: 17819*24^40551+1 is not prime. Residue 20A05F9B77C0E566[/COLOR]
[2009-08-07 14:06:45 GMT] Base24: 1099*24^40553+1 is not prime. Residue D342FD5C641F6722
[2009-08-07 14:08:12 GMT] Base24: 9726*24^40552+1 is not prime. Residue 9DC7E0FEF8810C84
[2009-08-07 14:09:08 GMT] Base24: 6181*24^40552+1 is not prime. Residue 9FD8141B967D6A3B
[2009-08-07 14:10:04 GMT] Base24: 5129*24^40553+1 is not prime. Residue 467417A0E33484E0
[2009-08-07 14:10:59 GMT] Base24: 7394*24^40553+1 is not prime. Residue 6E6E638320684226
[2009-08-07 14:11:55 GMT] Base24: 7481*24^40554+1 is not prime. Residue 75E9D6864F9D8ADD
[2009-08-07 14:12:49 GMT] Base24: 3526*24^40554+1 is not prime. Residue B6FE09C3477726A8
[2009-08-07 14:13:45 GMT] Base24: 4606*24^40554+1 is not prime. Residue 346919CC613EDC68
[2009-08-07 14:14:40 GMT] Base24: 5129*24^40555+1 is not prime. Residue 1BBF04CC5EBADD4E
[2009-08-07 14:16:08 GMT] Base24: 12799*24^40555+1 is not prime. Residue 3CC90EEFFF538DEE
[2009-08-07 14:17:36 GMT] Base24: 9279*24^40555+1 is not prime. Residue 5FBE0B1556DB1F86
[2009-08-07 14:19:03 GMT] Base24: 21439*24^40555+1 is not prime. Residue 43036915D93D1B27
[2009-08-07 14:20:31 GMT] Base24: 12969*24^40555+1 is not prime. Residue F5E5E5C0C95DFA2D
[2009-08-07 14:21:26 GMT] Base24: 7481*24^40556+1 is not prime. Residue E1C265FFF9A82304
[2009-08-07 14:22:22 GMT] Base24: 7746*24^40556+1 is not prime. Residue 7F3BAE3DBC691F02
[2009-08-07 14:23:50 GMT] Base24: 20731*24^40556+1 is not prime. Residue 2C2569D7BB19C6C2
[2009-08-07 14:25:18 GMT] Base24: 21776*24^40556+1 is not prime. Residue 506D029179A2E896
[2009-08-07 14:26:45 GMT] Base24: 29601*24^40556+1 is not prime. Residue D577DBEEA0C3A2D8
[2009-08-07 14:28:13 GMT] Base24: 22356*24^40556+1 is not prime. Residue 18C59E905E97B7EA
[2009-08-07 14:28:13 GMT] Total Time: 44:43:19 Total Tests: 2660 Total PRPs Found: 3
[COLOR=DarkOrange][2009-08-07 14:28:13 GMT] Base24: Returning work to server 192.168.2.7 at port 7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 5129*24^40553+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 7394*24^40553+1 1249653889>> to 192.168.2.7:7102[/COLOR]
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 7481*24^40554+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 3526*24^40554+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 4606*24^40554+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 5129*24^40555+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 12799*24^40555+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 9279*24^40555+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 21439*24^40555+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 12969*24^40555+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 7481*24^40556+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 7746*24^40556+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 20731*24^40556+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 21776*24^40556+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 29601*24^40556+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<WorkUnit: 22356*24^40556+1 1249653889>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<End of Message>> to 192.168.2.7:7102
[2009-08-07 14:28:49 GMT] Error sending <<QUIT>> to 192.168.2.7:7102

mdettweiler 2009-08-07 16:40

Yes, that's a classic barf. I'd recommend upgrading the server and all clients to v2.2.4 as soon as possible; that should fix the barfing problem.

Regarding the server not being able to accept more than one connection at a time: yes, it looks sort of like that. It would almost seem that the server has to wait for one client to finish talking to it before another one can begin. This may have been the cause of some of the timeouts that seemed to happen during the barfs. However, since the 2.2.4 client changes the timeout to 30 seconds, and won't keep sending results if the server isn't ready for them, this should now be fixed.

MyDogBuster 2009-08-07 18:46

[QUOTE]However, since the 2.2.4 client changes the timeout to 30 seconds, and won't keep sending results if the server isn't ready for them, this should now be fixed.[/QUOTE]

If I was you guys, I'd set the timeout to 60 seconds. Ya never know what a machine might be doing, or get into, after starting to process something. The larger the cache, the longer an update will take. Chances are 60 seconds will never be reached so what would it hurt?

Any ideas as to why I lost 1/5th of my candidates file during that barf?
Over 1MB of data right in the middle of the file.

MyDogBuster 2009-08-07 20:14

Another stange message only after I upgraded to 2.3.4


[2009-08-07 20:12:12 GMT] Error sending <<<link rel="icon" type="image/ico" href="prpnet.ico">>> to localhost:7102

BTW, the barf's seems to have stopped. I still see 2 updates happening together, but no error messages.

MyDogBuster 2009-08-07 22:27

[quote]BTW, the barf's seems to have stopped. I still see 2 updates happening together, but no error messages.
[/quote]I'll take that back. The barf's are still happening but on a less frequent basis. I think the problem is that timeout parameter. The server is running in command line mode. The program takes time to paint the command window with each message. I'm assuming it's doing so after updating the candidates file. When I watch it, it takes about a second to paint each line because of the update involved. I will bet you that if you up the timeout parameter to a minute or 2, the problem will go away. The time required is also dependent on the cache size set. If someone sets the cache size to say 100, that parameter will have to be higher. 30 seconds wasn't bad. I forced 160 updates thru my pipeline at once and only 15 got error messages. That was a lot higher before. Setting that parameter higher doesn't make the server run slower it just allows more time to get it's work done before problems occur. Maybe making it a changeable parameter would be good. That way someone who has big cache sizes can set it higher to allow for more processing time. On a public server, I would set it to at least 2 minutes. No one has any idea what updates are coming at any one time. It can't hurt, can it.

rogue 2009-08-07 23:02

[QUOTE=MyDogBuster;184511]I'll take that back. The barf's are still happening but on a less frequent basis. I think the problem is that timeout parameter. The server is running in command line mode. The program takes time to paint the command window with each message. I'm assuming it's doing so after updating the candidates file. When I watch it, it takes about a second to paint each line because of the update involved. I will bet you that if you up the timeout parameter to a minute or 2, the problem will go away. The time required is also dependent on the cache size set. If someone sets the cache size to say 100, that parameter will have to be higher. 30 seconds wasn't bad. I forced 160 updates thru my pipeline at once and only 15 got error messages. That was a lot higher before. Setting that parameter higher doesn't make the server run slower it just allows more time to get it's work done before problems occur. Maybe making it a changeable parameter would be good. That way someone who has big cache sizes can set it higher to allow for more processing time. On a public server, I would set it to at least 2 minutes. No one has any idea what updates are coming at any one time. It can't hurt, can it.[/QUOTE]

I'll think about it for the next release. To reduce file I/O, you can increase the setting of savefrequency= in the prpserver.ini file.

I should consider allowing an administrator the ability to split the prpserver.candidates file so that it doesn't take as much memory and reducing the I/O. It could (theoretically) be striped across multiple files only reading in a subsequent file when the server is running dry. Thoughts?

mdettweiler 2009-08-07 23:19

[quote=MyDogBuster;184511]I'll take that back. The barf's are still happening but on a less frequent basis. I think the problem is that timeout parameter. The server is running in command line mode. The program takes time to paint the command window with each message. I'm assuming it's doing so after updating the candidates file. When I watch it, it takes about a second to paint each line because of the update involved. I will bet you that if you up the timeout parameter to a minute or 2, the problem will go away. The time required is also dependent on the cache size set. If someone sets the cache size to say 100, that parameter will have to be higher. 30 seconds wasn't bad. I forced 160 updates thru my pipeline at once and only 15 got error messages. That was a lot higher before. Setting that parameter higher doesn't make the server run slower it just allows more time to get it's work done before problems occur. Maybe making it a changeable parameter would be good. That way someone who has big cache sizes can set it higher to allow for more processing time. On a public server, I would set it to at least 2 minutes. No one has any idea what updates are coming at any one time. It can't hurt, can it.[/quote]
Hmm...just to verify, by "barfing" you mean that the server is still writing blank results to completed_tests.log? Or is it just giving error messages? Because as long as it's not writing blank results (or apparently scrambling other results at the same time) then the integrity of the results is still sound and it's not quite the same as the "barfing" we were dealing with on G3000 (and which you'd reported earlier on your server with version 2.2.3).

Also, you're saying that it has to update prpserver.candidates with EVERY new line it outputs? That doesn't seem right. What's your "savefrequency=" set to in prpserver.ini? At the default of 5 minutes, I don't get any problems with that kind of bottlenecking.

Mark, should I go ahead and upgrade the server to 2.2.4, or do you need more data with 2.2.3? Either way, considering as how Ian's reporting further problems I won't load all of 100K-150K but just 140K-150K as another test run.

MyDogBuster 2009-08-07 23:47

[QUOTE]Also, you're saying that it has to update prpserver.candidates with EVERY new line it outputs? That doesn't seem right. What's your "savefrequency=" set to in prpserver.ini? At the default of 5 minutes, I don't get any problems with that kind of bottlenecking.
[/QUOTE]

I have savefrequency set to 0 for a reason. It's a valid number for the parameter. When testing something we have to take into account ANYTHING that can happen. Someone is going to set it to 0. I'm it.

I'll reset it to 3 minutes and try my testing again. I was not getting any blank records this time.

Mark, thanks for considering changing the I/O to the file.

[QUOTE]I should consider allowing an administrator the ability to split the prpserver.candidates file so that it doesn't take as much memory and reducing the I/O. It could (theoretically) be striped across multiple files only reading in a subsequent file when the server is running dry. Thoughts?[/QUOTE]

The problem with splitting the candidates file is that the removal of k's would be messed up unless all the candidates for a k are in the same file.
I'm currently testing a candidates file with 196K tests. The removal works just fine. Great feature. If we split it into 3 files then I would have to do some manual deleting.

I guess doing more analysis before setting up a candidates file would have helped. But then again, stress testing should ferret out all problems. Maybe for future reference someone could set some reasonable limits on file sizes and stuff like that.

All in all, great program(s) Mark.

MyDogBuster 2009-08-08 00:11

Okay, setting the savefrequency to 5 did not get rid of the bottleneck problem. I still got the error messages but no blank records.

I checked the client log and it showed that all 20 tests were accepted. I waited the 5 minutes and then checked the candidates file for the errored tests and they were still marked as inprogress on that machine. That's an inconsistency. The client thinks all's okay with the upload, but the server didn't accept the bottlenecked ones.

rogue 2009-08-08 01:06

Max, upgrade to 2.2.4. I don't think anything else is necessary at this point.

MyDogBuster, I don't understand your last statement. Are you saying that the all workunits failed to be reported to the server and the server still had them marked as inprogress? If so, that is not a problem as the server would still be waiting for valid test results. It is only a problem if the client dropped the workunits. If that is the case I would need to see debug logs from both. It is possible that a bug is still lurking in the client.

savefrequency=0 would cause a huge amount of I/O. I think that setting it to an hour or more is reasonable as long as the server is stable.

I also suggest setting maxworkunits to a value that allows clients to build up an hour or more of work before reporting. That would reduce the number of times clients need to communicate with the server to get and report work. It should also reduce bottlenecks. I max out on PrimeGrid projects and get 5+ hours of work each time.

At this time the server is not multi-threaded. I'm not certain what happens if multiple clients try to connect at the same time. I presume that if one is connected, then the others have to wait. Multi-threading the server would not be easy, which is why I've been avoiding it.

MyDogBuster 2009-08-08 01:16

[quote]MyDogBuster, I don't understand your last statement. Are you saying that the all workunits failed to be reported to the server and the server still had them marked as inprogress? If so, that is not a problem as the server would still be waiting for valid test results. It is only a problem if the client dropped the workunits. If that is the case I would need to see debug logs from both. It is possible that a bug is still lurking in the client.[/quote]The client finished all 20 tests. The server did not accept all of them because of the bottleneck.

I see this message on the server:
[2009-08-07 04:09:06 GMT] Error sending <<ServerType: 1>> to localhost:7102
[2009-08-07 04:09:06 GMT] Error sending <<WorkUnit: 16641*24^39588+1 1249618146 16641 24 39588 1>> to localhost:7102
[2009-08-07 04:09:06 GMT] [EMAIL="IMGunn1654@gmail.com"]IMGunn1654@gmail.com[/EMAIL] (Sophie#3) at 192.168.2.100: Sent 16641*24^39588+1
[2009-08-07 04:09:06 GMT] Error sending <<ServerType: 1>> to localhost:7102
[2009-08-07 04:09:06 GMT] Error sending <<WorkUnit: 28701*24^39588+1 1249618146 28701 24 39588 1>> to localhost:7102
[2009-08-07 04:09:06 GMT] [EMAIL="IMGunn1654@gmail.com"]IMGunn1654@gmail.com[/EMAIL] (Sophie#3) at 192.168.2.100: Sent 28701*24^39588+1
[2009-08-07 04:09:06 GMT] Error sending <<ServerType: 1>> to localhost:7102

But I see this message on the client log.

[2009-08-07 04:09:08 GMT] Base24: INFO: All 20 test results were accepted

[quote]I also suggest setting maxworkunits to a value that allows clients to build up an hour or more of work before reporting. That would reduce the number of times clients need to communicate with the server to get and report work. It should also reduce bottlenecks. I max out on PrimeGrid projects and get 5+ hours of work each time.[/quote]I can do that. I was just seeing what kind of limits there are to some of the settings.

mdettweiler 2009-08-08 01:47

[quote=MyDogBuster;184531]The client finished all 20 tests. The server did not accept all of them because of the bottleneck.

I see this message on the server:
[2009-08-07 04:09:06 GMT] Error sending <<ServerType: 1>> to localhost:7102
[2009-08-07 04:09:06 GMT] Error sending <<WorkUnit: 16641*24^39588+1 1249618146 16641 24 39588 1>> to localhost:7102
[2009-08-07 04:09:06 GMT] [EMAIL="IMGunn1654@gmail.com"]IMGunn1654@gmail.com[/EMAIL] (Sophie#3) at 192.168.2.100: Sent 16641*24^39588+1
[2009-08-07 04:09:06 GMT] Error sending <<ServerType: 1>> to localhost:7102
[2009-08-07 04:09:06 GMT] Error sending <<WorkUnit: 28701*24^39588+1 1249618146 28701 24 39588 1>> to localhost:7102
[2009-08-07 04:09:06 GMT] [EMAIL="IMGunn1654@gmail.com"]IMGunn1654@gmail.com[/EMAIL] (Sophie#3) at 192.168.2.100: Sent 28701*24^39588+1
[2009-08-07 04:09:06 GMT] Error sending <<ServerType: 1>> to localhost:7102

But I see this message on the client log.

[2009-08-07 04:09:08 GMT] Base24: INFO: All 20 test results were accepted[/quote]
Hmm...the server log excerpt you posted is of the server sending *new* workunits to the client, while the client log shows successful *sending* of results. They're both talking about different events.

MyDogBuster 2009-08-08 01:55

[QUOTE] Error sending <<ServerType: 1>> to localhost:7102[/QUOTE]

Nope same event. "to localhost:7102" is the server not the client.

rogue 2009-08-08 02:33

I would like to see both logs. Could you e-mail to me?

mdettweiler 2009-08-08 02:57

[quote=MyDogBuster;184538]Nope same event. "to localhost:7102" is the server not the client.[/quote]
Yes, but the server log says that it sent the client new workunits to test, whereas the client says it just successfully sent 20 results to the server. Possibly the times are a little dissynchronized between the server and client machine?

MyDogBuster 2009-08-08 03:25

Sorry guys, Max was right. I misinterpreted the error message. It's an error of sending work to the clients. I was just matching times and not actual k/n pairs. DUH

So my bottleneck is timing out on sends to the clients. Let me watch it for a while.

I see how this works now. My bad. The server program is slicker than I thought.

mdettweiler 2009-08-08 04:41

Okay, I've upgraded all servers to 2.2.4 and re-loaded 140K-150K into G3000 for another test run. Hopefully all will work well this time. :smile:

Edit: Mark, I see that the server pages still don't display HTML <title>'s, even though this option is properly set in prpserver.ini. I can confirm that the "Max N" column header has been fixed, though.

gd_barnes 2009-08-08 07:42

Outstanding testing Ian. Keep ferreting every little issue you can find. That is what I've been looking for.

Good job on getting some fixes in quickly Mark and Max. This will be an amazing setup when it is all done and runs perfectly! :smile:

rogue 2009-08-08 11:10

[QUOTE=MyDogBuster;184549]Sorry guys, Max was right. I misinterpreted the error message. It's an error of sending work to the clients. I was just matching times and not actual k/n pairs. DUH

So my bottleneck is timing out on sends to the clients. Let me watch it for a while.

I see how this works now. My bad. The server program is slicker than I thought.[/QUOTE]

The message might just be misworded. I still need both logs (with debuglevel=1) to know what is going on.

Max, which page isn't showing the title?

MyDogBuster 2009-08-08 13:16

[quote]The message might just be misworded. I still need both logs (with debuglevel=1) to know what is going on.[/quote]I've deleted all the logs and restarted everything with debug on. As soon as I get the bottleneck error again, I'll send you everything. I can always force it to happen. Need some sleep.

server_stats.html - I get a blank page now under server v2.3.4

Server-status.html - I get no html heading. Same with the user_status.html. Also on user_status.html I get 0 for PRP even though I have a PRP file with 11 found primes

Edited: Files emailed

mdettweiler 2009-08-08 15:52

[quote=rogue;184577]Max, which page isn't showing the title?[/quote]
All of the pages aren't showing the titles.

In other news, I'm seeing some rather weird behavior right now on the G3000 server. It's almost like the server "froze" at 6:00:04 GMT today (about 9 hours, 45 minutes ago), and in the middle of a communication with one of Lennart's boxes to boot. I had to kill the server manually (with -SIGKILL, since a regular -SIGTERM didn't do the trick) and restart it in order to fix it. Now it seems to be working OK.

Also of note, there was one lone test with a missing residue that showed up right before the server froze:
[code][2009-08-08 05:58:31 GMT] Message coming on socket 5
[2009-08-08 05:58:31 GMT] socket 5 <<<< FROM [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] _79 sm5ymt
[2009-08-08 05:58:31 GMT] [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] connecting from 93.179.39.71
[2009-08-08 05:58:31 GMT] socket 5 <<<< RETURNWORK 2.2.4
[2009-08-08 05:58:31 GMT] socket 5 <<<< WorkUnit: 168610*6^141180+1 1249710466
[2009-08-08 05:58:31 GMT] socket 5 >>>> INFO: Workunit found
[2009-08-08 05:59:02 GMT] socket 5 (nothing received)
[2009-08-08 05:59:02 GMT] socket 5 >>>> ERROR: Test for 168610*6^141180+1 rejected. No residue reported
[2009-08-08 05:59:02 GMT] Error sending <<ERROR: Test for 168610*6^141180+1 rejected. No residue reported>> to localhost:3000
[2009-08-08 05:59:02 GMT] socket 5 >>>> !!! send error !!!
[2009-08-08 05:59:02 GMT] [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] (_79) at 93.179.39.71: Rejected test on 168610*6^141180+1 due to no residue
[2009-08-08 05:59:02 GMT] socket 5 >>>> INFO: Test for candidate 168610*6^141180+1 accepted
[2009-08-08 05:59:02 GMT] Error sending <<INFO: Test for candidate 168610*6^141180+1 accepted>> to localhost:3000
[2009-08-08 05:59:02 GMT] socket 5 >>>> !!! send error !!!
[2009-08-08 05:59:02 GMT] 168610*6^141180+1: Test received by [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] at 93.179.39.71 Residue Residue:
[2009-08-08 05:59:02 GMT] socket 5 >>>> End of Workunit Message
[2009-08-08 05:59:02 GMT] Error sending <<End of Workunit Message>> to localhost:3000
[2009-08-08 05:59:02 GMT] socket 5 >>>> !!! send error !!!
[2009-08-08 05:59:33 GMT] socket 5 (nothing received)
[2009-08-08 05:59:33 GMT] socket 5 >>>> INFO: All 1 test results were accepted
[2009-08-08 05:59:33 GMT] Error sending <<INFO: All 1 test results were accepted>> to localhost:3000
[2009-08-08 05:59:33 GMT] socket 5 >>>> !!! send error !!!
[2009-08-08 05:59:33 GMT] socket 5 >>>> End of Message
[2009-08-08 05:59:33 GMT] Error sending <<End of Message>> to localhost:3000
[2009-08-08 05:59:33 GMT] socket 5 >>>> !!! send error !!!
[2009-08-08 06:00:04 GMT] socket 5 (nothing received)
[2009-08-08 06:00:04 GMT] closing socket 5[/code]Of note, the server did notice that no residue was reported, but marked down the test anyway.

After that the server went back to business as usual for a while, then ran into this:
[code][2009-08-08 06:00:04 GMT] Message coming on socket 5
[2009-08-08 06:00:04 GMT] socket 5 <<<< FROM [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] _29 sm5ymt
[2009-08-08 06:00:04 GMT] [EMAIL="sm5ymt@pekhult.se"]sm5ymt@pekhult.se[/EMAIL] connecting from 93.179.39.71
[2009-08-08 06:00:04 GMT] socket 5 <<<< RETURNWORK 2.2.4
[2009-08-08 06:00:04 GMT] socket 5 <<<< WorkUnit: 108527*6^141131+1 1249710211
[2009-08-08 06:00:04 GMT] socket 5 >>>> INFO: Workunit found
[2009-08-08 15:42:57 GMT] Accepted force quit. Waiting to close sockets before exiting
[2009-08-08 15:43:44 GMT] Accepted force quit. Waiting to close sockets before exiting[/code]
Without any particular explanation why, it simply froze right after sending the "INFO: Workunit found" message. The next time it did anything was when I hit Ctrl-C and it gave the first "Accepted force quit" message. The second such message was from when I -SIGTERM'd the process by PID to make sure I was killing the right one with -SIGKILL immediately after (so I didn't abruptly kill all the other servers at the same time).

Since I've restarted the server just now, all seems to be working well.

Lennart, do you possibly have a debug.log file from client _29 around when this happened?

Lennart 2009-08-08 16:11

[code][2009-08-08 05:43:26 GMT] rps: Getting work from server nplb-gb1.no-ip.org at port 3000
[2009-08-08 05:43:26 GMT] socket 3 >>>> GETWORK 2.2.4 3
[2009-08-08 05:43:26 GMT] socket 3 >>>> llr
[2009-08-08 05:43:26 GMT] socket 3 >>>> phrot
[2009-08-08 05:43:26 GMT] socket 3 >>>> pfgw
[2009-08-08 05:43:26 GMT] socket 3 >>>> End of Message
[2009-08-08 05:43:31 GMT] socket 3 <<<< ServerVersion: 2.2.4
[2009-08-08 05:43:32 GMT] socket 3 <<<< ServerType: 1
[2009-08-08 05:43:32 GMT] socket 3 <<<< WorkUnit: 108527*6^141131+1 1249710211 108527 6 141131 1
[2009-08-08 05:43:32 GMT] socket 3 <<<< ServerType: 1
[2009-08-08 05:43:32 GMT] socket 3 <<<< WorkUnit: 87800*6^141133+1 1249710211 87800 6 141133 1
[2009-08-08 05:43:32 GMT] socket 3 <<<< ServerType: 1
[2009-08-08 05:43:32 GMT] socket 3 <<<< WorkUnit: 124125*6^141134+1 1249710211 124125 6 141134 1
[2009-08-08 05:43:32 GMT] socket 3 <<<< End of Message
[2009-08-08 05:43:32 GMT] socket 3 >>>> GETGREETING
[2009-08-08 05:43:32 GMT] socket 3 <<<< ############
[2009-08-08 05:43:32 GMT] socket 3 <<<< Welcome to the CRUS G3000 PRPnet beta test server! :-D
[2009-08-08 05:43:32 GMT] socket 3 <<<< Server is running PRPnet v2.2.3
[2009-08-08 05:43:32 GMT] socket 3 <<<< ############
[2009-08-08 05:43:32 GMT] socket 3 <<<< OK.
[2009-08-08 05:43:32 GMT] socket 3 >>>> QUIT
[2009-08-08 05:48:37 GMT] rps: 108527*6^141131+1 is not prime. Residue 490957BF01DB7E2F
[2009-08-08 05:53:41 GMT] rps: 87800*6^141133+1 is not prime. Residue B9EF669CBC119BEC
[2009-08-08 05:58:45 GMT] rps: 124125*6^141134+1 is not prime. Residue 9F52401A6B78FAE8
[2009-08-08 05:58:45 GMT] Total Time: 16:39:12 Total Tests: 153 Total PRPs Found: 1
[2009-08-08 05:58:48 GMT] socket 3 >>>> FROM sm5ymt@pekhult.se _29 sm5ymt
[2009-08-08 05:58:48 GMT] rps: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-08 05:58:48 GMT] socket 3 >>>> RETURNWORK 2.2.4
[2009-08-08 05:58:48 GMT] socket 3 >>>> WorkUnit: 108527*6^141131+1 1249710211
[2009-08-08 05:59:19 GMT] socket 3 (nothing received)
[2009-08-08 05:59:19 GMT] `À|ë¨: Count not verify existence of workunit on the server.[/code]After this there is no more in the log. All my clients working on crus did hang !!

Lennart


All times are UTC. The time now is 09:39.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.