mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Conjectures 'R Us (https://www.mersenneforum.org/forumdisplay.php?f=81)
-   -   PRPnet server bugs and barfing problem (https://www.mersenneforum.org/showthread.php?t=12256)

gd_barnes 2009-08-02 09:22

Lennart found 2 new primes!:

123285*6^147200+1 is prime
145076*6^149946+1 is prime

Unfortunately they are not quite big enough for top 5000 right now but they are nice primes for the conjectures nonetheless.

28 k's now remain for Sierp base 6 at n=150K.

gd_barnes 2009-08-02 09:37

More problems:

[quote]
2009-08-02 03:03:47 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-02 03:06:56 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-02 03:07:52 GMT] Total Time: 6:27:55 Total Tests: 45 Total PRPs Found: 0
[2009-08-02 03:11:01 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-02 03:11:12 GMT] Total Time: 6:31:15 Total Tests: 45 Total PRPs Found: 0
[2009-08-02 03:11:15 GMT] G3000: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-02 03:11:41 GMT] Total Time: 6:31:44 Total Tests: 45 Total PRPs Found: 0
[2009-08-02 03:11:41 GMT] G3000: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-02 03:11:45 GMT] G3000: ERROR: Workunit 59506*6^114591+1 not found on server
[2009-08-02 03:11:45 GMT] G3000: The client will delete this workunit
[2009-08-02 03:11:46 GMT] G3000: INFO: 0 of 1 test results were accepted
[/quote][quote]
[2009-08-02 03:24:36 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-02 03:26:20 GMT] Total Time: 6:46:23 Total Tests: 48 Total PRPs Found: 0
[2009-08-02 03:26:23 GMT] G3000: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-02 03:27:21 GMT] Total Time: 6:47:24 Total Tests: 48 Total PRPs Found: 0
[2009-08-02 03:27:24 GMT] G3000: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-02 03:31:17 GMT] nplb-gb1.no-ip.org:3000 connect to socket failed
[2009-08-02 03:33:01 GMT] Total Time: 6:53:04 Total Tests: 48 Total PRPs Found: 0
[2009-08-02 03:33:22 GMT] G3000: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-02 03:33:54 GMT] G3000: ERROR: Workunit 145076*6^115554+1 not found on server
[2009-08-02 03:33:54 GMT] G3000: The client will delete this workunit
[2009-08-02 03:33:54 GMT] G3000: INFO: Workunit found
[2009-08-02 03:34:08 GMT] Total Time: 6:54:11 Total Tests: 48 Total PRPs Found: 0
[2009-08-02 03:35:41 GMT] G3000: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-02 03:37:05 GMT] Total Time: 6:57:08 Total Tests: 48 Total PRPs Found: 0
[2009-08-02 03:37:05 GMT] G3000: Returning work to server nplb-gb1.no-ip.org at port 3000
[2009-08-02 03:37:07 GMT] G3000: ERROR: Workunit 145076*6^115554+1 not found on server
[2009-08-02 03:37:07 GMT] G3000: The client will delete this workunit
[2009-08-02 03:37:08 GMT] G3000: ERROR: Workunit 68195*6^115556+1 not found on server
[2009-08-02 03:37:08 GMT] G3000: The client will delete this workunit
[2009-08-02 03:37:08 GMT] G3000: INFO: 0 of 2 test results were accepted
[/quote]Some of my results are not being accepted. Then they are deleted! What's up with that? The problem only seems to be on port 3000. No rejections on port 1470.

gd_barnes 2009-08-02 11:13

Max and Rogue:

Port 3000 has dried. Plan of action now:

1. Please correct all of the aforementioned problems.

2. Change all appropriate # of k's remaining, "min n" values for all k's, # of pairs remaining etc. in all web pages to the state that they should be at n=140K. Also change the 1st post of this thread to it's state at n=140K.

3. Load n=140K-150K back into the server and let it rip. When done, let's double check to make sure that all web pages update correctly and that all Emails are sent for primes found.

4. Verify that there are no rejected results and that the server doesn't barf for no reason like it did on Sat. morning.

5. Match all residuals with the 1st run.

Once this all runs correctly and everything is verified as correct, then we can consider the server and the interfacing with the web pages statuses/stats/etc. ready to go for n>150K or for other efforts. In other words, it will be production ready. It is the interfacing with these kinds of web pages and stats that I have the biggest concern about at NPLB.

That is what I would define as a test plan! :smile:

Thanks for the huge boost Lennart! :smile:


Gary

mdettweiler 2009-08-02 11:57

Whoa, I wake up, get online, and it turns out there's all this action waiting for me when I get on. :smile: I'll try to answer all of your questions below:
[quote=gd_barnes;183713]Max and Rogue:

Port 3000 has dried. Plan of action now:

1. Please correct all of the aforementioned problems.[/quote]
Okay, first of all, since we don't have an NPLB-like database set up for CRUS results, there is no mechanism in place for sending out emails when people find a prime. PRPnet has such a mechanism built in, but it will need to use an SMTP server that doesn't require a secured connection--that means we can't use Gmail's server. Essentially, we need to use an ISP server for it. Gary, I'll send you a PM shortly describing what will need to be done to get email notification active.

[quote]2. Change all appropriate # of k's remaining, "min n" values for all k's, # of pairs remaining etc. in all web pages to the state that they should be at n=140K. Also change the 1st post of this thread to it's state at n=140K.[/quote]
The web pages are automatically generated by the server. If it says that's what's in the server, then usually that is what's in the server. (Well, except for the server_status page that came out all garbled; I'm not sure why it did that.)

The reason why there were k's left in the server that had already been eliminated is because I had to reload some stuff, as described via PM. As soon as the primes for them are re-found, those k's will be automatically removed from the server.

[quote]3. Load n=140K-150K back into the server and let it rip. When done, let's double check to make sure that all web pages update correctly and that all Emails are sent for primes found.[/quote]
140K-150K [B]is[/B] in the server, and it's been working just fine. It's just that Lennart's mostly dried out the server and thus the only remaining k/n pairs on the server report are stragglers. Most of the work has been completed, including many of the ones at the end of the n-range (so, since the server only shows what it's got left, its "max n" value ends up shrinking a bit).

[quote]4. Verify that there are no rejected results and that the server doesn't barf for no reason like it did on Sat. morning.[/quote]
I've been keeping a close eye on that. It's been doing small periods of barfing since then, but it's been recovering rather nicely all on its own, and I have a script set up to salvage any and all barfed results so we don't lose anything. I've got it all under control. :smile:

[quote]5. Match all residuals with the 1st run.[/quote]
Unnecessary, since everything is working correctly as described above. :smile:

[quote]Once this all runs correctly and everything is verified as correct, then we can consider the server and the interfacing with the web pages statuses/stats/etc. ready to go for n>150K or for other efforts. In other words, it will be production ready. It is the interfacing with these kinds of web pages and stats that I have the biggest concern about at NPLB.[/quote]
Okay, to summarize:

-Email notification still needs to be set up for CRUS PRPnet servers. (For NPLB, it goes automatically through our DB, but that doesn't apply for CRUS.)
-The server barfing problem is under control.
-The problem on the server_status page seems to be due to some sort of bug in the server; at any rate, that was always one of the less useful pages (since all that information can be gathered from the main/server_stats page anyway). Possibly we may just want to leave its link off the first post here.
-The server is functioning correctly, and *has* processed all of 140K-150K exactly as it should.

Max :smile:

mdettweiler 2009-08-02 19:44

Okay, looks like G3000 has dried out. Gary, shall I load up 150K-200K?

(FYI: since the server has been fully dried out at the end of this range, I'll be able to "fresh start" the server for 150K-200K. That should clear out any residual effects of the earlier problems.)

gd_barnes 2009-08-02 20:05

Max,

Please reconsider your response in the context of the current situation. You haven't responded to the situation as it currently exists.

1. OK, the SMTP stuff needs to be set up. Thats fine. My bad. I forgot about that in your PM.

2. If it says something is in the server, then that is what is in the server?? What does that mean? The server has dried!! Those k's should not be in the server because they already have a prime! You say as soon as we find a prime for them, the pages will show correctly. How can we find a prime for them? We're done with the range you reloaded. That's my whole point. Let's get them out of there so the web pages show correctly. There should be 30 k's remaining at n=140K and 28 k's remaining at n=150K.

3. n=140K-150K IS in the server?? The server has dried! I said we need to RELOAD n=140K-150K in the server and run this thing again. Why are you saying that it IS in the server when I'm asking you to reload the range? Like I said, please consider the context of the current situation.

4. Matching of residuals is not necessary because everything is working correctly as shown above? Little is working correctly from what I can tell! If you don't feel that is necessary, please send them to me from the 1st run before we do a 2nd run and I will match them up after the 2nd run.

5. The situation with the barfing is not under control. We can't have scripts (i.e. patches) being written to fix problems. Let's use output from the problem to find it's root cause and fix it at its root cause so that patches aren't necessary.

We are not communicating properly here at all here. The only thing that I forgot about in a PM was the notification of primes. Nothing else from any PM's showed that the aforementioned problems should have occurred.

I don't feel that were under control at all. Let's get the server reloaded with n=140K-150K, make the pages so that the # of k's remaining are showing correctly for n=140K (30 k's remaining) [do we need to delete some prior k/n pairs for k's that already have a prime?], and run that range again as though it was a first pass test. At the end of that, there should be 28 k's remaining.

Reference "the server is functioning correctly": Not if it is still barfing. In my mind, "correctly" means that 100% of everything works correctly.

I've designed a test plan. There's nothing here that demontrstates conclusively that much of this will work correctly this next time around. Let's delete what's remaining the in the server for n=140K-150K as though it has never been done, make the web pages look like it had never been done (if that requires deleting prior pairs so that some k's don't show as remaining, then let's please do so), and then do that range again. If the web pages update correctly, then I'll be convinced but not before.

If my test plan is out of context given the situation, please design your own and let's discuss it before we proceed. A test plan should include:

1. The input into the system, i.e. the range that we test. You already stated that in the PM as what is shown in post #1 here so we're good there.
2. What the output should be and look like. That is: What should the web pages look like, what should the various prpnet files look like, etc. when the test is complete. That is the major problem that we are having.

If the output in #2 looks different in any manner, then the test is not correct.


Thank you,
Gary

gd_barnes 2009-08-02 20:06

[quote=mdettweiler;183745]Okay, looks like G3000 has dried out. Gary, shall I load up 150K-200K?

(FYI: since the server has been fully dried out at the end of this range, I'll be able to "fresh start" the server for 150K-200K. That should clear out any residual effects of the earlier problems.)[/quote]


What the heck? NO!!

It dried at 6 AM this morning. I've already stated that. Please stop, slow down, reread everything that I've previously written, and let's get it right.

Edit: Another problem: None of the links in the 1st post are working now!!

I'll be out later this afternoon and evening.

mdettweiler 2009-08-02 20:18

Okay, I think I could have communicated what I was trying to say a bit more clearly in my last post. Here we go again, this time hopefully clearer. :smile:

-The server did not dry at 6 AM. It still had quite a number of stragglers left. That is what I meant when I said that the server still had in it what the web pages said it did (which was correct). The web pages showed at the time that there were a few pairs left for each k, which was accurate at the time.

-The server [I]really[/I] dried shortly before I posted my latest message saying as such. That means the whole 100K-150K range is complete, stragglers and all.

-The barfing has ceased and desisted. I know what the problem was that caused the barfing, and it was due to a connection error, *not* a problem with PRPnet.

-All of the confusion regarding what the server had loaded into it, when it dried, etc. were not due at all to the barfing, but instead were fallout from a mess-up I made when I initially attempted to "fix" the barfing. I now have all that remedied.

-The links in the 1st post have stopped working because I shut down the server after it dried out. I did this so I could get the messes I made earlier under control better. This is of no particular consequence since the server has nothing in it anyway.

Long story short: none of the problems were caused by PRPnet itself; they were all due either to connection problems (which can happen to anybody and could cause other servers such as LLRnet to barf as well), or to human error on my part. The barfing has ceased (being somewhat of a random occurrence in the first place), and the human errors have been remedied.

There, that make sense now? :smile:

Max :smile:

gd_barnes 2009-08-03 06:17

Yes, that makes sense. But you still haven't addressed my concerns about the interfacing web pages yet. We still need a clean test and that hasn't happened yet. Just becuase we test, find problems, and "supposedly" fix those problems doesn't mean they are fixed. We have to retest! :smile:

On the "barfing", can you tell me what in the server that you fixed to make it no longer "barf". If you haven't done anything, then I see that you're saying that it was "due either to connection problems (which can happen to anybody and could cause other servers such as LLRnet to barf as well), or to human error on my part".

My response: I haven't seen LLRnet barf in ages. That tells me that it shouldn't happen to anyone on PRPnet and that it's either due to a PRPnet problem or to the human error. If it was due to a previous human error, then please refer to the test plan below on how we need to do it to avoid any human error.

Also, all of its interfaces must be proven to update and work correctly. Sorry man, I have to see it. Saying "it's fixed" won't do it. This means all related web pages that show: results, k's remaining, pairs remaining, stats, etc. Therefore I stand by my original test plan. We need to redo n=140K-150K to demonstrate that all of these are working correctly.

This may only be a hobby but it is mathematics and math is very exacting and continuing to run "production" data on something that isn't proven to have all of its interfaces working correctly is not something that I can stomach.

Here is a restated test plan with a little additional detail:

1. Please correct all of the aforementioned problems with the correct updating of the interfacing web pages to their state at n=140K. (I realize the server likely pulls this info. as it goes but something is not interfacing correctly with how it is pulling them or they would show correctly.) This includes the following:
a) k's remaining. (remove prior pairs or results if needed)
b) min n values for the page that shows all k's.
c) # of pairs remaining. (Does appear correct already as I was able to see it drop appropriately as it handed them out Sat. morning.)
d) The second "min n" column heading should show "max n" in the appropriate column on that one web page.

There must be some "preliminary" fix that is needed before the test begins or the pages would have shown correctly immediately after it began previously. There may simply need to be some prior results deleted...not sure there.

2. Delete any hint anywhere of all n=140K-150K from the server. If the # of k's will not display correctly because prior results are still in the server for k's where no prime was ever found, then we need to delete those results also. When we begin this new test, the web page should show 30 k's remaining.

3. Load n=140K-150K back into the server and let it rip. When done, let's double check to make sure that all web pages update correctly. This includes checking that:
(a) # of pairs remaining is correct as it is processing. (estimating is close enough)
(b) # of k's remaining is 30 when the test initially starts and is 28 when the test ends.
(c) The min n for each k is correct as it goes (estimate is sufficient) and is either blank or equal to the max n when the test is finished.

4. Verify that there are no rejected results and that the server doesn't barf for any reason.

5. Match all residuals with the 1st run.

It'd be great if we could have help with the rerun but I'll put multiple quads on it if needed to finish it within a day. Lennart still gets credit for the 2 new primes found that are now shown in the Sierp base 6 drive and on the web pages.

Finally: Let's come to a mutual agreement on the definition of "dried": To me, dried means the server is handing out no new pairs even if stragglers are remaining where the results have not come back yet. Just because stragglers are remaining doesn't mean anyone can get any work. That is if a new person came here and tried to access the server, he could not get any work. By that definition, the server dried at ~6 AM CDT Sunday morning because my machines were pulling no more new pairs after that. I'm not sure why it took hours for the server to receive back all of the results for the straggling pairs. Maybe somebody's machine was down for a while.


Thank you,
Gary

mdettweiler 2009-08-03 13:18

Okay, I see where you're coming from now. We'll re-run 140K-150K as a test to compare with this first run. However, before we do that, there's a couple things I'd like to clarify:

-I am quite sure that the barfing was *not* due to human error or PRPnet error in any way, but due to an essentially "random" connection fluctuation. I have no way of knowing which end of the line the fluctuation was on, but I do know this: there's not really any way to fix that. Though, they seem to happen rather rarely anyway, so it's probably not too big a deal in the big picture.

-Okay, I see what you're saying about the definition of a server being "dried". I'd always seen it used the other way around--to refer to a server being completely cleaned out. But, hey, whatever works. :smile:

With that out of the way, I'll go ahead and load 140K-150K into the server again. :smile:

mdettweiler 2009-08-03 13:42

All right, 140K-150K has been reloaded exactly as described above. I see Lennart's hungry machines have already swooped in and grabbed a bunch of work. :smile:

Checking the various web pages:
[URL]http://nplb-gb1.no-ip.org:3000/[/URL] (a.k.a. server_stats.html) checks out except for the "Min N" label on both the Min N and Max N columns, which is a known bug for Sierpinski/Riesel mode in PRPnet and will be fixed in a future release, but is only a cosmetic error for now.

[URL]http://nplb-gb1.no-ip.org:3000/server_status.html[/URL] still looks kind of weird:
[quote]Sierpinski Base 6 6 k remaining30 n remainingmin n is 4065 (140005 digits long)[/quote]
It seems that this page is displaying inaccurately, due to a bug in PRPnet. However, as before, this is a cosmetic error (even if a bit more serious than the extra "Min N" label), and regardless of how the words seem to come out on the page, the basic information does line up with the following (as of when I pulled down this page):
-k's remaining: 30
-n's remaining: 4065
-min n remaining: 140005
The # of digits doesn't seem to be presented at all, regardless of the wording.

[URL]http://nplb-gb1.no-ip.org:3000/user_stats.html[/URL] checks out.

I'll report the bug on the server_status page to Mark. However, since all the information on that page can be gathered from the server_stats/home page anyway, it's somewhat less of a big deal since it's not affecting anything except the display of the data. So, we should be OK even with that bug present.

Max :smile:


All times are UTC. The time now is 09:39.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.