mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > Conjectures 'R Us

Reply
 
Thread Tools
Old 2009-08-12, 22:32   #133
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

101000101000112 Posts
Default

Quote:
Originally Posted by rogue View Post
There is only so much I can glean from the posted information, even with debug turned on. I've looked through the code for problem spots. Some are easily found, others are much more elusive.

I have been a "one-man" team with writing the software. I can only dedicate so much time to it. If anyone would like to do a code review on the software and provide me feedback (via e-mail), I would appreciate it.
Point taken. I'll refer your request for a code review to Ian (MyDogBuster). Although he has mostly been a legacy programmer for 40 years, my feeling is that he may be able to come up with some thoughts on some things.

Another thought about this barfing: Could there be some problem with the way things are being updated in memory causing the server to "think" that a result has been returned when it hasn't or vice-versa? Updates in memory make me nervous, especially if there are a lot of them before the file is saved to disk.

Last fiddled with by gd_barnes on 2009-08-12 at 22:38
gd_barnes is online now   Reply With Quote
Old 2009-08-12, 22:55   #134
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

143208 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
Point taken. I'll refer your request for a code review to Ian (MyDogBuster). Although he has mostly been a legacy programmer for 40 years, my feeling is that he may be able to come up with some thoughts on some things.

Another thought about this barfing: Could there be some problem with the way things are being updated in memory causing the server to "think" that a result has been returned when it hasn't or vice-versa? Updates in memory make me nervous, especially if there are a lot of them before the file is saved to disk.
The code is in C++, so the software does not address memory arbitrarily. Typically the compiler will tell you that a variable is uninitialized before using it or it is doesn't have a valid value would crash the app immediately. I think that it is unlikely that this is happening. It is not outside of the realm of possibility, but not likely. I really need to have the client and server communicate in larger chunks, preferable a single chunk per communication, with a check digit to ensure clean communication. That will have to wait for the next major release.
rogue is offline   Reply With Quote
Old 2009-08-13, 22:51   #135
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Hey Mark, I think I just found something that might give us a clue to some of these problems. Just now, I stopped a PRPnet client on my computer seconds after it first started, without thinking of the potential consequences to the server, and noticed that it cut right in the middle of a connection. Here's the debug log:
Code:
[2009-08-13 22:47:46 GMT] PRPNet Client application v2.2.4 started
[2009-08-13 22:47:46 GMT] User name mdettweiler at email address is max@noprimeleftbehind.net
[2009-08-13 22:47:46 GMT] in FindNextServerForWork: total time for client=0 seconds
[2009-08-13 22:47:46 GMT] suffix: PGpps-smalln, no work done yet, target pct work done=100
[2009-08-13 22:47:46 GMT] socket 1848 >>>> FROM max@noprimeleftbehind.net Core2Duo mdettweiler
[2009-08-13 22:47:46 GMT] PGpps-smalln: Getting work from server pgllr.mine.nu at port 10000
[2009-08-13 22:47:46 GMT] socket 1848 >>>> GETWORK 2.2.4 20
[2009-08-13 22:47:46 GMT] socket 1848 >>>> llr
[2009-08-13 22:47:46 GMT] socket 1848 >>>> phrot
[2009-08-13 22:47:46 GMT] socket 1848 >>>> pfgw
[2009-08-13 22:47:46 GMT] socket 1848 >>>> End of Message
[2009-08-13 22:47:46 GMT] Accepted force quit.  Waiting to close sockets before exiting
Presumably it sent a "QUIT" right after the "End of Message" that didn't get reported in the log. At any rate, though, I'll bet it confused the heck out of the server; I'd guess that the server is right now spewing out "error sending test x to localhost:10000" errors.

Lennart, can you by chance get a debug log from this time frame on PrimeGrid port 10000 so we can confirm this?
mdettweiler is offline   Reply With Quote
Old 2009-08-18, 08:56   #136
MyDogBuster
 
MyDogBuster's Avatar
 
May 2008
Wilmington, DE

1011001001002 Posts
Default

Got this error out of PFGW / PRPNet 2.2.6 client / PFGW 3.2 20090725

In the PFGWerr file:

Error occuring in PFGW at Mon Aug 17 02:49:09 2009
Expr = 61126*39^1277-1
Failed at bit 7 of 6765
Msg = Detected in MAXERR>0.45 (round off check) in prp_using_gwnum
Iteration: 7/6765 ERROR: ROUND OFF 0.5>0.45

The PRPNet client then went into a loop and kept testing that pair over and over again. Had to stop client and remove that test from the work save file. No messages in either the client log or the server log. Also removed the test from the candidates file and restarted server. It was an active test.


Also got this strange message on the client log in a seperate incident.


[2009-08-18 07:58:09 GMT] Base24: 538844*39^1602-1 is not prime. Residue 4784E7D5D6F1F755
[2009-08-18 07:58:09 GMT] Base24: 892140*39^1602-1 is not prime. Residue C465D25FE8FB197A
[2009-08-18 07:58:09 GMT] Base24: 394774*39^1602-1 is not prime. Residue E043D984C3CE6B5C
[2009-08-18 07:58:09 GMT] Base24: 1081910*39^1602-1 is not prime. Residue C26F471C317401E2
[2009-08-18 07:58:10 GMT] Base24: 263818*39^1602-1 is not prime. Residue C20E2B8E0614B4D8
[2009-08-18 07:58:10 GMT] Base24: 378674*39^1602-1 is not prime. Residue A6612BFE632FB40D
[2009-08-18 07:58:10 GMT] Base24: Could not open file [work_Base24.in] for writing. Exiting program
[2009-08-18 08:01:50 GMT] PRPNet Client application v2.2.6 started
[2009-08-18 08:01:50 GMT] User name MyDogBuster at email address is IMGunn1654@gmail.com
[2009-08-18 08:01:50 GMT] Base24: 303664*39^1602-1 is not prime. Residue D39B88D622FF6046
[2009-08-18 08:01:50 GMT] Base24: 134698*39^1602-1 is not prime. Residue 527C07DCCEF4BE55


I was not messing with the files.

Last fiddled with by MyDogBuster on 2009-08-18 at 09:31
MyDogBuster is offline   Reply With Quote
Old 2009-08-18, 12:27   #137
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

11000110100002 Posts
Default

Quote:
Originally Posted by MyDogBuster View Post
Got this error out of PFGW / PRPNet 2.2.6 client / PFGW 3.2 20090725

In the PFGWerr file:

Error occuring in PFGW at Mon Aug 17 02:49:09 2009
Expr = 61126*39^1277-1
Failed at bit 7 of 6765
Msg = Detected in MAXERR>0.45 (round off check) in prp_using_gwnum
Iteration: 7/6765 ERROR: ROUND OFF 0.5>0.45

The PRPNet client then went into a loop and kept testing that pair over and over again. Had to stop client and remove that test from the work save file. No messages in either the client log or the server log. Also removed the test from the candidates file and restarted server. It was an active test.


Also got this strange message on the client log in a seperate incident.


[2009-08-18 07:58:09 GMT] Base24: 538844*39^1602-1 is not prime. Residue 4784E7D5D6F1F755
[2009-08-18 07:58:09 GMT] Base24: 892140*39^1602-1 is not prime. Residue C465D25FE8FB197A
[2009-08-18 07:58:09 GMT] Base24: 394774*39^1602-1 is not prime. Residue E043D984C3CE6B5C
[2009-08-18 07:58:09 GMT] Base24: 1081910*39^1602-1 is not prime. Residue C26F471C317401E2
[2009-08-18 07:58:10 GMT] Base24: 263818*39^1602-1 is not prime. Residue C20E2B8E0614B4D8
[2009-08-18 07:58:10 GMT] Base24: 378674*39^1602-1 is not prime. Residue A6612BFE632FB40D
[2009-08-18 07:58:10 GMT] Base24: Could not open file [work_Base24.in] for writing. Exiting program
[2009-08-18 08:01:50 GMT] PRPNet Client application v2.2.6 started
[2009-08-18 08:01:50 GMT] User name MyDogBuster at email address is IMGunn1654@gmail.com
[2009-08-18 08:01:50 GMT] Base24: 303664*39^1602-1 is not prime. Residue D39B88D622FF6046
[2009-08-18 08:01:50 GMT] Base24: 134698*39^1602-1 is not prime. Residue 527C07DCCEF4BE5.
As for the first error, the client is supposed to detect the error, then reissue the test with -a1. Can you run this number through PFGW with -a1 and -a2 to see if it fails with those as well? I could understand the loop if the -a1 and -a2 failed. That would be extremely unusual, but possible. I also recommend upgrading to PFGW 3.2.2 as FFT selection is less aggressive for some numbers. It is probably fixed with that release.

The client will terminate if it cannot open a file for I/O. I suppose that it could try again after a second or two.
rogue is offline   Reply With Quote
Old 2009-08-18, 14:06   #138
MyDogBuster
 
MyDogBuster's Avatar
 
May 2008
Wilmington, DE

22·23·31 Posts
Default

Quote:
As for the first error, the client is supposed to detect the error, then reissue the test with -a1. Can you run this number through PFGW with -a1 and -a2 to see if it fails with those as well? I could understand the loop if the -a1 and -a2 failed. That would be extremely unusual, but possible. I also recommend upgrading to PFGW 3.2.2 as FFT selection is less aggressive for some numbers. It is probably fixed with that release.
Worked with the -a1, so I don't know why it looped. You might want to check to be sure you issue it. Just to be sure, I'll switch to 3.2.2 if you can tell me where to get it. SourceForge still has only 3.2.

Last fiddled with by MyDogBuster on 2009-08-18 at 14:07
MyDogBuster is offline   Reply With Quote
Old 2009-08-18, 15:21   #139
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

1011010110112 Posts
Default

Quote:
Originally Posted by MyDogBuster View Post
Just to be sure, I'll switch to 3.2.2 if you can tell me where to get it. SourceForge still has only 3.2.
see http://www.mersenneforum.org/showthread.php?t=12281
kar_bon is offline   Reply With Quote
Old 2009-08-18, 17:16   #140
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

143208 Posts
Default

Quote:
Originally Posted by MyDogBuster View Post
Got this error out of PFGW / PRPNet 2.2.6 client / PFGW 3.2 20090725

In the PFGWerr file:

Error occuring in PFGW at Mon Aug 17 02:49:09 2009
Expr = 61126*39^1277-1
Failed at bit 7 of 6765
Msg = Detected in MAXERR>0.45 (round off check) in prp_using_gwnum
Iteration: 7/6765 ERROR: ROUND OFF 0.5>0.45
I think that I found the problem behind this. The client wasn't deleting the output file from PFGW, thus successive run of PFGW kept finding the error in that file and tried to redo the test.
rogue is offline   Reply With Quote
Old 2009-09-02, 01:24   #141
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

143208 Posts
Default

Admin edit: This post was copied over from the CRUS 4th drive thread for Riesel base 6.

I'm curious as to why PRPNet wouldn't be a good choice for this drive. I can understand base 3 due to the small n, but the n for this drive are much larger. It would be far less likely to cause trigger the same issues that it has with base 3. I could say the same for the Sierpinski base 6 drive.

And yes I am still working on the next release, but have higher priority items at home at this time which keep me away from active development.

Last fiddled with by gd_barnes on 2009-09-03 at 08:15 Reason: admin edit
rogue is offline   Reply With Quote
Old 2009-09-02, 02:47   #142
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Quote:
Originally Posted by rogue View Post
I'm curious as to why PRPNet wouldn't be a good choice for this drive. I can understand base 3 due to the small n, but the n for this drive are much larger. It would be far less likely to cause trigger the same issues that it has with base 3. I could say the same for the Sierpinski base 6 drive.

And yes I am still working on the next release, but have higher priority items at home at this time which keep me away from active development.
We found that on the Sierp. base 6 drive, we got loads and loads of barfs. For big stuff like base 22 it seems to be keeping the barfs to a minimal level, but base 6 is nonetheless a bit hard unless it's just a personal server (as we recommended for people with large #'s of cores in the first post of this thread).

Fortunately, for now PRPnet's barfs seem to be partially fixed to the point where they don't actually contaminate the results at all (which happened with some of the earlier barfs); though nonetheless, they can be rather annoying to fix. For instance, when I was processing some Sierp. base 6 results from a largish chunk of work Lennart did a few days ago, I came across a whole pile of barfs that took me about 6 hours to re-do. And that was only for a personal server.
mdettweiler is offline   Reply With Quote
Old 2009-09-02, 07:48   #143
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

101·103 Posts
Default

Max, when the tests get extremely long, say at n>500K, would trying PRPnet for a smallish range such at n=5K or 10K make sense? I think Lennart is essentially doing that right now for his personal ranges on this drive.
gd_barnes is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
PRPNet server for personal use johnadam74 Software 2 2016-01-01 15:58
New SR5 PRPnet server online ltd Sierpinski/Riesel Base 5 15 2013-03-19 18:03
First PSP PRPnet 4.0.6 server online ltd Prime Sierpinski Project 9 2011-03-15 04:58
PRPnet 3.1.3 stress-test server mdettweiler No Prime Left Behind 40 2010-01-30 18:05
First pass PRPNet server out of work? opyrt Prime Sierpinski Project 6 2009-09-24 18:14

All times are UTC. The time now is 09:44.


Tue Jul 27 09:44:19 UTC 2021 up 4 days, 4:13, 0 users, load averages: 2.44, 2.12, 1.95

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.