mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > Conjectures 'R Us

Reply
 
Thread Tools
Old 2009-08-07, 18:46   #78
MyDogBuster
 
MyDogBuster's Avatar
 
May 2008
Wilmington, DE

22·23·31 Posts
Default

Quote:
However, since the 2.2.4 client changes the timeout to 30 seconds, and won't keep sending results if the server isn't ready for them, this should now be fixed.
If I was you guys, I'd set the timeout to 60 seconds. Ya never know what a machine might be doing, or get into, after starting to process something. The larger the cache, the longer an update will take. Chances are 60 seconds will never be reached so what would it hurt?

Any ideas as to why I lost 1/5th of my candidates file during that barf?
Over 1MB of data right in the middle of the file.
MyDogBuster is offline   Reply With Quote
Old 2009-08-07, 20:14   #79
MyDogBuster
 
MyDogBuster's Avatar
 
May 2008
Wilmington, DE

22·23·31 Posts
Default

Another stange message only after I upgraded to 2.3.4


[2009-08-07 20:12:12 GMT] Error sending <<<link rel="icon" type="image/ico" href="prpnet.ico">>> to localhost:7102

BTW, the barf's seems to have stopped. I still see 2 updates happening together, but no error messages.

Last fiddled with by MyDogBuster on 2009-08-07 at 20:15
MyDogBuster is offline   Reply With Quote
Old 2009-08-07, 22:27   #80
MyDogBuster
 
MyDogBuster's Avatar
 
May 2008
Wilmington, DE

1011001001002 Posts
Default

Quote:
BTW, the barf's seems to have stopped. I still see 2 updates happening together, but no error messages.
I'll take that back. The barf's are still happening but on a less frequent basis. I think the problem is that timeout parameter. The server is running in command line mode. The program takes time to paint the command window with each message. I'm assuming it's doing so after updating the candidates file. When I watch it, it takes about a second to paint each line because of the update involved. I will bet you that if you up the timeout parameter to a minute or 2, the problem will go away. The time required is also dependent on the cache size set. If someone sets the cache size to say 100, that parameter will have to be higher. 30 seconds wasn't bad. I forced 160 updates thru my pipeline at once and only 15 got error messages. That was a lot higher before. Setting that parameter higher doesn't make the server run slower it just allows more time to get it's work done before problems occur. Maybe making it a changeable parameter would be good. That way someone who has big cache sizes can set it higher to allow for more processing time. On a public server, I would set it to at least 2 minutes. No one has any idea what updates are coming at any one time. It can't hurt, can it.

Last fiddled with by MyDogBuster on 2009-08-07 at 22:30
MyDogBuster is offline   Reply With Quote
Old 2009-08-07, 23:02   #81
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

24×397 Posts
Default

Quote:
Originally Posted by MyDogBuster View Post
I'll take that back. The barf's are still happening but on a less frequent basis. I think the problem is that timeout parameter. The server is running in command line mode. The program takes time to paint the command window with each message. I'm assuming it's doing so after updating the candidates file. When I watch it, it takes about a second to paint each line because of the update involved. I will bet you that if you up the timeout parameter to a minute or 2, the problem will go away. The time required is also dependent on the cache size set. If someone sets the cache size to say 100, that parameter will have to be higher. 30 seconds wasn't bad. I forced 160 updates thru my pipeline at once and only 15 got error messages. That was a lot higher before. Setting that parameter higher doesn't make the server run slower it just allows more time to get it's work done before problems occur. Maybe making it a changeable parameter would be good. That way someone who has big cache sizes can set it higher to allow for more processing time. On a public server, I would set it to at least 2 minutes. No one has any idea what updates are coming at any one time. It can't hurt, can it.
I'll think about it for the next release. To reduce file I/O, you can increase the setting of savefrequency= in the prpserver.ini file.

I should consider allowing an administrator the ability to split the prpserver.candidates file so that it doesn't take as much memory and reducing the I/O. It could (theoretically) be striped across multiple files only reading in a subsequent file when the server is running dry. Thoughts?
rogue is offline   Reply With Quote
Old 2009-08-07, 23:19   #82
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Quote:
Originally Posted by MyDogBuster View Post
I'll take that back. The barf's are still happening but on a less frequent basis. I think the problem is that timeout parameter. The server is running in command line mode. The program takes time to paint the command window with each message. I'm assuming it's doing so after updating the candidates file. When I watch it, it takes about a second to paint each line because of the update involved. I will bet you that if you up the timeout parameter to a minute or 2, the problem will go away. The time required is also dependent on the cache size set. If someone sets the cache size to say 100, that parameter will have to be higher. 30 seconds wasn't bad. I forced 160 updates thru my pipeline at once and only 15 got error messages. That was a lot higher before. Setting that parameter higher doesn't make the server run slower it just allows more time to get it's work done before problems occur. Maybe making it a changeable parameter would be good. That way someone who has big cache sizes can set it higher to allow for more processing time. On a public server, I would set it to at least 2 minutes. No one has any idea what updates are coming at any one time. It can't hurt, can it.
Hmm...just to verify, by "barfing" you mean that the server is still writing blank results to completed_tests.log? Or is it just giving error messages? Because as long as it's not writing blank results (or apparently scrambling other results at the same time) then the integrity of the results is still sound and it's not quite the same as the "barfing" we were dealing with on G3000 (and which you'd reported earlier on your server with version 2.2.3).

Also, you're saying that it has to update prpserver.candidates with EVERY new line it outputs? That doesn't seem right. What's your "savefrequency=" set to in prpserver.ini? At the default of 5 minutes, I don't get any problems with that kind of bottlenecking.

Mark, should I go ahead and upgrade the server to 2.2.4, or do you need more data with 2.2.3? Either way, considering as how Ian's reporting further problems I won't load all of 100K-150K but just 140K-150K as another test run.
mdettweiler is offline   Reply With Quote
Old 2009-08-07, 23:47   #83
MyDogBuster
 
MyDogBuster's Avatar
 
May 2008
Wilmington, DE

1011001001002 Posts
Default

Quote:
Also, you're saying that it has to update prpserver.candidates with EVERY new line it outputs? That doesn't seem right. What's your "savefrequency=" set to in prpserver.ini? At the default of 5 minutes, I don't get any problems with that kind of bottlenecking.
I have savefrequency set to 0 for a reason. It's a valid number for the parameter. When testing something we have to take into account ANYTHING that can happen. Someone is going to set it to 0. I'm it.

I'll reset it to 3 minutes and try my testing again. I was not getting any blank records this time.

Mark, thanks for considering changing the I/O to the file.

Quote:
I should consider allowing an administrator the ability to split the prpserver.candidates file so that it doesn't take as much memory and reducing the I/O. It could (theoretically) be striped across multiple files only reading in a subsequent file when the server is running dry. Thoughts?
The problem with splitting the candidates file is that the removal of k's would be messed up unless all the candidates for a k are in the same file.
I'm currently testing a candidates file with 196K tests. The removal works just fine. Great feature. If we split it into 3 files then I would have to do some manual deleting.

I guess doing more analysis before setting up a candidates file would have helped. But then again, stress testing should ferret out all problems. Maybe for future reference someone could set some reasonable limits on file sizes and stuff like that.

All in all, great program(s) Mark.
MyDogBuster is offline   Reply With Quote
Old 2009-08-08, 00:11   #84
MyDogBuster
 
MyDogBuster's Avatar
 
May 2008
Wilmington, DE

B2416 Posts
Default

Okay, setting the savefrequency to 5 did not get rid of the bottleneck problem. I still got the error messages but no blank records.

I checked the client log and it showed that all 20 tests were accepted. I waited the 5 minutes and then checked the candidates file for the errored tests and they were still marked as inprogress on that machine. That's an inconsistency. The client thinks all's okay with the upload, but the server didn't accept the bottlenecked ones.

Last fiddled with by MyDogBuster on 2009-08-08 at 00:12
MyDogBuster is offline   Reply With Quote
Old 2009-08-08, 01:06   #85
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

143208 Posts
Default

Max, upgrade to 2.2.4. I don't think anything else is necessary at this point.

MyDogBuster, I don't understand your last statement. Are you saying that the all workunits failed to be reported to the server and the server still had them marked as inprogress? If so, that is not a problem as the server would still be waiting for valid test results. It is only a problem if the client dropped the workunits. If that is the case I would need to see debug logs from both. It is possible that a bug is still lurking in the client.

savefrequency=0 would cause a huge amount of I/O. I think that setting it to an hour or more is reasonable as long as the server is stable.

I also suggest setting maxworkunits to a value that allows clients to build up an hour or more of work before reporting. That would reduce the number of times clients need to communicate with the server to get and report work. It should also reduce bottlenecks. I max out on PrimeGrid projects and get 5+ hours of work each time.

At this time the server is not multi-threaded. I'm not certain what happens if multiple clients try to connect at the same time. I presume that if one is connected, then the others have to wait. Multi-threading the server would not be easy, which is why I've been avoiding it.
rogue is offline   Reply With Quote
Old 2009-08-08, 01:16   #86
MyDogBuster
 
MyDogBuster's Avatar
 
May 2008
Wilmington, DE

1011001001002 Posts
Default

Quote:
MyDogBuster, I don't understand your last statement. Are you saying that the all workunits failed to be reported to the server and the server still had them marked as inprogress? If so, that is not a problem as the server would still be waiting for valid test results. It is only a problem if the client dropped the workunits. If that is the case I would need to see debug logs from both. It is possible that a bug is still lurking in the client.
The client finished all 20 tests. The server did not accept all of them because of the bottleneck.

I see this message on the server:
[2009-08-07 04:09:06 GMT] Error sending <<ServerType: 1>> to localhost:7102
[2009-08-07 04:09:06 GMT] Error sending <<WorkUnit: 16641*24^39588+1 1249618146 16641 24 39588 1>> to localhost:7102
[2009-08-07 04:09:06 GMT] IMGunn1654@gmail.com (Sophie#3) at 192.168.2.100: Sent 16641*24^39588+1
[2009-08-07 04:09:06 GMT] Error sending <<ServerType: 1>> to localhost:7102
[2009-08-07 04:09:06 GMT] Error sending <<WorkUnit: 28701*24^39588+1 1249618146 28701 24 39588 1>> to localhost:7102
[2009-08-07 04:09:06 GMT] IMGunn1654@gmail.com (Sophie#3) at 192.168.2.100: Sent 28701*24^39588+1
[2009-08-07 04:09:06 GMT] Error sending <<ServerType: 1>> to localhost:7102

But I see this message on the client log.

[2009-08-07 04:09:08 GMT] Base24: INFO: All 20 test results were accepted

Quote:
I also suggest setting maxworkunits to a value that allows clients to build up an hour or more of work before reporting. That would reduce the number of times clients need to communicate with the server to get and report work. It should also reduce bottlenecks. I max out on PrimeGrid projects and get 5+ hours of work each time.
I can do that. I was just seeing what kind of limits there are to some of the settings.

Last fiddled with by MyDogBuster on 2009-08-08 at 01:17
MyDogBuster is offline   Reply With Quote
Old 2009-08-08, 01:47   #87
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Quote:
Originally Posted by MyDogBuster View Post
The client finished all 20 tests. The server did not accept all of them because of the bottleneck.

I see this message on the server:
[2009-08-07 04:09:06 GMT] Error sending <<ServerType: 1>> to localhost:7102
[2009-08-07 04:09:06 GMT] Error sending <<WorkUnit: 16641*24^39588+1 1249618146 16641 24 39588 1>> to localhost:7102
[2009-08-07 04:09:06 GMT] IMGunn1654@gmail.com (Sophie#3) at 192.168.2.100: Sent 16641*24^39588+1
[2009-08-07 04:09:06 GMT] Error sending <<ServerType: 1>> to localhost:7102
[2009-08-07 04:09:06 GMT] Error sending <<WorkUnit: 28701*24^39588+1 1249618146 28701 24 39588 1>> to localhost:7102
[2009-08-07 04:09:06 GMT] IMGunn1654@gmail.com (Sophie#3) at 192.168.2.100: Sent 28701*24^39588+1
[2009-08-07 04:09:06 GMT] Error sending <<ServerType: 1>> to localhost:7102

But I see this message on the client log.

[2009-08-07 04:09:08 GMT] Base24: INFO: All 20 test results were accepted
Hmm...the server log excerpt you posted is of the server sending *new* workunits to the client, while the client log shows successful *sending* of results. They're both talking about different events.

Last fiddled with by mdettweiler on 2009-08-08 at 01:47
mdettweiler is offline   Reply With Quote
Old 2009-08-08, 01:55   #88
MyDogBuster
 
MyDogBuster's Avatar
 
May 2008
Wilmington, DE

54448 Posts
Default

Quote:
Error sending <<ServerType: 1>> to localhost:7102
Nope same event. "to localhost:7102" is the server not the client.
MyDogBuster is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
PRPNet server for personal use johnadam74 Software 2 2016-01-01 15:58
New SR5 PRPnet server online ltd Sierpinski/Riesel Base 5 15 2013-03-19 18:03
First PSP PRPnet 4.0.6 server online ltd Prime Sierpinski Project 9 2011-03-15 04:58
PRPnet 3.1.3 stress-test server mdettweiler No Prime Left Behind 40 2010-01-30 18:05
First pass PRPNet server out of work? opyrt Prime Sierpinski Project 6 2009-09-24 18:14

All times are UTC. The time now is 09:42.


Tue Jul 27 09:42:02 UTC 2021 up 4 days, 4:11, 0 users, load averages: 2.14, 1.96, 1.88

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.