mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > No Prime Left Behind

Reply
 
Thread Tools
Old 2009-12-08, 19:25   #1
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

7·11·73 Posts
Default Requirements for PRPNet to replace LLRNet

There have been numerous items discussed in threads that make it clear that PRPNet cannot fully replace LLRNet. This thread is for users to post, in clear language, what I can do with PRPNet to help make the transition from LLRNet easier. As you guys are the users of PRPNet/LLRNet, you should have a good idea regarding the requirements for PRPNet so that LLRNet can be sunsetted. Please post those requirements in this thread.
rogue is offline   Reply With Quote
Old 2009-12-08, 23:35   #2
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

186916 Posts
Default

Probably the biggest thing would be a multi-threaded server. Other than that, it's been working quite stably and does everything that LLRnet does, and more. The only thing left is to up its capacity so that it can handle extremely heavy loads.

That said, the single-threaded server can probably handle reasonably heavy loads as long as the candidates aren't miniscule; it's just that for now we're playing things extra safe. As Gary mentioned in another thread, if PRPnet runs stably for 6 months on all the larger-candidate drives, then we can start rolling it out for the smaller-yet-not-tiny stuff.
mdettweiler is offline   Reply With Quote
Old 2009-12-09, 00:26   #3
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

7×11×73 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Probably the biggest thing would be a multi-threaded server. Other than that, it's been working quite stably and does everything that LLRnet does, and more. The only thing left is to up its capacity so that it can handle extremely heavy loads.

That said, the single-threaded server can probably handle reasonably heavy loads as long as the candidates aren't miniscule; it's just that for now we're playing things extra safe. As Gary mentioned in another thread, if PRPnet runs stably for 6 months on all the larger-candidate drives, then we can start rolling it out for the smaller-yet-not-tiny stuff.
It will most likely require me to use a database to make the server multi-threaded. I could do it without multithreading, but it would be much harder to maintain reasonable throughput with it.

Regarding multi-threading, how many concurrent clients should the server be able to support?

It is important to me what you mean by "stable". Stability has multiple components. Here are some of them: ability for clients to get/report work without having to wait for the server to respond, uptime of the server software, low probability of bugs causing lost results, ability of PRPNet to handle issues beyond its control (lost connections mid-stream, run-away process on the server, etc.), etc.

I want to make sure that the targets that are reasonable because 24x7 uptime is not reasonable because I know that there are other server issues (hardware or OS) that are beyond the control of PRPNet. It is definitely desirable and a target to attempt to reach, but probably not reasonable. For example, if the server crashes once a week due to an undiagnosed bug, is that acceptable? Is it an issue of how much down-time. For example, if the server crashes and it takes a couple of hours to get going again (due to manual intervention), I would imagine that is not acceptable, but if a script can restart it within a couple of minutes, that might be okay.

I also want to make sure that the targets are not moving targets. I don't want someone to say that the next release can have no more than 5 lost tests a day and then express frustration because it loses 5 tests on a day and they expected it to lose no more than 3 per day.

Note that the above are examples. I do not have real-world stats on any of these things.

I will leave it to the users of PRPNet to define measurable requirements.
rogue is offline   Reply With Quote
Old 2009-12-09, 03:02   #4
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Quote:
Originally Posted by rogue View Post
It will most likely require me to use a database to make the server multi-threaded. I could do it without multithreading, but it would be much harder to maintain reasonable throughput with it.

Regarding multi-threading, how many concurrent clients should the server be able to support?

It is important to me what you mean by "stable". Stability has multiple components. Here are some of them: ability for clients to get/report work without having to wait for the server to respond, uptime of the server software, low probability of bugs causing lost results, ability of PRPNet to handle issues beyond its control (lost connections mid-stream, run-away process on the server, etc.), etc.
A good target would probably be for it to do at least everything that LLRnet can do, and handle extenuating circumstances better than LLRnet does. Namely:

1. be able to handle as many concurrent connections as the OS can feed to it
2. server response doesn't have to be immediate under heavy load, though it should be able to respond to each client witin 5-10 seconds
3. the server preferably shouldn't crash on a regular basis, though a crash or two over the course of a year due to extremely extenuating circumstances would be acceptable
4. if a result is, due to circumstances beyond the server's control, lost, the server should be able to ignore it and put the result back up for reassignment without mixing up residuals or dropping the result completely
5. connection issues beyond PRPnet's control such as dropped connections should be recognized quickly so as not to tie up resources. As for a runaway process on the server...well, I doubt there's anything the server could hope to do about that.

PRPnet already handles #3, #4, and #5 quite well for the most part. #1 is the main bottleneck, and #2 depends somewhat on it. (#2 is handled well now, but that would of course change due to the extensive modifications needed to accomodate #1.)

Note that this is a highly idealized list of qualifications. Some slight deficiencies are to be expected as it's only one human programming this. Nonetheless, in order to be fully ready for any potential loads or situations it may encounter, the above qualifications can be useful as guidelines.
mdettweiler is offline   Reply With Quote
Old 2009-12-23, 00:16   #5
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

2×72×103 Posts
Default

One thing that I would really like to see is a testing time added for each test...on the client side like is shown in the LLRnet lresults.txt file in LLRnet clients. That would allow me to estimate work load better amongst my machines. Having the time taken on the server side would not be necessary nor really useful. For that, LLRnet just shows the time taken between the time the candidate is handed out and the time it is returned, which usually bears little resemblence to the actual processing time that it took the machine to test the candidate.

The way it is with PRPnet now, I have to take the difference between the cumulative time taken through the most recent batch and subtract the cumulative time taken through the batch before that and divide that difference by the # of candidates in the batch; a lot of hoops to jump through. More recently, I went to batching just 1 candidate so I didn't have to wait as long to estimate my work load and it was more clear how long it was taking to do a single test but I'd rather batch at least 3 candidates so that there is less micro-downtime due to fewer communications with the server.

Since it is something that LLRnet has, I hope that wouldn't be too much work to add the feature. It would go a long way towards helping people with large #'s of clients estimate their work load and work flow better.


Thanks,
Gary

Last fiddled with by gd_barnes on 2009-12-23 at 00:38
gd_barnes is offline   Reply With Quote
Old 2009-12-23, 00:29   #6
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

2×72×103 Posts
Default

Quote:
Originally Posted by rogue View Post
Regarding multi-threading, how many concurrent clients should the server be able to support?

I also want to make sure that the targets are not moving targets. I don't want someone to say that the next release can have no more than 5 lost tests a day and then express frustration because it loses 5 tests on a day and they expected it to lose no more than 3 per day.

On the 1st para., I would say somewhere in the neighborhood of 300-500 clients. Is that reasonable? I believe the month-long LLRnet rally that we had pushed 300-400 clients at times. I'm sure PrimeGrid would like it to be 1000 clients or higher.

On the 2nd para., thanks for trying to manage expectations. :-) ...not an easy task I know.

Max, correct me if I'm wrong, but LLRnet rarely loses tests that I can recall, well...at least not in the last year or so when we were running rallies. Sometimes we'd have stragglers but it was because they hadn't been returned to the server after someone had them out for an extended period and the JobMaxTime was rather high.

If my perception is correct, I'd say that an initial target for PRPnet to shoot for is no more than 2-3 lost tests per week; working its way down to an avg. of < 1 lost test per week; and eventually ending up at < 1 lost test per MONTH. In other words, unexpected outages on both the client and server side need to be handled so that rarely is anything lost.

Example: Our 11th drive on David's LLRnet port 2000. We've run 100,000's of tests through that in the last few months, many times with 60-80 clients at a time. A large majority has been done on my machines. I can't recall a single lost test in the last 3-4 months. When we're ready to load the n=10K range, the first k/n pair remaining is always < 2 days old. Since 2 days is the JobMaxTime, that is what we would expect.


Gary

Last fiddled with by gd_barnes on 2009-12-23 at 00:36
gd_barnes is offline   Reply With Quote
Old 2009-12-23, 02:57   #7
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

7×11×73 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
On the 1st para., I would say somewhere in the neighborhood of 300-500 clients. Is that reasonable? I believe the month-long LLRnet rally that we had pushed 300-400 clients at times. I'm sure PrimeGrid would like it to be 1000 clients or higher.

On the 2nd para., thanks for trying to manage expectations. :-) ...not an easy task I know.

Max, correct me if I'm wrong, but LLRnet rarely loses tests that I can recall, well...at least not in the last year or so when we were running rallies. Sometimes we'd have stragglers but it was because they hadn't been returned to the server after someone had them out for an extended period and the JobMaxTime was rather high.

If my perception is correct, I'd say that an initial target for PRPnet to shoot for is no more than 2-3 lost tests per week; working its way down to an avg. of < 1 lost test per week; and eventually ending up at < 1 lost test per MONTH. In other words, unexpected outages on both the client and server side need to be handled so that rarely is anything lost.

Example: Our 11th drive on David's LLRnet port 2000. We've run 100,000's of tests through that in the last few months, many times with 60-80 clients at a time. A large majority has been done on my machines. I can't recall a single lost test in the last 3-4 months. When we're ready to load the n=10K range, the first k/n pair remaining is always < 2 days old. Since 2 days is the JobMaxTime, that is what we would expect.

Gary
It really is an issue of concurrent clients. I would not expect 300+ clients to hit the server within a few seconds of one another. PrimeGrid hasn't really complained. They are heavy users of BOINC, something that PRPNet is not really intended to compete with. I am trying to apply the KISS principle, both to development and deployment, which I think I have been successful with.

Once I have the database in place, results should not be lost unless there is a bug in the client or server. Crashing will not lead to lost tests, which can happen today.

That being said, I was able to spend some time on the MySQL version of the server. So far I have the code that converts the current candidates file and imports it into a database. That is working smoothly. The next piece is to get the HTML generation down. It will be built upon stats tables which have rolled-up data. I might be able to get to that on Christmas day. The next step after that is the multi-threading, which won't be much fun to do. I haven't never written any multi-threading logic, so that is something for me to learn. Unfortunately none of the examples I've found to do it are easy. The problem is that Windows and Unix do it completely differently. When that is done I'll be able to tackle the client/server messages. Fortunately I don't think that I need to change anything with sockets. I have no ETA for an alpha/beta.
rogue is offline   Reply With Quote
Old 2009-12-23, 05:41   #8
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

2·72·103 Posts
Default

Thanks for the update Mark. Most of it is well above my head. It sounds like you're hard at work.

Any thoughts on my previous post about adding the test timings to the client side? I think that would be very useful. I'm hoping it would only be a small change on your end.
gd_barnes is offline   Reply With Quote
Old 2009-12-23, 20:42   #9
MyDogBuster
 
MyDogBuster's Avatar
 
May 2008
Wilmington, DE

2×13×109 Posts
Default

I'd like to add one thing also. The server command window shows so much info that it is really hard to read. I'm sure that GMT date and time, what email it was sent to, and what program ran it are important to some, but I would just like to see what was tested, which client ran it, the residue and Gary's runtime. If I need the rest of the info I could always go to the log file. I, like Gary, need to manage many cores and I can't find the info I need without "searching" thru the command window with the scrolling bars. The client window is just fine.

Also, PRPAdmin can only run if there are tests currently in the candidates file. Is there any way to get that to run so that I can add tests to an empty candidates file? Kind of of a chicken and egg thing. I can't add tests to something that has nothing in it. I know I can use the prpserver -i option to populate the file but I then have to switch to PRPAdmin for adding more sequences.

Good luck on the conversion Mark.

Last fiddled with by MyDogBuster on 2009-12-23 at 20:46
MyDogBuster is offline   Reply With Quote
Old 2009-12-25, 21:10   #10
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

7×11×73 Posts
Default

Quote:
Originally Posted by MyDogBuster View Post
I'd like to add one thing also. The server command window shows so much info that it is really hard to read. I'm sure that GMT date and time, what email it was sent to, and what program ran it are important to some, but I would just like to see what was tested, which client ran it, the residue and Gary's runtime. If I need the rest of the info I could always go to the log file. I, like Gary, need to manage many cores and I can't find the info I need without "searching" thru the command window with the scrolling bars. The client window is just fine.

Also, PRPAdmin can only run if there are tests currently in the candidates file. Is there any way to get that to run so that I can add tests to an empty candidates file? Kind of of a chicken and egg thing. I can't add tests to something that has nothing in it. I know I can use the prpserver -i option to populate the file but I then have to switch to PRPAdmin for adding more sequences.
I have an update and a few items.

First, I forget to respond to Gary's request about the time it takes to do a test. If I do such a thing, it is likely to be based upon clock time, not CPU time because I cannot trust the time from the log files. I don't know yet how I will handle that, but I do know that it won't be in the initial release of 3.0.

Regarding the output on the server side, I'll have to think about it.

The new server won't care if the database is empty. It will still start up, thus allowing you to use prpadmin as you would like.

Now for the update. I finally have the multi-threading logic in place in the server. The only thing it is doing at this point is server up the web pages. There is a lot of additional information on those pages than there is today. Eventually I will have to provide users a way (XSL?) to limit what they want to see. If there are hundreds of bases or k/b combos, loading the webpage takes a while because of the amount of data being served.

There are a few things that the server doesn't do yet. First, the new server does not handle any client connections for getting/returning work. I don't expect that to be too difficult. Second, getting it to build and run on *nix. Windows and *nix use completely different functions for multi-threading, so I will have to write *nix version of the same code. Hopefully it won't be too difficult. Finally the server can't handle connections from the admin tool. That shouldn't be too difficult to add either.

If I have time, I should be able to tackle some of these things on Monday.
rogue is offline   Reply With Quote
Old 2009-12-28, 10:41   #11
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

1009410 Posts
Default

Quote:
Originally Posted by rogue View Post
I have an update and a few items.

First, I forget to respond to Gary's request about the time it takes to do a test. If I do such a thing, it is likely to be based upon clock time, not CPU time because I cannot trust the time from the log files. I don't know yet how I will handle that, but I do know that it won't be in the initial release of 3.0.
I'm confused. Why not just take the CPU time output from LLR or PFGW or whatever on the client side? That's what LLRnet does with LLR. It's quite accurate unless there is an interruption on the server or client side, in which case, the time starts back at 0 for the current test upon restart, which is no big deal because only the time for that one test is too low. Even with that, 99%+ of CPU timings are accurate and you don't have to do any kind of internal calculation.

If not all of the programs utilized by PRPnet have a CPU time, then just leave it blank or show zeros for the programs that don't.

It's that LLR or PFGW CPU time taken for an inidividual test that is badly needed to plan our work more easily. I would also be more confident that PRPnet has improved its timings vs. LLRnet by using the newest versions of LLR and PFGW. The big selling point for PRPnet at NPLB is not having to use the old LLR 3.5, which is ~5-10% slower than the newer versions. We need to be able to easily and quickly see that improvement.

This wouldn't be as important on the server side although would be nice to have. LLRnet just displays the time between when the pair was handed out and when it was returned, which is effectively useless. But if the LLR CPU time can also be shown on the server side like we'd like to have it for the client side, that would be good. But if all that can be easily shown is the same as what LLRnet shows, it's not worth it.

In other words, if any computation is necessary, it's not worth showing. The time should come from the output of LLR or PFGW or whatever program is running in PRPnet.


Gary

Last fiddled with by gd_barnes on 2009-12-28 at 10:52
gd_barnes is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
LLRnet and PRPnet servers for automated LLR mdettweiler Twin Prime Search 230 2020-04-01 03:30
LLRnet/PRPnet rally April 4th-11th mdettweiler No Prime Left Behind 55 2011-04-25 09:35
LLRnet/PRPnet rally January 3rd-10th mdettweiler No Prime Left Behind 48 2011-01-12 10:14
LLRnet/PRPnet rally Oct. 27th-Nov. 3rd mdettweiler No Prime Left Behind 33 2010-12-24 19:16
LLRnet/PRPnet rally June 4th-6th gd_barnes No Prime Left Behind 61 2010-07-30 17:28

All times are UTC. The time now is 21:22.

Sat Apr 4 21:22:19 UTC 2020 up 10 days, 18:55, 0 users, load averages: 1.24, 1.73, 1.72

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.