mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   No Prime Left Behind (https://www.mersenneforum.org/forumdisplay.php?f=82)
-   -   PRPnet 3.1.3 stress-test server (https://www.mersenneforum.org/showthread.php?t=13007)

mdettweiler 2010-01-22 20:39

PRPnet 3.1.3 stress-test server
 
Hi all,

As I've previously mentioned on a few occasions, Gary and I had been planning for a while to run a stress test on PRPnet 3.1.3 with lots of cores and very small candidates to ensure that the latest PRPnet can handle high loads. I do expect it to cope well based on testing performed at PrimeGrid, though nonetheless the testing done here will be valuable as it will show us whether Gary's setup in particular can handle the load.

To this effect, I have set up a new PRPnet 3.1.3 server and loaded it with work from k=2000-2200, n=50K-250K--i.e., a doublecheck of the 12th Drive. The server info is as follows:

server = nplb-gb1.no-ip.org
port = 7465

Or, in terms of a server= line for use in prpclient.ini:
server=G7465:100:1:nplb-gb1.no-ip.org:7465

Note that in the above line I've set the batch size to 1. Normally this is NOT what you'd want to do for small tests like this, as for numbers this small the overhead actually adds up to a nonnegligable amount of wasted CPU power. It also puts a much higher load on the server than, say, a batch size of 20 would. But in this case, high load is what we're aiming for. :smile:

Gary, as soon as I can get a Linux client package ready for PRPnet 3.1.3, I can send you a preconfigured package to drop on all of your quads. 12 quads * 4 cores = 48 cores, plus whatever I and anyone else can throw on there, so we should be over the magic number of 50.

If anyone else wants to put a few cores on the server, go right ahead--the more load, the better. :smile: Visit our [url=http://www.mersenneforum.org/showthread.php?t=12223]PRPnet thread[/url] for client download links and setup instructions. If the server can hold up for at least 6 hours or so with 50+ cores hammering away on it, then we can be quite confident in the server's capabilities for future rallies and the like.

Max :smile:

Mini-Geek 2010-01-22 20:44

Maybe I was just too quick, but I just set two Intel cores on it and they couldn't connect to the server. Are you still trying to get it set up?
Edit: I tried from another computer, and got this:[code][2010-01-22 20:49:33 GMT] PRPNet Client application v3.1.3 started
[2010-01-22 20:49:33 GMT] User name Mini-Geek at email address is tim.sorbera@gmail.com
[2010-01-22 20:49:36 GMT] G7465: Getting work from server nplb-gb1.no-ip.org at port 7465
[2010-01-22 20:49:45 GMT] G7465: PRPNet server is version 3.1.3
[2010-01-22 20:49:56 GMT] G7465: 2001*2^50150-1 is not prime. Residue 3A543F7AC1D6C29D
[2010-01-22 20:49:56 GMT] Total Time: 0:00:23 Total Tests: 1 Total PRPs Found: 0
[2010-01-22 20:49:56 GMT] G7465: Returning work to server nplb-gb1.no-ip.org at port 7465
[2010-01-22 20:50:06 GMT] Nothing was received on socket 368, therefore the socket was closed
[2010-01-22 20:50:07 GMT] Nothing was received on socket 364, therefore the socket was closed
[2010-01-22 20:50:07 GMT] Total Time: 0:00:34 Total Tests: 1 Total PRPs Found: 0
[2010-01-22 20:50:07 GMT] G7465: Returning work to server nplb-gb1.no-ip.org at port 7465
[2010-01-22 20:50:07 GMT] G7465: INFO: Test for 2001*2^50150-1 was ignored. Candidate and/or test was not found
[2010-01-22 20:50:07 GMT] G7465: INFO: 0 of 1 test results were accepted
[2010-01-22 20:50:07 GMT] G7465: Getting work from server nplb-gb1.no-ip.org at port 7465
[2010-01-22 20:50:18 GMT] Nothing was received on socket 1656, therefore the socket was closed
[2010-01-22 20:50:28 GMT] Nothing was received on socket 368, therefore the socket was closed
[2010-01-22 20:50:28 GMT] Could not verify connection to nplb-gb1.no-ip.org. Will try again later.
[2010-01-22 20:50:29 GMT] nplb-gb1.no-ip.org:7000 connect to socket failed
[2010-01-22 20:50:33 GMT] NPLB5thDrive: Getting work from server nplb-gb1.no-ip.org at port 3000
[/code](synopsis: got work without a problem, tried returning it without getting through twice, then got through properly, and the test was ignored because "Candidate and/or test was not found", then was unable to connect to get work, and moved on to another server.)
The first computer still has yet to make a connection, so I've stopped it for now.

mdettweiler 2010-01-22 20:56

[quote=Mini-Geek;202864]Maybe I was just too quick, but I just set two Intel cores on it and they couldn't connect to the server. Are you still trying to get it set up?[/quote]
Hmm...that's strange. I can access the web page at [URL]http://nplb-gb1.no-ip.org:7465/[/URL] just fine, so the server is definitely accessible from the outside.

I just looked at the server and I see your clients' requests for work, but it appears that they aren't being given candidates to test. Definitely strange. Upon looking back through debug.log, I'm seeing some messages that make me wonder if there's a problem with how the server's connecting to the DB. I'll look into it now.

mdettweiler 2010-01-22 20:56

[quote=Mini-Geek;202864]Edit: I tried from another computer, and got this:[code][2010-01-22 20:49:33 GMT] PRPNet Client application v3.1.3 started
[2010-01-22 20:49:33 GMT] User name Mini-Geek at email address is tim.sorbera@gmail.com
[2010-01-22 20:49:36 GMT] G7465: Getting work from server nplb-gb1.no-ip.org at port 7465
[2010-01-22 20:49:45 GMT] G7465: PRPNet server is version 3.1.3
[2010-01-22 20:49:56 GMT] G7465: 2001*2^50150-1 is not prime. Residue 3A543F7AC1D6C29D
[2010-01-22 20:49:56 GMT] Total Time: 0:00:23 Total Tests: 1 Total PRPs Found: 0
[2010-01-22 20:49:56 GMT] G7465: Returning work to server nplb-gb1.no-ip.org at port 7465
[2010-01-22 20:50:06 GMT] Nothing was received on socket 368, therefore the socket was closed
[2010-01-22 20:50:07 GMT] Nothing was received on socket 364, therefore the socket was closed
[2010-01-22 20:50:07 GMT] Total Time: 0:00:34 Total Tests: 1 Total PRPs Found: 0
[2010-01-22 20:50:07 GMT] G7465: Returning work to server nplb-gb1.no-ip.org at port 7465
[2010-01-22 20:50:07 GMT] G7465: INFO: Test for 2001*2^50150-1 was ignored. Candidate and/or test was not found
[2010-01-22 20:50:07 GMT] G7465: INFO: 0 of 1 test results were accepted
[2010-01-22 20:50:07 GMT] G7465: Getting work from server nplb-gb1.no-ip.org at port 7465
[2010-01-22 20:50:18 GMT] Nothing was received on socket 1656, therefore the socket was closed
[2010-01-22 20:50:28 GMT] Nothing was received on socket 368, therefore the socket was closed
[2010-01-22 20:50:28 GMT] Could not verify connection to nplb-gb1.no-ip.org. Will try again later.
[2010-01-22 20:50:29 GMT] nplb-gb1.no-ip.org:7000 connect to socket failed
[2010-01-22 20:50:33 GMT] NPLB5thDrive: Getting work from server nplb-gb1.no-ip.org at port 3000
[/code](synopsis: got work without a problem, tried returning it without getting through twice, then got through properly, and the test was ignored because "Candidate and/or test was not found", then was unable to connect to get work, and moved on to another server.)
The first computer still has yet to make a connection, so I've stopped it for now.[/quote]
Just saw your edit. This is definitely very strange. I'll look into it and see what's up.

mdettweiler 2010-01-22 21:03

Okay, I just took a look at the server, and while I didn't find much of particular interest, I did notice that the VNC connection would sometimes work at a normal speed, and other times it would be really slow.

I wonder if maybe Gary is hogging up the internet connection with something...though I can't fathom what would be quite this hoggish. Neither can that really explain the issue of the "Candidate and/or test was not found" message you got.

Anyway, I've restarted the server. It may be that this problem is due to some stupid mistake I made; give it a try now and see how it works.

mdettweiler 2010-01-22 21:12

Oh! I think I know what's going on. I just tried putting a client (Windows, 3.1.3) of my own on the server, and observed it behaving like this:

-It would ask for work from the server.
-About 9 seconds later, the server would respond with a test.
-Within 3 or 4 seconds, the client would finish the test. (Hey, they're pretty small.)
-The client would send the test back to the server.
-The server would accept it almost momentarily.

Note that 9 seconds is a really long time for the server to respond. I think this has to do with the fact that I loaded the server with an absolutely enormous number of candidates. Methinks it's taking a while for the MySQL server to look in the database and come up with a test. It took about that long when I tried to view the Candidate table manually from the console.

Sometimes, though, normal variation would push the delay over the critical 10 second mark--which means the client's timeout kicks in and it gives up. Hence the problems Mini-Geek was seeing; I think he was getting timeouts a tad more often than me, probably due to small differences in the latency of his internet connection vs. mine.

I'm going to try re-loading the server with a smaller batch of work--say, k=2000-2050 instead of 2000-2200. That should make for a less bloated database and hopefully fix this problem. [B]Note: This means the server will be offline for up to 10 minutes or so.[/B]

rogue 2010-01-22 21:15

[QUOTE=mdettweiler;202872]Oh! I think I know what's going on. I just tried putting a client (Windows, 3.1.3) of my own on the server, and observed it behaving like this:

-It would ask for work from the server.
-About 9 seconds later, the server would respond with a test.
-Within 3 or 4 seconds, the client would finish the test. (Hey, they're pretty small.)
-The client would send the test back to the server.
-The server would accept it almost momentarily.

Note that 9 seconds is a really long time for the server to respond. I think this has to do with the fact that I loaded the server with an absolutely enormous number of candidates. Methinks it's taking a while for the MySQL server to look in the database and come up with a test. It took about that long when I tried to view the Candidate table manually from the console.

Sometimes, though, normal variation would push the delay over the critical 10 second mark--which means the client's timeout kicks in and it gives up. Hence the problems Mini-Geek was seeing; I think he was getting timeouts a tad more often than me, probably due to small differences in the latency of his internet connection vs. mine.

I'm going to try re-loading the server with a smaller batch of work--say, k=2000-2050 instead of 2000-2200. That should make for a less bloated database and hopefully fix this problem. [B]Note: This means the server will be offline for up to 10 minutes or so.[/B][/QUOTE]

How many candidates were loaded into the server when you had the problems? I wonder if the database needs an index or two.

mdettweiler 2010-01-22 21:18

[quote=rogue;202873]How many candidates were loaded into the server when you had the problems? I wonder if the database needs an index or two.[/quote]
There were about 726,000 candidates loaded at the time. I'm currently loading up about a quarter of that after having dumped out the server's DB.

mdettweiler 2010-01-22 21:23

Okay, I've now got the server loaded with just k=2000-2050 now. Let's see how that one works. :smile:

mdettweiler 2010-01-22 21:32

It seems to be working all right now--the server's taking about 2 seconds to respond to requests for work, which is pretty much perfect. I'll refrain from making hasty generalizations lest I be forced to later put my foot in my mouth...we'll see how it goes. :smile:

Mini-Geek 2010-01-22 21:34

I thought I should post my experiences with 3.1.3:
I've been using PRPnet 3.1.3 a bit, and while it all worked just fine in sending and receiving work on my local box, when I tried to run it from another machine on my network, the client would only understand a portion of what the server was sending. The rest were, like what is apparently happening here, being marked as reserved on the server, but not being received and run on the client. Getting "Candidate and/or test was not found" was pretty rare, but would happen with the entire batch whenever it did (4 such batches over ~4000 tests). I checked the server logs for one such candidate, and here's it's story: (as the server knows it)
[code]prpserver.log:
[2010-01-21 18:05:42 GMT] 809997332*3^10319-1 sent to Email: tim.sorbera@gmail.com User: Mini-Geek Client: dad2
[2010-01-21 18:11:27 GMT] Test of 809997332*3^10319-1 for user tim.sorbera@gmail.com and client dad2 has expired.
(a few more assignments/expirations every 5 minutes, then:)
[2010-01-21 18:33:05 GMT] Test of 809997332*3^10319-1 for user tim.sorbera@gmail.com and client dad2 has expired.
[2010-01-21 18:33:08 GMT] 809997332*3^10319-1 sent to Email: tim.sorbera@gmail.com User: Mini-Geek Client: dad1
[2010-01-21 18:40:09 GMT] tim.sorbera@gmail.com (dad2): Test 1264098464 for candidate 809997332*3^10319-1 was not found

completed_tests.log:
[2010-01-21 18:33:22 GMT] 809997332*3^10319-1 received by Email: tim.sorbera@gmail.com User: Mini-Geek Client: dad1 Program: pfgw.exe Residue: A1599B2980DBDA7B
[/code]I had about 22000 candidates loaded into the server.

Now, in response to the recent change:
I can now get and return work on both computers, but now on both I sometimes (pretty often, maybe every 5-10 times I communicate) get the "No available candidates" message. I know there are fewer candidates, but it's not THAT much less! :grin:


All times are UTC. The time now is 23:17.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.