mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > No Prime Left Behind

Reply
Thread Tools
Old 2015-08-05, 16:30   #419
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

141518 Posts
Default

To address an issue raised in the CRUS forum, I will be upgrading PRPnet ports 2000 and 9000 to version 5.4.0 later today. This will result in a brief downtime for each port in turn, hopefully less than an hour.

The databases and config folders will be backed up beforehand so if anything goes wrong, I can quickly roll back to the current version and troubleshoot "offline". Version 5.4.0 has been around and used at PrimeGrid since January and has no regressions from 5.3.2 that I'm aware of.

Edit: I will be also upgrading NPLB port 1468. Almost forgot about that one.

Last fiddled with by mdettweiler on 2015-08-05 at 16:35
mdettweiler is offline   Reply With Quote
Old 2015-08-05, 22:53   #420
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Port 2000 is going down right now for upgrade. I will edit this post when it's back up.

I will only take down one port at a time, to ensure that the other can serve as a backup for people's clients.

Last fiddled with by mdettweiler on 2015-08-05 at 22:54
mdettweiler is offline   Reply With Quote
Old 2015-08-05, 23:57   #421
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Update (6:52 PM server time): Port 2000 is back up. Everything seems to be working fine, though there hasn't yet been much client activity to test it. I'll be keeping an eye on it and if anything goes terribly wrong, I can revert to the backups.

Dave: FYI, I needed to update the "prpnet-to-llrnet.pl" script to handle the changes to PRPnet's completed_tests.log format in version 5.2.0. Ever since I upgraded the CRUS servers last year, this script has been jumbling the LLRnet-formatted output files, but nobody's noticed because we don't actually use those files over at CRUS. At NPLB, however, the stats system is reading everything in LLRnet format, so this is important. I've tweaked the script so it should be able to handle both old- and new-format input files (and even both mixed in one file, as today's will be).

Now I'm taking port 9000 down for upgrade.

Last fiddled with by mdettweiler on 2015-08-05 at 23:58
mdettweiler is offline   Reply With Quote
Old 2015-08-06, 00:43   #422
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Update (7:35 PM server time): Port 9000 is back up.

For 9000, I upgraded only to 5.3.2 (the second-newest version, same as we're using at CRUS). I initially tried to upgrade it to 5.4.0, but the server was immediately beseiged by "client too old, dropping connection" messages as Gary's old clients tried to talk to it. Since the older clients are unaware of this mechanism, they unwittingly kept hammering the server many times a second, effectively DoSing it. Oops... So I restored the backup I took before the upgrade, and re-upgraded to 5.3.2. Now it's working better.

5.3.2 is compatible with both the older and newer clients, so it will keep everyone happy until Gary can get his clients upgraded. 5.4.0 is a drop-in replacement for 5.3.2, so "finishing" the upgrade is very easy and quick.

Next, I will bring down port 1468 for upgrade. Like 9000, I will upgrade it to 5.3.2, not all the way to 5.4.0, until Gary has upgraded his clients. (Port 2000 is already on 5.4.0, but I guess Gary will just have to stay off that server until he's upgraded. I don't believe anyone else is running clients that old, but if they are, the same goes for them.)
mdettweiler is offline   Reply With Quote
Old 2015-08-06, 00:52   #423
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Update (7:50 PM server time): Port 1468 is back up. Now all NPLB and CRUS servers are running PRPnet 5.x.

To summarize, the server versions are:
  • Port 2000: 5.4.0 (only 5.x clients can connect)
  • Port 9000: 5.3.2 (all clients can connect)
  • Port 1468: 5.3.2 (all clients can connect)
  • CRUS 1300 and 1400: 5.3.2 (all clients can connect)

I will continue keeping an eye on them over the next couple of days.

Also, Gary, when you get back from your trip, we can discuss upgrading your clients to 5.4.0.
mdettweiler is offline   Reply With Quote
Old 2015-08-06, 04:05   #424
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

186916 Posts
Exclamation Port 9000 down for the moment

We are having some ongoing trouble with port 9000. Shortly after the upgrade, it was besieged by "The client is too old. The connection was dropped." messages repeating multiple times a second. This ultimately lead to a server crash. Restarting the server doesn't seem to help, because it immediately gets stuck in the "The client is too old" loop again.

Please note that port 9000 is running PRPnet 5.3.2, the same version that has been performing stably and reliably at CRUS since last year. The only server running a newer version (5.4.0, which has also been around since January and is used on all of PrimeGrid's servers) is port 2000, which has not exhibited this issue yet.

I have a backup of port 9000's database and configuration from before the outage, so we always have the option of restoring it to version 4.3.6 and picking up where we left off. However, since we have other servers still operating reliably, I am going to try to diagnose this issue (off forum) with Mark before rolling back the server again. If it continues for more than a few days, I'll restore the backup and we'll investigate this on a test server.

The other servers (2000 and 1468, as well as 1300 and 1400 which have been on 5.3.2 since last year) are not exhibiting this problem, so if you have not done so already, please set your clients to use them as backups. Port 2000 is testing candidates very similar to what's in 9000 right now.

(P.S.: Gary, sorry to do this to you while you're away. Everything is under control since it's easy to restore from the backups...just wanted to give everyone a heads-up. Let me know if you want me to restore the backup at any time.)

Last fiddled with by mdettweiler on 2015-08-06 at 04:05
mdettweiler is offline   Reply With Quote
Old 2015-08-06, 04:30   #425
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

624910 Posts
Default

As a follow-up...further troubleshooting seems to confirm that this is is not an issue with Gary's old clients (which are running 4.3.1, not quite as old as I thought), or even necessarily with the PRPnet server code, since Gary's been running those clients without trouble on CRUS's v5.3.2 servers for a long time now. My hunch is that there's something especially "weird" going on with port 9000's database.

That said, I am just beginning debugging and can't say anything for sure. Long story short, all the other ports (both CRUS and NPLB) are doing fine and survived the upgrade apparently without issue. This is something weird with 9000.
mdettweiler is offline   Reply With Quote
Old 2015-08-06, 16:54   #426
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Cool Port 9000 is back and better than ever!

Hi all,

The problems with port 9000 have been fixed. Now all of NPLB's servers have been successfully upgraded to v5.3.2 or newer.

It turns out that my original instinct was on the money. One "runaway" client of Gary's running the very old version 4.2.0 didn't know what to do with the new server, and "crashed": prpclient had gotten stuck in a loop trying to contact the server, over and over again. It was trying to do this so fast that it effectively DoS'd the server, bringing it to a halt. (Because Gary's clients are on the server's LAN, the bombardment had the full effect of a 100Mbps connection.)

I checked all of Gary's machines remotely (except his personal Windows box, which I don't have access to ), and this appears to be the only one running a really old ("broken") version. The rest of his machines are all on 4.3.1 or 5.0.8 and playing nicely with the upgraded servers. (Well, I don't know how they'll react to port 2000 running 5.4.0...hopefully a bit more gracefully than the 4.2.0 client did. Gary doesn't have any clients on port 2000 right now.)

Max
mdettweiler is offline   Reply With Quote
Old 2015-08-07, 08:37   #427
AMDave
 
AMDave's Avatar
 
Jan 2006
deep in a while-loop

2·7·47 Posts
Default

Good work. Thanks Max!
AMDave is offline   Reply With Quote
Old 2015-08-11, 23:40   #428
AMDave
 
AMDave's Avatar
 
Jan 2006
deep in a while-loop

2·7·47 Posts
Default

Backups from 2015-08-10 02:10:38 restored successfully on the DR server, including all of the port upgrades.
AMDave is offline   Reply With Quote
Old 2015-12-24, 07:58   #429
AMDave
 
AMDave's Avatar
 
Jan 2006
deep in a while-loop

12228 Posts
Default

NPLB update 2015-12-24.

Today I rolled out some subtle upgrades to the live stats pages. No outage was required.

Last week, the backups from 2015-12-18 02:08:40 were restored successfully on the DR server. The backup and restore process is still working as expected to both the local backups and the DR server.

Note that the NPLB DR server has been upgraded to Ubuntu 14.04.3 LTS (GNU/Linux 3.13.0-65-generic x86_64) and everything is working well.

I am already at the point where I am happy that I can have the project operating on a replacement server anywhere in the world within a day or two. The only thing of concern would be the age of the last DR backup if a local backup could not be recovered from the previous day. I am still testing the DR restore manually every fortnight (or so) to make sure we are never too far behind.

A couple of weeks ago I ran some operational recovery tests on the DR server. This did raise a remote connection config issue that I need to resolve at cut-over, but it looks like everything is working fine - although GB's remote desktop would look a bit different.

That's all the fiddling I am doing for 2015.

Best wishes to all for the holiday season

(PS - don't forget to clean out yer dust-bunnies)

Cheers.
AMDave

Last fiddled with by AMDave on 2015-12-24 at 08:23
AMDave is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
What exactly is sent to the server? paul0 NFS@Home 2 2015-03-12 23:00
Server bug -- new? Christenson Information & Answers 5 2011-07-12 21:44
Server Down? Grant Information & Answers 13 2008-11-24 19:37
New ECM-server available andi314 Factoring 3 2003-08-31 11:22
New Server Hardware and price quotes, Funding the server Angular PrimeNet 32 2002-12-09 01:12

All times are UTC. The time now is 13:51.


Mon Aug 2 13:51:42 UTC 2021 up 10 days, 8:20, 0 users, load averages: 2.35, 2.25, 2.08

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.