mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   No Prime Left Behind (https://www.mersenneforum.org/forumdisplay.php?f=82)
-   -   Server outrages (https://www.mersenneforum.org/showthread.php?t=13840)

henryzz 2021-08-01 15:06

[QUOTE=gd_barnes;584542]The servers and NPLB pages are down. I am getting SQL errors in trying to restart them. I did a little bit of looking into it and could not come up with anything quickly. Unfortunately I am scheduled to leave town today and don't have time to fix it. I should be back by the end of the week. I'll begin looking at it then.

Sorry about the problem.[/QUOTE]


Have you considered hosting your servers on the cloud? You might find that more reliable although there might be bit of work to transition it.

gd_barnes 2021-08-02 06:18

Hell no! :-)

gd_barnes 2021-08-06 22:04

Help needed!

Here is the status for SQL databases:

The NPLB stats database is back up.

The PRPnet servers are not back up.

I ran into a space issue on my server machine so I deleted a bunch of old stuff and now there is plenty of space.

The PRPnet servers were previously corrupted as a result of attempting to update themselves when there was not enough space. There appears to be little to no loss of data except maybe the last few results that were submitted.

What I need to do now is dump the databases, drop them, recreate them, and reload them so that the corruption is fixed. This is similar to what I did in Jan. 2020 when there was a corruption. I've done it before so with help from others before I have a good idea of the process involved. Like before I've run into various issues like MySQL not starting. I've gotten that issue fixed with this:

in /etc/mysql/my.cnf I put:
innodb_force_recovery = 5

After rebooting that worked and got SQL restarted. Now I need to dump the databases. But I am stuck. I'm logged in as root. I ran the following to dump one of the databases:

root@jeepford:/etc/mysql# mysqldump -u gary -p prpnet2000 > dump2000.sql

After entering that it prompted me for my password. I entered it. But I'm getting access denied.

Here is the message:
mysqldump: Got error: 1045: Access denied for user 'gary'@'localhost' (using password: YES) when trying to connect

I've also tried it with -u root. Same problem.

I did not have this problem dumping the databases back in Jan. 2020. My password is a standard one that I've used for the life of the machine so that cannot be the issue.

I've done a lot of google searching for this and nothing I have tried seems to work.

I've tried it with SQL both started and stopped. The SQL status looks good at the moment.

Any help would be appreciated.

gd_barnes 2021-08-07 01:32

I have answered my own question in the last post. Now I have another problem. I have now dumped all of the databases. I am now trying to drop them. I get the following error message:

ERROR 1010 (HY000): Error dropping database (can't rmdir './prpnet1400', errno: 13)

Once again I've done a lot of googling. It did tell me that is some sort of permission or access issue. But I've been unable to fix it.

Once again any help would be appreciated.

gd_barnes 2021-08-07 03:17

Once again I answered my own question in the last post.

All of the PRPnet servers are back up and running! No data appears lost. :smile:

It is not clear to me whether or not the NPLB stats were corrupted. To be safe I'm going to drop it, restart it, and reload it. It is a huge file (4.3 GB) that will likely take a couple of hours to reload. So the NPLB stats will be down for a couple hours while that is done.

I have extended the expiration time to 1 week for tests that were left in limbo while the servers were down. Since the servers have been down ~5-6 days that'll give everyone ~1-2 days to return their completed tests. If your clients have some completed tests you can restart them and they should return them before requesting new work.

Sorry that the problem occurred right before I left town.

AMDave 2021-08-17 15:54

Hi Gary. Got your DM. Hi there NPLB! Good to see you again. I will be home tomorrow night (AEST) to login, examine and report.
I have kept my DEV server and the NPLB DR server in 'stasis' in my RedBack rack. Ooh. A few years now I would guess - checking - yup. since late 2016. It has been a while.
I will fire them up for testing. Lets see if those capacitors are as good as the price I paid for them! Else I will yank the HDDs into the workstation. No time to waste.
If the NPLB backup files are good on your end, as we designed them to be, then we have a very high probability of a great outcome.
:)
Sorry everyone. I do not have any pleasant hold music to offer you. You will have to go online and stream your own. It's 2021. Otherwise I will hum. Hum hum hum hum hum. No. You don't want that. Come on. Seriously. Move along please. You can come back later.

AMDave 2021-08-17 17:21

Cursory remote examination from afar:
Log files showing the result reception was absent between 02-Aug and 07-Aug but since then all is good with the result reception and the processing mechanism itself appears to be OK also.
The project stats pages appear to be working (on face value)
I will need to check the port-based databases and statistics pages
From this thread it looks like you had an issue with port 2000 but it currently appears to be pumping data.

Maybe one of you fixed the principal issue?
Did you turn it off and on again? :)

I will still log in, [URL="https://www.google.com/search?q=shirt+front&oq=shirt+front"]shirt-front[/URL] it and look it up and down and ask it "Are you still a working server?"
Until it says yes ;)

gd_barnes 2021-08-17 20:36

[QUOTE=AMDave;585900]Cursory remote examination from afar:
Log files showing the result reception was absent between 02-Aug and 07-Aug but since then all is good with the result reception and the processing mechanism itself appears to be OK also.
The project stats pages appear to be working (on face value)
I will need to check the port-based databases and statistics pages
From this thread it looks like you had an issue with port 2000 but it currently appears to be pumping data.

Maybe one of you fixed the principal issue?
Did you turn it off and on again? :)

I will still log in, [URL="https://www.google.com/search?q=shirt+front&oq=shirt+front"]shirt-front[/URL] it and look it up and down and ask it "Are you still a working server?"
Until it says yes ;)[/QUOTE]

I'm very happy to hear from you Dave! Thank you for responding!

Around Aug. 1st the hard-drive filled up. This caused a serious corruption of all SQL servers.

Here is what I did on or about Aug. 7th:
1. Deleted a lot of old files to clear some space up. There is now 80-100 GB's of space.
2. Backed up all the SQL servers.
3. Dropped them and re-created them.
4. Reloaded them from the backups.

Yep I've rebooted the machine several times.

This all appeared to work fine. Since then here is what is happening:
1. The PRPnet servers are working fine.
2. The hourly NPLB stats update is not working properly.
3. Sometimes the hourly Port Report works and sometimes it doesn't. At this moment there is nothing in the recent progress by port.
4. The hourly update appears to happen at random times. As of this moment it is ~15:35 local time. The last update was 15:23 just a few mins ago. But the one prior update to that was 13:27. It should be updating once an hour and finish updating at around 5-10 minutes after the hour like it has always done.
5. At this moment the statistics of top participants, tops teams, and stats by server have no data in them. For reference top participants are shown here: [URL]http://www.noprimeleftbehind.net/stats/index.php?content=participant_stats[/URL]
When they do work they only show the total stats of the last few months instead of the entire project.

When I re-loaded the NPLB stats DB the grand total stats were all there from Aug. 1st. But for some reason it wants to completely rebuild them each hour or whenever random time that the update actually occurs. I have concluded that the statistics are somehow "added to" once an hour from the grand total that was calculated the previous hour using results received in that prior hour. But for some reason that grand total keeps getting wiped out.

My opinion as to what is happening:
1. Somehow the grand total stats from the previous update time keeps getting wiped out. I suspect this is because it does not have a "pointer" as to when the previous cut-off of the total stats were previously calculated.
2. The server not sensing any previous stats attempts to bring in as many results as it can in order to update the total stats.
3. It realizes there are too many "new" results since it cannot sense the pointer in #1 so it finally stops processing at some random point. This seems to take as long as 2-3 hours sometimes hence the random times of the updates.
4. Based on #3 sometimes the new stats are left blank and sometimes they are just left with stats totals over the last few months. At this moment as I write this they are blank.

I have limited knowledge of SQL DB's. The process here for NPLB stats is complex due to the hourly update but for the PRPnet servers it's not too bad so I was able to get them all reset and running properly. We need some help on the NPLB stats.

Thank you! :smile:

Gary

AMDave 2021-08-18 15:10

Your port servers are still live and available so keep em coming ;)

Stats Maintenance in progress.
Some issues found.
Some tables rebuilt.
Stats refreshes are back on line.
Tomorrow I will add a preventative patch.

gd_barnes 2021-08-18 19:05

At this moment (14:05 local time) the hourly port report, top participants, and top teams all show nothing. This includes the progress_crosstab, participant_stats, and team_stats tables.

The stats by server (server_stats) only has the recent 3 servers and the stats are all duplicated.

The last update is at 13:09 local time so that is as expected.

***

Edit: Another update at 14:09. Things are different now. The hourly port report (progress_crosstab), top participants (participant_stats), and stats by server (server_stats) all show nothing. The top teams (team_stats) only has recent teams in it.

This is fairly similar to what was happening before. The main difference is that it updates hourly when it should.

***

Edit 2: Another update at 15:10. Different again. progress_crosstab is empty, participant_stats has only last few months data in it, team_stats is empty, and server_stats only has the last few months servers in it.

It only seems to want to include recent stats. Older stats seem to be wiped out. Another thing that is different right now is that the hourly port report (progress_crosstab) has been empty every hour. Before (since Aug. 7th) it was working a majority of the time.

AMDave 2021-08-19 14:31

Patches applied.
Stats are back out of maintenance mode.
Monitoring in progress.

gd_barnes 2021-08-19 19:28

Looks great! :smile:

gd_barnes 2021-08-20 04:44

Stats are back in maintenance mode.

AMDave 2021-08-20 14:10

At some point there was an event that caused corruption the MySQL and that corruption was retained in the backups so restoring from back up could not excise the problem.
The old prpnet1470 database was already unrecoverable, but I had to forcibly remove it to restore integrity.

The processing logs were really quite perplexing until the timestamps revealed the picture.
Somewhere in MySQL was enough of a problem to cause the sql steps to take longer and longer to complete, until they were too long.
Eventually the daily and hourly processes overlapped.
As one process created the update tables another one removed them.
That is how the stats summaries disappeared.

I re-optimised the configuration of the MySQL server itself,
I have re-optimised all of the databases: both ports and stats.
It was close to optimal but I have refined it further.
Bounced it and bench marked it several times.
Its looking better, for an old dog. (It is a great grandfather now!)

I have merged the daily and hourly stats processes so there is now only one process.
That comes at a small cost in processing time.
The original hourly process target was 3 minutes and the daily process time was 10 minutes.
The single process will now run every hour including midnight and take up to 20 minutes depending on how many results you can pump into it :)
I have removed the weakest point in the process (the 'blind-update-and-switch' summary step) which is where the web stats were disappearing from.

And finally I have added a 'hard' semaphore that prevents the process from running over itself.
So the process will attempt to run each hour, as before, but if a massive slowdown occurs again it will run to completion without interruption and will not start again until the start of the hour after it has completed.

I will have to come back and change that hard semaphore to a soft semaphore at a more convenient time.

Back out of maintenance mode.
Monitoring the log files closely.

gd_barnes 2021-08-20 20:09

Thank you very much, Dave, for that tremendous effort. :smile:

I had seen that prpnet1470 caused a big issue over a year ago. Hence I never attempted to run it again. It's too bad that its corruption caused the backups to become corrupted.

To everyone: The NPLB stats pages look good now!

odicin 2021-08-27 05:19

Very nice... :smile: I noticed that there are still some incorrect entries on the [URL="https://stats.free-dc.org/stats.php?page=proj&proj=nplb"]free-dc stats page[/URL] (Top 10 Users, Show All entries). Possibly, the export for this also has to be touched again.


Regards Odi

AMDave 2021-09-08 12:44

I have posted on FDC asking Bok to refresh the project stats from the latest NPLB extracts

AMDave 2021-09-09 11:47

Bok advised that his new stats server no longer refreshes the stats of non-BOINC projects.
However he is looking into making NPLB an exception.
There is a bit of work and goodwill involved so if it happens don't forget to tell him thank you :)

Mini-Geek 2022-04-17 17:39

Looks like NPLB servers are down since 11am. :exclaim:

gd_barnes 2022-04-17 19:56

My internet service is down. I am putting in a call to Spectrum now.

Sorry about the outage.

gd_barnes 2022-04-17 21:54

I definitely am having a problem with my router. I’ve done all the usual reboots and restarts. Unplugging and plugging back in. Etc.

Spectrum confirmed the issue. Likely the earliest that they can be here is around 11am CDT US on Monday. (4pm GMT). There is a small chance that they may be able to come out this evening.

I will continue to keep everyone updated.

gd_barnes 2022-04-18 00:45

Spectrum will not be here until Monday around 11am local time.

In the meantime if you are running an NPLB PRPnet server you can try to connect with the following:
server=G9000:100:1:192.168.0.110:9000

This would be entered in your prpclient.ini file if you are running port 9000.

You can also try connecting to the main NPLB page by entering 192.168.0.110 for the URL.

I have extended the expiration on all the work units to 48 hours.

WraithX 2022-04-18 01:55

[QUOTE=gd_barnes;604179]You can also try connecting to the main NPLB page by entering 192.168.0.110 for the URL.[/QUOTE]

Just to let you know, 192.168.x.x is not routable on the internet, it is a local-only (private) subnet. No one but you will be able to reach that server via that ip address.

You can read more about private networks on wikipedia: [url]https://en.wikipedia.org/wiki/Private_network[/url]

gd_barnes 2022-04-18 02:13

Thank you for the info! 😊

I should have known that!

gd_barnes 2022-04-18 14:11

Everything is back up! :-)

I had a sliced wire around the box in the back yard that has now been replaced.

My apologies for the outage.

gd_barnes 2022-04-19 18:12

My internet service is down again. Spectrum has been called. Someone should be here within 2 hours of this posting.

gd_barnes 2022-04-19 22:33

They had to replace my router. That means that I had to set up port forwarding to allow the PRPnet servers connections. I've done that. But there may be some other things that I need to do to set up the new router successfully. It will likely be several hours or the better part of a day before I get everything figured out.

gd_barnes 2022-04-19 23:04

They are going to have to replace my router again due to lack of Wifi on my end. This will be done on Wednesday about 3-4 PM CDT (8-9 PM GMT).

After that it may be another day before everything on the new router is set up correctly so that the servers and noprimeleftbehind pages point to the correct place. Unfortunately I'm flying blind here somewhat.

gd_barnes 2022-04-20 09:36

I finally figured out how to correctly do the port forwarding on the new router. So all of the pages and servers are working correctly now. :smile:

Unfortunately they have to replace the router again later today as per my previous post due to WiFi issues here. So that means there will be another outage at that time.

The good news is now that I've figured out how to properly configure the port forwarding on a new router, the outage shouldn't be long...perhaps 2-3 hours if everything goes well.

gd_barnes 2022-04-20 20:13

Well...They replaced my router with a Spectrum brand router. Previously I had an ARRIS router. Now I have to teach myself how the port forwarding works on it. It will probably be several hours before the web pages and servers are available.

gd_barnes 2022-04-20 21:36

I got the port forwarding figured out for the PRPnet servers. So those are now up and running.

I'm still having problems getting the forwarding figured out for the noprimeleftbehind main and server pages on internal port 80 so those are not publicly viewable yet.

I do have the server machine (with all of the web pages and PRPnet servers) set up as a static internal IP address. I can confirm that the web and server pages are still there and updating properly by viewing them on that internal IP address. But the port forwarding on port 80 to allow public access is still an issue.

This is a big learning process.

Trilo 2022-05-07 01:46

Looks like port 9000 is not reporting on "recent progress".

gd_barnes 2022-05-07 02:27

[QUOTE=Trilo;605392]Looks like port 9000 is not reporting on "recent progress".[/QUOTE]
Weird. I'm not sure what causes that. All servers have been up the entire time. I've seen it "skip" an hour or two for just the stats for one or more ports at times and then fully update itself for all hours in the 2nd or 3rd hour. This time it skipped 3 hours for only port 9000 and then updated for all hours in the 4th hour. No info was lost. It just seems to be delayed at times. It's up to date now and looks accurate.

For future reference this occurred for the hours of 18, 19, and 20:00 local time and then fully updated everything for the 21:00 hour. I'm going to see if there is a pattern of the time of day that it happens.

gd_barnes 2022-05-18 11:49

We had a power blip around 4-4:30 AM local time (9-9:30AM GMT). I got the servers back up and running within 5-10 minutes. Unfortunately I just now noticed that the NPLB stats pages have not been updating since then. I will see if I can fix the issue later today.

gd_barnes 2022-05-18 19:24

The hourly stats are now working again. :smile:

No data was lost.


All times are UTC. The time now is 13:12.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.