mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   PrimeNet (https://www.mersenneforum.org/forumdisplay.php?f=11)
-   -   OFFICIAL "SERVER PROBLEMS" THREAD (https://www.mersenneforum.org/showthread.php?t=5758)

Madpoo 2014-08-25 17:51

[QUOTE=Madpoo;381370]Thanks for letting me know it's working again. I trunc'd the logfile to shrink it down.

The issue did start showing up at 8:01 AM UTC today so that corresponds to when you saw your thing start failing, give or take.

I'm not sure what caused the log to grow... if you're just doing things that result in reads from the DB then it shouldn't be logging squat, but maybe there's something internally I don't know about where maybe it "touches" a timestamp in some table resulting in excessive writing... I have no idea really.

I guess I'll keep an eye on the log today and see if it recurs and maybe dig through the IIS logs to see if there were any odd patterns leading up to the issue.[/QUOTE]

My best guess is that it was a nightly job to reorganize indexes that runs at 7:45 AM UTC, just 15 minutes before it started throwing messages about the log drive being full. It shouldn't have resulted in log growth but stranger things have happened. I've run it previously and it took a while to do the initial index defrag but that was it. It's been running at the same time the past week since we moved to the new server and hasn't had a problem... only takes a couple minutes to run now.

To be on the safe side I removed that task. It may have had some interference with a log backup task that runs around that time which was recently scheduled, so that could have been it, backing up the log while the reorg was taking place.

Once things have settled I can add the index task back and make sure things are scheduled without any possible overlapping time periods. At some point George is going to rebuild the indexes anyway and maybe we can optimize how they're stored.

Sorry for the inconvenience folks.

TheMawn 2014-08-25 20:16

I don't know if this should go in the Server Problems thread or the software thread.

My Prime95 client was getting the database error 11 code a bunch of times since it attempted to resubmit the result every 70 minutes. At 3:24 AM GMT -6:00 (Saskatchewan Time) I got the first error and then I got them over and over until 9:14.

Then, instead of trying to communicate again, it "deleted unprocessed message from spool file".

I submitted the line of results manually and it went through (as opposed to giving me a result not needed error). If this isn't just something that happened to me, I think any results that were to be automatically communicated by Prime95 during the outage are going to be missed.

Prime95 2014-08-25 20:30

[QUOTE=TheMawn;381395]I don't know if this should go in the Server Problems thread or the software thread.

My Prime95 client was getting the database error 11 code a bunch of times since it attempted to resubmit the result every 70 minutes. At 3:24 AM GMT -6:00 (Saskatchewan Time) I got the first error and then I got them over and over until 9:14.

Then, instead of trying to communicate again, it "deleted unprocessed message from spool file".

I submitted the line of results manually and it went through (as opposed to giving me a result not needed error). If this isn't just something that happened to me, I think any results that were to be automatically communicated by Prime95 during the outage are going to be missed.[/QUOTE]

Can you email your prime.log file to me?

Madpoo 2014-08-26 03:37

[QUOTE=TheMawn;381395]I don't know if this should go in the Server Problems thread or the software thread.

My Prime95 client was getting the database error 11 code a bunch of times since it attempted to resubmit the result every 70 minutes. At 3:24 AM GMT -6:00 (Saskatchewan Time) I got the first error and then I got them over and over until 9:14.

Then, instead of trying to communicate again, it "deleted unprocessed message from spool file".

I submitted the line of results manually and it went through (as opposed to giving me a result not needed error). If this isn't just something that happened to me, I think any results that were to be automatically communicated by Prime95 during the outage are going to be missed.[/QUOTE]

I can actually take all of the result check-ins to the API during the time when SQL was having trouble and "replay" them. I'd have to think about this and understand the consequences before doing anything like that.

I'll have to look at the API documentation and figure out which of the queries are actually checking in a result compared to requesting new work... I don't want to request new work. :)

I'd definitely check with George and see if that's even an option, much less if it's a good idea or not. At the very least maybe I can get the data and then double-check in the database and see if clients actually checked the same thing in later on their own.

LaurV 2014-08-26 04:05

That would be good as an exercise on your side, but from the "production" point of view, you should let the users deal with their own submissions. My all submissions went through at last, and if few were not, then I checked and resubmitted. If you start tweaking with the data base, then we risk reporting work which was not done, which would be worse (imagine missing a prime due to some wrong LL residue, this will not be found for years!). The worse can happen if you do nothing, is that the work for [U]two days[/U] only (and for few "negligent" users only, who don't follow their submissions), would have to be redone, which is just two days of work lost. No other negative consequences. So, my advice, if it works, don't freak with it. :smile:

Madpoo 2014-08-26 04:18

[QUOTE=Madpoo;381425]I can actually take all of the result check-ins to the API during the time when SQL was having trouble and "replay" them. I'd have to think about this and understand the consequences before doing anything like that.

I'll have to look at the API documentation and figure out which of the queries are actually checking in a result compared to requesting new work... I don't want to request new work. :)

I'd definitely check with George and see if that's even an option, much less if it's a good idea or not. At the very least maybe I can get the data and then double-check in the database and see if clients actually checked the same thing in later on their own.[/QUOTE]

There were 4000 attempts made during the affected time period where a client was trying to do a final check in of an assignment. There's also another 29 that were attempts to update things like "no factors from 2^65 to 2^66" but were just progress reports... they'll check in later on their own.

Those 4000 contain duplicates... I didn't filter out the dupes. It's not really 4000 though. That would be impressive if so. :)

All of the rest of the connections were updating estimated completion, trying to get new assignments, etc. The type of stuff that clients would retry anyway later on.

(EDIT: I got my check in status backwards... fixed my info)

Madpoo 2014-08-26 05:04

[QUOTE=LaurV;381426]That would be good as an exercise on your side, but from the "production" point of view, you should let the users deal with their own submissions. My all submissions went through at last, and if few were not, then I checked and resubmitted. If you start tweaking with the data base, then we risk reporting work which was not done, which would be worse (imagine missing a prime due to some wrong LL residue, this will not be found for years!). The worse can happen if you do nothing, is that the work for [U]two days[/U] only (and for few "negligent" users only, who don't follow their submissions), would have to be redone, which is just two days of work lost. No other negative consequences. So, my advice, if it works, don't freak with it. :smile:[/QUOTE]

Oh yeah, I wouldn't actually *do* something like that without actually seeing if it's needed and if George thought it was okay to use that info as a way to integrate anything that might be missing.

From a safety point of view, all the data is present that would be there if a user did a manual check in which is probably how it would be done (not just resubmitting it through the API. Just an option if it comes to it.

After removing duplicate attempts by clients to check in their results, there's something like 1150 during that time period.

It might not be an issue at all... I spot checked a few and they show up in the database just fine, meaning the client connected later successfully and checked everything in. All of the ones I looked at got checked in later in the day once SQL was fixed.

That's why I thought it might be good to at least run through them and just make sure nothing was missing in the DB, but I feel pretty good that the clients did what they should... the API should have given a result code back to the client that there was an error.

Manual results, same thing... it showed that ugly error (which it shouldn't have... need some error handling in the code) but otherwise it should have been clear to anyone to try again later.

Madpoo 2014-08-26 20:30

[QUOTE=Madpoo;381428]Oh yeah, I wouldn't actually *do* something like that without actually seeing if it's needed and if George thought it was okay to use that info as a way to integrate anything that might be missing.

From a safety point of view, all the data is present that would be there if a user did a manual check in which is probably how it would be done (not just resubmitting it through the API. Just an option if it comes to it.

After removing duplicate attempts by clients to check in their results, there's something like 1150 during that time period.

It might not be an issue at all... I spot checked a few and they show up in the database just fine, meaning the client connected later successfully and checked everything in. All of the ones I looked at got checked in later in the day once SQL was fixed.

That's why I thought it might be good to at least run through them and just make sure nothing was missing in the DB, but I feel pretty good that the clients did what they should... the API should have given a result code back to the client that there was an error.

Manual results, same thing... it showed that ugly error (which it shouldn't have... need some error handling in the code) but otherwise it should have been clear to anyone to try again later.[/QUOTE]

Well, I spent a little time today sifting those logs and comparing with what got checked into the database since it's been 24 hours.

There were 99 attempts to check in a LL result during the outage period, and 66 of them checked in their results later, so all is well with those.

33 of those haven't checked in their results. Not sure if they'll retry, like maybe the client will do it on it's own, or what, but there it is.

I haven't checked the factoring check-ins but that's my next stop.

I guess for now I'll send George a list of those 33 and we'll see if they check themselves in after a few more days or whatever. It has all of the info like their CPU guid, assignment guid, LL residues, error counts, etc... everything that a client would send as part of their check in, so it should be possible to manually take care of those if that's what George decides is the best course of action.

I imagine the API code might need a revisit to make sure errors from the SQL side are handled and it won't let the client think everything was okay when it really wasn't. :smile:

Meanwhile I'm getting the actual physical hardware setup this week and let's just say this happened again for some reason (log file growing crazy), the drive for the SQL log is 144 GB so at the very least it will take longer for it to fill up and throw errors... and I really should put something like SQL Sentry on there to start emailing when things get ugly.

kladner 2014-08-26 20:58

Many thanks for all your efforts, and the efforts of all who keep this system running. :cool:

Madpoo 2014-08-26 23:49

[QUOTE=Madpoo;381463]Well, I spent a little time today sifting those logs and comparing with what got checked into the database since it's been 24 hours.[/QUOTE]

Here's what I could determine:
7 "Factor found" missing out of 17
11 "P-1 results" missing out of 47 (no factor found)
20 "ECM results" missing out of 43 (no factor found)
33 "LL results" missing out of 99 (not prime)
272 "TF results" missing out of 944 (no factor found)

The total added up to 1150 results which matches what I found when I removed duplicate attempts to check in.

All the data is there including the user/CPU that tried to check it in, so I'll just pass this off to George for remediation.

I'll also check again in a while and see if any of those wayward clients check their stuff in later.

TheMawn 2014-08-27 01:57

[QUOTE=Prime95;381397]Can you email your prime.log file to me?[/QUOTE]

Did you receive it, George? Not asking you to rush or anything, but I just want to make sure you're not still waiting because I misspelled the address.


All times are UTC. The time now is 23:04.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.