mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet > GPU to 72

Reply
Thread Tools
Old 2018-09-05, 19:03   #4192
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2·3·1,693 Posts
Default

Quote:
Originally Posted by chalsall View Post
Hey all. FSCK!!!

The GPU72 server just crashed; looks like another bad Seagate drive.

1and1 Tech Support and I are working it. Please stand by....
No end of fun! Good luck with it.
kladner is offline   Reply With Quote
Old 2018-09-05, 20:05   #4193
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

33×192 Posts
Default

Quote:
Originally Posted by kladner View Post
No end of fun! Good luck with it.
Yeah... Thanks...

Another fscking Seagate drive!!! And this time it didn't even throw any SMART warnings!

I was editing a file on the server, and went to save it. The console hung. Went to another console and asked for a SMART report (smartctl -a /dev/sda) and that hung. Went to yet another console, and killed the smartclt command; kernel panic.

Logged into the serial console and asked for a reboot:
Code:
Sep  5 13:06:51 gpu72 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  5 13:06:51 gpu72 kernel: ata1.00: failed command: FLUSH CACHE EXT
Sep  5 13:06:51 gpu72 kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 2#012         res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  5 13:06:51 gpu72 kernel: ata1.00: status: { DRDY }
Sep  5 13:06:51 gpu72 kernel: ata1: hard resetting link
Sep  5 13:06:51 gpu72 kernel: ata1: link is slow to respond, please be patient (ready=0)
Sep  5 13:06:51 gpu72 kernel: ata1: COMRESET failed (errno=-16)
Sep  5 13:06:51 gpu72 kernel: ata1: hard resetting link
Sep  5 13:06:51 gpu72 kernel: ata1: link is slow to respond, please be patient (ready=0)
Sep  5 13:06:51 gpu72 kernel: ata1: COMRESET failed (errno=-16)
Sep  5 13:06:51 gpu72 kernel: ata1: hard resetting link
Sep  5 13:06:51 gpu72 kernel: ata1: link is slow to respond, please be patient (ready=0)
Sep  5 13:06:51 gpu72 kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep  5 13:06:51 gpu72 kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Sep  5 13:06:51 gpu72 kernel: ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
Sep  5 13:06:51 gpu72 kernel: ata1.00: qc timeout (cmd 0xec)
Sep  5 13:06:51 gpu72 kernel: ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Sep  5 13:06:51 gpu72 kernel: ata1.00: failed to IDENTIFY after ACPI commands
Sep  5 13:06:51 gpu72 kernel: ata1.00: revalidation failed (errno=-5)
Sep  5 13:06:51 gpu72 kernel: ata1: hard resetting link
Sep  5 13:06:51 gpu72 kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep  5 13:06:51 gpu72 kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Sep  5 13:06:51 gpu72 kernel: ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
Sep  5 13:06:51 gpu72 kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Sep  5 13:06:51 gpu72 kernel: ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
Sep  5 13:06:51 gpu72 kernel: ata1.00: configured for UDMA/133
Sep  5 13:06:51 gpu72 kernel: ata1.00: retrying FLUSH 0xea Emask 0x4
Sep  5 13:06:51 gpu72 kernel: ata1: EH complete
TL;DR: The drive just died!!! Like, instantly!

1&1 have scheduled a drive replaced (AGAIN!!!!). Third time for this machine. They also promised to check all the cabling, and if this ever happens again they'll replace the machine.

Have I ever mentioned I hate computers? (And don't even get me started on SeaCrap!)
chalsall is offline   Reply With Quote
Old 2018-09-05, 21:47   #4194
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

33·192 Posts
Default

Quote:
Originally Posted by chalsall View Post
1&1 have scheduled a drive replaced (AGAIN!!!!).
OK. The girl is back... No data-loss (I don't think).

And although 1&1 insisted on installing yet another SeaCRAP drive, at least it's a different model this time: "Seagate Constellation CS" instead of a "Seagate Barracuda 7200.14 (AF)". They also changed the drive carriages to a newer model with better airflow.

RAID1 rebuild is currently under way, so there may be a bit of sluggishness for an hour or so.

Please let me know if anyone sees anything weird. You never know with this kind of thing....

Edit: OK. Except for the file I was editing at the time of the crash, everything looks good. DB sanity checks passed, etc.

Last fiddled with by chalsall on 2018-09-05 at 22:05
chalsall is offline   Reply With Quote
Old 2018-09-07, 17:20   #4195
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

33×192 Posts
Default

Hey All.

Just to document the exchange I had with 1&1 after the outage, trying to give constructive criticism.

Quote:
Hi [REDACTED].

OK, thanks for being willing to interact via email. This is with regards to TTN [REDACTED].

First off, let me please tell you that I find 1and1's technical support people to be really quite good. I'm sure they have to deal with "noobs" who don't know what they're doing, and call only after both of their drives have failed (in an RAID1 configuration).

The three main points I wanted to bring forward:

1. Why does 1&1 insist on using Seagate drives? A proper configuration in a RAID1 array is to use two different manufacturers (or, at least, two different models), so the MTBF is unlikely to be exactly the same.

2. This is now the third time this machine has been given a new Seagate hard drive. The first time it was replaced it turned out the brand new hard drive was bad, so we had to go through the whole RAID1 rebuild exercise again.

3. Why was I not informed that the new drive was in place, and should be worked? I didn't get an email, nor a phone call about this (except for a request to take a survey evaluating your service).

3.1. I had to call into the 1&1 Dedicated Server Help Desk to be told the work had been completed. I then informed the tech ([REDACTED]) that I wasn't seeing the machine. He did some magic, and then I was able to access the machine.

Again, please let me say that overall I'm happy with 1&1's service.

But yesterday's outage caused just a little bit more business continuity interruption than I would have liked, and if I had been informed the moment the machine was supposed to have been back online the outage would have been shorter.

Please let me know if you have questions or comments.
The typical corporate response:
Quote:
Thank you for contacting us.

After looking over your concern, I’ve included some helpful resources that may be of use to you.

1. Unfortunately, we are not provided the reasons behind specific hardware decisions as these are not made directly by the server department or the datacenter. These decisions in hardware options and availability are made from a higher level that we do not have direct influence over.

2. I do apologize for any inconvenience you have faced regarding hard disk failures. The drive test should reveal if a drive is bad prior to it being used in the server, and I am unsure how you ended up with a 2nd faulty drive being placed in the server. Again, I apologize for the inconvenience.

3. The typical drive replacement is done within 1-4 hours. I see this one was done quite quick (within an hour), but unfortunately, there is no automated alerts that can be sent to you regarding this. It would require that the datacenter ticket is updated by the datacenter technician and that one of our agents manually checks it and updates you. While we always do our best to keep an eye on the tickets we escalate, however, there are occasionally some delays depending on staffing, call volume, and agent awareness during shift changes. I apologize that there was a delay in notifying you about the replacement.
As usual with large providers, their people can't actually do much more than apologise. That and four bucks will get you a coffee.

And, so, you manage the situation... Know the lay of the land, and always be prepared to deal with everything which could possibly happen....
chalsall is offline   Reply With Quote
Old 2018-09-09, 04:37   #4196
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

23×419 Posts
Default

Hey Chris, I remember I read this few days ago, but I skipped at the time, not being interested, but it seems it repeats: I also got assigned 39M TF from 71 to 74 (?!?), 18 (eighteen) of them. Was that supposed to happen? After changing the bitlevel to higher/lower, I started getting correct assignments. For now, I will let them finish, but for the future, if you offer assignments out of the expected ranges, please add them to the tables. I always look to the tables when I request assignments, and I don't like surprises. Mind that I do not question your reasons of assigning this or that, I only question the fact that you gave me some work which I didn't expect, because it was not exposed as available. If you assign 39M range for whatever reason, add it to the table, so I can see it.
LaurV is offline   Reply With Quote
Old 2018-09-10, 18:23   #4197
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

33·192 Posts
Default

Quote:
Originally Posted by LaurV View Post
I also got assigned 39M TF from 71 to 74 (?!?), 18 (eighteen) of them. Was that supposed to happen?
No. What happened there is I was doing some "personal" TF'ing in that range, and was making GPU72 aware so it would collect the statistics.

I somehow managed to forget to remove the last few candidates (24#) from the DB, and you chose an unusual criteria in your query which resulted in those being given to you.

Quote:
Originally Posted by LaurV View Post
For now, I will let them finish, but for the future, if you offer assignments out of the expected ranges, please add them to the tables. I always look to the tables when I request assignments, and I don't like surprises.
A fair criticism. And again, this was a mistake made by me (a Stupid Programmer Error (SPE)); the 39M assignments you received should not have been given.

For the last several months when my GPUs are working at lower levels I've been working "off the books", to avoid this kind of thing. And I've made sure that the GPU72 DB is clean of such potential erroneous assignments.
chalsall is offline   Reply With Quote
Old 2018-09-26, 22:19   #4198
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

33·192 Posts
Default Tropical Storm Kirk...

Hey All. Just a heads up...

Barbados is about to be hit by Tropical Storm Kirk. We're prepared, but there's a chance I might lose power and/or connectivity for a while.

GPU72 is pretty steady-state at the moment, so there shouldn't be any issues. But if you don't see me around for a bit, you'll know why.
chalsall is offline   Reply With Quote
Old 2018-09-27, 02:38   #4199
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

236568 Posts
Default

Quote:
Originally Posted by chalsall View Post
Hey All. Just a heads up...

Barbados is about to be hit by Tropical Storm Kirk. We're prepared, but there's a chance I might lose power and/or connectivity for a while.

GPU72 is pretty steady-state at the moment, so there shouldn't be any issues. But if you don't see me around for a bit, you'll know why.
Best of luck. You are well above the waves, right?
kladner is offline   Reply With Quote
Old 2018-09-28, 14:50   #4200
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

33·192 Posts
Default

Quote:
Originally Posted by kladner View Post
Best of luck. You are well above the waves, right?
Thanks. And yeah, our home is about 100 m above sea level. Although Linda's office is only about 5 m above...

Kirk turned out to be a bit anti-climatic. It passed north of us, so we only saw about 50 km/h winds, and about 15 cm of rain over eight hours.

There's a saying around Bimshire: "God is a Bajan." This annoys the heck out of me, because people use it as a rational for why they don't need to be properly prepared for storms. But, in Kirk's case, the empirical supports the argument: after passing us it then dropped back down south to our latitude and continued on east....

Last fiddled with by chalsall on 2018-09-28 at 14:54 Reason: s/shouldn't be/don't need to be/;
chalsall is offline   Reply With Quote
Old 2018-09-29, 06:42   #4201
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2·3·1,693 Posts
Default

That's good to hear. Five meters could be a close thing on the bad side of a strong storm. I am happy for the miss.

Last fiddled with by kladner on 2018-09-29 at 06:43
kladner is offline   Reply With Quote
Old 2018-10-12, 03:35   #4202
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

111128 Posts
Default www.gpu72.com

AWOL?
petrw1 is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Status Primeinator Operation Billion Digits 5 2011-12-06 02:35
62 bit status 1997rj7 Lone Mersenne Hunters 27 2008-09-29 13:52
OBD Status Uncwilly Operation Billion Digits 22 2005-10-25 14:05
1-2M LLR status paulunderwood 3*2^n-1 Search 2 2005-03-13 17:03
Status of 26.0M - 26.5M 1997rj7 Lone Mersenne Hunters 25 2004-06-18 16:46

All times are UTC. The time now is 08:23.


Tue Jul 27 08:23:33 UTC 2021 up 4 days, 2:52, 0 users, load averages: 1.65, 1.78, 1.77

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.