![]() |
[QUOTE=chalsall;495432]Hey all. FSCK!!!
The GPU72 server just crashed; looks like another bad Seagate drive. 1and1 Tech Support and I are working it. Please stand by....[/QUOTE] No end of fun! Good luck with it. |
[QUOTE=kladner;495436]No end of fun! Good luck with it.[/QUOTE]
Yeah... Thanks... Another fscking Seagate drive!!! And this time it didn't even throw any SMART warnings! I was editing a file on the server, and went to save it. The console hung. Went to another console and asked for a SMART report (smartctl -a /dev/sda) and that hung. Went to yet another console, and killed the smartclt command; kernel panic. Logged into the serial console and asked for a reboot:[CODE] Sep 5 13:06:51 gpu72 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Sep 5 13:06:51 gpu72 kernel: ata1.00: failed command: FLUSH CACHE EXT Sep 5 13:06:51 gpu72 kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 2#012 res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Sep 5 13:06:51 gpu72 kernel: ata1.00: status: { DRDY } Sep 5 13:06:51 gpu72 kernel: ata1: hard resetting link Sep 5 13:06:51 gpu72 kernel: ata1: link is slow to respond, please be patient (ready=0) Sep 5 13:06:51 gpu72 kernel: ata1: COMRESET failed (errno=-16) Sep 5 13:06:51 gpu72 kernel: ata1: hard resetting link Sep 5 13:06:51 gpu72 kernel: ata1: link is slow to respond, please be patient (ready=0) Sep 5 13:06:51 gpu72 kernel: ata1: COMRESET failed (errno=-16) Sep 5 13:06:51 gpu72 kernel: ata1: hard resetting link Sep 5 13:06:51 gpu72 kernel: ata1: link is slow to respond, please be patient (ready=0) Sep 5 13:06:51 gpu72 kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Sep 5 13:06:51 gpu72 kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out Sep 5 13:06:51 gpu72 kernel: ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out Sep 5 13:06:51 gpu72 kernel: ata1.00: qc timeout (cmd 0xec) Sep 5 13:06:51 gpu72 kernel: ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4) Sep 5 13:06:51 gpu72 kernel: ata1.00: failed to IDENTIFY after ACPI commands Sep 5 13:06:51 gpu72 kernel: ata1.00: revalidation failed (errno=-5) Sep 5 13:06:51 gpu72 kernel: ata1: hard resetting link Sep 5 13:06:51 gpu72 kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Sep 5 13:06:51 gpu72 kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out Sep 5 13:06:51 gpu72 kernel: ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out Sep 5 13:06:51 gpu72 kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out Sep 5 13:06:51 gpu72 kernel: ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out Sep 5 13:06:51 gpu72 kernel: ata1.00: configured for UDMA/133 Sep 5 13:06:51 gpu72 kernel: ata1.00: retrying FLUSH 0xea Emask 0x4 Sep 5 13:06:51 gpu72 kernel: ata1: EH complete [/CODE] TL;DR: The drive just died!!! Like, instantly! 1&1 have scheduled a drive replaced (AGAIN!!!!). Third time for this machine. They also promised to check all the cabling, and if this ever happens again they'll replace the machine. Have I ever mentioned I hate computers? (And don't even get me started on SeaCrap!) |
[QUOTE=chalsall;495441]1&1 have scheduled a drive replaced (AGAIN!!!!).[/QUOTE]
OK. The girl is back... No data-loss (I don't think). And although 1&1 insisted on installing yet another SeaCRAP drive, at least it's a different model this time: "Seagate Constellation CS" instead of a "Seagate Barracuda 7200.14 (AF)". They also changed the drive carriages to a newer model with better airflow. RAID1 rebuild is currently under way, so there may be a bit of sluggishness for an hour or so. Please let me know if anyone sees anything weird. You never know with this kind of thing.... Edit: OK. Except for the file I was editing at the time of the crash, everything looks good. DB sanity checks passed, etc. |
Hey All.
Just to document the exchange I had with 1&1 after the outage, trying to give constructive criticism. [QUOTE]Hi [REDACTED]. OK, thanks for being willing to interact via email. This is with regards to TTN [REDACTED]. First off, let me please tell you that I find 1and1's technical support people to be really quite good. I'm sure they have to deal with "noobs" who don't know what they're doing, and call only after both of their drives have failed (in an RAID1 configuration). The three main points I wanted to bring forward: 1. Why does 1&1 insist on using Seagate drives? A proper configuration in a RAID1 array is to use two different manufacturers (or, at least, two different models), so the MTBF is unlikely to be exactly the same. 2. This is now the third time this machine has been given a new Seagate hard drive. The first time it was replaced it turned out the brand new hard drive was bad, so we had to go through the whole RAID1 rebuild exercise again. 3. Why was I not informed that the new drive was in place, and should be worked? I didn't get an email, nor a phone call about this (except for a request to take a survey evaluating your service). 3.1. I had to call into the 1&1 Dedicated Server Help Desk to be told the work had been completed. I then informed the tech ([REDACTED]) that I wasn't seeing the machine. He did some magic, and then I was able to access the machine. Again, please let me say that overall I'm happy with 1&1's service. But yesterday's outage caused just a little bit more business continuity interruption than I would have liked, and if I had been informed the moment the machine was supposed to have been back online the outage would have been shorter. Please let me know if you have questions or comments.[/QUOTE] The typical corporate response:[QUOTE] Thank you for contacting us. After looking over your concern, I’ve included some helpful resources that may be of use to you. 1. Unfortunately, we are not provided the reasons behind specific hardware decisions as these are not made directly by the server department or the datacenter. These decisions in hardware options and availability are made from a higher level that we do not have direct influence over. 2. I do apologize for any inconvenience you have faced regarding hard disk failures. The drive test should reveal if a drive is bad prior to it being used in the server, and I am unsure how you ended up with a 2nd faulty drive being placed in the server. Again, I apologize for the inconvenience. 3. The typical drive replacement is done within 1-4 hours. I see this one was done quite quick (within an hour), but unfortunately, there is no automated alerts that can be sent to you regarding this. It would require that the datacenter ticket is updated by the datacenter technician and that one of our agents manually checks it and updates you. While we always do our best to keep an eye on the tickets we escalate, however, there are occasionally some delays depending on staffing, call volume, and agent awareness during shift changes. I apologize that there was a delay in notifying you about the replacement.[/QUOTE] As usual with large providers, their people can't actually do much more than apologise. That and four bucks will get you a coffee. And, so, you manage the situation... Know the lay of the land, and always be prepared to deal with everything which could possibly happen.... |
Hey Chris, I remember I read this few days ago, but I skipped at the time, not being interested, but it seems it repeats: I also got assigned 39M TF from 71 to 74 (?!?), 18 (eighteen) of them. Was that supposed to happen? After changing the bitlevel to higher/lower, I started getting correct assignments. For now, I will let them finish, but for the future, if you offer assignments out of the expected ranges, please add them to the tables. I always look to the tables when I request assignments, and I don't like surprises. Mind that I do not question your reasons of assigning this or that, I only question the fact that you gave me some work which I didn't expect, because it was not exposed as available. If you assign 39M range for whatever reason, add it to the table, so I can see it.
|
[QUOTE=LaurV;495716]I also got assigned 39M TF from 71 to 74 (?!?), 18 (eighteen) of them. Was that supposed to happen?[/QUOTE]
No. What happened there is I was doing some "personal" TF'ing in that range, and was making GPU72 aware so it would collect the statistics. I somehow managed to forget to remove the last few candidates (24#) from the DB, and you chose an unusual criteria in your query which resulted in those being given to you. [QUOTE=LaurV;495716]For now, I will let them finish, but for the future, if you offer assignments out of the expected ranges, please add them to the tables. I always look to the tables when I request assignments, and I don't like surprises.[/QUOTE] A fair criticism. And again, this was a mistake made by me (a Stupid Programmer Error (SPE)); the 39M assignments you received should not have been given. For the last several months when my GPUs are working at lower levels I've been working "off the books", to avoid this kind of thing. And I've made sure that the GPU72 DB is clean of such potential erroneous assignments. |
Tropical Storm Kirk...
Hey All. Just a heads up...
Barbados is about to be hit by Tropical Storm Kirk. We're prepared, but there's a chance I might lose power and/or connectivity for a while. GPU72 is pretty steady-state at the moment, so there shouldn't be any issues. But if you don't see me around for a bit, you'll know why. |
[QUOTE=chalsall;496863]Hey All. Just a heads up...
Barbados is about to be hit by Tropical Storm Kirk. We're prepared, but there's a chance I might lose power and/or connectivity for a while. GPU72 is pretty steady-state at the moment, so there shouldn't be any issues. But if you don't see me around for a bit, you'll know why.[/QUOTE] Best of luck. You are well above the waves, right? |
[QUOTE=kladner;496877]Best of luck. You are well above the waves, right?[/QUOTE]
Thanks. And yeah, our home is about 100 m above sea level. Although Linda's office is only about 5 m above... Kirk turned out to be a bit anti-climatic. It passed north of us, so we only saw about 50 km/h winds, and about 15 cm of rain over eight hours. There's a saying around Bimshire: "God is a Bajan." This annoys the heck out of me, because people use it as a rational for why they don't need to be properly prepared for storms. But, in Kirk's case, the empirical supports the argument: after passing us it then dropped back down south to our latitude and continued on east.... |
That's good to hear. Five meters could be a close thing on the bad side of a strong storm. I am happy for the miss.
|
www.gpu72.com
AWOL?
|
| All times are UTC. The time now is 23:12. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.