mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU to 72 (https://www.mersenneforum.org/forumdisplay.php?f=95)
-   -   GPU to 72 status... (https://www.mersenneforum.org/showthread.php?t=16263)

chalsall 2017-12-04 19:21

[QUOTE=James Heinrich;473143]If I had RAID1 with two Seagate drives and one failed, I would expect the second one to also fail before the array could be rebuilt.[/QUOTE]

Yeah. That's why I always insist on different manufacturers for the drives in RAID arrays provisioned for my clients. MTBF can become a bit of a problem when all of them fail right around the same time....

chalsall 2017-12-04 20:46

[QUOTE=chalsall;473139]It is scheduled to be replaced in the next hour or so. It's "hot-swappable", and one of the drives in a RAID1 set, so there should be no downtime.[/QUOTE]

OK. The One and One techs have just now removed the /dev/sda drive.

chalsall 2017-12-04 21:43

[QUOTE=chalsall;473162]OK. The One and One techs have just now removed the /dev/sda drive.[/QUOTE]

And then all hell let loose. "Could I please speak to your supervisor? "Sure, please let me put you on hold...

chalsall 2017-12-04 22:25

[QUOTE=chalsall;473165]And then all hell let loose. "Could I please speak to your supervisor? "Sure, please let me put you on hold...[/QUOTE]

And then I finally got to do "real-time" with a level 3 tech, who understood what I was saying and laughed at my jokes.

Everything should be back to nominal. Please let me know if anyone sees anything odd.

chalsall 2017-12-05 17:33

[QUOTE=chalsall;473168]Everything should be back to nominal. Please let me know if anyone sees anything odd.[/QUOTE]

Just so everyone knows, the RAID array is rebuilding; should complete in about three days... (!) Things are a little sluggish because of this.

But the good news is the new drive is a different model (unfortunately still a Seagate), and the surviving drive in the array has zero non-recoverable read errors. Plus I have a script taking hourly snapshots of the DB mirrored to another server.

Edit: Just because it [URL="https://www.youtube.com/watch?v=fbiioBFkD_Q"]amuses my sorry ass...[/URL]

kladner 2017-12-10 12:43

[B]Is gpu72.com down?[/B]

It's not just you! [URL="http://gpu72.com"]gpu72.com[/URL] looks down from here.

ET_ 2017-12-10 14:01

[QUOTE=kladner;473626][B]Is gpu72.com down?[/B]

It's not just you! [URL="http://gpu72.com"]gpu72.com[/URL] looks down from here.[/QUOTE]

Back up.

James Heinrich 2017-12-10 14:01

Looks up here and now.

chalsall 2017-12-11 01:54

[QUOTE=James Heinrich;473629]Looks up here and now.[/QUOTE]

Sorry guys.

As I said before, the /dev/sda drive was failing hard, so I had the techs replace it. It was hot-swappable, but that didn't go so well.

Six days later (this Sunday morning) I woke up to the server not responding to anything. Even the serial console was non-responsive. A shame, since the RAID1 rebuild was up to 39.1% complete.

A hard reset later, and the machine was back. But not responding well.

TL;DR: Don't run mprime on a Linux server during a RAID1 rebuild. The rebuild estimation has dropped from two weeks to two days. Plus the server is a whole lot more responsive to HTTP and MySQL requests.

Oh, the irony!

retina 2017-12-11 02:03

[QUOTE=chalsall;473673]TL;DR: Don't run mprime on a Linux server during a RAID1 rebuild. The rebuild estimation has dropped from two weeks to two days.[/QUOTE]So the rebuild process has an equal or lower priority than mprime?

chalsall 2017-12-11 15:01

[QUOTE=retina;473674]So the rebuild process has an equal or lower priority than mprime?[/QUOTE]

The rebuild process (in this case, md3_resync) is at priority 20, nice of 0, while mprime is priority 30, nice of 10.

But it seems that because mprime is so incredibly efficient at saturating the CPUs, the OS wasn't able to prevent it from degrading the md3_resync process significantly.


All times are UTC. The time now is 23:13.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.