![]() |
[QUOTE=chalsall;473728]The rebuild process (in this case, md3_resync) is at priority 20, nice of 0, while mprime is priority 30, nice of 10.
But it seems that because mprime is so incredibly efficient at saturating the CPUs, the OS wasn't able to prevent it from degrading the md3_resync process significantly.[/QUOTE] I think it's just that modern OSs are really bad at actually implementing minimum priority. For instance, I can't run yafu or msieve, at nice -n 19, and e.g. play a video game at the same time, and those are hardly efficient at saturating the CPU. (Well, maybe yafu's SIQS has inline assembly, or GGNFS' sievers with the same, but even those get ~20% gains from hyperthreading. Linear algebra is surely even less good at saturating the CPU, and even then I still can't truly do two things at once.) |
[QUOTE=chalsall;473728]The rebuild process (in this case, md3_resync) is at priority 20, nice of 0, while mprime is priority 30, nice of 10.
But it seems that because mprime is so incredibly efficient at saturating the CPUs, the OS wasn't able to prevent it from degrading the md3_resync process significantly.[/QUOTE]I think the scheduler is broken. Is it putting two threads of different priorities onto one CPU because it got the SMC (aka hyperthreading) map wrong? Or maybe it thinks SMC is no issue at all and just allocates without caring. |
[QUOTE=retina;473731]I think the scheduler is broken. Is it putting two threads of different priorities onto one CPU because it got the SMC (aka hyperthreading) map wrong? Or maybe it thinks SMC is no issue at all and just allocates without caring.[/QUOTE]
To be perfectly honest, I'm still trying to figure out what was/is going on. Mostly using empirical testing. Hyperthreading is not the issue; the CPU (Intel(R) Xeon(R) CPU E31220 @ 3.10GHz) has four (4#) CPUs, and doesn't support hyperthreading. I did have mprime configured to use all four CPUs in parallel on a single P-1 workload. Memory usage (up to 8 GB of 12 GB available) had no effect on the /proc/mdstat estimated completion. I could try spinning mprime up again using only one CPU, and see how the scheduler handles this. But this is a mission critical server (not only for GPU72, but also other clients), so I'm a bit reluctant to do so. |
Man, when it rains, it pours...
So, after all the hell we went through earlier with the failing Seagate drive, the hotswap (which didn't go well), and then trying to rebuild the RAID1 array twice, it turns out the *brand* *new* Seagate drive is bad!!! They've just swapped in ANOTHER Seagate drive, and we have to go through the whole process AGAIN! We are going to have to reboot the server to get the hotswapped drive to show up as /dev/sda instead of /dev/sdc, so there will (hopefully) be a minute or two of downtime shortly (longer if things don't go well)... If you have the choice, NEVER use Seagate drives. They are CRAP!!!! |
That's not quite what I predicted, but certainly confirms my opinion of Seagate :censored: :bangheadonwall:
|
I still use them, but only because I was getting them at a very low cost. Would I trust one? Nope.
|
[QUOTE=James Heinrich;474311]That's not quite what I predicted, but certainly confirms my opinion of Seagate :censored: :bangheadonwall:[/QUOTE]
Yeah... This is a direct quote from the email I received from 1and1 about this:[QUOTE]In a datacenter, Hard disks are "consumables" and are changed out like tires on a car when they wear out. Your dedicated server is designed to have two "twins" or mirror drives. When one goes down, we [replace it].[/QUOTE] No shit, Sherlock... But clearly 1and1 haven't [URL="https://en.wikipedia.org/wiki/Firestone_and_Ford_tire_controversy"]learned the lessons of Ford[/URL]. Just so everyone knows, the reboot was cleanly successful, and the RAID1 rebuild is under-way. [C]/proc/mdstat[/C] is reporting about ten times faster rebuilding compared to the last "new" Seagate drive... I took the opportunity to test if mprime still impacted the rebuild speed, and it does. P.S. Just to say, while this has been a frustrating experience, I do generally find 1and1 to be quite good. Very inexpensive for truly dedicated servers, and their higher-level techs do actually know their stuff. |
I also use 1and1 for mersenne.ca and fortunately I've never had cause to test their tech support :smile:
|
[QUOTE=Mark Rose;474320]I still use them, but only because I was getting them at a very low cost. Would I trust one? Nope.[/QUOTE]
I will _sometimes_ include a single Seagate drive in a RAID6 array (because I like a mixture of manufacturers and models in such an array). Whenever someone asks me if they should buy an external USB hard drive, I answer "Yes, unless you see Seagate anywhere on the box". |
[QUOTE=chalsall;474310]We are going to have to reboot the server to get the hotswapped drive to show up as /dev/sda instead of /dev/sdc, so there will (hopefully) be a minute or two of downtime shortly (longer if things don't go well)...[/QUOTE]
So, five days later, the RAID1 resync/rebuild completed successfully. I have learnt a few things about this: 1. Never trust Seagate. (I already knew this, but just reconfirmed). 2. Don't map most of your drive into a RAID1, and then subdivide using LVM. 2.1. Instead, make an educated guess as to what your storage needs are going to be, and create multiple arrays and partitions to suit. 2.2. The reason this rebuild took so long (beyond the fact the first newly replaced Seagate drive was bad) was every time the server answered a network request one or more block had to be written, which slowed down the resync process. |
[url]https://www.gpu72.com/[/url] is trying to use a self-signed certificate right now, which my MISFIT is, understandably, complaining about. Has something gone wrong with the server configuration?
|
| All times are UTC. The time now is 23:13. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.