Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Here's one we made recently: Purchasing an array of hard drives (for storage servers) and not making sure that not all of them are from the same batch. Since they were made in the same batch, they had the same defects and when they failed, they failed one after each other in a very short interval. Since all of them failed, RAID didn't help, we had to restore the day-old offline backup.


I've thought this for a while. Ideally, you want to have drives of different ages, so they're at different pts of the 'bathtub curve'.

If you're mirroring, you should only ever mirror a "fresh, unproven" disk with an "old stalwart" disk. Doing that also means that when your "old stalwart" becomes senile, it's paired with a younger disk.

But doing that does mean rotating mirror sets when you buy a new tranch of disks. Which does put load on, which can trigger failure.

Anyone have a good plan for doing this?


(In theory, as I don't run a storage farm) RAID6 might help here since if the disk rotation triggered a failure you'd have a second parity disk to fall back on. However ideally all the disks in an array would be from different lots.

Or, I believe that on some RAID controllers it is possible to add a third disk to a RAID1 pair, which means you could build the new disk before removing either of the old ones, thus there is never a single point of failure even during the disk replacement operation.


Legend has it that one of the early ESS systems ran into something like this.

Telephone switching systems have functionally zero downtime. They're designed to be fully modular, entirely hot swappable, the kind of thing Erlang was built for and give Z-series a run for their money. So you have this hulking room-sized[1] brute of a telephone switch which cannot ever fail and if it does so must always do so gracefully and with plenty of warning.

And one day it just falls over. No warning, no graceful failover to a redundant system, just poof gone. After much wailing and gnashing of teeth the root cause is identified: n drives were capable of failing without issue, n+1 failed simultaneously.

At this point, stories differ: this was either the beginning of a "no two drives from the same manufacturer" policy or the end of the career of a PHB who vetoed said policy on grounds of excessive cost.

[1] http://www.montagar.com/~patj/phone-switches.htm


You frequently see RAID cascade failures.

Drive A fails. You pop it out, put in a new disk. Machine starts rebuilding RAID array. This involves a ton of reads from other disks. Under the increased strain, Drive B, which was already on the edge, also fails.

And so on.

I'm no longer a believer in redundant disks. Redundant machines would be better.


Good lesson that RAID != Backup!

RAID5 by chance?


ARRRRRHGH. That is NOT THE LESSON. I'm sorry, every time there is a discussion of hard disks and RAID, someone posts this same stupid comment. The GP comment didn't suggest that RAID == Backup. Not in any way. Your reply suggests that they did.

"RAID5 by chance?" How do you know they didn't use it? They might have had multiple failures too close together to rebuild the array.

Sorry for the caps, but I just snapped seeing this comment for the nth time and with so many upvotes.


That isn't something I'd ever thought about until now.

Would it make sense to use drives from more than one company as they would very different failure characteristics?


It's a terrible idea to mix and match drive models, because they'll have drastically different performance characteristics, and if you're lucky you'll only get a little worse than the lowest common denominator on all axes.

What quality OEMs do is make sure they never ship you drives from the same manufacturing batch in one enclosure.


How do you make sure to buy from a quality OEM?


Possibly. IBM had the infamous bad run of DeskStar (or was it another model?) drives about 8-9 years ago... We got a batch of them - in that case it wouldn't have helped you to buy them from different distributors or otherwise tried to get a different batch as the number of problematic drives was huge. At least they were extremely good about replacing them no questions asked and we got them all replaced before we lost data.


DeskStar (or was it another model?)

DeskStar is right. They earned the nickname "DeathStar" because they failed so often.


Would it make sense to use drives from more than one company as they would very different failure characteristics?

This is probably difficult to take advantage of unless you setup your RAID layout very carefully.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: