From a reader:
Love to hear your take on this: Why RAID 5 stops working in 2009.
That article by Robin Harris is more than a year old, but was suddenly linked hither and yon in the last few days. Its thesis is that because RAID array capacities are approaching unrecoverable read error rates, if one disk in your RAID fails, you’ll very probably get a read error from one of the other disks in the course of rebuilding the array, and lose data.
This basic claim is true, but there are three reasons why this problem is not as scary as it sounds.
1: Losing one bit shouldn’t lose you more than one file. Consumer RAID controllers may have a fit over small errors, but losing one file because a drive failed is seldom actually a serious problem.
2: Putting your valuable data on a RAID array and not backing it up is a bad idea even if disk errors never happen. One day, you are guaranteed to confidently instruct your computer to delete something that you actually want, and RAID won’t protect you from that, or from housefires, theft of the drives, and so on. You still need proper backups.
3: If you’re going to build a story on statistics, it helps a lot if you get the statistics right.
Robin Harris says it is “almost certain” that 12 terabytes of disk drives with a one-in-12Tb read error rate will have an error if you read the whole capacity.
This statement is wrong.
Actually, the probability of one or more errors, in this situation, is only 63.2%. When you know why this is, you discover that there’s less to fear here than you’d think.
(Robin Harris is not the only sinner, here. This guy makes exactly the same mistake. This guy, on the other hand, says one error in ten to the fourteen reads gives you “a 56% chance” of an error in seven terabytes read; he’s actually done the maths correctly.)
The mistake people make over and over again when figuring out stuff like this is saying that that if you’ve got (say) a one in a million chance of event Y occurring every time you take action X, and you do X a million times, the probability that Y will have happened is 1.
(If it were, then if you did X a million and one times, the probability that Y will have occured would now be slightly more than one. This is unacceptably weird, even by mathematicians’ standards.)
What you do to figure out the real probabilities in this sort of situation is look at the probability that Y will never happen over your million trials.
(If it matters to you if Y happens more than once, then things get more complex. But usually the outcomes you’re interested in are “Y does not happen at all” and “Y happens one or more times”. That is the case here, and in many other “chance of failure” sorts of situations.)
To make this easier to understand, let’s look at a version of the problem using numbers that you can figure out on a piece of paper, without having to do anything a million times.
Let’s say that you’re throwing an ordinary (fair!) six-sided die, and you don’t want to get a one. The chance of getting a one is, of course, one in six, and let’s say you’re throwing the die six times.
For each throw, the probability of something other than one coming up is five in six. So the probability of something other than one coming up for all six throws is:
5/6 times 5/6 times 5/6 times 5/6 times 5/6 times 5/6.
This can more easily be written as five-sixths to the power of six, or (5/6)^6, and it’s equal to (5^6)/(6^6), or 15625/46656. That’s about 0.335, where 1 is certainty, and 0 is impossibility.
So six trials, in each of which an undesirable outcome has a one in six chance of happening, certainly do not make the undesirable outcome certain. You actually have about a one-third chance that the undesirable outcome will not happen at all.
It’s easy to adjust this for different probabilities and different numbers of trials. If you intend to throw the dice 15 times instead of six, you calculate (5/6)^15, which gives you about a 0.065 chance that you’ll get away with no ones. And if you decide to toss a coin ten times, and want to know how likely it is that it’ll never come up tails, then the calculation will be (1/2)^10, a miserable 0.00098.
In the one-in-a-million, one-million-times version, you figure out (1 - 1/1000000)^1000000, which is about 0.368. So there’s a 36.8% chance that the one-in-a-million event will never happen in one million trials, and a 63.2% chance that the event will happen one or more times.
OK, on to the disk-drive example.
Let’s say that the chance of an unrecoverable read failure is indeed one in ten to the 14 - 1/100,000,000,000,000. I’ll express this big number, and the other big numbers to come, in the conventional computer-y form of scientific notation that doesn’t require little superscript numbers. One times ten to the power of 14, a one with 14 zeroes after it, is thus written “1E+14″.
The chance of no error occurring on any given read, given this error probability, is 1 - 1/(1E+14), which is 0.99999999999999. Very close to one, but not quite there.
(Note that if you start figuring this stuff out for yourself in a spreadsheet or something, really long numbers may cause you to start hitting precision problems, where the computer runs out of digits to express a number like 0.99999999999999999999999999999999999999 correctly, rounds it off to one, and breaks your calculation. Fortunately, the mere fourteen-digit numbers we’re working with here are well within normal computer precision.)
OK, now let’s say we’re reading the whole of a drive which just happens to have a capacity of exactly 1E+14 bits, at this error rate of one error in every 10^14 reads. So the chance of zero errors is:
(1 - 1/(1E+14))^1E+14
This equals about 0.368. Or, if you prefer, a 63.2% chance of one or more errors.
Note that the basic statement about the probability of an error remains true - overall, a drive with an Unrecoverable Read Error Rate of one in ten to the fourteen will indeed have such an error once in every ten to the fourteen reads. But that doesn’t guarantee such an error in any particular ten to the fourteen reads, any more than the fact that a coin comes up evenly heads or tails guarantees that you’ll get one of each if you throw it twice.
Now, a RAID that’s 63.2% likely to have an error if one of its drives fails is still not a good thing. But there’s a big difference between 63.2% and “almost certain”.
(Note also that we’re talking about a lot of data, here. At fifty megabytes per second, ten to the fourteen bits will take about 2.8 days to read.)
Getting the statistics right makes the numbers look proportionally better if the error rate can be reduced.
If drive manufacturers manage to reduce the error rate by a factor of ten, for instance, so now it’s one in every ten to the fifteen reads instead of every 1E+14, the chance that you’ll get no such errors in a given ten to the fourteen reads improves to about 90.5%.
If they reduce the error rate all the way to one in ten to the sixteen, then ten to the fourteen reads are 98.9% likely to all be fine.
I’m not saying it’s necessarily easy to make such an improvement in the read error rate, especially in the marketing-bulldust-soaked hard-drive industry.
But neither is the situation as dire as the “almost certain” article says.
All who commit such crimes against mathematical literacy are hereby sentenced to read John Allen Paulos’ classic Innumeracy.
(This is not a very severe sentence, since the book is actually rather entertaining.)