A story of two MTTDL models
By relling on Jan 17, 2007
Mean Time to Data Loss (MTTDL) is a metric that we find useful for comparing data storage systems. I think it is particularly useful for determining what sort of data protection you may want to use for a RAID system. For example, suppose you have a Sun Fire X4550 (aka Thumper) server with 48 internal disk drives. What would be the best way to configure the disks for redundancy? Previously, I explored space versus MTTDL and space versus unscheduled Mean Time Between System Interruptions (U_MTBSI) for the X4500 running ZFS. The same analysis works for SVM or LVM, too.
For this blog, I want to explore the calculation of MTTDL for a bunch of disks. It turns out, there are multiple models for calculating MTTDL. The one described previously here is the simplest and only considers the Mean Time Between Failure (MTBF) of a disk and the Mean Time to Repair (MTTR) of the repair and reconstruction process. I'll call that model #1 which solves for MTTDL. To quickly recap:
For non-protected schemes (dynamic striping, RAID-0)
MTTDL = MTBF / N
For single parity schemes (2-way mirror, raidz, RAID-1, RAID-5):
MTTDL = MTBF2 / (N \* (N-1) \* MTTR)
For double parity schemes (3-way mirror, raidz2, RAID-6):
MTTDL = MTBF3 / (N \* (N-1) \* (N-2) \* MTTR2)
You can often get MTBF data from your drive vendor and you can measure or estimate your MTTR with reasonable accuracy. But MTTDL does not consider the Unrecoverable Error Rate (UER) for read operations on disk drives. It turns out that the UER is often easier to get from the disk drive data sheets, because sometimes the drive vendors don't list MTBF (or Annual Failure Rate, AFR) for all of their drive models. Typically, UER will be 1 per 1014 bits read for consumer class drives and 1 per 1015 for enterprise class drives. This can be alarming, because you could also say that consumer class drives should see 1 UER per 12.5 TBytes of data read. Today, 500-750 GByte drives are readily available and 1 TByte drives are announced. Most people will be unhappy if they get an unrecoverable read error once every dozen or so times they read the whole disk. Worse yet, if we have the data protected with RAID and we have to replace a drive, we really do hope that the data reconstruction completes correctly. To add to our nightmare, the UER does not decrease by adding disks. If we can't rely on the data to be correctly read, we can't be sure that our data reconstruction will succeed, and we'll have data loss. Clearly, we need a model which takes this into account. Let's call that model #2, for MTTDL:
First, we calculate the probability of unsuccessful reconstruction due to a UER for N disks of a given size (unit conversion omitted):
Precon_fail = (N-1) \* size / UER
For single-disk failure protection:
MTTDL = MTBF / (N \* Precon_fail)
For double-disk failure protection:
MTTDL = MTBF2/ (N \* (N-1) \* MTTR \* Precon_fail)
Comparing the MTTDL model to the MTTDL model shows some interesting aspects of design. First, there is no MTTDL model for RAID-0 because there is no data reconstruction – any failure and you lose data. Second, the MTTR doesn't enter into the MTTDL model until you get to double-disk failure scenarios. You could nit pick about this, but as you'll soon see, it really doesn't make any difference for our design decision process. Third, you can see that the Precon_fail is a function of the size of the data set. This is because the UER doesn't change as you grow the data set. Or, to look at it from a different direction, if you use consumer class drives with 1 UER for 1014 bits, and you have 12.5 TBytes of data, the probability of an unrecoverable read during the data reconstruction is 1. Ugh. If the Precon_fail is 1, then the MTTDL model looks a lot like the RAID-0 model and friends don't let friends use RAID-0! Maybe you could consider a smaller sized data set to offset this risk. Let's see how that looks in pictures.
2-way mirroring is an example of a configuration which provides single-disk failure protection. Each data point represents the space available when using 2-way mirrors in a zpool. Since this is for a X4500, we consider 46 total available disks and any disks not used for data are available as spares. In this graph you can clearly see that the MTTDL model encourages the use of hot spares. More importantly, although the results of the calculations of the two models are around 5 orders of magnitude different, the overall shape of the curve remains the same. Keep in mind that we are talking years here, perhaps 10 million years, which is well beyond the 5-year expected life span of a disk. This is the nature of the beast when using a constant MTBF. For models which consider the change in MTBF as the device ages, you should never see such large numbers. But the wish for more accurate models does not change the relative merits of the design decision, which is what we really care about – the best RAID configuration given a bunch of disks. Should I use single disk failure protection or double disk failure protection? To answer that, lets look at the model for raidz2.
From this graph you can see that double disk protection is clearly better than single disk protection above, regardless of which model we choose. Good, this makes sense. You can also see that with raidz2 we have a larger number of disk configuration options. A 3-disk raidz2 set is somewhat similar to a 3-way mirror with the best MTTDL, but doesn't offer much available space. A 4-disk set will offer better space, but not quite as good MTTDL. This pattern continues through 8 disks/set. Judging from the graphs, you should see that a 3-disk set will offer approximately an order of magnitude better MTTDL than an 8-disk, for either MTTDL model. This is because the UER remains constant while the data to be reconstructed increases.
I hope that these models give you an insight into how you can model systems for RAS. In my experience, most people get all jazzed up with the space and forget that they are often making a space vs. RAS trade-off. You can use these models to help you make good design decisions when configuring RAID systems. Since the graphs use Space on the X-axis, it is easy to look at the design trade-offs for a given amount of available space.
Just one more teaser... there are other MTTDL models, but it is unclear if they would help make better decisions, and I'll explore those in another blog.