Using MTBF and Time Dependent Reliability for disks

Some of the papers from FAST07 are still generating buzz.  David Morgenstern has just written another interesting article on hard disk MTBF. In the article he mentions the time dependent reliability (TDR) work we do at Sun.  In this blog, I'll share my perspective on the topic and how to best use the data you can get to design better systems.

MTBF is a constant source of confusion. Empirically, if MTBF was well understood and easily communicated, then Morgenstern wouldn't need to write articles about it. MTBF is a summary metric, which hides many important details. For example, data collected for the years 1996-1998 in the US showed that the annual death rate for children aged 5-14 was 20.8 per 100,000 resident population. This shows an average failure rate of 0.0208% per year.  Thus, the MTBF for children aged 5-14 in the US is approximately 4,807 years. Clearly, no human child could be expected to live 5,000 years. Similarly, if a vendor says that the disk MTBF is 1 Million hours (114 years), you cannot expect a disk to last that long.  In fact, 114  years ago, disk drives did not exist. Yet, in a statistical analysis, it is quite possible for a disk vendor to see a field failure rate corresponding to 114 years MTBF. Sometimes, you'll see a disk manufacturer showing the MTBF as an Annual Failure Rate (AFR) percentage. 1 Million hours MTBF = 0.88% AFR. Personally, I find annual failure rates to be easier to use when setting service expectations. An 0.88% AFR means that I could expect to replace about 1% of my disks per year, for the expected lifetime of the disk.

Is 1% AFR good enough? Some of the FAST07 papers showed measured disk AFR in the 4-6% range. Is 4-6% AFR good enough? This sort of question is difficult to answer. We'd like the AFR to be 0%, but that is unrealistic. We would also expect that as the disks go beyond their useful life we should see an increase in failure rates. Recall that no child will live to 4,800 years -- for humans, the death rate increases dramatically as we approach 100 years and very, very few people have lived past 110 years. Similarly, many disks are rated for a 5-year life span, after which you can expect the failure rate to increase. What we really want to know is, "is the reliability getting better (a good thing) or worse (a bad thing)?" At Sun, we use time dependent reliability (TDR) analysis to track the field reliability of our products. This allows us to see changes in reliability as we implement changes in processes, procedures, or products. If a product begins to show a worsening in field reliability, we dig into the data to see why, and fix the problem.  Be sure to check out the TDR presentation and white papers for details, it is a very good method to implement in your processes.

When is MTBF useful? IMHO, the problems with MTBF for determining service rates are not a good reason for tossing it out entirely. MTBF (or AFR) is often the only reliability metric which is available for hardware from a wide variety of vendors. But rather than worrying about the magnitude of the number, examine the relative relationships between the parts. For example, if you need a disk drive and one model has a data-sheet MTBF of 1 Million hours and another has a data-sheet MTBF of 1.6 Million hours, it is very likely that the 1.6 Million hour drive will be more reliable than the 1 Million hour drive.

OK, so that was a no-brainer analysis.  Real life is never that easy.  Suppose you've been stung by a failed disk and subsequently resolved to never get stung again by using RAID. For 2-disk mirrors, the simple MTBF analysis still makes sense: 2 disks with 1.6 Million hours MTBF will be more reliable than 2 disks with 1 Million hours MTBF. But this begins to become more difficult as you add disks (and possible configurations) and begin to consider deferred maintenance (to save on service costs.) Once you start asking these questions, the simple MTBF of the resulting system becomes increasingly less interesting and other reliability measurements become more interesting. For example, you will often hear us talk about Mean Time Between System Interruption (MTBSI) or Mean Time To Data Loss (MTTDL). which are more representative of the pain associated with service failures in redundant systems. However, in order to calculate MTBSI or MTTDL, we need to have some measurement of the reliability of the individual components -- MTBF. When we compare two different system design scenarios, the MTBSI or MTTDL calculations will show the differences in design as a relative measure of the components. This is very useful for making system design decisions.

In conclusion, MTBF will remain misunderstood for a long time.  But MTBF and redundancy analyses using MTBF are still useful for comparing system design trade-offs.




Post a Comment:
Comments are closed for this entry.



« July 2016