Thursday Apr 05, 2007

ZFS ported to FreeBSD!

The FreeBSD team has added ZFS to the FreeBSD-7.0 release!  This is excellent news and all of us are happy to share with the FreeBSD community.  Pawel Jakub Dawidek has posted this note to the FreeBSD and ZFS community. This will greatly expand the use of ZFS and will no doubt lead to more innovative developments in the community.  Well done!

Tuesday Mar 20, 2007

Using MTBF and Time Dependent Reliability for disks

Some of the papers from FAST07 are still generating buzz.  David Morgenstern has just written another interesting article on hard disk MTBF. In the article he mentions the time dependent reliability (TDR) work we do at Sun.  In this blog, I'll share my perspective on the topic and how to best use the data you can get to design better systems.

MTBF is a constant source of confusion. Empirically, if MTBF was well understood and easily communicated, then Morgenstern wouldn't need to write articles about it. MTBF is a summary metric, which hides many important details. For example, data collected for the years 1996-1998 in the US showed that the annual death rate for children aged 5-14 was 20.8 per 100,000 resident population. This shows an average failure rate of 0.0208% per year.  Thus, the MTBF for children aged 5-14 in the US is approximately 4,807 years. Clearly, no human child could be expected to live 5,000 years. Similarly, if a vendor says that the disk MTBF is 1 Million hours (114 years), you cannot expect a disk to last that long.  In fact, 114  years ago, disk drives did not exist. Yet, in a statistical analysis, it is quite possible for a disk vendor to see a field failure rate corresponding to 114 years MTBF. Sometimes, you'll see a disk manufacturer showing the MTBF as an Annual Failure Rate (AFR) percentage. 1 Million hours MTBF = 0.88% AFR. Personally, I find annual failure rates to be easier to use when setting service expectations. An 0.88% AFR means that I could expect to replace about 1% of my disks per year, for the expected lifetime of the disk.

Is 1% AFR good enough? Some of the FAST07 papers showed measured disk AFR in the 4-6% range. Is 4-6% AFR good enough? This sort of question is difficult to answer. We'd like the AFR to be 0%, but that is unrealistic. We would also expect that as the disks go beyond their useful life we should see an increase in failure rates. Recall that no child will live to 4,800 years -- for humans, the death rate increases dramatically as we approach 100 years and very, very few people have lived past 110 years. Similarly, many disks are rated for a 5-year life span, after which you can expect the failure rate to increase. What we really want to know is, "is the reliability getting better (a good thing) or worse (a bad thing)?" At Sun, we use time dependent reliability (TDR) analysis to track the field reliability of our products. This allows us to see changes in reliability as we implement changes in processes, procedures, or products. If a product begins to show a worsening in field reliability, we dig into the data to see why, and fix the problem.  Be sure to check out the TDR presentation and white papers for details, it is a very good method to implement in your processes.

When is MTBF useful? IMHO, the problems with MTBF for determining service rates are not a good reason for tossing it out entirely. MTBF (or AFR) is often the only reliability metric which is available for hardware from a wide variety of vendors. But rather than worrying about the magnitude of the number, examine the relative relationships between the parts. For example, if you need a disk drive and one model has a data-sheet MTBF of 1 Million hours and another has a data-sheet MTBF of 1.6 Million hours, it is very likely that the 1.6 Million hour drive will be more reliable than the 1 Million hour drive.

OK, so that was a no-brainer analysis.  Real life is never that easy.  Suppose you've been stung by a failed disk and subsequently resolved to never get stung again by using RAID. For 2-disk mirrors, the simple MTBF analysis still makes sense: 2 disks with 1.6 Million hours MTBF will be more reliable than 2 disks with 1 Million hours MTBF. But this begins to become more difficult as you add disks (and possible configurations) and begin to consider deferred maintenance (to save on service costs.) Once you start asking these questions, the simple MTBF of the resulting system becomes increasingly less interesting and other reliability measurements become more interesting. For example, you will often hear us talk about Mean Time Between System Interruption (MTBSI) or Mean Time To Data Loss (MTTDL). which are more representative of the pain associated with service failures in redundant systems. However, in order to calculate MTBSI or MTTDL, we need to have some measurement of the reliability of the individual components -- MTBF. When we compare two different system design scenarios, the MTBSI or MTTDL calculations will show the differences in design as a relative measure of the components. This is very useful for making system design decisions.

In conclusion, MTBF will remain misunderstood for a long time.  But MTBF and redundancy analyses using MTBF are still useful for comparing system design trade-offs.



Wednesday Jan 31, 2007

ZFS RAID recommendations: space, performance, and MTTDL all-in

Wrapping up the thread on space, performance, and MTTDL, I thought that you might like to see one graph which would show the entire design space I've been using.  Here it is:

All-in graph 

This shows the data I've previously blogged about in scale. You can easily see that for MTTDL, double parity protection is better than single parity protection which is better than striping (no parity protection). Mirroring is also better than raidz or raidz2 for MTTDL and small, random read iops. I call this the "all-in" slide because, in a sense, it puts everything in one pot.

While this sort of analysis is useful, the problem is that there are more dimensions of the problem. I will show you some of the other models we use to evaluate and model systems in later blogs, but it might not be so easy to show so many useful factors on one graph.  I'll try my best...

Tuesday Jan 30, 2007

ZFS RAID recommendations: space, performance, and MTTDL

In this blog I connect the space vs MTTDL models with a performance model. This gives engough information for you to make RAID configuration trade-offs for various systems. The examples here target the Sun Fire X4500 (aka Thumper) server, but the models work for other systems too. In particular, the model results apply generically to RAID implementations on just a bunch of disks (JBOD) systems. 

The best thing about a model is that it is a simplification of real life.
The worst thing about a model is that it is a simplification of real life.

Small, Random Read Performance Model

For this analysis, we will use a small, random read performance model. The calculations for the model can be made with data which is readily available from disk data sheets. We calculate the expected I/O operations per second (iops) based on the average read seek and rotational speed of the disk. We don't consider the command overhead, as it is generally small for modern drives and is not always specified in disk data sheets.

maximum rotational latency = 60,000 (ms/min) / rotational speed (rpm)

iops = 1000 (ms/s) / (average read seek time (ms) + (maximum rotational latency (ms) / 2))

Since most disks use consistent rotational speeds, this small table may help you to see what the rotational speed contribution will be.

Rotational Speed (rpm)

Maximum Rotational Latency (ms)











For example, if we have a 73 GByte, 2.5" Seagate Saviio SAS drive which has a 4.1 ms average read seek and rotational speed of 10,000 rpm:

iops = 1000 / (4.1 + (6.0 / 2)) = 140.8

By comparison, a 750 GByte, 3.5" Seagate Barracuda SATA drive which has a 8.5 ms average read seek and rotational speed of 7,200 rpm:

iops = 1000 / (8.5 + (8.3 / 2)) = 79.0

I purposely used those two examples because people are always wondering why we tend to prefer smaller, faster, and (unfortunately) more expensive drives over larger, slower, less expensive drives - a 78% performance improvement is rather significant. The 3.5" drives also use about 25-75% more power than their smaller cousins, largely due to the rotating mass. Small is beautiful in a SWaP sense.

Next we use the RAID set configuration information to calculate the total small, random read iops for the zpool or volume. Here we need to talk about sets of disks which may make up a multi-level zpool or volume. For example, RAID-1+0 is a stripe (RAID-0) of mirrored sets (RAID-1). RAID-0 is a stripe of disks.

  • For dynamic striping (RAID-0), add the iops for each set or disk. On average the iops are spread randomly across all sets or disks, gaining concurrency.

  • For mirroring (RAID-1), add the iops for each set or disk. For reads, any set or disk can satisfy a read, so we also get concurrency.

  • For single parity raidz (RAID-5), the set operates at the performance of one disk. See below.

  • For double parity raidz2 (RAID-6), the set operates at the performance of one disk. See below.

For example, if you have 6 disks, then there are many different ways you can configure them, with varying performance calculations

RAID Configuration (6 disks)

Small, Random Read Performance Relative to a Single Disk

6-disk dynamic stripe (RAID-0)


3-set dynamic stripe, 2-way mirror (RAID-1+0)


2-set dynamic stripe, 3-way mirror (RAID-1+0)


6-disk raidz (RAID-5)


2-set dynamic stripe, 3-disk raidz (RAID-5+0)


2-way mirror, 3-disk raidz (RAID-5+1)


6-disk raidz2 (RAID-6)


Clearly, using mirrors improves both performance and data reliability. Using stripes increases performance, at the cost of data reliability. raidz and raidz2 offer data reliability, at the cost of performance. This leads us down a rathole...

The Parity Performance Rathole

Many people expect that data protection schemes based on parity, such as raidz (RAID-5) or raidz2 (RAID-6), will offer the performance of striped volumes, except for the parity disk. In other words, they expect that a 6-disk raidz zpool would have the same small. random read performance as a 5-disk dynamic stripe. Similarly, they expect that a 6-disk raidz2 zpool would have the same performance as a 4-disk dynamic stripe. ZFS doesn't work that way, today. ZFS uses a checksum to validate the contents of a block of data written. The block is spread across the disks (vdevs) in the set. In order to validate the checksum, ZFS must read the blocks from more than one disk, thus not taking advantage of spreading unrelated, random reads concurrently across the disks. In other words, the small, random read performance of a raidz or raidz2 set is, essentially, the same as the single disk performance. The benefit of this design is that writes should be more reliable and faster because you don't have the RAID-5 write hole or read-modify-write performance penalty.

Many people also think that this is a design deficiency. As a RAS guy, I value the data validation offered by the checksum over the performance supposedly gained by RAID-5. Reasonable people can disagree, but perhaps some day a clever person will solve this for ZFS.

So, what do other logical volume managers or RAID arrays do? The results seem mixed. I have seen some RAID array performance characterization data which is very similar to the ZFS performance for parity sets. I have heard anecdotes that other implementations will read the blocks and only reconstruct a failed block as needed. The problem is, how do such systems know that a block has failed? Anecdotally, it seems that some of them trust what is read from the disk. To implement a per-disk block checksum verification, you'd still have to perform at least two reads from different disks, so it seems to me that you are trading off data integrity for performance. In ZFS, data integrity is paramount. Perhaps there is more room for research here, or perhaps it is just one of those engineering trade-offs that we must live with.

Other Performance Models

I'm also looking for other performance models which can be applied to generic disks with data that is readily available to the public. The reason that the small, random read iops model works is that it doesn't need to consider caching or channel resource utilization. Adding these variables would require some knowledge of the configuration topology and the cache policies (which may also change with firmware updates.) I've kicked around the idea of a total disk bandwidth model which will describe a range of possible bandwidths based upon the media speed of the drives, but it is not clear to me that it will offer any satisfaction. Drop me a line if you have a good model or further thoughts on this topic.

You should be cautious about extrapolating the performance results described here to other workloads. You could consider this to be a worst-case model because it assumes 0% disk cache hits. I would hope that most workloads exhibit better performance, but rather than guessing (hoping) the best way to find out is to run the workload and measure the performance. If you characterize a number of different configurations, then you might build your own performance graphs which fit your workload.

Putting It All Together

Now we have a method to compare a variety of different ZFS or RAID disk configurations by evaluating space, performance, and MTTDL. First, let's look at single parity schemes such as 2-way mirrors and raidz on the Sun Fire X4500 (aka Thumper) server.

Single Parity Model Results 

Here you can see that a 2-way mirror (RAID-1, RAID-1+0) has better performance and MTTDL than raidz for any specific space requirement except for the case where we run out of hot spares for the 2-way mirror (using all 46 disks for data). By contrast, all of the raidz configurations here have hot spares. You can use this to help make design trade-offs by prioritizing space, performance, and MTTDL.

You'll also note that I did not label the left-side Y axis (MTTDL) again, but I did label the right-side Y axis (small, random read iops). I did this with mixed emotion. I didn't label the MTTDL axis values as I explained previously. But I did label the performance axis so that you can do a rough comparison to the double parity graph below. Note that in the double parity graph, the MTTDL axis is in units of Millions of years, instead of years above.

Double Parity Model Results

Here you can see the same sort of comparison between 3-way mirrors and raidz2 sets. The mirrors still outperform the raidz2 sets and have better MTTDL.

Some people ask me which layout I prefer. For the vast majority of cases, I prefer mirroring. Then, of course, we get into some sort of discussion about how raidz, raidz2, RAID-5, RAID-6 or some other configuration offers more GBytes/$. For my 2007 New Year's resolution I've decided to help make the world a happier place.  If you want to be happier, you should use mirroring with at least one hot spare.


We can make design trade-offs between space, performance, and MTTDL for disk storage systems. As with most engineering decisions, there often is not a clear best solution given all of the possible solutions. By using some simple models, we can see the trade-offs more clearly.

Wednesday Jan 17, 2007

A story of two MTTDL models

Mean Time to Data Loss (MTTDL) is a metric that we find useful for comparing data storage systems. I think it is particularly useful for determining what sort of data protection you may want to use for a RAID system. For example, suppose you have a Sun Fire X4550 (aka Thumper) server with 48 internal disk drives. What would be the best way to configure the disks for redundancy? Previously, I explored space versus MTTDL and space versus unscheduled Mean Time Between System Interruptions (U_MTBSI) for the X4500 running ZFS. The same analysis works for SVM or LVM, too.

For this blog, I want to explore the calculation of MTTDL for a bunch of disks. It turns out, there are multiple models for calculating MTTDL. The one described previously here is the simplest and only considers the Mean Time Between Failure (MTBF) of a disk and the Mean Time to Repair (MTTR) of the repair and reconstruction process. I'll call that model #1 which solves for MTTDL[1]. To quickly recap:

For non-protected schemes (dynamic striping, RAID-0)
For single parity schemes (2-way mirror, raidz, RAID-1, RAID-5):
MTTDL[1] = MTBF2 / (N \* (N-1) \* MTTR)
For double parity schemes (3-way mirror, raidz2, RAID-6):
MTTDL[1] = MTBF3 / (N \* (N-1) \* (N-2) \* MTTR2)
You can often get MTBF data from your drive vendor and you can measure or estimate your MTTR with reasonable accuracy. But MTTDL[1] does not consider the Unrecoverable Error Rate (UER) for read operations on disk drives. It turns out that the UER is often easier to get from the disk drive data sheets, because sometimes the drive vendors don't list MTBF (or Annual Failure Rate, AFR) for all of their drive models. Typically, UER will be 1 per 1014 bits read for consumer class drives and 1 per 1015 for enterprise class drives. This can be alarming, because you could also say that consumer class drives should see 1 UER per 12.5 TBytes of data read. Today, 500-750 GByte drives are readily available and 1 TByte drives are announced. Most people will be unhappy if they get an unrecoverable read error once every dozen or so times they read the whole disk. Worse yet, if we have the data protected with RAID and we have to replace a drive, we really do hope that the data reconstruction completes correctly. To add to our nightmare, the UER does not decrease by adding disks. If we can't rely on the data to be correctly read, we can't be sure that our data reconstruction will succeed, and we'll have data loss. Clearly, we need a model which takes this into account. Let's call that model #2, for MTTDL[2]:
First, we calculate the probability of unsuccessful reconstruction due to a UER for N disks of a given size (unit conversion omitted):
Precon_fail = (N-1) \* size / UER
For single-disk failure protection:
MTTDL[2] = MTBF / (N \* Precon_fail)
For double-disk failure protection:
MTTDL[2] = MTBF2/ (N \* (N-1) \* MTTR \* Precon_fail)
Comparing the MTTDL[1] model to the MTTDL[2] model shows some interesting aspects of design. First, there is no MTTDL[2] model for RAID-0 because there is no data reconstruction – any failure and you lose data. Second, the MTTR doesn't enter into the MTTDL[2] model until you get to double-disk failure scenarios. You could nit pick about this, but as you'll soon see, it really doesn't make any difference for our design decision process. Third, you can see that the Precon_fail is a function of the size of the data set. This is because the UER doesn't change as you grow the data set. Or, to look at it from a different direction, if you use consumer class drives with 1 UER for 1014 bits, and you have 12.5 TBytes of data, the probability of an unrecoverable read during the data reconstruction is 1. Ugh. If the Precon_fail is 1, then the MTTDL[2] model looks a lot like the RAID-0 model and friends don't let friends use RAID-0! Maybe you could consider a smaller sized data set to offset this risk. Let's see how that looks in pictures.

 MTTDL models for 2-way mirror

2-way mirroring is an example of a configuration which provides single-disk failure protection. Each data point represents the space available when using 2-way mirrors in a zpool. Since this is for a X4500, we consider 46 total available disks and any disks not used for data are available as spares. In this graph you can clearly see that the MTTDL[1] model encourages the use of hot spares. More importantly, although the results of the calculations of the two models are around 5 orders of magnitude different, the overall shape of the curve remains the same. Keep in mind that we are talking years here, perhaps 10 million years, which is well beyond the 5-year expected life span of a disk. This is the nature of the beast when using a constant MTBF. For models which consider the change in MTBF as the device ages, you should never see such large numbers. But the wish for more accurate models does not change the relative merits of the design decision, which is what we really care about – the best RAID configuration given a bunch of disks. Should I use single disk failure protection or double disk failure protection? To answer that, lets look at the model for raidz2.

MTTDL models for raidz2 

From this graph you can see that double disk protection is clearly better than single disk protection above, regardless of which model we choose. Good, this makes sense. You can also see that with raidz2 we have a larger number of disk configuration options. A 3-disk raidz2 set is somewhat similar to a 3-way mirror with the best MTTDL, but doesn't offer much available space. A 4-disk set will offer better space, but not quite as good MTTDL. This pattern continues through 8 disks/set. Judging from the graphs, you should see that a 3-disk set will offer approximately an order of magnitude better MTTDL than an 8-disk, for either MTTDL model. This is because the UER remains constant while the data to be reconstructed increases.
I hope that these models give you an insight into how you can model systems for RAS. In my experience, most people get all jazzed up with the space and forget that they are often making a space vs. RAS trade-off. You can use these models to help you make good design decisions when configuring RAID systems. Since the graphs use Space on the X-axis, it is easy to look at the design trade-offs for a given amount of available space.

Just one more teaser... there are other MTTDL models, but it is unclear if they would help make better decisions, and I'll explore those in another blog.

Friday Jan 12, 2007

ZFS RAID recommendations: space vs U_MTBSI

Some people get all wrapped around the axle worrying about disk controllers.  These same people often criticize innovative products like the Sun Fire X4500 (aka Thumper) server because it only has 6 SATA controllers for 48 disk drives (8 disks/controller). In this blog, I'll take a look at the various possible RAID configurations for an X4500 and see how they affect the Unscheduled Mean Time Between System Interruptions (U_MTBSI).

Get off the bus

I think this overabundance of worry originated years ago, when controllers were very expensive and the available I/O slots for a computer were quite limited. If you look at the old-school technologies such as parallel SCSI or IDE interfaces, which were buses, then the concerns were valid. Indeed, if you look at any buses, you'd see many of the same opportunities for problems: S100, VME, DBus, etc. What we fear the most about a parallel bus is that some device will grab ahold of it and not let go, thus wedging everything on the bus. For storage devices, this could mean that a single hardware fault could disrupt access to data and it's redundancy, causing an outage. We will often describe this as a single fault zone. The diversity concept encourages distributing the redundant components across different fault zones. The pocketbook concept places the ultimate limits on how many components are available, though.

Another place where bus designs limit our choices is in performance. In general, only one device can be talking on a bus at any given time. So, if you have a bunch of devices sharing a bus, then you could have a performance bottleneck to go with your single fault zone. The obvious way to avoid this is to not use buses. (I usually take this opportunity to dis' fibre channel, but I'll spare you this time :-)  Today, we have many opportunities to replace buses with point-to-point technologies: parallel SCSI replaced by serial attached SCSI (SAS), IDE/ATA replaced by Serial ATA (SATA), DBus replaced by Safari, front side bus (FSB) replaced by HyperTransport, Ethernet hubs replaced by Ethernet switches, etc.

From a RAS perspective, point-to-point technologies are very cool, largely because there are more fault zones, one per pair, but any single fault zone only affects one pair. This has numerous advantages because we can build highly reliable protocols into the links and not have to deal with sharing. Basically, if I have only two devices, then I can construct the link such that each device logically has one transmit and one receive interface.  Simple. Simple is good. Simple allows us to do things like automatically know when a device is on the other end of the link, we hear something, as opposed to a shared bus where you have to place a request for an address on the bus and hope that a device responds, which it can't if the bus is wedged by some other failed device. In other parts of the system, going point-to-point has allowed even more RAS improvements. For example, we use point-to-point connections between the system boards in a Sun Fire E25K server, which is why we can implement dynamic reconfiguration.

We also gain from Moore's Law. A curious thing happens when you integrate more functions onto a single chip -- the failure rate tends to remain more or less constant. In other words, if you take the functions performed by 4 different chips and integrate them onto one chip, then you get a 4:1 parts count reduction, the per-part failure rate stays roughly the same, so you get a 4x increase in reliability. Putting this together, consider replacing 4 parallel SCSI buses, each with a single controller and drive (4 fault zones), with a single 4-port SAS controller. At first glance you'd say that you replaced 4 fault zones with one fault zone. But in order to analyze such a system, you must take into account the reliability of the components. The single SAS controller will have approximately the same failure rate as each of the parallel SCSI controllers, so we get a 4x increase in controller reliability. Now, the answer to the question of which is better isn't so clear. And the math can get rather complicated. Naturally, we developed a tool to perform such analyses, RASCAD.


RASCAD is an industry-leading tool we developed and use at Sun to easily answer design questions regarding reliability, availability, and serviceability (RAS) of complex systems. For the computationally intrigued, we build heirarchial Markov models, very cool.

At Sun, we evaluate systems for their Mean Time Between System Interruptions (MTBSI). Very simply, a system interruption occurs when a component failure causes a system interruption (reboot). This includes service events to repair the component. For example, if a component fails and the system reboots in a degraded state and repairing the component requires another reboot, then the failure causes two interruptions (e.g. CPUs). Obviously, some components can often be repaired without causing a second service outage (e.g. disks). MTBSI does include a serviceability component, which can be mitigated using planned outages and other processes. So, we often will try to stick with the unscheduled outages, U_MTBSI, which is an indicator of pain. Pain is not good, so we try to increase the time between painful events, U_MTBSI, whenever possible.

Analysis of X4500 and ZFS U_MTBSI

Previously, I discussed space versus MTTDL for an X4500. Given the 6 SATA controller configuration of the 48 available disks, how is the U_MTBSI affected by the various possible RAID configurations in ZFS? A dandy question. I took the same configurations and computed the U_MTBSI under the condition that I would strive for the best possible controller diversity. This gets a little tricky when you have only 6 controllers. For example, if you have a 7-disk raidz set, then at least two of the disks will share a controller. If you had a 6-disk raidz set, then you could place each drive in the set on a different controller. For the X4500, this gets a little more difficult because the two drives that you can boot from are on the same controller (BIOS limitation?) Also, what about spares? What happens when a spare shares the same controller as a data disk? Should I mirror adjacent or in opposition? Will the sun rise again tomorrow?  Anyway, you can see where you can easily begin to get wrapped around the axle worrying about this stuff. Let's see what the analysis shows.

Plot of space vs U_MTBSI

The first thing you notice is that the same statement I made last time still applies - friends don't let friends use RAID-0!

The second thing you'll notice is that the RAID types are clumping again. But the clumping is not as diverse in the U_MTBSI axis as you might expect. This is because the reliability of the controller is much higher than the reliability of a single disk.  If you think about it, you are worried about the reliability of one chip versus a pile of chips, motors, amplifiers, heads, media, connectors, wires, and other stuff.  Since the reliability of the controller is much higher than the reliability of the disks (controller MTBF >> disk MTBF) the disk reliability dominates. This is also why we do see a significant difference between the RAID types used.

The third thing you'll notice is that once again I haven't labeled the U_MTBSI axis. If I had labeled it, then it would be a rat hole opportunity with a high probability of entrance.  In this case, all of the components are identical, the only change is the RAID configuration. So you could even consider the results normalized to some value and gain the same insights.

The explanation I'll offer for why RAID-Z (RAID-5) is worse than mirroring (RAID-1) or double parity RAID-Z2 (RAID-6) is that the probability of two disks failing remains the same. But the probability that two disks failing and causing you to have an interruption is very different. I think this clearly shows what David Patterson alluded to in a presentation he gave at the Sun Technical Conference once: single parity RAID-5 just doesn't give enough protection, double parity (e.g. RAID-Z2) would have been a better choice. He also mentioned that people hassle him because they've lost data with RAID-5. Needless to say, I'm not a big fan of RAID-5.

You can see that controller diversity doesn't make as big an impact on the U_MTBSI as the RAID type. For example, the U_MTBSI for raidz with 5+1 (one column per controller) is not that much different than for 6+1 (more than one column per controller). Similarly, the use of hot spares doesn't seem to make a big difference. You can more easily see the advantage of hot sparing when you look at MTTDL or Mean Time Between Service (MTBS, more on that later...)

I hope that this view of the trade-off between RAS and space will help you make better design decisions. There are other trade-offs to be considered as well, so I encourage you to look at all of the requirements and see how they can be satisfied. And don't get all wrapped around the axle worried about SAS/SATA controller diversity. Be diverse if you can, but don't worry too much if you can't.

Thursday Jan 11, 2007

ZFS RAID recommendations: space vs MTTDL

It is not always obvious what the best RAID set configuration should be for a given set of disks. This is even more difficult to see as the number of disks grows large, like on a Sun Fire X4500 (aka Thumper) server. By default, the X4500 ships with 46 disks available for data. This leads to hundreds of possible permutations of RAID sets.  Which would be best? One analysis is the trade-off space and Mean Time To Data Loss (MTTDL). For this blog, I will try to stick with ZFS terminology in the text, but the principles apply to other RAID systems, too.

The space calculation is straightforward.  Configure the RAID sets and sum the total space available.

The MTTDL calculation is one attribute of Reliability, Availability, and Serviceability (RAS) which we can also calculate relatively easily. For large numbers of disks, MTTDL is particularly useful because we only need to consider the reliability of the disks, and not the other parts of the system (fodder for a later blog :-). While this doesn't tell the whole RAS story, it is a very good method for evaluating a big bunch of disks. The equations are fairly straightforward:

For non-protected schemes (dynamic striping, RAID-0)


For single parity schemes (2-way mirror, raidz, RAID-1, RAID-5):

MTTDL = MTBF2 / (N \* (N-1) \* MTTR)

For double parity schemes (3-way mirror, raidz2, RAID-6):

MTTDL = MTBF3 / (N \* (N-1) \* (N-2) \* MTTR2)

Where MTBF is the Mean Time Between Failure and MTTR is the Mean Time To Recover. You can get MTBF values from disk data sheets which are usually readily available. You could also adjust them for your situation or based upon your actual experience. At Sun, we have many years of field failure data for disks and use design values which are consistent with our experiences. YMMV, of course. For MTTR you need to consider the logistical repair time, which is usually the time required to identify the failed disk and physically replace it.  You also need to consider the data reconstruction time, which may be a long time for large disks, depending on how rapidly ZFS or your logical volume manager (LVM) will reconstruct the data. Obviously, a spreadsheet or tool helps ease the computational burden.

Note: the reconstruction time for ZFS is a function of the amount of data, not the size of the disk. Traditional LVMs or hardware RAID arrays have no context of the data and therefore have to reconstruct the entire disk rather than just reconstruct the data. In the best case (0% used), ZFS will reconstruct the data almost instantaneously.  In the worst case (100% used) ZFS will have to reconstruct the entire disk, just like a traditional LVM.  This is one of the advantages of ZFS over traditional LVMs: faster reconstruction time, lower MTTR, better MTTDL.

Note: if you have a multi-level RAID set, such as RAID-1+0, then you need to use both the single parity and no protection MTTDL calculations to get the MTTDL of the top-level volume. 

So, I took a large number of possible ZFS configurations for a X4500 and calculated the space and MTTDL for the zpool. The interesting thing is that the various RAID protection schemes fall out in clumps. For example, you would expect that a 3-way mirror has better MTTDL and less available space than a 2-way mirror. As you vary the configurations, you can see the changes in space and MTTDL, but you would never expect a 2-way mirror to have better MTTDL than a 3-way mirror. The result is that if you plot the available space against the MTTDL, then the various RAID configurations will tend to clump together.

X4500 MTTDL vs Space

The most obvious conclusion from the above data is that you shouldn't use simply dynamic striping or RAID-0. Friends don't let friends use RAID-0!

You will also notice that I've omitted the values on the MTTDL axis. You've noticed that the MTTDL axis uses a log scale, so that should give you a clue as to the magnitude of the differences. The real reason I've omitted the values is because they are a rat hole opportunity with a high entrance probability. It really doesn't matter if the MTTDL calculation shows that you should see a trillion years of MTTDL because the expected lifetime of a disk is on the order of 5 years.  I don't really expect any disk to last more than a decade or two. What you should take away from this is that bigger MTTDL is better, and you get a much bigger MTTDL as you increase the number of redundant copies of the data. It is better to stay out of the MTTDL value rat hole. 

The other obvious conclusion is that you should use hot spares. The reason for this is that when a hot spare is used, the MTTR is decreased because we don't have to wait for the physical disk to be replaced before we start data reconstruction on a spare disk. The time you must wait for the data to be reconstructed and available is time where you are exposed to another failure which may cause data loss. In general, you always want to increase MTBF (the numerator) and decrease MTTR (the denominator) to get high RAS.

The most interesting result of this analysis is that the RAID configurations will tend to clump together. For example, there isn't much difference between the MTTDL of a 5-disk zpool versus a 6-disk raidz zpool.

But if you look at this data another way, there is a huge difference in the RAS. For example, suppose you want 15,000 GBytes of space in your X4500.  You could use either raidz or raidz2 with or without spares. Clearly, you would have better RAS if you choose raidz2 with spares than any of the other options for the space requirement. Whether you use 6, 7, 8, or 9 disks in your raidz2 set makes less difference in MTTDL.


There are other considerations when choosing the ZFS or RAID configurations which I plan to address in later blogs. For now, I hope that this will encourage you to think about how you might approach the space and RAS trade-offs for your storage configurations.


Wednesday Nov 16, 2005

A short ramdisk & ZFS anecdote

In a status meeting one day, Jim Mauro was lamenting that his storage system kept breaking when he was trying to do ZFS performance testing. Later I dropped him a note suggesting using a ramdisk instead. Clearly, ramdisks are not persistent storage, so they aren't really usable for those things you wish to keep. But for exploring features of file systems, they can be quite handy.

# ramdiskadm -a whee 100m

# zpool create demo /dev/ramdisk/whee

# zfs create demo/forgetme

# zfs set mountpoint=/forgetme demo/forgetme

Now we can explore ZFS without actually using disk space. This might come in handy, if you don't care about the volatility of the data.

ZFS from a RAS point of view: context of data

I've been working on RAS analysis of ZFS for a while and I haven't been this excited about a new product launch for a very long time. I'll be blogging about it more over the next few months as it is a very deep, interesting, and detailed analysis.

Let's begin with a look at some of the history of data storage in computers. Long, long ago, persistent data storage was very costly (price/bit) and slow. This fact of life lead file system designers to be efficient with space while also trying to optimize storage placement on the disk for performance. Take a look at the venerable UFS for example. When UFS was designed, disks were quite simple and mostly dumb devices. If you examine the newfs(1m) man page, you'll see all sorts of options for setting block sizes (which are fixed), cylinder groups, rotational delay, rotational speed, number of tracks per cylinder, UFS wanted to know the details of the hardware so that it could optimize its utilization. Of course, over time the hardware changed rather dramatically: SCSI (which hides disk geometry from the host) became ubiquitous, new disk interfaces were developed, processors became much faster than the rotational delay, new storage devices were invented, and so on. A problem with this design was that, in practice, the hardware dictated the data storage design. As the hardware changed dramatically in the years since its invention, the affects of the early design philosophy became more apparent and required many modifications. Change is generally a good thing, and UFS's ability to survive more than 25 years is a testament to its good design at inception.

From a RAS perspective, having the hardware drive the file system design leads to some limitations. The most glaring is that UFS trusts the underlying hardware to deliver correct data. In hindsight, this is rather risky as the hardware was often unreliable with little error detection or correction in the data path from memory to media. But given that CPUs were so slow and expensive, using CPU cycles to provide enhanced error detection wasn't feasible. In other words, a data corruption problem caused by hardware is not handled by the file system very well. If you have ever had to manually fsck(1m), you'd know what I mean.

This approach had another affect on system design. The computer industry has spent enormous effort designing some very cool and complex devices to look like disk drives. Think about it. RAID arrays are perhaps the most glorious examples of this taken to the extreme, you'll see all sorts of disk virtualization, replication, backup, and other tricks taking place behind a thin veneer disguised as a disk. The problem with this is that by emulating a rather simple device, any context of the data is lost. From a data context perspective, a RAID array is basically dumbed down to the level of a simple rotating platter with a moving head. While many people seem to be happy with this state of affairs, it is really very limiting. For example, a RAID-5 volume can achieve its best write performance when full stripe writes occur. But the RAID-5 volume is really using disks to emulate a disk. If you have a 4+1 RAID volume then the minimum stripe size for best performance would be a multiple of 2 kBytes (N \* 4 \* 512 bytes), more likely it will be much larger. Regardless of the optimal stripe width, UFS doesn't know anything about it, and neither do applications. So UFS goes happily on its way dutifully placing application data in the file system. To their credit, performance experts have spent many hours manually trying to optimally match application, file system, and storage data alignment to reach peak performance. I'd rather be surfing...

Suppose we could do it all over again, and this chance may not arise for another 25 years. Rather than having a hardware design dictate the file system design, let's make the file system design fit our data requirements. Further, let's not trust the hardware to provide correct data. Let's call it ZFS. Suddenly we are liberated! When an application writes data, ZFS knows how big the request is, and can allocate the appropriate block size. CPU cycles are now very inexpensive (more so now that cores are almost free) so we can use the CPU to detect data corruption anywhere in the data path. We don't have to rely on a parity protected SCSI bus, or a bug-free disk firmware (I've got the scars) to ensure that what is on persistent storage is what we get in memory. For that matter, we really don't care if the storage is a disk at all, we'll just treat it as a random access list of blocks. In any case, by distrusting everything in the storage data path we will build in the reliability and redundancy into the file system. We'd really like applications to do this, too, but that is like boiling the ocean, and I digress. The key here is that ZFS knows the data, knows when it is bad, and knows how to add redundancy to make it reliable. This knowledge of the context of the file system data is very powerful. I'll be exploring the meaning of this in detail in later blogs. For now, my advice is to get on the OpenSolaris bandwagon and try it out, you will be pleasantly surprised.

Technorati Tags: ,




« July 2016