Friday May 04, 2007

ZFS, copies, and data protection

OpenSolaris build 61 (or later) is now available for download. ZFS has added a new feature that will improve data protection: redundant copies for data (aka ditto blocks for data). Previously, ZFS stored redundant copies of metadata. Now this feature is available for data, too.

This represents a new feature which is unique to ZFS: you can set the data protection policy on a per-file system basis, beyond that offered by the underlying device or volume. For single-device systems, like my laptop with its single disk drive, this is very powerful. I can have a different data protection policy for the files that I really care about (my personal files) than the files that I really don't care about or that can be easily reloaded from the OS installation DVD. For systems with multiple disks assembled in a RAID configuration, the data protection is not quite so obvious. Let's explore this feature, look under the hood, and then analyze some possible configurations.

Using Copies

To change the numbers of data copies, set the copies property. For example, suppose I have a zpool named "zwimming." The default number of data copies is 1. But you can change that to 2 quite easily.

# zfs set copies=2 zwimming

The copies property works for all new writes, so I recommend that you set that policy when you create the file system or immediately after you create a zpool.

You can verify the copies setting by looking at the properties.

# zfs get copies zwimming
NAME      PROPERTY  VALUE     SOURCE
zwimming  copies    2         local

ZFS will account for the space used. For example, suppose I create three new file systems and copy some data to them. You can then see that the space used reflects the number of copies. If you use quotas, then the copies will be charged against the quotas, too.

# zfs create -o copies=1 zwimming/single
# zfs create -o copies=2 zwimming/dual
# zfs create -o copies=3 zwimming/triple
# cp -rp /usr/share/man1 /zwimming/single
# cp -rp /usr/share/man1 /zwimming/dual
# cp -rp /usr/share/man1 /zwimming/triple
# zfs list -r zwimming                                                       
NAME USED AVAIL REFER MOUNTPOINT
zwimming 48.2M 310M 33.5K /zwimming
zwimming/dual 16.0M 310M 16.0M /zwimming/dual
zwimming/single 8.09M 310M 8.09M /zwimming/single
zwimming/triple 23.8M 310M 23.8M /zwimming/triple

This makes sense. Each file system has one, two, or three copies of the data and will use correspondingly one, two, or three times as much space to store the data.

Under the Covers

ZFS will spread the ditto blocks across the vdev or vdevs to provide spatial diversity. Bill Moore has previously blogged about this, or you can see it in the code for yourself. From a RAS perspective, this is a good thing. We want to reduce the possibility that a single failure, such as a drive head impact with media, could disturb both copies of our data. If we have multiple disks, ZFS will try to spread the copies across multiple disks. This is different than mirroring, in subtle ways. The actual placement is ultimately based upon available space. Let's look at some simplified examples. First, for the default file system configuration settings on a single disk.

Default, simple config

Note that there are two copies of the metadata, by default. If we have two or more copies of the data, the number of metadata copies is three.

ZFS, 2 copies 

Suppose you have a 2-disk stripe. In that case, ZFS will try to spread the copies across the disks.

ZFS, 2 copies, 2 disks

Since the copies are created above the zpool, a mirrored zpool will faithfully mirror the copies.

 

ZFS, copies=2, mirrored

Since the copies policy is set at the file system level, not the zpool level, a single zpool may contain multiple file systems, each with different policies. In other words, you could have data which is not copied allocated along with data that is copied.

 

ZFS, mixed copies

Using different policies for different file systems allows you to have different data protection policies, allows you to improve data protection, and offers many more permutations of configurations for you to weigh in your designs.

RAS Modeling

It is obvious that increasing the number of data copies will effectively reduce the amount of available space accordingly. But how will this affect reliability? To answer that question we use the MTTDL[2] model I previously described, with the following changes:

First, we calculate the probability of unsuccessful reconstruction due to a UER for N disks of a given size (unit conversion omitted). The number of copies decreases this probability. This makes sense as we could use another copy of the data for reconstruction and to completely fail, we'd need to lose all copies:
Precon_fail = ((N-1) \* size / UER)copies
For single-disk failure protection:
MTTDL[2] = MTBF / (N \* Precon_fail)
For double-disk failure protection:
MTTDL[2] = MTBF2/ (N \* (N-1) \* MTTR \* Precon_fail)

Note that as the number of copies increases, Precon_fail approaches zero quickly. This will increase the MTTDL. We want higher MTTDL, so this is a good thing.

OK, now that we can calculate available space and MTTDL, let's look at some configurations for 46 disks available on a Sun Fire X4500 (aka Thumper). We'll look at single parity schemes, to reduce the clutter, but double parity schemes will show the same, relative improvements.

ZFS, X4500 single parity schemes with copies

bigger view 

You can see that we are trading off space for MTTDL. You can also see that for raidz zpools, having more disks in the sets reduces the MTTDL. It gets more interesting to see that the 2-way mirror with copies=2 is very similar in space and MTTDL to the 5-disk raidz with copies=3. Hmm. Also, the 2-way mirror with copies=1 is similar in MTTDL to the 7-disk raidz with copies=2, though the mirror configurations allow more space. This information may be useful as you make trade-offs. Since the copies parameter is set per file system, you can still set the data protection policy for important data separately from unimportant data. This might be a good idea for some situations where you might have permanent originals (eg. CDs, DVDs) and want to apply a different data protection policy.

In the future, once we have a better feel for the real performance considerations, we'll be able to add a performance component into the analysis.

Single Device Revisited

Now that we see how data protection is improved, let's revisit the single device case. I use the term device here because there is a significant change occurring in storage as we replace disk drives with solid state, non-volatile memory devices (eg. flash disks and future MRAM or PRAM devices). A large number of enterprise customers demand dual disk drives for mirroring root file systems in servers. However, there is also a growing demand for solid state boot devices, and we have some Sun servers with this option. Some believe that by 2009, the majority of laptops will also have solid state devices instead of disk drives. In the interim, there are also hybrid disk drives.

What affect will these devices have on data retention? We know that if the entire device completely fails, then the data is most likely unrecoverable. In real life, these devices can suffer many failures which result in data loss, but which are not complete device failures. For disks, we see the most common failure is an unrecoverable read where data is lost from one or more sector (bar 1 in the graph below). For flash memories, there is an endurance issue where repeated writes to a cell may reduce the probability of reading the data correctly. If you only have one copy of the data, then the data is lost, never to be read correctly again.

We captured disk error codes returned from a number of disk drives in the field. The Pareto chart below shows the relationship between the error codes. Bar 1 is the unrecoverable read which accounts for about 24% of the errors recorded. The violet bars show recoverable errors which did succeed. Examples of successfully recovered errors are: write error - recovered with block reallocation, read error - recovered by ECC using normal retries, etc. The recovered errors do not (immediately) indicate a data loss event, so they are largely transparent to applications. We worry more about the unrecoverable errors.

 

Disk error Pareto chart

Approximately 1/3 of the errors were unrecoverable. If such an error occurs in ZFS metadata, then ZFS will try to read alternate metadata copy and repair the metadata. If the data has multiple copies, then it is likely that we will not lose any data. This is a more detailed view of the storage device because we are not treating all failures as a full device failure.

Both real and anecdotal evidence suggests that unrecoverable errors can occur while the device is still largely operational. ZFS has the ability to survive such errors without data loss. Very cool. Murphy's Law will ultimately catch up with you, though. In the case where ZFS cannot recover the data, ZFS will tell you which file is corrupted. You can then decide whether or not you should recover it from backups or source media.

Another Single Device

Now that I've got you to think of the single device as a single device, I'd like to extend the thought to RAID arrays. There is much confusion amongst people about whether ZFS should or should not be used with RAID arrays. If you search, you'll find comments and recommendations both for and against using hardware RAID for ZFS. The main argument is centered around the ability of ZFS to correct errors. If you have a single device backed by a RAID array with some sort of data protection, then previous versions of ZFS could not recover data which was lost. Hold it right there, fella! Do I mean that RAID arrays and the channel from the array to main memory can have errors? Yes, of course! We have seen cases where errors were introduced somewhere along the path between disk media to main memory where data was lost or corrupted. Prior to ZFS, these were silent errors and blissfully ignored. With ZFS, the checksum now detects these errors and tries to recover. If you don't believe me, then watch the ZFS forum on opensolaris.org where we get reports like this about once a month or so. With ZFS copies, you can now recover from such errors without changing the RAID array configuration.

If ZFS can correct a data error, it will attempt to do so. You now have a the option to improve your data protection even when using a single RAID LUN. And this is the same mechanism we can use for a single disk or flash drive: data copies. You can implement the copies on a per-file system basis and thus have different data protection policies even though the data is physically stored on a RAID LUN in a hardware RAID array. I really hope we can put to rest the "ZFS prefers JBOD" argument and just concentrate our efforts on implementing the best data protection policies for the requirements.

ZFS with data copies is another tool in your toolbelt to improve your life, and the life of your data.



Wednesday Apr 18, 2007

Is eWeek living on internet time?

eWeek has published a nice article describing Sun's new, low-cost RAID array: the Sun StorageTek ST2500 Low Cost Array.  This is an interesting new product that has broad appeal and will be a heck of a good box to run under ZFS.

But I'm worried about eWeek.  It seems that they've lost track of time.  Many of us have been running on internet time for most of our lives.  This quote makes me wonder if eWeek forgot to update their timezone data:

"The ZFS [the speedy Zeta[sic] file system, recently released to the open-source community by Sun] is very interesting, and people are looking at it," [Henry] Baltazar told eWeek.

I'm pretty sure Henry Baltazar is running on internet time, and provided a very nice quote.  But whoever added the editorial clarification at eWeek spelled Zettabyte wrong and is running on island timeZFS was released to the open-source community on June 14, 2005 - nearly 2 years ago in real time.  Even in real time, 2 years can hardly be considered "recent."  Sigh.

 

Thursday Apr 05, 2007

ZFS ported to FreeBSD!

The FreeBSD team has added ZFS to the FreeBSD-7.0 release!  This is excellent news and all of us are happy to share with the FreeBSD community.  Pawel Jakub Dawidek has posted this note to the FreeBSD and ZFS community. This will greatly expand the use of ZFS and will no doubt lead to more innovative developments in the community.  Well done!


Wednesday Jan 31, 2007

ZFS RAID recommendations: space, performance, and MTTDL all-in

Wrapping up the thread on space, performance, and MTTDL, I thought that you might like to see one graph which would show the entire design space I've been using.  Here it is:

All-in graph 

This shows the data I've previously blogged about in scale. You can easily see that for MTTDL, double parity protection is better than single parity protection which is better than striping (no parity protection). Mirroring is also better than raidz or raidz2 for MTTDL and small, random read iops. I call this the "all-in" slide because, in a sense, it puts everything in one pot.

While this sort of analysis is useful, the problem is that there are more dimensions of the problem. I will show you some of the other models we use to evaluate and model systems in later blogs, but it might not be so easy to show so many useful factors on one graph.  I'll try my best...


Tuesday Jan 30, 2007

ZFS RAID recommendations: space, performance, and MTTDL

In this blog I connect the space vs MTTDL models with a performance model. This gives engough information for you to make RAID configuration trade-offs for various systems. The examples here target the Sun Fire X4500 (aka Thumper) server, but the models work for other systems too. In particular, the model results apply generically to RAID implementations on just a bunch of disks (JBOD) systems. 

The best thing about a model is that it is a simplification of real life.
The worst thing about a model is that it is a simplification of real life.

Small, Random Read Performance Model

For this analysis, we will use a small, random read performance model. The calculations for the model can be made with data which is readily available from disk data sheets. We calculate the expected I/O operations per second (iops) based on the average read seek and rotational speed of the disk. We don't consider the command overhead, as it is generally small for modern drives and is not always specified in disk data sheets.

maximum rotational latency = 60,000 (ms/min) / rotational speed (rpm)

iops = 1000 (ms/s) / (average read seek time (ms) + (maximum rotational latency (ms) / 2))

Since most disks use consistent rotational speeds, this small table may help you to see what the rotational speed contribution will be.

Rotational Speed (rpm)

Maximum Rotational Latency (ms)

4,200

14.3

5,400

11.1

7,200

8.3

10,000

6.0

15,000

4.0

For example, if we have a 73 GByte, 2.5" Seagate Saviio SAS drive which has a 4.1 ms average read seek and rotational speed of 10,000 rpm:

iops = 1000 / (4.1 + (6.0 / 2)) = 140.8

By comparison, a 750 GByte, 3.5" Seagate Barracuda SATA drive which has a 8.5 ms average read seek and rotational speed of 7,200 rpm:

iops = 1000 / (8.5 + (8.3 / 2)) = 79.0

I purposely used those two examples because people are always wondering why we tend to prefer smaller, faster, and (unfortunately) more expensive drives over larger, slower, less expensive drives - a 78% performance improvement is rather significant. The 3.5" drives also use about 25-75% more power than their smaller cousins, largely due to the rotating mass. Small is beautiful in a SWaP sense.

Next we use the RAID set configuration information to calculate the total small, random read iops for the zpool or volume. Here we need to talk about sets of disks which may make up a multi-level zpool or volume. For example, RAID-1+0 is a stripe (RAID-0) of mirrored sets (RAID-1). RAID-0 is a stripe of disks.

  • For dynamic striping (RAID-0), add the iops for each set or disk. On average the iops are spread randomly across all sets or disks, gaining concurrency.

  • For mirroring (RAID-1), add the iops for each set or disk. For reads, any set or disk can satisfy a read, so we also get concurrency.

  • For single parity raidz (RAID-5), the set operates at the performance of one disk. See below.

  • For double parity raidz2 (RAID-6), the set operates at the performance of one disk. See below.

For example, if you have 6 disks, then there are many different ways you can configure them, with varying performance calculations

RAID Configuration (6 disks)

Small, Random Read Performance Relative to a Single Disk

6-disk dynamic stripe (RAID-0)

6

3-set dynamic stripe, 2-way mirror (RAID-1+0)

6

2-set dynamic stripe, 3-way mirror (RAID-1+0)

6

6-disk raidz (RAID-5)

1

2-set dynamic stripe, 3-disk raidz (RAID-5+0)

2

2-way mirror, 3-disk raidz (RAID-5+1)

2

6-disk raidz2 (RAID-6)

1

Clearly, using mirrors improves both performance and data reliability. Using stripes increases performance, at the cost of data reliability. raidz and raidz2 offer data reliability, at the cost of performance. This leads us down a rathole...

The Parity Performance Rathole

Many people expect that data protection schemes based on parity, such as raidz (RAID-5) or raidz2 (RAID-6), will offer the performance of striped volumes, except for the parity disk. In other words, they expect that a 6-disk raidz zpool would have the same small. random read performance as a 5-disk dynamic stripe. Similarly, they expect that a 6-disk raidz2 zpool would have the same performance as a 4-disk dynamic stripe. ZFS doesn't work that way, today. ZFS uses a checksum to validate the contents of a block of data written. The block is spread across the disks (vdevs) in the set. In order to validate the checksum, ZFS must read the blocks from more than one disk, thus not taking advantage of spreading unrelated, random reads concurrently across the disks. In other words, the small, random read performance of a raidz or raidz2 set is, essentially, the same as the single disk performance. The benefit of this design is that writes should be more reliable and faster because you don't have the RAID-5 write hole or read-modify-write performance penalty.

Many people also think that this is a design deficiency. As a RAS guy, I value the data validation offered by the checksum over the performance supposedly gained by RAID-5. Reasonable people can disagree, but perhaps some day a clever person will solve this for ZFS.

So, what do other logical volume managers or RAID arrays do? The results seem mixed. I have seen some RAID array performance characterization data which is very similar to the ZFS performance for parity sets. I have heard anecdotes that other implementations will read the blocks and only reconstruct a failed block as needed. The problem is, how do such systems know that a block has failed? Anecdotally, it seems that some of them trust what is read from the disk. To implement a per-disk block checksum verification, you'd still have to perform at least two reads from different disks, so it seems to me that you are trading off data integrity for performance. In ZFS, data integrity is paramount. Perhaps there is more room for research here, or perhaps it is just one of those engineering trade-offs that we must live with.

Other Performance Models

I'm also looking for other performance models which can be applied to generic disks with data that is readily available to the public. The reason that the small, random read iops model works is that it doesn't need to consider caching or channel resource utilization. Adding these variables would require some knowledge of the configuration topology and the cache policies (which may also change with firmware updates.) I've kicked around the idea of a total disk bandwidth model which will describe a range of possible bandwidths based upon the media speed of the drives, but it is not clear to me that it will offer any satisfaction. Drop me a line if you have a good model or further thoughts on this topic.

You should be cautious about extrapolating the performance results described here to other workloads. You could consider this to be a worst-case model because it assumes 0% disk cache hits. I would hope that most workloads exhibit better performance, but rather than guessing (hoping) the best way to find out is to run the workload and measure the performance. If you characterize a number of different configurations, then you might build your own performance graphs which fit your workload.

Putting It All Together

Now we have a method to compare a variety of different ZFS or RAID disk configurations by evaluating space, performance, and MTTDL. First, let's look at single parity schemes such as 2-way mirrors and raidz on the Sun Fire X4500 (aka Thumper) server.

Single Parity Model Results 

Here you can see that a 2-way mirror (RAID-1, RAID-1+0) has better performance and MTTDL than raidz for any specific space requirement except for the case where we run out of hot spares for the 2-way mirror (using all 46 disks for data). By contrast, all of the raidz configurations here have hot spares. You can use this to help make design trade-offs by prioritizing space, performance, and MTTDL.

You'll also note that I did not label the left-side Y axis (MTTDL) again, but I did label the right-side Y axis (small, random read iops). I did this with mixed emotion. I didn't label the MTTDL axis values as I explained previously. But I did label the performance axis so that you can do a rough comparison to the double parity graph below. Note that in the double parity graph, the MTTDL axis is in units of Millions of years, instead of years above.

Double Parity Model Results

Here you can see the same sort of comparison between 3-way mirrors and raidz2 sets. The mirrors still outperform the raidz2 sets and have better MTTDL.

Some people ask me which layout I prefer. For the vast majority of cases, I prefer mirroring. Then, of course, we get into some sort of discussion about how raidz, raidz2, RAID-5, RAID-6 or some other configuration offers more GBytes/$. For my 2007 New Year's resolution I've decided to help make the world a happier place.  If you want to be happier, you should use mirroring with at least one hot spare.

Conclusion

We can make design trade-offs between space, performance, and MTTDL for disk storage systems. As with most engineering decisions, there often is not a clear best solution given all of the possible solutions. By using some simple models, we can see the trade-offs more clearly.


Wednesday Jan 17, 2007

A story of two MTTDL models

Mean Time to Data Loss (MTTDL) is a metric that we find useful for comparing data storage systems. I think it is particularly useful for determining what sort of data protection you may want to use for a RAID system. For example, suppose you have a Sun Fire X4550 (aka Thumper) server with 48 internal disk drives. What would be the best way to configure the disks for redundancy? Previously, I explored space versus MTTDL and space versus unscheduled Mean Time Between System Interruptions (U_MTBSI) for the X4500 running ZFS. The same analysis works for SVM or LVM, too.

For this blog, I want to explore the calculation of MTTDL for a bunch of disks. It turns out, there are multiple models for calculating MTTDL. The one described previously here is the simplest and only considers the Mean Time Between Failure (MTBF) of a disk and the Mean Time to Repair (MTTR) of the repair and reconstruction process. I'll call that model #1 which solves for MTTDL[1]. To quickly recap:

For non-protected schemes (dynamic striping, RAID-0)
MTTDL[1] = MTBF / N
For single parity schemes (2-way mirror, raidz, RAID-1, RAID-5):
MTTDL[1] = MTBF2 / (N \* (N-1) \* MTTR)
For double parity schemes (3-way mirror, raidz2, RAID-6):
MTTDL[1] = MTBF3 / (N \* (N-1) \* (N-2) \* MTTR2)
You can often get MTBF data from your drive vendor and you can measure or estimate your MTTR with reasonable accuracy. But MTTDL[1] does not consider the Unrecoverable Error Rate (UER) for read operations on disk drives. It turns out that the UER is often easier to get from the disk drive data sheets, because sometimes the drive vendors don't list MTBF (or Annual Failure Rate, AFR) for all of their drive models. Typically, UER will be 1 per 1014 bits read for consumer class drives and 1 per 1015 for enterprise class drives. This can be alarming, because you could also say that consumer class drives should see 1 UER per 12.5 TBytes of data read. Today, 500-750 GByte drives are readily available and 1 TByte drives are announced. Most people will be unhappy if they get an unrecoverable read error once every dozen or so times they read the whole disk. Worse yet, if we have the data protected with RAID and we have to replace a drive, we really do hope that the data reconstruction completes correctly. To add to our nightmare, the UER does not decrease by adding disks. If we can't rely on the data to be correctly read, we can't be sure that our data reconstruction will succeed, and we'll have data loss. Clearly, we need a model which takes this into account. Let's call that model #2, for MTTDL[2]:
First, we calculate the probability of unsuccessful reconstruction due to a UER for N disks of a given size (unit conversion omitted):
Precon_fail = (N-1) \* size / UER
For single-disk failure protection:
MTTDL[2] = MTBF / (N \* Precon_fail)
For double-disk failure protection:
MTTDL[2] = MTBF2/ (N \* (N-1) \* MTTR \* Precon_fail)
Comparing the MTTDL[1] model to the MTTDL[2] model shows some interesting aspects of design. First, there is no MTTDL[2] model for RAID-0 because there is no data reconstruction – any failure and you lose data. Second, the MTTR doesn't enter into the MTTDL[2] model until you get to double-disk failure scenarios. You could nit pick about this, but as you'll soon see, it really doesn't make any difference for our design decision process. Third, you can see that the Precon_fail is a function of the size of the data set. This is because the UER doesn't change as you grow the data set. Or, to look at it from a different direction, if you use consumer class drives with 1 UER for 1014 bits, and you have 12.5 TBytes of data, the probability of an unrecoverable read during the data reconstruction is 1. Ugh. If the Precon_fail is 1, then the MTTDL[2] model looks a lot like the RAID-0 model and friends don't let friends use RAID-0! Maybe you could consider a smaller sized data set to offset this risk. Let's see how that looks in pictures.

 MTTDL models for 2-way mirror

2-way mirroring is an example of a configuration which provides single-disk failure protection. Each data point represents the space available when using 2-way mirrors in a zpool. Since this is for a X4500, we consider 46 total available disks and any disks not used for data are available as spares. In this graph you can clearly see that the MTTDL[1] model encourages the use of hot spares. More importantly, although the results of the calculations of the two models are around 5 orders of magnitude different, the overall shape of the curve remains the same. Keep in mind that we are talking years here, perhaps 10 million years, which is well beyond the 5-year expected life span of a disk. This is the nature of the beast when using a constant MTBF. For models which consider the change in MTBF as the device ages, you should never see such large numbers. But the wish for more accurate models does not change the relative merits of the design decision, which is what we really care about – the best RAID configuration given a bunch of disks. Should I use single disk failure protection or double disk failure protection? To answer that, lets look at the model for raidz2.

MTTDL models for raidz2 

From this graph you can see that double disk protection is clearly better than single disk protection above, regardless of which model we choose. Good, this makes sense. You can also see that with raidz2 we have a larger number of disk configuration options. A 3-disk raidz2 set is somewhat similar to a 3-way mirror with the best MTTDL, but doesn't offer much available space. A 4-disk set will offer better space, but not quite as good MTTDL. This pattern continues through 8 disks/set. Judging from the graphs, you should see that a 3-disk set will offer approximately an order of magnitude better MTTDL than an 8-disk, for either MTTDL model. This is because the UER remains constant while the data to be reconstructed increases.
I hope that these models give you an insight into how you can model systems for RAS. In my experience, most people get all jazzed up with the space and forget that they are often making a space vs. RAS trade-off. You can use these models to help you make good design decisions when configuring RAID systems. Since the graphs use Space on the X-axis, it is easy to look at the design trade-offs for a given amount of available space.

Just one more teaser... there are other MTTDL models, but it is unclear if they would help make better decisions, and I'll explore those in another blog.

Friday Jan 12, 2007

ZFS RAID recommendations: space vs U_MTBSI

Some people get all wrapped around the axle worrying about disk controllers.  These same people often criticize innovative products like the Sun Fire X4500 (aka Thumper) server because it only has 6 SATA controllers for 48 disk drives (8 disks/controller). In this blog, I'll take a look at the various possible RAID configurations for an X4500 and see how they affect the Unscheduled Mean Time Between System Interruptions (U_MTBSI).

Get off the bus

I think this overabundance of worry originated years ago, when controllers were very expensive and the available I/O slots for a computer were quite limited. If you look at the old-school technologies such as parallel SCSI or IDE interfaces, which were buses, then the concerns were valid. Indeed, if you look at any buses, you'd see many of the same opportunities for problems: S100, VME, DBus, etc. What we fear the most about a parallel bus is that some device will grab ahold of it and not let go, thus wedging everything on the bus. For storage devices, this could mean that a single hardware fault could disrupt access to data and it's redundancy, causing an outage. We will often describe this as a single fault zone. The diversity concept encourages distributing the redundant components across different fault zones. The pocketbook concept places the ultimate limits on how many components are available, though.

Another place where bus designs limit our choices is in performance. In general, only one device can be talking on a bus at any given time. So, if you have a bunch of devices sharing a bus, then you could have a performance bottleneck to go with your single fault zone. The obvious way to avoid this is to not use buses. (I usually take this opportunity to dis' fibre channel, but I'll spare you this time :-)  Today, we have many opportunities to replace buses with point-to-point technologies: parallel SCSI replaced by serial attached SCSI (SAS), IDE/ATA replaced by Serial ATA (SATA), DBus replaced by Safari, front side bus (FSB) replaced by HyperTransport, Ethernet hubs replaced by Ethernet switches, etc.

From a RAS perspective, point-to-point technologies are very cool, largely because there are more fault zones, one per pair, but any single fault zone only affects one pair. This has numerous advantages because we can build highly reliable protocols into the links and not have to deal with sharing. Basically, if I have only two devices, then I can construct the link such that each device logically has one transmit and one receive interface.  Simple. Simple is good. Simple allows us to do things like automatically know when a device is on the other end of the link, we hear something, as opposed to a shared bus where you have to place a request for an address on the bus and hope that a device responds, which it can't if the bus is wedged by some other failed device. In other parts of the system, going point-to-point has allowed even more RAS improvements. For example, we use point-to-point connections between the system boards in a Sun Fire E25K server, which is why we can implement dynamic reconfiguration.

We also gain from Moore's Law. A curious thing happens when you integrate more functions onto a single chip -- the failure rate tends to remain more or less constant. In other words, if you take the functions performed by 4 different chips and integrate them onto one chip, then you get a 4:1 parts count reduction, the per-part failure rate stays roughly the same, so you get a 4x increase in reliability. Putting this together, consider replacing 4 parallel SCSI buses, each with a single controller and drive (4 fault zones), with a single 4-port SAS controller. At first glance you'd say that you replaced 4 fault zones with one fault zone. But in order to analyze such a system, you must take into account the reliability of the components. The single SAS controller will have approximately the same failure rate as each of the parallel SCSI controllers, so we get a 4x increase in controller reliability. Now, the answer to the question of which is better isn't so clear. And the math can get rather complicated. Naturally, we developed a tool to perform such analyses, RASCAD.

RASCAD 

RASCAD is an industry-leading tool we developed and use at Sun to easily answer design questions regarding reliability, availability, and serviceability (RAS) of complex systems. For the computationally intrigued, we build heirarchial Markov models, very cool.

At Sun, we evaluate systems for their Mean Time Between System Interruptions (MTBSI). Very simply, a system interruption occurs when a component failure causes a system interruption (reboot). This includes service events to repair the component. For example, if a component fails and the system reboots in a degraded state and repairing the component requires another reboot, then the failure causes two interruptions (e.g. CPUs). Obviously, some components can often be repaired without causing a second service outage (e.g. disks). MTBSI does include a serviceability component, which can be mitigated using planned outages and other processes. So, we often will try to stick with the unscheduled outages, U_MTBSI, which is an indicator of pain. Pain is not good, so we try to increase the time between painful events, U_MTBSI, whenever possible.

Analysis of X4500 and ZFS U_MTBSI

Previously, I discussed space versus MTTDL for an X4500. Given the 6 SATA controller configuration of the 48 available disks, how is the U_MTBSI affected by the various possible RAID configurations in ZFS? A dandy question. I took the same configurations and computed the U_MTBSI under the condition that I would strive for the best possible controller diversity. This gets a little tricky when you have only 6 controllers. For example, if you have a 7-disk raidz set, then at least two of the disks will share a controller. If you had a 6-disk raidz set, then you could place each drive in the set on a different controller. For the X4500, this gets a little more difficult because the two drives that you can boot from are on the same controller (BIOS limitation?) Also, what about spares? What happens when a spare shares the same controller as a data disk? Should I mirror adjacent or in opposition? Will the sun rise again tomorrow?  Anyway, you can see where you can easily begin to get wrapped around the axle worrying about this stuff. Let's see what the analysis shows.

Plot of space vs U_MTBSI

The first thing you notice is that the same statement I made last time still applies - friends don't let friends use RAID-0!

The second thing you'll notice is that the RAID types are clumping again. But the clumping is not as diverse in the U_MTBSI axis as you might expect. This is because the reliability of the controller is much higher than the reliability of a single disk.  If you think about it, you are worried about the reliability of one chip versus a pile of chips, motors, amplifiers, heads, media, connectors, wires, and other stuff.  Since the reliability of the controller is much higher than the reliability of the disks (controller MTBF >> disk MTBF) the disk reliability dominates. This is also why we do see a significant difference between the RAID types used.

The third thing you'll notice is that once again I haven't labeled the U_MTBSI axis. If I had labeled it, then it would be a rat hole opportunity with a high probability of entrance.  In this case, all of the components are identical, the only change is the RAID configuration. So you could even consider the results normalized to some value and gain the same insights.

The explanation I'll offer for why RAID-Z (RAID-5) is worse than mirroring (RAID-1) or double parity RAID-Z2 (RAID-6) is that the probability of two disks failing remains the same. But the probability that two disks failing and causing you to have an interruption is very different. I think this clearly shows what David Patterson alluded to in a presentation he gave at the Sun Technical Conference once: single parity RAID-5 just doesn't give enough protection, double parity (e.g. RAID-Z2) would have been a better choice. He also mentioned that people hassle him because they've lost data with RAID-5. Needless to say, I'm not a big fan of RAID-5.

You can see that controller diversity doesn't make as big an impact on the U_MTBSI as the RAID type. For example, the U_MTBSI for raidz with 5+1 (one column per controller) is not that much different than for 6+1 (more than one column per controller). Similarly, the use of hot spares doesn't seem to make a big difference. You can more easily see the advantage of hot sparing when you look at MTTDL or Mean Time Between Service (MTBS, more on that later...)

I hope that this view of the trade-off between RAS and space will help you make better design decisions. There are other trade-offs to be considered as well, so I encourage you to look at all of the requirements and see how they can be satisfied. And don't get all wrapped around the axle worried about SAS/SATA controller diversity. Be diverse if you can, but don't worry too much if you can't.

Thursday Jan 11, 2007

ZFS RAID recommendations: space vs MTTDL

It is not always obvious what the best RAID set configuration should be for a given set of disks. This is even more difficult to see as the number of disks grows large, like on a Sun Fire X4500 (aka Thumper) server. By default, the X4500 ships with 46 disks available for data. This leads to hundreds of possible permutations of RAID sets.  Which would be best? One analysis is the trade-off space and Mean Time To Data Loss (MTTDL). For this blog, I will try to stick with ZFS terminology in the text, but the principles apply to other RAID systems, too.

The space calculation is straightforward.  Configure the RAID sets and sum the total space available.

The MTTDL calculation is one attribute of Reliability, Availability, and Serviceability (RAS) which we can also calculate relatively easily. For large numbers of disks, MTTDL is particularly useful because we only need to consider the reliability of the disks, and not the other parts of the system (fodder for a later blog :-). While this doesn't tell the whole RAS story, it is a very good method for evaluating a big bunch of disks. The equations are fairly straightforward:

For non-protected schemes (dynamic striping, RAID-0)

MTTDL = MTBF / N

For single parity schemes (2-way mirror, raidz, RAID-1, RAID-5):

MTTDL = MTBF2 / (N \* (N-1) \* MTTR)

For double parity schemes (3-way mirror, raidz2, RAID-6):

MTTDL = MTBF3 / (N \* (N-1) \* (N-2) \* MTTR2)

Where MTBF is the Mean Time Between Failure and MTTR is the Mean Time To Recover. You can get MTBF values from disk data sheets which are usually readily available. You could also adjust them for your situation or based upon your actual experience. At Sun, we have many years of field failure data for disks and use design values which are consistent with our experiences. YMMV, of course. For MTTR you need to consider the logistical repair time, which is usually the time required to identify the failed disk and physically replace it.  You also need to consider the data reconstruction time, which may be a long time for large disks, depending on how rapidly ZFS or your logical volume manager (LVM) will reconstruct the data. Obviously, a spreadsheet or tool helps ease the computational burden.

Note: the reconstruction time for ZFS is a function of the amount of data, not the size of the disk. Traditional LVMs or hardware RAID arrays have no context of the data and therefore have to reconstruct the entire disk rather than just reconstruct the data. In the best case (0% used), ZFS will reconstruct the data almost instantaneously.  In the worst case (100% used) ZFS will have to reconstruct the entire disk, just like a traditional LVM.  This is one of the advantages of ZFS over traditional LVMs: faster reconstruction time, lower MTTR, better MTTDL.

Note: if you have a multi-level RAID set, such as RAID-1+0, then you need to use both the single parity and no protection MTTDL calculations to get the MTTDL of the top-level volume. 

So, I took a large number of possible ZFS configurations for a X4500 and calculated the space and MTTDL for the zpool. The interesting thing is that the various RAID protection schemes fall out in clumps. For example, you would expect that a 3-way mirror has better MTTDL and less available space than a 2-way mirror. As you vary the configurations, you can see the changes in space and MTTDL, but you would never expect a 2-way mirror to have better MTTDL than a 3-way mirror. The result is that if you plot the available space against the MTTDL, then the various RAID configurations will tend to clump together.

X4500 MTTDL vs Space

The most obvious conclusion from the above data is that you shouldn't use simply dynamic striping or RAID-0. Friends don't let friends use RAID-0!

You will also notice that I've omitted the values on the MTTDL axis. You've noticed that the MTTDL axis uses a log scale, so that should give you a clue as to the magnitude of the differences. The real reason I've omitted the values is because they are a rat hole opportunity with a high entrance probability. It really doesn't matter if the MTTDL calculation shows that you should see a trillion years of MTTDL because the expected lifetime of a disk is on the order of 5 years.  I don't really expect any disk to last more than a decade or two. What you should take away from this is that bigger MTTDL is better, and you get a much bigger MTTDL as you increase the number of redundant copies of the data. It is better to stay out of the MTTDL value rat hole. 

The other obvious conclusion is that you should use hot spares. The reason for this is that when a hot spare is used, the MTTR is decreased because we don't have to wait for the physical disk to be replaced before we start data reconstruction on a spare disk. The time you must wait for the data to be reconstructed and available is time where you are exposed to another failure which may cause data loss. In general, you always want to increase MTBF (the numerator) and decrease MTTR (the denominator) to get high RAS.

The most interesting result of this analysis is that the RAID configurations will tend to clump together. For example, there isn't much difference between the MTTDL of a 5-disk zpool versus a 6-disk raidz zpool.

But if you look at this data another way, there is a huge difference in the RAS. For example, suppose you want 15,000 GBytes of space in your X4500.  You could use either raidz or raidz2 with or without spares. Clearly, you would have better RAS if you choose raidz2 with spares than any of the other options for the space requirement. Whether you use 6, 7, 8, or 9 disks in your raidz2 set makes less difference in MTTDL.

 

There are other considerations when choosing the ZFS or RAID configurations which I plan to address in later blogs. For now, I hope that this will encourage you to think about how you might approach the space and RAS trade-offs for your storage configurations.

 

About

relling

Search

Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today