Friday May 04, 2007

ZFS, copies, and data protection

OpenSolaris build 61 (or later) is now available for download. ZFS has added a new feature that will improve data protection: redundant copies for data (aka ditto blocks for data). Previously, ZFS stored redundant copies of metadata. Now this feature is available for data, too.

This represents a new feature which is unique to ZFS: you can set the data protection policy on a per-file system basis, beyond that offered by the underlying device or volume. For single-device systems, like my laptop with its single disk drive, this is very powerful. I can have a different data protection policy for the files that I really care about (my personal files) than the files that I really don't care about or that can be easily reloaded from the OS installation DVD. For systems with multiple disks assembled in a RAID configuration, the data protection is not quite so obvious. Let's explore this feature, look under the hood, and then analyze some possible configurations.

Using Copies

To change the numbers of data copies, set the copies property. For example, suppose I have a zpool named "zwimming." The default number of data copies is 1. But you can change that to 2 quite easily.

# zfs set copies=2 zwimming

The copies property works for all new writes, so I recommend that you set that policy when you create the file system or immediately after you create a zpool.

You can verify the copies setting by looking at the properties.

# zfs get copies zwimming
NAME      PROPERTY  VALUE     SOURCE
zwimming  copies    2         local

ZFS will account for the space used. For example, suppose I create three new file systems and copy some data to them. You can then see that the space used reflects the number of copies. If you use quotas, then the copies will be charged against the quotas, too.

# zfs create -o copies=1 zwimming/single
# zfs create -o copies=2 zwimming/dual
# zfs create -o copies=3 zwimming/triple
# cp -rp /usr/share/man1 /zwimming/single
# cp -rp /usr/share/man1 /zwimming/dual
# cp -rp /usr/share/man1 /zwimming/triple
# zfs list -r zwimming                                                       
NAME USED AVAIL REFER MOUNTPOINT
zwimming 48.2M 310M 33.5K /zwimming
zwimming/dual 16.0M 310M 16.0M /zwimming/dual
zwimming/single 8.09M 310M 8.09M /zwimming/single
zwimming/triple 23.8M 310M 23.8M /zwimming/triple

This makes sense. Each file system has one, two, or three copies of the data and will use correspondingly one, two, or three times as much space to store the data.

Under the Covers

ZFS will spread the ditto blocks across the vdev or vdevs to provide spatial diversity. Bill Moore has previously blogged about this, or you can see it in the code for yourself. From a RAS perspective, this is a good thing. We want to reduce the possibility that a single failure, such as a drive head impact with media, could disturb both copies of our data. If we have multiple disks, ZFS will try to spread the copies across multiple disks. This is different than mirroring, in subtle ways. The actual placement is ultimately based upon available space. Let's look at some simplified examples. First, for the default file system configuration settings on a single disk.

Default, simple config

Note that there are two copies of the metadata, by default. If we have two or more copies of the data, the number of metadata copies is three.

ZFS, 2 copies 

Suppose you have a 2-disk stripe. In that case, ZFS will try to spread the copies across the disks.

ZFS, 2 copies, 2 disks

Since the copies are created above the zpool, a mirrored zpool will faithfully mirror the copies.

 

ZFS, copies=2, mirrored

Since the copies policy is set at the file system level, not the zpool level, a single zpool may contain multiple file systems, each with different policies. In other words, you could have data which is not copied allocated along with data that is copied.

 

ZFS, mixed copies

Using different policies for different file systems allows you to have different data protection policies, allows you to improve data protection, and offers many more permutations of configurations for you to weigh in your designs.

RAS Modeling

It is obvious that increasing the number of data copies will effectively reduce the amount of available space accordingly. But how will this affect reliability? To answer that question we use the MTTDL[2] model I previously described, with the following changes:

First, we calculate the probability of unsuccessful reconstruction due to a UER for N disks of a given size (unit conversion omitted). The number of copies decreases this probability. This makes sense as we could use another copy of the data for reconstruction and to completely fail, we'd need to lose all copies:
Precon_fail = ((N-1) \* size / UER)copies
For single-disk failure protection:
MTTDL[2] = MTBF / (N \* Precon_fail)
For double-disk failure protection:
MTTDL[2] = MTBF2/ (N \* (N-1) \* MTTR \* Precon_fail)

Note that as the number of copies increases, Precon_fail approaches zero quickly. This will increase the MTTDL. We want higher MTTDL, so this is a good thing.

OK, now that we can calculate available space and MTTDL, let's look at some configurations for 46 disks available on a Sun Fire X4500 (aka Thumper). We'll look at single parity schemes, to reduce the clutter, but double parity schemes will show the same, relative improvements.

ZFS, X4500 single parity schemes with copies

bigger view 

You can see that we are trading off space for MTTDL. You can also see that for raidz zpools, having more disks in the sets reduces the MTTDL. It gets more interesting to see that the 2-way mirror with copies=2 is very similar in space and MTTDL to the 5-disk raidz with copies=3. Hmm. Also, the 2-way mirror with copies=1 is similar in MTTDL to the 7-disk raidz with copies=2, though the mirror configurations allow more space. This information may be useful as you make trade-offs. Since the copies parameter is set per file system, you can still set the data protection policy for important data separately from unimportant data. This might be a good idea for some situations where you might have permanent originals (eg. CDs, DVDs) and want to apply a different data protection policy.

In the future, once we have a better feel for the real performance considerations, we'll be able to add a performance component into the analysis.

Single Device Revisited

Now that we see how data protection is improved, let's revisit the single device case. I use the term device here because there is a significant change occurring in storage as we replace disk drives with solid state, non-volatile memory devices (eg. flash disks and future MRAM or PRAM devices). A large number of enterprise customers demand dual disk drives for mirroring root file systems in servers. However, there is also a growing demand for solid state boot devices, and we have some Sun servers with this option. Some believe that by 2009, the majority of laptops will also have solid state devices instead of disk drives. In the interim, there are also hybrid disk drives.

What affect will these devices have on data retention? We know that if the entire device completely fails, then the data is most likely unrecoverable. In real life, these devices can suffer many failures which result in data loss, but which are not complete device failures. For disks, we see the most common failure is an unrecoverable read where data is lost from one or more sector (bar 1 in the graph below). For flash memories, there is an endurance issue where repeated writes to a cell may reduce the probability of reading the data correctly. If you only have one copy of the data, then the data is lost, never to be read correctly again.

We captured disk error codes returned from a number of disk drives in the field. The Pareto chart below shows the relationship between the error codes. Bar 1 is the unrecoverable read which accounts for about 24% of the errors recorded. The violet bars show recoverable errors which did succeed. Examples of successfully recovered errors are: write error - recovered with block reallocation, read error - recovered by ECC using normal retries, etc. The recovered errors do not (immediately) indicate a data loss event, so they are largely transparent to applications. We worry more about the unrecoverable errors.

 

Disk error Pareto chart

Approximately 1/3 of the errors were unrecoverable. If such an error occurs in ZFS metadata, then ZFS will try to read alternate metadata copy and repair the metadata. If the data has multiple copies, then it is likely that we will not lose any data. This is a more detailed view of the storage device because we are not treating all failures as a full device failure.

Both real and anecdotal evidence suggests that unrecoverable errors can occur while the device is still largely operational. ZFS has the ability to survive such errors without data loss. Very cool. Murphy's Law will ultimately catch up with you, though. In the case where ZFS cannot recover the data, ZFS will tell you which file is corrupted. You can then decide whether or not you should recover it from backups or source media.

Another Single Device

Now that I've got you to think of the single device as a single device, I'd like to extend the thought to RAID arrays. There is much confusion amongst people about whether ZFS should or should not be used with RAID arrays. If you search, you'll find comments and recommendations both for and against using hardware RAID for ZFS. The main argument is centered around the ability of ZFS to correct errors. If you have a single device backed by a RAID array with some sort of data protection, then previous versions of ZFS could not recover data which was lost. Hold it right there, fella! Do I mean that RAID arrays and the channel from the array to main memory can have errors? Yes, of course! We have seen cases where errors were introduced somewhere along the path between disk media to main memory where data was lost or corrupted. Prior to ZFS, these were silent errors and blissfully ignored. With ZFS, the checksum now detects these errors and tries to recover. If you don't believe me, then watch the ZFS forum on opensolaris.org where we get reports like this about once a month or so. With ZFS copies, you can now recover from such errors without changing the RAID array configuration.

If ZFS can correct a data error, it will attempt to do so. You now have a the option to improve your data protection even when using a single RAID LUN. And this is the same mechanism we can use for a single disk or flash drive: data copies. You can implement the copies on a per-file system basis and thus have different data protection policies even though the data is physically stored on a RAID LUN in a hardware RAID array. I really hope we can put to rest the "ZFS prefers JBOD" argument and just concentrate our efforts on implementing the best data protection policies for the requirements.

ZFS with data copies is another tool in your toolbelt to improve your life, and the life of your data.



Friday Jan 12, 2007

ZFS RAID recommendations: space vs U_MTBSI

Some people get all wrapped around the axle worrying about disk controllers.  These same people often criticize innovative products like the Sun Fire X4500 (aka Thumper) server because it only has 6 SATA controllers for 48 disk drives (8 disks/controller). In this blog, I'll take a look at the various possible RAID configurations for an X4500 and see how they affect the Unscheduled Mean Time Between System Interruptions (U_MTBSI).

Get off the bus

I think this overabundance of worry originated years ago, when controllers were very expensive and the available I/O slots for a computer were quite limited. If you look at the old-school technologies such as parallel SCSI or IDE interfaces, which were buses, then the concerns were valid. Indeed, if you look at any buses, you'd see many of the same opportunities for problems: S100, VME, DBus, etc. What we fear the most about a parallel bus is that some device will grab ahold of it and not let go, thus wedging everything on the bus. For storage devices, this could mean that a single hardware fault could disrupt access to data and it's redundancy, causing an outage. We will often describe this as a single fault zone. The diversity concept encourages distributing the redundant components across different fault zones. The pocketbook concept places the ultimate limits on how many components are available, though.

Another place where bus designs limit our choices is in performance. In general, only one device can be talking on a bus at any given time. So, if you have a bunch of devices sharing a bus, then you could have a performance bottleneck to go with your single fault zone. The obvious way to avoid this is to not use buses. (I usually take this opportunity to dis' fibre channel, but I'll spare you this time :-)  Today, we have many opportunities to replace buses with point-to-point technologies: parallel SCSI replaced by serial attached SCSI (SAS), IDE/ATA replaced by Serial ATA (SATA), DBus replaced by Safari, front side bus (FSB) replaced by HyperTransport, Ethernet hubs replaced by Ethernet switches, etc.

From a RAS perspective, point-to-point technologies are very cool, largely because there are more fault zones, one per pair, but any single fault zone only affects one pair. This has numerous advantages because we can build highly reliable protocols into the links and not have to deal with sharing. Basically, if I have only two devices, then I can construct the link such that each device logically has one transmit and one receive interface.  Simple. Simple is good. Simple allows us to do things like automatically know when a device is on the other end of the link, we hear something, as opposed to a shared bus where you have to place a request for an address on the bus and hope that a device responds, which it can't if the bus is wedged by some other failed device. In other parts of the system, going point-to-point has allowed even more RAS improvements. For example, we use point-to-point connections between the system boards in a Sun Fire E25K server, which is why we can implement dynamic reconfiguration.

We also gain from Moore's Law. A curious thing happens when you integrate more functions onto a single chip -- the failure rate tends to remain more or less constant. In other words, if you take the functions performed by 4 different chips and integrate them onto one chip, then you get a 4:1 parts count reduction, the per-part failure rate stays roughly the same, so you get a 4x increase in reliability. Putting this together, consider replacing 4 parallel SCSI buses, each with a single controller and drive (4 fault zones), with a single 4-port SAS controller. At first glance you'd say that you replaced 4 fault zones with one fault zone. But in order to analyze such a system, you must take into account the reliability of the components. The single SAS controller will have approximately the same failure rate as each of the parallel SCSI controllers, so we get a 4x increase in controller reliability. Now, the answer to the question of which is better isn't so clear. And the math can get rather complicated. Naturally, we developed a tool to perform such analyses, RASCAD.

RASCAD 

RASCAD is an industry-leading tool we developed and use at Sun to easily answer design questions regarding reliability, availability, and serviceability (RAS) of complex systems. For the computationally intrigued, we build heirarchial Markov models, very cool.

At Sun, we evaluate systems for their Mean Time Between System Interruptions (MTBSI). Very simply, a system interruption occurs when a component failure causes a system interruption (reboot). This includes service events to repair the component. For example, if a component fails and the system reboots in a degraded state and repairing the component requires another reboot, then the failure causes two interruptions (e.g. CPUs). Obviously, some components can often be repaired without causing a second service outage (e.g. disks). MTBSI does include a serviceability component, which can be mitigated using planned outages and other processes. So, we often will try to stick with the unscheduled outages, U_MTBSI, which is an indicator of pain. Pain is not good, so we try to increase the time between painful events, U_MTBSI, whenever possible.

Analysis of X4500 and ZFS U_MTBSI

Previously, I discussed space versus MTTDL for an X4500. Given the 6 SATA controller configuration of the 48 available disks, how is the U_MTBSI affected by the various possible RAID configurations in ZFS? A dandy question. I took the same configurations and computed the U_MTBSI under the condition that I would strive for the best possible controller diversity. This gets a little tricky when you have only 6 controllers. For example, if you have a 7-disk raidz set, then at least two of the disks will share a controller. If you had a 6-disk raidz set, then you could place each drive in the set on a different controller. For the X4500, this gets a little more difficult because the two drives that you can boot from are on the same controller (BIOS limitation?) Also, what about spares? What happens when a spare shares the same controller as a data disk? Should I mirror adjacent or in opposition? Will the sun rise again tomorrow?  Anyway, you can see where you can easily begin to get wrapped around the axle worrying about this stuff. Let's see what the analysis shows.

Plot of space vs U_MTBSI

The first thing you notice is that the same statement I made last time still applies - friends don't let friends use RAID-0!

The second thing you'll notice is that the RAID types are clumping again. But the clumping is not as diverse in the U_MTBSI axis as you might expect. This is because the reliability of the controller is much higher than the reliability of a single disk.  If you think about it, you are worried about the reliability of one chip versus a pile of chips, motors, amplifiers, heads, media, connectors, wires, and other stuff.  Since the reliability of the controller is much higher than the reliability of the disks (controller MTBF >> disk MTBF) the disk reliability dominates. This is also why we do see a significant difference between the RAID types used.

The third thing you'll notice is that once again I haven't labeled the U_MTBSI axis. If I had labeled it, then it would be a rat hole opportunity with a high probability of entrance.  In this case, all of the components are identical, the only change is the RAID configuration. So you could even consider the results normalized to some value and gain the same insights.

The explanation I'll offer for why RAID-Z (RAID-5) is worse than mirroring (RAID-1) or double parity RAID-Z2 (RAID-6) is that the probability of two disks failing remains the same. But the probability that two disks failing and causing you to have an interruption is very different. I think this clearly shows what David Patterson alluded to in a presentation he gave at the Sun Technical Conference once: single parity RAID-5 just doesn't give enough protection, double parity (e.g. RAID-Z2) would have been a better choice. He also mentioned that people hassle him because they've lost data with RAID-5. Needless to say, I'm not a big fan of RAID-5.

You can see that controller diversity doesn't make as big an impact on the U_MTBSI as the RAID type. For example, the U_MTBSI for raidz with 5+1 (one column per controller) is not that much different than for 6+1 (more than one column per controller). Similarly, the use of hot spares doesn't seem to make a big difference. You can more easily see the advantage of hot sparing when you look at MTTDL or Mean Time Between Service (MTBS, more on that later...)

I hope that this view of the trade-off between RAS and space will help you make better design decisions. There are other trade-offs to be considered as well, so I encourage you to look at all of the requirements and see how they can be satisfied. And don't get all wrapped around the axle worried about SAS/SATA controller diversity. Be diverse if you can, but don't worry too much if you can't.

About

relling

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today