ZFS RAID recommendations: space vs U_MTBSI
By relling on Jan 12, 2007
Some people get all wrapped around the axle worrying about disk controllers. These same people often criticize innovative products like the Sun Fire X4500 (aka Thumper) server because it only has 6 SATA controllers for 48 disk drives (8 disks/controller). In this blog, I'll take a look at the various possible RAID configurations for an X4500 and see how they affect the Unscheduled Mean Time Between System Interruptions (U_MTBSI).
Get off the bus
I think this overabundance of worry originated years ago, when controllers were very expensive and the available I/O slots for a computer were quite limited. If you look at the old-school technologies such as parallel SCSI or IDE interfaces, which were buses, then the concerns were valid. Indeed, if you look at any buses, you'd see many of the same opportunities for problems: S100, VME, DBus, etc. What we fear the most about a parallel bus is that some device will grab ahold of it and not let go, thus wedging everything on the bus. For storage devices, this could mean that a single hardware fault could disrupt access to data and it's redundancy, causing an outage. We will often describe this as a single fault zone. The diversity concept encourages distributing the redundant components across different fault zones. The pocketbook concept places the ultimate limits on how many components are available, though.
Another place where bus designs limit our choices is in performance. In general, only one device can be talking on a bus at any given time. So, if you have a bunch of devices sharing a bus, then you could have a performance bottleneck to go with your single fault zone. The obvious way to avoid this is to not use buses. (I usually take this opportunity to dis' fibre channel, but I'll spare you this time :-) Today, we have many opportunities to replace buses with point-to-point technologies: parallel SCSI replaced by serial attached SCSI (SAS), IDE/ATA replaced by Serial ATA (SATA), DBus replaced by Safari, front side bus (FSB) replaced by HyperTransport, Ethernet hubs replaced by Ethernet switches, etc.
From a RAS perspective, point-to-point technologies are very cool, largely because there are more fault zones, one per pair, but any single fault zone only affects one pair. This has numerous advantages because we can build highly reliable protocols into the links and not have to deal with sharing. Basically, if I have only two devices, then I can construct the link such that each device logically has one transmit and one receive interface. Simple. Simple is good. Simple allows us to do things like automatically know when a device is on the other end of the link, we hear something, as opposed to a shared bus where you have to place a request for an address on the bus and hope that a device responds, which it can't if the bus is wedged by some other failed device. In other parts of the system, going point-to-point has allowed even more RAS improvements. For example, we use point-to-point connections between the system boards in a Sun Fire E25K server, which is why we can implement dynamic reconfiguration.
We also gain from Moore's Law. A curious thing happens when you integrate more functions onto a single chip -- the failure rate tends to remain more or less constant. In other words, if you take the functions performed by 4 different chips and integrate them onto one chip, then you get a 4:1 parts count reduction, the per-part failure rate stays roughly the same, so you get a 4x increase in reliability. Putting this together, consider replacing 4 parallel SCSI buses, each with a single controller and drive (4 fault zones), with a single 4-port SAS controller. At first glance you'd say that you replaced 4 fault zones with one fault zone. But in order to analyze such a system, you must take into account the reliability of the components. The single SAS controller will have approximately the same failure rate as each of the parallel SCSI controllers, so we get a 4x increase in controller reliability. Now, the answer to the question of which is better isn't so clear. And the math can get rather complicated. Naturally, we developed a tool to perform such analyses, RASCAD.
RASCAD is an industry-leading tool we developed and use at Sun to easily answer design questions regarding reliability, availability, and serviceability (RAS) of complex systems. For the computationally intrigued, we build heirarchial Markov models, very cool.
At Sun, we evaluate systems for their Mean Time Between System Interruptions (MTBSI). Very simply, a system interruption occurs when a component failure causes a system interruption (reboot). This includes service events to repair the component. For example, if a component fails and the system reboots in a degraded state and repairing the component requires another reboot, then the failure causes two interruptions (e.g. CPUs). Obviously, some components can often be repaired without causing a second service outage (e.g. disks). MTBSI does include a serviceability component, which can be mitigated using planned outages and other processes. So, we often will try to stick with the unscheduled outages, U_MTBSI, which is an indicator of pain. Pain is not good, so we try to increase the time between painful events, U_MTBSI, whenever possible.
Analysis of X4500 and ZFS U_MTBSI
Previously, I discussed space versus MTTDL for an X4500. Given the 6 SATA controller configuration of the 48 available disks, how is the U_MTBSI affected by the various possible RAID configurations in ZFS? A dandy question. I took the same configurations and computed the U_MTBSI under the condition that I would strive for the best possible controller diversity. This gets a little tricky when you have only 6 controllers. For example, if you have a 7-disk raidz set, then at least two of the disks will share a controller. If you had a 6-disk raidz set, then you could place each drive in the set on a different controller. For the X4500, this gets a little more difficult because the two drives that you can boot from are on the same controller (BIOS limitation?) Also, what about spares? What happens when a spare shares the same controller as a data disk? Should I mirror adjacent or in opposition? Will the sun rise again tomorrow? Anyway, you can see where you can easily begin to get wrapped around the axle worrying about this stuff. Let's see what the analysis shows.
The first thing you notice is that the same statement I made last time still applies - friends don't let friends use RAID-0!
The second thing you'll notice is that the RAID types are clumping again. But the clumping is not as diverse in the U_MTBSI axis as you might expect. This is because the reliability of the controller is much higher than the reliability of a single disk. If you think about it, you are worried about the reliability of one chip versus a pile of chips, motors, amplifiers, heads, media, connectors, wires, and other stuff. Since the reliability of the controller is much higher than the reliability of the disks (controller MTBF >> disk MTBF) the disk reliability dominates. This is also why we do see a significant difference between the RAID types used.
The third thing you'll notice is that once again I haven't labeled the U_MTBSI axis. If I had labeled it, then it would be a rat hole opportunity with a high probability of entrance. In this case, all of the components are identical, the only change is the RAID configuration. So you could even consider the results normalized to some value and gain the same insights.
The explanation I'll offer for why RAID-Z (RAID-5) is worse than mirroring (RAID-1) or double parity RAID-Z2 (RAID-6) is that the probability of two disks failing remains the same. But the probability that two disks failing and causing you to have an interruption is very different. I think this clearly shows what David Patterson alluded to in a presentation he gave at the Sun Technical Conference once: single parity RAID-5 just doesn't give enough protection, double parity (e.g. RAID-Z2) would have been a better choice. He also mentioned that people hassle him because they've lost data with RAID-5. Needless to say, I'm not a big fan of RAID-5.
You can see that controller diversity doesn't make as big an impact on the U_MTBSI as the RAID type. For example, the U_MTBSI for raidz with 5+1 (one column per controller) is not that much different than for 6+1 (more than one column per controller). Similarly, the use of hot spares doesn't seem to make a big difference. You can more easily see the advantage of hot sparing when you look at MTTDL or Mean Time Between Service (MTBS, more on that later...)
I hope that this view of the trade-off between RAS and space will help you make better design decisions. There are other trade-offs to be considered as well, so I encourage you to look at all of the requirements and see how they can be satisfied. And don't get all wrapped around the axle worried about SAS/SATA controller diversity. Be diverse if you can, but don't worry too much if you can't.