The original promise of RAID (Redundant Arrays of Inexpensive Disks) was that it would provide fast, reliable storage using cheap disks. The key point was cheap; yet somehow we ended up here. Why?

RAID-5 (and other data/parity schemes such as RAID-4, RAID-6, even-odd, and Row Diagonal Parity) never quite delivered on the RAID promise -- and can't -- due to a fatal flaw known as the RAID-5 write hole. Whenever you update the data in a RAID stripe you must also update the parity, so that all disks XOR to zero -- it's that equation that allows you to reconstruct data when a disk fails. The problem is that there's no way to update two or more disks atomically, so RAID stripes can become damaged during a crash or power outage.

To see this, suppose you lose power after writing a data block but before writing the corresponding parity block. Now the data and parity for that stripe are inconsistent, and they'll remain inconsistent forever (unless you happen to overwrite the old data with a full-stripe write at some point). Therefore, if a disk fails, the RAID reconstruction process will generate garbage the next time you read any block on that stripe. What's worse, it will do so silently -- it has no idea that it's giving you corrupt data.

There are software-only workarounds for this, but they're so slow that software RAID has died in the marketplace. Current RAID products all do the RAID logic in hardware, where they can use NVRAM to survive power loss. This works, but it's expensive.

There's also a nasty performance problem with existing RAID schemes. When you do a partial-stripe write -- that is, when you update less data than a single RAID stripe contains -- the RAID system must read the old data and parity in order to compute the new parity. That's a huge performance hit. Where a full-stripe write can simply issue all the writes asynchronously, a partial-stripe write must do synchronous reads before it can even start the writes.

Once again, expensive hardware offers a solution: a RAID array can buffer partial-stripe writes in NVRAM while it's waiting for the disk reads to complete, so the read latency is hidden from the user. Of course, this only works until the NVRAM buffer fills up. No problem, your storage vendor says! Just shell out even more cash for more NVRAM. There's no problem your wallet can't solve.

Partial-stripe writes pose an additional problem for a transactional filesystem like ZFS. A partial-stripe write necessarily modifies live data, which violates one of the rules that ensures transactional semantics. (It doesn't matter if you lose power during a full-stripe write for the same reason that it doesn't matter if you lose power during any other write in ZFS: none of the blocks you're writing to are live yet.)

If only we didn't have to do those evil partial-stripe writes...

Enter RAID-Z.

RAID-Z is a data/parity scheme like RAID-5, but it uses dynamic stripe width. Every block is its own RAID-Z stripe, regardless of blocksize. This means that every RAID-Z write is a full-stripe write. This, when combined with the copy-on-write transactional semantics of ZFS, completely eliminates the RAID write hole. RAID-Z is also faster than traditional RAID because it never has to do read-modify-write.

Whoa, whoa, whoa -- that's it? Variable stripe width? Geez, that seems pretty obvious. If it's such a good idea, why doesn't everybody do it?

Well, the tricky bit here is RAID-Z reconstruction. Because the stripes are all different sizes, there's no simple formula like "all the disks XOR to zero." You have to traverse the filesystem metadata to determine the RAID-Z geometry. Note that this would be impossible if the filesystem and the RAID array were separate products, which is why there's nothing like RAID-Z in the storage market today. You really need an integrated view of the logical and physical structure of the data to pull it off.

But wait, you say: isn't that slow? Isn't it expensive to traverse all the metadata? Actually, it's a trade-off. If your storage pool is very close to full, then yes, it's slower. But if it's not too close to full, then metadata-driven reconstruction is actually faster because it only copies live data; it doesn't waste time copying unallocated disk space.

But far more important, going through the metadata means that ZFS can validate every block against its 256-bit checksum as it goes. Traditional RAID products can't do this; they simply XOR the data together blindly.

Which brings us to the coolest thing about RAID-Z: self-healing data. In addition to handling whole-disk failure, RAID-Z can also detect and correct silent data corruption. Whenever you read a RAID-Z block, ZFS compares it against its checksum. If the data disks didn't return the right answer, ZFS reads the parity and then does combinatorial reconstruction to figure out which disk returned bad data. It then repairs the damaged disk and returns good data to the application. ZFS also reports the incident through Solaris FMA so that the system administrator knows that one of the disks is silently failing.

Finally, note that RAID-Z doesn't require any special hardware. It doesn't need NVRAM for correctness, and it doesn't need write buffering for good performance. With RAID-Z, ZFS makes good on the original RAID promise: it provides fast, reliable storage using cheap, commodity disks.

For a real-world example of RAID-Z detecting and correcting silent data corruption on flaky hardware, check out Eric Lowe's SATA saga.

The current RAID-Z algorithm is single-parity, but the RAID-Z concept works for any RAID flavor. A double-parity version is in the works.

One last thing that fellow programmers will appreciate: the entire RAID-Z implementation is just 599 lines.

Technorati Tags:

I thought you might be gratified to see that RAID-Z is already in wikipedia :)

Posted by Dan Price on November 18, 2005 at 04:03 AM PST #

How does ZFS deal with sata drives that have the write cache turned on? Such that in the event of a crash, writes that were acked to the OS are lost?

Posted by Naveen Nalam on November 18, 2005 at 04:39 AM PST #

We issue the SYNCHRONIZE CACHE command to the disks after pushing all data in a transaction group, but before updating the uberblock to commit it. I'll describe the transaction group commit process in detail in a future post.

Posted by Jeff Bonwick on November 18, 2005 at 04:40 PM PST #

Interesting Stuff Jeff. Another point about conventional RAID-5 is that, in dedicated RAID-5 hardware, you quickly saturate the memory bus between the CPU, disk controllers and dedicated XOR hardware. Not only do you saturate it, but you also have bus contention - since the disks will often generate data simultaneously - but that data has to be, essentially, serialized over the single bus. The solution to achieving high performance in this situation is, as you've said, add more hardware and more system complexity (and cost). One of my little "tricks", is to determine the system bus characteristics of the RAID-5 hardware. Once you know this, you \*know\* the maximum performance capabilities of the product!

Posted by Al Hopper on November 20, 2005 at 07:20 PM PST #

Very interesting idea. Is any performance data available for RAID-Z?

Posted by Igor on November 27, 2005 at 12:47 AM PST #

FYI, a RAID system with variable sized parity stripes:

Posted by guest on December 20, 2005 at 10:36 AM PST #

Excellent information

Posted by Veera Swamy Vallabhaneni on April 13, 2006 at 07:27 PM PDT #

Post a Comment:
Comments are closed for this entry.



« July 2016