ZFS End-to-End Data Integrity

The job of any filesystem boils down to this: when asked to read a block, it should return the same data that was previously written to that block. If it can't do that -- because the disk is offline or the data has been damaged or tampered with -- it should detect this and return an error.

Incredibly, most filesystems fail this test. They depend on the underlying hardware to detect and report errors. If a disk simply returns bad data, the average filesystem won't even detect it.

Even if we could assume that all disks were perfect, the data would still be vulnerable to damage in transit: controller bugs, DMA parity errors, and so on. All you'd really know is that the data was intact when it left the platter. If you think of your data as a package, this would be like UPS saying, "We guarantee that your package wasn't damaged when we picked it up." Not quite the guarantee you were looking for.

In-flight damage is not a mere academic concern: even something as mundane as a bad power supply can cause silent data corruption.

Arbitrarily expensive storage arrays can't solve the problem. The I/O path remains just as vulnerable, but becomes even longer: after leaving the platter, the data has to survive whatever hardware and firmware bugs the array has to offer.

And if you're on a SAN, you're using a network designed by disk firmware writers. God help you.

What to do? One option is to store a checksum with every disk block. Most modern disk drives can be formatted with sectors that are slightly larger than the usual 512 bytes -- typically 520 or 528. These extra bytes can be used to hold a block checksum. But making good use of this checksum is harder than it sounds: the effectiveness of a checksum depends tremendously on where it's stored and when it's evaluated.

In many storage arrays (see the Dell|EMC PowerVault paper for a typical example with an excellent description of the issues), the data is compared to its checksum inside the array. Unfortunately this doesn't help much. It doesn't detect common firmware bugs such as phantom writes (the previous write never made it to disk) because the data and checksum are stored as a unit -- so they're self-consistent even when the disk returns stale data. And the rest of the I/O path from the array to the host remains unprotected. In short, this type of block checksum provides a good way to ensure that an array product is not any less reliable than the disks it contains, but that's about all.

NetApp's block-appended checksum approach appears similar but is in fact much stronger. Like many arrays, NetApp formats its drives with 520-byte sectors. It then groups them into 8-sector blocks: 4K of data (the WAFL filesystem blocksize) and 64 bytes of checksum. When WAFL reads a block it compares the checksum to the data just like an array would, but there's a key difference: it does this comparison after the data has made it through the I/O path, so it validates that the block made the journey from platter to memory without damage in transit.

This is a major improvement, but it's still not enough. A block-level checksum only proves that a block is self-consistent; it doesn't prove that it's the right block. Reprising our UPS analogy, "We guarantee that the package you received is not damaged. We do not guarantee that it's your package."

The fundamental problem with all of these schemes is that they don't provide fault isolation between the data and the checksum that protects it.

ZFS Data Authentication

End-to-end data integrity requires that each data block be verified against an independent checksum, after the data has arrived in the host's memory. It's not enough to know that each block is merely consistent with itself, or that it was correct at some earlier point in the I/O path. Our goal is to detect every possible form of damage, including human mistakes like swapping on a filesystem disk or mistyping the arguments to dd(1). (Have you ever typed "of=" when you meant "if="?)

A ZFS storage pool is really just a tree of blocks. ZFS provides fault isolation between data and checksum by storing the checksum of each block in its parent block pointer -- not in the block itself. Every block in the tree contains the checksums for all its children, so the entire pool is self-validating. [The uberblock (the root of the tree) is a special case because it has no parent; more on how we handle that in another post.]

When the data and checksum disagree, ZFS knows that the checksum can be trusted because the checksum itself is part of some other block that's one level higher in the tree, and that block has already been validated.

ZFS uses its end-to-end checksums to detect and correct silent data corruption. If a disk returns bad data transiently, ZFS will detect it and retry the read. If the disk is part of a mirror or RAID-Z group, ZFS will both detect and correct the error: it will use the checksum to determine which copy is correct, provide good data to the application, and repair the damaged copy.

As always, note that ZFS end-to-end data integrity doesn't require any special hardware. You don't need pricey disks or arrays, you don't need to reformat drives with 520-byte sectors, and you don't have to modify applications to benefit from it. It's entirely automatic, and it works with cheap disks.

But wait, there's more!

The blocks of a ZFS storage pool form a Merkle tree in which each block validates all of its children. Merkle trees have been proven to provide cryptographically-strong authentication for any component of the tree, and for the tree as a whole. ZFS employs 256-bit checksums for every block, and offers checksum functions ranging from the simple-and-fast fletcher2 (the default) to the slower-but-secure SHA-256. When using a cryptographic hash like SHA-256, the uberblock checksum provides a constantly up-to-date digital signature for the entire storage pool.

Which comes in handy if you ask UPS to move it.


Technorati Tags:
Comments:

Is there a file-system/RAID equivalent of the Fallacies of Distributed Computing? Sounds like there should be. Maybe something like: (\*) Data redundancy gurantees data reliability, (\*) checksums gurantee data accuracy, (\*) bits only flip on disc platters, (\*) firmware is bug-free (\*) NVRAM can persistent in the long-term (\*) only the file-system can write data.

Posted by Chris Rijk on December 08, 2005 at 08:08 PM PST #

While we're at it, Chris, maybe we can add "multiple writes issued at the same time can't fail separately" because that's how RAID-Z supposedly fixes the RAID-5 write hole. We could also add "calculating checksums is free" and "caching metadata on the host alone is better than caching on both the host and the storage" because those also seem to be common myths. Then there's "I/O bandwidth is the scarcest commodity on systems today so we're going to use up to half of it writing parity and break layering to save a few CPU cycles" but that's really more of an inconsistency than a myth.

But seriously, Jeff, thanks for writing about this. The transactional-write strategy and checksumming are IMO the truly magical parts of ZFS. Even if the former is foreshadowed by other systems (such as WAFL which you finally mention but still fail to credit as an inspiration) it's still very cool to see it in a general-purpose filesystem and it does solve data-integrity problems that can't possibly be solved at the external-hardware level. It's a great application of the end-to-end principle.

P.S. I'd love to see what you have to say about transactional updates and batching and so on. Are you planning to do that one next?

Posted by Platypus on December 08, 2005 at 08:46 PM PST #

Hmm... some comments for Platypus:

"calculating checksums is free"... not free, just essential if you want your data actually correct. Platypus, why aren't you railing about the absurd cost of adding ECC or parity to RAM? Remember: Performance is a goal, correctness is a constraint.

"I/O bandwidth is the scarcest commodity on systems today so we're going to use up to half of it writing parity and break layering to save a few CPU cycles".... Half of it writing parity? Sounds like your using a mirror, not RAIDZ... Note that we're going to see doubling of the number of CPUs on a single chip every 18 to 24 months... and that networking bandwidth is growing even faster.

And breaking layering? Layers are fine - but when they have outlived their usefulness, it's time to consolidate functions together - is Intel going to complain that AMD's integration of a memory controller on their Opteron is breaking layering? Layering throws away information - in this case very valuable information. Forcing all redunancy in the IO subsystem to be hidded behind a block interface was an expedient design decision for the first volume managers - not something engraved in stone. Layering is for cakes, not for software.

Posted by Bart on December 09, 2005 at 01:07 AM PST #

Half of it writing parity?

According to a comment Jeff B left on my website explaining how RAID-Z varies the stripe size, if you're doing single-block writes each upper-layer write will be done as one data plus one parity on disk.

Note that we're going to see doubling of the number of CPUs on a single chip every 18 to 24 months... and that networking bandwidth is growing even faster.

...all of which only reinforces the point about system balance. If the rate of improvement is lower for I/O bandwidth than for other factors such as CPU or networking, then a "feature" that uses up to 50% of that I/O bandwidth writing parity will become more and more of an issue as systems continue to evolve. That's how things can work out with RAID-Z, as a couple of people have noted in the OpenSolaris forums, but ZFS would work without RAID-Z. If any part of a "full stripe" write into as-yet-unattached space fails, it can simply be retried with no ill effect. That's the beauty of transactional updates. That leaves one free to do wider writes all the time, which could potentially waste a bit of space temporarily except that (a) disk space is likely to be the most expendable resource available, and (b) the waste will often be only temporary as the stripe is filled and earlier partial versions are reclaimed so that the final state will actually be more compact.

By the way, whenever the stripe width is two it's silly to write parity anyway. Fallback to mirroring in that case would be better.

Layers are fine - but when they have outlived their usefulness, it's time to consolidate functions together

It has yet to be proven that the layers involved in this case have outlived their usefulness. I reject that assumption, for reasons I'll get to in a moment.

Forcing all redunancy in the IO subsystem to be hidded behind a block interface was an expedient design decision for the first volume managers - not something engraved in stone.

One can change the details of an interface without rearranging the layers entirely. The interface between filesystems and volume managers can - and in my opinion should - be made richer, not abandoned, and that would have been sufficient to meet ZFS's goals while retaining greater compatibility with the rest of the storage ecosystem. Of course, that would have meant negotiating with others instead of reinventing the wheel.

Layering is for cakes, not for software.

It's a good thing there are smarter people than you, who don't think in simplistic slogans. Collapsing layers is a common trick in the embedded world, to wring every last bit of performance out of a CPU- or memory-constrained system at cost of adaptability. In a system that is constrained by neither CPU nor memory and where future extension should be expected, it's a bad approach. If networking folks had thought as you do, it would be a lot harder to implement or deploy IPv6 or the current crop of transport protocols, not to mention packet filters and such that insert themselves between layers. In other words, it would be harder for technology to progress. In storage that same story has been played out with SCSI/FC and PATA/SATA/SAS.

Of course a layered implementation might require more CPU cycles, but we've already established that those are not the constraining resource. More importantly, it can be hard work to figure out how the interface should look to maximize functionality while minimizing performance impact, but that's why they pay us the big bucks. A designer who picks the easy way out isn't innovating; he's failing to do his job.

Layering also makes more rigorous testing possible, and provides other benefits, which - and here's the real kicker - is probably why ZFS itself is still layered. It's different layering, to be sure, but it's still layering. I'm sure Jeff B will claim there are good reasons why the new layering is better than the old, and those reasons might even be valid, but right now those justifications are not apparent. So far it looks a lot like when a junior engineer rewrites code before they fully understand it, and introduces unnecessary regression in the process as a result. A more experienced engineer will try to understand what's there first, and might end up rewriting it (properly) anyway, but will more often find a less disruptive fix. Which kind of engineer are you, Bart?

Posted by Platypus on December 09, 2005 at 03:44 AM PST #

Another thought also occurred to me with respect to the "most filesystems fail this test" and "correctness is a constraint" and such. Sun still sells a filesystem that fails this test, and even ZFS users still have to boot off of one that violates this constraint. It might not be wise to be too critical of "every other filesystem" while that's the case.

The real fact is that data integrity is about probabilities, not absolutes. Any hardware capable of corrupting your data between the time it's checked by your HBA and the time it's checked by ZFS is also in all probability capable of corrupting it after it's checked by ZFS, especially if any kind of caching is involved (as is almost certain to be the case). Someone at, say, Oracle might say that the integrity checking's not truly "end to end" unless one end is the application - not the filesystem. In the end there's always going to be a window of vulnerability somewhere, so the best one can do is try to address the most common causes of failure. I think Fibre Channel is overdesigned, for example, and I've never been a firmware engineer so if that dig was directed at me it was misplaced, but those awful FC SANs do manage to reduce the occurrence of what has in my experience been one of the most common failure modes - someone pulling out a cable while they're doing something totally unrelated in the data center. If you can re-zone everything from your desk, you spend a lot less time anywhere near the cables.

RAID does a good job dealing with bit-rot on idle media, FC and other protocols do a pretty good job of safeguarding the physical data path while bits are in transit, and ZFS adds yet another bit of protection from (roughly) the hardware up to the top of the filesystem interface. Yay. All good. Now what about something that rots in the buffer/page cache? What about an application that writes something other than what it meant to, as I've seen databases do on many occasions? Oh well? Too bad? Not My Problem? Maybe. You could say the filesystem did its job, but that would be cold comfort to the person who's data was lost - and a little disingenuous from anyone who claims layers don't matter. There are still more holes to be plugged, and more "ends" to which integrity checking could be extended. The game's not over yet.

Posted by Platypus on December 11, 2005 at 02:26 AM PST #

Platypus wrote:
According to a comment Jeff B left on my website explaining how RAID-Z varies the stripe size, if you're doing single-block writes each upper-layer write will be done as one data plus one parity on disk.

How often do single-block writes happen? This is a fallback; analyzing it requires knowing what the actual write patterns are. My guess is that the vast majority of writes are of the full-stripe variety, not the mirroring fallback.

By the way, whenever the stripe width is two it's silly to write parity anyway. Fallback to mirroring in that case would be better.

If you use "even" parity (i.e. XORing all the blocks yields a zero block), then when the stripe-width is two, you get mirroring.

Posted by Jonathan Adams on December 11, 2005 at 07:53 AM PST #

If you use "even" parity (i.e. XORing all the blocks yields a zero block), then when the stripe-width is two, you get mirroring.

True, but in most systems doing it that way would mean touching data that could otherwise be left alone - with all of the obvious effects on cache pollution, memory bandwidth, etc. In the ZFS case, of course, the data has to be touched anyway to calculate a checksum so that price has already been paid - which was sort of one of my earlier points.

Posted by Platypus on December 15, 2005 at 11:06 PM PST #

The description of NetApp's block-appended checksum technology is incorrect: Not only does the checksum guarantee that the block is self-consistent, it also ensures that it is the right block. The checksum includes the logical identity of the block ("I am block 15 of LUN foo") and contains enough information to distinguish the most recently written contents from the previous (self-consistent but stale) contents of that block.

In terms of the UPS analogy, the NetApp checksum mechanism guarantees that the package is undamaged -and- that it's the right package.

Posted by Eric Hamilton on May 30, 2006 at 03:03 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

bonwick

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today