ZFS End-to-End Data Integrity
By bonwick on Dec 08, 2005
The job of any filesystem boils down to this: when asked to read a block, it should return the same data that was previously written to that block. If it can't do that -- because the disk is offline or the data has been damaged or tampered with -- it should detect this and return an error.
Incredibly, most filesystems fail this test. They depend on the underlying hardware to detect and report errors. If a disk simply returns bad data, the average filesystem won't even detect it.
Even if we could assume that all disks were perfect, the data would still be vulnerable to damage in transit: controller bugs, DMA parity errors, and so on. All you'd really know is that the data was intact when it left the platter. If you think of your data as a package, this would be like UPS saying, "We guarantee that your package wasn't damaged when we picked it up." Not quite the guarantee you were looking for.
In-flight damage is not a mere academic concern: even something as mundane as a bad power supply can cause silent data corruption.
Arbitrarily expensive storage arrays can't solve the problem. The I/O path remains just as vulnerable, but becomes even longer: after leaving the platter, the data has to survive whatever hardware and firmware bugs the array has to offer.
And if you're on a SAN, you're using a network designed by disk firmware writers. God help you.
What to do? One option is to store a checksum with every disk block. Most modern disk drives can be formatted with sectors that are slightly larger than the usual 512 bytes -- typically 520 or 528. These extra bytes can be used to hold a block checksum. But making good use of this checksum is harder than it sounds: the effectiveness of a checksum depends tremendously on where it's stored and when it's evaluated.
In many storage arrays (see the Dell|EMC PowerVault paper for a typical example with an excellent description of the issues), the data is compared to its checksum inside the array. Unfortunately this doesn't help much. It doesn't detect common firmware bugs such as phantom writes (the previous write never made it to disk) because the data and checksum are stored as a unit -- so they're self-consistent even when the disk returns stale data. And the rest of the I/O path from the array to the host remains unprotected. In short, this type of block checksum provides a good way to ensure that an array product is not any less reliable than the disks it contains, but that's about all.
NetApp's block-appended checksum approach appears similar but is in fact much stronger. Like many arrays, NetApp formats its drives with 520-byte sectors. It then groups them into 8-sector blocks: 4K of data (the WAFL filesystem blocksize) and 64 bytes of checksum. When WAFL reads a block it compares the checksum to the data just like an array would, but there's a key difference: it does this comparison after the data has made it through the I/O path, so it validates that the block made the journey from platter to memory without damage in transit.
This is a major improvement, but it's still not enough. A block-level checksum only proves that a block is self-consistent; it doesn't prove that it's the right block. Reprising our UPS analogy, "We guarantee that the package you received is not damaged. We do not guarantee that it's your package."
The fundamental problem with all of these schemes is that they don't provide fault isolation between the data and the checksum that protects it.
ZFS Data Authentication
End-to-end data integrity requires that each data block be verified against an independent checksum, after the data has arrived in the host's memory. It's not enough to know that each block is merely consistent with itself, or that it was correct at some earlier point in the I/O path. Our goal is to detect every possible form of damage, including human mistakes like swapping on a filesystem disk or mistyping the arguments to dd(1). (Have you ever typed "of=" when you meant "if="?)
A ZFS storage pool is really just a tree of blocks. ZFS provides fault isolation between data and checksum by storing the checksum of each block in its parent block pointer -- not in the block itself. Every block in the tree contains the checksums for all its children, so the entire pool is self-validating. [The uberblock (the root of the tree) is a special case because it has no parent; more on how we handle that in another post.]
When the data and checksum disagree, ZFS knows that the checksum can be trusted because the checksum itself is part of some other block that's one level higher in the tree, and that block has already been validated.
ZFS uses its end-to-end checksums to detect and correct silent data corruption. If a disk returns bad data transiently, ZFS will detect it and retry the read. If the disk is part of a mirror or RAID-Z group, ZFS will both detect and correct the error: it will use the checksum to determine which copy is correct, provide good data to the application, and repair the damaged copy.
As always, note that ZFS end-to-end data integrity doesn't require any special hardware. You don't need pricey disks or arrays, you don't need to reformat drives with 520-byte sectors, and you don't have to modify applications to benefit from it. It's entirely automatic, and it works with cheap disks.
But wait, there's more!
The blocks of a ZFS storage pool form a Merkle tree in which each block validates all of its children. Merkle trees have been proven to provide cryptographically-strong authentication for any component of the tree, and for the tree as a whole. ZFS employs 256-bit checksums for every block, and offers checksum functions ranging from the simple-and-fast fletcher2 (the default) to the slower-but-secure SHA-256. When using a cryptographic hash like SHA-256, the uberblock checksum provides a constantly up-to-date digital signature for the entire storage pool.
Which comes in handy if you ask UPS to move it.
Technorati Tags: OpenSolaris Solaris ZFS