By bonwick on May 01, 2006
Resilvering -- also known as resyncing, rebuilding, or reconstructing -- is the process of repairing a damaged device using the contents of healthy devices. This is what every volume manager or RAID array must do when one of its disks dies, gets replaced, or suffers a transient outage.
For a mirror, resilvering can be as simple as a whole-disk copy. For RAID-5 it's only slightly more complicated: instead of copying one disk to another, all of the other disks in the RAID-5 stripe must be XORed together. But the basic idea is the same.
In a traditional storage system, resilvering happens either in the volume manager or in RAID hardware. Either way, it happens well below the filesystem.
But this is ZFS, so of course we just had to be different.
In a previous post I mentioned that RAID-Z resilvering requires a different approach, because it needs the filesystem metadata to determine the RAID-Z geometry. In effect, ZFS does a 'cp -r' of the storage pool's block tree from one disk to another. It sounds less efficient than a straight whole-disk copy, and traversing a live pool safely is definitely tricky (more on that in a future post). But it turns out that there are so many advantages to metadata-driven resilvering that we've chosen to use it even for simple mirrors.
The most compelling reason is data integrity. With a simple disk copy, there's no way to know whether the source disk is returning good data. End-to-end data integrity requires that each data block be verified against an independent checksum -- it's not enough to know that each block is merely consistent with itself, because that doesn't catch common hardware and firmware bugs like misdirected reads and phantom writes.
By traversing the metadata, ZFS can use its end-to-end checksums to detect and correct silent data corruption, just like it does during normal reads. If a disk returns bad data transiently, ZFS will detect it and retry the read. If it's a 3-way mirror and one of the two presumed-good disks is damaged, ZFS will use the checksum to determine which one is correct, copy the data to the new disk, and repair the damaged disk.
A simple whole-disk copy would bypass all of this data protection. For this reason alone, metadata-driven resilvering would be desirable even it it came at a significant cost in performance.
Fortunately, in most cases, it doesn't. In fact, there are several advantages to metadata-driven resilvering:
Live blocks only. ZFS doesn't waste time and I/O bandwidth copying free disk blocks because they're not part of the storage pool's block tree. If your pool is only 10-20% full, that's a big win.
Transactional pruning. If a disk suffers a transient outage, it's not necessary to resilver the entire disk -- only the parts that have changed. I'll describe this in more detail in a future post, but in short: ZFS uses the birth time of each block to determine whether there's anything lower in the tree that needs resilvering. This allows it to skip over huge branches of the tree and quickly discover the data that has actually changed since the outage began.
What this means in practice is that if a disk has a five-second outage, it will only take about five seconds to resilver it. And you don't pay extra for it -- in either dollars or performance -- like you do with Veritas change objects. Transactional pruning is an intrinsic architectural capability of ZFS.
Top-down resilvering. A storage pool is a tree of blocks. The higher up the tree you go, the more disastrous it is to lose a block there, because you lose access to everything beneath it.
Going through the metadata allows ZFS to do top-down resilvering. That is, the very first thing ZFS resilvers is the uberblock and the disk labels. Then it resilvers the pool-wide metadata; then each filesystem's metadata; and so on down the tree. Throughout the process ZFS obeys this rule: no block is resilvered until all of its ancestors have been resilvered.
It's hard to overstate how important this is. With a whole-disk copy, even when it's 99% done there's a good chance that one of the top 100 blocks in the tree hasn't been copied yet. This means that from an MTTR perspective, you haven't actually made any progress: a second disk failure at this point would still be catastrophic.
With top-down resilvering, every single block copied increases the amount of discoverable data. If you had a second disk failure, everything that had been resilvered up to that point would be available.
Priority-based resilvering. ZFS doesn't do this one yet, but it's in the pipeline. ZFS resilvering follows the logical structure of the data, so it would be pretty easy to tag individual filesystems or files with a specific resilver priority. For example, on a file server you might want to resilver calendars first (they're important yet very small), then /var/mail, then home directories, and so on.
What I hope to convey with each of these posts is not just the mechanics of how a particular feature is implemented, but to illustrate how all the parts of ZFS form an integrated whole. It's not immediately obvious, for example, that transactional semantics would have anything to do with resilvering -- yet transactional pruning makes recovery from transient outages literally orders of magnitude faster. More on how that works in the next post.