I was talking to the ZFS engineering team recently about a comment we received through another venue around the lack of ZFS recovery tools. We think this comment was possibly related by a pool failure in 2006 after reviewing mail archives. Prompted by this comment and 10 years of working with our customers to use ZFS successfully, let's review both the problems we saw when ZFS was in its infancy, particularly around 3rd-party hardware and drivers, either ignoring cache flush commands or not generating or fabricating device IDs, and the tools that we had then and now.
Let's look at the problems associated with cache flushing first:
- ZFS data integrity depends upon explicit write ordering: first data, metadata, and then finally the uberblock
- A disk drive write cache is a small amount of memory on the drive's controller board
- ZFS enables this cache and flushes it out every time ZFS commits a transaction
- We found that some 3rd-party disk drives ignore the "synchronize cache" command and just discard the request, which can lead to out-of-order writes. This issue might prevent pools from be imported.
The second problem is around pool device IDs:
- All file systems must rely on a close relationship with their underlying devices and ZFS is no exception.
- ZFS tracks pool devices with internal device IDs so that if any intervening hardware component changes, ZFS can find its own pool devices.
- Third-party storage drivers do not always generate or fabricate device IDs so if the hardware is moved or changed, ZFS can't find the pool devices and you could end up with a broken ZFS pool. If the pool is exported or the system is shutdown before hardware is changed or moved, ZFS has a chance to reread the device information and recover, unless there was some other hardware or cabling problem.
- Oracle/Sun hardware generates device IDs that ZFS relies on but I still don't recommend moving or changing hardware under live pools.
10 years ago, we had the tools to recover when a pool was damaged by changing pool device IDs, but most of us didn't know the steps to recover back then. Because we saw a fondness in the ZFS community for moving hardware underneath ZFS storage pools that had the unfortunate result of changing pool device IDs, we learned the steps to recover, using an existing tool, zdb, that integrated with ZFS (thankfully) back in 2005.
A summary of the steps is to use zdb -l to identify the pool device labels (and device IDs), create symbolic links in /dev/dsk to the previous device names, and then try to import the pool with simulated device names. Or, use dd to make a physical block copy of the pool devices and try to lofi-mount them with the original pool device names so that the data can be recovered. Many early ZFS engineers, support engineers, and community members worked tirelessly to recover broken pools and we are all grateful. Many still do.
Then and Now
- Then (2005) - ZFS debugger (zdb) became an invaluable tool not only for identifying and then reconstructing pool device information, but also for recovering ZFS pool data. Ben Rockwood wrote a blog in 2008 on how to use zdb to recover ZFS data. I've used it myself to verify that ZFS data is actually encrypted and also how to recover data.
- 2009 - Pool recovery (zpool clear -FXn or zpool import -FXn) attempts to rewind the pool back to a previous transaction so that the damaged pool could be cleared if already imported upon reboot or just imported. This feature allows pool recovery, but must be used as a first step and has a risk of losing a small amount of data due to rewinding to a previous transaction.
- 2011 - Read-only pool import allows a pool on broken hardware to be imported so that at least the data can be recovered.
- Now - Along with the above tools that have been enhanced along the way, Oracle ZFS experts (many from the early days) are available to help recover ZFS pools and data. The best way to engage is to go through MOS support.
- System hardware recommendations:
- Provide ECC memory and redundant pool devices, following the latest ZFS Best Practices.
- Ensure hardware respects cache flush commands.
- If you are using storage arrays with battery-backed cache, make sure you monitor battery life.
- ZFS pools built on SATA disks behind some SAS expanders were problematic in that if one disk failed, other disks around the failed disk also failed (as if in sympathy).
- Don't move or change hardware under live pools. Export the pool first or shutdown the system. Keep a listing of zdb -l output and compare it to the device information for the pool about to be imported. Be prepared to repair the pool device links if there is a mismatch.
- Maximizing for space generally means maximizing pain upon a hardware failure. If your pool is too big to back up, reconfigure it so it can be backed up.