In the initial days of ZFS some pointed out that ZFS resilvering was metadata driven and was therefore super fast : after all we only had to resilver data that was in-use
compared to traditional storage that has to resilver entire disk even if there is no actual data stored. And indeed on newly created pools ZFS was super fast for resilvering.
But of course storage pools rarely stay empty. So what happened when pools grew to store large quantities of data ? Well we basically had to resilver most blocks present on a failed disk. So the advantage of only resilvering what is actually present is not much of a advantage, in real life, for ZFS.
And while ZFS based storage grew in importance, so did disk sizes. The disk sizes that people put in production are growing very fast showing the appetite of customers to store vast quantities of data. This is happening despite the fact that those disks are not delivering significantly more IOPS than their ancestors. As time goes by, a trend that has lasted forever, we have fewer and fewer IOPS available to service a given unit of data. Here ZFSSA storage arrays with TB class caches are certainly helping the trend. Disk IOPS don't matter as much as before because all of the hot data is cached inside ZFS. So customers gladly tradeoff IOPS for capacity given that ZFSSA deliver tons of cached IOPS and ultra cheap GB of storage.
And then comes resilvering...
So when a disk goes bad, one has to resilver all of the data on it. It is assured at that point that we will be accessing all of the data from surviving disks in the raid group and that this is not a highly cached set. And here was the rub with old style ZFS resilvering : the metadata driven algorithm was actually generating small random IOPS. The old algorithm was actually going through all of the blocks file by file, snapshot by snapshot. When it found an element to resilver, it would issue the IOPS necessary for that operation. Because of the nature of ZFS, the populating of those blocks didn't lead to a sequential workload on the resilvering disks.
So in a worst case scenario, we would have to issue small random IOPS covering 100% of what was stored on the failed disk and issue small random writes to the new disk coming in as a replacement. With big disks and very low
IOPS rating comes ugly resilvering times. That effect was also compounded by a voluntary design balance that was strongly biased to protect application load. The compounded effect was month long resilvering.
To solve this, we designed a subtly modified version of resilvering. We split the algorithm in two phases. The populating phase and the iterating phase. The populating phase is mostly unchanged over the previous algorithm except that, when encountering a block to resilver, instead of issuing the small random IOPS, we generate a new on disk log of them. After having iterated through all of the metadata and discovered all of the elements that need to be resilvered we now can sort these blocks by physical disk offset and issue the I/O in ascending order. This in turn allows the ZIO subsystem to aggregate adjacent I/O more efficiently leading to fewer larger I/Os issued to the disk. And by virtue of issuing I/Os in physical order
it allows the disk to serve these IOPS at the streaming limit of the disks (say 100MB/sec) rather than being IOPS limited (say 200 IOPS).
So we hold a strategy that allows us to resilver nearly as fast as physically possible by the given disk hardware. With that newly acquired capability of ZFS, comes the requirement to service application load with a limited impact from resilvering. We therefore have some mechanism to limit resilvering load in the presence of application load. Our stated goal is to be able to run through resilvering at 1TB/day (1TB of data reconstructed on the replacing drive) even in the face of an active workload.
As disks are getting bigger and bigger, all storage vendors will see increasing resilvering times. The good news is that, since Solaris 11.2 and ZFSSA since 2013.1.2
, ZFS is now able to run resilvering with much of the same disk throughput limits as the rest of non-ZFS based storage.
The sequential resilvering performance on a RAIDZ pool is particularly noticeable to this happy Solaris 11.2 customer saying It is really good to see the new feature work so well in practice