To fix a longstanding bug
scrub/resilver has to start over
when snapshot is taken), I ended up rewriting the ZFS scrub code
(putback in OpenSolaris build 94). Mark
Maybee and I were already working on
rewrite (or relocate)",
functionality that will allow us to traverse all the blocks in the pool
and move them around. This will be used to move all the used space off
a disk so that it can be removed. But we realized that since bp
relocate has to visit all the blocks in the pool, it can also be used
to scrub or resilver the pool. In fact, it's simpler to use the bp
relocate algorithm just for scrubbing, since we don't have to actually
move anything around. So in the interest of fixing 6343667 as quickly
as possible (in particular, in time for theFishworks
), we decided to first implement only the part of our
design necessary to fix that bug, and later complete entire bp relocate
work to support device removal.
Similarities Between Old and New Scrub Code
Both the old and new scrub code work by
traversing all the metadata in the pool to find all the block
pointers. This is necessary because we need the checksums, which are
stored in the block pointers (rather than in a structure addressed by
disk offset, or in the (larger, 520-byte) disk sector itself). Because
we have snapshots and clones, there can be many pointers to the same
block. So an algorithm which simply follows every
pointer can end up scrubbing each snapshotted block many times, causing
the scrub to take many times longer than it could. Instead, we take
advantage of the birth time, which is stored in the block pointer. We
visit each block only from the dataset that it was born in. In
particular, if the block's birth time is before (less than) the
previous snapshot's, then we know that the block will be / has been
visited when we traverse an earlier snapshot, so we can ignore it. The
key here is that this applies to indirect blocks as well as data
blocks. When an indirect block does not need to be visited, we also
know that nothing under it needs to be visited, because everything
under it must have a birth time less than or equal to the indirect
block's (because if something under it was written, then all blocks
above it including this indirect block would also be written).
To recap, in both old and new code, we will scrub by traversing each
dataset, visiting the blocks with birth time after that dataset's
previous snapshot, and ignoring other blocks.
Differences Between Old and New Scrub Code
There are two fundamental differences between the old and new scrub
code: the old code works in open context, and visits datasets in no
particular order[\*]; the new code works insyncing context
, and visits
datasets in order from oldest to newest. The key here is that because
of the order that we visit the datasets, we know what has already been
done, and what is left to do. Because we are working in syncing
context, we know that while we are actively scrubbing, nothing can be
changing (snapshots can't be taken or destroyed, blocks can't be freed,
etc). However, we can't do the entire scrub in one sync (because we
need to do new syncs every so often to push out dirty data, otherwise
all writes will be blocked), so we need to pause and allow each sync to
complete, and then resume when the next sync occurs. With these
constraints, it is not too hard to deal with new snapshots being taken
(and other changes) while the scrub is paused.
[\*] It's in a defined order, by object-id, but that doesn't correspond
to anything in particular. When visiting a dataset, we don't know if
the datasets before or after it have already been visited. And if a
snapshot is taken, we don't know if its object-id will happen to be
before or after the current one; this is partly the cause of 6343667.
So we want to scrub each dataset from oldest to newest, and take into
account snapshots and other changes that can occur while the scrub is
paused. How do we do that?
Order of Visiting Datasets
Here is a valid ordering for a filesystem (4), its snapshots (1, 2, 3)
and a clone (6), and its snapshot (5).
Note that an equally valid order would be 1, 2, 5, 3, 6, 4. What
matters is that we visit everything before a given dataset before
In the existing on-disk format (zpool version 10 and earlier), there is
no easy way to determine this order. There are two problems: first,
each dataset has a single forward pointer (ds_next_snap_obj), but if
there are clones of that snapshot, then we don't have the additional
forward pointers (shown above, there is no pointer to snapshot 5).
Second, there is no easy way to find the first snapshot (#1 above).
The best solution is to add some more information to the on-disk
format, which is what zpool version 11 does. We added multiple forward
pointers (ds_next_clones_obj), and we made all filesystems be clones of
| -->5-->6 clone
------>7-->8-->9 another filesystem
In this diagram, 2's
will point to 5, and 0's (the$ORIGIN
) will point to 1 and 7. When we do a "zpool upgrade"
version 11 or later, we will walk all the datasets and add this
information in. The new scrub code works even on earlier versions, but
whenever we need this extra information, we have to walk a bunch of
datasets to find it, so it's a little slower (but this performance hit
is usually small compared to the total time taken to perform the scrub).
Visitation Order Implementation
Now that we have (or will dig up on demand) the information we need,
how do we go about scrubbing in the correct order? We store a
pool-global "scrub queue" which is all the datasets that we
can do next. In particular, after we finish visiting each dataset, we put all
its next pointers (both the main next snapshot as well as any clones)
in the queue. Then we pull a dataset out of the queue and visit it.
We start with the $ORIGIN
, which has no data, so it will just
add every non-clone filesystem (or zvol, of course) to the queue. The
queue is implemented as a zap object, so the order we pull things out
of it is not well defined, but that is fine.
Now, we are visiting everything in syncing context in the right order.
If nothing changes (eg, snapshots taken or destroyed) then our
algorithm is complete. But as I mentioned, we just scrub a little bit
in each transaction group (aka sync phase), and pause the scrub to
allow the sync to complete and then restart next sync. Each time we
visit a block, we check to see if we should pause, based on how much
time has passed, and if anything is waiting for the sync to complete.
When we pause, we remember where we left off with a zbookmark_t, which
specifies the location by dataset, object, level, and blockid. When we
continue from a paused scrub, we traverse back to that point before we
continue scrubbing. Check outscrub_pause()
Dealing with Changes
Now, what if something changed while we were paused? There are a few
cases we need to worry about: dataset deletion, snapshot creation, and
"clone swap" which happens at the end of an incremental "zfs
When a dataset is destroyed, we need to see if any of our saved state
references this dataset, and if so then remove the dataset and add its
next pointer (the dataset following it). It's as though we "magically"
finished visiting that dataset and are ready to move on to it's next
pointer. This applies whether this is the dataset that we paused in
the middle of, or if it's a dataset that's in the queue. If it's the
dataset that we paused in the middle of, we need to reset the bookmark
to indicate that we are not in the middle of any dataset, so that we
will find a new dataset to scrub (by pulling something out of the
queue). (Note that any dataset being destroyed can't be cloned, so
there's only one next pointer.) You can see this happening in
If we snapshot a dataset that our saved state references, what could
happen? If we don't do anything special, then when we visit that
dataset, we'll see that it's previous snapshot was created very
recently, so we won't scrub very much of it. In fact, we won't scrub
anything useful, because we only need to scrub blocks that were created
before the scrub started. So we will neglect to scrub everything that
we should have done for that dataset. To correct this problem, in our
saved state we replace the snapshotted dataset with its newly-created
snapshot. Note that this works whether the saved dataset was in the
queue or was the dataset we paused in the middle of, in which case we
will pick up where we left off but in the new snapshot. You can see
this happening in
The last change we need to worry about is a little tricky. When we do
an incremental "zfs recv", we create a temporary clone of the most
recent snapshot, and initially receive the new data into that clone.
This way, the filesystem that's being logically received into can
remain mounted and accessible during the receive operation. But when
the receive completes, we have to switch the filesystem and its
temporary clone, so that the filesystem sees the new data. Although
this is logically just a "zfs promote" and some "zfs rename" and "zfs
set" operations, we implement it more simply by swapping the contents
(objsets) of the datasets. If the scrub code was unaware of this, then
it would not scrub the right thing -- for example, imagine that F and C
are the filesystem and its clone, and we have already scrubbed F, but C
is in the scrub queue, and we swap their contents. Then we would still
scrub C, but now its contents would be what F's were, which we already
scrubbed! The new contents of F would never be scrubbed. To handle
this, when we do the clone contents swap operation, we also swap the
dataset ID if either of them are stored in our saved state (again,
either the dataset we paused in the middle of, or the scrub queue).
You can see this happening in
We have solved a nasty bug by rewriting the scrub code to do its work
in syncing context, visiting datasets in order from oldest to newest.
I had to introduce a few new on-disk data structures so we can quickly
determine this order. And there are a few operations that the scrub
code needs to know about and account for. This work lays a bunch of
infrastructure that will be used by the upcoming device removal
feature. Stay tuned!