Monday Dec 15, 2008

New Scrub Code

To fix a longstanding bug (6343667, scrub/resilver has to start over when snapshot is taken), I ended up rewriting the ZFS scrub code (putback in OpenSolaris build 94). Mark Maybee and I were already working on "bp (block pointer) rewrite (or relocate)", functionality that will allow us to traverse all the blocks in the pool and move them around. This will be used to move all the used space off a disk so that it can be removed. But we realized that since bp relocate has to visit all the blocks in the pool, it can also be used to scrub or resilver the pool. In fact, it's simpler to use the bp relocate algorithm just for scrubbing, since we don't have to actually move anything around. So in the interest of fixing 6343667 as quickly as possible (in particular, in time for the Fishworks NAS appliance), we decided to first implement only the part of our design necessary to fix that bug, and later complete entire bp relocate work to support device removal.

Similarities Between Old and New Scrub Code

Both the old and new scrub code work by traversing all the metadata in the pool to find all the block pointers. This is necessary because we need the checksums, which are stored in the block pointers (rather than in a structure addressed by disk offset, or in the (larger, 520-byte) disk sector itself). Because we have snapshots and clones, there can be many pointers to the same block. So an algorithm which simply follows every block pointer can end up scrubbing each snapshotted block many times, causing the scrub to take many times longer than it could. Instead, we take advantage of the birth time, which is stored in the block pointer. We visit each block only from the dataset that it was born in. In particular, if the block's birth time is before (less than) the previous snapshot's, then we know that the block will be / has been visited when we traverse an earlier snapshot, so we can ignore it. The key here is that this applies to indirect blocks as well as data blocks. When an indirect block does not need to be visited, we also know that nothing under it needs to be visited, because everything under it must have a birth time less than or equal to the indirect block's (because if something under it was written, then all blocks above it including this indirect block would also be written).

To recap, in both old and new code, we will scrub by traversing each dataset, visiting the blocks with birth time after that dataset's previous snapshot, and ignoring other blocks.

Differences Between Old and New Scrub Code

There are two fundamental differences between the old and new scrub code: the old code works in open context, and visits datasets in no particular order[\*]; the new code works in syncing context, and visits datasets in order from oldest to newest. The key here is that because of the order that we visit the datasets, we know what has already been done, and what is left to do. Because we are working in syncing context, we know that while we are actively scrubbing, nothing can be changing (snapshots can't be taken or destroyed, blocks can't be freed, etc). However, we can't do the entire scrub in one sync (because we need to do new syncs every so often to push out dirty data, otherwise all writes will be blocked), so we need to pause and allow each sync to complete, and then resume when the next sync occurs. With these constraints, it is not too hard to deal with new snapshots being taken (and other changes) while the scrub is paused.

[\*] It's in a defined order, by object-id, but that doesn't correspond to anything in particular. When visiting a dataset, we don't know if the datasets before or after it have already been visited. And if a snapshot is taken, we don't know if its object-id will happen to be before or after the current one; this is partly the cause of 6343667.

So we want to scrub each dataset from oldest to newest, and take into account snapshots and other changes that can occur while the scrub is paused. How do we do that?

Order of Visiting Datasets

Here is a valid ordering for a filesystem (4), its snapshots (1, 2, 3) and a clone (6), and its snapshot (5).
    time -->

    1-->2-->3-->4  filesystem
         ---5-->6 clone
Note that an equally valid order would be 1, 2, 5, 3, 6, 4. What matters is that we visit everything before a given dataset before visiting it.

In the existing on-disk format (zpool version 10 and earlier), there is no easy way to determine this order. There are two problems: first, each dataset has a single forward pointer (ds_next_snap_obj), but if there are clones of that snapshot, then we don't have the additional forward pointers (shown above, there is no pointer to snapshot 5). Second, there is no easy way to find the first snapshot (#1 above).

The best solution is to add some more information to the on-disk format, which is what zpool version 11 does. We added multiple forward pointers (ds_next_clones_obj), and we made all filesystems be clones of the "$ORIGIN" snapshot.
    |-->1-->2-->3-->4 filesystem
    |       \\
    |        -->5-->6 clone
     ------>7-->8-->9  another filesystem
In this diagram, 2's ds_next_clones_obj will point to 5, and 0's (the $ORIGIN) will point to 1 and 7. When we do a "zpool upgrade" to version 11 or later, we will walk all the datasets and add this information in. The new scrub code works even on earlier versions, but whenever we need this extra information, we have to walk a bunch of datasets to find it, so it's a little slower (but this performance hit is usually small compared to the total time taken to perform the scrub).

Visitation Order Implementation

Now that we have (or will dig up on demand) the information we need, how do we go about scrubbing in the correct order? We store a pool-global "scrub queue" which is all the datasets that we can do next. In particular, after we finish visiting each dataset, we put all its next pointers (both the main next snapshot as well as any clones) in the queue. Then we pull a dataset out of the queue and visit it. We start with the $ORIGIN, which has no data, so it will just add every non-clone filesystem (or zvol, of course) to the queue. The queue is implemented as a zap object, so the order we pull things out of it is not well defined, but that is fine.

Now, we are visiting everything in syncing context in the right order. If nothing changes (eg, snapshots taken or destroyed) then our algorithm is complete. But as I mentioned, we just scrub a little bit in each transaction group (aka sync phase), and pause the scrub to allow the sync to complete and then restart next sync. Each time we visit a block, we check to see if we should pause, based on how much time has passed, and if anything is waiting for the sync to complete. When we pause, we remember where we left off with a zbookmark_t, which specifies the location by dataset, object, level, and blockid. When we continue from a paused scrub, we traverse back to that point before we continue scrubbing. Check out scrub_pause().

Dealing with Changes

Now, what if something changed while we were paused? There are a few cases we need to worry about: dataset deletion, snapshot creation, and "clone swap" which happens at the end of an incremental "zfs recv".

When a dataset is destroyed, we need to see if any of our saved state references this dataset, and if so then remove the dataset and add its next pointer (the dataset following it). It's as though we "magically" finished visiting that dataset and are ready to move on to it's next pointer. This applies whether this is the dataset that we paused in the middle of, or if it's a dataset that's in the queue. If it's the dataset that we paused in the middle of, we need to reset the bookmark to indicate that we are not in the middle of any dataset, so that we will find a new dataset to scrub (by pulling something out of the queue). (Note that any dataset being destroyed can't be cloned, so there's only one next pointer.) You can see this happening in dsl_pool_ds_destroyed().

If we snapshot a dataset that our saved state references, what could happen? If we don't do anything special, then when we visit that dataset, we'll see that it's previous snapshot was created very recently, so we won't scrub very much of it. In fact, we won't scrub anything useful, because we only need to scrub blocks that were created before the scrub started. So we will neglect to scrub everything that we should have done for that dataset. To correct this problem, in our saved state we replace the snapshotted dataset with its newly-created snapshot. Note that this works whether the saved dataset was in the queue or was the dataset we paused in the middle of, in which case we will pick up where we left off but in the new snapshot. You can see this happening in dsl_pool_ds_snapshotted().

The last change we need to worry about is a little tricky. When we do an incremental "zfs recv", we create a temporary clone of the most recent snapshot, and initially receive the new data into that clone. This way, the filesystem that's being logically received into can remain mounted and accessible during the receive operation. But when the receive completes, we have to switch the filesystem and its temporary clone, so that the filesystem sees the new data. Although this is logically just a "zfs promote" and some "zfs rename" and "zfs set" operations, we implement it more simply by swapping the contents (objsets) of the datasets. If the scrub code was unaware of this, then it would not scrub the right thing -- for example, imagine that F and C are the filesystem and its clone, and we have already scrubbed F, but C is in the scrub queue, and we swap their contents. Then we would still scrub C, but now its contents would be what F's were, which we already scrubbed! The new contents of F would never be scrubbed. To handle this, when we do the clone contents swap operation, we also swap the dataset ID if either of them are stored in our saved state (again, either the dataset we paused in the middle of, or the scrub queue). You can see this happening in dsl_pool_ds_clone_swapped().


We have solved a nasty bug by rewriting the scrub code to do its work in syncing context, visiting datasets in order from oldest to newest. I had to introduce a few new on-disk data structures so we can quickly determine this order. And there are a few operations that the scrub code needs to know about and account for. This work lays a bunch of infrastructure that will be used by the upcoming device removal feature. Stay tuned!

Thursday Nov 17, 2005

Is it magic?

ZFS provides an impressive array of features, many of which are not available in traditional storage products. My partner-in-crime Mark Maybee has written about the utility of ZFS's quotas and reservaions. I'd like to talk about another features which we implemented, snapshots.


Snapshots are a familiar concept to many other filesystems: a snapshot is a view of a filesystem as it was at a particular point in time. ZFS's snapshots are useful in the same way that some other filesystems's snapshots are: By doing a backup of a snapshot, you have a consistent, non-changing target for the backup program to work with. Snapshots can also be used to recover from recent mistakes, by copying the fudged files from the snapshot.

What makes ZFS snapshots different is that we've removed all the limits. You can have as many snapshots as you want, taken as often as you like, and named as you choose. Taking a snapshot is a constant-time operation. The presence of snapshots doesn't slow down any operations. Deleting snapshots takes time proportional to the number of blocks that the delete will free, and is very efficient.

The awesome thing about ZFS being open source is that all these great properties of snapshots aren't magic -- you can see how it works for yourself! Snapshots are implemented in the DSL layer.


Our entire filesystem is represented on disk as a giant tree of blocks with the leaf blocks containing data and the interior blocks containing metadata (mostly indirect blocks, but also dnode_phys_t's). To write a modified data block to disk, we write the modified data block to a new (previously unused) location on disk, rather than over-writing its existing location on disk. Now we must modify its parent block to point to the new on-disk location of the modified data block. We also write the modified parent block to a new (previously unused) location on disk. We continue this procedure of writing out parents to new locations until we get to the root of the tree, which is stored at a fixed location and is overwritten.

Data structures

Snapshots are a type of dataset, represented on-disk by a dsl_dataset_phys_t. To support snapshots, we maintain two data structures: a birth time associated with each block (blk_birth in the blkptr_t), and a list of "dead" blocks associated with each filesystem and snapshot (pointed to by ds_deadlist_obj in dsl_dataset_phys_t).

When writing the location of a block (the "block pointer") to its parent (eg. an indirect block), we also write the time that the child block was written -- the block's "birth time". This time is not literally the wall-clock time, but rather the value of a counter which increments each time we sync out a whole "transaction group" (aka "txg") and update the root of the block tree, the uberblock.

The dead list is an array of blkptr_t's which were referenced (or "live") in the previous snapshot, but are not referenced in this snapshot (or filesystem). These are the blocks that this snapshot "killed". I'll talk more about how this list is maintained and used in a bit.

Snapshot Creation

Conceptually, to take a snapshot, all we need to do is save the old uberblock before overwriting it, since it still points to valid, unmodified data. In fact, we do this not to the uberblock, but to the root of the sub-tree which represents a single filesystem, which is the objset_phys_t (or more generally, whatever ds_bp points to in dsl_dataset_phys_t). So each filesystem has its own snapshots, independent of other filesystems. The filesystem's snapshots are tracked in a doubly-linked list (the pointers are ds_prev_snap_obj and ds_next_snap_obj), sorted by the time they were taken, with the filesystem at the tail of the list. The snapshots also have administrator-chosen names, which are stored in a directory-like structure, maintained by the ZAP object pointed to by ds_snapames_zapobj.

When a snapshot is created, its dead list is set to the filesystem's dead list, and the filesystem's dead list is set to a new, empty list.

Snapshot creation happens in dsl_dataset_snapshot_sync().

Freeing blocks

When the filesystem is modified such that it no longer references a given block (eg. that block is overwritten, or the object that contains it is freed), the DMU will call dsl_dataset_block_born, which will determine whether we can actually free that block, reclaiming it storage space for other uses. We can free the block if and only if there are no other references to it. We can determine this by comparing the block's birth time (blk_birth) with the birth time of the most recent snapshot (ds_prev_snap_txg. If the block was born before the most recent snapshot, then that snapshot will reference the block and we can not free it. Otherwise, it was born after the most recent snapshot, and thus this snapshot (and all others) can not reference it, so we must free it.

When we can not free a block because the most recent snapshot was referencing it, we add its block pointer to the filesystem's dead list.

To summarize, there are two cases to consider when a block becomes no longer referenced by a filesystem:

                                   ---------------> time

block A:		    [... -----------------]

block B:				      [---]

[optional previous snapshots] ... ----- snap ------ fs

Block A was live when the most recent snapshot was taken, so that snapshot references it and this it can not be freed. Block B was born after the most recent snapshot, and thus no snapshots were taken of it so it must be freed.

Snapshot deletion

When a snapshot is deleted, dsl_dataset_destroy_sync will be called, which must determine which blocks we must free, and also maintain the dead lists. It's useful to think of 4 classes of blocks:

                                   ---------------> time

block A: [... --------------------------------]

block B:                       [--------------]

block C: [..................... ------------------------------------- ...]

block D: [... -------------------]

         ... ----- prev snap ----- this snap ------ next snap (or fs) ----- ...

To accomplish this, we iterate over the next snapshot's dead list (those in case A and B), and compare each block's birth time to the birth time of our previous snapshot. If the block was born before our previous snapshot (case A), then we do not free it, and we add it to our dead list. Otherwise, the block was born after our previous snapshot (case B), and we must free it. Then we delete the next snapshot's dead list, and set the next snapshot's dead list to our dead list. Finally, we can remove this snapshot from the linked list of snapshots, and the directory of snapshot names.

While the implementation is relatively simple, the algorithm is pretty subtle. How do we know that it is correct? First, did we free the correct blocks? The blocks we must free are those that are referenced by only the snapshot we are deleting (case B). Those blocks are the blocks which meet 4 constraints: (1) the were born after the previous snapshot, (2) the were born before this snapshot, (3) they died after this snapshot, and (4) they died before the next snapshot.

The blocks on the next snapshot's dead list are those that meet constraints (2) and (3) and (4) -- they are live in this snapshot, but dead in the next snapshot. (Note, the same applies if the next snapshot is actually the filesystem.) So to find the blocks that meet all constraints, we examine all the blocks on the next snapshot's dead list, and find those that meet constraint (1) -- ie. if the block's birth time is after the previous snapshot.

Now, did we leave the correct blocks on the next snapshot's dead list? This snapshot's dead list contains the blocks that were live in the previous snapshot, and dead in this snapshot (case D). If this snapshot did not exist, then they would be live in the previous snapshot and dead in the next snapshot, and therefore should be on the next snapshot's dead list. Additionally, the blocks which were live for both the previous snapshot and this snapshot, but dead in the next snapshot (case A) should be on the next snapshot's dead list.


Hopefully this gives you a glimpse into how the DSL operates. For further reading, you might be interested in how zfs rollback works (see dsl_dataset_rollback_sync()). Clones are handled as a slightly special case of regular filesystems -- check out dsl_dataset_create_sync(). If you have any questions, don't hesitate to ask!

Technorati Tag:
Technorati Tag:
Technorati Tag:

Monday Sep 27, 2004

Jeff Bonwick on 128-bits

Fellow ZFS engineer Jeff Bonwick has posted a blog entry on why 128 bit storage addresses are necessary and sufficient.

Tuesday Sep 14, 2004

ZFS featured on homepage

The homepage features a new article on ZFS. Check it out, and feel free to post comments with the form at the bottom of the article, on this blog, or at the Expert Exchange tomorrow.

Monday Sep 13, 2004

Expert Exchange (chat session) with ZFS engineers

Sun is hosting an Expert Exchange on ZFS. Basically this is an interactive chat session where you can ask fellow ZFS engineer Bill Moore and I questions about ZFS. The session will be this Wednesday, September 15 from 10-11am PDT. You can find more information here.

A transcript of the chat session is now available (click on 'archives' from the expert exchange website).

Monday Aug 30, 2004

What is ZFS?

It occurs to me that before I can really talk much about ZFS, you need to know what it is, and how it's generally arranged. So here's an overview of what ZFS is, reproduced from our internal webpage:
ZFS is a vertically integrated storage system that provides end-to-end data integrity, immense (128-bit) capacity, and very simple administration.

To applications, ZFS looks like a standard POSIX filesystem. No porting is required.

To administrators, ZFS presents a pooled storage model that completely eliminates the concept of volumes and the associated problems of partition management, provisioning, and filesystem grow/shrink. Thousands or even millions of filesystems can all draw from a common storage pool, each one consuming only as much space as it actually needs. Moreover, the combined I/O bandwidth of all devices in the pool is available to all filesystems at all times.

All operations are copy-on-write transactions, so the on-disk state is always valid. There is no need to fsck(1M) a ZFS filesystem, ever.

Every block is checksummed to prevent silent data corruption, and the data is self-healing in mirrored or RAID configurations. When one copy is damaged, ZFS detects it (via the checksum) and uses another copy to repair it.

ZFS provides an unlimited number of snapshots, which are created in constant-time, and do not require additional copies of any data. Snapshots provide point-in-time copies of the data for backups and end-user recovery from fat-finger mistakes.

ZFS provides clones, which provide a fast, space-efficient way of making "copies" of filesystems. Clones are extremely useful when many almost-identical "copies" of a set of data are required -- for example, multiple code sourcebases, one for each engineer or bug being fixed; or multiple system images, one for each zone or netboot-ed machine.

ZFS also provides quotas, to limit space consumption; reservations, to guarantee availability of space in the future; compression, to reduce both disk space and I/O bandwidth requirements; and supports the full range of NFSv4/NT-style ACLs.
What exactly is a "pooled storage model"? Basically it means rather than stacking one FS on top of one volume on top of some disks, you stack many filesystems on top of one storage pool on top of lots of disks. Take a home directoy server with a few thousand users and a few terabytes of data. Traditionally, you'd probably set it up so that there are a few filesystems, each a few hundred megabytes, and put a couple hundred users on each filesystem.

That seems odd -- why is there an arbitrariy grouping of users into filesystems? It would be more logical to have either one filesystem for all users, or one filesystem for each user. We can rule out the latter because it would require that we statically partition our storage and decide up front how much space each user got -- ugh. Using one big filesystem would be plausable, but performance may become a problem with large filesystems -- both common run-time performance and performance of administrative tasks. Many backup tools are filesystem-based. The run-time of fsck(1m) is not linear in the size of the filesystem, so it could take a lot longer to fsck that one 8TB filesystem than it would to fsck 80 100GB filesystems. Furthermore, some filesystems simply don't support more than a terabyte or so of storage.

It's inconvenient to run out of space in a traditional filesystem, and happens all too often. You might have lots of free space in a different filesystem, but you can't easily use it. You could manually migrate users to different filesystems to balance the free space... (hope your users don't mind downtime! hope you find the right backup tape when it comes time to restore!) Eventually you'll have to install new disks, make a new volume and filesystem out of them, and then migrate some users over to the new filesystem, incurring downtime. I experienced these kinds of problems with my home directory when I was attending school (using VxFS on VxVM on Solaris), they still plague some home directory and other file servers at Sun (using UFS on SVM on Solaris).

With ZFS, you can have one storage pool which encompasses all the storage attached to your server. Then you can easily create one filesystem for each user. When you run low on storage, simply attach more disks and add them to the pool. No downtime. This is the scenario on the home directory server that I use at Sun, which uses ZFS on Solaris.

Thus concludes ZFS lesson 1.

Wednesday Jul 28, 2004


I am an engineer in the Solaris kernel group. I've been working for Sun for the past three years, primarily on ZFS, the Zettabyte File System. Expect posting frequency to increase as ZFS becomes increasingly available to the outside world.



« July 2016