• ZFS |
    Thursday, November 17, 2005

Is it magic?

ZFS provides an impressive array of features, many of which are not available in traditional storage products. My partner-in-crime Mark Maybee has written about the utility of ZFS's quotas and reservaions. I'd like to talk about another features which we implemented, snapshots.


Snapshots are a familiar concept to many other filesystems: a snapshot is a view of a filesystem as it was at a particular point in time. ZFS's snapshots are useful in the same way that some other filesystems's snapshots are: By doing a backup of a snapshot, you have a consistent, non-changing target for the backup program to work with. Snapshots can also be used to recover from recent mistakes, by copying the fudged files from the snapshot.

What makes ZFS snapshots different is that we've removed all the limits. You can have as many snapshots as you want, taken as often as you like, and named as you choose. Taking a snapshot is a constant-time operation. The presence of snapshots doesn't slow down any operations. Deleting snapshots takes time proportional to the number of blocks that the delete will free, and is very efficient.

The awesome thing about ZFS being open source is that all these great properties of snapshots aren't magic -- you can see how it works for yourself! Snapshots are implemented in the DSL layer.


Our entire filesystem is represented on disk as a giant tree of blocks
with the leaf blocks containing data and the interior blocks containing
metadata (mostly indirect blocks, but also dnode_phys_t's). To write a modified data block to
disk, we write the modified data block to a new (previously unused)
location on disk, rather than over-writing its existing location on
disk. Now we must modify its parent block to point to the new on-disk
location of the modified data block. We also write the modified parent
block to a new (previously unused) location on disk. We continue this
procedure of writing out parents to new locations until we get to the
root of the tree, which is stored at a fixed location and is

Data structures

Snapshots are a type of dataset, represented on-disk by a dsl_dataset_phys_t.
To support snapshots, we maintain two data structures: a
birth time associated with each block (blk_birth in the blkptr_t), and a list of "dead" blocks
associated with each filesystem and snapshot (pointed to by ds_deadlist_obj in dsl_dataset_phys_t).

When writing the location of a block (the "block pointer") to its parent
(eg. an indirect block), we also write the time that the child block was
written -- the block's "birth time". This time is not literally the
wall-clock time, but rather the value of a counter which increments each
time we sync out a whole "transaction group" (aka "txg") and update the root of the block tree, the uberblock.

The dead list is an array of blkptr_t's which were referenced (or "live") in the previous snapshot, but
are not referenced in this snapshot (or filesystem). These are the
blocks that this snapshot "killed". I'll talk more about how this list is maintained and used in a bit.

Snapshot Creation

Conceptually, to take a snapshot, all we need to do is save the old uberblock before
overwriting it, since it still points to valid, unmodified data. In
fact, we do this not to the uberblock, but to the root of the
sub-tree which represents a single filesystem, which is the objset_phys_t (or more generally, whatever ds_bp points to in dsl_dataset_phys_t). So each filesystem has
its own snapshots, independent of other filesystems. The filesystem's
snapshots are tracked in a doubly-linked list (the pointers are ds_prev_snap_obj and ds_next_snap_obj), sorted by the time they
were taken, with the filesystem at the tail of the list. The snapshots
also have administrator-chosen names, which are stored in a
directory-like structure, maintained by the ZAP object pointed to by ds_snapames_zapobj.

When a snapshot is created, its dead list is set to the filesystem's
dead list, and the filesystem's dead list is set to a new, empty list.

Snapshot creation happens in dsl_dataset_snapshot_sync().

Freeing blocks

When the filesystem is modified such that it no longer references a
given block (eg. that block is overwritten, or the object that contains
it is freed), the DMU will call dsl_dataset_block_born,
which will determine whether we can actually free that block,
reclaiming it storage space for other uses. We can free the block if
and only if there are no other references to it. We can determine this
by comparing the block's birth time (blk_birth) with the birth time of the most
recent snapshot (ds_prev_snap_txg. If the block was born before the most recent snapshot,
then that snapshot will reference the block and we can not free it.
Otherwise, it was born after the most recent snapshot, and thus this
snapshot (and all others) can not reference it, so we must free it.

When we can not free a block because the most recent snapshot was
referencing it, we add its block pointer to the
filesystem's dead list.

To summarize, there are two cases to consider when a block becomes no
longer referenced by a filesystem:

                                   ---------------> time
block A:

[... -----------------]
block B:

[optional previous snapshots] ... ----- snap ------ fs

Block A was live when the most recent snapshot was taken, so that
snapshot references it and this it can not be freed. Block B was born
after the most recent snapshot, and thus no snapshots were taken of it
so it must be freed.

Snapshot deletion

When a snapshot is deleted, dsl_dataset_destroy_sync will be called, which must determine which blocks we must free,
and also maintain the dead lists. It's useful to think of 4 classes of blocks:

                                   ---------------> time
block A: [... --------------------------------]
block B: [--------------]
block C: [..................... ------------------------------------- ...]
block D: [... -------------------]
... ----- prev snap ----- this snap ------ next snap (or fs) ----- ...

To accomplish this, we iterate over the next
snapshot's dead list (those in case A and B), and
compare each block's birth time to the birth time of our previous
snapshot. If the block was born before our previous snapshot (case A),
then we do not free it, and we add it to our dead list. Otherwise, the
block was born after our previous snapshot (case B), and we must free
it. Then we delete the next snapshot's dead list, and set the next
snapshot's dead list to our dead list. Finally, we can remove this
snapshot from the linked list of snapshots, and the directory of
snapshot names.

While the implementation is relatively simple, the algorithm is pretty subtle. How do we know that it is correct? First, did we free the
correct blocks? The blocks we must free are those that are referenced
by only the snapshot we are deleting (case B). Those blocks are the
blocks which meet 4 constraints: (1) the were born after the previous
snapshot, (2) the were born before this snapshot, (3) they died after
this snapshot, and (4) they died before the next snapshot.

The blocks on the next snapshot's dead list are those that meet
constraints (2) and (3) and (4) -- they are live in this snapshot, but
dead in the next snapshot. (Note, the same applies if the next snapshot
is actually the filesystem.) So to find the blocks that meet all
constraints, we examine all the blocks on the next snapshot's dead list,
and find those that meet constraint (1) -- ie. if the block's birth time
is after the previous snapshot.

Now, did we leave the correct blocks on the next snapshot's dead list?
This snapshot's dead list contains the blocks that were live in the
previous snapshot, and dead in this snapshot (case D). If this snapshot
did not exist, then they would be live in the previous snapshot and dead
in the next snapshot, and therefore should be on the next snapshot's
dead list. Additionally, the blocks which were live for both the
previous snapshot and this snapshot, but dead in the next snapshot (case
A) should be on the next snapshot's dead list.


Hopefully this gives you a glimpse into how the DSL operates. For further reading, you might be interested in how zfs rollback works (see
dsl_dataset_rollback_sync()). Clones are handled as a slightly special case of regular filesystems -- check out dsl_dataset_create_sync().
If you have any questions, don't hesitate to ask!

Technorati Tag: href="http://www.technorati.com/tag/OpenSolaris" rel="tag">OpenSolaris

Technorati Tag: href="http://www.technorati.com/tag/Solaris" rel="tag">Solaris

Technorati Tag: href="http://www.technorati.com/tag/ZFS" rel="tag">ZFS

Join the discussion

Comments ( 11 )
  • Glenn Monday, November 21, 2005
    Okay, here's a question. All the hype around ZFS talks about "no limits". But then I see examples of using "zfs backup" on a snapshot that involve redirecting standard output to a raw tape drive. This arrangement will not handle EOM well. What kind of support is there for multi-volume backup tapes? What kind of support is there for remote tape drives, and for that matter, multi-volume remote backups? I would expect no less capability in this area than is available with ufsdump.
  • Glenn Monday, November 21, 2005
    Here are more questions. In the public info on ZFS (e.g., in Jeff Bonwick's slides, and in the ZFS Administration Guide), I see examples of using snapshots for remote replication that use commands like "zfs backup -i tank/fs@11:31 tank/fs@11:32 | ssh host zfs restore -d /tank/fs" executed on a regular periodic basis. This looks, ah, a bit optimistic to me. What happens if the network goes down and a particular instance of this command fails? One needs a much-more-robust solution that will always re-sync correctly in the presence of such failures. Also, the best uses of snapshots would be when their creation is synchronized with transaction boundaries within my application. Is there some mechanism provided for that (e.g., a library call to synchronously create a snapshot)?
  • Matt Monday, November 21, 2005
    Glenn, you are absolutely correct that 'zfs backup' / 'zfs restore' is not a fully functional solution for either backup or remote replication. However, when combined with some simple shell scripts, it can be useful in many common cases. I'll probably be posting some of those scripts here in the next few months.

    We're also working on more complete solutions, which will address the shortcomings you mentoned above, and will eventually ship as part of Solaris.

    I don't think of these shortcomings as "limits", but they do point to the need for more polished tools. The functionality that we have today is a building block that can and will be used to produce more complete solutions.

  • Raymond Page Sunday, November 27, 2005
    Essentially you seem to assume that your backend media is traditionally the same, unlimited random write, free reads. Is there any support for describing limited write, serial write, serial vs random read, limited read types of media?
    I read mention of taking into account latency and bandwidth, which points that you've considered some of the above, however shouldn't there be a way of forcibly specifying some of that?
  • Klavs Klavsen Wednesday, October 24, 2007

    I was wondering about clones... When you destroy a clone, how do you determine the blocks that have been freed? You write that clones are handled as slightly special case of regular filesystems, so I'm guessing you have some sort of free map for each clone?

  • Matt Sunday, October 28, 2007

    Klavs, to destroy a non-clone filesystem, we traverse all the blocks and free them. For a clone, we also traverse all the blocks, but we only free the blocks that were born since the most recent (ie, origin) snapshot. Actually, we don't need to traverse \*all\* the blocks for a clone, because for any given block, if it was not modified after the most recent snapshot, we know that nothing underneath it was either, so we can therefore "prune" the traversal and not look at its children. You can see this in action where dsl_dataset_destroy_sync() calls traverse_dsl_dataset() with the kill_blkptr() callback.

  • Klavs Klavsen Monday, October 29, 2007

    Matt, thanks a lot for your reply! I had wondered if you just traversed the block tree, which is nice because its so simple, but I worried that you potentially had to traverse way to many blocks that should not be released (even after pruning as you describe)... But thinking about it again, it probably isn't so bad since you probably don't have a very high fan-out in your b+ tree because of the 256-bit checksum included in block-pointers (but I'm not sure what block-size you use in your b+ tree nodes, so it's hard to speculate on fan-out).

    When an app performs a transaction, you probably are very reluctant to store the state of file system changes to persistent storage until the transaction is committed, but what if the set of changes in the transaction becomes very large (bigger than your in-memory buffer), you will have to save that state to disk, even though it is not committed yet. I thought that you used a snapshot+clone for that, and after the transaction was committed would copy the changes to the "real" filesystem, and deleted the snapshot+clone. But that can be a expensive with the way clones are destroyed, or am I exaggerating the performance implications?

  • Frank Tuesday, November 6, 2007

    i've got a suggestion. a snapshot represents a stat of filesystem and is also a SET of files.

    are there any operations available or on road to handle this sets like this?

    basical set operations are

    merging (A|B) (A united B)

    deselecting (A\\B) (A without B)

    selecting (A&B) (A cut B)

    combining (AxB) (Kartesian Produkt)

    combine path-bases with path-extension.

    there are many further usefull ops.

    i think something like this would very usefull:

    transform snapshot into sql-db(s)

    perform logical-semantic operations on db(s)

    transform output data-set into snapshot-format.

    i think this would add highlevel-db-facilities even to stupid backup-tools capable to handle snapshots.

  • Troy Wilson Tuesday, February 26, 2008

    How does ZFS snapshot/clone approach effect fragmentation? I know this is a challenge with NetApp who does similar snaps. This seems to be an effect that is experienced over time as multiple snaps are created/destroyed that locality of reference is lost. Curios how ZFS addresses this.

  • James Friday, June 5, 2009

    Hi Matt, I have the same question as Troy regarding the fragmentation. We are considering heavily using ZFS snapshot. But need to know more about the fragmentation implication. My feeling is that a shared block can never be in the proximity of all snapshots. Appreciate any help on this. Thanks.

  • Joydeep Thursday, October 1, 2009

    log structured/write-anywhere file systems fragment. netapp has a built in defragmenter - but it's tough to keep up under high modification rates. not sure if zfs has built in defragmenter. the act of reading an entire file and writing it out all over again acts as a poor man's defragmenter of sorts - albeit one that unnecessarily will consume space via references to overwritten blocks inside older snapshots. a built in defragmenter can avoid this. there is also free space fragmentation that causes less and less full stripes to be available over time and can cause io performance to degrade over time.

Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha