Is it magic?
By ahrens on Nov 17, 2005
Snapshots are a familiar concept to many other filesystems: a snapshot is a view of a filesystem as it was at a particular point in time. ZFS's snapshots are useful in the same way that some other filesystems's snapshots are: By doing a backup of a snapshot, you have a consistent, non-changing target for the backup program to work with. Snapshots can also be used to recover from recent mistakes, by copying the fudged files from the snapshot.
What makes ZFS snapshots different is that we've removed all the limits. You can have as many snapshots as you want, taken as often as you like, and named as you choose. Taking a snapshot is a constant-time operation. The presence of snapshots doesn't slow down any operations. Deleting snapshots takes time proportional to the number of blocks that the delete will free, and is very efficient.
The awesome thing about ZFS being open source is that all these great properties of snapshots aren't magic -- you can see how it works for yourself! Snapshots are implemented in the DSL layer.
Our entire filesystem is represented on disk as a giant tree of blocks with the leaf blocks containing data and the interior blocks containing metadata (mostly indirect blocks, but also dnode_phys_t's). To write a modified data block to disk, we write the modified data block to a new (previously unused) location on disk, rather than over-writing its existing location on disk. Now we must modify its parent block to point to the new on-disk location of the modified data block. We also write the modified parent block to a new (previously unused) location on disk. We continue this procedure of writing out parents to new locations until we get to the root of the tree, which is stored at a fixed location and is overwritten.
Snapshots are a type of dataset, represented on-disk by a dsl_dataset_phys_t. To support snapshots, we maintain two data structures: a birth time associated with each block (blk_birth in the blkptr_t), and a list of "dead" blocks associated with each filesystem and snapshot (pointed to by ds_deadlist_obj in dsl_dataset_phys_t).
When writing the location of a block (the "block pointer") to its parent (eg. an indirect block), we also write the time that the child block was written -- the block's "birth time". This time is not literally the wall-clock time, but rather the value of a counter which increments each time we sync out a whole "transaction group" (aka "txg") and update the root of the block tree, the uberblock.
The dead list is an array of blkptr_t's which were referenced (or "live") in the previous snapshot, but are not referenced in this snapshot (or filesystem). These are the blocks that this snapshot "killed". I'll talk more about how this list is maintained and used in a bit.
Conceptually, to take a snapshot, all we need to do is save the old uberblock before overwriting it, since it still points to valid, unmodified data. In fact, we do this not to the uberblock, but to the root of the sub-tree which represents a single filesystem, which is the objset_phys_t (or more generally, whatever ds_bp points to in dsl_dataset_phys_t). So each filesystem has its own snapshots, independent of other filesystems. The filesystem's snapshots are tracked in a doubly-linked list (the pointers are ds_prev_snap_obj and ds_next_snap_obj), sorted by the time they were taken, with the filesystem at the tail of the list. The snapshots also have administrator-chosen names, which are stored in a directory-like structure, maintained by the ZAP object pointed to by ds_snapames_zapobj.
When a snapshot is created, its dead list is set to the filesystem's dead list, and the filesystem's dead list is set to a new, empty list.
Snapshot creation happens in dsl_dataset_snapshot_sync().
When the filesystem is modified such that it no longer references a given block (eg. that block is overwritten, or the object that contains it is freed), the DMU will call dsl_dataset_block_born, which will determine whether we can actually free that block, reclaiming it storage space for other uses. We can free the block if and only if there are no other references to it. We can determine this by comparing the block's birth time (blk_birth) with the birth time of the most recent snapshot (ds_prev_snap_txg. If the block was born before the most recent snapshot, then that snapshot will reference the block and we can not free it. Otherwise, it was born after the most recent snapshot, and thus this snapshot (and all others) can not reference it, so we must free it.
When we can not free a block because the most recent snapshot was referencing it, we add its block pointer to the filesystem's dead list.
To summarize, there are two cases to consider when a block becomes no longer referenced by a filesystem:
---------------> time block A: [... -----------------] block B: [---] [optional previous snapshots] ... ----- snap ------ fs
Block A was live when the most recent snapshot was taken, so that snapshot references it and this it can not be freed. Block B was born after the most recent snapshot, and thus no snapshots were taken of it so it must be freed.
When a snapshot is deleted, dsl_dataset_destroy_sync will be called, which must determine which blocks we must free, and also maintain the dead lists. It's useful to think of 4 classes of blocks:
---------------> time block A: [... --------------------------------] block B: [--------------] block C: [..................... ------------------------------------- ...] block D: [... -------------------] ... ----- prev snap ----- this snap ------ next snap (or fs) ----- ...
To accomplish this, we iterate over the next snapshot's dead list (those in case A and B), and compare each block's birth time to the birth time of our previous snapshot. If the block was born before our previous snapshot (case A), then we do not free it, and we add it to our dead list. Otherwise, the block was born after our previous snapshot (case B), and we must free it. Then we delete the next snapshot's dead list, and set the next snapshot's dead list to our dead list. Finally, we can remove this snapshot from the linked list of snapshots, and the directory of snapshot names.
While the implementation is relatively simple, the algorithm is pretty subtle. How do we know that it is correct? First, did we free the correct blocks? The blocks we must free are those that are referenced by only the snapshot we are deleting (case B). Those blocks are the blocks which meet 4 constraints: (1) the were born after the previous snapshot, (2) the were born before this snapshot, (3) they died after this snapshot, and (4) they died before the next snapshot.
The blocks on the next snapshot's dead list are those that meet constraints (2) and (3) and (4) -- they are live in this snapshot, but dead in the next snapshot. (Note, the same applies if the next snapshot is actually the filesystem.) So to find the blocks that meet all constraints, we examine all the blocks on the next snapshot's dead list, and find those that meet constraint (1) -- ie. if the block's birth time is after the previous snapshot.
Now, did we leave the correct blocks on the next snapshot's dead list? This snapshot's dead list contains the blocks that were live in the previous snapshot, and dead in this snapshot (case D). If this snapshot did not exist, then they would be live in the previous snapshot and dead in the next snapshot, and therefore should be on the next snapshot's dead list. Additionally, the blocks which were live for both the previous snapshot and this snapshot, but dead in the next snapshot (case A) should be on the next snapshot's dead list.
Hopefully this gives you a glimpse into how the DSL operates. For further reading, you might be interested in how zfs rollback works (see dsl_dataset_rollback_sync()). Clones are handled as a slightly special case of regular filesystems -- check out dsl_dataset_create_sync(). If you have any questions, don't hesitate to ask!
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: ZFS