Recent Posts


Is it magic?

ZFS provides an impressive array of features, many of which are not available in traditional storage products. My partner-in-crime Mark Maybee has written about the utility of ZFS's quotas and reservaions. I'd like to talk about another features which we implemented, snapshots.SnapshotsSnapshots are a familiar concept to many other filesystems: a snapshot is a view of a filesystem as it was at a particular point in time. ZFS's snapshots are useful in the same way that some other filesystems's snapshots are: By doing a backup of a snapshot, you have a consistent, non-changing target for the backup program to work with. Snapshots can also be used to recover from recent mistakes, by copying the fudged files from the snapshot.What makes ZFS snapshots different is that we've removed all the limits. You can have as many snapshots as you want, taken as often as you like, and named as you choose. Taking a snapshot is a constant-time operation. The presence of snapshots doesn't slow down any operations. Deleting snapshots takes time proportional to the number of blocks that the delete will free, and is very efficient.The awesome thing about ZFS being open source is that all these great properties of snapshots aren't magic -- you can see how it works for yourself! Snapshots are implemented in the DSL layer.BackgroundOur entire filesystem is represented on disk as a giant tree of blockswith the leaf blocks containing data and the interior blocks containingmetadata (mostly indirect blocks, but also dnode_phys_t's). To write a modified data block todisk, we write the modified data block to a new (previously unused)location on disk, rather than over-writing its existing location ondisk. Now we must modify its parent block to point to the new on-disklocation of the modified data block. We also write the modified parentblock to a new (previously unused) location on disk. We continue thisprocedure of writing out parents to new locations until we get to theroot of the tree, which is stored at a fixed location and isoverwritten.Data structuresSnapshots are a type of dataset, represented on-disk by a dsl_dataset_phys_t.To support snapshots, we maintain two data structures: abirth time associated with each block (blk_birth in the blkptr_t), and a list of "dead" blocksassociated with each filesystem and snapshot (pointed to by ds_deadlist_obj in dsl_dataset_phys_t).When writing the location of a block (the "block pointer") to its parent(eg. an indirect block), we also write the time that the child block waswritten -- the block's "birth time". This time is not literally thewall-clock time, but rather the value of a counter which increments eachtime we sync out a whole "transaction group" (aka "txg") and update the root of the block tree, the uberblock.The dead list is an array of blkptr_t's which were referenced (or "live") in the previous snapshot, butare not referenced in this snapshot (or filesystem). These are theblocks that this snapshot "killed". I'll talk more about how this list is maintained and used in a bit.Snapshot CreationConceptually, to take a snapshot, all we need to do is save the old uberblock beforeoverwriting it, since it still points to valid, unmodified data. Infact, we do this not to the uberblock, but to the root of thesub-tree which represents a single filesystem, which is the objset_phys_t (or more generally, whatever ds_bp points to in dsl_dataset_phys_t). So each filesystem hasits own snapshots, independent of other filesystems. The filesystem'ssnapshots are tracked in a doubly-linked list (the pointers are ds_prev_snap_obj and ds_next_snap_obj), sorted by the time theywere taken, with the filesystem at the tail of the list. The snapshotsalso have administrator-chosen names, which are stored in adirectory-like structure, maintained by the ZAP object pointed to by ds_snapames_zapobj.When a snapshot is created, its dead list is set to the filesystem'sdead list, and the filesystem's dead list is set to a new, empty list.Snapshot creation happens in dsl_dataset_snapshot_sync().Freeing blocksWhen the filesystem is modified such that it no longer references agiven block (eg. that block is overwritten, or the object that containsit is freed), the DMU will call dsl_dataset_block_born,which will determine whether we can actually free that block,reclaiming it storage space for other uses. We can free the block ifand only if there are no other references to it. We can determine thisby comparing the block's birth time (blk_birth) with the birth time of the mostrecent snapshot (ds_prev_snap_txg. If the block was born before the most recent snapshot,then that snapshot will reference the block and we can not free it.Otherwise, it was born after the most recent snapshot, and thus thissnapshot (and all others) can not reference it, so we must free it.When we can not free a block because the most recent snapshot wasreferencing it, we add its block pointer to thefilesystem's dead list.To summarize, there are two cases to consider when a block becomes nolonger referenced by a filesystem: ---------------> timeblock A: [... -----------------]block B: [---][optional previous snapshots] ... ----- snap ------ fsBlock A was live when the most recent snapshot was taken, so thatsnapshot references it and this it can not be freed. Block B was bornafter the most recent snapshot, and thus no snapshots were taken of itso it must be freed.Snapshot deletionWhen a snapshot is deleted, dsl_dataset_destroy_sync will be called, which must determine which blocks we must free,and also maintain the dead lists. It's useful to think of 4 classes of blocks: ---------------> timeblock A: [... --------------------------------]block B: [--------------]block C: [..................... ------------------------------------- ...]block D: [... -------------------] ... ----- prev snap ----- this snap ------ next snap (or fs) ----- ...To accomplish this, we iterate over the nextsnapshot's dead list (those in case A and B), andcompare each block's birth time to the birth time of our previoussnapshot. If the block was born before our previous snapshot (case A),then we do not free it, and we add it to our dead list. Otherwise, theblock was born after our previous snapshot (case B), and we must freeit. Then we delete the next snapshot's dead list, and set the nextsnapshot's dead list to our dead list. Finally, we can remove thissnapshot from the linked list of snapshots, and the directory ofsnapshot names.While the implementation is relatively simple, the algorithm is pretty subtle. How do we know that it is correct? First, did we free thecorrect blocks? The blocks we must free are those that are referencedby only the snapshot we are deleting (case B). Those blocks are theblocks which meet 4 constraints: (1) the were born after the previoussnapshot, (2) the were born before this snapshot, (3) they died afterthis snapshot, and (4) they died before the next snapshot.The blocks on the next snapshot's dead list are those that meetconstraints (2) and (3) and (4) -- they are live in this snapshot, butdead in the next snapshot. (Note, the same applies if the next snapshotis actually the filesystem.) So to find the blocks that meet allconstraints, we examine all the blocks on the next snapshot's dead list,and find those that meet constraint (1) -- ie. if the block's birth timeis after the previous snapshot.Now, did we leave the correct blocks on the next snapshot's dead list?This snapshot's dead list contains the blocks that were live in theprevious snapshot, and dead in this snapshot (case D). If this snapshotdid not exist, then they would be live in the previous snapshot and deadin the next snapshot, and therefore should be on the next snapshot'sdead list. Additionally, the blocks which were live for both theprevious snapshot and this snapshot, but dead in the next snapshot (caseA) should be on the next snapshot's dead list.Magic?Hopefully this gives you a glimpse into how the DSL operates. For further reading, you might be interested in how zfs rollback works (seedsl_dataset_rollback_sync()). Clones are handled as a slightly special case of regular filesystems -- check out dsl_dataset_create_sync().If you have any questions, don't hesitate to ask!Technorati Tag: OpenSolarisTechnorati Tag: SolarisTechnorati Tag: ZFS

ZFS provides an impressive array of features, many of which are not available in traditional storage products. My partner-in-crime Mark Maybee has written about the utility of ZFS's quotas and...


What is ZFS?

It occurs to me that before I can really talk much about ZFS, you need to know what it is, and how it's generally arranged. So here's an overview of what ZFS is, reproduced from our internal webpage:ZFS is a vertically integrated storage system that provides end-to-enddata integrity, immense (128-bit) capacity, and very simple administration.To applications, ZFS looks like a standard POSIX filesystem.No porting is required.To administrators, ZFS presents a pooled storage model that completelyeliminates the concept of volumes and the associated problems ofpartition management, provisioning, and filesystem grow/shrink.Thousands or even millions of filesystems can all draw from a commonstorage pool, each one consuming only as much space as it actually needs.Moreover, the combined I/O bandwidth of all devices in the pool isavailable to all filesystems at all times.All operations are copy-on-write transactions, so the on-disk state isalways valid. There is no need to fsck(1M) a ZFS filesystem, ever.Every block is checksummed to prevent silent data corruption, and thedata is self-healing in mirrored or RAID configurations. When one copyis damaged, ZFS detects it (via the checksum) and uses another copy torepair it.ZFS provides an unlimited number of snapshots, which are created inconstant-time, and do not require additional copies of any data.Snapshots provide point-in-time copies of the data for backups andend-user recovery from fat-finger mistakes.ZFS provides clones, which provide a fast, space-efficient way of making"copies" of filesystems. Clones are extremely useful when manyalmost-identical "copies" of a set of data are required -- for example,multiple code sourcebases, one for each engineer or bug being fixed; ormultiple system images, one for each zone or netboot-ed machine.ZFS also provides quotas, to limit space consumption; reservations, toguarantee availability of space in the future; compression, to reduceboth disk space and I/O bandwidth requirements; and supports the fullrange of NFSv4/NT-style ACLs.What exactly is a "pooled storage model"? Basically it means rather than stacking one FS on top of one volume on top of some disks, you stack many filesystems on top of one storage pool on top of lots of disks. Take a home directoy server with a few thousand users and a few terabytes of data. Traditionally, you'd probably set it up so that there are a few filesystems, each a few hundred megabytes, and put a couple hundred users on each filesystem.That seems odd -- why is there an arbitrariy grouping of users into filesystems? It would be more logical to have either one filesystem for all users, or one filesystem for each user. We can rule out the latter because it would require that we statically partition our storage and decide up front how much space each user got -- ugh. Using one big filesystem would be plausable, but performance may become a problem with large filesystems -- both common run-time performance and performance of administrative tasks. Many backup tools are filesystem-based. The run-time of fsck(1m) is not linear in the size of the filesystem, so it could take a lot longer to fsck that one 8TB filesystem than it would to fsck 80 100GB filesystems. Furthermore, some filesystems simply don't support more than a terabyte or so of storage.It's inconvenient to run out of space in a traditional filesystem, and happens all too often. You might have lots of free space in a different filesystem, but you can't easily use it. You could manually migrate users to different filesystems to balance the free space... (hope your users don't mind downtime! hope you find the right backup tape when it comes time to restore!) Eventually you'll have to install new disks, make a new volume and filesystem out of them, and then migrate some users over to the new filesystem, incurring downtime. I experienced these kinds of problems with my home directory when I was attending school (using VxFS on VxVM on Solaris), they still plague some home directory and other file servers at Sun (using UFS on SVM on Solaris).With ZFS, you can have one storage pool which encompasses all the storage attached to your server. Then you can easily create one filesystem for each user. When you run low on storage, simply attach more disks and add them to the pool. No downtime. This is the scenario on the home directory server that I use at Sun, which uses ZFS on Solaris. Thus concludes ZFS lesson 1.

It occurs to me that before I can really talk much about ZFS, you need to know what it is, and how it's generally arranged. So here's an overview of what ZFS is, reproduced from our internal webpage: Z...