Every filesystem must keep track of two basic things: where your data is, and where the free space is.
In principle, keeping track of free space is not strictly necessary: every block is either allocated or free, so the free space can be computed by assuming everything is free and then subtracting out everything that's allocated; and the allocated space can be found by traversing the entire filesystem from the root. Any block that cannot be found by traversal from the root is, by definition, free.
In practice, finding free space this way would be insufferable because it would take far too long for any filesystem of non-trivial size. To make the allocation and freeing of blocks fast, the filesystem needs an efficient way to keep track of free space. In this post we'll examine the most common methods, why they don't scale well, and the new approach we devised
for ZFS.
One way to mitigate the pathology of random frees is to defer the update of the bitmaps or B-trees, and instead keep a list of recently freed blocks. When this deferred free list reaches a certain size, it can be sorted, in memory, and then freed to the underlying bitmaps or B-trees with somewhat better locality. Not ideal, but it helps.
But what if we went further?
Recall that log-structured filesystems long ago posed this question: what if, instead of periodically folding a transaction log back into the filesystem, we made the transaction log be the filesystem?
Well, the same question could be asked of our deferred free list: what if, instead of folding it into a bitmap or B-tree, we made the deferred free list be the free space representation?
That is precisely what ZFS does. ZFS divides the space on each virtual device into a few hundred regions called metaslabs. Each metaslab has an associated space map, which describes that metaslab's free space. The space map is simply a log of allocations and frees, in time order. Space maps make random frees just as efficient as sequential frees, because regardless of which extent is being freed, it's represented on disk by appending the extent (a couple of integers) to the space map object -- and appends have perfect locality. Allocations, similarly, are represented on disk as extents appended to the space map object (with, of course, a bit set indicating that it's an allocation, not a free).
When ZFS decides to allocate blocks from a particular metaslab, it first reads that metaslab's space map from disk and replays the allocations and frees into an in-memory AVL tree of free space, sorted by offset. This yields a compact in-memory representation of free space that supports efficient allocation of contiguous space. ZFS also takes this opportunity to condense the space map: if there are many allocation-free pairs that cancel out, ZFS replaces the on-disk space map with the smaller in-memory version.
Space maps have several nice properties:
Finally, note that when a space map is completely full, it is represented by a single extent. Space maps therefore have the appealing property that as your storage pool approaches 100% full, the space maps start to evaporate, thus making every last drop of disk space available to hold useful information.
Very interesting. But I do not understand the "sorted by offset" of the AVL. What are the nodes of the tree ? When replaying the transaction, is ZFS dividing/joining nodes ?
Very interesting stuff, thanks for the read. I'm curious though, what prompted the choice of an AVL tree over one of the other balanced trees such as red-black? Is the stricter balancing enforced by AVL a requirement for efficiency here?
Ah, good questions. The nodes of the in-memory AVL tree describe free extents. The tree is initialized with a single extent for the whole metaslab, indicating that everything is free. Then, for each allocation record in the space map, we excise that extent from the tree (which may entail dividing a free extent); and for each free record, we add that extent to the tree (which may entail joining to adjacent extents into a single larger one). The code in question is all here: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/space_map.c
Rene: there was no particular reason for choosing AVL trees over red-black. AVL is just what we happen to have support for in the Solaris kernel. Any data structure that efficiently keeps things sorted would suffice.
Long awaited description of what space map is. Thanks! I finally cleared my understanding of space maps.
Since the previous post about block allocation I played a bit with space maps and made implementation based on red-black trees with NIL nodes with couple of interesting properties - size of the largest free segment in space map is always known and allocation is basically a variant of avl_find() with alternate comparison function. The only drawback is it uses 8 more bytes for each space segment for now. I'd like to share this with ZFS community.
How exactly do the copy-on-write/always-valid-on-disk concepts line up with the idea of space maps (or any other free space tracking mechanism, for that matter)? Is the map update flushed to disk before empty blocks are filled and referenced or after? I am curious as to how ZFS can claim to both not leak blocks on crashes and not need (substantial, at least) fs checks after a crash. Also, where does locality come into play? The mention above that only one block of the map normally resides in memory (the linear update log and not the AVL tree, I assume) would imply to me that the update could be chained sequentially with the blocks whose allocations it -also- describes, but I can't shake the feeling that I'm missing a lot of the picture in my understanding. Thanks for any explanation!
In summary, space maps use a free list along with a tree. Once the AVL tree is created for the metaslab, is it possible to modify the AVL tree in a lazy manner? i.e. deferred frees. I would imagine that all allocations and deallocations must be written to the transaction log, so its persistent.
Thanks for explaining this, Jeff. If you want to see what space maps look like from space, see blogs.sun.com/relling/entry/space_maps_from_space
Very interesting! I have a question related to removed elements...
Since multiple snapshots can reuse the same block, how can ZFS know when the last reference to a block is removed, and the block can be freed? I'm guessing some sort of reference counting, but I don't see where you can put it?
Anyway, thanks for an interesting blog :-)
Ahh... you have a space map for every snapshot/clone?
Klavs -- no, there is only one space map for each metaslab. No distinction is made between blocks which are referenced multiple times (eg, by snapshots/clones) vs just once. The DSL layer keeps track of when a block is no longer referenced anywhere. I describe how this works in my blog -- http://blogs.sun.com/ahrens/entry/is_it_magic
Hey Matt, thanks a lot for that explanation!
Using internal timestamps on rows is a common implementation technique in relational databases to handle concurrency control, but the solution that the ZFS team has come up with for freeing blocks takes it to another level, very clever!