By bonwick on Sep 13, 2007
Every filesystem must keep track of two basic things: where your data is, and where the free space is.
In principle, keeping track of free space is not strictly necessary: every block is either allocated or free, so the free space can be computed by assuming everything is free and then subtracting out everything that's allocated; and the allocated space can be found by traversing the entire filesystem from the root. Any block that cannot be found by traversal from the root is, by definition, free.
In practice, finding free space this way would be insufferable because it would take far too long for any filesystem of non-trivial size. To make the allocation and freeing of blocks fast, the filesystem needs an efficient way to keep track of free space. In this post we'll examine the most common methods, why they don't scale well, and the new approach we devised for ZFS.
BitmapsThe most common way to represent free space is by using a bitmap. A bitmap is simply an array of bits, with the Nth bit indicating whether the Nth block is allocated or free. The overhead for a bitmap is quite low: 1 bit per block. For a 4K blocksize, that's 1/(4096\*8) = 0.003%. (The 8 comes from 8 bits per byte.)
For a 1GB filesystem, the bitmap is 32KB -- something that easily fits in memory, and can be scanned quickly to find free space. For a 1TB filesystem, the bitmap is 32MB -- still stuffable in memory, but no longer trivial in either size or scan time. For a 1PB filesystem, the bitmap is 32GB, and that simply won't fit in memory on most machines. This means that scanning the bitmap requires reading it from disk, which is slower still.
Clearly, this doesn't scale.
One seemingly obvious remedy is to break the bitmap into small chunks, and keep track of the number of bits set in each chunk. For example, for a 1PB filesystem using 4K blocks, the free space can be divided into a million bitmaps, each 32KB in size. The summary information (the million integers indicating how much space is in each bitmap) fits in memory, so it's easy to find a bitmap with free space, and it's quick to scan that bitmap.
But there's still a fundamental problem: the bitmap(s) must be updated not only when a new block is allocated, but also when an old block is freed. The filesystem controls the locality of allocations (it decides which blocks to put new data into), but it has no control over the locality of frees. Something as simple as 'rm -rf' can cause blocks all over the platter to be freed. With our 1PB filesystem example, in the worst case, removing 4GB of data (a million 4K blocks) could require each of the million bitmaps to be read, modified, and written out again. That's two million disk I/Os to free a measly 4GB -- and that's just not reasonable, even as worst-case behavior.
More than any other single factor, this is why bitmaps don't scale: because frees are often random, and bitmaps that don't fit in memory perform pathologically when they are accessed randomly.
B-treesAnother common way to represent free space is with a B-tree of extents. An extent is a contiguous region of free space described by two integers: offset and length. The B-tree sorts the extents by offset so that contiguous space allocation is efficient. Unfortunately, B-trees of extents suffer the same pathology as bitmaps when confronted with random frees.
What to do?
One way to mitigate the pathology of random frees is to defer the update of the bitmaps or B-trees, and instead keep a list of recently freed blocks. When this deferred free list reaches a certain size, it can be sorted, in memory, and then freed to the underlying bitmaps or B-trees with somewhat better locality. Not ideal, but it helps.
But what if we went further?
Space maps: log-structured free lists
Recall that log-structured filesystems long ago posed this question: what if, instead of periodically folding a transaction log back into the filesystem, we made the transaction log be the filesystem?
Well, the same question could be asked of our deferred free list: what if, instead of folding it into a bitmap or B-tree, we made the deferred free list be the free space representation?
That is precisely what ZFS does. ZFS divides the space on each virtual device into a few hundred regions called metaslabs. Each metaslab has an associated space map, which describes that metaslab's free space. The space map is simply a log of allocations and frees, in time order. Space maps make random frees just as efficient as sequential frees, because regardless of which extent is being freed, it's represented on disk by appending the extent (a couple of integers) to the space map object -- and appends have perfect locality. Allocations, similarly, are represented on disk as extents appended to the space map object (with, of course, a bit set indicating that it's an allocation, not a free).
When ZFS decides to allocate blocks from a particular metaslab, it first reads that metaslab's space map from disk and replays the allocations and frees into an in-memory AVL tree of free space, sorted by offset. This yields a compact in-memory representation of free space that supports efficient allocation of contiguous space. ZFS also takes this opportunity to condense the space map: if there are many allocation-free pairs that cancel out, ZFS replaces the on-disk space map with the smaller in-memory version.
Space maps have several nice properties:
- They don't require initialization: a space map with no entries indicates that there have been no allocations and no frees, so all space is free.
- They scale: because space maps are append-only, only the last block of the space map object needs to be in memory to ensure excellent performance, no matter how much space is being managed.
- They have no pathologies: space maps are efficient to update regardless of the pattern of allocations and frees.
- They are equally efficient at finding free space whether the pool is empty or full (unlike bitmaps, which take longer to scan as they fill up).
Finally, note that when a space map is completely full, it is represented by a single extent. Space maps therefore have the appealing property that as your storage pool approaches 100% full, the space maps start to evaporate, thus making every last drop of disk space available to hold useful information.