Space Maps


Every filesystem must keep track of two basic things: where your data is, and where the free space is.

In principle, keeping track of free space is not strictly necessary: every block is either allocated or free, so the free space can be computed by assuming everything is free and then subtracting out everything that's allocated; and the allocated space can be found by traversing the entire filesystem from the root.  Any block that cannot be found by traversal from the root is, by definition, free.

In practice, finding free space this way would be insufferable because it would take far too long for any filesystem of non-trivial size.  To make the allocation and freeing of blocks fast, the filesystem needs an efficient way to keep track of free space.  In this post we'll examine the most common methods, why they don't scale well, and the new approach we devised for ZFS.

Bitmaps

The most common way to represent free space is by using a bitmap.  A bitmap is simply an array of bits, with the Nth bit indicating whether the Nth block is allocated or free.  The overhead for a bitmap is quite low: 1 bit per block.  For a 4K blocksize, that's 1/(4096\*8) = 0.003%.  (The 8 comes from 8 bits per byte.)

For a 1GB filesystem, the bitmap is 32KB -- something that easily fits in memory, and can be scanned quickly to find free space.  For a 1TB filesystem, the bitmap is 32MB -- still stuffable in memory, but no longer trivial in either size or scan time.  For a 1PB filesystem, the bitmap is 32GB, and that simply won't fit in memory on most machines.  This means that scanning the bitmap requires reading it from disk, which is slower still.

Clearly, this doesn't scale.

One seemingly obvious remedy is to break the bitmap into small chunks, and keep track of the number of bits set in each chunk.  For example, for a 1PB filesystem using 4K blocks, the free space can be divided into a million bitmaps, each 32KB in size.  The summary information (the million integers indicating how much space is in each bitmap) fits in memory, so it's easy to find a bitmap with free space, and it's quick to scan that bitmap.

But there's still a fundamental problem: the bitmap(s) must be updated not only when a new block is allocated, but also when an old block is freed.  The filesystem controls the locality of allocations (it decides which blocks to put new data into), but it has no control over the locality of frees.  Something as simple as 'rm -rf' can cause blocks all over the platter to be freed.  With our 1PB filesystem example, in the worst case, removing 4GB of data (a million 4K blocks) could require each of the million bitmaps to be read, modified, and written out again.  That's two million disk I/Os to free a measly 4GB -- and that's just not reasonable, even as worst-case behavior.

More than any other single factor, this is why bitmaps don't scale: because frees are often random, and bitmaps that don't fit in memory perform pathologically when they are accessed randomly.

B-trees

Another common way to represent free space is with a B-tree of extents.  An extent is a contiguous region of free space described by two integers: offset and length.  The B-tree sorts the extents by offset so that contiguous space allocation is efficient.  Unfortunately, B-trees of extents suffer the same pathology as bitmaps when confronted with random frees.

What to do?

Deferred frees

One way to mitigate the pathology of random frees is to defer the update of the bitmaps or B-trees, and instead keep a list of recently freed blocks.  When this deferred free list reaches a certain size, it can be sorted, in memory, and then freed to the underlying bitmaps or B-trees with somewhat better locality.  Not ideal, but it helps.

But what if we went further?

Space maps:  log-structured free lists

Recall that log-structured filesystems long ago posed this question: what if, instead of periodically folding a transaction log back into the filesystem, we made the transaction log be the filesystem?

Well, the same question could be asked of our deferred free list: what if, instead of folding it into a bitmap or B-tree, we made the deferred free list be the free space representation?

That is precisely what ZFS does.  ZFS divides the space on each virtual device into a few hundred regions called metaslabs.  Each metaslab has an associated space map, which describes that metaslab's free space.  The space map is simply a log of allocations and frees, in time order.  Space maps make random frees just as efficient as sequential frees, because regardless of which extent is being freed, it's represented on disk by appending the extent (a couple of integers) to the space map object -- and appends have perfect locality.  Allocations, similarly, are represented on disk as extents appended to the space map object (with, of course, a bit set indicating that it's an allocation, not a free).

When ZFS decides to allocate blocks from a particular metaslab, it first reads that metaslab's space map from disk and replays the allocations and frees into an in-memory AVL tree of free space, sorted by offset.  This yields a compact in-memory representation of free space that supports efficient allocation of contiguous space.  ZFS also takes this opportunity to condense the space map: if there are many allocation-free pairs that cancel out, ZFS replaces the on-disk space map with the smaller in-memory version.

Space maps have several nice properties:

  • They don't require initialization: a space map with no entries indicates that there have been no allocations and no frees, so all space is free.

  • They scale: because space maps are append-only, only the last block of the space map object needs to be in memory to ensure excellent performance, no matter how much space is being managed.

  • They have no pathologies: space maps are efficient to update regardless of the pattern of allocations and frees.

  • They are equally efficient at finding free space whether the pool is empty or full (unlike bitmaps, which take longer to scan as they fill up).

Finally, note that when a space map is completely full, it is represented by a single extent.  Space maps therefore have the appealing property that as your storage pool approaches 100% full, the space maps start to evaporate, thus making every last drop of disk space available to hold useful information.

Comments:

Very interesting. But I do not understand the "sorted by offset" of the AVL. What are the nodes of the tree ? When replaying the transaction, is ZFS dividing/joining nodes ?

Posted by Pierre Chatelier on September 14, 2007 at 03:49 AM PDT #

Very interesting stuff, thanks for the read. I'm curious though, what prompted the choice of an AVL tree over one of the other balanced trees such as red-black? Is the stricter balancing enforced by AVL a requirement for efficiency here?

Posted by Rene Gollent on September 14, 2007 at 05:13 PM PDT #

Ah, good questions. The nodes of the in-memory AVL tree describe free extents. The tree is initialized with a single extent for the whole metaslab, indicating that everything is free. Then, for each allocation record in the space map, we excise that extent from the tree (which may entail dividing a free extent); and for each free record, we add that extent to the tree (which may entail joining to adjacent extents into a single larger one). The code in question is all here: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/space_map.c

Posted by Jeff Bonwick on September 15, 2007 at 07:58 PM PDT #

Rene: there was no particular reason for choosing AVL trees over red-black. AVL is just what we happen to have support for in the Solaris kernel. Any data structure that efficiently keeps things sorted would suffice.

Posted by Jeff Bonwick on September 15, 2007 at 08:09 PM PDT #

Long awaited description of what space map is. Thanks! I finally cleared my understanding of space maps.

Since the previous post about block allocation I played a bit with space maps and made implementation based on red-black trees with NIL nodes with couple of interesting properties - size of the largest free segment in space map is always known and allocation is basically a variant of avl_find() with alternate comparison function. The only drawback is it uses 8 more bytes for each space segment for now. I'd like to share this with ZFS community.

Posted by Victor Latushkin on September 15, 2007 at 08:55 PM PDT #

How exactly do the copy-on-write/always-valid-on-disk concepts line up with the idea of space maps (or any other free space tracking mechanism, for that matter)? Is the map update flushed to disk before empty blocks are filled and referenced or after? I am curious as to how ZFS can claim to both not leak blocks on crashes and not need (substantial, at least) fs checks after a crash. Also, where does locality come into play? The mention above that only one block of the map normally resides in memory (the linear update log and not the AVL tree, I assume) would imply to me that the update could be chained sequentially with the blocks whose allocations it -also- describes, but I can't shake the feeling that I'm missing a lot of the picture in my understanding. Thanks for any explanation!

Posted by Jeff Smith on September 16, 2007 at 12:19 PM PDT #

In summary, space maps use a free list along with a tree. Once the AVL tree is created for the metaslab, is it possible to modify the AVL tree in a lazy manner? i.e. deferred frees. I would imagine that all allocations and deallocations must be written to the transaction log, so its persistent.

Posted by Ashish M on September 17, 2007 at 04:26 AM PDT #

Thanks for explaining this, Jeff. If you want to see what space maps look like from space, see blogs.sun.com/relling/entry/space_maps_from_space

Posted by Richard Elling on September 18, 2007 at 09:58 AM PDT #

Very interesting! I have a question related to removed elements...

Since multiple snapshots can reuse the same block, how can ZFS know when the last reference to a block is removed, and the block can be freed? I'm guessing some sort of reference counting, but I don't see where you can put it?

Anyway, thanks for an interesting blog :-)

Posted by Klavs Klavsen on September 19, 2007 at 12:49 AM PDT #

Ahh... you have a space map for every snapshot/clone?

Posted by Klavs Klavsen on September 19, 2007 at 01:40 AM PDT #

Klavs -- no, there is only one space map for each metaslab. No distinction is made between blocks which are referenced multiple times (eg, by snapshots/clones) vs just once. The DSL layer keeps track of when a block is no longer referenced anywhere. I describe how this works in my blog -- http://blogs.sun.com/ahrens/entry/is_it_magic

Posted by Matt Ahrens on September 20, 2007 at 11:10 AM PDT #

Hey Matt, thanks a lot for that explanation!

Using internal timestamps on rows is a common implementation technique in relational databases to handle concurrency control, but the solution that the ZFS team has come up with for freeing blocks takes it to another level, very clever!

Posted by Klavs Klavsen on September 20, 2007 at 06:33 PM PDT #

Post a Comment:
Comments are closed for this entry.
About

bonwick

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today