Block allocation is central to any filesystem. It affects not only performance, but also the administrative model (e.g. stripe configuration) and even some core capabilities like transactional semantics, compression, and block sharing between snapshots. So it's important to get it right.
There are three components to the block allocation policy in ZFS:
By design, these three policies are independent and pluggable. They can be changed at will without altering the on-disk format, which gives us lots of flexibility in the years ahead.
So... let's go allocate a block!
1. Device selection (aka dynamic striping). Our first task is device selection. The goal is to spread the load across all devices in the pool so that we get maximum bandwidth without needing any notion of stripe groups. You add more disks, you get more bandwidth. We call this dynamic striping -- the point being that it's done on the fly by the filesystem, rather than at configuration time by the administrator.
There are many ways to select a device. Any policy would work, including just picking one at random. But there are several practical considerations:
2. Metaslab selection. We divide each device into a few hundred regions, called metaslabs, because the overall scheme was inspired by the slab allocator. Having selected a device, which metaslab should we use? Intuitively it seems that we'd always want the one with the most free space, but there are other factors to consider:
All of these considerations can be seen in the function metaslab_weight(). Having defined a weighting scheme, the selection algorithm is simple: always select the metaslab with the highest weight.
3. Block selection. Having selected a metaslab, we must choose a block within that metaslab. The current allocation policy is a simple variation on first-fit; it seems likely that we can do better. In the future I expect that we'll have not only a better algorithm, but a whole collection of algorithms, each optimized for a specific workload. Anticipating this, the block allocation code is fully vectorized; see space_map_ops_t for details.
The mechanism (as opposed to policy) for keeping track of free space in a metaslab is a new data structure called a space map, which I'll describe in the next post.