ZFS Block Allocation
By bonwick on Nov 04, 2006
Block allocation is central to any filesystem. It affects not only performance, but also the administrative model (e.g. stripe configuration) and even some core capabilities like transactional semantics, compression, and block sharing between snapshots. So it's important to get it right.
There are three components to the block allocation policy in ZFS:
- Device selection (dynamic striping)
- Metaslab selection
- Block selection
By design, these three policies are independent and pluggable. They can be changed at will without altering the on-disk format, which gives us lots of flexibility in the years ahead.
So... let's go allocate a block!
1. Device selection (aka dynamic striping). Our first task is device selection. The goal is to spread the load across all devices in the pool so that we get maximum bandwidth without needing any notion of stripe groups. You add more disks, you get more bandwidth. We call this dynamic striping -- the point being that it's done on the fly by the filesystem, rather than at configuration time by the administrator.
There are many ways to select a device. Any policy would work, including just picking one at random. But there are several practical considerations:
- If a device was recently added to the pool, it'll be relatively empty. To address such imbalances, we bias the allocation slightly in favor of underutilized devices. This keeps space usage uniform across all devices.
- All else being equal, round-robin is a fine policy, but it's critical to get the granularity right. If the granularity is too coarse (e.g. 1GB), we'll only get one device worth of bandwidth when doing sequential reads and writes. If the granularity is too fine (e.g. one block), we'll waste any read buffering the device can do for us. In practice, we've found that switching from one device to the next every 512K works well for the current generation of disk drives.
- That said, for intent log blocks, it's better to round-robin between devices each time we write a log block. That's because they're very short-lived, so we don't expect to ever need to read them; therefore it's better to optimize for maximum IOPs when writing log blocks. Neil Perrin integrated support for this earlier today.
- More generally, we'll probably want different striping policies for different types of data: large/sequential, small/random, transient (like the intent log), and dnodes (clumping for spatial density might be good). This is fertile ground for investigation.
- If one of the devices is slow or degraded for some reason, it should be avoided. This is work in progress.
2. Metaslab selection. We divide each device into a few hundred regions, called metaslabs, because the overall scheme was inspired by the slab allocator. Having selected a device, which metaslab should we use? Intuitively it seems that we'd always want the one with the most free space, but there are other factors to consider:
- Modern disks have uniform bit density and constant angular velocity. Therefore, the outer recording zones are faster (higher bandwidth) than the inner zones by the ratio of outer to inner track diameter, which is typically around 2:1. We account for this by assigning a higher weight to the free space in lower-LBA metaslabs. In effect, this means that we'll select the metaslab with the most free bandwidth rather than simply the one with the most free space.
- When a pool is relatively empty, we want to keep allocations in the outer (faster) regions of the disk; this improves both bandwidth and latency (by minimizing seek distances). Therefore, we assign a higher weight to metaslabs that have been used before.
All of these considerations can be seen in the function metaslab_weight(). Having defined a weighting scheme, the selection algorithm is simple: always select the metaslab with the highest weight.
3. Block selection. Having selected a metaslab, we must choose a block within that metaslab. The current allocation policy is a simple variation on first-fit; it seems likely that we can do better. In the future I expect that we'll have not only a better algorithm, but a whole collection of algorithms, each optimized for a specific workload. Anticipating this, the block allocation code is fully vectorized; see space_map_ops_t for details.
The mechanism (as opposed to policy) for keeping track of free space in a metaslab is a new data structure called a space map, which I'll describe in the next post.