In the final article of this series
, I walk through the fascinating topic of block picking. As is well known ZFS is an allocate on write storage system and every time it updates the on-disk structure, it writes the new data wherever it chooses. Within reasonable bounds ZFS also control the timing of when to write data out to devices (outside of ZIL blocks which is governed by applications). The current bounds are set about 5 seconds apart and when that clock ticks, we bundle up all recent changes in a transaction group (TXG) handled by spa_sync
Armed with this herd of pending data blocks, ZFS issues a highly concurrent workload dedicated to running all CPU intensive tasks. For any individual I/O, after going through compression, encryption and checksumming, we move on to the allocation task and finally device level I/O scheduling.
The first task of allocation involves, selecting a device in a pool. Then, within that device, selecting a sub-region called a metaslab
and finally, within that metaslab, selecting a block where the data is stored. Our guiding principles through the process is to
- Foster write I/O aggregation
- Ensure devices are used efficiently
- Avoid fragmenting the on-disk space
- Limit the core memory required
- Serve concurrent allocations quickly
- Do all this with as little CPU resources as possible
Let's see how ZFS solves this tough equation.
When a TXG is triggered, we want devices to receive a set of I/Os such that the I/Os;
- Stream on the media
I/O aggregation occurs downstream when 2 I/Os, which are ready to issue, have just been allocated on the same device in adjacent locations. The per device I/O pipeline is then able to merge both of them (N
of them actually) into a single device I/O. This is an incredible feature as it applies to items that have no relationship to each other, other than their allocation proximity as first described in Need Inodes ?
Streaming the media is a close cousin to aggregation. I/Os aggregate when they are adjacent, but if they are not, we still want to avoid seeking to a far away area of a disk every I/O. A disk seek is, after all, an eternity. So, while setting data onto the media, we like to keep logical block addresses (LBA) as packed as possible leading to streaming efficiency. Similarly for SSDs, doing so, avoids fragmenting the internal space mapping done by the Flash Translation Layer (FTL). Even logical volumes software welcome this model.
ZFS allocates from a device until about 1MB of data is handled. The first 1MB of blocks that reaches the allocation stage (after CPU heavy transformations) are directed to a first device. The following 1MB of blocks move on to the next one, and so on. After iterating round robin through every device, we are now in a state where every device in a pool is busy with I/O. At the same time other blocks are being processed through the CPU intensive stages and are now building up an I/O backlog onto every device. At that moment, even if CPUs are heavily used, the system is actually issuing I/O to all devices that are 100% busy.
Once we have directed 1MB of allocation to a specific device, we still want the I/Os to target a localized subarea of the device. A metaslab has an in-core structure representing free space of a portion of a device (very roughly 1%). At any one time, ZFS only allocates from a single metaslab of a device insuring dense packing of all writes and therefore avoiding long disk head excursions. The other benefit is that for the purpose of allocation, we strictly only need to keep in-core a structure representing the free space for that subarea. This is the active
metaslab. During a given TXG we thus only write to the active metaslab of every device.
And now, going for the kill. We have the device and the metaslab subarea within it to service our allocation of size X
. We finally have to choose a specific device location to fit this allocation. A simple strategy would be to allocate all blocks back-to-back regardless of size. That would lead to maximum aggregation, but we must be considerate of space fragmentation implications.
Blocks we are allocating together at this instant, may well be freed later on a different schedule from each other. Frankly we have no way to know when a block free occurs since that is entirely driven by the workload. In ZFS, our bias is to consider that blocks that have similar size have a better chance of having similar life expectancy. We exploit this by maintaining a set of separate pointers within our metaslab and allocate blocks of similar size near each other.
When ZFS first came of age, it had 2 strategies to pick blocks. The regular one lead to good performance through aggregation while another strategy was aimed at defragmenting and lead to terrible performance. We would switch strategy when a metaslab started to have less than 30% free space within it. Customers voiced their discontent loudly. The 30% parameter was later reduced to 4% but that didn't reduce the complaints1
The other problem we had in the past, was that when we needed to switch a metaslab whose structure was not yet in core, we would block all allocating threads waiting on the in-core loading of the metaslab data. If that took too much time, we could leave devices idling in the process.
Finally, we would switch only when an allocation could not be satisfied usually because a request was larger than the largest free block available. Forced to switch, we would then select the metaslab that had the best2
free space. This meant that we would keep using metaslabs past their prime capacity to foster aggregated and streaming I/Os.
Today is a Better World
Fast forward to today, we have evolved this whole process in very significant ways.
The thread blocking problem was simple enough; when we do in fact load a metaslab we quickly direct allocating threads to other devices. We therefore keep feeding the other devices more smoothly.
But the most important advance is that we don't use an allocator that switches strategy based on available space in the metaslab. Allocations of a given size are serviced from a chunk chosen such that the I/Os aggregate 16-fold : so 4K allocations tend to consume 64K chunks, while 128K allocations looks for 2MB chunks. Allocations of different sizes do not compete for the same sized chunks. Finally, as the maximum size available in a metaslab is reduced, we simply and gradually scale down our potential for aggregation from this metaslab.
Alongside this change, we decided to actually switch away from a metaslab as soon as it started to show signs of fatigue. As long as a metaslab is able to serve blocks of approximately 1MB we keep allocating from it. But as soon as it's biggest block size drops below this threshold, we go and pick the metaslab with the best
Finally to account for more frequent switches, we also decided to unload metaslabs less aggressively than before. This policy allows us to reuse a metaslab without incurring the cost of loading since that comes with both a CPU and I/O cost.
With these changes in, we have an allocator that fosters aggregation very effectively and leads to performance that degrades gracefully as the pool fills up. This allocator has served us well over the many years it's been in place.
ZFS gives you great performance by handling writes in a way that streams the data to devices. This is effective and delivers maximum performance as long as there is large sequential free space on devices. For users, the real judge is to monitor the average I/O size for writes and if, for the same workload mix, the write size starts to creep down, then it's time to consider adding additional storage space.
1 Some still remember this 30% factor and use that as a rule of thumb to not exceed even though we have not used this allocator in years; tuning metaslab_df_free_pct has no effect on our systems. 2 I say best free space and not most free space since we actually boost the desirability of metaslabs with low addresses. For physical disks, outer tracks fly faster under a disk head and that translate into more throughput. Even for SSDs we see a benefit : Favoring one side of the addresses, means we reuse freed space more aggressively. The overwrites of an SSD LBA range means that the flash cells holding the overwritten data can be recycled quickly by the FTL, which simplifies greatly its operation.