Concurrent Metaslab Syncing
As hinted in my previous article, spa_sync() is the function that runs whenever a pool needs to update it's internal state. That thread is the master of ceremony for the whole TXG syncing process. As such it is the most visible of thread. At the same time, it's the thread we actually want to see idling. The spa_sync thread is setup to generate work for taskqs and then wait for the work to happen. That's why we often see spa_sync waiting in zio_wait or taskq_wait. This is what we expect that thread to be doing.
Let's dig into this process a bit more. While we do expect spa_sync to mostly be waiting, it is not the only thing that it does. Before it waits, it has to farm out work to those taskqs. Every TXG, spa_sync wakes up and starts to create work for the zio taskq threads. Those threads immediately pick up the initial tasks posted by spa_sync and just as quickly generate load for pool devices. Our goal is just to keep taskqs and more importantly device fed with work.
And so, we have this single spa_sync thread, quickly posting work to zio taskqs and threads working on checksum computation and other CPU intensive tasks. This model ensures that the disk queues are non-empty for the duration of the data update portion of a TXG.
In practice, that single spa_sync thread is able to generate the tasks to service the most demanding environment. When we hit some form of pool saturation, we typically see spa_sync waiting on a zio and that is just the expected sign that something at the I/O level below ZFS is the current limiting factor.
But, not too long ago, there was a grain of sand in this beautiful clockwork. After spa_sync was all done with ... well waiting... it had a final step to run before updating the uberblock. It would walk through all the devices and process all the space map updates, keeping track of all the allocs and frees. In many cases, this was a quick on-CPU operation done by the spa_sync thread. But when dealing with a large amount of deletion it could show up as significant. It was definitely something that spa_sync was tackling itself as opposed to farming out to workers.
A project was spawned to fix this and during the evaluation the ZFS engineer figured out that a lot of the work could be handled in the earlier stages of the zio processing, further reducing the amount of work we could have to wait on in the later stages of spa_sync.
This fix was a very important step in making sure that the critical thread running spa_sync spends most of it's time ...waiting.