System Duty Cycle Scheduling Class
It's well known that ZFS uses a bulk update model to maintain the consistency of information stored on disk. This is referred to as a transaction group (TXG) update or internally as a spa_sync(), which is the name of the function that orchestrates this task. This task ultimately updates the uberblock between consistent ZFS states.
Today these tasks are expected to run on a 5-second schedule with some leeway. Internally, ZFS builds up the data structures such that when a new TXG is ready to be issued it can do so in the most efficient way possible. That method turned out to be a mixed blessing.
The story is that when ZFS is ready, it uses zio taskqs to execute all of the heavy lifting, CPU intensive jobs necessary to complete the TXG. This process includes the checksumming of every modified block and possibly compressing and encrypting them. It also does on-disk allocation and issues I/O to the disk drivers. This means there is a lot of CPU intensive work to do when a TXG is ready to go. The zio subsystem was crafted in such a way that when this activity does show up, the taskqs that manage the work never need to context switch out. The taskq threads can run on CPU for seconds on end. That created a new headache for the Solaris scheduler.
Things would not have been so bad if ZFS was the only service being provided. But our systems, of course, deliver a variety of services and non-ZFS clients were being short changed by the scheduler. It turns out that before this use case, most kernel threads had short spans of execution. Therefore kernel threads were never made preemptable and nothing would prevent them from continuous execution (seconds is same as infinity for a computer). With ZFS, we now had a new type of kernel thread, one that frequently consumed significant amounts of CPU time.
A team of Solaris engineers went on to design a new scheduling class specifically targeting this kind of bulk processing. Putting the zio taskqs in this class allowed those threads to become preemptable when they used too much CPU. We also changed our model such that we limited the number of CPUs dedicated to these intensive taskqs. Today, each pool may use at most 50% of nCPUS to run these tasks. This is managed by kernel parameter zio_taskq_batch_pct which was reduced from 100% to 50%.
Using these 2 features we are now much better equipped to allow the TXG to proceed at top speeds without starving application from CPU access and in the end, running applications is all that matters.