A very significant improvement is coming soon to ZFS. A change that will increase the general quality of service delivered by ZFS. Interestingly it's a change that might also slow down your microbenchmark but nevertheless it's a change you should be eager for.
For a filesystem, write throttling designates the act of blocking application for some amount of time, as short as possible, waiting for the proper conditions to allow the write system calls to succeed. Write throttling is normally required because applications can write to memory (dirty memory pages) at a rate significantly faster than the kernel can flush the data to disk. Many workloads dirty memory pages by writing to the filesystem page cache at near memory copy speed, possibly using multiple threads issuing high-rates of filesystem writes. Concurrently, the filesystem is doing it's best to drain all that data to the disk subsystem.
Given the constraints, the time to empty the filesystem cache to disk can be longer than the time required for applications to dirty the cache. Even if one considers storage with fast NVRAM, under sustained load, that NVRAM will fill up to a point where it needs to wait for a slow disk I/O to make room for more data to get in.
When committing data to a filesystem in bursts, it can be quite desirable to push the data at memory speed and then drain the cache to disk during the lapses between bursts. But when data is generated at a sustained high rate, lack of throttling leads to total memory depletion. We thus need at some point to try and match the application data rate with that of the I/O subsystem. This is the primary goal of write throttling.
A secondary goal of write throttling is to prevent massive data loss. When applications do not manage I/O synchronization (i.e don't use O_DSYNC and fsync), data ends up cached in the filesystem and the contract is that there is no guarantee that the data will still be there if a system crash were to occur. So even if the filesystem cannot be blamed for such data loss, it is still a nice feature to help prevent such massive losses.
Case in point : UFS Write throttling
For instance UFS would use the fsflush daemon to try to keep data exposed for no more than 30 seconds (default value of autoup). Also, UFS would keep track of the amount of I/O outstanding for each file. Once too much I/O was pending, UFS would throttle writers for that file. This was controlled through ufs_HW, ufs_LW and their values were commonly tuned (a bad sign). Eventually old defaults values were updated and seem to work nicely today. UFS write throttling thus operates on a per file basis. While there are some merits to this approach, it can be defeated as it does not manage the imbalance between memory and disks at a system level.
ZFS Previous write throttling
ZFS is designed around the concept of transaction groups (txg). Normally, every 5 seconds an _open_ txg goes to the quiesced state. From that state the quiesced txg will go to the syncing state which sends dirty data to the I/O subsystem. For each pool, there are at most 1 txg in each of the 3 states, open, quiescing, syncing. Write throttling used to occur when the 5 second txg clock would fire while the syncing txg had not yet completed. The open group would wait on the quiesced one which waits on the syncing one. Application writers (write system call) would block, possibly a few seconds, waiting for a txg to open. In other words, if a txg took more than 5 seconds to sync to disk, we would globally block writers thus matching their speed with that of the I/O. But if a workload had a bursty write behavior that could be synced during the allotted 5 seconds, application would never be throttled.
But ZFS did not sufficiently controled the amount of data that could get in an open txg. As long as the ARC cache was no more than half dirty, ZFS would accept data. For a large memory machine or one with weak storage, this was likely to cause long txg sync times. The downsides were many :
- if we did ended up throttled, long sync times meant the system
behavior would be sluggish for seconds at a time.
- long txg sync times also meant that our granularity at which
we could generate snapshots would be impacted.
- we ended up with lots of pending data in the cache all of
which could be lost in the event of a crash.
- the ZFS I/O scheduler which prioritizes operations was also
- By not throttling we had the possibility that
sequential writes on large files could displace from the ARC
a very large number of smaller objects. Refilling
that data meant very large number of disk I/Os.
Not throttling can paradoxically end up as very
costly for performance.
- the previous code also could at times, not be issuing I/Os
to disk for seconds even though the workload was
critically dependant of storage speed.
- And foremost, lack of throttling depleted memory and prevented
ZFS from reacting to memory pressure.
That ZFS is considered a memory hog is most likely the results of the the previous throttling code. Once a proper solution is in place, it will be interesting to see if we behave better on that front.
The new code keeps track of the amount of data accepted in a TXG and the time it takes to sync. It dynamically adjusts that amount so that each TXG sync takes about 5 seconds (txg_time variable). It also clamps the limit to no more than 1/8th of physical memory.
And to avoid the system wide and seconds long throttle effect, the new code will detect when we are dangerously close to that situation (7/8th of the limit) and will insert 1 tick delays for applications issuing writes. This prevents a write intensive thread from hogging the available space starving out other threads. This delay should also generally prevent the system wide throttle.
So the new steady state behavior of write intensive workloads is that, starting with an empty TXG, all threads will be allowed to dirty memory at full speed until a first threshold of bytes in the TXG is reached. At that time, every write system call will be delayed by 1 tick thus significantly slowing down the pace of writes. If the previous TXG completes it's I/Os, then the current TXG will then be allowed to resume at full speed. But in the unlikely event that a workload, despite the per write 1-tick delay, manages to fill up the TXG up to the full threshold we will be forced to throttle all writes in order to allow the storage to catch up.
It should make the system much better behaved and generally more performant under sustained write stress.
If you are owner of an unlucky workload that ends up as slowed by more throttling, do consider the other benefits that you get from the new code. If that does not compensate for the loss, get in touch and tell us what your needs are on that front.