The new ZFS write throttle

A very significant improvement is coming soon to ZFS. A change that will increase the general quality of service delivered by ZFS. Interestingly it's a change that might also slow down your microbenchmark but nevertheless it's a change you should be eager for.

Write throttling

For a filesystem, write throttling designates the act of blocking application for some amount of time, as short as possible, waiting for the proper conditions to allow the write system calls to succeed. Write throttling is normally required because applications can write to memory (dirty memory pages) at a rate significantly faster than the kernel can flush the data to disk. Many workloads dirty memory pages by writing to the filesystem page cache at near memory copy speed, possibly using multiple threads issuing high-rates of filesystem writes. Concurrently, the filesystem is doing it's best to drain all that data to the disk subsystem.

Given the constraints, the time to empty the filesystem cache to disk can be longer than the time required for applications to dirty the cache. Even if one considers storage with fast NVRAM, under sustained load, that NVRAM will fill up to a point where it needs to wait for a slow disk I/O to make room for more data to get in.

When committing data to a filesystem in bursts, it can be quite desirable to push the data at memory speed and then drain the cache to disk during the lapses between bursts. But when data is generated at a sustained high rate, lack of throttling leads to total memory depletion. We thus need at some point to try and match the application data rate with that of the I/O subsystem. This is the primary goal of write throttling.

A secondary goal of write throttling is to prevent massive data loss. When applications do not manage I/O synchronization (i.e don't use O_DSYNC and fsync), data ends up cached in the filesystem and the contract is that there is no guarantee that the data will still be there if a system crash were to occur. So even if the filesystem cannot be blamed for such data loss, it is still a nice feature to help prevent such massive losses.

Case in point : UFS Write throttling

For instance UFS would use the fsflush daemon to try to keep data exposed for no more than 30 seconds (default value of autoup). Also, UFS would keep track of the amount of I/O outstanding for each file. Once too much I/O was pending, UFS would throttle writers for that file. This was controlled through ufs_HW, ufs_LW and their values were commonly tuned (a bad sign). Eventually old defaults values were updated and seem to work nicely today. UFS write throttling thus operates on a per file basis. While there are some merits to this approach, it can be defeated as it does not manage the imbalance between memory and disks at a system level.

ZFS Previous write throttling

ZFS is designed around the concept of transaction groups (txg). Normally, every 5 seconds an _open_ txg goes to the quiesced state. From that state the quiesced txg will go to the syncing state which sends dirty data to the I/O subsystem. For each pool, there are at most 1 txg in each of the 3 states, open, quiescing, syncing. Write throttling used to occur when the 5 second txg clock would fire while the syncing txg had not yet completed. The open group would wait on the quiesced one which waits on the syncing one. Application writers (write system call) would block, possibly a few seconds, waiting for a txg to open. In other words, if a txg took more than 5 seconds to sync to disk, we would globally block writers thus matching their speed with that of the I/O. But if a workload had a bursty write behavior that could be synced during the allotted 5 seconds, application would never be throttled.

The Issue

But ZFS did not sufficiently controled the amount of data that could get in an open txg. As long as the ARC cache was no more than half dirty, ZFS would accept data. For a large memory machine or one with weak storage, this was likely to cause long txg sync times. The downsides were many :

	- if we did ended up throttled, long  sync times meant the system
	behavior would be sluggish for seconds at a time.

	- long txg sync times also meant that our granularity at which 
	we could generate snapshots would be impacted.

	- we ended up with lots of pending data in the cache all of
	which could be lost in the event of a crash.

	- the ZFS I/O scheduler which prioritizes operations was also
	negatively impacted.	

	- By  not    throttling we had the possibility that
	sequential writes on large files  could displace from the ARC
	a very large number of smaller objects. Refilling
	that data  meant  very  large number of  disk I/Os.  
	Not throttling can  paradoxically  end up as  very
	costly for performance.

	- the previous code also could at times, not be issuing I/Os
	to disk for seconds even though the workload was
	critically dependant of storage speed.

	- And foremost, lack of throttling depleted memory and prevented
	ZFS from reacting to memory pressure.
That ZFS is considered a memory hog is most likely the results of the the previous throttling code. Once a proper solution is in place, it will be interesting to see if we behave better on that front.

The Solutions

The new code keeps track of the amount of data accepted in a TXG and the time it takes to sync. It dynamically adjusts that amount so that each TXG sync takes about 5 seconds (txg_time variable). It also clamps the limit to no more than 1/8th of physical memory.

And to avoid the system wide and seconds long throttle effect, the new code will detect when we are dangerously close to that situation (7/8th of the limit) and will insert 1 tick delays for applications issuing writes. This prevents a write intensive thread from hogging the available space starving out other threads. This delay should also generally prevent the system wide throttle.

So the new steady state behavior of write intensive workloads is that, starting with an empty TXG, all threads will be allowed to dirty memory at full speed until a first threshold of bytes in the TXG is reached. At that time, every write system call will be delayed by 1 tick thus significantly slowing down the pace of writes. If the previous TXG completes it's I/Os, then the current TXG will then be allowed to resume at full speed. But in the unlikely event that a workload, despite the per write 1-tick delay, manages to fill up the TXG up to the full threshold we will be forced to throttle all writes in order to allow the storage to catch up.

It should make the system much better behaved and generally more performant under sustained write stress.

If you are owner of an unlucky workload that ends up as slowed by more throttling, do consider the other benefits that you get from the new code. If that does not compensate for the loss, get in touch and tell us what your needs are on that front.


It sounds great, though I hope that the 1/8 number is tunable.

Posted by Marc on mai 14, 2008 at 08:19 AM MEST #

yep. There are tunables for the target sync times,
the ratio of memory that can be used, an override of that value and we can turn off the whole thing if necessary.

Posted by Roch on mai 14, 2008 at 08:59 AM MEST #

For our situation we do the following and look at write throughput-
create a little sybase database and pick an i/o size (we use 16k i/o). Then start adding devices to the database- we use 32GB devices which usually blows through any write cache a storage array has... once a dba does an alter database to add the storage and we see how the storage system handles i/o.

Posted by dean ross-smith on mai 14, 2008 at 09:01 AM MEST #

Excellent news. I came across this problem in Fall 2006 and have been waiting for a solution. Any idea when we can expect it in the Solaris (OpenSolaris won't do for us) release?

Posted by Anantha on mai 14, 2008 at 10:49 AM MEST #

Good news. Along with continuously fast developed high speed disk and array, the number 1/8 might be tuned. And hope to see another variable to disable it, too. On the other side, zfs pool concepts might hold lots of several kind of storages ( from legacy one to latest one ) at the same pool. Doesn't this throttling hurt zfs ?

Posted by Bonghwan Kim on mai 14, 2008 at 11:59 AM MEST #

Which build do you expect this to integrate into Nevada?

Also, is the timer that decides when to sync to disk (as distinct from the timer checking how long a sync takes to complete) now a tunable - the 5 second timer. Or does this now dynamically adjust?



Posted by andrewk8 on mai 14, 2008 at 02:34 PM MEST #

Bonghwan : my take is that the cirscumstances where this will hurt will be very very specific, while the benefits will be felt much more generally. But yes there will be tunables to control how this all works.

Andrewk8 : there is a timer that decides when to sync. And there is code to measure how long a sync takes which I would not call a timer. At low load, a txg will be cut every 30 seconds (tunable). Under stress a txg will be cut every time ZFS estimates that the sync time will match the target time of 5 seconds (tunable).

Posted by Roch on mai 14, 2008 at 04:11 PM MEST #

Still, write throttling implemented at the file system level
cannot address "the imbalance between memory and disks at
a system level" fully, because there are other sources of dirty
pages, like anonymous pages. It seems that the only way to
handle this correctly is to put generic throttling mechanism in
VM, isn't it?

Posted by nikita on mai 15, 2008 at 04:26 AM MEST #

Excellent writeup!

Is there a PSARC case or BugID associated that I can track progress with?

Posted by benr on mai 15, 2008 at 05:21 AM MEST #

Nikita : unless i misunderstand what you mean, the VM already has it's throttling mechanism when dealing with too much demand and memory shortfall. A FS would like to avoid being the cause of VM shortfall and so this is about taking measure to avoid that.

Benr : bugid are 6429205, 6415647

Posted by Roch on mai 16, 2008 at 08:30 AM MEST #

Roch, I looked up the bug id's and they don't seem to indicate what build of Nevada these new features will be available in. IHAC with the exact problem you describe using Nevada today and seeking help. Thanks, Tim.

Posted by Tim Thomas on juin 12, 2008 at 05:50 AM MEST #

Ref my previous comment. It looks like these changes are in snv b87 and later.

Posted by Tim Thomas on juin 12, 2008 at 09:08 AM MEST #

How is the 1 tick delay implemented?

Posted by tildeleb on juillet 16, 2008 at 05:59 PM MEST #

Post a Comment:
Comments are closed for this entry.



« août 2016

No bookmarks in folder