The Wonders of ZFS Storage
Performance for your Data

  • ZFS
    May 14, 2008

The new ZFS write throttle

Roch Bourbonnais
Principal Performance Engineer
A very significant improvement is coming soon to ZFS. A change that will increase the general quality of service delivered by ZFS. Interestingly it's a change that might also slow down your microbenchmark but nevertheless it's a change you should be eager for.

Write throttling

For a filesystem, write throttling designates the act of blocking application for some amount of time, as short as possible, waiting for the proper conditions to allow the write system calls to succeed. Write throttling is normally required because applications can write to memory (dirty memory pages) at a rate significantly faster than the kernel can flush the data to disk. Many workloads dirty memory pages by writing to the filesystem page cache at near memory copy speed, possibly using multiple threads issuing high-rates of filesystem writes. Concurrently, the filesystem is doing it's best to drain all that data to the disk subsystem.

Given the constraints, the time to empty the filesystem cache to disk can be longer than the time required for applications to dirty the cache. Even if one considers storage with fast NVRAM, under sustained load, that NVRAM will fill up to a point where it needs to wait for a slow disk I/O to make room for more data to get in.

When committing data to a filesystem in bursts, it can be quite desirable to push the data at memory speed and then drain the cache to disk during the lapses between bursts. But when data is generated at a sustained high rate, lack of throttling leads to total memory depletion. We thus need at some point to try and match the application data rate with that of the I/O subsystem. This is the primary goal of write throttling.

A secondary goal of write throttling is to prevent massive data loss. When applications do not manage I/O synchronization (i.e don't use O_DSYNC and fsync), data ends up cached in the filesystem and the contract is that there is no guarantee that the data will still be there if a system crash were to occur. So even if the filesystem cannot be blamed for such data loss, it is still a nice feature to help prevent such massive losses.

Case in point : UFS Write throttling

For instance UFS would use the fsflush daemon to try to keep data exposed for no more than 30 seconds (default value of autoup). Also, UFS would keep track of the amount of I/O outstanding for each file. Once too much I/O was pending, UFS would throttle writers for that file. This was controlled through ufs_HW, ufs_LW and their values were commonly tuned (a bad sign). Eventually old defaults values were updated and seem to work nicely today. UFS write throttling thus operates on a per file basis. While there are some merits to this approach, it can be defeated as it does not manage the imbalance between memory and disks at a system level.

ZFS Previous write throttling

ZFS is designed around the concept of transaction groups (txg). Normally, every 5 seconds an _open_ txg goes to the quiesced state. From that state the quiesced txg will go to the syncing state which sends dirty data to the I/O subsystem. For each pool, there are at most 1 txg in each of the 3 states, open, quiescing, syncing. Write throttling used to occur when the 5 second txg clock would fire while the syncing txg had not yet completed. The open group would wait on the quiesced one which waits on the syncing one. Application writers (write system call) would block, possibly a few seconds, waiting for a txg to open. In other words, if a txg took more than 5 seconds to sync to disk, we would globally block writers thus matching their speed with that of the I/O. But if a workload had a bursty write behavior that could be synced during the allotted 5 seconds, application would never be throttled.

The Issue

But ZFS did not sufficiently controled the amount of data that could get in an open txg. As long as the ARC cache was no more than half dirty, ZFS would accept data. For a large memory machine or one with weak storage, this was likely to cause long txg sync times. The downsides were many :

- if we did ended up throttled, long sync times meant the system

behavior would be sluggish for seconds at a time.

- long txg sync times also meant that our granularity at which

we could generate snapshots would be impacted.

- we ended up with lots of pending data in the cache all of

which could be lost in the event of a crash.

- the ZFS I/O scheduler which prioritizes operations was also

negatively impacted.

- By not throttling we had the possibility that

sequential writes on large files could displace from the ARC

a very large number of smaller objects. Refilling

that data meant very large number of disk I/Os.

Not throttling can paradoxically end up as very

costly for performance.

- the previous code also could at times, not be issuing I/Os

to disk for seconds even though the workload was

critically dependant of storage speed.

- And foremost, lack of throttling depleted memory and prevented

ZFS from reacting to memory pressure.
That ZFS is considered a memory hog is most likely the results of the the previous throttling code. Once a proper solution is in place, it will be interesting to see if we behave better on that front.

The Solutions

The new code keeps track of the amount of data accepted in a TXG and the time it takes to sync. It dynamically adjusts that amount so that each TXG sync takes about 5 seconds (txg_time variable). It also clamps the limit to no more than 1/8th of physical memory.

And to avoid the system wide and seconds long throttle effect, the new code will detect when we are dangerously close to that situation (7/8th of the limit) and will insert 1 tick delays for applications issuing writes. This prevents a write intensive thread from hogging the available space starving out other threads. This delay should also generally prevent the system wide throttle.

So the new steady state behavior of write intensive workloads is that, starting with an empty TXG, all threads will be allowed to dirty memory at full speed until a first threshold of bytes in the TXG is reached. At that time, every write system call will be delayed by 1 tick thus significantly slowing down the pace of writes. If the previous TXG completes it's I/Os, then the current TXG will then be allowed to resume at full speed. But in the unlikely event that a workload, despite the per write 1-tick delay, manages to fill up the TXG up to the full threshold we will be forced to throttle all writes in order to allow the storage to catch up.

It should make the system much better behaved and generally more performant under sustained write stress.

If you are owner of an unlucky workload that ends up as slowed by more throttling, do consider the other benefits that you get from the new code. If that does not compensate for the loss, get in touch and tell us what your needs are on that front.

Join the discussion

Comments ( 13 )
  • Marc Wednesday, May 14, 2008

    It sounds great, though I hope that the 1/8 number is tunable.

  • Roch Wednesday, May 14, 2008

    yep. There are tunables for the target sync times,
    the ratio of memory that can be used, an override of that value and we can turn off the whole thing if necessary.

  • dean ross-smith Wednesday, May 14, 2008

    For our situation we do the following and look at write throughput-
    create a little sybase database and pick an i/o size (we use 16k i/o). Then start adding devices to the database- we use 32GB devices which usually blows through any write cache a storage array has... once a dba does an alter database to add the storage and we see how the storage system handles i/o.

  • Anantha Wednesday, May 14, 2008

    Excellent news. I came across this problem in Fall 2006 and have been waiting for a solution. Any idea when we can expect it in the Solaris (OpenSolaris won't do for us) release?

  • Bonghwan Kim Wednesday, May 14, 2008

    Good news. Along with continuously fast developed high speed disk and array, the number 1/8 might be tuned. And hope to see another variable to disable it, too. On the other side, zfs pool concepts might hold lots of several kind of storages ( from legacy one to latest one ) at the same pool. Doesn't this throttling hurt zfs ?

  • andrewk8 Wednesday, May 14, 2008

    Which build do you expect this to integrate into Nevada?

    Also, is the timer that decides when to sync to disk (as distinct from the timer checking how long a sync takes to complete) now a tunable - the 5 second timer. Or does this now dynamically adjust?



  • Roch Wednesday, May 14, 2008

    Bonghwan : my take is that the cirscumstances where this will hurt will be very very specific, while the benefits will be felt much more generally. But yes there will be tunables to control how this all works.

    Andrewk8 : there is a timer that decides when to sync. And there is code to measure how long a sync takes which I would not call a timer. At low load, a txg will be cut every 30 seconds (tunable). Under stress a txg will be cut every time ZFS estimates that the sync time will match the target time of 5 seconds (tunable).

  • nikita Thursday, May 15, 2008

    Still, write throttling implemented at the file system level
    cannot address "the imbalance between memory and disks at
    a system level" fully, because there are other sources of dirty
    pages, like anonymous pages. It seems that the only way to
    handle this correctly is to put generic throttling mechanism in
    VM, isn't it?

  • benr Thursday, May 15, 2008

    Excellent writeup!

    Is there a PSARC case or BugID associated that I can track progress with?

  • Roch Friday, May 16, 2008

    Nikita : unless i misunderstand what you mean, the VM already has it's throttling mechanism when dealing with too much demand and memory shortfall. A FS would like to avoid being the cause of VM shortfall and so this is about taking measure to avoid that.

    Benr : bugid are 6429205, 6415647

  • Tim Thomas Thursday, June 12, 2008

    Roch, I looked up the bug id's and they don't seem to indicate what build of Nevada these new features will be available in. IHAC with the exact problem you describe using Nevada today and seeking help. Thanks, Tim.

  • Tim Thomas Thursday, June 12, 2008

    Ref my previous comment. It looks like these changes are in snv b87 and later.

  • tildeleb Wednesday, July 16, 2008

    How is the 1 tick delay implemented?

Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.