The Wonders of ZFS Storage
Performance for your Data

ZFS and Directio

Roch Bourbonnais
Principal Performance Engineer


In view of the great performance gains that UFS gets out of the 'Directio' (DIO) feature, it is interesting to ask ourselves, where exactly do those gains come from and if ZFS can be tweaked to benefit from them in the same way.

UFS Directio

UFS Directio is actually a set of things bundled together that improves performance of very specific workloads most notably that of Database. Directio is actually a performance hint to the filesystem and apart from relaxing posix requirements does not carry any change in filesystem semantics. The users of directio actually assert the condition on the full Filesystem or individual file level and the filesystem code if given extra freedom to run or not the tuned DIO codepath.

What does that tuned code path gets us ? A few things:

- output goes directly from application buffer to disk

bypassing the filesystem core memory cache.

- the FS is not constrained anymore to strictly obey the POSIX

write ordering. The FS is thus able to allow multiple thread

concurrently issuing some I/Os to a single file.

- On input UFS DIO refrains from doing any form of readahead.
In a sense, by taking out the middleman (the filesystem cache), UFS/DIO causes files to behave a lot like a raw device. Application reads and writes map one to one onto individual I/Os.

People often consider that the great gains that DIO provides comes from avoiding the CPU cost of the copy into system caches and from the avoiding the double buffering, once in the DB, once in the FS, that one gets in the non-directio case.

I would argue that while the CPU cost associated with a copy certainly does exists, the copy will run very very quickly compared to the time the ensuing I/O takes. So the impact of the copy would only appear on systems that have their CPU quite saturated, notably for industry standard benchmarks. However real systems, which are more likely to be I/O constrained than CPU constrained should not pay a huge toll to this effect.

As for double buffering, I note that Databases (or applications in general), are normally setup to consume a given amount of memory and the FS operates using the remaining portion. Filesystems caches data in memory for lack of better use of that memory. And FS give up their hold whenever necessary. So the data is not double buffered but rather 'free' memory keeps a hold on recently issued I/O. Buffering data in 2 locations does not look like a performance issue to me.

Anything for ZFS ?

So what does that leaves us with ? Why is DIO so good ? This tells me that we gain a lot from those 2 mantras

don't do any more I/O that requested

allow multiple concurrent I/O to a file.
I note that UFS readahead is particularly bad for certain usage; when UFS sees access to 2 consecutive pages, it will read a full cluster and those are typically 1MB in sizes today. So avoiding UFS readahead has probably contributed greatly to the success of DIO. As for ZFS there are 2 levels of readahead (a.k.a prefetching). One that is filebased and one device based. Both are being reworked at this stage. I note that filebased readahead code has not and will not behave like UFS. On the other hand device level prefetching probably is being over agressive for DB type loads and it should be avoided. While I have not given hope of that this can be managed automatically, watch this space for tuning scripts to control the device prefetching behavior.

DIO for input does not otherwise appear an interesting proposition since if the data is cached, I don't really see the gains in bypassing it (apart from slowing down the reads).

As for writes, ZFS, out of the box, does not suffer from the single writer lock that UFS needs to implement the posix ordering rules. The transaction groups (TXG) are sufficient for that purpose (see The Dynamics of ZFS).

This leaves us to the amount of I/O needed by the 2 filesystems when running many concurrent O_DSYNC writers running small writes to random file offsets.

UFS actually handles this load by overwriting the data in it's preallocated disk locations. Every 8K pages is associated with set place on the storage and a write to that location means a disk head movement and an 8K output I/O. This loads should scale well with number of disks in the storage and the 'random' IOPS capability of each drives. If a drives handle 150 random IOPS, then we can handle about 1MB/s/drive of output.

Now ZFS will behave quite differently. ZFS does not have preallocation of file blocks and will not, ever, overwrite live data. The handling of the O_DSYNC writes in ZFS will occur in 2 stages.

The 2 stages of ZFS

First at the ZFS Intent Log (ZIL) level where we need to I/O the data in order to release the application blocked in a write call. Here the ZIL has the ability of aggregating data from multiple writes and issue fewer/larger I/Os than UFS would. Given the ZFS strategy of block allocation we also expect those I/O to be able to stream to the disk at high speed. We don't expect to be restrained by the random IOPS capabilities of disk but more by their streaming performance.

Next at the TXG level, we clean up the state of the filesystem and here again the block allocation should allow high rate of data transfer. At this stage there are 2 things we have to care about.

With current state of things, we probably will see the data sent to disk twice, once to the ZIL once to the pool. While this appears suboptimal at first, the aggregation and streaming characteristics of ZFS makes the current situation already probably better than what UFS can achieve. We're also looking to see if we can make this even better by avoiding the 2 copies while preserving the full streaming performance characteristics.

For pool level I/O we must take care to not inflate the amount of data sent to disk which could eventually cause early storage saturation. ZFS works out of the box with 128K records for large files. However for DB workloads, we expect this will be tuned such that the ZFS recordsize matches the DB block size. We also expect the DB blocksize to be at least 8K in sizes. Matching the ZFS recordize to the DB block size is a recommendation that is inline with what UFS DIO has taught us: don't do any more I/O than necessary.

Note also that with ZFS, because we don't overwrite live data, every block output needs to bubble up into metadata block updates etc... So there are some extra I/O that ZFS has to do. So depending on the exact test conditions the gains of ZFS can be offset by the extra metadata I/Os.

ZFS Performance and DB

Despite all the advantage of ZFS, the reason that performance data has been hard to come by is that we have to clear up the road and bypass the few side issues that currently affects performance on large DB loads. At this stage, we do have to spend some time and apply magic recipes to get ZFS performance on Database to behave the way it's intended to.

But when the dust settles, we should be right up there in terms of performance compared to UFS/DIO, and improvements ideas are still plenty, if you have some more I'm interested....

Join the discussion

Comments ( 4 )
  • Myron Scott Wednesday, July 12, 2006
    ZFS Slides

    Slide 31 at the above link seems to suggest that one day there will be a userland interface into the DMU. Am I misreading the slide? It would nice to see what kind of performance is possible by having the database written directly to the DMU.

  • Jason Ozolins Wednesday, July 12, 2006
    DIO for input does not otherwise appear an interesting proposition since if the data is cached, I don't really see the gains in bypassing it (apart from slowing down the reads)

    Bypassing the cache for a read makes sense if your locality of reference is so low that you never get cache hits on that data, but do have some other concurrent workload which can benefit from cached filesystem data. Not just for databases: high speed linear reads through huge files will completely trash the buffer cache quickly enough that useful data can be thrown out.

    Example: Linux 2.4's tendency to drop executable pages when heavy file activity was going on. You'd start some heavy file munging and go to get coffee, come back to your machine after five minutes and it would take thirty seconds after the screen unlocked to get all your windows redrawn because the executables were no longer resident in memory.

  • PJ Wednesday, July 12, 2006
    "we should be right up there in terms of performance compared to UFS/DIO" -- how does the performance of ZFS or UFS/DIO compare to raw devices for databases? The management of raw devices maybe a pain but I wonder if it is worth the pain to squeeze all the performance that we can get with them?
  • James Mansion Thursday, July 13, 2006
    Would it be helpful for the prelogging write stream to be able to tell ZFS that the data should compress well? Or just try to compress it anyway as you go? Even if the data is expanded back to the raw form when moved to the target blocks, this may improve the effective write performance to the log by a factor 5 or more, and have no further cost if the caches can retail an association between the compressed record and the equivalent cached raw data. Is it possible to dedicate a spindle to the write-ahead log and then distribute data to other target spindles asynchronously? (Hmm, starting to sound a bit like a database!) If I mflush a large mapped region with multiple (discontinuous) dirty pages, what happens? Cheers, James
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.