By User13278091-Oracle on juil. 12, 2006
ZFS AND DIRECTIO
In view of the great performance gains that UFS gets out of the 'Directio' (DIO) feature, it is interesting to ask ourselves, where exactly do those gains come from and if ZFS can be tweaked to benefit from them in the same way.
UFS Directio is actually a set of things bundled together that improves performance of very specific workloads most notably that of Database. Directio is actually a performance hint to the filesystem and apart from relaxing posix requirements does not carry any change in filesystem semantics. The users of directio actually assert the condition on the full Filesystem or individual file level and the filesystem code if given extra freedom to run or not the tuned DIO codepath.
What does that tuned code path gets us ? A few things:
- output goes directly from application buffer to disk bypassing the filesystem core memory cache. - the FS is not constrained anymore to strictly obey the POSIX write ordering. The FS is thus able to allow multiple thread concurrently issuing some I/Os to a single file. - On input UFS DIO refrains from doing any form of readahead.In a sense, by taking out the middleman (the filesystem cache), UFS/DIO causes files to behave a lot like a raw device. Application reads and writes map one to one onto individual I/Os.
People often consider that the great gains that DIO provides comes from avoiding the CPU cost of the copy into system caches and from the avoiding the double buffering, once in the DB, once in the FS, that one gets in the non-directio case.
I would argue that while the CPU cost associated with a copy certainly does exists, the copy will run very very quickly compared to the time the ensuing I/O takes. So the impact of the copy would only appear on systems that have their CPU quite saturated, notably for industry standard benchmarks. However real systems, which are more likely to be I/O constrained than CPU constrained should not pay a huge toll to this effect.
As for double buffering, I note that Databases (or applications in general), are normally setup to consume a given amount of memory and the FS operates using the remaining portion. Filesystems caches data in memory for lack of better use of that memory. And FS give up their hold whenever necessary. So the data is not double buffered but rather 'free' memory keeps a hold on recently issued I/O. Buffering data in 2 locations does not look like a performance issue to me.
Anything for ZFS ?
So what does that leaves us with ? Why is DIO so good ? This tells me that we gain a lot from those 2 mantras
don't do any more I/O that requested allow multiple concurrent I/O to a file.I note that UFS readahead is particularly bad for certain usage; when UFS sees access to 2 consecutive pages, it will read a full cluster and those are typically 1MB in sizes today. So avoiding UFS readahead has probably contributed greatly to the success of DIO. As for ZFS there are 2 levels of readahead (a.k.a prefetching). One that is filebased and one device based. Both are being reworked at this stage. I note that filebased readahead code has not and will not behave like UFS. On the other hand device level prefetching probably is being over agressive for DB type loads and it should be avoided. While I have not given hope of that this can be managed automatically, watch this space for tuning scripts to control the device prefetching behavior.
DIO for input does not otherwise appear an interesting proposition since if the data is cached, I don't really see the gains in bypassing it (apart from slowing down the reads).
As for writes, ZFS, out of the box, does not suffer from the single writer lock that UFS needs to implement the posix ordering rules. The transaction groups (TXG) are sufficient for that purpose (see The Dynamics of ZFS).
This leaves us to the amount of I/O needed by the 2 filesystems when running many concurrent O_DSYNC writers running small writes to random file offsets.
UFS actually handles this load by overwriting the data in it's preallocated disk locations. Every 8K pages is associated with set place on the storage and a write to that location means a disk head movement and an 8K output I/O. This loads should scale well with number of disks in the storage and the 'random' IOPS capability of each drives. If a drives handle 150 random IOPS, then we can handle about 1MB/s/drive of output.
Now ZFS will behave quite differently. ZFS does not have preallocation of file blocks and will not, ever, overwrite live data. The handling of the O_DSYNC writes in ZFS will occur in 2 stages.
The 2 stages of ZFS
First at the ZFS Intent Log (ZIL) level where we need to I/O the data in order to release the application blocked in a write call. Here the ZIL has the ability of aggregating data from multiple writes and issue fewer/larger I/Os than UFS would. Given the ZFS strategy of block allocation we also expect those I/O to be able to stream to the disk at high speed. We don't expect to be restrained by the random IOPS capabilities of disk but more by their streaming performance.
Next at the TXG level, we clean up the state of the filesystem and here again the block allocation should allow high rate of data transfer. At this stage there are 2 things we have to care about.
With current state of things, we probably will see the data sent to disk twice, once to the ZIL once to the pool. While this appears suboptimal at first, the aggregation and streaming characteristics of ZFS makes the current situation already probably better than what UFS can achieve. We're also looking to see if we can make this even better by avoiding the 2 copies while preserving the full streaming performance characteristics.
For pool level I/O we must take care to not inflate the amount of data sent to disk which could eventually cause early storage saturation. ZFS works out of the box with 128K records for large files. However for DB workloads, we expect this will be tuned such that the ZFS recordsize matches the DB block size. We also expect the DB blocksize to be at least 8K in sizes. Matching the ZFS recordize to the DB block size is a recommendation that is inline with what UFS DIO has taught us: don't do any more I/O than necessary.
Note also that with ZFS, because we don't overwrite live data, every block output needs to bubble up into metadata block updates etc... So there are some extra I/O that ZFS has to do. So depending on the exact test conditions the gains of ZFS can be offset by the extra metadata I/Os.
ZFS Performance and DB
Despite all the advantage of ZFS, the reason that performance data has been hard to come by is that we have to clear up the road and bypass the few side issues that currently affects performance on large DB loads. At this stage, we do have to spend some time and apply magic recipes to get ZFS performance on Database to behave the way it's intended to.
But when the dust settles, we should be right up there in terms of performance compared to UFS/DIO, and improvements ideas are still plenty, if you have some more I'm interested....