In this paper I’d like to review the performance data we have gathered comparing this initial release of ZFS (Nov 16 2005) with the Solaris legacy, optimized beyond reason, UFS filesystem. The data we will be reviewing is based on 14 Unit tests that were designed to stress some specific usage pattern of filesystem operations. Working with these well contained usage scenarios, greatly facilitate subsequent performance engineering analysis.
Our focus was to issue a fair head to head comparison between UFS and ZFS but not try to produce the biggest, meanest marketing numbers. Since ZFS is also a Volume Manager, we actually compared ZFS to a UFS/SVM combination. In cases where ZFS underperforms UFS, we wanted to figure out why and how to improve ZFS.
We currently also are focusing on data intensive operations. Metadata intensive tests are being develop and we will report on those in a later study.
Looking ahead to our results we find that of our 12 Filesystem Unit test that were successfully run:
- ZFS outpaces UFS in 6 tests by a mean factor of 3.4
- UFS outpaces ZFS in 4 tests by a mean factor of 3.0
- ZFS equals UFS in 2 tests.
In this paper, we will be taking a closer look at the tests where UFS is ahead and try to make proposition toward improving those numbers.
THE SYSTEM UNDER TEST
Our testbed is a hefty V890 with 8 x 1200 Mhz US-IV CPUs (16 cores). At this point we are not yet monitoring the CPU utilization of the different tests although we plan to do so in the future. The storage is an insanely large 300 disk array; The disks were rather old technology, small & slow 9 GB disks. None of the test currently stresses the array very much and the idea was mostly trying to take the storage configuration out of the equation. Working with old technology disks, the absolute throughput numbers are not necessarily of interest; they are presented in an appendix.
Every disk in our configuration is partitioned into 2 slices and a simple zvm or zpool stripped volume is made across all spindles. We then build a filesystem on top of the volume. All commands are run with default parameters. Both filesystems are mounted and we can run our test suite on either one.
Every test is rerun multiple times in succession; The tests are defined and developed to avoid variability between instances. Some of the current test definition require that file data not be present in the filesystem cache. Since we currently do not have a convenient way to control this for ZFS, the result for those tests are omitted from this report.
THE FILESYSTEM UNIT TESTS
Here is the definition of the 14 data intensive tests we have currently identified. Note that we are very open to new test definition; if you know of an data intensive application, that uses a Filesystem in a very different pattern, and there must be tons of them, we would dearly like to hear from you.
Test 1
This is the simplest way to create a file; we open/creat a file then issue 1MB writes until the filesize reaches 128 MB; we then close the file.
Test 2
In this test, we also create a new file, although here we work with a file opened with the O_DSYNC flag. We work with 128K writes system calls. This maps to some database file creation scheme.
Test 3
This test is also relative to file creation but with writes that are much smaller and of varying sizes. In this test, we create a 50MB file using writes of size picked randomly between [1K,8K]. The file is open with default flags (no O_\*SYNC) but every 10 MB of written data we issue an fsync() call for the whole file. This form of access can be used for log files that have data integrity requirements.
Test 4
Moving now to a read test; we read a 1 GB file (assumed in cache) with 32K read system call. This is a rather simple test to keep everybody honest.
Test 5
This is same test as Test 4 but when the file is assumed not present in the filesystem cache. We currently have no control on ZFS for this and so we will not be reporting performance numbers for this test. This is a basic streaming read sequence that should test the readahead capacity of a filesystem.
Test 6
Our previous write test, were allocating writes. In this test we will verify the ability of a filesystem to rewrite over an existing file. We will look at 32K writes, to a file open with O_DSYNC.
Test 7
Here we also test the ability to rewrite existing files. The size are randomly picked in the [1K,8K] range. Not special control over data integrity (no O_\*SYNC, no fsync()).
Test 8
In this test we create a very large file (10 GB) with 1MB writes followed by 2 full-pass sequential read. This test is still evolving but we want verify the ability of the filesystem to work with files that are of size close or larger that available free memory.
Test 9
In this test, we issue 8K writes at random 8K aligned offsets in a 1 GB file. When 128 MB of data is written we issue an fsync().
Test 10
Here, we issue 2K writes at random (unaligned) offsets to a file opened O_DSYNC.
Test 11
Same test as 10 but using 4 cooperating threads all working on a single file.
Test 12
Here we attempt to simulate a mixed read/write pattern. Working with an existing file, we loop through a pattern of 3 reads at 3 randomly selected 8K aligned offsets followed by an 8K write to the last read block.
Test 13
In this test we issue 2K pread() calls (to an random unaligned offset). File is asserted to not be in the cache. Since we currently have no such control, no won’t report data for this test.
Test 14
We have 4 cooperating threads (working on a single file) issuing 2K pread() calls to random unaligned offset. The file is present in the cache.
THE RESULTS
We have a common testing framework to generate the performance data. Each test is written using as a simple C program and the framework is responsible for creating threads, files, timing the runs and reporting. We currently are in discussing merging this test framework with the Filebench suite. We regret that we cannot easily share the test code, however the above descriptions should be sufficiently precise to allow someone to reproduce our data. In my mind a simple 10 to 20 disk array and any small server should be enough to generate similar numbers. If anyone find very different results, I would be very interested in knowing about it.
Our framework reports all timing results as a throughput measure. Absolute values of throughput is highly test case dependent. A 2K O_DSYNC write will not have the same throughput as a 1MB cached read. Some test would be better described in terms of operations per second. However since our focus is a relative ZFS to UFS/SVM comparison, we will focus here on the delta in throughput between the 2 filesystems (for the curious the full throughput data is posted in the appendix).
Drumroll….
Task ID Description Winning FS / Performance Delta
1 open() and allocation of a ZFS / 3.4X
128.00 MB file with
write(1024K) then close().
2 open(O_DSYNC) and ZFS / 5.3X
allocation of a
5.00 MB file with
write(128K) then close().
3 open() and allocation of a UFS / 1.8X
50.00 MB file with write() of
size picked uniformly in
[1K,8K] issuing fsync()
every 10.00 MB
4 Sequential read(32K) of a ZFS / 1.1X
1024.00 MB file, cached.
5 Sequential read(32K) of a no data
1024 MB MB file, uncached.
6 Sequential rewrite(32K) of a ZFS / 2.6X
10.00 MB file, O_DSYNC,
uncached
7 Sequential rewrite() of a 1000.00 UFS / 1.3X
MB cached file, size picked
uniformly in the [1K,8K]
range, then close().
8 create a file of size 1/2 of ZFS / 2.3X
freemem using write(1MB)
followed by 2 full-pass
sequential read(1MB). No
special cache manipulation.
9 128.00 MB worth of random 8 UFS / 2.3X
K-aligned write to a
1024.00 MB file; followed
by fsync(); cached.
10 1.00 MB worth of 2K write to draw (UFS == ZFS)
100.00 MB file, O_DSYNC,
random offset, cached.
11 1.00 MB worth of 2K write to ZFS / 5.8X
100.00 MB file, O_DSYNC,
random offset, uncached. 4
cooperating threads each
writing 1 MB
12 128.00 MB worth of 8K aligned draw (UFS == ZFS)
read&write to 1024.00 MB
file, pattern of 3 X read,
then write to last read
page, random offset,
cached.
13 5.00 MB worth of pread(2K) per no data
thread within a shared
1024.00 MB file, random
offset, uncached
14 5.00 MB worth of pread(2K) per UFS / 6.9X
thread within a shared
1024.00 MB file, random
offset, cached 4 threads.
As stated in the abstract
- ZFS outpaces UFS in 6 tests by a mean factor of 3.4
- UFS outpaces ZFS in 4 tests by a mean factor of 3.0
- ZFS equals UFS in 2 tests.
The performance differences can be sizable; lets have a closer look at some of them.
PERFORMANCE DEBRIEF
Lets look at each test to try and understand what is the cause of the performance differences.
Test 1 (ZFS 3.4X)
open() and allocation of a
128.00 MB file with
write(1024K) then close().
This test is not fully analyzed. We note that in this situation UFS will regularly kick off some I/O from the context of the write system call. This would occur whenever a cluster of writes (typically of size 128K or 1MB) has completed. The initiation of I/O by UFS slows down the process. On the other hand ZFS can zoom through the test at a rate much closer to a memcopy. The ZFS I/Os to disks are actually generated internally by the ZFS transaction group mechanism: every few seconds a transaction group will come and flush the dirty data to disk and this occurs without throttling the test.
Test 2 (ZFS 5.3X)
open(O_DSYNC) and
allocation of a
5.00 MB file with
write(128K) then close().
Here ZFS shows an even bigger advantage. Because of it’s design and complexity, UFS is actually somewhat limited in it capacity to write allocate files in O_DSYNC mode. Every new UFS write requires some disk block allocation, which must occur one block at a time when O_DSYNC is set. ZFS can easily outperform UFS for this test.
Test 3 (UFS 1.8X)
open() and allocation of a
50.00 MB file with write() of
size picked uniformly in
[1K,8K] issuing fsync()
every 10.00 MB
Here ZFS pays the advantage it had in test 1. In this test, we issue very many writes to a file. Those are cached as the process is racing along. When the fsync() hits (every 10 MB of outstanding data per the test definition) the FS must now guarantee that all the data is set to stable storage. Since UFS kicks off I/O more regularly, when the fsync() hits UFS has a smaller amount of data left to sync up. What save the day for ZFS is that, for that leftover data UFS slows down to a crawl. On the other hand ZFS has accumulated a large amount of data in the cache and when the fsync() hits.
Fortunately ZFS is able to issue much larger I/Os to disk and catches some of it’s lag that has built up. But the final results shows that UFS wins the horse race (at least in this specific test); Details of the test will influence final result here.
However the ZFS team is working on ways to make the fsync() much better. We actually have 2 possible avenues of improvements. We can borrow from the UFS behavior and kick off some I/Os when too much outstanding data is cached. UFS does this at a very regular interval which does not look right either. But clearly if a file has many MB of outstanding dirty data sending them off to disk might be beneficial. On the other hand, keeping the data in cache in interesting when the pattern of writing is such that the same file offsets are written and re-written over and over again. Sending the data to disk is wasteful if data is subsequently rewritten shortly after. Basically the FS must place a bet on whether a future fsync() will occur before an new write to the block. We cannot win this bet on all tests all the time.
Given that fsync() performance is important, I would like to see us asynchronously kick off I/O when some we reach many MB of outstanding data to a file. This is nevertheless debatable.
Even if we don’t do this, we have another area of improvement that the ZFS team is looking into. When the Fsync finally hits the fan, even with a lot of outstanding data; the current implementation does not issue disk I/Os very efficiently. The proper way to do this is to kick-off all required I/Os and then wait for them to all complete. Currently in the intricacies of the code, some I/Os are issued and waited upon one after the other. This is not yet optimal but we certainly should see improvements coming in the future and I truly expect ZFS fsync() performance to be ahead all the time.
Test 4 (ZFS 1.1X)
Sequential read(32K) of a 1024.00
MB file, cached.
Rather simple test, mostly close to memcopy speed between the Filesystem cache and the user buffer. Contest is almost a wash with ZFS slightly on top. Not yet analyzed.
Test 5 (N/A)
Sequential read(32K) of a 1024.00
MB file, uncached.
No results dues to lack of control on the ZFS file level caching.
Test 6 (ZFS 2.6X)
Sequential rewrite(32K) of a
10.00 MB file, O_DSYNC,
uncached
Due to the WAFL (Write Anywhere File Layout) ZFS, a rewrite is not very different to an initial write and it seems to perform very well on this test. Presumably UFS performance is hindered by the need to synchronize the cached data. Result not yet analyzed.
Test 7 (UFS 1.3X)
Sequential rewrite() of a 1000.00
MB cached file, size picked
uniformly in the [1K,8K]
range, then close().
In this test we are not timing any of the disk I/O. This is merely a test about unrolling the filesystem code for 1K to 8K cached writes. The UFS codepath wins in simplicity and years of performance tuning. The ZFS codepath here somewhat suffers from it’s youth. Understandably the ZFS current implementation is very well layered and we easily imagine that the locking strategies of the different layers are independent of one another. We have found (thanks dtrace) that a small ZFS cached write would use about 3 times as many lock acquisition that an equivalent UFS call. Mutex rationalization within or between layers certainly seems to be an area of potential improvement for ZFS that would help this particular test. We also realised that the very clean and layered code implementation is causing the callstack to follow very many elevator ride up and down between layers. On a SPARC CPU going up and down 6 or 7 layers deep in the callstack causes a spill/fill trap and one additional trap for every additional floor travelled. Fortunately there are very many areas where ZFS will be able to merge different functions into single one or possibly exploit the technique of tail calls to regain some of the lost performance. All in all, we find that the performance difference is small enough to not be worrysome at this point specially in view of the possible improvements we already have identified.
Test 8 (ZFS 2.3X)
create a file of size 1/2 of
freemem using write(1MB)
followed by 2 full-pass
sequential read(1MB). No
special cache manipulation.
This test needs to be analyzed further. We note that UFS will proactively freebehind read blocks. While this is a very responsible use of memory (give it back after use) it potentially impact the re-read UFS performance. While we’re happy to see ZFS performance on top, some investigation is warranted to make sure that ZFS does not overconsume memory in some situations.
Test 9 (UFS 2.3X)
128.00 MB worth of random 8
K-aligned write to a
1024.00 MB file; followed
by fsync(); cached.
In this test we expect a rational similar to the one of Test 3 to take effect. The same cure should also apply.
Test 10 (draw)
1.00 MB worth of 2K write to
100.00 MB file, O_DSYNC,
random offset, cached.
Both FS must issue and wait for a 2K I/O on each write. They both do this as efficiently as possible.
Test 11 (ZFS 5.8X)
1.00 MB worth of 2K write to
100.00 MB file, O_DSYNC,
random offset, uncached. 4
cooperating threads each
writing 1 MB
This test is similar to the previous one except for the 4 cooperating threads. ZFS being on top highlights a key feature of ZFS, the lack of single writer lock. UFS can only allow a single write thread working per file. The only exception is when directio is enabled and then only with rather restrictive conditions. UFS with directio would allow concurrent writers with the implied restriction that it did not honor full POSIX semantics regarding write atomicity. ZFS, out of the box, is able to allow concurrent writers without requiring any special setup nor giving up full POSIX semantics. All great news for simplicity of deployment and great Data-Base performance .
Test 12 (draw)
128.00 MB worth of 8K aligned
read&write to 1024.00 MB
file, pattern of 3 X read,
then write to last read
page, random offset,
cached.
Both filesystem perform appropriately. Test still require analysis.
Test 13 (N/A)
5.00 MB worth of pread(2K) per
thread within a shared
1024.00 MB file, random
offset, uncached
No results dues to lack of control on the ZFS file level caching.
Test 14 (UFS 6.9X)
5.00 MB worth of pread(2K) per
thread within a shared
1024.00 MB file, random
offset, cached 4 threads.
This test unexplicably shows UFS on top. The UFS code can perform rather well given that the FS cache is stored in the page cache. Servicing writes from cache can be made very scalable. We are just starting our analysis of the performance characteristic of ZFS for this test We have identified some serialization construct in the buffer management code where we find that reclaiming the buffers into which to put the cached data is acting as a serial throttle. This is truly the only test where the ZFS performance disappoint although there is no doubt that we will be finding a cure to this implementation issue.
THE TAKEAWAY
ZFS is on top on very many of our test often by a significant factor. Where UFS is ahead we have a clear view on how to improve the ZFS implementation. The case of shared readers to a single file will be the test that requires special attention.
Given the youth of the ZFS implementation, the performance outline presented in this paper shows that the ZFS design decision are totally validated from a performance perspective.
FUTURE DIRECTIONS
Clearly, we should now expands the unit test coverage. We would like to study more metadata intensive workloads. We also would like to see how ZFS features such as compression and RaidZ perform. Other interesting studies could focus on CPU consumption and memory efficiency. We also need to find a solution to running the existing unit test that requires the files to not be cached in the filesystem.
APPENDIX/ THROUGHPUT MEASURE
Here are the raw throughput measures for each of the 14 Unit test.
Task ID Description ZFS latest+nv25(MB/s) UFS+nv25 (MB/s)
1 open() and allocation of a 486.01572 145.94098
128.00 MB file with
write(1024K) then close(). ZFS 3.4X
2 open(O_DSYNC) and 4.5637 0.86565
allocation of a
5.00 MB file with
write(128K) then close(). ZFS 5.3X
3 open() and allocation of a 27.3327 50.09027
50.00 MB file with write() of
size picked uniformly in
[1K,8K] issuing fsync() 1.8X UFS
every 10.00 MB
4 Sequential read(32K) of a 1024.00 674.77396 612.92737
MB file, cached.
ZFS 1.1X
5 Sequential read(32K) of a 1024.00 1756.57637 17.53705
MB file, uncached.
XXXXXXXXX
6 Sequential rewrite(32K) of a 2.20641 0.85497
10.00 MB file, O_DSYNC,
uncached ZFS 2.6X
7 Sequential rewrite() of a 1000.00 204.31557 257.22829
MB cached file, size picked
uniformly in the [1K,8K] 1.3X UFS
range, then close().
8 create a file of size 1/2 of 698.18182 298.25243
freemem using write(1MB)
followed by 2 full-pass
sequential read(1MB). No ZFS 2.3X
special cache manipulation.
9 128.00 MB worth of random 8 42.75208 100.35258
K-aligned write to a
1024.00 MB file; followed 2.3X UFS
by fsync(); cached.
10 1.00 MB worth of 2K write to 0.117925 0.116375
100.00 MB file, O_DSYNC,
random offset, cached. ====
11 1.00 MB worth of 2K write to 0.42673 0.07391
100.00 MB file, O_DSYNC,
random offset, uncached. 4
cooperating threads each ZFS 5.8X
writing 1 MB
12 128.00 MB worth of 8K aligned 264.84151 266.78044
read&write to 1024.00 MB
file, pattern of 3 X read,
then write to last read =====
page, random offset,
cached.
13 5.00 MB worth of pread(2K) per 75.98432 0.11684
thread within a shared
1024.00 MB file, random XXXXXXXX
offset, uncached
14 5.00 MB worth of pread(2K) per 56.38486 386.70305
thread within a shared
1024.00 MB file, random 6.9X UFS
offset, cached 4 threads.
OpenSolaris, ZFS