ZFS to UFS Performance Comparison on Day 1

With special thanks to Chaoyue Xiong for her help in this work.

In this paper I’d like to review the performance data we have gathered comparing this initial release of ZFS (Nov 16 2005) with the Solaris legacy, optimized beyond reason, UFS filesystem. The data we will be reviewing is based on 14 Unit tests that were designed to stress some specific usage pattern of filesystem operations. Working with these well contained usage     scenarios, greatly facilitate    subsequent performance engineering analysis.

Our focus was to issue a fair head to head comparison between UFS and ZFS but not try to produce the biggest, meanest marketing numbers. Since ZFS is also a Volume   Manager, we actually compared ZFS to a UFS/SVM combination. In cases where ZFS underperforms UFS, we wanted to figure out why and how to improve ZFS.

We currently also are focusing on data intensive operations. Metadata intensive tests are being develop and we will   report on those in a later study.

Looking ahead to our results we find that of our 12 Filesystem Unit test that were successfully run:

ZFS outpaces UFS in 6 tests by a mean factor of 3.4
UFS outpaces ZFS in 4 tests by a mean factor of 3.0
ZFS equals UFS in 2 tests.

In this paper, we will be taking a closer look at the tests where UFS is ahead and try to make proposition toward improving those numbers.

THE SYSTEM UNDER TEST

Our testbed is a hefty V890 with 8 x 1200 Mhz US-IV CPUs (16 cores). At this point we are not yet monitoring the CPU utilization of the different tests although we plan to do so in the future. The storage is an insanely large 300 disk   array; The disks were   rather old technology, small & slow 9 GB disks. None of the test currently stresses the array very much and the idea was mostly trying to take the storage configuration   out of the equation. Working with old technology disks, the absolute throughput numbers are not necessarily of interest; they are presented in an appendix.

Every disk in our configuration is partitioned into   2 slices and a simple zvm or zpool stripped volume is made across all spindles. We then build a filesystem on top of the volume. All commands are run with default parameters. Both filesystems are mounted and we can run our test suite on either one.

Every test is rerun multiple times in succession; The tests   are defined and developed to avoid variability between instances. Some of the current test definition require that file data not be present in the filesystem cache. Since we currently do not have a convenient way to control this for ZFS, the result for those tests are omitted from this report.

THE FILESYSTEM UNIT TESTS

Here is the definition   of the 14 data   intensive tests we have currently identified.   Note that we   are very open to   new test definition; if you know of an data intensive application, that uses a Filesystem in a very different pattern, and there must be tons of them, we would dearly like to hear from you.

Test 1

This is the simplest way to create a file; we open/creat a file then issue 1MB writes until the filesize reaches 128 MB; we then close the file.

Test 2

In this test, we also create a new file, although here we work with a file opened with the O_DSYNC flag. We work with 128K writes system calls. This maps to some database file creation scheme.

Test 3

This test is also relative to file creation but with writes that are much smaller and of varying sizes. In this test, we create a 50MB file using writes of size picked randomly between [1K,8K]. The file is open with default flags (no O_\*SYNC) but every 10 MB of written data we issue an fsync() call for the whole file. This form of access can be used for log files that have data integrity requirements.

Test 4

Moving now to a read test; we read a 1 GB file (assumed in cache) with 32K read system call. This is a rather simple test to keep everybody honest.

Test 5

This is same test as Test 4 but when the file is assumed not present in the filesystem cache. We currently have no control on ZFS for this and so we will not be reporting   performance numbers for this test. This is a basic streaming read sequence that should test the readahead capacity of a filesystem.

Test 6

Our previous write test, were allocating writes. In this test we will verify the ability of a filesystem to rewrite over an existing file. We will look at 32K writes, to a file open with O_DSYNC.

Test 7

Here we also test the ability to rewrite existing files. The size are randomly picked in the [1K,8K] range. Not special control over data integrity (no O_\*SYNC, no fsync()).

Test 8

In this test we create a very large file (10 GB) with 1MB writes followed by 2 full-pass sequential read. This test is still evolving but we want verify the ability of the filesystem to work with files that are of size close or larger that available free memory.

Test 9

In this test, we issue 8K writes at random 8K aligned offsets in a 1 GB file. When 128 MB of data is written we issue an fsync().

Test 10

Here, we issue 2K writes at   random (unaligned) offsets to a file opened O_DSYNC.

Test 11

Same test   as 10 but using 4   cooperating threads all working on a single file.

Test 12

Here we attempt to simulate a mixed read/write pattern. Working with an existing file, we loop through a pattern of 3 reads at 3 randomly selected 8K aligned offsets followed by an 8K write to the last read block.

Test 13

In this test we   issue 2K pread()    calls (to an random unaligned offset). File is asserted to not be in the cache. Since we currently have no such control, no won’t report data for this test.

Test 14

We have 4 cooperating threads (working on a single file) issuing 2K pread() calls to random unaligned offset. The file is present in the cache.

THE RESULTS

We have a common testing framework to generate the performance data. Each test is written using as a simple C program and the framework is responsible   for creating   threads,   files, timing   the runs and reporting. We currently are in discussing merging this test framework with the Filebench suite. We regret that we cannot easily share the test code, however the   above descriptions should be sufficiently precise to allow someone to reproduce our data.   In my mind a simple 10 to 20 disk array and any small server should be enough to generate similar numbers. If anyone find very different results, I would be very interested in knowing about it.

Our framework reports all    timing    results   as a   throughput measure. Absolute values of throughput is highly test case dependent. A 2K O_DSYNC write will not have the same throughput as a 1MB cached read. Some test would be better described in terms of operations per second.    However since our focus is a   relative ZFS to UFS/SVM comparison, we will focus here on the delta in throughput between the 2 filesystems (for the curious the full throughput data is posted in the appendix).

Drumroll….

Task ID     Description                                              Winning FS / Performance Delta

1                 open() and allocation of a                        ZFS / 3.4X
                   128.00 MB file with
                   write(1024K) then close().

2                 open(O_DSYNC) and                               ZFS / 5.3X
                   allocation of a
                   5.00 MB file with
                   write(128K) then close().

3                 open() and allocation of a                     UFS / 1.8X
                   50.00 MB file with write() of
                   size picked uniformly in
                   [1K,8K] issuing fsync()
                   every 10.00 MB

4                 Sequential read(32K) of a                        ZFS / 1.1X
                   1024.00 MB file, cached.

5                 Sequential read(32K) of a                         no data
                  1024 MB MB file, uncached.

6                Sequential rewrite(32K) of a                    ZFS / 2.6X
                   10.00   MB file, O_DSYNC,
                   uncached

7                 Sequential rewrite() of a 1000.00            UFS / 1.3X
                   MB cached file, size picked
                   uniformly    in the [1K,8K]
                   range, then close().

8                create a file   of size 1/2 of                    ZFS / 2.3X
                   freemem   using write(1MB)
                   followed by 2    full-pass
                   sequential   read(1MB). No
                   special cache manipulation.

9                 128.00 MB worth of random 8            UFS / 2.3X
                   K-aligned write       to a
                   1024.00 MB file; followed
                   by fsync(); cached.

10             1.00 MB worth of   2K write to            draw (UFS == ZFS)
                  100.00   MB file, O_DSYNC,
                random offset, cached.

11             1.00 MB worth of 2K write to                ZFS / 5.8X
                  100.00 MB    file, O_DSYNC,
                random offset, uncached. 4
                cooperating threads   each
                writing 1 MB

12            128.00 MB worth of 8K aligned            draw (UFS == ZFS)
                 read&write   to 1024.00 MB
                 file, pattern of 3 X read,
                 then write to   last   read
                 page,     random    offset,
               cached.

13            5.00 MB worth of pread(2K) per            no data
                 thread   within   a shared
                1024.00 MB    file, random
                offset, uncached

14            5.00 MB worth of pread(2K) per                UFS / 6.9X
                thread within a shared
                1024.00 MB file, random
                offset, cached 4 threads.

As stated in the abstract

ZFS outpaces UFS in 6 tests by a mean factor of 3.4
UFS outpaces ZFS in 4 tests by a mean factor of 3.0
ZFS equals UFS in 2 tests.

The performance differences can be sizable; lets have a closer look at some of them.

PERFORMANCE DEBRIEF

Lets look at each test to try and understand what is the cause of the performance differences.

Test 1 (ZFS 3.4X)

     open() and allocation of a
    128.00 MB file with
    write(1024K) then close().

This test is not fully analyzed. We note that in this situation UFS will regularly kick off some I/O from the context of the write system call. This would occur whenever a cluster of writes (typically of size 128K or 1MB) has completed. The initiation of I/O by UFS slows down the process. On the other hand ZFS can zoom through the test at a rate much closer to a memcopy. The ZFS I/Os to disks are actually generated internally by the ZFS transaction group mechanism: every few seconds a transaction group will come and flush the dirty data to disk and this occurs without throttling the test.

Test 2 (ZFS 5.3X)

     open(O_DSYNC) and
    allocation of a
    5.00 MB file with
    write(128K) then close().

Here ZFS shows an even bigger advantage.   Because of it’s design and complexity, UFS is actually somewhat limited in it capacity to write allocate files in O_DSYNC mode. Every new UFS write requires some disk block   allocation, which must occur one block at a   time when O_DSYNC is set. ZFS can easily outperform UFS for this test.

Test 3 (UFS 1.8X)

     open() and allocation of a
    50.00 MB file with write() of
    size picked uniformly in
    [1K,8K] issuing fsync()
    every 10.00 MB

Here ZFS pays the advantage it had in test 1. In this test, we issue very many writes to a file. Those are cached as the process is racing along. When the fsync() hits (every 10 MB of outstanding data per the test definition) the FS must now guarantee that all the data is set to stable storage. Since UFS kicks off I/O more regularly, when the fsync() hits UFS has a smaller amount of data left to sync up. What save the day for ZFS is that, for that leftover data UFS slows down to a crawl. On the other hand ZFS has accumulated a large amount of data in the cache and when the fsync() hits.

Fortunately ZFS is able to issue much larger I/Os to disk and catches some of it’s lag that has built up. But the final results shows that UFS wins the horse race (at least in this specific test); Details of the test will influence final result here.

However the ZFS team is working on ways   to make the fsync()   much better. We actually have 2 possible avenues of improvements. We can borrow from the UFS behavior and kick off some I/Os when too much outstanding data is cached. UFS does this at a very regular interval which does not look right either. But clearly if a file has many MB of outstanding dirty data sending   them off to disk   might be beneficial. On    the other hand,    keeping the data in   cache in interesting when the pattern of writing is such that the same file offsets are written and re-written over and over again. Sending the data to disk is wasteful if data is subsequently rewritten shortly after. Basically the FS must place a bet on whether a future fsync() will occur before an new write to the block.   We cannot win this bet on all tests all the time.

Given that fsync() performance is important, I would like to see us asynchronously kick off I/O when some we reach many MB of outstanding data to a file. This is nevertheless debatable.

Even if we don’t do this, we have another area of improvement that the ZFS team is looking into. When the Fsync finally hits the fan, even with a lot of outstanding data; the current implementation does not issue disk I/Os very efficiently.   The proper way to do this is to kick-off all required I/Os and then wait for them to all complete. Currently in the intricacies of the   code, some I/Os are issued and waited upon one after the   other. This is not yet optimal but we certainly should see improvements coming in the future and I truly expect ZFS fsync() performance to be ahead all the time.

Test 4 (ZFS 1.1X)

     Sequential read(32K) of a 1024.00
    MB file, cached.

Rather simple test, mostly    close to memcopy speed between   the Filesystem cache and the user buffer. Contest is almost a wash with ZFS slightly on top. Not yet analyzed.

Test 5 (N/A)

     Sequential read(32K) of a 1024.00
    MB file, uncached.

No results dues to lack of control on the ZFS file level caching.

Test 6 (ZFS 2.6X)

    Sequential rewrite(32K) of a
    10.00   MB file, O_DSYNC,
    uncached

Due to the WAFL (Write Anywhere File Layout) ZFS, a rewrite is not very different to an initial write and it seems to perform very well on this test. Presumably UFS performance is hindered by the need to synchronize the cached data. Result not yet analyzed.

Test 7 (UFS 1.3X)

     Sequential rewrite() of a 1000.00
    MB cached file, size picked
    uniformly    in the [1K,8K]
    range, then close().

In this test we are not timing any of the disk I/O. This is merely a test about unrolling the filesystem code for 1K to 8K cached writes. The UFS codepath wins in simplicity and years of performance tuning. The ZFS codepath here somewhat suffers from it’s youth. Understandably the ZFS current implementation is very well   layered and we easily imagine that the locking strategies of   the different layers   are independent of one another. We have found (thanks dtrace) that a small ZFS cached write would use about 3 times as many lock acquisition that an equivalent UFS    call. Mutex rationalization within or between layers certainly seems to be an area of potential improvement for ZFS that would help this particular test. We also realised that the very clean and layered code implementation   is causing the callstack to follow very many elevator ride up and down between layers. On a SPARC CPU going up and down 6 or 7 layers deep in the callstack causes a spill/fill trap and one   additional trap for every additional   floor travelled. Fortunately there are very many areas where ZFS will be able to merge different functions into single one or possibly exploit the technique of tail calls to regain some of the lost performance. All in all, we find that the performance difference is small enough to not be worrysome at this point specially in view of the possible improvements we already have identified.

Test 8 (ZFS 2.3X)

    create a file   of size 1/2 of
    freemem   using write(1MB)
    followed by 2    full-pass
    sequential   read(1MB). No
    special cache manipulation.

This test needs to be   analyzed further. We   note that UFS will proactively freebehind read blocks. While this is a very responsible use of memory   (give it back after use) it potentially impact the re-read UFS performance. While we’re happy to see ZFS performance on top, some investigation is warranted to make sure that ZFS does not overconsume memory in some situations.

Test 9 (UFS 2.3X)

       128.00 MB worth of random 8
    K-aligned write       to a
    1024.00 MB file; followed
    by fsync(); cached.

In this test we expect a rational similar to the one of Test 3 to take effect. The same cure should also apply.

Test 10 (draw)

    1.00 MB worth of   2K write to
    100.00   MB file, O_DSYNC,
    random offset, cached.

Both FS must issue and wait for a 2K I/O on each write. They both do this as efficiently as possible.

Test 11 (ZFS 5.8X)

     1.00 MB worth of 2K write to
    100.00 MB    file, O_DSYNC,
    random offset, uncached. 4
    cooperating threads   each
    writing 1 MB

This test is similar to the previous one except for the 4 cooperating threads. ZFS being on top highlights a key feature of ZFS, the lack of single writer lock. UFS can only allow a single write thread working per file. The only exception is when directio is enabled and then only with rather restrictive conditions. UFS with directio would allow concurrent writers with the implied restriction that it did not honor full POSIX semantics regarding write atomicity. ZFS, out of the box, is able to allow concurrent writers without requiring any special setup nor   giving up full     POSIX semantics. All   great news for simplicity of deployment and great Data-Base performance .

Test 12 (draw)

128.00 MB worth of 8K aligned
    read&write   to 1024.00 MB
    file, pattern of 3 X read,
    then write to   last   read
    page,     random    offset,
    cached.

Both filesystem perform appropriately. Test still require analysis.

Test 13 (N/A)

    5.00 MB worth of pread(2K) per
    thread   within   a shared
    1024.00 MB    file, random
    offset, uncached

No results dues to lack of control on the ZFS file level caching.

Test 14 (UFS 6.9X)

     5.00 MB worth of pread(2K) per
    thread within a shared
    1024.00 MB file, random
    offset, cached 4 threads.

This test unexplicably shows UFS on top.   The UFS code can perform rather well given that the FS cache is stored in the   page cache. Servicing writes from cache can be made very scalable. We are just starting our analysis of the performance characteristic of   ZFS for this   test We have identified some serialization construct in the buffer management code where we find that reclaiming the buffers into which to put the cached data is acting as a serial throttle. This is truly the only test where the   ZFS performance disappoint although there    is no doubt   that    we will be    finding   a cure to this implementation issue.

THE TAKEAWAY

ZFS is on top on very many of our test often by a significant factor. Where UFS is ahead we have a clear view on how to improve the ZFS implementation. The case of shared readers to a single file will be the test that requires special attention.

Given the youth of the ZFS implementation, the performance outline presented in this paper shows that the ZFS design decision are totally validated from a performance perspective.

FUTURE DIRECTIONS

Clearly, we should now expands the unit test coverage. We would like to study more metadata intensive workloads. We also would like to see how ZFS features such as compression and RaidZ perform. Other interesting studies could focus on CPU consumption and memory efficiency. We also need to find a solution to running the existing unit test that requires the files to not be cached in the filesystem.

APPENDIX/ THROUGHPUT MEASURE

Here are the raw throughput measures for each of the 14 Unit test.

Task ID     Description              ZFS latest+nv25(MB/s)     UFS+nv25 (MB/s)

1     open() and allocation of a        486.01572         145.94098
    128.00 MB file with
    write(1024K) then close().                ZFS 3.4X

2     open(O_DSYNC) and            4.5637             0.86565
    allocation of a
    5.00 MB file with
    write(128K) then close().               ZFS 5.3X

3     open() and allocation of a         27.3327         50.09027
    50.00 MB file with write() of
    size picked uniformly in
    [1K,8K] issuing fsync()                           1.8X UFS
    every 10.00 MB

4     Sequential read(32K) of a 1024.00    674.77396         612.92737
    MB file, cached.
                               ZFS 1.1X

5     Sequential read(32K) of a 1024.00    1756.57637         17.53705
    MB file, uncached.
                               XXXXXXXXX

6     Sequential rewrite(32K) of a        2.20641         0.85497
    10.00   MB file, O_DSYNC,
    uncached                       ZFS 2.6X

7     Sequential rewrite() of a 1000.00    204.31557         257.22829
    MB cached file, size picked
    uniformly    in the [1K,8K]                   1.3X UFS
    range, then close().

8     create a file   of size 1/2 of    698.18182         298.25243
    freemem   using write(1MB)
    followed by 2    full-pass
    sequential   read(1MB). No               ZFS 2.3X
    special cache manipulation.

9       128.00 MB worth of random 8        42.75208         100.35258
    K-aligned write       to a
    1024.00 MB file; followed                   2.3X UFS
    by fsync(); cached.

10     1.00 MB worth of   2K write to        0.117925         0.116375
    100.00   MB file, O_DSYNC,
    random offset, cached.                      ====

11     1.00 MB worth of 2K write to    0.42673         0.07391
    100.00 MB    file, O_DSYNC,
    random offset, uncached. 4
    cooperating threads   each            ZFS 5.8X
    writing 1 MB

12     128.00 MB worth of 8K aligned        264.84151         266.78044
    read&write   to 1024.00 MB
    file, pattern of 3 X read,
    then write to   last   read                 =====
    page,     random    offset,
    cached.

13     5.00 MB worth of pread(2K) per        75.98432         0.11684
    thread   within   a shared
    1024.00 MB    file, random               XXXXXXXX
    offset, uncached

14     5.00 MB worth of pread(2K) per    56.38486         386.70305
    thread within a shared
    1024.00 MB file, random                          6.9X UFS
    offset, cached 4 threads.

OpenSolaris, ZFS

ZFS to UFS Performance Comparison on Day 1

THE SYSTEM UNDER TEST

THE FILESYSTEM UNIT TESTS

THE RESULTS

PERFORMANCE DEBRIEF

THE TAKEAWAY

FUTURE DIRECTIONS

APPENDIX/ THROUGHPUT MEASURE

Roch Bourbonnais

Principal Performance Engineer

A Very Slow Link-Edit - get the latest patch

Runtime Token Expansion - some clarification

ZFS to UFS Performance Comparison on Day 1

THE SYSTEM UNDER TEST

THE FILESYSTEM UNIT TESTS

THE RESULTS

PERFORMANCE DEBRIEF

THE TAKEAWAY

FUTURE DIRECTIONS

APPENDIX/ THROUGHPUT MEASURE

Authors

Roch Bourbonnais

Principal Performance Engineer

A Very Slow Link-Edit - get the latest patch

Runtime Token Expansion - some clarification