With special thanks to Chaoyue Xiong for her help in this work.
        
In this paper I’d like to review the performance data we have gathered comparing this initial  release of ZFS  (Nov 16 2005) with the Solaris legacy, optimized beyond reason, UFS filesystem.  The  data we will be reviewing is based on 14 Unit tests that  were designed to stress some specific usage pattern of  filesystem operations.  Working  with these well  contained usage     scenarios, greatly  facilitate    subsequent performance engineering analysis.

Our focus was to issue a fair head to  head comparison between UFS and ZFS but not  try to  produce the  biggest,  meanest marketing numbers. Since ZFS  is also a Volume   Manager, we actually  compared  ZFS to a UFS/SVM combination.  In cases  where ZFS underperforms UFS, we wanted to figure out why and how to improve ZFS.

We currently also are focusing on data intensive operations.  Metadata intensive tests are  being develop and we will   report on those  in a later study.

Looking ahead to  our results we find  that of our  12 Filesystem Unit test that were successfully run:
  •     ZFS outpaces UFS in 6 tests by a mean factor of 3.4
  •     UFS outpaces ZFS in 4 tests by a mean factor of 3.0
  •     ZFS equals UFS in 2 tests.

In this paper, we will be taking a closer  look at the tests where UFS is ahead and try to make proposition toward improving those numbers.

THE SYSTEM UNDER TEST

Our testbed is a hefty V890 with 8 x 1200 Mhz US-IV CPUs (16 cores). At this point  we are  not  yet monitoring  the  CPU utilization  of  the different tests  although we plan to do  so in the future. The storage is  an insanely  large  300 disk   array; The disks  were   rather old technology,  small &  slow  9 GB  disks.  None  of the  test  currently stresses the array very much  and the idea was  mostly trying to  take the storage configuration   out  of the  equation.  Working  with  old technology disks, the absolute throughput  numbers are not necessarily of interest; they are presented in an appendix.

Every disk  in our configuration  is partitioned into   2 slices and a simple zvm or  zpool stripped volume  is made across all spindles. We then build  a filesystem on top of  the volume.  All commands  are run with default parameters.  Both filesystems  are mounted and we can run our test suite on either one.

Every  test is rerun  multiple  times  in  succession; The tests   are defined and developed to avoid variability between instances. Some of the current test definition  require that file data  not be present in the filesystem cache. Since we currently do not  have a convenient way to control this for  ZFS, the result for those  tests are omitted from this report.

THE FILESYSTEM UNIT TESTS

Here  is the  definition   of the 14  data   intensive tests  we  have currently identified.   Note  that  we   are  very  open to   new test definition; if you know of an data  intensive application, that uses a Filesystem in  a very  different pattern,  and  there must be  tons of them, we would dearly like to hear from you.

Test 1

This is the simplest way  to create a  file; we open/creat a file then issue 1MB writes until the filesize reaches 128 MB; we then close the file.

Test 2

In this test, we also create a new file,  although here we work with a file opened  with the O_DSYNC  flag.  We work with  128K writes system calls.  This maps to some database file creation scheme.

Test 3

This test is  also relative to file creation  but with writes that are much smaller and of varying sizes. In this test, we create a 50MB file using writes of size picked randomly between [1K,8K]. The file is open with  default flags (no  O_\*SYNC) but  every 10 MB  of written  data we issue an fsync() call for the  whole file. This form  of access can be used for log files that have data integrity requirements.

Test 4

Moving now to a read test; we read a  1 GB file (assumed in cache) with 32K read system call. This is a rather  simple test to keep everybody honest.

Test 5

This is same test as Test  4 but when  the file is assumed not present in the filesystem cache. We currently have  no control on ZFS for this and so we  will not be reporting   performance numbers for  this test. This is a basic streaming read sequence that should test the readahead capacity of a filesystem.

Test 6

Our previous write test, were allocating  writes. In this test we will verify the ability of a filesystem to rewrite over an existing file. We will look at 32K writes, to a file open with O_DSYNC.

Test 7

Here we also test the ability to rewrite existing  files. The size are randomly picked  in the [1K,8K] range. Not  special  control over data integrity (no O_\*SYNC, no fsync()).

Test 8

In  this test  we  create a very  large  file  (10 GB) with 1MB  writes followed by 2 full-pass sequential  read.  This test is still evolving but we want  verify the ability of the  filesystem to work with  files that are of size close or larger that available free memory.

Test 9

In this test, we issue 8K writes at random 8K aligned offsets in a 1 GB file. When 128 MB of data is written we issue an fsync().

Test 10

Here,  we issue  2K writes at   random (unaligned) offsets  to  a file opened O_DSYNC.

Test 11

Same test   as 10 but using 4   cooperating threads all  working  on a single file.

Test 12

Here we attempt to  simulate a mixed  read/write pattern. Working with an existing file, we loop through  a pattern of 3  reads at 3 randomly selected 8K aligned  offsets followed by an  8K write to the last read block.
 
Test 13

In this test  we   issue 2K pread()    calls (to an  random  unaligned offset).  File is asserted to not be in  the cache. Since we currently have no such control, no won’t report data for this test.

Test 14

We have 4 cooperating  threads (working on a  single file)  issuing 2K pread() calls  to random unaligned offset. The  file is present in the cache.

THE RESULTS

We have  a common testing framework  to generate the performance data. Each test is written using as a simple  C program and the framework is responsible   for creating   threads,   files,  timing   the runs  and reporting.  We currently are in discussing merging this test framework with the Filebench  suite.  We regret that  we cannot easily share the test  code,  however the   above descriptions  should  be sufficiently precise to allow someone to  reproduce our data.   In my mind a simple 10 to 20 disk array and any small server  should be enough to generate similar  numbers.  If anyone  find very different  results, I would be very interested in knowing about it.

Our framework reports all    timing    results   as a   throughput measure. Absolute values of throughput is  highly test case dependent. A 2K O_DSYNC write will  not have the same throughput  as a 1MB cached read.  Some test would be better described in  terms of operations per second.    However  since  our focus  is  a   relative ZFS to  UFS/SVM comparison, we will focus here on  the delta in throughput between the 2 filesystems (for the curious  the full throughput  data is posted in the appendix).

Drumroll….


Task ID      Description                                               Winning FS / Performance Delta

1                 open() and allocation  of a                        ZFS / 3.4X
                   128.00 MB file with
                   write(1024K) then close().                

2                 open(O_DSYNC) and                               ZFS / 5.3X
                   allocation of a
                   5.00 MB file with
                   write(128K) then close().              

3                 open()  and allocation of a                        UFS / 1.8X
                   50.00 MB file with write() of
                   size picked uniformly  in
                   [1K,8K] issuing fsync()                          
                   every 10.00 MB

4                 Sequential read(32K) of a
                       ZFS / 1.1X
                   1024.00 MB file, cached.
                              
5                 Sequential read(32K) of a                         no data
                  1024 MB MB file, uncached.

6                 Sequential rewrite(32K) of a                    ZFS / 2.6X
                   10.00   MB  file,  O_DSYNC,
                   uncached                      

7                 Sequential rewrite() of a 1000.00            UFS / 1.3X
                   MB cached file, size picked
                   uniformly    in the [1K,8K]                  
                   range, then close().

8                 create  a file   of size 1/2  of                    ZFS / 2.3X
                   freemem   using  write(1MB)
                   followed  by 2    full-pass
                   sequential   read(1MB).  No              
                   special cache manipulation.

9                 128.00  MB  worth  of random  8            UFS / 2.3X
                   K-aligned write       to  a
                   1024.00  MB  file; followed                  
                   by fsync(); cached.

10              1.00  MB worth of   2K write to            draw (UFS == ZFS)
                  100.00   MB file,  O_DSYNC,
                  random offset, cached.

11             1.00  MB worth  of  2K write  to                ZFS  / 5.8X
                  100.00 MB    file, O_DSYNC,
                  random offset, uncached.  4
                  cooperating  threads   each             
                  writing 1 MB

12             128.00 MB  worth of  8K aligned            draw (UFS == ZFS)
                 read&write   to  1024.00 MB
                 file, pattern  of 3 X read,
                 then write to   last   read
                 page,     random    offset,
                 cached.

13             5.00  MB worth of pread(2K) per            no data
                 thread   within   a  shared
                1024.00  MB    file, random
                offset, uncached

14            5.00 MB  worth of  pread(2K) per                UFS / 6.9X
                thread within a shared
                1024.00 MB file, random                         
                offset, cached 4 threads.

As stated in the abstract

  •     ZFS outpaces UFS in 6 tests by a mean factor of 3.4
  •     UFS outpaces ZFS in 4 tests by a mean factor of 3.0
  •     ZFS equals UFS in 2 tests.

The performance differences can be sizable; lets have a closer look at some of them.

PERFORMANCE DEBRIEF

Lets look at each test to try and understand what  is the cause of the performance differences.

Test 1 (ZFS 3.4X)

     open() and allocation  of a
    128.00 MB file with
    write(1024K) then close().                

This  test is not fully  analyzed. We note  that in this situation UFS will regularly kick off some I/O from the  context of the write system call.  This would occur  whenever a  cluster  of writes (typically  of size  128K or 1MB)  has completed. The initiation  of I/O by UFS slows down the process.  On the other hand ZFS  can zoom through the test at a rate much closer to  a memcopy.  The  ZFS I/Os to disks are actually generated internally by the ZFS  transaction group mechanism: every few seconds a transaction group will come and flush the dirty data to disk and this occurs without throttling the test.

Test 2 (ZFS 5.3X)

     open(O_DSYNC) and
    allocation of a
    5.00 MB file with
    write(128K) then close().              

Here ZFS shows  an even bigger advantage.   Because of it’s design and complexity,  UFS is actually somewhat limited  in it capacity to write allocate files in  O_DSYNC mode.  Every  new  UFS write  requires some disk block   allocation, which must occur  one  block at a   time when O_DSYNC is set. ZFS can easily outperform UFS for this test.

Test 3 (UFS 1.8X)

     open()  and allocation of a
    50.00 MB file with write() of
    size picked uniformly  in
    [1K,8K] issuing fsync()                          
    every 10.00 MB

Here ZFS pays the advantage it had in test  1.  In this test, we issue very many writes to a file.  Those are cached as the process is racing along.  When the fsync() hits (every 10 MB  of outstanding data per the test definition) the FS must now guarantee that all the data is set to stable  storage.  Since UFS  kicks off  I/O more  regularly, when  the fsync() hits UFS has a smaller amount  of data left to  sync up.  What save the day for ZFS is that, for that leftover data UFS slows down to a crawl.  On the other hand ZFS has accumulated a large amount of data in the cache and when  the fsync() hits.  

Fortunately ZFS is able  to issue much larger I/Os to  disk and catches some  of it’s lag that has built  up.  But the final  results shows that UFS  wins the horse race (at least  in this specific test);  Details of the test will influence final  result here. 

However the ZFS  team  is working on ways   to make the fsync()   much better.  We actually have 2  possible avenues of improvements.  We can borrow from  the  UFS behavior and kick  off  some I/Os when too  much outstanding data is cached.  UFS does  this at a very regular interval which does not look  right either.  But clearly  if a file has many MB of outstanding  dirty  data  sending   them  off  to  disk   might  be beneficial.  On    the other hand,    keeping  the data in   cache  in interesting when  the pattern of  writing is such  that the same file offsets  are written and re-written over  and over again.  Sending the data to disk  is wasteful if  data is subsequently  rewritten shortly after.  Basically the FS must place a bet on whether a future fsync() will occur before an new write  to the block.   We cannot win this bet on all tests all the time.

Given that fsync() performance  is important, I  would like to see  us asynchronously kick off I/O when some we reach many MB of outstanding data to a file. This is nevertheless debatable.

Even if we don’t do this, we have another area of improvement that the ZFS team  is looking into.  When the  Fsync finally hits the fan, even with a  lot of outstanding data;  the current  implementation does not issue  disk I/Os very efficiently.   The proper way  to  do this is to kick-off all required I/Os  and then wait  for  them to all  complete. Currently in the  intricacies of the   code, some I/Os are  issued and waited  upon one after the   other.  This is  not yet  optimal but  we certainly  should see  improvements coming  in the  future and I truly expect ZFS fsync() performance to be ahead all the time.
   
Test 4 (ZFS 1.1X)

     Sequential read(32K) of a 1024.00
    MB file, cached.

Rather simple  test, mostly    close  to memcopy  speed  between   the Filesystem  cache and the  user buffer. Contest is  almost a wash with ZFS slightly on top. Not yet analyzed.

Test 5 (N/A)

     Sequential read(32K) of a 1024.00
    MB file, uncached.

No results dues to lack of control on the ZFS file level caching.

Test 6 (ZFS 2.6X)

      Sequential rewrite(32K) of a
    10.00   MB  file,  O_DSYNC,
    uncached                      

Due  to the WAFL  (Write Anywhere File  Layout) ZFS, a  rewrite is not very different to an initial  write and it seems  to perform very well on this  test.  Presumably UFS performance is  hindered by the need to synchronize the cached data. Result not yet analyzed.

Test 7 (UFS 1.3X)

     Sequential rewrite() of a 1000.00
    MB cached file, size picked
    uniformly    in the [1K,8K]                  
    range, then close().

In this test we are not timing any of the  disk I/O. This is merely a test about unrolling the  filesystem code for 1K  to 8K cached writes. The  UFS codepath wins in  simplicity and years of performance tuning. The ZFS codepath here somewhat suffers from it’s youth. Understandably the ZFS  current  implementation is very well   layered and we  easily imagine  that the  locking  strategies of   the different layers   are independent of one another. We have found (thanks dtrace) that a small ZFS cached write would use about 3 times as many lock acquisition that an equivalent UFS    call.  Mutex rationalization  within  or  between layers certainly seems to be an area of  potential improvement for ZFS that would help this particular test.  We  also realised that the very clean and  layered code implementation   is causing the  callstack  to follow very many elevator ride up and down between  layers. On a SPARC CPU going up  and down 6  or 7 layers  deep in the callstack causes  a spill/fill trap and one   additional trap for every additional   floor travelled. Fortunately there  are very  many  areas where ZFS  will be able to merge different functions into  single one or possibly exploit the technique of  tail calls to regain  some of the lost  performance. All in all, we find that the performance difference is small enough to not  be  worrysome at this  point  specially in  view of  the possible improvements we already have identified.

Test 8 (ZFS 2.3X)

      create  a file   of size 1/2  of
    freemem   using  write(1MB)
    followed  by 2    full-pass
    sequential   read(1MB).  No              
    special cache manipulation.

This  test  needs to  be   analyzed further.  We   note that  UFS will proactively  freebehind read blocks. While  this is a very responsible use of memory   (give it back  after use)  it  potentially  impact the re-read UFS performance.  While we’re happy  to see ZFS performance on top, some investigation is  warranted to make sure  that ZFS does  not overconsume memory in some situations.

Test 9 (UFS 2.3X)

       128.00  MB  worth  of random  8
    K-aligned write       to  a
    1024.00  MB  file; followed                  
    by fsync(); cached.

In this test we expect a rational similar to the one of Test 3 to take effect. The same cure should also apply.

Test 10 (draw)

      1.00  MB worth of   2K write to
    100.00   MB file,  O_DSYNC,
    random offset, cached.

Both FS must issue and wait  for a 2K I/O on  each write. They both do this as efficiently as possible.

Test 11 (ZFS 5.8X)

     1.00  MB worth  of  2K write  to
    100.00 MB    file, O_DSYNC,
    random offset, uncached.  4
    cooperating  threads   each             
    writing 1 MB

This test is similar to the previous  one except for the 4 cooperating threads. ZFS being on top highlights a key feature of ZFS, the lack of single writer lock.  UFS can only allow  a single write thread working per file.  The only  exception is  when directio  is enabled  and then only with rather restrictive conditions. UFS with directio would allow concurrent writers with the implied restriction  that it did not honor full POSIX semantics regarding write atomicity.  ZFS,  out of the box, is able  to  allow concurrent  writers  without requiring any  special setup  nor   giving up full     POSIX semantics. All   great news  for simplicity of deployment and great Data-Base performance .

Test 12 (draw)

    128.00 MB  worth of  8K aligned
    read&write   to  1024.00 MB
    file, pattern  of 3 X read,
    then write to   last   read
    page,     random    offset,
    cached.

Both filesystem perform appropriately. Test still require analysis.

Test 13 (N/A)

      5.00  MB worth of pread(2K) per       
    thread   within   a  shared
    1024.00  MB    file, random
    offset, uncached

No results dues to lack of control on the ZFS file level caching.

Test 14 (UFS 6.9X)

     5.00 MB  worth of  pread(2K) per   
    thread within a shared
    1024.00 MB file, random                         
    offset, cached 4 threads.

This test unexplicably  shows UFS on  top.   The UFS  code can perform rather well given  that  the FS cache  is  stored in the   page cache. Servicing writes from  cache can be made  very scalable.  We  are just starting our analysis  of  the performance characteristic of   ZFS for this   test  We have identified  some  serialization  construct in the buffer management code where we find that  reclaiming the buffers into which to put the cached  data is acting  as a serial throttle. This is truly the  only test where  the   ZFS performance disappoint  although there    is no doubt   that    we will be    finding   a cure to  this implementation issue.

THE TAKEAWAY

ZFS is  on top   on very  many  of  our  test  often by a  significant factor. Where UFS is ahead we have a clear view on  how to improve the ZFS implementation.  The case of shared readers to  a single file will be the test that requires special attention.

Given the youth of the  ZFS  implementation, the performance outline presented in this paper shows that the ZFS design decision are totally validated from a performance perspective.

FUTURE DIRECTIONS

Clearly, we should now expands the unit  test coverage.  We would like to study more metadata intensive workloads.  We also would like to see how   ZFS  features such as  compression    and RaidZ perform.   Other interesting studies could   focus    on CPU consumption     and memory efficiency.  We also  need to find  a solution to running the existing unit test that requires the files to not be cached in the filesystem.

APPENDIX/ THROUGHPUT MEASURE

Here are the raw throughput measures for each of the 14 Unit test.

 Task ID      Description              ZFS latest+nv25(MB/s)      UFS+nv25 (MB/s)

1     open() and allocation  of a        486.01572         145.94098
    128.00 MB file with
    write(1024K) then close().                 ZFS 3.4X

2     open(O_DSYNC) and            4.5637             0.86565
    allocation of a
    5.00 MB file with
    write(128K) then close().               ZFS 5.3X

3     open()  and allocation of a          27.3327         50.09027
    50.00 MB file with write() of
    size picked uniformly  in
    [1K,8K] issuing fsync()                           1.8X UFS
    every 10.00 MB

4     Sequential read(32K) of a 1024.00    674.77396         612.92737
    MB file, cached.
                               ZFS 1.1X

5     Sequential read(32K) of a 1024.00    1756.57637         17.53705
    MB file, uncached.
                               XXXXXXXXX

6      Sequential rewrite(32K) of a        2.20641         0.85497
    10.00   MB  file,  O_DSYNC,
    uncached                       ZFS 2.6X

7     Sequential rewrite() of a 1000.00    204.31557         257.22829
    MB cached file, size picked
    uniformly    in the [1K,8K]                   1.3X UFS
    range, then close().

8      create  a file   of size 1/2  of    698.18182         298.25243
    freemem   using  write(1MB)
    followed  by 2    full-pass
    sequential   read(1MB).  No               ZFS 2.3X
    special cache manipulation.

9       128.00  MB  worth  of random  8        42.75208         100.35258
    K-aligned write       to  a
    1024.00  MB  file; followed                   2.3X UFS
    by fsync(); cached.

10      1.00  MB worth of   2K write to        0.117925         0.116375
    100.00   MB file,  O_DSYNC,
    random offset, cached.                      ====

11     1.00  MB worth  of  2K write  to    0.42673         0.07391
    100.00 MB    file, O_DSYNC,
    random offset, uncached.  4
    cooperating  threads   each              ZFS 5.8X
    writing 1 MB

12      128.00 MB  worth of  8K aligned        264.84151         266.78044
    read&write   to  1024.00 MB
    file, pattern  of 3 X read,
    then write to   last   read                 =====
    page,     random    offset,
    cached.

13      5.00  MB worth of pread(2K) per        75.98432         0.11684
    thread   within   a  shared
    1024.00  MB    file, random               XXXXXXXX
    offset, uncached

14     5.00 MB  worth of  pread(2K) per    56.38486         386.70305
    thread within a shared
    1024.00 MB file, random                          6.9X UFS
    offset, cached 4 threads.

,