Losing My Fear of ZFS


The original ZFS vision was enormously ambitious: End the Suffering, Free Your Mind, providing simplicity, power, safety, and speed. As is common with most new technologies, this ambitious vision was not completely fulfilled in the intial versions. Initial usage showed that although it did have useful and convenient features, for some workloads, such as the memory-intensive SPEC CPU benchmarks, there were reasons for concern. Now that ZFS has had time to grow, more of the vision is fulfilled. This article, told from the personal perspective of one performance engineer, describes some of the improvements, and provides examples of use.

Rumors: Does This Sound Familiar?

Have you heard some of these about ZFS?

"ZFS? You can't use that - it will eat all your memory!"

"ZFS? That's a software disk striping/RAID solution. You don't want that. You want hardware RAID."

"ZFS? Be afraid."

Can I Please Just Forget About IO? (NO)

As a performance engineer, my primary concern is for the SPEC CPU benchmarks - which intentionally do relatively little IO. Usually.

To a first approximation, IO can be ignored in this context. Usually.

To a first approximation, it's fine if my ZFS "knowledge" is limited to rumors / innuendo as quoted above. Until....

Until there comes the second approximation, the re-education, and the beginner loses some fear of ZFS.

Why a SPEC CPU Benchmarker Might Care About IO

Although the SPEC CPU benchmarks intentionally try to avoid doing IO, some amount inevitably remains. An analysis of the IO in the benchmarks shows that one benchmark, 450.soplex, reads 300 MB of data. Most of that comes from a single 1/4 GB file, ref.mps, which is read during the second invocation of the benchmark.

Given the speed of today's disk drives, is that a lot? Using an internal drive (Seagate ST973402SSUN72G), a T5220 with a Niagara 2 processor reads the 1/4 GB file at about 50 MB/sec. It takes about 5.5 seconds to read one copy of the file, which is a tiny amount of time compared to how long it takes to run one copy of the actual benchmark - about 3000 seconds.

But 1/4 GB becomes a concern when one takes into account that we do not, in fact, read one copy of the file when testing SPEC CPU2006, because we are interested in the SPECrate metrics, which run multiple copies of the benchmarks. On a single-chip T5220 system, which supports 64 theads, 63 copies of the benchmark are run. An 8-chip M5000, which supports 8 threads per chip, also runs 63 copies.

On such systems, it is not uncommon to see 10 to 30 minutes of time when the CPU is sitting idle - which is not the desired behavior for a CPU benchmark.

For example, on the M5000, as shown in the graph below, it takes about 18 minutes before the CPU reaches the desired 99% User time. During that 18 minutes, a single disk with ufs on it is, according to iostat, 100% busy. It reads about 16 MB/sec, doing about 725 reads/sec.

Note that in this graph, and all other graphs in this article, the program being tested is only one of the benchmarks drawn from a larger suite, and only one of its inputs. Therefore, no statements in this article should be taken as indicative of "compliant" (aka "reportable") runs of an entire suite. SPEC and the benchmark name SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. For more information about SPEC and the CPU benchmarks, see www.spec.org/cpu2006.

graph1: ufs takes 18 min on M5000

ZFS Makes its Dramatic Entrance

Although this tester has heard concerns raised by people who have passed along rumors of ZFS limitations, there have been other teachers who have sung its praises, including one who has pointed out that 450.soplex's 1/4 GB input file is highly compressible, going from 267 MB to 20 MB with gzip.

The best IO is the IO that you never have to do at all. By using the ZFS compression feature, we can make 90% of the IO go away:

      $ zpool create -f tank c0t1d0
      $ zfs create tank/spec-zfs-gzip
      $ zfs set compression=gzip tank/spec-zfs-gzip

graph2: zfs pegs the CPU almost immediately

The improvement from ZFS gzip compression is indeed dramatic.

The careful reader may note that there are actually two lines on the far left: one measured with Solaris 10 Update 7, the other with Solaris Express. The version of Solaris did not appear to be a signficant variable for the tests reported in this paper, as can be seen by the fact that the two lines are right on top of each other.

What About Memory Consumption?

Although ZFS has done a great job above, what about its memory consumption? Concerns have been raised that it is memory-hungry, and indeed the "Be st Practices" Guide plainly says that it will use all the memory on the system if it thinks it can get away with it:

The ZFS adaptive replacement cache (ARC) tries to use most of a system's available memory to cache file system data. The default is to use all of physical memory except 1 Gbyte. As memory pressure increases, the ARC relinquishes memory.

ZFS memory usage is an important concern when running the SPEC CPU benchmarks, which are designed to stress the CPU, the memory system, and the compiler. Some of the benchmarks in the suite use just under 1 GB of physical memory, and it is desirable to run (n — 1) copies on a system with (n) threads and (n) GB of memory. Fortunately, there is a tuning knob available to control the size of the ARC: set zfs:zfs_arc_max = 0x(size) can be added to/etc/system.

The tests reported on this page all use a limited ARC cache.

It should also be noted that all tests are done after a fresh reboot, so presumably the ARC cache is not contributing to the reported performance. More details about methods may be found at the end of the article.

ZFS on T5440: Good, But Not As Dramatic

Although the above simple commands are enough to remove the IO idle time on the M5000, for the 4-chip T5440 there is a bigger problem: this system supports 256 threads, and 255 copies of the benchmark are run. Therefore, it needs to quickly inhale on the order of 64 GB.

A somewhat older RAID system was made available for this test: an SE3510 with 12x 15K 72GB drives. Using this device with ufs, it takes 30 minutes before the system hits the maximum user time, as shown by the line on the right in the graph below:

Graph 3: ufs takes about 30 min, zfs about 15 min

In the ufs test above, the SE3510 is configured as 12x drives in a RAID-5 logical drive, with a simple ufs filesystem (newfs -f 8192 /dev/dsk/c2t40d0s6). Despite the large number of drives, the SE3510 sustains a steady read rate of only about 45 MB/sec, processing about 3000 IO/sec according to iostat. (Aside: the IO expert may question why the hardware RAID provides only 45 MB/sec, but please bear in mind we are following the path of the IO beginner here. This topic is re-visted below.)

The zfs file system reads about 16 MB/sec, doing about 4500 IO/sec, but takes less than 1/2 as long to peg the CPU, since it is reading compressed data.

The zfs file system also used an SE3510 with SUN72G 15k RPM drives. On that unit, 12 individual "NRAID" drives were created, and made visible to the host as 12 separate units. Then, 10 of them were strung together as zfs RAID-Z using:

   # zpool create zf-raidz10 raidz \\
     c3t40d0  c3t40d1  c3t40d2 c3t40d3  c3t40d4  \\
     c3t40d5  c3t40d6  c3t40d7 c3t40d8  c3t40d9

   # zpool status       
      pool: zf-raidz10
     state: ONLINE
     scrub: none requested

           NAME         STATE     READ WRITE CKSUM
           zf-raidz10   ONLINE       0     0     0
             raidz1     ONLINE       0     0     0
               c3t40d0  ONLINE       0     0     0
               c3t40d1  ONLINE       0     0     0
               c3t40d2  ONLINE       0     0     0
               c3t40d3  ONLINE       0     0     0
               c3t40d4  ONLINE       0     0     0
               c3t40d5  ONLINE       0     0     0
               c3t40d6  ONLINE       0     0     0
               c3t40d7  ONLINE       0     0     0
               c3t40d8  ONLINE       0     0     0
               c3t40d9  ONLINE       0     0     0

Compression was added at a later time, but before the experiment shown above:

      $ zfs list -o compression zf-raidz10

Why Is the T5440 Improvement Not As Dramatic As the M5000?

The improvement from zfs is helpful to the T5440, but unlike the M5000, nearly 15 minutes of clock time is spent on IO. Let's look at some statistics from iostat:

$ iostat -xncz 30
 us sy wt id
  8  4  0 88
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    2.1    0.0  0.0  0.0    0.1    6.7   0   0 c0t0d0
  469.3    0.0 1851.2    0.0  0.9 18.5    1.9   39.4  25  88 c3t40d9
  401.8    0.0 1893.2    0.0 19.7  9.8   48.9   24.5  88  96 c3t40d8
  471.1    0.0 1836.2    0.0  1.1 18.4    2.4   39.1  27  87 c3t40d7
  416.1    0.0 1858.5    0.0  2.0 16.1    4.9   38.7  33  88 c3t40d6
  452.1    0.0 1792.8    0.0 13.9 13.1   30.9   29.0  78  92 c3t40d5
  417.8    0.0 1868.9    0.0  0.9 16.3    2.1   39.1  18  87 c3t40d4
  461.0    0.0 1766.9    0.0  3.7 17.2    8.0   37.3  42  87 c3t40d3
  418.9    0.0 1854.9    0.0  2.9 16.2    7.0   38.6  40  88 c3t40d2
  433.6    0.0 1761.0    0.0 21.9  8.9   50.6   20.6  92  99 c3t40d1
  420.0    0.0 1852.6    0.0  1.5 16.1    3.5   38.4  29  86 c3t40d0

A kind ZFS expert notes that "with RAID-Z the disks are saturated delivering >400 iops. The problem of RAID-Z is that those iops carry small amount of data and throughput is low." For more information, see this popular reference: https://blogs.oracle.com/roch/entry/when_to_and_not_to.

A secondary reason might be that as the reads are done, ZFS is decompressing the gzip'd data on a system where single thread performance is much slower than the one in Graph #2. On the M5000, 'gunzip ref.mps' requires about 2 seconds of CPU time; on the T5440, about 7 seconds. It should be emphasized that this is only a secondary concern for the read statistics described in this article, although it can become more important for write workloads, since compression is harder than decompression. Doing 'gzip ref.mps' takes ~12 seconds on the M5000, and ~51 seconds on the T5440. Furthermore, although the T5440 has 256 threads available, as of Solaris 10 s10s_u7, and Solaris Express snv_112, it is only willing to spend 8 threads doing gzip/gunzip operations. (This limitation may change in a future version of Solaris.)

Solution: Mirrors, No Gzip

The kind ZFS expert suggested trying mirrored drives without gzip. When this is done, the %b (busy) time, which is about 90% in the iostat report just above, changes to 98-100%. The %w time (queue non-empty) time, which shows wide variability just above, also pushes 90-100%. Because we are reading much more data, elapsed time is actually slower - the red line in the graph below:

graph4: uncompressed mirrors slower than gzip/raidz, until we have 24 drives

Adding 12 more drives, configured as 8x three way mirrors, does the trick: the leftmost line shows the desired slope. We spend about 3-4 minutes reading the file, an acceptable amount given that the benchmark as a whole runs for more than 120 minutes.

The file system for the leftmost line was created using:

      # zpool create dev8-m3-ngz \\
      > mirror c2t40d0  c2t40d1     c3t40d0 \\
      > mirror c2t40d2  c2t40d3     c3t40d1 \\
      > mirror c2t40d4  c2t40d5     c3t40d2 \\
      > mirror c2t40d6  c2t40d7     c3t40d3 \\
      > mirror c2t40d8              c3t40d4  c3t40d5 \\
      > mirror c2t40d9              c3t40d6  c3t40d7 \\
      > mirror c2t40d10             c3t40d8  c3t40d9 \\
      > mirror c2t40d11             c3t40d10 c3t40d11

The command creates 3-way mirrors, splitting each mirror across the two available controllers (c2 and c3). There are 8 of these 3-way mirrors, and zfs will dynamically stripe data across the 8 mirrors.

Were These Tests Fair?

The hardware IO expert may be bothered by the data presented here for the RAID-5 ufs configuration. Why would the hardware RAID system, with 12x drives, deliver only 45 MB/sec? In addition, it may seem odd that the tests use a RAID device which is now 5 years old, and compare it versus contemporary ZFS.

This is a fair point. In fact, a more modern RAID device has been observed delivering 97 MB/sec to 450.soplex, although with a very different system under test.

On the other hand, it should be emphasized that all the T5440 tests reported in this article used SE3510/SUN72G/15k. For the ufs tests, the SE3510 on-board software did the RAID-5 work. For the zfs tests, the SE3510 simply presented its disks to the Solaris system as 12 separate ("NRAID") logical units, and zfs did the RAID-Z and mirroring work.

Could there be something wrong with the particular SE3510 used for ufs? That seems unlikely. Although Graph 3 compares two different SE3510s (both connected to the same HBA, both configured with SUN72G 15k drives), a later test repeated the RAID-5 run on the exact same SE3510 unit as had been used for zfs. The time did not improve.

Is it possible that the SE3510 was mis-configured? Maybe. The author does not claim to be an IO expert, and, in particular, relied on the SE3510 menu system, not its command line interface (sccli). The menus provide limited access to disk block size setting, and the tester did not at first realize that the disk block size depends on this other parameter .... located over here in the menus ...


For this particular controller, default block sizes are controlled indirectly by whether this setting is yes or no. Changing it to "No" makes the default block size larger (32 KB vs. 128 KB). Once this was discovered, various tests were repeated. The hw RAID-5 test was repeated with explicit selection of a larger size; however, it did not improve. On the other hand, the NRAID devices, controlled by zfs, did improve.

Finally, in order to isolate any overhead from RAID-5, the SE3510 was configured as 12 x drives in a RAID-0 stripe (256 KB stripe size). The time required to start 450.soplex was still over 30 minutes.


As usual, your mileage may vary depending on your workload. This is especially true for IO workloads.

Summary / Basic Lessons

Some basic lessons about ZFS emerge:

1) ZFS can be easily taught not to hog memory.

2) Selecting gzip compression can be a big win, especially on systems with relatively faster CPUs.

3) Setting up mirrored drives with dynamic striping is straightforward.

4) ZFS is not so scary, after all.

Notes on Methods

During an actual test of a "reportable" run of SPEC CPU2006, file caches are normally not effective for 450.soplex, because its data files are set up many hours prior to their use, with many intervening programs competing for memory usage. Therefore, for all tests reported here, it was important to avoid unwanted file caching effects that would not be present in a reportable run, which was accomplished as summarized below:

    runspec -a setup --rate 450.soplex
    cd 450.soplex/run/run\*000
    specinvoke -nnr > doit.sh
    convert 'sh dobmk' in doit.sh to 'sh dobmk &'

The tests noted as Solaris 10 used:

      # head -1 /etc/release
                          Solaris 10 5/09 s10s_u7wos_08 SPARC

The tests noted as SNV used:

      # head -1 /etc/release
                       Solaris Express Community Edition snv_112 SPARC

The tests on the M5000 used 72GB 10K RPM disks. The ufs disk was a FUJITSU MBB2073RCSUN72G (SAS); the zfs disk was a SEAGATE ST973401LSUN72G (Ultra320 SCSI). The tests on the T5440 used 72GB 15K RPM disks: FUJITSU MAU3073FCSUN72 (Fibre Channel).


My IO teachers include Senthil Ramanujam, Cloyce Spradling, and Roch Bourbonnais, none of whom should be blamed for this beginner's ignorance. Karsten Guthridge was the first to point out the usefulness of ZFS gzip compression for 450.soplex.


Very interesting, I'd be interested in seeing as well if using lzjb (the default ZFS compression) provided a different result. It isn't as aggressive as gzip can be in getting the data smaller but it isn't as CPU intensive either.

Posted by Darren Moffat on May 28, 2009 at 12:20 AM EDT #

Interesting article.

I'm curious why you didn't try using lzjb for compression as well. I've seen benchmarks showing that it's cpu time is almost negligible (and obviously uses less cpu then gzip).

I've never seen LVM do well with software raid, and would argue you're probably getting worse performance by trying this.

Posted by Eric on June 22, 2009 at 07:20 AM EDT #

Great peice of work, this answers a number of questions I had over the performance of ZFS.

One question is why did the MAUs on the T2+ chip not have more of an impact on performance? Would the ZFS compression load be automatically offloaded to the MAU?

Posted by Paul Thomson on July 20, 2009 at 12:22 AM EDT #

Just inherited a 3510 and setup ZFS. So my googling found your blog post.

I did all your same permutations of NRAID, HW RAID 5, RAID 0 on the 3510, messing wiht 32k vs 128k blocksize (& zfs recordsize) while testing zfs throughput with iozone on the Solaris host (v240 with 16GB RAM).

I'm getting the same 38-45mb/sec write throughput. What should I be getting? Any tips for improvement?

Posted by svrocket on October 19, 2009 at 06:48 PM EDT #

" ZFS can be easily taught not to hog memory."

I am doing some evaluation on ZFS.

I limited the ARC size at /etc/system as
set zfs:zfs_arc_max = 0x20000000
set zfs:zfs_arc_min = 0x10000000

The system has OpenSolaris 111b 12 GB of RAM and 12 CPUs.

The test consists of 10'000 files (threads) @ 1 MB doing 4K random WRITES.

This is what I observe:
mdb reports 1.5 GB of ZFS file data
kstat reports 1 GB or ARC used (even that the limit is set to 512MB, not sure why kstat reports 536 as the target max).

I also see 120 MB/s of back end READS and only 0.2 MB/s writes - this is a RANDOM WRITES.

It seems that the limit is not a real limit ... unless I am missing something.

It also seems that the kernel module can grow as well - not sure how to set limits there.

If do the same with only 1000 files open, I am getting 650 MB of ARC used and about 5 MB/s IO. My back end support 80 MB/s random I/O at 128 KB and about 50 MB/s at 4 K.

So far I see that the memory usage is unpredictable in the corner cases - and this is really where you want to set the limits.

Posted by Mike on April 02, 2010 at 04:13 AM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed

« December 2016