Losing My Fear of ZFS
By jhenning on May 26, 2009
The original ZFS vision was enormously ambitious: End the Suffering, Free Your Mind, providing simplicity, power, safety, and speed. As is common with most new technologies, this ambitious vision was not completely fulfilled in the intial versions. Initial usage showed that although it did have useful and convenient features, for some workloads, such as the memory-intensive SPEC CPU benchmarks, there were reasons for concern. Now that ZFS has had time to grow, more of the vision is fulfilled. This article, told from the personal perspective of one performance engineer, describes some of the improvements, and provides examples of use.
Rumors: Does This Sound Familiar?
Have you heard some of these about ZFS?
"ZFS? You can't use that - it will eat all your memory!"
"ZFS? That's a software disk striping/RAID solution. You don't want that. You want hardware RAID."
"ZFS? Be afraid."
Can I Please Just Forget About IO? (NO)
As a performance engineer, my primary concern is for the SPEC CPU benchmarks - which intentionally do relatively little IO. Usually.
To a first approximation, IO can be ignored in this context. Usually.
To a first approximation, it's fine if my ZFS "knowledge" is limited to rumors / innuendo as quoted above. Until....
Until there comes the second approximation, the re-education, and the beginner loses some fear of ZFS.
Why a SPEC CPU Benchmarker Might Care About IO
Although the SPEC CPU benchmarks intentionally try to avoid doing IO, some amount inevitably remains. An analysis of the IO in the benchmarks shows that one benchmark, 450.soplex, reads 300 MB of data. Most of that comes from a single 1/4 GB file, ref.mps, which is read during the second invocation of the benchmark.
Given the speed of today's disk drives, is that a lot? Using an internal drive (Seagate ST973402SSUN72G), a T5220 with a Niagara 2 processor reads the 1/4 GB file at about 50 MB/sec. It takes about 5.5 seconds to read one copy of the file, which is a tiny amount of time compared to how long it takes to run one copy of the actual benchmark - about 3000 seconds.
But 1/4 GB becomes a concern when one takes into account that we do not, in fact, read one copy of the file when testing SPEC CPU2006, because we are interested in the SPECrate metrics, which run multiple copies of the benchmarks. On a single-chip T5220 system, which supports 64 theads, 63 copies of the benchmark are run. An 8-chip M5000, which supports 8 threads per chip, also runs 63 copies.
On such systems, it is not uncommon to see 10 to 30 minutes of time when the CPU is sitting idle - which is not the desired behavior for a CPU benchmark.
For example, on the M5000, as shown in the graph below, it takes about 18 minutes before the CPU reaches the desired 99% User time. During that 18 minutes, a single disk with ufs on it is, according to iostat, 100% busy. It reads about 16 MB/sec, doing about 725 reads/sec.
Note that in this graph, and all other graphs in this article, the program being tested is only one
of the benchmarks drawn from a larger suite, and only one of its inputs. Therefore, no statements in this article should
be taken as indicative of "compliant" (aka "reportable") runs of an entire suite. SPEC
ZFS Makes its Dramatic Entrance
Although this tester has heard concerns raised by people who have passed along rumors of ZFS limitations, there have been other teachers who have sung its praises, including one who has pointed out that 450.soplex's 1/4 GB input file is highly compressible, going from 267 MB to 20 MB with gzip.
The best IO is the IO that you never have to do at all. By using the ZFS compression feature, we can make 90% of the IO go away:
$ zpool create -f tank c0t1d0 $ zfs create tank/spec-zfs-gzip $ zfs set compression=gzip tank/spec-zfs-gzip
The improvement from ZFS gzip compression is indeed dramatic.
The careful reader may note that there are actually two lines on the far left: one measured with Solaris 10 Update 7, the other with Solaris Express. The version of Solaris did not appear to be a signficant variable for the tests reported in this paper, as can be seen by the fact that the two lines are right on top of each other.
What About Memory Consumption?
Although ZFS has done a great job above, what about its memory consumption? Concerns have been raised that it is memory-hungry, and indeed the "Be st Practices" Guide plainly says that it will use all the memory on the system if it thinks it can get away with it:
The ZFS adaptive replacement cache (ARC) tries to use most of a system's available memory to cache file system data. The default is to use all of physical memory except 1 Gbyte. As memory pressure increases, the ARC relinquishes memory.
ZFS memory usage is an important concern when running the SPEC CPU benchmarks, which are designed to stress the CPU, the memory system, and the compiler. Some of the benchmarks in the suite use just under 1 GB of physical memory, and it is desirable to run (n — 1) copies on a system with (n) threads and (n) GB of memory. Fortunately, there is a tuning knob available to control the size of the ARC: set zfs:zfs_arc_max = 0x(size) can be added to/etc/system.
The tests reported on this page all use a limited ARC cache.
It should also be noted that all tests are done after a fresh reboot, so presumably the ARC cache is not contributing to the reported performance. More details about methods may be found at the end of the article.
ZFS on T5440: Good, But Not As Dramatic
Although the above simple commands are enough to remove the IO idle time on the M5000, for the 4-chip T5440 there is a bigger problem: this system supports 256 threads, and 255 copies of the benchmark are run. Therefore, it needs to quickly inhale on the order of 64 GB.
A somewhat older RAID system was made available for this test: an SE3510 with 12x 15K 72GB drives. Using this device with ufs, it takes 30 minutes before the system hits the maximum user time, as shown by the line on the right in the graph below:
In the ufs test above, the SE3510 is configured as 12x drives in a RAID-5 logical drive, with a simple ufs filesystem (newfs -f 8192 /dev/dsk/c2t40d0s6). Despite the large number of drives, the SE3510 sustains a steady read rate of only about 45 MB/sec, processing about 3000 IO/sec according to iostat. (Aside: the IO expert may question why the hardware RAID provides only 45 MB/sec, but please bear in mind we are following the path of the IO beginner here. This topic is re-visted below.)
The zfs file system reads about 16 MB/sec, doing about 4500 IO/sec, but takes less than 1/2 as long to peg the CPU, since it is reading compressed data.
The zfs file system also used an SE3510 with SUN72G 15k RPM drives. On that unit, 12 individual "NRAID" drives were created, and made visible to the host as 12 separate units. Then, 10 of them were strung together as zfs RAID-Z using:
# zpool create zf-raidz10 raidz \\ c3t40d0 c3t40d1 c3t40d2 c3t40d3 c3t40d4 \\ c3t40d5 c3t40d6 c3t40d7 c3t40d8 c3t40d9 # zpool status pool: zf-raidz10 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM zf-raidz10 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c3t40d0 ONLINE 0 0 0 c3t40d1 ONLINE 0 0 0 c3t40d2 ONLINE 0 0 0 c3t40d3 ONLINE 0 0 0 c3t40d4 ONLINE 0 0 0 c3t40d5 ONLINE 0 0 0 c3t40d6 ONLINE 0 0 0 c3t40d7 ONLINE 0 0 0 c3t40d8 ONLINE 0 0 0 c3t40d9 ONLINE 0 0 0
Compression was added at a later time, but before the experiment shown above:
$ zfs list -o compression zf-raidz10 COMPRESS gzip
Why Is the T5440 Improvement Not As Dramatic As the M5000?
The improvement from zfs is helpful to the T5440, but unlike the M5000, nearly 15 minutes of clock time is spent on IO. Let's look at some statistics from iostat:
$ iostat -xncz 30 . . . cpu us sy wt id 8 4 0 88 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 2.1 0.0 0.0 0.0 0.1 6.7 0 0 c0t0d0 469.3 0.0 1851.2 0.0 0.9 18.5 1.9 39.4 25 88 c3t40d9 401.8 0.0 1893.2 0.0 19.7 9.8 48.9 24.5 88 96 c3t40d8 471.1 0.0 1836.2 0.0 1.1 18.4 2.4 39.1 27 87 c3t40d7 416.1 0.0 1858.5 0.0 2.0 16.1 4.9 38.7 33 88 c3t40d6 452.1 0.0 1792.8 0.0 13.9 13.1 30.9 29.0 78 92 c3t40d5 417.8 0.0 1868.9 0.0 0.9 16.3 2.1 39.1 18 87 c3t40d4 461.0 0.0 1766.9 0.0 3.7 17.2 8.0 37.3 42 87 c3t40d3 418.9 0.0 1854.9 0.0 2.9 16.2 7.0 38.6 40 88 c3t40d2 433.6 0.0 1761.0 0.0 21.9 8.9 50.6 20.6 92 99 c3t40d1 420.0 0.0 1852.6 0.0 1.5 16.1 3.5 38.4 29 86 c3t40d0
A kind ZFS expert notes that "with RAID-Z the disks are saturated delivering >400 iops. The problem of RAID-Z is that those iops carry small amount of data and throughput is low." For more information, see this popular reference: https://blogs.oracle.com/roch/entry/when_to_and_not_to.
A secondary reason might be that as the reads are done, ZFS is decompressing the gzip'd data on a system where single thread performance is much slower than the one in Graph #2. On the M5000, 'gunzip ref.mps' requires about 2 seconds of CPU time; on the T5440, about 7 seconds. It should be emphasized that this is only a secondary concern for the read statistics described in this article, although it can become more important for write workloads, since compression is harder than decompression. Doing 'gzip ref.mps' takes ~12 seconds on the M5000, and ~51 seconds on the T5440. Furthermore, although the T5440 has 256 threads available, as of Solaris 10 s10s_u7, and Solaris Express snv_112, it is only willing to spend 8 threads doing gzip/gunzip operations. (This limitation may change in a future version of Solaris.)
Solution: Mirrors, No Gzip
The kind ZFS expert suggested trying mirrored drives without gzip. When this is done, the %b (busy) time, which is about 90% in the iostat report just above, changes to 98-100%. The %w time (queue non-empty) time, which shows wide variability just above, also pushes 90-100%. Because we are reading much more data, elapsed time is actually slower - the red line in the graph below:
Adding 12 more drives, configured as 8x three way mirrors, does the trick: the leftmost line shows the desired slope. We spend about 3-4 minutes reading the file, an acceptable amount given that the benchmark as a whole runs for more than 120 minutes.
The file system for the leftmost line was created using:
# zpool create dev8-m3-ngz \\ > mirror c2t40d0 c2t40d1 c3t40d0 \\ > mirror c2t40d2 c2t40d3 c3t40d1 \\ > mirror c2t40d4 c2t40d5 c3t40d2 \\ > mirror c2t40d6 c2t40d7 c3t40d3 \\ > mirror c2t40d8 c3t40d4 c3t40d5 \\ > mirror c2t40d9 c3t40d6 c3t40d7 \\ > mirror c2t40d10 c3t40d8 c3t40d9 \\ > mirror c2t40d11 c3t40d10 c3t40d11 #
The command creates 3-way mirrors, splitting each mirror across the two available controllers (c2 and c3). There are 8 of these 3-way mirrors, and zfs will dynamically stripe data across the 8 mirrors.
Were These Tests Fair?
The hardware IO expert may be bothered by the data presented here for the RAID-5 ufs configuration. Why would the hardware RAID system, with 12x drives, deliver only 45 MB/sec? In addition, it may seem odd that the tests use a RAID device which is now 5 years old, and compare it versus contemporary ZFS.
This is a fair point. In fact, a more modern RAID device has been observed delivering 97 MB/sec to 450.soplex, although with a very different system under test.
On the other hand, it should be emphasized that all the T5440 tests reported in this article used SE3510/SUN72G/15k. For the ufs tests, the SE3510 on-board software did the RAID-5 work. For the zfs tests, the SE3510 simply presented its disks to the Solaris system as 12 separate ("NRAID") logical units, and zfs did the RAID-Z and mirroring work.
Could there be something wrong with the particular SE3510 used for ufs? That seems unlikely. Although Graph 3 compares two different SE3510s (both connected to the same HBA, both configured with SUN72G 15k drives), a later test repeated the RAID-5 run on the exact same SE3510 unit as had been used for zfs. The time did not improve.
Is it possible that the SE3510 was mis-configured? Maybe. The author does not claim to be an IO expert, and, in particular, relied on the SE3510 menu system, not its command line interface (sccli). The menus provide limited access to disk block size setting, and the tester did not at first realize that the disk block size depends on this other parameter .... located over here in the menus ...
For this particular controller, default block sizes are controlled indirectly by whether this setting is yes or no. Changing it to "No" makes the default block size larger (32 KB vs. 128 KB). Once this was discovered, various tests were repeated. The hw RAID-5 test was repeated with explicit selection of a larger size; however, it did not improve. On the other hand, the NRAID devices, controlled by zfs, did improve.
Finally, in order to isolate any overhead from RAID-5, the SE3510 was configured as 12 x drives in a RAID-0 stripe (256 KB stripe size). The time required to start 450.soplex was still over 30 minutes.
As usual, your mileage may vary depending on your workload. This is especially true for IO workloads.
Summary / Basic Lessons
Some basic lessons about ZFS emerge:
1) ZFS can be easily taught not to hog memory.
2) Selecting gzip compression can be a big win, especially on systems with relatively faster CPUs.
3) Setting up mirrored drives with dynamic striping is straightforward.
4) ZFS is not so scary, after all.
Notes on Methods
During an actual test of a "reportable" run of SPEC CPU2006, file caches are normally not effective for 450.soplex, because its data files are set up many hours prior to their use, with many intervening programs competing for memory usage. Therefore, for all tests reported here, it was important to avoid unwanted file caching effects that would not be present in a reportable run, which was accomplished as summarized below:
runspec -a setup --rate 450.soplex reboot cd 450.soplex/run/run\*000 specinvoke -nnr > doit.sh convert 'sh dobmk' in doit.sh to 'sh dobmk &' doit.sh
The tests noted as Solaris 10 used:
# head -1 /etc/release Solaris 10 5/09 s10s_u7wos_08 SPARC
The tests noted as SNV used:
# head -1 /etc/release Solaris Express Community Edition snv_112 SPARC
The tests on the M5000 used 72GB 10K RPM disks. The ufs disk was a FUJITSU MBB2073RCSUN72G (SAS); the zfs disk was a SEAGATE ST973401LSUN72G (Ultra320 SCSI). The tests on the T5440 used 72GB 15K RPM disks: FUJITSU MAU3073FCSUN72 (Fibre Channel).
My IO teachers include Senthil Ramanujam, Cloyce Spradling, and Roch Bourbonnais, none of whom should be blamed for this beginner's ignorance. Karsten Guthridge was the first to point out the usefulness of ZFS gzip compression for 450.soplex.