ZFS vs. The Benchmark


It's been said that developing a new filesystem is an over-constrained problem. The interfaces are fixed (POSIX system calls at the top, block devices at the bottom), correctness in the face of bad hardware is not optional, and it must be fast, or nobody cares. Well, gather 'round, settle back, and listen to (or read) the tale of ZFS vs. "The Benchmark".


Once, not long ago, we had a customer write a benchmark for Solaris that was meant to stress the VM system. The idea was fairly straightforward: mmap(2) a large file, then randomly dirty it, forcing the system to do some paging in the process. Pretty simple, right?


Our beloved customer then goes to run this on an appropriately large file, stored on a UFS filesystem. The benchmark then proceeds to grind the system to a near standstill. Cursory inspection shows that the single disk containing the UFS filesystem has 24,000 outstanding I/Os to it. This is the same disk that contains the root filesystem. Not a good sign.


As you might well imagine, the system was pretty unresponsive. Not hung, but so slow it was beyond tolerable for even the most patient of us. The benchmark had caused enough memory pressure that most non-dirty pages (like our shell, ls, vmstat, etc.) were evicted from memory. Whenever we wanted to type a command, these pages had to be fetched off of disk. That's right, the same disk with 24,000 outstanding I/Os. The disk was a real champ, cranking out about 400 IOPS, which meant that any read request (for, say, the ls command) took about one minute to work its way to the front of the I/O queue. Like I said, a trying situation for someone attempting to figure out what the system is doing.


The score so far? The Benchmark: 1 UFS: 0

And in this corner...

At this point in time, ZFS was still in its early stages, but I had just finished writing some fancy I/O scheduling code based on some ideas that had been bouncing around in my head for a couple of years. Having explained what I had just done to Bryan Cantrill, he had the bright idea of running this same benchmark using ZFS. While I was still busy waving my hands and trying to back down on my bold claims, he had already started up the benchmark using ZFS.


The difference in system response was breathtaking. While running the same workload, the system was perfectly responsive and behaved like a machine without a care in the world. Wow. I then promptly went back to making bold claims.


One of the features of our advanced I/O scheduler in ZFS is that each I/O has both a priority and a deadline associated with it. Higher priority I/Os get deadlines that are sooner than lower priority ones. Our I/O issue policy is then to issue the I/O within the oldest deadline group that has the lowest LBA (logical block address). This means that for I/Os that share the same deadline, we do a linear pass across the disk, kind of like half of an elevator algorithm. We don't do a full elevator algorithm since it tends to lead to over-servicing the middle of the disk and starving the outer edges.


How does this fancy I/O scheduling help, you ask? In general, a write(2) system call is asynchronous. The kernel buffers the data for the write in memory, returns from the system call, and later sends the data to disk. On the other hand, a read(2) is inherently synchronous. An application is blocked from making forward progress until the read(2) system call returns, which it can't do until the I/O has come back from disk. As a result, we give writes a lower priority than reads. So if we have, say, several thousand writes queued up to a disk, and then a read comes in, the read will effectively cut to the front of the line and get issued right away. This leads to a much more responsive system, even when it's under heavy load. Since we have deadlines associated with each I/O, a constant stream of reads won't permanently starve the outstanding writes.


At this point, the clouds cleared, the sun came out, birds started singing, and all was Goodness and Light. The Benchmark had gone from being a "thrash the disk" test to being an actual VM test, like it was originally intended.


Final score? The Benchmark: 0 ZFS: 1

Comments:

Nice, but why the deadline / I/O priority has not been implemented in the "I/O layer" (ie: FS-independent) instead of being ZFS only? Just curious, looks the right place to be.

Posted by Diego on November 16, 2005 at 02:53 AM PST #

<cite>Just curious, looks the right place to be.</cite>

So Sun employees can brag about ZFS, instead of getting depressed about their crappy drivers?

Posted by Alex on November 17, 2005 at 12:34 AM PST #

As a result, we give writes a lower priority than reads. So if we have, say, several thousand writes queued up to a disk, and then a read comes in, the read will effectively cut to the front of the line and get issued right away.

What if the <tt>read(2)</tt> reads a portion of the file that a pending <tt>write(2)</tt> would have affected?

Isn't that a Bad Thing?

Posted by Ron on November 17, 2005 at 07:33 AM PST #

So Sun employees can brag about ZFS, instead of getting depressed about their crappy drivers?

I'll ignore the inflammatory tone of your comment and try to answer the question briefly. More will follow in a blog entry.


The whole purpose of an I/O scheduler (especially one with advanced features like deadlines and inheritable priority) is to re-order I/O requests. Without knowledge of the interdependancies, it is impossible to do this without sacrificing correctenss. In the read case, you can generally do this, but you still have to be careful so that you're consistent with interleaved writes to the same block.

Posted by Bill Moore on November 17, 2005 at 01:18 PM PST #

What if the read(2) reads a portion of the file that a pending write(2) would have affected?

We handle that case at a higher layer in ZFS. If a read is issued for a write that is on its way to disk, the read is satisfied by the in-memory copy of that data. It would never make it down to the actual I/O layer, so it isn't a problem.

Posted by Bill Moore on November 17, 2005 at 01:20 PM PST #

This comparison is not fair because UFS is rather old. Why you gauys don`t bench the new FS like Reiser4?

Posted by Vladislav Mikhailikov on November 17, 2005 at 07:17 PM PST #

This comparison is not fair because UFS is rather old. Why you gauys don`t bench the new FS like Reiser4?

I don't think your statement is correct. This is a perfectly fair comparison. It just may not be the comparison that you're most interested in.


That said, let me assure you that we're working on benchmarking against Reiser4, and so far, we're doing quite well. I'm working to make sure the comparison is as fair as possible, since benchmarking different filesystems on different OSes is a tricky business. It's very easy to conflate the OS differences with the filesystem ones.

Posted by Bill Moore on November 18, 2005 at 01:15 PM PST #

Hi Bill, What will happen when a system's under memory pressure, a file gets flushed out to disk and then a read for the same file gets issued? I take it that the read will have to wait until the write's completed as there's no in-memory copy, right? Thanks, James.

Posted by James on April 12, 2006 at 02:16 AM PDT #

James,


Hopefully the system will behave intelligently and actually write the file to disk instead of flushing the file to the page cache. Now I could see things getting tricky here if you have some big slow data storage and then a fast small disk for your pagecache/root disk. Now copying the file out to the pagecache will be much faster than just writing it.


I'm actually very curious how ZFS handles this. Or more broadly ZFS+Solaris since it is a problem about the interaction of the filesystem and the VM system.


BTW Mr. Moore, are you still planning on updating this blog at somepoint. I can't complain about getting busy and not posting for long periods of time since I do the same on my blog but I want to know if I should subscribe to your feed.

Posted by logicnazi on April 30, 2006 at 02:43 AM PDT #

I can't claim to fully understand all the interactions of VM, swap and filesystem, but as I understood it, the benchmark basically forced the system to do a lot of writes to disk, while reads are important for shell responsiveness, and you said ZFS specifically prefers reads over writes. What would have happened, if the benchmark caused lots of reads to disk? Then the read for "ls" wouldn't have priority over the I/O that the benchmark caused, causing the system with ZFS to behave just as sluggish (or even more sluggish) than with UFS?

Posted by Ben Bucksch on May 01, 2006 at 02:39 AM PDT #

<h2> Interesting... But</h2>

Ok, so you've posted this "benchmark" showing how ZFS is better than UFS at this task. Highly likely that ZFS is better than UFS at most things. I may even be persuaded to say ZFS is better than most filesystems at most things... but there I see three big problems.

  1. The first problem that I would like to point out, that others have is well is that there are absolutely NO benchmarks that compare the actual performance of ZFS to any other filesystem. &nbsp;
  2. ZFS seems to have been optimized for IO to a single device, stating things like "Using batches of LBA (logical block address)" seem to indicate that you actually believe that you know where the data is being physically written to the disk. &nbsp;Maybe you've heard of this new thing called 'virtualization'? &nbsp;Seems its been around for a while, and in fact hdd makers have been using it for at <em><span style="text-decoration: underline;">least</span></em> 10 years now, heck more like 30. &nbsp;More to the point, what about doing IO to a real storage device, like say a Sun 9990. &nbsp;Don't tell me you think you actually know where the data is going... &nbsp; so why waste time optimizing for something any real performance user can tell you won't matter?
  3. Finally, any real HPTC customer will tell you parallel is the answer. &nbsp;Sun figured it out with processing, why not figure it out with IO? &nbsp;The real way to get massive performance is to have a multi host parallel filesystem, via a fast interconnect. &nbsp;Why something very similar to Sun's QFS. &nbsp;In fact, why NOT use QFS? &nbsp;Wasting all this time on ZFS seems somewhat pointless, but that's just one paying customers opinion.

The fact that Sun thinks something like ZFS is the answer to storage problems tells me just how little Sun understands about storage... very sad indeed...

Posted by fdr on November 21, 2006 at 12:35 AM PST #

This is what I thought that I would see here. A simple comparison of UFS ZFS ext3fs reiserfs on the same physical hardware. Done by Chad Mynhier http://cmynhier.blogspot.com/2006/05/zfs-benchmarking.html

Posted by Paul Connolly on January 16, 2007 at 08:11 PM PST #

I'm following along here in hopes of finding the bottleneck in our ZFS implementations.
Bulk write performance is just terrible (as example a 400GB database backup). Writes to tape are 400% faster than to our ZFS filesystems (tested, consistent). and this to a 6130/40 SAN disk array. (with disk sets and virtual disks not shared and allocated specifically to the the backup op).

400% faster to tape than zfs disks on a fast array ?
Makes me wonder.

Posted by Db2dude on October 29, 2007 at 09:20 AM PDT #

i have read some article about it and I can understand all the interactions of VM, swap and filesystem. As I got it and understood it, the benchmark basically forced the system to do a lot of writes to disk,,, while reads are important for shell responsiveness.....

Posted by virus protection on March 27, 2011 at 01:59 PM PDT #

i have read some article about it and I can understand all the interactions of VM, swap and filesystem. As I got it and understood it, the benchmark basically forced the system to do a lot of writes to disk,,, while reads are important for shell responsiveness.....

Posted by virus protection on March 27, 2011 at 02:01 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

bill

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today