ZFS vs. The Benchmark
By bill on Nov 16, 2005
It's been said that developing a new filesystem is an over-constrained problem. The interfaces are fixed (POSIX system calls at the top, block devices at the bottom), correctness in the face of bad hardware is not optional, and it must be fast, or nobody cares. Well, gather 'round, settle back, and listen to (or read) the tale of ZFS vs. "The Benchmark".
Once, not long ago, we had a customer write a benchmark for Solaris that was meant to stress the VM system. The idea was fairly straightforward: mmap(2) a large file, then randomly dirty it, forcing the system to do some paging in the process. Pretty simple, right?
Our beloved customer then goes to run this on an appropriately large file, stored on a UFS filesystem. The benchmark then proceeds to grind the system to a near standstill. Cursory inspection shows that the single disk containing the UFS filesystem has 24,000 outstanding I/Os to it. This is the same disk that contains the root filesystem. Not a good sign.
As you might well imagine, the system was pretty unresponsive. Not hung, but so slow it was beyond tolerable for even the most patient of us. The benchmark had caused enough memory pressure that most non-dirty pages (like our shell, ls, vmstat, etc.) were evicted from memory. Whenever we wanted to type a command, these pages had to be fetched off of disk. That's right, the same disk with 24,000 outstanding I/Os. The disk was a real champ, cranking out about 400 IOPS, which meant that any read request (for, say, the ls command) took about one minute to work its way to the front of the I/O queue. Like I said, a trying situation for someone attempting to figure out what the system is doing.
The score so far? The Benchmark: 1 UFS: 0
And in this corner...
At this point in time, ZFS was still in its early stages, but I had just finished writing some fancy I/O scheduling code based on some ideas that had been bouncing around in my head for a couple of years. Having explained what I had just done to Bryan Cantrill, he had the bright idea of running this same benchmark using ZFS. While I was still busy waving my hands and trying to back down on my bold claims, he had already started up the benchmark using ZFS.
The difference in system response was breathtaking. While running the same workload, the system was perfectly responsive and behaved like a machine without a care in the world. Wow. I then promptly went back to making bold claims.
One of the features of our advanced I/O scheduler in ZFS is that each I/O has both a priority and a deadline associated with it. Higher priority I/Os get deadlines that are sooner than lower priority ones. Our I/O issue policy is then to issue the I/O within the oldest deadline group that has the lowest LBA (logical block address). This means that for I/Os that share the same deadline, we do a linear pass across the disk, kind of like half of an elevator algorithm. We don't do a full elevator algorithm since it tends to lead to over-servicing the middle of the disk and starving the outer edges.
How does this fancy I/O scheduling help, you ask? In general, a write(2) system call is asynchronous. The kernel buffers the data for the write in memory, returns from the system call, and later sends the data to disk. On the other hand, a read(2) is inherently synchronous. An application is blocked from making forward progress until the read(2) system call returns, which it can't do until the I/O has come back from disk. As a result, we give writes a lower priority than reads. So if we have, say, several thousand writes queued up to a disk, and then a read comes in, the read will effectively cut to the front of the line and get issued right away. This leads to a much more responsive system, even when it's under heavy load. Since we have deadlines associated with each I/O, a constant stream of reads won't permanently starve the outstanding writes.
At this point, the clouds cleared, the sun came out, birds started singing, and all was Goodness and Light. The Benchmark had gone from being a "thrash the disk" test to being an actual VM test, like it was originally intended.
Final score? The Benchmark: 0 ZFS: 1