ZFS and the all-singing, all-dancing test suite

"A product is only as good as its test suite"

This is a saying that I first heard while helping to get a storage company off the ground. The full import of this statement didn't quite hit me until a few months later.

Whenever you write software, it's a very human thing, and as such, limited by our own cognitive capacity to provide correctness. I'm sure there are people out there who are smarter than me and make fewer mistakes. But no matter how good they are, the number is still greater than zero. There will always be bugs because there is a limit to how much even the smartest of us can fit into our heads at a given time.

So what do you do in order to drive the the number of bugs in a given piece of software towards zero? You write tests. And the end quality of your product depends very directly on how good these tests are at exploring the boundary conditions of your code. If you only test the simple cases, only the simple things will actually work.

Naturally, when you're developing something that other people entrust their livelihood to, like a filesystem or a high-end storage array, you want to make darn sure that their trust is not misplaced. To that end, we wrote a very aggressive test suite for ZFS.

But first, a little background. As Eric has diagrammed on his Source Code Tour, ZFS is broken up into several layers: ZPL (ZFS Posix Layer), DMU (Data Management Unit), and SPA (Storage Pool Allocator). In terms of external dependencies, the DMU and the SPA (about 80% of the kernel code) need very little from the kernel. The ZPL, which plugs into the Solaris VFS and vnode layer, depends on many interfaces from throughout the kernel. Fortunately, it tends to be less snarly than the rest of the code.

What we've done then is to allow the DMU and SPA to be compiled in both kernel and user context (the ZPL is kernel-only). The user-level code can't be used to mount filesystems, but we can plug a test harness, called ztest, into the DMU and SPA (in place of the ZPL) and run the whole shebang in user-land. Running in user-land gives us fast test cycles (no reboot if we botched the code), and allows us to use a less extreme debugging environment.

Now that we have our code in user-land all ready for a test harness to drive it, what do we do? We do what would typically be called white-box testing, where we purposefully bang on sore spots in our code, along with the standard datapath testing. What kind of tests? Anything and everything, all in parallel, all as fast as we can:

Read, write, create, and delete files and directories
Just what you would expect from your standard filesystem stress test.
Create and destroy entire filesystems and storage pools
It's important to test not only the data path, but the administrative path as well, since that's where a lot of problems typically occur -- doing some administrative operation while the filesystem is getting the crap beat out of it.
Turn compression on/off while data is in flight
Since all ZFS administrative operations are on-line, things can change at any time, and we have to cope.
Change checksum algorithms
Same deal. If the checksum algorithm changes mid-flight, all old blocks are still fine (our checksums and compression functions are on a per-block basis), and any new blocks go out with the new checksum algorithm applied.
Add, replace, attach, and detach devices in the pool
This is a big one. Since all ZFS operations are 100% on-line, we can't pause, wait a moment, grab a big fat lock, or anything that would slow us down. Any administrative operations must not disrupt the flight of data and (of course) must ensure that data integrity is in no way compromised. Since adding disks, replacing disks, or upgrading/downgrading mirrors are "big deals" administratively, we must make sure that we don't trip over any edge conditions in our data path.
Change I/O caching and scheduling policies on-the-fly
The I/O subsystem in ZFS is nothing to sneeze at, and there are many tunables internal to the code. One day, we hope to have the code adaptively tune these on-the-fly in response to workload. In the meantime, we verify that fiddling with them doesn't lead to bad behavior.
Scribble random garbage on any redundant data
Whenever our random device addition happens to have created a redundant configuration (either a mirror or RAID-Z), we take the opportunity to test our self-healing data (automatically repair corrupted data). We destroy data in a random, checkerboard pattern since destroying all copies of a given piece of data is not a particularly interesting data point. We're not PsychicFS, after all.
Scrub, verify, and resilver devices
This ensures that as we're running, we can traverse the contents of the pool in a safe way, verifying all copies of our data along the way, and repairing anything we notice to be out of place.
Force violent crashes to simulate power loss
Since we're in user-land, a "kill -9" will simulate a violent crash due to power loss. No "sync your data to disk", no "here it comes", just "bang, you're dead". We even break up our low-level writes to disk so that they're non-atomic at the syscall level. This way, we simulate partial writes in the event of a power failure. Furthermore, when we re-start the test suite and resume from the simulated power failure, we first verify the integrity of the entire storage pool to ensure everything is totally consistent.

Remember, we're doing all of the above operations as fast as we can. To make things even racier, we put the backing store of the pool under test in /tmp, which is an in-memory filesystem under Solaris. This way, we can expose many race conditions that simply would not occur at the speed of normal disks.

All in all, this means that we put our code through more abuse in about 20 seconds than most users will see in a lifetime. We run this test suite every night as part of our nightly testing, as well as when we're developing code in our private workspaces. And it obeys Bill's first law of test suites: the effectiveness of a test suite is directly proportional to its ease of use. We simply just type "ztest" and off it goes.

Of course all this wonderful test suite leaves out the poor ZPL. Fortunately we have a dedicated test team that also does more traditional testing, running ZFS in the kernel, acting as a real filesystem. They run a whole slew of tests that also try to get maximal code coverage, verify standards compliance, and generally abuse the system as much as possible. While we strive to get ztest covering as much of our code base as possible, nothing beats the real deal in terms of giving ourselves warm, fuzzy feelings.

As our friends at Veritas once said, "If it's not tested, it doesn't work." And I couldn't agree with them more.


A great read! It seems your blog is young, but I'll bookmark it in case you keep delivering interesting stuff like this. I'd like to hear about the design process you went through to come up with a filesystem that can do all of the stuff mentioned, much less do it under the load you've described, with all sorts of major administrative chores happening in real time. That's not the sort of thing you can get a Wrox or O'Reilly book on how to do.

Posted by John DeHope 3 on November 16, 2005 at 02:14 AM PST #

Your description of the userland testing leaves me unconvinced that you've covered all failure modes. The fact is that a power loss or power surge can cause severe data corruption not simulated by your software-only, sudden-kill procedure. For instance, on a real-world disk, you may get anomalous results if the power goes out while the drive is in the middle of a physical write of a particular sector. Furthermore, there are other ways a power loss might damage many seemingly random sectors. A famous power-drop failure at UMich years ago, for instance, left a spiral pattern of corrupted sectors as the head moved across the disk, still writing, because the drive's circuitry left the write current active while the head was retracting. Sudden shock to the drive cabinet might have similar effects. Also, have you tried running your tests on actual failed disks returned from the field, to find out how failures manifest in the real world? Another thing you could do is destructive testing. Take one of those failed disks which is mostly working, and put it inside an air-sealed chamber (but still cabled over to your machine). Now introduce some corrosive atmospheric contaminant like ammonia or acetic acid vapors into the chamber. As the fumes seep into the drive through the pressure-equalization filter and the media starts to degrade, see how your code reacts to those failures. Do the same with excessive heat (inadequate cooling) or out-of-spec humidity. This would essentially be accelerated-life testing that would better simulate what your customers will experience with their hardware. One can also imagine other tests to validate correct behavior in the presence of controller failures, intermittent cable connections, bad bus termination, and so forth. I've been reading most of the ZFS descriptive material available so far, and I haven't seen any evidence of testing against such real-world device failures.

Finally, on the topic of putting the backing store in /tmp, note that a configuration like that changes the timing around in such a way that it might not exercise the same race conditions as would be experienced in a real system. Such tests might be helpful but I wouldn't consider them definitive; you still have to test the timing expected on a production system.

Please do such testing before ZFS makes it into S10. Not doing so would be like not testing chemicals on animals -- which simply means that the first organisms that get exposed to possibly dangerous materials are people, and chances are, everybody stays ignorant and in denial about actual toxicity. Is that really an ethical stance?

Posted by Glenn on November 19, 2005 at 05:21 PM PST #

Looking at the wikipedia page, it says the maximum file size of a file in ZFS is 2\^64. Is off_t going to be modified to be an integer larger than a signed 64-bit integer to account for this and if not, how does lseek, read, write, and ftruncate work?

Posted by asdf on February 08, 2006 at 01:42 PM PST #

Good entry, I read it late... but this spam in the comments sucks.

Posted by J. on May 08, 2007 at 04:51 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed



« January 2017