Friday Jan 29, 2010


As many folks know, Sun and GreenBytes have been having some discussions related to deduplication technology in ZFS. In a large part, this was due to some unfortunate communication breakdowns between us.

In the beginning, the fact that GreenBytes was able to develop their own implementation of deduplication on top of ZFS, was what started our discussions. This is a powerful engineering accomplishment, especially considering the fact that they did this without the help of the ZFS team.

Over the months of 2008, the engineers at GreenBytes and Sun had several productive discussions about deduplication and ZFS. Ultimately, we decided that each engineering team had different needs we needed to meet with regards to the behavior and implementation of deduplication in ZFS.

While each of our deduplication technologies are unique, they have different strengths, which is what Open Source development is all about - nobody has to wholly agree on the right approach, and is free to do their own thing. I think both Sun and GreenBytes have a lot of respect for each other's engineering teams due to these discussions over the past two years.

Sun definitely looks forward to possible forthcoming contributions to OpenSolaris and ZFS in the future from GreenBytes, as they represent the spirit of Open Source that Sun hoped to cultivate with the CDDL.

We're glad we were able to resolve this misunderstanding, and we are looking forward to cooperatively working together through the OpenSolaris community in the future.

Friday May 12, 2006

Ditto Blocks - The Amazing Tape Repellent

I still remember the smell...

Before Xerox took over the copying business, the world used the ditto machine. You know the one - blue ink, nice smell. I still remember popping a few brain cells sniffing my fresh-off-the-press homework papers in grade school. Alas, I'm sure many readers of this blog won't have had the chance to have the ditto machine experience, but that's progress for you.

Anyway, let's take a left at the next intersection and turn off of memory lane and rejoin the present.

One block to rule them all

In the abstract, you can think of ZFS (or any other filesystem, for that matter) as a tree of blocks. By this, I mean that there is a root block from which all other blocks are discoverable. Let's now imagine a case where you have a petabyte of data in your filesystem (or storage pool, in ZFS' case) and think about having enough failures such that a single block becomes unavailable. What happens?

If the failure is at a leaf block, that one block is now unreadable and an application will get EIO if it tries to access that block. Ok, fine, that doesn't seem too bad. Also, since typically more than 98% of the data in a storage pool is user data blocks (leaf blocks), this is the most likely scenario.

But let's consider what happens if that single block failure is near the top of the tree. Now, we've got a problem. This single failed block casts an expanding shadow of undiscoverable blocks all the way down the tree. If you have a lot of data, a significant portion of it may now be unavailable. Potentially hundreds of terabytes of data at the mercy of a single block. Probably not what you had in mind.

Ditto Blocks

What is a user to do? Set up a 3-way mirror for all my data? Even though storage is cheap, it's not that cheap. A while back, we decided we needed to do better than this for ZFS. The result? Ditto blocks.

What are ditto blocks, you ask? ZFS has block pointers, which as you might imagine, point to blocks on disk. We call these Disk Virtual Addresses (DVAs). Before we did our initial integration of ZFS into Solaris, I made room in the block pointers for not just one DVA, but up to three DVAs. Using these extra DVAs, we can store up to three copies of a block in three separate locations. Mind you, this is on top of whatever replication the pool already has. If you were to use a 3-way ditto block in a pool with mirrored disks, that means that there would be six physical copies of that block.

We use ditto blocks to ensure that the more "important" a filesystem block is (the closer to the root of the tree), the more replicated it becomes. Our current policy is that we store one DVA for user data, two DVAs for filesystem metadata, and three DVAs for metadata that's global across all filesystems in the storage pool.

This has several nice properties. First, the blocks that are more critical to the health of the pool are more replicated, making them far less likely to suffer catastrophic failure. If P is the chance a given block will suffer an unrecoverable error in a given unit of time, then P3 is the chance that a 3-way ditto block will fail in that same amount of time. This approaches zero very quickly.

Second, since almost every storage pool has the vast majority of data in user blocks (well over 98%), there is very little impact in terms of I/O and space consumption for utilizing ditto blocks. Very little data is global to all filesystems (for which we store three copies), and usually about 1.5-2% of the data is per-filesystem metadata, which means that there is about a 2% hit in terms of space and I/O for this added redundancy.

Spread 'em

Once we had ditto blocks, then next question is: Where should we put the extra copies? The answer seems pretty obvious: As far apart as possible.

In a storage pool with only a single disk, we spread the blocks physically across the disk. Our policy aims to place the blocks at least 1/8 of the disk apart. This way, if there is a localized media failure (not all that uncommon on today's drives), you still have a copy elsewhere on that disk.

In a storage pool with multiple devices (vdevs), things get a little spicier. We allocate each copy of a block on a separate vdev. So even if an entire top-level vdev fails (a mirror or RAID-Z stripe), we can still access data. Furthermore, if you think of all the vdevs in your pool as forming a ring, we always try to allocate ditto blocks on the vdevs adjacent to the first copy. The reasoning behind this is a little subtle.

Imagine you have 100 vdevs making up your storage pool. To simplify things, further assume that all blocks in the pool were 2-way ditto blocks. If you just randomly allocated two blocks on two random vdevs, any two vdev failure will guarantee that you lose at least some data. Now consider our policy of only mirroring using neighboring vdevs. If you have two top-level vdevs fail, the only way you could possibly lose data is if they were two adjacent vdevs. This means that given two failures, the probability of data loss goes from 100% to just under 2%.

Don't try this at home

After writing this code and testing it, I thought what fun it would be to see it in action on my laptop. I created a new storage pool using a slice on my laptop drive, put a bunch of data on there, then wiped clean the first 1GB of that slice. As you might imagine, any of the file blocks that were unlucky enough to be allocated in that first 1GB were unreadable. However, I could still navigate the entire filesystem, typing "ls", "rm" and creating new files as much as I wanted. Pretty damn sweet. ZFS just survived a failure scenario that would send any other filesystem to tape. I know you'd have to still go to tape for the file contents that were damaged, but the filesystem was still 100% usable and I could get a list of files that were damaged by running zpool status -v. For the careful reader, you'll note that this command currently only give you the object number, but it will give you the actual filename in the near future.

It's all good, mate

In the future, ditto blocks will be available for user data as well on a per-filesystem basis. Imagine having your laptop with you on vacation (as I recently did). A single disk. You can create a filesystem, for your digital photos, and specify that you want it to use 2- or 3-way ditto blocks for your pictures. ZFS will spread each copy of each block far apart on the disk so that you wind up with an effect very close to mirroring with only a single disk. Furthermore, since we have a pretty decent I/O scheduler, it doesn't rattle the crap out of your disk drive.

This can lead to even more fun if you create a storage pool with several non-replicated disks. You'll be able to mix non-replicated data (for a build area or web cache) and in the same storage pool, be able to use ditto blocks to mirror your "important" data. How cool would that be?

But wait! What if a filesystem requests 2-way ditto blocks for user data? Wouldn't that mean that the filesystem metadata is no more replicated than its contents? Actually, we calculate the per-filesystem and per-pool metadata replication to be +1 and +2 compared to the user data (capped at 3, of course). So we do our best to have the same semantics, even when user data utilizes ditto blocks. More fun that you can shake a stick at.

Finally, you have to admit that the name is kinda catchy. I can smell the blue ink from here...

Wednesday Nov 16, 2005

ZFS vs. The Benchmark

It's been said that developing a new filesystem is an over-constrained problem. The interfaces are fixed (POSIX system calls at the top, block devices at the bottom), correctness in the face of bad hardware is not optional, and it must be fast, or nobody cares. Well, gather 'round, settle back, and listen to (or read) the tale of ZFS vs. "The Benchmark".

Once, not long ago, we had a customer write a benchmark for Solaris that was meant to stress the VM system. The idea was fairly straightforward: mmap(2) a large file, then randomly dirty it, forcing the system to do some paging in the process. Pretty simple, right?

Our beloved customer then goes to run this on an appropriately large file, stored on a UFS filesystem. The benchmark then proceeds to grind the system to a near standstill. Cursory inspection shows that the single disk containing the UFS filesystem has 24,000 outstanding I/Os to it. This is the same disk that contains the root filesystem. Not a good sign.

As you might well imagine, the system was pretty unresponsive. Not hung, but so slow it was beyond tolerable for even the most patient of us. The benchmark had caused enough memory pressure that most non-dirty pages (like our shell, ls, vmstat, etc.) were evicted from memory. Whenever we wanted to type a command, these pages had to be fetched off of disk. That's right, the same disk with 24,000 outstanding I/Os. The disk was a real champ, cranking out about 400 IOPS, which meant that any read request (for, say, the ls command) took about one minute to work its way to the front of the I/O queue. Like I said, a trying situation for someone attempting to figure out what the system is doing.

The score so far? The Benchmark: 1 UFS: 0

And in this corner...

At this point in time, ZFS was still in its early stages, but I had just finished writing some fancy I/O scheduling code based on some ideas that had been bouncing around in my head for a couple of years. Having explained what I had just done to Bryan Cantrill, he had the bright idea of running this same benchmark using ZFS. While I was still busy waving my hands and trying to back down on my bold claims, he had already started up the benchmark using ZFS.

The difference in system response was breathtaking. While running the same workload, the system was perfectly responsive and behaved like a machine without a care in the world. Wow. I then promptly went back to making bold claims.

One of the features of our advanced I/O scheduler in ZFS is that each I/O has both a priority and a deadline associated with it. Higher priority I/Os get deadlines that are sooner than lower priority ones. Our I/O issue policy is then to issue the I/O within the oldest deadline group that has the lowest LBA (logical block address). This means that for I/Os that share the same deadline, we do a linear pass across the disk, kind of like half of an elevator algorithm. We don't do a full elevator algorithm since it tends to lead to over-servicing the middle of the disk and starving the outer edges.

How does this fancy I/O scheduling help, you ask? In general, a write(2) system call is asynchronous. The kernel buffers the data for the write in memory, returns from the system call, and later sends the data to disk. On the other hand, a read(2) is inherently synchronous. An application is blocked from making forward progress until the read(2) system call returns, which it can't do until the I/O has come back from disk. As a result, we give writes a lower priority than reads. So if we have, say, several thousand writes queued up to a disk, and then a read comes in, the read will effectively cut to the front of the line and get issued right away. This leads to a much more responsive system, even when it's under heavy load. Since we have deadlines associated with each I/O, a constant stream of reads won't permanently starve the outstanding writes.

At this point, the clouds cleared, the sun came out, birds started singing, and all was Goodness and Light. The Benchmark had gone from being a "thrash the disk" test to being an actual VM test, like it was originally intended.

Final score? The Benchmark: 0 ZFS: 1

ZFS and the all-singing, all-dancing test suite

"A product is only as good as its test suite"

This is a saying that I first heard while helping to get a storage company off the ground. The full import of this statement didn't quite hit me until a few months later.

Whenever you write software, it's a very human thing, and as such, limited by our own cognitive capacity to provide correctness. I'm sure there are people out there who are smarter than me and make fewer mistakes. But no matter how good they are, the number is still greater than zero. There will always be bugs because there is a limit to how much even the smartest of us can fit into our heads at a given time.

So what do you do in order to drive the the number of bugs in a given piece of software towards zero? You write tests. And the end quality of your product depends very directly on how good these tests are at exploring the boundary conditions of your code. If you only test the simple cases, only the simple things will actually work.

Naturally, when you're developing something that other people entrust their livelihood to, like a filesystem or a high-end storage array, you want to make darn sure that their trust is not misplaced. To that end, we wrote a very aggressive test suite for ZFS.

But first, a little background. As Eric has diagrammed on his Source Code Tour, ZFS is broken up into several layers: ZPL (ZFS Posix Layer), DMU (Data Management Unit), and SPA (Storage Pool Allocator). In terms of external dependencies, the DMU and the SPA (about 80% of the kernel code) need very little from the kernel. The ZPL, which plugs into the Solaris VFS and vnode layer, depends on many interfaces from throughout the kernel. Fortunately, it tends to be less snarly than the rest of the code.

What we've done then is to allow the DMU and SPA to be compiled in both kernel and user context (the ZPL is kernel-only). The user-level code can't be used to mount filesystems, but we can plug a test harness, called ztest, into the DMU and SPA (in place of the ZPL) and run the whole shebang in user-land. Running in user-land gives us fast test cycles (no reboot if we botched the code), and allows us to use a less extreme debugging environment.

Now that we have our code in user-land all ready for a test harness to drive it, what do we do? We do what would typically be called white-box testing, where we purposefully bang on sore spots in our code, along with the standard datapath testing. What kind of tests? Anything and everything, all in parallel, all as fast as we can:

Read, write, create, and delete files and directories
Just what you would expect from your standard filesystem stress test.
Create and destroy entire filesystems and storage pools
It's important to test not only the data path, but the administrative path as well, since that's where a lot of problems typically occur -- doing some administrative operation while the filesystem is getting the crap beat out of it.
Turn compression on/off while data is in flight
Since all ZFS administrative operations are on-line, things can change at any time, and we have to cope.
Change checksum algorithms
Same deal. If the checksum algorithm changes mid-flight, all old blocks are still fine (our checksums and compression functions are on a per-block basis), and any new blocks go out with the new checksum algorithm applied.
Add, replace, attach, and detach devices in the pool
This is a big one. Since all ZFS operations are 100% on-line, we can't pause, wait a moment, grab a big fat lock, or anything that would slow us down. Any administrative operations must not disrupt the flight of data and (of course) must ensure that data integrity is in no way compromised. Since adding disks, replacing disks, or upgrading/downgrading mirrors are "big deals" administratively, we must make sure that we don't trip over any edge conditions in our data path.
Change I/O caching and scheduling policies on-the-fly
The I/O subsystem in ZFS is nothing to sneeze at, and there are many tunables internal to the code. One day, we hope to have the code adaptively tune these on-the-fly in response to workload. In the meantime, we verify that fiddling with them doesn't lead to bad behavior.
Scribble random garbage on any redundant data
Whenever our random device addition happens to have created a redundant configuration (either a mirror or RAID-Z), we take the opportunity to test our self-healing data (automatically repair corrupted data). We destroy data in a random, checkerboard pattern since destroying all copies of a given piece of data is not a particularly interesting data point. We're not PsychicFS, after all.
Scrub, verify, and resilver devices
This ensures that as we're running, we can traverse the contents of the pool in a safe way, verifying all copies of our data along the way, and repairing anything we notice to be out of place.
Force violent crashes to simulate power loss
Since we're in user-land, a "kill -9" will simulate a violent crash due to power loss. No "sync your data to disk", no "here it comes", just "bang, you're dead". We even break up our low-level writes to disk so that they're non-atomic at the syscall level. This way, we simulate partial writes in the event of a power failure. Furthermore, when we re-start the test suite and resume from the simulated power failure, we first verify the integrity of the entire storage pool to ensure everything is totally consistent.

Remember, we're doing all of the above operations as fast as we can. To make things even racier, we put the backing store of the pool under test in /tmp, which is an in-memory filesystem under Solaris. This way, we can expose many race conditions that simply would not occur at the speed of normal disks.

All in all, this means that we put our code through more abuse in about 20 seconds than most users will see in a lifetime. We run this test suite every night as part of our nightly testing, as well as when we're developing code in our private workspaces. And it obeys Bill's first law of test suites: the effectiveness of a test suite is directly proportional to its ease of use. We simply just type "ztest" and off it goes.

Of course all this wonderful test suite leaves out the poor ZPL. Fortunately we have a dedicated test team that also does more traditional testing, running ZFS in the kernel, acting as a real filesystem. They run a whole slew of tests that also try to get maximal code coverage, verify standards compliance, and generally abuse the system as much as possible. While we strive to get ztest covering as much of our code base as possible, nothing beats the real deal in terms of giving ourselves warm, fuzzy feelings.

As our friends at Veritas once said, "If it's not tested, it doesn't work." And I couldn't agree with them more.

Wednesday Aug 31, 2005

Who the heck am I?

A question I ask myself every day. My job description is that I'm a Senior Staff Engineer in the Solaris Kernel group. At present, I'm working on a new filesystem/storage architecture called ZFS. No, it hasn't been shipped yet, but we hope it will soon.

A little history

I've got a BS/EE and MS/CS degrees from universities over in Michigan. After graduating, I went to work for good ol' Sun Microsystems in the Server OS Software group, where I got to watch SunFire (Ex000/Ex500) go out the door and started work on Serengeti (Ex800). As part of my work on Serengeti, I did a lot of HW/SW bringup work. I ported 4 operating systems (Linux, JavaOS, VxWorks, and Chorus) to the Service Processor, and wrote an interactive Java debugger for it. I also did much of the initial UltraSPARC-III bringup work, tripping over some nasty bugs in the first rev of silicon (but that's a story for another day).

By this time, the .com boom was in full swing, and like many other people, I left Sun to join a startup: 3PARdata. I was employee number 6 at 3PAR, being in the group of 3 non-founders that helped get the company off the ground. Being at a small startup that later grew to around 200 employees was one of the most educational experiences I've ever been through. I did everything from design HW (the first FCAL interface board for 3PAR's JBOD), SW architecture (overall design and implementation of the stack), bringup (both board-level and ASIC), and even went so far as to write an x86 BIOS implementation, from scratch, in C (also a story for another day). It was also a very in-depth look at what it takes to get a company off the ground and launch a product; not just from the engineering side, but from the business and operational view as well.

Unfortunately, all good things must come to an end, and I wound up leaving 3PAR. I then did a short stint at a small company called BitMover, working on BitKeeper, their source code control product. As many people can probably imagine, it was a real hoot working with Larry McVoy, even though we don't always agree on things. Even though there were many interesting problems to work on within BitMover, the lack of an office environment (everyone worked form home) was clearly not for me.

At this time, I returned to Sun and joined the Solaris Kernel group, working with a bunch of really talented folks with whom I'd kept in touch with after leaving Sun. Since that time, I've been working on ZFS (more on that later), applying the knowledge I've gained by thinking about storage for the past 6 years.




« April 2014