Rampant Layering Violation?
By bonwick on May 03, 2007
Andrew Morton has famously called ZFS a "rampant layering violation" because it combines the functionality of a filesystem, volume manager, and RAID controller. I suppose it depends what the meaning of the word violate is. While designing ZFS we observed that the standard layering of the storage stack induces a surprising amount of unnecessary complexity and duplicated logic. We found that by refactoring the problem a bit -- that is, changing where the boundaries are between layers -- we could make the whole thing much simpler.
An example from mathematics (my actual background) provides a useful prologue.
Suppose you had to compute the sum, from n=1 to infinity, of 1/n(n+1).
Expanding that out term by term, we have:
1/(1\*2) + 1/(2\*3) + 1/(3\*4) + 1/(4\*5) + ...
1/2 + 1/6 + 1/12 + 1/20 + ...
What does that infinite series add up to? It may seem like a hard problem, but that's only because we're not looking at it right. If you're clever, you might notice that there's a different way to express each term:
1/n(n+1) = 1/n - 1/(n+1)
1/(1\*2) = 1/1 - 1/2
1/(2\*3) = 1/2 - 1/3
1/(3\*4) = 1/3 - 1/4
Thus, our sum can be expressed as:
(1/1 - 1/2) + (1/2 - 1/3) + (1/3 - 1/4) + (1/4 - 1/5) + ...
Now, notice the pattern: each term that we subtract, we add back. Only in Congress does that count as work. So if we just rearrange the parentheses -- that is, if we rampantly violate the layering of the original problem by using associativity to refactor the arithmetic across adjacent terms of the series -- we get this:
1/1 + (-1/2 + 1/2) + (-1/3 + 1/3) + (-1/4 + 1/4) + ...
1/1 + 0 + 0 + 0 + ...
In others words,
Isn't that cool?
Mathematicians have a term for this. When you rearrange the terms of a series so that they cancel out, it's called telescoping -- by analogy with a collapsable hand-held telescope. In a nutshell, that's what ZFS does: it telescopes the storage stack. That's what allows us to have a filesystem, volume manager, single- and double-parity RAID, compression, snapshots, clones, and a ton of other useful stuff in just 80,000 lines of code.
A storage system is more complex than this simple analogy, but at a high level the same idea really does apply. You can think of any storage stack as a series of translations from one naming scheme to another -- ultimately translating a filename to a disk LBA (logical block address). Typically it looks like this:
filesystem(upper): filename to object (inode)
filesystem(lower): object to volume LBA
volume manager: volume LBA to array LBA
RAID controller: array LBA to disk LBA
This is the stack we're about to refactor.
First, note that the traditional filesystem layer is too monolithic. It would be better to separate the filename-to-object part (the upper half) from the object-to-volume-LBA part (the lower half) so that we could reuse the same lower-half code to support other kinds of storage, like objects and iSCSI targets, which don't have filenames. These storage classes could then speak to the object layer directly. This is more efficient than going through something like /dev/lofi, which makes a POSIX file look like a device. But more importantly, it provides a powerful new programming model -- object storage -- without any additional code.
Second, note that the volume LBA is completely useless. Adding a layer of indirection often adds flexibility, but not in this case: in effect we're translating from English to French to German when we could just as easily translate from English to German directly. The intermediate French has no intrinsic value. It's not visible to applications, it's not visible to the RAID array, and it doesn't provide any administrative function. It's just overhead.
So ZFS telescoped that entire layer away. There are just three distinct layers in ZFS: the ZPL (ZFS POSIX Layer), which provides traditional POSIX filesystem semantics; the DMU (Data Management Unit), which provides a general-purpose transactional object store; and the SPA (Storage Pool Allocator), which provides virtual block allocation and data transformations (replication, compression, and soon encryption). The overall ZFS translation stack looks like this:
ZPL: filename to object
DMU: object to DVA (data virtual address)
SPA: DVA to disk LBA
The DMU provides both file and block access to a common pool of physical storage. File access goes through the ZPL, while block access is just a direct mapping to a single DMU object. We're also developing new data access methods that use the DMU's transactional capabilities in more interesting ways -- more about that another day.
The ZFS architecture eliminates an entire layer of translation -- and along with it, an entire class of metadata (volume LBAs). It also eliminates the need for hardware RAID controllers. At the same time, it provides a useful new interface -- object storage -- that was previously inaccessible because it was buried inside a monolithic filesystem.
I certainly don't feel violated. Do you?