Wednesday May 03, 2006

You say zeta, I say zetta

On a lighter note, what is the Z in ZFS?

I've seen the (non-)word 'zetabyte' many times in the press. It's also been noted that zeta is the last letter of the Greek alphabet, which will no doubt surprise many Greeks.

So, here's the real story. When I began the (nameless) project, it needed a name. I had no idea what to call it. I knew that the Big Idea was to bring virtual memory concepts to disks, so things like VSS (Virtual Storage System) came to mind. The problem is, everything I came up with sounded like vapor. I don't know about you, but when I hear "Tiered Web Deployment Virtualization Engine" I assume it's either a vast steaming pile of complexity or a big fluffy cloud of nothing -- which is correct north of 98% of the time. So I didn't want a name that screamed "BS!" out of the gate. That actually ruled out a lot of stuff.

At first I was hesitant to pick a name that ended in FS, because that would pigeon-hole the project in a way that's not quite accurate. It's much more than a filesystem, but then again, it is partly a filesystem, and at least people know what that is. It's a real thing. It's not a Tiered Web Deployment Virtualization Engine. I figured it would be better to correct any initial misimpression than to make no impression at all.

So the hunt was on for something-FS. The first thing I did was to Google all 26 three-letter options, from AFS to ZFS. As it turns out, they've all been used by current or previous products, often many times over. But ZFS hadn't been used much, and certainly not by anything mainstream like (for example) AFS, DFS, NFS or UFS.

I should mention that I ruled out BFS or BonwickFS, because it implies either sole authorship or a failure to share credit. Neither would be good.

So in the end, I picked ZFS for the simplest of reasons: it sounds cool. It doesn't come with any baggage, like YFS (Why FS? Ha ha!). And you can even say it in hex (2f5).

The next task was to retrofit the acronym -- that is, I had to come up with something for the Z to stand for. I actually got out my trusty old 1980 Random House Unabridged and looked through the entire Z section, and found nothing that spoke to me. Zero had a certain appeal (as in zero administrative hassle), but then again -- eh.

OK, what does the web have to say? I don't recall exactly how I stumbled on the zetta prefix, but it seemed ideal: ZFS was to be a 128-bit filesystem, and the next unit after exabyte (the 64-bit limit is 16EB) is zettabyte. Perfect: the zettabyte filesystem.

A bit of trivia (in case the rest of this post isn't quite trivial enough for you): the prefix 'zetta' is not of Greek or Latin origin -- at least, not directly. The original SI prefixes were as follows:

	milli	kilo	(1000\^1)
	micro	mega	(1000\^2)
	nano	giga	(1000\^3)
	pico	tera	(1000\^4)
	femto	peta	(1000\^5)
	atto	exa	(1000\^6)

The demand for more extreme units actually began at the small end, so the SI people needed names for 1000\^-7 and 1000\^-8. For seven, they took the Latin "sept" and made "zepto" (septo wouldn't work because s for second is ubiquitous); for eight, they took the Greek "okto" and made "yocto" (octo wouldn't work because o looks like 0). For symmetry with zepto and yocto on the small side, they defined zetta and yotta on the big side. (I also suspect they were drinking heavily, and thought yotta was a real hoot.)

But zettabyte wasn't perfect, actually. We (we were a team by now) found that when you call it the zettabyte filesystem, you have to explain what a zettabyte is, and by then the elevator has reached the top floor and all people know is that you're doing large capacity. Which is true, but it's not the main point.

So we finally decided to unpimp the name back to ZFS, which doesn't stand for anything. It's just a pseudo-acronym that vaguely suggests a way to store files that gets you a lot of points in Scrabble.

And it is, of course, "the last word in filesystems".

Monday May 01, 2006

Smokin' Mirrors

Resilvering -- also known as resyncing, rebuilding, or reconstructing -- is the process of repairing a damaged device using the contents of healthy devices. This is what every volume manager or RAID array must do when one of its disks dies, gets replaced, or suffers a transient outage.

For a mirror, resilvering can be as simple as a whole-disk copy. For RAID-5 it's only slightly more complicated: instead of copying one disk to another, all of the other disks in the RAID-5 stripe must be XORed together. But the basic idea is the same.

In a traditional storage system, resilvering happens either in the volume manager or in RAID hardware. Either way, it happens well below the filesystem.

But this is ZFS, so of course we just had to be different.

In a previous post I mentioned that RAID-Z resilvering requires a different approach, because it needs the filesystem metadata to determine the RAID-Z geometry. In effect, ZFS does a 'cp -r' of the storage pool's block tree from one disk to another. It sounds less efficient than a straight whole-disk copy, and traversing a live pool safely is definitely tricky (more on that in a future post). But it turns out that there are so many advantages to metadata-driven resilvering that we've chosen to use it even for simple mirrors.

The most compelling reason is data integrity. With a simple disk copy, there's no way to know whether the source disk is returning good data. End-to-end data integrity requires that each data block be verified against an independent checksum -- it's not enough to know that each block is merely consistent with itself, because that doesn't catch common hardware and firmware bugs like misdirected reads and phantom writes.

By traversing the metadata, ZFS can use its end-to-end checksums to detect and correct silent data corruption, just like it does during normal reads. If a disk returns bad data transiently, ZFS will detect it and retry the read. If it's a 3-way mirror and one of the two presumed-good disks is damaged, ZFS will use the checksum to determine which one is correct, copy the data to the new disk, and repair the damaged disk.

A simple whole-disk copy would bypass all of this data protection. For this reason alone, metadata-driven resilvering would be desirable even it it came at a significant cost in performance.

Fortunately, in most cases, it doesn't. In fact, there are several advantages to metadata-driven resilvering:

Live blocks only. ZFS doesn't waste time and I/O bandwidth copying free disk blocks because they're not part of the storage pool's block tree. If your pool is only 10-20% full, that's a big win.

Transactional pruning. If a disk suffers a transient outage, it's not necessary to resilver the entire disk -- only the parts that have changed. I'll describe this in more detail in a future post, but in short: ZFS uses the birth time of each block to determine whether there's anything lower in the tree that needs resilvering. This allows it to skip over huge branches of the tree and quickly discover the data that has actually changed since the outage began.

What this means in practice is that if a disk has a five-second outage, it will only take about five seconds to resilver it. And you don't pay extra for it -- in either dollars or performance -- like you do with Veritas change objects. Transactional pruning is an intrinsic architectural capability of ZFS.

Top-down resilvering. A storage pool is a tree of blocks. The higher up the tree you go, the more disastrous it is to lose a block there, because you lose access to everything beneath it.

Going through the metadata allows ZFS to do top-down resilvering. That is, the very first thing ZFS resilvers is the uberblock and the disk labels. Then it resilvers the pool-wide metadata; then each filesystem's metadata; and so on down the tree. Throughout the process ZFS obeys this rule: no block is resilvered until all of its ancestors have been resilvered.

It's hard to overstate how important this is. With a whole-disk copy, even when it's 99% done there's a good chance that one of the top 100 blocks in the tree hasn't been copied yet. This means that from an MTTR perspective, you haven't actually made any progress: a second disk failure at this point would still be catastrophic.

With top-down resilvering, every single block copied increases the amount of discoverable data. If you had a second disk failure, everything that had been resilvered up to that point would be available.

Priority-based resilvering. ZFS doesn't do this one yet, but it's in the pipeline. ZFS resilvering follows the logical structure of the data, so it would be pretty easy to tag individual filesystems or files with a specific resilver priority. For example, on a file server you might want to resilver calendars first (they're important yet very small), then /var/mail, then home directories, and so on.

What I hope to convey with each of these posts is not just the mechanics of how a particular feature is implemented, but to illustrate how all the parts of ZFS form an integrated whole. It's not immediately obvious, for example, that transactional semantics would have anything to do with resilvering -- yet transactional pruning makes recovery from transient outages literally orders of magnitude faster. More on how that works in the next post.

Tuesday Dec 13, 2005


The ZFS team is hosting a BoF at FAST in San Francisco at 9PM tonight.

We'll begin with an in-depth look at what ZFS is and how it works. This will be followed by the customary Airing of Grievances. (I'll get you a free ZFS T-shirt if you can get a Festivus pole past security and into our BoF. (Seriously.))

Quite a few of us will be there, so come on up and meet the folks who put the dot in vmcore.0.

Monday Dec 12, 2005

SEEK_HOLE and SEEK_DATA for sparse files

A sparse file is a file that contains much less data than its size would suggest. If you create a new file and then do a 1-byte write to the billionth byte, for example, you've just created a 1GB sparse file. By convention, reads from unwritten parts of a file return zeroes.

File-based backup and archiving programs like tar, cpio, and rsync can detect runs of zeroes and ignore them, so that the archives they produce aren't filled with zeroes. Still, this is terribly inefficient. If you're a backup program, what you really want is a list of the non-zero segments in the file. But historically there's been no way to get this information without looking at filesystem-specific metadata.

Desperately seeking segments

As part of the ZFS project, we introduced two general extensions to lseek(2): SEEK_HOLE and SEEK_DATA. These allow you to quickly discover the non-zero regions of sparse files. Quoting from the new man page:

       o  If whence is SEEK_HOLE, the offset of the start of  the
          next  hole greater than or equal to the supplied offset
          is returned. The definition of a hole is provided  near
          the end of the DESCRIPTION.

       o  If whence is SEEK_DATA, the file pointer is set to  the
          start  of the next non-hole file region greater than or
          equal to the supplied offset.


     A "hole" is defined as a contiguous  range  of  bytes  in  a
     file,  all  having the value of zero, but not all zeros in a
     file are guaranteed to be represented as holes returned with
     SEEK_HOLE. Filesystems are allowed to expose ranges of zeros
     with SEEK_HOLE, but not required to.  Applications  can  use
     SEEK_HOLE  to  optimise  their behavior for ranges of zeros,
     but must not depend on it to find all such ranges in a file.
     The  existence  of  a  hole  at the end of every data region
     allows for easy programming and implies that a virtual  hole
     exists  at  the  end  of  the  file. Applications should use
     fpathconf(_PC_MIN_HOLE_SIZE) or  pathconf(_PC_MIN_HOLE_SIZE)
     to  determine if a filesystem supports SEEK_HOLE. See fpath-

     For filesystems that do not supply information about  holes,
     the file will be represented as one entire data region.

Any filesystem can support SEEK_HOLE / SEEK_DATA. Even a filesystem like UFS, which has no special support for sparseness, can walk its block pointers much faster than it can copy out a bunch of zeroes. Even if the search algorithm is linear-time, at least the constant is thousands of times smaller.

Sparse file navigation in ZFS

A file in ZFS is a tree of blocks. To maximize the performance of SEEK_HOLE and SEEK_DATA, ZFS maintains in each block pointer a fill count indicating the number of data blocks beneath it in the tree (see below). Fill counts reveal where holes and data reside, so that ZFS can find both holes and data in guaranteed logarithmic time.

  L4                           6
  L3          5                0                1
         -----------      -----------      -----------
  L2     3    0    2                       0    0    1
        ---  ---  ---                     ---  ---  ---
  L1    111       101                               001

An indirect block tree for a sparse file, showing the fill count in each block pointer. In this example there are three block pointers per indirect block. The lowest-level (L1) block pointers describe either one block of data or one block-sized hole, indicated by a fill count of 1 or 0, respectively. At L2 and above, each block pointer's fill count is the sum of the fill counts of its children. At any level, a non-zero fill count indicates that there's data below. A fill count less than the maximum (3L-1 in this example) indicates that there are holes below. From any offset in the file, ZFS can find the next hole or next data in logarithmic time by following the fill counts in the indirect block tree.


At this writing, SEEK_HOLE and SEEK_DATA are Solaris-specific. I encourage (implore? beg?) other operating systems to adopt these lseek(2) extensions verbatim (100% tax-free) so that sparse file navigation becomes a ubiquitous feature that every backup and archiving program can rely on. It's long overdue.

Technorati Tags:

Thursday Dec 08, 2005

ZFS End-to-End Data Integrity

The job of any filesystem boils down to this: when asked to read a block, it should return the same data that was previously written to that block. If it can't do that -- because the disk is offline or the data has been damaged or tampered with -- it should detect this and return an error.

Incredibly, most filesystems fail this test. They depend on the underlying hardware to detect and report errors. If a disk simply returns bad data, the average filesystem won't even detect it.

Even if we could assume that all disks were perfect, the data would still be vulnerable to damage in transit: controller bugs, DMA parity errors, and so on. All you'd really know is that the data was intact when it left the platter. If you think of your data as a package, this would be like UPS saying, "We guarantee that your package wasn't damaged when we picked it up." Not quite the guarantee you were looking for.

In-flight damage is not a mere academic concern: even something as mundane as a bad power supply can cause silent data corruption.

Arbitrarily expensive storage arrays can't solve the problem. The I/O path remains just as vulnerable, but becomes even longer: after leaving the platter, the data has to survive whatever hardware and firmware bugs the array has to offer.

And if you're on a SAN, you're using a network designed by disk firmware writers. God help you.

What to do? One option is to store a checksum with every disk block. Most modern disk drives can be formatted with sectors that are slightly larger than the usual 512 bytes -- typically 520 or 528. These extra bytes can be used to hold a block checksum. But making good use of this checksum is harder than it sounds: the effectiveness of a checksum depends tremendously on where it's stored and when it's evaluated.

In many storage arrays (see the Dell|EMC PowerVault paper for a typical example with an excellent description of the issues), the data is compared to its checksum inside the array. Unfortunately this doesn't help much. It doesn't detect common firmware bugs such as phantom writes (the previous write never made it to disk) because the data and checksum are stored as a unit -- so they're self-consistent even when the disk returns stale data. And the rest of the I/O path from the array to the host remains unprotected. In short, this type of block checksum provides a good way to ensure that an array product is not any less reliable than the disks it contains, but that's about all.

NetApp's block-appended checksum approach appears similar but is in fact much stronger. Like many arrays, NetApp formats its drives with 520-byte sectors. It then groups them into 8-sector blocks: 4K of data (the WAFL filesystem blocksize) and 64 bytes of checksum. When WAFL reads a block it compares the checksum to the data just like an array would, but there's a key difference: it does this comparison after the data has made it through the I/O path, so it validates that the block made the journey from platter to memory without damage in transit.

This is a major improvement, but it's still not enough. A block-level checksum only proves that a block is self-consistent; it doesn't prove that it's the right block. Reprising our UPS analogy, "We guarantee that the package you received is not damaged. We do not guarantee that it's your package."

The fundamental problem with all of these schemes is that they don't provide fault isolation between the data and the checksum that protects it.

ZFS Data Authentication

End-to-end data integrity requires that each data block be verified against an independent checksum, after the data has arrived in the host's memory. It's not enough to know that each block is merely consistent with itself, or that it was correct at some earlier point in the I/O path. Our goal is to detect every possible form of damage, including human mistakes like swapping on a filesystem disk or mistyping the arguments to dd(1). (Have you ever typed "of=" when you meant "if="?)

A ZFS storage pool is really just a tree of blocks. ZFS provides fault isolation between data and checksum by storing the checksum of each block in its parent block pointer -- not in the block itself. Every block in the tree contains the checksums for all its children, so the entire pool is self-validating. [The uberblock (the root of the tree) is a special case because it has no parent; more on how we handle that in another post.]

When the data and checksum disagree, ZFS knows that the checksum can be trusted because the checksum itself is part of some other block that's one level higher in the tree, and that block has already been validated.

ZFS uses its end-to-end checksums to detect and correct silent data corruption. If a disk returns bad data transiently, ZFS will detect it and retry the read. If the disk is part of a mirror or RAID-Z group, ZFS will both detect and correct the error: it will use the checksum to determine which copy is correct, provide good data to the application, and repair the damaged copy.

As always, note that ZFS end-to-end data integrity doesn't require any special hardware. You don't need pricey disks or arrays, you don't need to reformat drives with 520-byte sectors, and you don't have to modify applications to benefit from it. It's entirely automatic, and it works with cheap disks.

But wait, there's more!

The blocks of a ZFS storage pool form a Merkle tree in which each block validates all of its children. Merkle trees have been proven to provide cryptographically-strong authentication for any component of the tree, and for the tree as a whole. ZFS employs 256-bit checksums for every block, and offers checksum functions ranging from the simple-and-fast fletcher2 (the default) to the slower-but-secure SHA-256. When using a cryptographic hash like SHA-256, the uberblock checksum provides a constantly up-to-date digital signature for the entire storage pool.

Which comes in handy if you ask UPS to move it.

Technorati Tags:

Thursday Nov 17, 2005


The original promise of RAID (Redundant Arrays of Inexpensive Disks) was that it would provide fast, reliable storage using cheap disks. The key point was cheap; yet somehow we ended up here. Why?

RAID-5 (and other data/parity schemes such as RAID-4, RAID-6, even-odd, and Row Diagonal Parity) never quite delivered on the RAID promise -- and can't -- due to a fatal flaw known as the RAID-5 write hole. Whenever you update the data in a RAID stripe you must also update the parity, so that all disks XOR to zero -- it's that equation that allows you to reconstruct data when a disk fails. The problem is that there's no way to update two or more disks atomically, so RAID stripes can become damaged during a crash or power outage.

To see this, suppose you lose power after writing a data block but before writing the corresponding parity block. Now the data and parity for that stripe are inconsistent, and they'll remain inconsistent forever (unless you happen to overwrite the old data with a full-stripe write at some point). Therefore, if a disk fails, the RAID reconstruction process will generate garbage the next time you read any block on that stripe. What's worse, it will do so silently -- it has no idea that it's giving you corrupt data.

There are software-only workarounds for this, but they're so slow that software RAID has died in the marketplace. Current RAID products all do the RAID logic in hardware, where they can use NVRAM to survive power loss. This works, but it's expensive.

There's also a nasty performance problem with existing RAID schemes. When you do a partial-stripe write -- that is, when you update less data than a single RAID stripe contains -- the RAID system must read the old data and parity in order to compute the new parity. That's a huge performance hit. Where a full-stripe write can simply issue all the writes asynchronously, a partial-stripe write must do synchronous reads before it can even start the writes.

Once again, expensive hardware offers a solution: a RAID array can buffer partial-stripe writes in NVRAM while it's waiting for the disk reads to complete, so the read latency is hidden from the user. Of course, this only works until the NVRAM buffer fills up. No problem, your storage vendor says! Just shell out even more cash for more NVRAM. There's no problem your wallet can't solve.

Partial-stripe writes pose an additional problem for a transactional filesystem like ZFS. A partial-stripe write necessarily modifies live data, which violates one of the rules that ensures transactional semantics. (It doesn't matter if you lose power during a full-stripe write for the same reason that it doesn't matter if you lose power during any other write in ZFS: none of the blocks you're writing to are live yet.)

If only we didn't have to do those evil partial-stripe writes...

Enter RAID-Z.

RAID-Z is a data/parity scheme like RAID-5, but it uses dynamic stripe width. Every block is its own RAID-Z stripe, regardless of blocksize. This means that every RAID-Z write is a full-stripe write. This, when combined with the copy-on-write transactional semantics of ZFS, completely eliminates the RAID write hole. RAID-Z is also faster than traditional RAID because it never has to do read-modify-write.

Whoa, whoa, whoa -- that's it? Variable stripe width? Geez, that seems pretty obvious. If it's such a good idea, why doesn't everybody do it?

Well, the tricky bit here is RAID-Z reconstruction. Because the stripes are all different sizes, there's no simple formula like "all the disks XOR to zero." You have to traverse the filesystem metadata to determine the RAID-Z geometry. Note that this would be impossible if the filesystem and the RAID array were separate products, which is why there's nothing like RAID-Z in the storage market today. You really need an integrated view of the logical and physical structure of the data to pull it off.

But wait, you say: isn't that slow? Isn't it expensive to traverse all the metadata? Actually, it's a trade-off. If your storage pool is very close to full, then yes, it's slower. But if it's not too close to full, then metadata-driven reconstruction is actually faster because it only copies live data; it doesn't waste time copying unallocated disk space.

But far more important, going through the metadata means that ZFS can validate every block against its 256-bit checksum as it goes. Traditional RAID products can't do this; they simply XOR the data together blindly.

Which brings us to the coolest thing about RAID-Z: self-healing data. In addition to handling whole-disk failure, RAID-Z can also detect and correct silent data corruption. Whenever you read a RAID-Z block, ZFS compares it against its checksum. If the data disks didn't return the right answer, ZFS reads the parity and then does combinatorial reconstruction to figure out which disk returned bad data. It then repairs the damaged disk and returns good data to the application. ZFS also reports the incident through Solaris FMA so that the system administrator knows that one of the disks is silently failing.

Finally, note that RAID-Z doesn't require any special hardware. It doesn't need NVRAM for correctness, and it doesn't need write buffering for good performance. With RAID-Z, ZFS makes good on the original RAID promise: it provides fast, reliable storage using cheap, commodity disks.

For a real-world example of RAID-Z detecting and correcting silent data corruption on flaky hardware, check out Eric Lowe's SATA saga.

The current RAID-Z algorithm is single-parity, but the RAID-Z concept works for any RAID flavor. A double-parity version is in the works.

One last thing that fellow programmers will appreciate: the entire RAID-Z implementation is just 599 lines.

Technorati Tags:

Wednesday Nov 16, 2005

Thank You

I was originally planning to blog about RAID-Z today, but I can't.

There's been such a tremendous outpouring of energy and passion for the ZFS launch today that the only thing I can think about is this: thank you.

Eric Schrock and Dan Price have been tireless in their efforts to get the OpenSolaris ZFS community up and running. Bryan Cantrill has rounded up all the ZFS blog entries from today, which run the gamut from informative to surprising to downright funny. And for me, as the papa bear, every one of them is also personally touching. Thank you for using ZFS during the early days, for providing insights on how to make it better, and for taking the time to write up your personal experiences and spread the word.

I'm incredibly proud of the ZFS team, and deeply grateful: no one person can do something on this scale. Thank you for making the leap of faith to join me in this effort, to create something out of nothing, to take an idea from vision to reality. It has been an amazing and wonderful experience.

To Cathy, Andrew, David, and Galen: thank you for support, encouragement, patience, and love. Six years is a long time to wait for Daddy to finish his project. But somehow, you managed to understand how important this was to me, and for you, that was good enough: you've been a constant source of energy the whole way. I can't imagine doing it without you.

Awww, now I'm getting all weepy.

Please come back tomorrow and I promise we'll have that chat about RAID-Z.

Technorati Tag:
Technorati Tag:
Technorati Tag:

Monday Oct 31, 2005

ZFS: The Last Word in Filesystems


Halloween has been a special day for ZFS since its inception.

On 10/31/2001, we got the user-level prototype working.

On 10/31/2002, we got the first in-kernel mount.

And today, 10/31/2005, we integrated into Solaris. ZFS will hit the street in a couple of weeks via Solaris Express.

We (the ZFS team) will have much more to say about ZFS in the coming weeks. But tonight, I just want to tell you what it was like to drive this thing home.

The ZFS team is distributed: we have people working in Menlo Park, California; Broomfield, Colorado; and several other places. This week, we flew everyone in to Menlo Park and took over a giant conference room.

The first thing you notice is the heat. These rooms are made for 100 watt humans, not multi-gigahertz CPUs and associated paraphernalia. And, like any big company, Sun is all about saving money in the dumbest ways -- like turning off the A/C at night, and outsourcing the people who could possibly turn it back on.

At first, things went pretty well. We comandeered the Nob Hill conference room, which has a long table and lots of power and network taps. We brought in a bunch of machines and created 'Camp ZFS'. Each new box amped up the heat level, to the point that it became difficult to think. So we grabbed a 36-inch fan from one of the labs to get the air flowing. That was a huge help, although it sounded like you were wearing a pair of lawn mowers as headphones.

On Sunday, we plugged in one more laptop. That was it -- we blew the circuit breaker! So here we are, less than 24 hours from our scheduled integration date, and all of our computers are without power -- and the room containing the circuit breakers was locked. (Thanks, OSHA!)

So we took a three-pronged approach: (1) went through the Approved Process to get power restored (ETA: April); (2) hunted down someone from campus security to get us key access to the electrical room (ETA: Sunday night); and (3) sent our manager to Home Depot to buy a bunch of 100-foot extension cords so we could, if necessary, run Nob Hill off of the adjacent lab's power grid (ETA: 30 minutes).

All three came through. We ran half of the load over extension cords to the lab, the other half on the Nob Hill circuit. It took a bit of experimentation to find a load balance that would stay up without tripping the breaker again. Apparently, we had angered it. (Even now, I'm typing this blog entry courtesy of a shiny new yellow extension cord.)

Meanwhile, the clock was ticking.

At the end of a large project like this, it's never the technology that kills you -- it's the process, the cleanup of home-grown customizations, the packaging, the late-breaking code review comments, the collapsing of SCCS deltas, stuff like that. With power back up, we slogged on until about 4AM. Everything looked good, so we went home to sleep. Actually, some people just crashed on the couches in my office, and Bill's office next door.

By 10AM Monday we were back, making sure that all the final tests had run successfully, and working through the last bits of paperwork with the Solaris release team. After five years of effort, it was time to type 'putback'.

Denied! The permissions on a directory were wrong. Fix them up, try again.

Denied! One more TeamWare directory with wrong permissions. Fine, fix that too.

Last try... and there it goes! 584 files, 92,000 lines of change, 56 patents, 5 years... and there it is. Just like that.

Fortunately we were prepared for either success or failure. We had brought in a massive array of vodka, tequila, wine... you name it, we had it. And not in small quantities.

As I said at the beginning, we'll have lots more to say in the coming days. But right now, it's time to sleep!


Tuesday Jun 14, 2005

Now It Can Be Told

For thirty years, people have speculated on the identity of Deep Throat. Last week the mystery was finally revealed: the man called Deep Throat was not John Dean, not G Gordon Liddy, not Al Haig -- no, it was none other than Mark Felt. That's right, Mark Felt.

I'm sorry -- who? Who the f\*\*\* is Mark Felt? Felt, F-E-L-T? Let me think for a minute... Felt... Felt... nope, never heard of him. Ever. You've got to be kidding me. I've waited thirty years for this?

Well, if you think that revelation was a letdown, just keep scrolling: because while the chattering classes in Washington were playing the Deep Throat parlor game, an even less compelling question was nagging at far fewer people: where did the slab allocator get its name?

About 20 years ago Kellogg's ran a commercial for Special K cereal with the tag line, "Can you pinch an inch?" The implication was that you were overweight if you could pinch more than an inch of fat on your waist -- and that hoovering a bowl of corn flakes would help.

Back in high school I was watching TV with some friends, and this commercial came on. The voice intoned: "Can you pinch an inch?" Without missing a beat, Tommy, who weighed about 250 pounds, reached for his midsection and offered his response: "Hell, I can grab a slab!"

Fast forward 10 years. I was debugging a heap corruption bug in the Solaris 2.2 kernel. The kernel had actually crashed in the allocator itself, as often happens when something corrupts the heap. Trying to disentangle the state, I had the crash dump in one window and the source code for the Solaris 2.2 kernel memory allocator in another.

The code had all the hallmarks of what we in the industry call "swill": mysterious hard-coded constants, global locks, fixed-size arrays, unions, macros that implicitly reference global state, and variable names that are at once long and uninformative. A small taste:

if ((listp->fl_slack >= 2) || (tmpsiz == max)) {
        listp->fl_slack -= 2;
        HEADFREE(listp, bufp);
        if (tmpsiz < max)
                bufp->fb_union.fb_state = DELAYED;
        else {
                LOOKUP(addr, bufp->fb_union.fb_poolp);
                if (--(bufp->fb_union.fb_poolp->bp_inuse) == 0 &&
                   bufp->fb_union.fb_poolp != Km_NewSmall &&
                   bufp->fb_union.fb_poolp != Km_NewBig) {
                        if (Km_IdleI < NIDLEP) {
                                if ((bufp->fb_union.fb_poolp->bp_status &
                                    BP_IDLE) == 0) {
                                        Km_Idlep[Km_IdleI++] = ...

and on and on it went.

All of this might have been tolerable if the allocator performed well, or was space-efficient, or had some other redeeming virtue. But no. It was big, fat, slow, and had to die. I went home that night determined to burn it to the ground. (The allocator, not my home.)

The basic idea for the new allocator was to grab a page of memory, carve it up into equal-size chunks, and use a reference count to keep track of how many chunks were allocated. When the reference count went to zero, the page could be freed.

That worked fine for small buffers (say 1/8 of a page or smaller), but wouldn't quite work for larger buffers: those would require more than one page. So I couldn't just grab a page; I'd have to grab a ... something. "Grab" must have been a mental trigger: all of a sudden I remembered Tommy, and how much I loved that line, and that sealed it: we would grab a slab, and this would be called the slab allocator.

Now you know. And Mark Felt is looking pretty dang interesting.

Technorati Tag:
Technorati Tag:

Friday Sep 24, 2004

128-bit storage: are you high?

One gentle reader offered this feedback on our recent ZFS article:

64 bits would have been plenty ... but then you can't talk out of your ass about boiling oceans then, can you?

Well, it's a fair question. Why did we make ZFS a 128-bit storage system? What on earth made us think it's necessary? And how do we know it's sufficient?

Let's start with the easy one: how do we know it's necessary?

Some customers already have datasets on the order of a petabyte, or 250 bytes. Thus the 64-bit capacity limit of 264 bytes is only 14 doublings away. Moore's Law for storage predicts that capacity will continue to double every 9-12 months, which means we'll start to hit the 64-bit limit in about a decade. Storage systems tend to live for several decades, so it would be foolish to create a new one without anticipating the needs that will surely arise within its projected lifetime.

If 64 bits isn't enough, the next logical step is 128 bits. That's enough to survive Moore's Law until I'm dead, and after that, it's not my problem. But it does raise the question: what are the theoretical limits to storage capacity?

Although we'd all like Moore's Law to continue forever, quantum mechanics imposes some fundamental limits on the computation rate and information capacity of any physical device. In particular, it has been shown that 1 kilogram of matter confined to 1 liter of space can perform at most 1051 operations per second on at most 1031 bits of information [see Seth Lloyd, "Ultimate physical limits to computation." Nature 406, 1047-1054 (2000)]. A fully-populated 128-bit storage pool would contain 2128 blocks = 2137 bytes = 2140 bits; therefore the minimum mass required to hold the bits would be (2140 bits) / (1031 bits/kg) = 136 billion kg.

That's a lot of gear.

To operate at the 1031 bits/kg limit, however, the entire mass of the computer must be in the form of pure energy. By E=mc2, the rest energy of 136 billion kg is 1.2x1028 J. The mass of the oceans is about 1.4x1021 kg. It takes about 4,000 J to raise the temperature of 1 kg of water by 1 degree Celcius, and thus about 400,000 J to heat 1 kg of water from freezing to boiling. The latent heat of vaporization adds another 2 million J/kg. Thus the energy required to boil the oceans is about 2.4x106 J/kg \* 1.4x1021 kg = 3.4x1027 J. Thus, fully populating a 128-bit storage pool would, literally, require more energy than boiling the oceans.

Thursday Sep 16, 2004


Welcome aboard! I'm Jeff Bonwick, a Distinguished Engineer (la-de-da!) at Sun. I'm guessing you're here because you recently read about ZFS.

Let me begin with a note of thanks.

According to Sun's website staff, the ZFS article has generated the highest reader response ever -- thank you! The ZFS team gets to see all the feedback you provide, so please keep it coming. I'll respond to some of the more interesting comments in this blog.

My favorite comment thus far was a caustic remark about 128-bit storage, which will be the subject of the next post...




« June 2016