Monday Sep 27, 2010

And now, page 2

To my team:

After 20 incredible years at Sun/Oracle, I have decided to try something new.

This was a very hard decision, and not one made lightly.  I have always enjoyed my work, and still do -- everything from MTS-2 to Sun Fellow to Oracle VP.  I love the people I work with and the technology we've created together, which is why I've been doing it for so long.  But I have always wanted to try doing a startup, and recently identified an opportunity that I just can't resist.  (We are in stealth mode, so that's all I can say for now.)

This team will always have a special place in my heart.  Being part of the Solaris team means doing the Right Thing, innovating, changing the rules, and being thought leaders -- creating the ideas that everyone else wants to copy. Add to that Oracle's unmatched market reach and ability to execute, and you have a combination that I believe will succeed in ways we couldn't have imagined two years ago.  I hope that Solaris and ZFS Storage are wildly successful, and that you have fun making it happen.

To the ZFS community:

Thank you for being behind us from Day One.  After a decade in the making, ZFS is now an adult.  Of course there's always more to do, and from this point forward, I look forward to watching you all do it.  There is a great quote whose origin I have never found: "Your ideas will go further if you don't insist on going with them."  That has proven correct many times in my life, and I am confident that it will prove true again.

For me, it's time to try the Next Big Thing.  Something I haven't fully fleshed out yet.  Something I don't fully understand yet.  Something way outside my comfort zone.  Something I might fail at.  Everything worth doing begins that way.  I'll let you know how it goes.

My last day at Oracle will be this Thursday, September 30, 2010.  After that you can reach me at my personal e-mail, with the usual first-dot-last construction.

It has truly been a wonderful couple of decades.  To everyone who taught me, worked with me, learned from me, and supported my efforts in countless ways large and small:  Thank you.

Sunday Nov 01, 2009

ZFS Deduplication

You knew this day was coming: ZFS now has built-in deduplication.

If you already know what dedup is and why you want it, you can skip the next couple of sections. For everyone else, let's start with a little background.

What is it?

Deduplication is the process of eliminating duplicate copies of data. Dedup is generally either file-level, block-level, or byte-level. Chunks of data -- files, blocks, or byte ranges -- are checksummed using some hash function that uniquely identifies data with very high probability. When using a secure hash like SHA256, the probability of a hash collision is about 2\^-256 = 10\^-77 or, in more familiar notation, 0.00000000000000000000000000000000000000000000000000000000000000000000000000001. For reference, this is 50 orders of magnitude less likely than an undetected, uncorrected ECC memory error on the most reliable hardware you can buy.

Chunks of data are remembered in a table of some sort that maps the data's checksum to its storage location and reference count. When you store another copy of existing data, instead of allocating new space on disk, the dedup code just increments the reference count on the existing data. When data is highly replicated, which is typical of backup servers, virtual machine images, and source code repositories, deduplication can reduce space consumption not just by percentages, but by multiples.

What to dedup: Files, blocks, or bytes?

Data can be deduplicated at the level of files, blocks, or bytes.

File-level assigns a hash signature to an entire file. File-level dedup has the lowest overhead when the natural granularity of data duplication is whole files, but it also has significant limitations: any change to any block in the file requires recomputing the checksum of the whole file, which means that if even one block changes, any space savings is lost because the two versions of the file are no longer identical. This is fine when the expected workload is something like JPEG or MPEG files, but is completely ineffective when managing things like virtual machine images, which are mostly identical but differ in a few blocks.

Block-level dedup has somewhat higher overhead than file-level dedup when whole files are duplicated, but unlike file-level dedup, it handles block-level data such as virtual machine images extremely well. Most of a VM image is duplicated data -- namely, a copy of the guest operating system -- but some blocks are unique to each VM. With block-level dedup, only the blocks that are unique to each VM consume additional storage space. All other blocks are shared.

Byte-level dedup is in principle the most general, but it is also the most costly because the dedup code must compute 'anchor points' to determine where the regions of duplicated vs. unique data begin and end. Nevertheless, this approach is ideal for certain mail servers, in which an attachment may appear many times but not necessary be block-aligned in each user's inbox. This type of deduplication is generally best left to the application (e.g. Exchange server), because the application understands the data it's managing and can easily eliminate duplicates internally rather than relying on the storage system to find them after the fact.

ZFS provides block-level deduplication because this is the finest granularity that makes sense for a general-purpose storage system. Block-level dedup also maps naturally to ZFS's 256-bit block checksums, which provide unique block signatures for all blocks in a storage pool as long as the checksum function is cryptographically strong (e.g. SHA256).

When to dedup: now or later?

In addition to the file/block/byte-level distinction described above, deduplication can be either synchronous (aka real-time or in-line) or asynchronous (aka batch or off-line). In synchronous dedup, duplicates are eliminated as they appear. In asynchronous dedup, duplicates are stored on disk and eliminated later (e.g. at night). Asynchronous dedup is typically employed on storage systems that have limited CPU power and/or limited multithreading to minimize the impact on daytime performance. Given sufficient computing power, synchronous dedup is preferable because it never wastes space and never does needless disk writes of already-existing data.

ZFS deduplication is synchronous. ZFS assumes a highly multithreaded operating system (Solaris) and a hardware environment in which CPU cycles (GHz times cores times sockets) are proliferating much faster than I/O. This has been the general trend for the last twenty years, and the underlying physics suggests that it will continue.

How do I use it?

Ah, finally, the part you've really been waiting for.

If you have a storage pool named 'tank' and you want to use dedup, just type this:

zfs set dedup=on tank

That's it.

Like all zfs properties, the 'dedup' property follows the usual rules for ZFS dataset property inheritance. Thus, even though deduplication has pool-wide scope, you can opt in or opt out on a per-dataset basis.

What are the tradeoffs?

It all depends on your data.

If your data doesn't contain any duplicates, enabling dedup will add overhead (a more CPU-intensive checksum and on-disk dedup table entries) without providing any benefit. If your data does contain duplicates, enabling dedup will both save space and increase performance. The space savings are obvious; the performance improvement is due to the elimination of disk writes when storing duplicate data, plus the reduced memory footprint due to many applications sharing the same pages of memory.

Most storage environments contain a mix of data that is mostly unique and data that is mostly replicated. ZFS deduplication is per-dataset, which means you can selectively enable dedup only where it is likely to help. For example, suppose you have a storage pool containing home directories, virtual machine images, and source code repositories. You might choose to enable dedup follows:

zfs set dedup=off tank/home

zfs set dedup=on tank/vm

zfs set dedup=on tank/src

Trust or verify?

If you accept the mathematical claim that a secure hash like SHA256 has only a 2\^-256 probability of producing the same output given two different inputs, then it is reasonable to assume that when two blocks have the same checksum, they are in fact the same block. You can trust the hash. An enormous amount of the world's commerce operates on this assumption, including your daily credit card transactions. However, if this makes you uneasy, that's OK: ZFS provies a 'verify' option that performs a full comparison of every incoming block with any alleged duplicate to ensure that they really are the same, and ZFS resolves the conflict if not. To enable this variant of dedup, just specify 'verify' instead of 'on':

zfs set dedup=verify tank

Selecting a checksum

Given the ability to detect hash collisions as described above, it is possible to use much weaker (but faster) hash functions in combination with the 'verify' option to provide faster dedup. ZFS offers this option for the fletcher4 checksum, which is quite fast:

zfs set dedup=fletcher4,verify tank

The tradeoff is that unlike SHA256, fletcher4 is not a pseudo-random hash function, and therefore cannot be trusted not to collide. It is therefore only suitable for dedup when combined with the 'verify' option, which detects and resolves hash collisions. On systems with a very high data ingest rate of largely duplicate data, this may provide better overall performance than a secure hash without collision verification.

Unfortunately, because there are so many variables that affect performance, I cannot offer any absolute guidance on which is better. However, if you are willing to make the investment to experiment with different checksum/verify options on your data, the payoff may be substantial. Otherwise, just stick with the default provided by setting dedup=on; it's cryptograhically strong and it's still pretty fast.

Scalability and performance

Most dedup solutions only work on a limited amount of data -- a handful of terabytes -- because they require their dedup tables to be resident in memory.

ZFS places no restrictions on your ability to dedup. You can dedup a petabyte if you're so inclined. The performace of ZFS dedup will follow the obvious trajectory: it will be fastest when the DDTs (dedup tables) fit in memory, a little slower when they spill over into the L2ARC, and much slower when they have to be read from disk. The topic of dedup performance could easily fill many blog entries -- and it will over time -- but the point I want to emphasize here is that there are no limits in ZFS dedup. ZFS dedup scales to any capacity on any platform, even a laptop; it just goes faster as you give it more hardware.


Bill Moore and I developed the first dedup prototype in two very intense days in December 2008. Mark Maybee and Matt Ahrens helped us navigate the interactions of this mostly-SPA code change with the ARC and DMU. Our initial prototype was quite primitive: it didn't support gang blocks, ditto blocks, out-of-space, and various other real-world conditions. However, it confirmed that the basic approach we'd been planning for several years was sound: namely, to use the 256-bit block checksums in ZFS as hash signatures for dedup.

Over the next several months Bill and I tag-teamed the work so that at least one of us could make forward progress while the other dealt with some random interrupt of the day.

As we approached the end game, Matt Ahrens and Adam Leventhal developed several optimizations for the ZAP to minimize DDT space consumption both on disk and in memory, key factors in dedup performance. George Wilson stepped in to help with, well, just about everything, as he always does.

For final code review George and I flew to Colorado where many folks generously lent their time and expertise: Mark Maybee, Neil Perrin, Lori Alt, Eric Taylor, and Tim Haley.

Our test team, led by Robin Guo, pounded on the code and made a couple of great finds -- which were actually latent bugs exposed by some new, tighter ASSERTs in the dedup code.

My family (Cathy, Andrew, David, and Galen) demonstrated enormous patience as the project became all-consuming for the last few months. On more than one occasion one of the kids has asked whether we can do something and then immediately followed their own question with, "Let me guess: after dedup is done."

Well, kids, dedup is done. We're going to have some fun now.

Monday Jun 09, 2008

ZFS in MacOS X Snow Leopard

It's official!

Cheers to Noel, Don, Bertrand, and all the great folks at Apple.

Now when can I get this in Time Capsule?  :-)

Monday May 26, 2008


Chocolate on my peanut butter?

No, peanut butter on my chocolate!

All I can say for the moment is... stay tuned.

Thursday Sep 13, 2007

Space Maps

Every filesystem must keep track of two basic things: where your data is, and where the free space is.

In principle, keeping track of free space is not strictly necessary: every block is either allocated or free, so the free space can be computed by assuming everything is free and then subtracting out everything that's allocated; and the allocated space can be found by traversing the entire filesystem from the root.  Any block that cannot be found by traversal from the root is, by definition, free.

In practice, finding free space this way would be insufferable because it would take far too long for any filesystem of non-trivial size.  To make the allocation and freeing of blocks fast, the filesystem needs an efficient way to keep track of free space.  In this post we'll examine the most common methods, why they don't scale well, and the new approach we devised for ZFS.


The most common way to represent free space is by using a bitmap.  A bitmap is simply an array of bits, with the Nth bit indicating whether the Nth block is allocated or free.  The overhead for a bitmap is quite low: 1 bit per block.  For a 4K blocksize, that's 1/(4096\*8) = 0.003%.  (The 8 comes from 8 bits per byte.)

For a 1GB filesystem, the bitmap is 32KB -- something that easily fits in memory, and can be scanned quickly to find free space.  For a 1TB filesystem, the bitmap is 32MB -- still stuffable in memory, but no longer trivial in either size or scan time.  For a 1PB filesystem, the bitmap is 32GB, and that simply won't fit in memory on most machines.  This means that scanning the bitmap requires reading it from disk, which is slower still.

Clearly, this doesn't scale.

One seemingly obvious remedy is to break the bitmap into small chunks, and keep track of the number of bits set in each chunk.  For example, for a 1PB filesystem using 4K blocks, the free space can be divided into a million bitmaps, each 32KB in size.  The summary information (the million integers indicating how much space is in each bitmap) fits in memory, so it's easy to find a bitmap with free space, and it's quick to scan that bitmap.

But there's still a fundamental problem: the bitmap(s) must be updated not only when a new block is allocated, but also when an old block is freed.  The filesystem controls the locality of allocations (it decides which blocks to put new data into), but it has no control over the locality of frees.  Something as simple as 'rm -rf' can cause blocks all over the platter to be freed.  With our 1PB filesystem example, in the worst case, removing 4GB of data (a million 4K blocks) could require each of the million bitmaps to be read, modified, and written out again.  That's two million disk I/Os to free a measly 4GB -- and that's just not reasonable, even as worst-case behavior.

More than any other single factor, this is why bitmaps don't scale: because frees are often random, and bitmaps that don't fit in memory perform pathologically when they are accessed randomly.


Another common way to represent free space is with a B-tree of extents.  An extent is a contiguous region of free space described by two integers: offset and length.  The B-tree sorts the extents by offset so that contiguous space allocation is efficient.  Unfortunately, B-trees of extents suffer the same pathology as bitmaps when confronted with random frees.

What to do?

Deferred frees

One way to mitigate the pathology of random frees is to defer the update of the bitmaps or B-trees, and instead keep a list of recently freed blocks.  When this deferred free list reaches a certain size, it can be sorted, in memory, and then freed to the underlying bitmaps or B-trees with somewhat better locality.  Not ideal, but it helps.

But what if we went further?

Space maps:  log-structured free lists

Recall that log-structured filesystems long ago posed this question: what if, instead of periodically folding a transaction log back into the filesystem, we made the transaction log be the filesystem?

Well, the same question could be asked of our deferred free list: what if, instead of folding it into a bitmap or B-tree, we made the deferred free list be the free space representation?

That is precisely what ZFS does.  ZFS divides the space on each virtual device into a few hundred regions called metaslabs.  Each metaslab has an associated space map, which describes that metaslab's free space.  The space map is simply a log of allocations and frees, in time order.  Space maps make random frees just as efficient as sequential frees, because regardless of which extent is being freed, it's represented on disk by appending the extent (a couple of integers) to the space map object -- and appends have perfect locality.  Allocations, similarly, are represented on disk as extents appended to the space map object (with, of course, a bit set indicating that it's an allocation, not a free).

When ZFS decides to allocate blocks from a particular metaslab, it first reads that metaslab's space map from disk and replays the allocations and frees into an in-memory AVL tree of free space, sorted by offset.  This yields a compact in-memory representation of free space that supports efficient allocation of contiguous space.  ZFS also takes this opportunity to condense the space map: if there are many allocation-free pairs that cancel out, ZFS replaces the on-disk space map with the smaller in-memory version.

Space maps have several nice properties:

  • They don't require initialization: a space map with no entries indicates that there have been no allocations and no frees, so all space is free.

  • They scale: because space maps are append-only, only the last block of the space map object needs to be in memory to ensure excellent performance, no matter how much space is being managed.

  • They have no pathologies: space maps are efficient to update regardless of the pattern of allocations and frees.

  • They are equally efficient at finding free space whether the pool is empty or full (unlike bitmaps, which take longer to scan as they fill up).

Finally, note that when a space map is completely full, it is represented by a single extent.  Space maps therefore have the appealing property that as your storage pool approaches 100% full, the space maps start to evaporate, thus making every last drop of disk space available to hold useful information.




« July 2014