X

ZFS Deduplication

Guest Author

You knew this day was coming: ZFS now has built-in deduplication.

If you already know what dedup is and why you want it, you can skip
the next couple of sections. For everyone else, let's start with
a little background.

What is it?

Deduplication is the process of eliminating duplicate copies of data.
Dedup is generally either file-level, block-level, or byte-level.
Chunks of data -- files, blocks, or byte ranges -- are checksummed
using some hash function that uniquely identifies data with very high
probability. When using a secure hash like SHA256, the probability of a
hash collision is about 2\^-256 = 10\^-77 or, in more familiar notation,
0.00000000000000000000000000000000000000000000000000000000000000000000000000001.
For reference, this is 50 orders of magnitude less likely than an undetected,
uncorrected ECC memory error on the most reliable hardware you can buy.

Chunks of data are remembered in a table of some sort that maps the
data's checksum to its storage location and reference count. When you
store another copy of existing data, instead of allocating new space
on disk, the dedup code just increments the reference count on the
existing data. When data is highly replicated, which is typical of
backup servers, virtual machine images, and source code repositories,
deduplication can reduce space consumption not just by percentages,
but by multiples.

What to dedup: Files, blocks, or bytes?

Data can be deduplicated at the level of files, blocks, or bytes.

File-level assigns a hash signature to an entire file. File-level
dedup has the lowest overhead when the natural granularity of data
duplication is whole files, but it also has significant limitations:
any change to any block in the file requires recomputing the checksum
of the whole file, which means that if even one block changes, any space
savings is lost because the two versions of the file are no longer identical.
This is fine when the expected workload is something like JPEG or MPEG files,
but is completely ineffective when managing things like virtual machine
images, which are mostly identical but differ in a few blocks.

Block-level dedup has somewhat higher overhead than file-level dedup when
whole files are duplicated, but unlike file-level dedup, it handles block-level
data such as virtual machine images extremely well. Most of a VM image is
duplicated data -- namely, a copy of the guest operating system -- but some
blocks are unique to each VM. With block-level dedup, only the blocks that
are unique to each VM consume additional storage space. All other blocks
are shared.

Byte-level dedup is in principle the most general, but it is also the most
costly because the dedup code must compute 'anchor points' to determine
where the regions of duplicated vs. unique data begin and end.
Nevertheless, this approach is ideal for certain mail servers, in which an
attachment may appear many times but not necessary be block-aligned in each
user's inbox. This type of deduplication is generally best left to the
application (e.g. Exchange server), because the application understands
the data it's managing and can easily eliminate duplicates internally
rather than relying on the storage system to find them after the fact.

ZFS provides block-level deduplication because this is the finest
granularity that makes sense for a general-purpose storage system.
Block-level dedup also maps naturally to ZFS's 256-bit block checksums,
which provide unique block signatures for all blocks in a storage pool
as long as the checksum function is cryptographically strong (e.g. SHA256).

When to dedup: now or later?

In addition to the file/block/byte-level distinction described above,
deduplication can be either synchronous (aka real-time or in-line)
or asynchronous (aka batch or off-line). In synchronous dedup,
duplicates are eliminated as they appear. In asynchronous dedup,
duplicates are stored on disk and eliminated later (e.g. at night).
Asynchronous dedup is typically employed on storage systems that have
limited CPU power and/or limited multithreading to minimize the
impact on daytime performance. Given sufficient computing power,
synchronous dedup is preferable because it never wastes space
and never does needless disk writes of already-existing data.

ZFS deduplication is synchronous. ZFS assumes a highly multithreaded
operating system (Solaris) and a hardware environment in which CPU cycles
(GHz times cores times sockets) are proliferating much faster than I/O.
This has been the general trend for the last twenty years, and the
underlying physics suggests that it will continue.

How do I use it?

Ah, finally, the part you've really been waiting for.

If you have a storage pool named 'tank' and you want to use dedup,
just type this:

zfs set dedup=on tank

That's it.

Like all zfs properties, the 'dedup' property follows the usual rules
for ZFS dataset property inheritance. Thus, even though deduplication
has pool-wide scope, you can opt in or opt out on a per-dataset basis.

What are the tradeoffs?

It all depends on your data.

If your data doesn't contain any duplicates, enabling dedup will add
overhead (a more CPU-intensive checksum and on-disk dedup table entries)
without providing any benefit. If your data does contain duplicates,
enabling dedup will both save space and increase performance. The
space savings are obvious; the performance improvement is due to the
elimination of disk writes when storing duplicate data, plus the
reduced memory footprint due to many applications sharing the same
pages of memory.

Most storage environments contain a mix of data that is mostly unique
and data that is mostly replicated. ZFS deduplication is per-dataset,
which means you can selectively enable dedup only where it is likely
to help. For example, suppose you have a storage pool containing
home directories, virtual machine images, and source code repositories.
You might choose to enable dedup follows:

zfs set dedup=off tank/home

zfs set dedup=on tank/vm

zfs set dedup=on tank/src

Trust or verify?

If you accept the mathematical claim that a secure hash like SHA256 has
only a 2\^-256 probability of producing the same output given two different
inputs, then it is reasonable to assume that when two blocks have the
same checksum, they are in fact the same block. You can trust the hash.
An enormous amount of the world's commerce operates on this assumption,
including your daily credit card transactions. However, if this makes
you uneasy, that's OK: ZFS provies a 'verify' option that performs
a full comparison of every incoming block with any alleged duplicate to
ensure that they really are the same, and ZFS resolves the conflict if not.
To enable this variant of dedup, just specify 'verify' instead of 'on':

zfs set dedup=verify tank

Selecting a checksum

Given the ability to detect hash collisions as described above, it is
possible to use much weaker (but faster) hash functions in combination
with the 'verify' option to provide faster dedup. ZFS offers this
option for the fletcher4 checksum, which is quite fast:

zfs set dedup=fletcher4,verify tank

The tradeoff is that unlike SHA256, fletcher4 is not a pseudo-random
hash function, and therefore cannot be trusted not to collide. It is
therefore only suitable for dedup when combined with the 'verify' option,
which detects and resolves hash collisions. On systems with a very high
data ingest rate of largely duplicate data, this may provide better
overall performance than a secure hash without collision verification.

Unfortunately, because there are so many variables that affect performance,
I cannot offer any absolute guidance on which is better. However, if
you are willing to make the investment to experiment with different
checksum/verify options on your data, the payoff may be substantial.
Otherwise, just stick with the default provided by setting dedup=on;
it's cryptograhically strong and it's still pretty fast.

Scalability and performance

Most dedup solutions only work on a limited amount of data -- a handful
of terabytes -- because they require their dedup tables to be resident
in memory.

ZFS places no restrictions on your ability to dedup. You can dedup
a petabyte if you're so inclined. The performace of ZFS dedup will
follow the obvious trajectory: it will be fastest when the DDTs
(dedup tables) fit in memory, a little slower when they spill over
into the L2ARC, and much slower when they have to be read from disk.
The topic of dedup performance could easily fill many blog entries -- and
it will over time -- but the point I want to emphasize here is that there
are no limits in ZFS dedup. ZFS dedup scales to any capacity on any
platform, even a laptop; it just goes faster as you give it more hardware.

Acknowledgements

Bill Moore and I developed the first dedup prototype in two very intense
days in December 2008. Mark Maybee and Matt Ahrens helped us navigate
the interactions of this mostly-SPA code change with the ARC and DMU.
Our initial prototype was quite primitive: it didn't support gang blocks,
ditto blocks, out-of-space, and various other real-world conditions.
However, it confirmed that the basic approach we'd been planning for
several years was sound: namely, to use the 256-bit block checksums
in ZFS as hash signatures for dedup.

Over the next several months Bill and I tag-teamed the work so that
at least one of us could make forward progress while the other dealt
with some random interrupt of the day.

As we approached the end game, Matt Ahrens and Adam Leventhal developed
several optimizations for the ZAP to minimize DDT space consumption both
on disk and in memory, key factors in dedup performance. George Wilson
stepped in to help with, well, just about everything, as he always does.

For final code review George and I flew to Colorado where many folks
generously lent their time and expertise: Mark Maybee, Neil Perrin,
Lori Alt, Eric Taylor, and Tim Haley.

Our test team, led by Robin Guo, pounded on the code and made a couple
of great finds -- which were actually latent bugs exposed by some new,
tighter ASSERTs in the dedup code.

My family (Cathy, Andrew, David, and Galen) demonstrated enormous
patience as the project became all-consuming for the last few months.
On more than one occasion one of the kids has asked whether we can do
something and then immediately followed their own question with,
"Let me guess: after dedup is done."

Well, kids, dedup is done. We're going to have some fun now.

Join the discussion

Comments ( 84 )
  • Thommy M. Malmström Monday, November 2, 2009

    Can't wait to test it...


  • c0t0d0s0 Monday, November 2, 2009

    Really great news. While other filesystems try to get at the point where ZFS was yesterday, ZFS moves ahead.


  • max Monday, November 2, 2009

    "When using a secure hash like SHA256, the probability of a hash collision is about 2\^-256"

    Jeff, what about this?

    http://en.wikipedia.org/wiki/Birthday_problem


  • Cristian Yxen Monday, November 2, 2009

    Will dedup speed up copying or moving files from one dataset to another? If yes, will it result in read-only activity on the disks when moving the file or, even better, in only increasing the reference count of the blocks?


  • Gregg Wonderly Monday, November 2, 2009

    Well, now you just need to dedup your time schedule, and you'll have a lot of spare time blocks laying around that you can use for your kids!


  • Bahamat Monday, November 2, 2009

    Birthday collisions, having only 365 values, fully utilize less than 15 bits. 256 bit offers substantially more values. Feel free to recalculate the birthday formulas with 2\^256 instead of 365 and post the result if you like.


  • bastien Monday, November 2, 2009

    " the probability of a hash collision is about 2\^-256 = 10\^-77 "

    Consider 2\^(-256/2) to simplifie.


  • Ross Monday, November 2, 2009

    Does Dedup also apply to data in the L2ARC? If so... wow o_0


  • guest Monday, November 2, 2009

    You mentioned, that probably everyone wants to have the DDTs in memory. So a little formula like zfs_size/blocksize \* N ~ DDT size would be helpful as well...


  • Michael S. Monday, November 2, 2009

    Bahamat: did you read the link? It includes probabilities for 256 bit hashes. With a 256 bit hash, there's a 1-in-a-billion chance of a collision with 1.5 × 10\^34 blocks. How many blocks might there be in a ZFS array?


  • Dick Davies Monday, November 2, 2009

    Excellent news; for starters this should give us the space benefits of a sparse zone with the flexibility of a full root one.


  • Eric Schrock Monday, November 2, 2009

    @max, @bastien -

    There is a discussion of the overall chance of hash collision when factoring in the total number of blocks in the ARC thread for a related, but orthogonal case:

    http://arc.opensolaris.org/caselog/PSARC/2009/557/mail

    Which refers to the following table:

    http://en.wikipedia.org/wiki/Birthday_paradox#Probability_table

    To have a collision probability of 10\^-18 (already more reliable than almost anything else in the system), this would require approximately 2\^98 unique blocks (2\^115 bytes @128k) to be written, well beyond the limits for any forseeable storage platform.


  • newsham Monday, November 2, 2009

    The problem with using a hash function is that attackers control a lot of data on your filesystem on a modern OS. Consider the web cache on a machine where a user browses the web. This allows an attacker a platform to intentionally try to cause collisions that can cause the filesystem to malfunction. These sort of attacks are still hard, but there has been a lot of progress in attack several popular hash functions lately. A solution to prevent this is to use a keyed hash function and keep the key secret.


  • Alan Burlison Monday, November 2, 2009

    Do you have any advice or observations on how dedup interacts with ZFS compression?


  • Dennis Clarke Monday, November 2, 2009

    If the implementation of the SHA256 ( or possibly SHA512 at some point ) algorithm is well threaded then one would be able to leverage those massively multi-core Niagara T2 servers. The SHA256 hash is based on six 32-bit functions whereas SHA512 is based on six 64-bit functions. The CMT Niagara T2 can easily process those 64-bit hash functions and the multi-core CMT trend is well established. So long as context switch times are very low one would think that IO with a SHA512 based de-dupe implementation would be possible and even realistic. That would solve the hash collision concern I would think.


  • Brian White Monday, November 2, 2009

    This is great news.

    How long until we can actually use this feature in the development builds? Can’t wait to try it out!


  • Patrick Georgi Monday, November 2, 2009

    "Excellent news; for starters this should give us the space benefits of a sparse zone with the flexibility of a full root one."

    Unless the deduplication also spills over into the memory management, it doesn't: Running two deduplicated full root zones still requires twice the RAM than running only one of them, while running two sparse zones only require twice the read/write pages, with the read-only pages being shared.

    So the questions are: How often does Solaris load a couple of identical pages to memory, when they're deduplicated on disk? Are there plans to get that to "once", if it isn't already?

    Real-life scenario: We run 60 sparse zones vs. the 16 full zones that the system could manage before we're out of RAM (4GB installed).

    And the difference only grows with more RAM.


  • Don MacAskill Monday, November 2, 2009

    Great news!! wtg Jeff & Co!

    Does the deduping apply to the ARC/L2ARC as well (ie, only pointers to duplicate blocks reside, rather than the whole block)?

    And I assume it works well with compression?

    Can't wait to play. :)


  • Daniel Monday, November 2, 2009

    Any ideas when this will show up in releases of OpenSolaris or Solaris 10?


  • Mark Weber Monday, November 2, 2009

    One question I thought of while reading is do different pools recognize the data in the other pool for deduplication. For example, if I have pools A and B (deduplication is enabled in each) and I have an identical block X in both pools, will it be deduplicated? There are pro's and con's to deduplicating across pools and not deduplicating across pools. Which did you choose and will there be options in the future to do the opposite?


  • Bill Moore Monday, November 2, 2009

    @Don - The ARC work (so that we deduplicate in-core) is forthcoming. Mark Maybee is making good progress on that.

    And yes, it works perfectly well with compression - just as you would imagine. :)


  • Mike La Spina Monday, November 2, 2009

    Congrats Jeff,

    Another great step for OpenSolaris and OpenStorage.

    I'm looking forward to snv_128 on IPS. Is there any way you can hurry the binary images over to IPS?


  • Mike La Spina Monday, November 2, 2009

    My Apologies,

    Congrats Jeff, Bill and all the great team members.


  • Zooko Monday, November 2, 2009

    newsham: as far as the world's open cryptographic community knows, it is impossible for anyone to generate collisions in SHA-256 even if they are deliberately trying to do so.

    Dennis: Hm, can't ZFS use the hardware implementation of SHA-256 in the T2? Anyway, "the hash collision concern" is already solved by SHA-256 -- see Eric Shrock's comment.

    all: it seems like there are some funny interactions between dedupe and crypto: http://mail.opensolaris.org/pipermail/zfs-crypto-discuss/2009-November/002947.html . I'm glad to learn from this blog post that Nicolas Williams was wrong to say that dedupe will always require full block comparison.


  • Deniz Monday, November 2, 2009

    So, is this going to be included in the next update of solaris or are we going to have to download some kind of a kernel patch soon?

    Thanks.


  • Pierre Lebeaupin Monday, November 2, 2009

    This is something I always wished. Some people rant about differential copies (copy on write of files) or sparse files, but the problem of these schemes is that they tend to not survive the file being copied on a different disk, or at any rate they work only if the filesystem has knowledge of the original creation of the redundancy. It is not the case here, and so is much more general. Imagine, one can now write a file of one petabyte of zeroes - without the application telling the filesystem (other than by writing said zeroes).

    Fun question: would it be possible for a malicious user to try and write blocks that he suspects are the same as blocks from data he's not supposed to be able to read, and figure out if a deduplication occurred by timing the process? Would be an interesting attack. (note: has nothing to with finding a collision with whatever hash function is used, though these attacks are interesting as well; as far as I am aware, so far ZFS was only dependent on the hash function resistance to pre-image attacks, if of course ZFS was supposed to guarantee cryptographic integrity; but now the hash function must also resist collision attacks or fun things might occur...)

    One drawback of the current support however: given a huge array such that the dedup table has to spill to disk, which is used in bursts for, say, backup, and since browsing the table (and adding entries to it) has no locality whatsoever (otherwise, it means there is a problem with the hash function, if I'm not mistaken), it will have to hit the disk with the same proportion as the proportion of data not in memory over the total size of the table (or in other words: you cannot efficiently cache that table in memory), then an offline option would have been useful for that use case.

    Pierre (still in bargaining stage Mac user)


  • Chris Monday, November 2, 2009

    Congrats to everyone involved!

    I'm curious about the claim of increased performance across the board -- doesn't read performance suffer from the transformation from what was once a continuous read into many short reads in different areas of the disk? Or perhaps you have a strategy to deal with that, or don't find this fragmentation to be a problem in practice?

    Thanks.


  • dasmo Monday, November 2, 2009

    Well done Jeff. You seem to be involved with all the best work ;)

    Re:"One question I thought of while reading is do different pools recognize the data in the other pool for deduplication. "

    I doubt that dedup has the ability to be done across pools. Reason a is that it's a ZFS flag to turn it on rather than a zpool one. Reason b is that you can't be sure that the second pool will not be removed or even exported. I prefer it this way.


  • Keith Adams Monday, November 2, 2009

    Congrats on this achievement, but I also have a question.

    Birthday paradox arguments aside, cryptographic hashes have a long history of being broken. If this ever happens to SHA-256, the Mother of All Remote privilege escalations will immediately apply. E.g.,a collision with a block of the Windows kernel would let \*a web site\* run privileged code on however-many-hundreds of VMs hang off the corrupted physical block (assuming web caches get written to virtual disk).

    The verify flag provides an out here, but it is not the default, and most users will take that hint. Unfortunately, there is no way to know whether you should have used verify until it is too late; if SHA-256 is ever broken, then a non-verify pool is defective.

    I realize you all are smart cookies, and have thought much more deeply about this than I have just reading this blog post. So what's the punchline? Are we that much more confident about SHA-256 than its predecessors? Is the performance hit to verify enough to make the system useless for important domains? Some combination of the two?


  • Patrick Georgi Monday, November 2, 2009

    @Keith Adams:

    It's not _that_ bad. For your exploit to work, you'd need to have the faulty file around _first_.

    Deduplication doesn't overwrite existing files with a duplicate, but avoids writing duplicates in the first place. (at least as far as I understand it)

    So, the attack vector would require you to know beforehand of a new component that's normally ran with elevanted rights. Then push a file with the same hash to the system, and then wait for deduplication to kick in.

    That's a whole lot harder than "hey, I can write a file with the same content as the kernel".

    It's possible to scan the hash space: Write all kinds of files, hash with a different hash in RAM and on disk, if they differ, you got a collision with another file.

    And then hope for it to be something truly secret ;-)

    But given the size of the hash space, that's not a productive use of your time, and it would be too easy to figure out that there's something fishy going on (who's creating and deleting billions of files all the time?)


  • John Woods Monday, November 2, 2009

    Oh wow, that dude actually makes sense!

    RT

    www.complete-privacy.at.tc


  • Nico Monday, November 2, 2009

    @zooko: indeed, ZFS dedup does have the option of not verifying block contents when hashes match -- I spoke too quickly. A minor error, I think.

    @{Zooko, various others}: The point is that if you don't want to trust the hash function, well, you don't have to.

    @Zooko: Back to the ZFS crypto issue that Darren was asking for advice on: by MACing every block in addition to hashing it we don't depend on the hash function's collision resistance for security, though, clearly, for dedup you'll want to enable block contents verification if you do fear attackers that can create hash collisions. IMO, not depending on a hash function's collision resistance, is a good thing.


  • Andrey Kuzmin Monday, November 2, 2009

    Congratulations on rolling this out of the door. Quite an achievement.

    A quick&simple suggestion wrt "Trust or verify?" and, specifically, using fast fletcher hash with subsequent positve(s) verification by byte-comparison. There is third option: always look-up by fletcher hash (assuming blocks are indexed by both this AND sha-256), and use sha to verify positives.

    This has slight advantage over verification by byte-comparison, especially with multiple positives, and huge - fletcher over sha - win in negative cases. The latter yields a nice property: the less duplicated is the data, the lesser is dedupe's overhead.


  • Jacob Monday, November 2, 2009

    Where are we going to see it first, Solaris, OpenSolaris or S7000?


  • Philip Monday, November 2, 2009

    That's nice an all... but for my purposes, it would be far more of a "win", to allow for cross-zfs(but same pool) hardlinks. And/or mv's.

    Not to mention some kind of zfs-aware rsync. For fast, efficient remote replication (or restorals, for that matter!), that does not require keeping a matching "full-filesystem snapshot" around).

    Maybe with this dup-detection stuff, you will be closer to having that happen now?


  • AppGirl Monday, November 2, 2009

    Fabulous info! Can't wait to try this out on my VM.


  • Wrex Monday, November 2, 2009

    What a lot of others asked: Where and when are we, the general public, going to be able to see it and test it ourselves? Your blog speaks as if it's readily available, yet I don't see it in Update 8 of Sol10, nor in recent snapshots of OpenSolaris. Am I missing something?

    Thanks


  • Wrex Monday, November 2, 2009

    I stand corrected:

    http://mail.opensolaris.org/pipermail/onnv-notify/2009-November/010683.html

    So when can we expect it in Solaris 10? :-)


  • Felipe Alfaro Solana Monday, November 2, 2009

    The probability of a collision for an equiprobable hash function is ~ 2\^(-n/2), where "n" is the size of the output hash in bits. The probability is _not_ 2\^-n (that's the probability of a getting a single output, not the probability of two input documents producing the same output which is exactly a collision).

    For more information about collisions for cryptographical functions please read about the Birthday attack. For example: http://en.wikipedia.org/wiki/Birthday_attack.


  • Klaus Borchers Monday, November 2, 2009

    Congrats, really nice feature!

    Did you consider side-channel attacks? Let us assume that an attacker that is a user on a machine knows that somewhere on disk there is a block that contains just a username and a password, say "john:abc"

    Now he could write his own combinations "john:aaa", "john:aab", .. ,"john:zzz" each to a unique block in a file on disk and notice that the timing of writing "john:abc" is different to all the others, because the block "john:abc" is deduped?


  • Dan Price Monday, November 2, 2009

    @Wrex: You should see it in OpenSolaris dev builds (i.e. http://pkg.opensolaris.org/dev) in roughly a month. The current "build" (build 128) closes for code changes on 9 November, then it gets some QA, and then is published. We just pushed build 126 publicly last week.


  • Philip Monday, November 2, 2009

    More thoughts: How about using dedup processing, to then enable synthetic snapshot creation across systems?

    Example: two separate solaris systems, both running zfs filesystems that have "mostly similar" content, and need to be kept near-synced in future.

    Rather than having to completely blow away and rebuild one, to then have a shared full snapshot for incremental zsends... how about some kind of tool that would create one?

    Reasons this would be worthwhile, could be large datasets, and bandwidth-constrained interconnects.

    and/or: user error. They were previously kept in sync with zsync, but some admin accidentally deletes the "wrong" snapshot. or all snapshots. That admin is going to have some very very unhappy users, unless there's a nice neat way to regen the common snapshot without long downtimes for rebuild.


  • Laird Monday, November 2, 2009

    For all the people that will climb down the hole of hash function probability it would be interesting to contrast that with just bit rot on modern drives.


  • John Monday, November 2, 2009

    @Klaus:

    Timing it may be unreliable, especially on a busy system but someone could just dtrace it to figure it out. It might make sense to have a nodedup option applied at file level so developers have a chance to adress such concerns for sensitive files.


  • pvw Monday, November 2, 2009

    We have no idea on the suitability of our data for dedup. Is there any method available that can scan the data and report on the suitability of switching dedup on?


  • Dave Monday, November 2, 2009

    Super news!! :D


  • Nicko Monday, November 2, 2009

    @Felipe Alfaro Solana: The "probability of a collision" is NOT about 2\^(n/2). That is the approximate number of items one needs to hash with an n-bit hash function for the probability of a collision to be about 50%.

    The critical issue, which a few people have touched upon, is that the probability of "a collision" depends on how much data is being hashed. If we have a zetabyte of data and a 128K block size we can end up needing to hash up to 2\^53 (about 10\^16) blocks. The probability of getting any collisions on a 256 bit hash function with this many tries is about 1 in 2\^151 (less than 1 in 10\^46). This is dozens of orders of magnitude safer than the underlying disc drives.


  • Greg Broiles Monday, November 2, 2009

    Re side channel attacks - seems to me they are plausible both by measuring system response time and by measuring available space on the volume(s) after writing the suspect block(s). Even on busy system, it will be possible to identify trends. (e.g., if we write block "X" 1000 times, it is always much faster than if we write block "Y" 1000 times; same for space usage.) These attacks could be mitigated by making the actual response time equal to the expected worst-case scenario, and by limiting "space available" responses to "available under quota" versus "absolute space available", but both approaches would create other side effects such as slower response time(s) and forcing the use of quotas.

    It could also be used to identify media files (music, video, whatever) if there's a typical/canonical encoding that's likely to be used.

    Might be useful for identifying executable/library/data files installed on a machine (perhaps at the version level to identify vulnerabilities) if the attacker can work with a known example; or for iteratively determining the contents of a sensitive file such as a \*nix password file (discussed above by Klaus Borchers) or a file containing a database access password, passphrase, .htaccess file (or password file pointed to by .htaccess file), etc.

    I am a fan of the deduplication idea, but I think it has significant security implications that may be tough to identify early.


  • vince Monday, November 2, 2009

    Tanks... I am waiting for this since zfs was made. :)


  • Joseph Kotran Monday, November 2, 2009

    Jeff,

    Congratulations to you and your team! You have reenergized my enthusiasm for Solaris and you have given my business a strong reason go to w/ Sun Solaris in lieu of the competition.

    God bless you. Take time for your family.

    Joe


  • TA Tuesday, November 3, 2009

    I have a large ZFS pool (on a Storage 7000 system) containing virtual machine images. I'm sure the dedup stuff will be very useful for this kind of data. I'm wondering, though, will there be some way of forcing existing data through the algorithm, or will only data that was written after dedup was turned on benefit?


  • sophie Tuesday, November 3, 2009

    Congratulations for a job well done.

    After reading all the comments so far, would you consider the following suggestion to allow the user to select check summing method SHA-256, SHA-512, fletcher4 or some future method.

    Enabling the check summing function extensions as a dynamically loadable modules would allow in place dedup improvements


  • Thanos Makatos Tuesday, November 3, 2009

    Excellent work. I have two quick questions:

    1. What happens when the DDTs cannot fit in memory? How many extra I/Os will be needed in this case for each application I/O?

    2. Given the common case that two blocks are very similar but not identical, how does ZFS handle this?


  • abc Tuesday, November 3, 2009

    will there be an asynchronous version in the future??

    thanks for the info


  • sophed Tuesday, November 3, 2009

    Scary is all I can say.

    So what happens when you lose a block on the disk that happened to be the reference block for 12 others?

    First thing I'd want to know before letting this option anywhere near my production filesystems is how much redundancy is there and how easy is it to configure ?


  • Knut Grunwald Tuesday, November 3, 2009

    Does this open a backup poisening attack ? If i write the same blocks in sequence in one big or several small files, with a sequence just long enough to fool the compression system of the backup, would this enable an attacker to exhaust backup space ?

    If the backup space is another deduped ZFS-System, would this enable an attacker to exhaust the communications capability ? If there is a quota on the accounts, this is not a real problem, but with unlimited accounts, it may be.

    Off course, most of the time it is not an attacker, but an idiot running untested software, without having a look at it, for extended periods of time.

    Can ZFS trigger a warning, if a block happens to be referenced for more than setable number of times ?


  • Kebabbert Tuesday, November 3, 2009

    Sophed,

    There is a mechanism for that. If a block is referenced by 1000 other blocks you have a severe problem if that block corrupts. Therefore you will be able to specify how many references is allowed to a block. Saying something like, "a block can not have more than 10 references" or something similar.

    But, ZFS is very secure and for the redudancy you use raidz2 or raidz3, of course. With raidz2 (raid-6) you will get lots of redundancy.

    Jeff,

    Good work! BTW, jeff, did you know that Solaris is the best OS out there? :o) Truly.


  • guest Tuesday, November 3, 2009

    Dreaming about a BitTorrent client that uses dedup to find chunks on disk before trying to download them.


  • guest Tuesday, November 3, 2009

    Couple of perhaps stupid questions (if so aplogies):

    Does the p value for collisions hold true for blocks that only differ in a single bit?

    and

    Do you/have you empirically tested your implementation in some way to verify collision frequency...?


  • David Strom Tuesday, November 3, 2009

    When will we see this in Solaris? (not OpenSolaris)

    In Solaris 10 next or Solaris +++ (11?)?

    --

    David Strom


  • Lalith Suresh Tuesday, November 3, 2009

    Awesome work! Congratulations! Can't wait to try this out. By the way, does this mean that data deduplication software are going to be pushed aside?


  • UX-admin Tuesday, November 3, 2009

    "You should see it in OpenSolaris dev builds (i.e. http://pkg.opensolaris.org/dev) in roughly a month. The current "build" (build 128) closes for code changes on 9 November, then it gets some QA, and then is published. We just pushed build 126 publicly last week."

    That's great, but OpenSolaris is so GNU/Linux bastardized and changing so fast, that's both illogical and impractical to use it in production.

    Let's try the question another way:

    will the ZFS deduplication feature make it into Solaris 10 u9?


  • Mike Tuesday, November 3, 2009

    So I have to ask... Why use an expensive hashing algorithm at all? Why not use a cheaper hash, but use trust + verify before actually performing a deduplication? This will reduce processor load and eliminate the use of collisions in the event two pieces of data hash the same but are in fact different.

    On my system, MD5 costs 1/3 of what SHA256 costs in CPU time, and while it may be more likely to cause collisions, ZFS should always do a byte-to-byte comparison on any block before deduping one to make sure they are, in fact, identical.


  • Max Tuesday, November 3, 2009

    @Mike: That's what dedup=fletcher4,verify is for...


  • Sean Reifschneider Tuesday, November 3, 2009

    For some real world hash collision examples you may want to try using backuppc on your data. It includes deduplication, but runs at a file level.

    I was rather surprised to find that in a small system (a few TB), I'm running into quite a few files that have the same hash but different contents.

    For example, on one of my backup servers:

    # Pool is 2698.23GB comprising 1232264 files and 4369 directories (as of 11/3 10:16),

    # Pool hashing gives 2651 repeated files with longest chain 51,

    So, I have 3TB of data, and one of the checksums has 51 collisions that were detected (51 different files that had the same hash but different contents).

    At first I was surprised that there were any collisions, but then I remembered the birthday problem...

    In any case, it seems like "verify" is the most conservative setting, and I'm surprised that the file-system that basically promises to never corrupt your data defaults to a setting in which this could happen.

    Sean


  • Garen Tuesday, November 3, 2009

    By default, fletcher4 is used when creating new zfs file-systems. Many users like myself have already filled up a bunch of zfs file-systems using fletcher4. What happens if a user tries to enable dedup but doesn't use verify for an existing fletcher4 filesystem? Is there a warning/error?

    Are there any plans for asynchronous deduplication at some point?

    For the paranoid and performance conscious, I could see wanting to do dedup with verification in a batch job, run at least nightly.


  • Bartek Pelc Tuesday, November 3, 2009

    @Sean Reifschneider

    What hash function is being used by software you mentioned?


  • Klaus Borchers Tuesday, November 3, 2009

    @Greg: Come to think of it, the easiest exploit would probably be to use the "readahead" features of the system, where not single blocks (say 4K) are read from disk, but larger sectors (say 64K).

    Reading sixteen 4k-blocks on an even sector boundary, the first block, if not in cache, will take at least several msecs(disk access time), while the 15 subsequent blocks will be available in usecs.

    In order to find out if a block with a certain content exists anywhere on disk, from within, say, a VM, you just write a sector with 15 blocks of random garbage onto the disk, but one block somewhere in the middle, e.g. #9, contains the contents you want to check. After a few minutes to hours, when the data you have written has been evicted from cache in ram, you read back the first block, which takes a long time, and then the others, which follow almost immediately - except for #9, which was deduplicated, and must be fetched from somewhere else on disk.

    In practice, I guess, with all the optimisations and layers in the system, it may be far more complicated, but securing the system against this kind of attack will be even more difficult.


  • Jason Herring Tuesday, November 3, 2009

    If this is turned on for an existing ZFS file system are pre-existing segments able to be deduplicated? If ZFS only supports synchronous dedupe does the data have to be pulled off and repopulated?


  • Michael Widmann Tuesday, November 3, 2009

    Hi

    As we rely heavily on ZFS for more than 2 Billion of Files in different Filesystems my question is: is dedup working across filesystems?

    and is there a way to get the non dedup ratio compared to the dedup ratio on a zfs filesystem?

    could we get the "not deduped" size of a filesystem? with df or do we receive the deduped size?

    just some management questions

    hope to test your good work soon!

    Michael


  • Jim Klimov Wednesday, November 4, 2009

    First of all, thank you all for this feature, and I hope you and your families will get some quality time together before hacking on all the questions and requests in these comments :)

    And I also thank the commenters for raising some interesting questions and ideas.

    Concerning the matter of deduplicating data that's already on our disks, in our existing systems, I'm afraid we'd have to suffice for a while with a trick I use to compress previously uncompressed files (say, local zones). We shut down the zone, move its files to a subdirectory, and use Sun cpio (keeping ZFSacl's) to copy the files back to their expected location. Upon write, they are compressed (and nowadays they'd also be deduped). This can be tweaked to doing per-file copies/removals to minimize the free-space pressure when remaking existing systems.

    Needless to say, some supported utility which allows to (un-)compress and (un-)dedup existing files in-place (like setting the Compressed flag on NTFS objects) is very welcome and long-awaited :) The already de-facto working capability of using different compression algorithms within the same ZFS dataset is also a bonus versus NTFS, and I'd love to see that in said utility. (In example of our local zones, the fresh install of binary files can be done with gzip-9, then new files like data and logs are written with a faster lzjb.)

    Another question arises: what if we have same files (blocks) compressed with different algorithms? On-disk blocks apparently contain different (compressed) data and have different checksums for the same original data, and different amount of on-disk blocks for the same original files.

    These would probably not dedup ultimately to one block, but to at least as many as there are different compression algorithms?

    And for compressed blocks inside different original files (including VM images) the block-alignment would make it even less probable that we have dedupable whole blocks? Even a one-byte offset would not let us save space from otherwise same original data?

    In short - does it mean we would (probably) save more space by not compressing certain filesystems (i.e. VM image containers) but rather only deduping them?

    PS: I was surprised to see no follow-up to this comment:

    > Well, now you just need to dedup your time schedule, and you'll have a lot of spare time blocks laying around that you can use for your kids!

    Apparently, this strikes the family time too. Instead of going with kids to a zoo 10 times, "Jeff" only goes once and tells the family that they should remember it as 10 different trips ;)


  • QuAzI Wednesday, November 4, 2009

    Can ZFS use sha256 for self-healing instead deduplicaion?


  • Robert Milkowski Wednesday, November 4, 2009

    QuAzl - zfs set checksum=sha256 dataset

    see man zfs for more details

    It's been there for ever.


  • QuAzI Wednesday, November 4, 2009

    I know about checksum and how it's work for RAID-Z. But if ZFS can use sha256 for duplicates searching maybe they can use it for self-healing instead deduplicating?


  • Bartek Pelc Wednesday, November 4, 2009

    @QuAzI

    Sha256 in ZFS IS used for self healing if you set it as checksum algorithm.. It is used, instead of fletcher, to check integrity of every block in datasets, not only in RAIDZ. It was like that since creation of ZFS, and now this same checksum field can be used for two purposes, integrity and deduplication.


  • QuAzI Wednesday, November 4, 2009

    Thanks. I don't found that in overviews and documents, just examples of self-healing of mirrors found. It's good.


  • francisco Wednesday, November 4, 2009

    Out of curiosity, how does dedup interact with userquota/groupquota? Will the full size of the deduped block count against the quota? I'm guessing that's the case as it's more efficient, though really the user/group isn't using all that space.

    Dedup sounds great. I'm really looking forward to that and comstar making their way into the 7000 series. Thanks for your work.


  • okky Thursday, November 5, 2009

    The bad thing about synchronous dedup feature is, that the file blocks will be fragmented. If file is rare to be IOed, this is fine. But if file is still rather HOT (means, many reads/writes are being held, but not as frequently as to have block images being cached), dedup-ed file may cause performance drop.

    So, it will be wonderful if we could have asynchronous dedup features, with some way we can write identification (which file should be counted as dedup target) code in user mode.

    Probably needless to say, characteristics mentioned above only matters for media like HDD, which fragmentation overheads are quite big. For media like SSD, synchronized dedup feature is best and great solution.

    So, my great wish is, to have dedup feature in both synchronous, and asynchronous. And for asynchronous mode, with interface for identification code.


  • Drop It Low Girl Thursday, November 5, 2009

    Can now definitely understand Apple not wanting to include ZFS if what they would have shipped was going to suck compared to what's going out the door in OpenSolaris --not to mention that Sun's new owner pushing its own BTRFS.


  • Ediscovery Trends Thursday, November 5, 2009

    This is great news - I can't wait to see how this is going to manifest itself in other industries, especially the ediscovery and litigation support industry.


  • QuAzI Friday, November 6, 2009

    Does ZFS control any collisions like 22 collisions by Somitra Kumar Sanadhya and Palash Sarkar ( http://arxiv.org/abs/0803.1220 )?


  • KKovacs Sunday, November 8, 2009

    Great news!

    One question tough: What does this mean on the caching level? Are the blocks cached before or after the deduping?

    What I mean is: if I have 450 copies of the same 120 MB, will ZFS keep "virtually" all that 54GB in the memory cache, even on a machine with only 8 GB of RAM?


  • rado watches Monday, November 9, 2009

    Great post!

    Thank you!


  • muralidhar Monday, November 9, 2009

    <html>

    which version onwards this feature enabled.

    regards

    Muralidhar

    </html>


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha