Deduplication now in ZFS

I've been waiting for this ever since it was announced - and deduplication is now available in ZFS! (hence this hastily written blog...). The basic concept is that when data is written to a ZFS filesystem with dedup turned on, ZFS only stored blocks that are unique within the ZFS pool, rather than storing redundant copies of identical data. See Jeff Bonwick's blog for more information on concepts and implementation.

Getting started with ZFS dedup

If you're running OpenSolaris 2009.06 and are pointing to the hot-off-the-presses repository you can upgrade by using System -> Package Manager -> File -> Updates, or use the CLI example shown in Roman Ivanov's blog. Some painless downloading and a reboot, and you have access to the bits that contain ZFS dedup.

$ uname -a
SunOS suzhou 5.11 snv_128a i86pc i386 i86pc Solaris
$ cat /etc/release
                       OpenSolaris Development snv_128a X86
           Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                           Assembled 23 November 2009 

Now, I'm a rather paranoid guy, and the machine in question is my main desktop, so I'm going to experiment on a ZFS pool in a ramdisk, rather than the pool I keep all my data. At least for today! Here, I create a 100MB ramdisk, put a ZFS pool in it, and turn on dedup for its top-level ZFS dataset. I then create a ZFS dataset in it, and hand it to my normal, non-root userid.

Note: the error message below is because you don't turn on 'dedup' at the pool level - you turn it on at the ZFS dataset level. This is despite the (possibly confusing fact) that candidate blocks for deduplication are across the entire ZFS pool (not the dataset), and the space savings ratio due to deduplication are reported at the pool level. So, in this case, the zpool command thinks I'm looking for the deduplication ratio, which is a read-only attribute. After that error, I turn dedup on for the ZFS dataset at the top of the pool. Out of habit, I turn on compression too.

# ramdiskadm -a dimmdisk 100m
/dev/ramdisk/dimmdisk
# zpool create dimmpool /dev/ramdisk/dimmdisk
# zfs create dimmpool/fast
# zpool set dedup=on dimmpool
cannot set property for 'dimmpool': 'dedup' is readonly
# zfs set dedup=on dimmpool
# zfs set compression=on dimmpool
# chown savit /dimmpool/fast
Now, I'll populate that ZFS dataset with a few directories and copy some files into the first of them:
$ mkdir /dimmpool/fast/v1
$ mkdir /dimmpool/fast/v2
$ mkdir /dimmpool/fast/v3
$ mkdir /dimmpool/fast/v4
$ mkdir /dimmpool/fast/v5
$ cp \*gz /dimmpool/fast/v1
$ zpool list dimmpool
NAME       SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
dimmpool  95.5M  5.32M  90.2M     5%  1.00x  ONLINE  -
No duplicate data so far, which isn't a surprise, now for the real test. Let's copy the same data into different directories and see what happens:
$ cp \*gz /dimmpool/fast/v2
$ cp \*gz /dimmpool/fast/v3
$ cp \*gz /dimmpool/fast/v4
$ cp \*gz /dimmpool/fast/v5
$ zpool list dimmpool
NAME       SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
dimmpool  95.5M  6.55M  89.0M     6%  5.00x  ONLINE  -

Wow! I have 5 copies of the same data and ZFS trimmed away the excess copies when storing on disk.

The mystery of the growing disk

One surprise with dedup is how traditional tools like df respond. They obviously are unaware of deduplication, but have to somehow cope with the idea that the same filesystem is storing more (user accessible) bytes. This is very well explained in Joerg's blog at df considered problematic (and tip of the hat to Craig to provoke me to comment on this).

To illustrate this, I'll create a small ZFS pool backed by a file (a file on ZFS, which is perfectly valid). I'm going to fill it with multiple copies of the same CD image (a Ubuntu 9.10 ISO), which neatly get deduped away. As I do this, look at the output of the df command:

# mkfile 1g /var/tmp/TEMP
# zpool create temp /var/tmp/TEMP
# zfs set dedup=on temp; zfs set compression=on temp
# zpool list temp
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
temp  1016M   141K  1016M     0%  1.00x  ONLINE  -
# cp ubuntu-9.10-desktop-i386.iso /temp
# zpool list temp
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
temp  1016M   685M   331M    67%  1.00x  ONLINE  -
# df -h /temp
Filesystem            Size  Used Avail Use% Mounted on
temp                  983M  683M  300M  70% /temp
# cp ubuntu-9.10-desktop-i386.iso /temp/iso2
# df -h /temp
Filesystem            Size  Used Avail Use% Mounted on
temp                  1.7G  1.4G  295M  83% /temp
# cp ubuntu-9.10-desktop-i386.iso /temp/iso3
# df -h /temp
Filesystem            Size  Used Avail Use% Mounted on
temp                  2.3G  2.1G  289M  88% /temp
# cp ubuntu-9.10-desktop-i386.iso /temp/iso4
# zpool list temp 
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
temp  1016M   692M   324M    68%  4.00x  ONLINE  -
# df -h /temp
Filesystem            Size  Used Avail Use% Mounted on
temp                  3.0G  2.7G  278M  91% /temp

Notice that the ZFS pool shows the deduplication ratio, and df acts as if the disk is getting bigger - growing in this case to 3GB. Otherwise, how could it explain 2.7GB of data in a filesystem that was only the original size of 983M?

A bigger test

Well, the above tests are probably "best case scenarios". Let's try something bigger, and use a real disk.

I'll use a spare ZFS pool called 'tpool'. Since this is a ZFS pool I've had for a while, I have to upgrade the pool to the latest on-disk format, which I show here. Note which level (21, in listing below) provides deduplication. For a change of pace, I'll do this via pfexec from my userid, instead of being root.

$ zpool list tpool
NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
tpool     736G  87.1G   649G    11%  1.00x  ONLINE  --
$ pfexec zpool status tpool
  pool: tpool
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
	still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
	pool will no longer be accessible on older software versions.
 scrub: none requested
config:

	NAME         STATE     READ WRITE CKSUM
	tpool        ONLINE       0     0     0
	  c18t0d0p2  ONLINE       0     0     0

errors: No known data errors
$ zpool upgrade -v
This system is currently running ZFS pool version 22.

The following versions are supported:

VER  DESCRIPTION
---  --------------------------------------------------------
 1   Initial ZFS version
 2   Ditto blocks (replicated metadata)
 3   Hot spares and double parity RAID-Z
 4   zpool history
 5   Compression using the gzip algorithm
 6   bootfs pool property
 7   Separate intent log devices
 8   Delegated administration
 9   refquota and refreservation properties
 10  Cache devices
 11  Improved scrub performance
 12  Snapshot properties
 13  snapused property
 14  passthrough-x aclinherit
 15  user/group space accounting
 16  stmf property support
 17  Triple-parity RAID-Z
 18  Snapshot user holds
 19  Log device removal
 20  Compression using zle (zero-length encoding)
 21  Deduplication
 22  Received properties

For more information on a particular version, including supported releases, see:

http://www.opensolaris.org/os/community/zfs/version/N

Where 'N' is the version number.
$ pfexec zpool upgrade tpool
This system is currently running ZFS pool version 22.

Successfully upgraded 'tpool' from version 16 to version 22
Now I can turn dedup on. As I mentioned before, you turn it on at the ZFS dataset level. Here, I want all ZFS datasets in this pool to inherit this property, so I set it at the topmost level.
$ pfexec zfs set dedup=on tpool
$ zfs get dedup tpool
NAME   PROPERTY  VALUE          SOURCE
tpool  dedup     on             local
$ zpool get dedupratio tpool
NAME   PROPERTY    VALUE  SOURCE
tpool  dedupratio  1.00x  -
Now I'll copy about 28GB worth of data - note how the free disk space is only reduced by 5GB!
$ zpool list tpool
NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
tpool     736G   206G   530G    28%  1.02x  ONLINE  -
$ cp -rp /usbpool/Music /tpool/
$ zfs list tpool/Music
NAME          USED  AVAIL  REFER  MOUNTPOINT
tpool/Music  27.9G   513G  27.9G  /tpool/Music
$ zpool list tpool
NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
tpool     736G   211G   525G    28%  1.20x  ONLINE  -
ZFS found that the written data was a duplicate of existing data, and deduplicated it on the fly. Only one copy of the 28GB occupies space on disk.

Putting this in context

Now, let's understand this correctly - somewhere in the tpool filepool I had the same 28GB of music files. With a small amount of effort I could have found the original copy (I know where I tend to put things), and just stored new or changed data. But that's not the point!

Imagine if this were a file store serving hundreds of users. They obviously can't search through one another's directories and see if the exact same data was already stored by someone else (and somehow know that the data would be retained unchanged, for exactly as long as they themselves needed it). Consider the case where a group of employees receive the same e-mail attachments and store them in their private directores: without deduplication the same data is stored many times, with deduplication it is stored only once.

ZFS deduplication works at the block level, not on an individual file basis, so it is possible that "mostly the same" files can enjoy benefits of deduplication and reduced disk space consumption. Imagine even a single user working with a series of medical image files from CAT scans, or different versions of an animated film. The different files might be almost identical in contents, and potentially the majority of disk blocks could be the same - and stored only once instead of storing the same contents multiple times.

Caveats and Considerations

There are important considerations to keep in mind, so please don't blindly turn on deduplication. First, Dedupe is available in OpenSolaris 2009.06 and Solaris 11 Express - it is not available in Solaris 10.

Second, deduplication requires a great deal of RAM (or an SSD-based L2ARC) to store the ZFS dedup table - the metadata that contains the block signatures and describes ownership. If you don't have enough RAM, then file operations (especially writes and in particular, file deletion) will be extremely slow. On a large server class machine this may not be a problem, but be very cautious about deploying it on a desktop class system with a few GB of RAM. See Roch Bourbonnais' blog on dedup performance for technical details and results of sizing experiments.

Given those considerations - when would you use deduplication? Consider it when you have business processes that create a lot of duplicate data on disk, hosted on a server with suitable RAM and I/O. Then you can get substantial space benefits while maintaining good performance.

This is not quite the same advice that I give for ZFS compression: I recommend turning default ZFS compression on in almost all cases, since the CPU cost is small and the savings in disk space can be valuable (in fact, the reduced disk space can decrease overall CPU cost, since fewer physical I/Os are needed) - very little downside risk is involved. It's more like deciding whether to turn on ZFS gzip-9 compression, which provides greater potential space savings at a much higher CPU cost. In general, evaluate your data and your server configuration before turning on deduplication.

Conclusions

This is a powerful new feature for ZFS - providing deduplication as a new native feature of the most advanced file system, without imposing add-on license fees and with a simple user interface. This currently is a feature of Solaris 11 Express, not Solaris 10. Users who want to make use of this feature now can certainly do so with Solaris 11 Express and can obtain formal support.

However, you don't have to upgrade to Solaris 11 Express to get benefits from deduplication (and other features introduced in OpenSolaris.) Sun leverages Solaris 11 in the Sun Storage 7000 Unified Storage Systems (a bit wordy, that name) storage appliances. This provides a fully-supported way to gain the benefits of our rapid introduction of advanced features, providing unique solutions for the systems and storage markets. <script type="text/javascript"> var sc_project=6611784; var sc_invisible=1; var sc_security="4251aa3a"; </script> <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script>

visit tracker on tumblr
Comments:

Thanks Jeff. I have also been playing with snv_128a bits, with an eye towards how space is reported. I'm curious about what objective value to correlate dedupratio against.

IMHO the ratio isn't very meaningful by itself -- I could have dedup'ed 10 bytes, or 10TB. Either way, the ratio just says, 10.00x. But my pool has a real size somewhere -- how do I gauge the effectiveness?

Things make sense with compression on, I get copies \* compressed size in USEDDS when I run "zfs list -o space".

Next up, I will try messing with the dedupditto property. More info in the comments for Joerg's "df" post:

http://www.c0t0d0s0.org/index.php?url=archives/6168-df-considered-problematic.htm

Thanks... -cheers, CSB

Posted by Craig S. Bell on December 07, 2009 at 07:40 AM MST #

Thanks for your comment. I think it will take us all a while to figure out how to interpret the different commands' output, and Joerg's note really provides the right insight. Watching the "size of a file system" reported by "df" climb as you add deduped data is surprising, but probably the right way to handle it. I'm going to add a paragraph in the blog entry to illustrate that. Please comment when you have a chance on your own research!

Note to all: CSB sent me a private note pointing out that there is a missing "l" on the URL he posted above. it's "html", not "htm")

Posted by Jeffrey Savit on December 07, 2009 at 10:08 AM MST #

I don't have any problem with the perceived filesystem / dataset size growing, that's perfectly acceptable. Still, the df available percentage drops as dedup increases. So you have to stop and think it over, since these tools make assumptions that are no longer valid.

It seems like it will become harder to tell just how close you are to full. Maybe that's not a bad thing -- but my automated monitoring tools won't dig it. I have installers that freak out when df shows the ZFS zonepath size as zero bytes, even when there's plenty in the available column.

Imagine fueling your vehicle without being able to measure just how large the tank is at that moment. The example isn't entirely inapt -- how much gasoline you receive per measured gallon depends greatly on the ambient temperature. When it's hot out, you get less for your money. That night, the needle drops a bit -- the tank is observed to be relatively less full than it was before.

Either the tank grew, or the fuel shrank. Either way, you can't predict just how much range you have left, based upon things that you can easily observe (such as miles driven). Will you run out of gas more frequently? Probably not.

An update on auto-ditto: I tried setting the dedupditto property, but it appears to be read-only at this point. Oh well. =-(

Since the default value is zero, I'll assume that the feature simply hasn't been enabled yet. Maybe it will become public (and adjustable) in some later build.

Of course, this feature will continue the trend of making observation seem non-deterministic, but then we're already consulting a soothsayer as it is. Tools aside, I think this is a paradigm shift that users like me will have to get comfortable with.

My testing continues... thanks again Jeff. -cheers, CSB

P.S. The correct link to Joerg's df article should be:

http://www.c0t0d0s0.org/index.php?url=archives/6168-df-considered-problematic.html

Posted by Craig Bell on December 09, 2009 at 05:54 AM MST #

http://opensolaris.org/jive/thread.jspa?threadID=119823&tstart=0

Another interesting discussion of ZFS space accounting. Bonwick points out how to calculate a "diluted" dedup ratio. That would be a nice property to see.

Mike points out that ZFS isn't consistent with accounting even before dedup -- for example, compressratio doesn't account for runs of zeroes, as they are treated as sparse file holes -- but only when compression is on.

For me, it all comes back to: How much do I need to know in order to reasonably avoid running out of "available" before I can get to the vdev station?

Would the extra gauges help me predict what will happen next, or just confuse me along the way?

I would continue testing, but my snv_128a system seems to have run afoul of CR 6905936. The system goes off into space while importing my test pool... -c

Posted by Craig S. Bell on December 15, 2009 at 05:26 AM MST #

Update -- after a few reboots, I was able to wrest control of my snv_128a system back. There seem to be some performance and/or memory issues with importing pools with any dedup=on. This occurs even when I used a temporary pool (based on a ramdisk vdev) for testing.

http://opensolaris.org/jive/thread.jspa?threadID=120106

Going to snv_129 didn't seem to help much with that issue, but I did discover that you can set dedupditto there. The minimum value is 100, which answers my question about people using this feature on a single-vdev pool (i.e. laptop, mobile device, &c.) with low values (=2).

Interestingly, when I get to 10,000x (or perhaps 100 \* 100) then the copies increase again, from 2 to 3. When I drop back down by one instance, it stays at 3 copies until I am down below 100x dedupratio again. Then it goes directly from 3 to 1 copy, as I cross the threshold back to 99x. Bug or feature?

Generally, it seems like dedup is currently intended or optimized for use on beefier systems with more memory and larger redundant pools, and not intended for use absolutely everywhere that ZFS is supported. Maybe low-memory performance will improve in a later release...

I haven't updated to snv_130 yet, due to some of the issues related to the rush to add features before the restrictions kick in leading up to the 2010.03 release. I did see a note that people dealing with the "system goes south" issues were able to wrest control back using a genunix rescue boot disk based on snv_130, which is encouraging:

http://opensolaris.org/jive/thread.jspa?threadID=119465

So I reluctantly choose to postpone further dedup testing until some of the unrelated issues with build 130 are fixed. Maybe snv_131 will be more favorable for ZFS as well.

Thanks... -cheers, CSB

Posted by Craig S. Bell on December 31, 2009 at 03:16 AM MST #

Hey, very cool research, Craig. I have had to use my (supposed) week off doing things less exciting than trying this out, so appreciate hearing about your experiences here. When the logjam clears, I'll kick the tires again!

regards, Jeff

Posted by Jeffrey Savit on December 31, 2009 at 03:24 AM MST #

Post a Comment:
Comments are closed for this entry.
About

jsavit

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today