X

News, tips, partners, and perspectives for Oracle’s virtualization offerings

Deduplication now in ZFS

Jeff Savit
Product Management Senior Manager
I've been waiting for this ever since it was announced - and deduplication is now available in ZFS! (hence this hastily written blog...).
The basic concept is that when data is written to a ZFS filesystem with dedup turned on, ZFS only stored blocks that are unique within the ZFS pool, rather than storing redundant copies of identical data.
See Jeff Bonwick's blog for more information on concepts and implementation.

Getting started with ZFS dedup

If you're running OpenSolaris 2009.06 and are pointing to the hot-off-the-presses repository you can upgrade by
using System -> Package Manager -> File -> Updates, or use the CLI example shown in
Roman Ivanov's blog.
Some painless downloading and a reboot, and you have access to the bits that contain ZFS dedup.

$ uname -a
SunOS suzhou 5.11 snv_128a i86pc i386 i86pc Solaris
$ cat /etc/release
OpenSolaris Development snv_128a X86
Copyright 2009 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 23 November 2009

Now, I'm a rather paranoid guy, and the machine in question is my main desktop, so I'm going
to experiment on a ZFS pool in a ramdisk, rather than the pool I keep all my data. At least for today!
Here, I create a 100MB ramdisk, put a ZFS pool in it, and
turn on dedup for its top-level ZFS dataset. I then create a ZFS dataset in it, and hand it to
my normal, non-root userid.

Note: the error message below is because you don't turn on 'dedup' at the pool level - you turn it on at the ZFS dataset level. This is despite the (possibly confusing fact) that candidate blocks for deduplication are across
the entire ZFS pool (not the dataset), and the space savings ratio due to deduplication are reported at the pool level.
So, in this case, the zpool command thinks I'm looking for the deduplication ratio, which is a read-only attribute.
After that error, I turn dedup on for the ZFS dataset at the top of the pool. Out of habit, I turn on compression too.

# ramdiskadm -a dimmdisk 100m
/dev/ramdisk/dimmdisk
# zpool create dimmpool /dev/ramdisk/dimmdisk
# zfs create dimmpool/fast
# zpool set dedup=on dimmpool
cannot set property for 'dimmpool': 'dedup' is readonly
# zfs set dedup=on dimmpool
# zfs set compression=on dimmpool
# chown savit /dimmpool/fast

Now, I'll populate that ZFS dataset with a few directories and copy some files into the first of them:
$ mkdir /dimmpool/fast/v1
$ mkdir /dimmpool/fast/v2
$ mkdir /dimmpool/fast/v3
$ mkdir /dimmpool/fast/v4
$ mkdir /dimmpool/fast/v5
$ cp \*gz /dimmpool/fast/v1
$ zpool list dimmpool
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
dimmpool 95.5M 5.32M 90.2M 5% 1.00x ONLINE -

No duplicate data so far, which isn't a surprise,
now for the real test. Let's copy the same data into different directories
and see what happens:
$ cp \*gz /dimmpool/fast/v2
$ cp \*gz /dimmpool/fast/v3
$ cp \*gz /dimmpool/fast/v4
$ cp \*gz /dimmpool/fast/v5
$ zpool list dimmpool
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
dimmpool 95.5M 6.55M 89.0M 6% 5.00x ONLINE -

Wow! I have 5 copies of the same data and ZFS trimmed away the excess copies when storing on disk.

The mystery of the growing disk


One surprise with dedup is how traditional tools like df respond. They obviously are unaware of deduplication, but have to somehow cope with the idea that the same filesystem is storing more (user accessible) bytes. This is very well explained in Joerg's blog at df considered problematic (and tip of the hat to Craig to provoke me to comment on this).

To illustrate this, I'll create a small ZFS pool backed by a file (a file on ZFS, which is perfectly valid). I'm going to fill it with multiple copies of the same CD image (a Ubuntu 9.10 ISO), which neatly get deduped away. As I do this, look at the output of the df command:

# mkfile 1g /var/tmp/TEMP
# zpool create temp /var/tmp/TEMP
# zfs set dedup=on temp; zfs set compression=on temp
# zpool list temp
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
temp 1016M 141K 1016M 0% 1.00x ONLINE -
# cp ubuntu-9.10-desktop-i386.iso /temp
# zpool list temp
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
temp 1016M 685M 331M 67% 1.00x ONLINE -
# df -h /temp
Filesystem Size Used Avail Use% Mounted on
temp 983M 683M 300M 70% /temp
# cp ubuntu-9.10-desktop-i386.iso /temp/iso2
# df -h /temp
Filesystem Size Used Avail Use% Mounted on
temp 1.7G 1.4G 295M 83% /temp
# cp ubuntu-9.10-desktop-i386.iso /temp/iso3
# df -h /temp
Filesystem Size Used Avail Use% Mounted on
temp 2.3G 2.1G 289M 88% /temp
# cp ubuntu-9.10-desktop-i386.iso /temp/iso4
# zpool list temp
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
temp 1016M 692M 324M 68% 4.00x ONLINE -
# df -h /temp
Filesystem Size Used Avail Use% Mounted on
temp 3.0G 2.7G 278M 91% /temp

Notice that the ZFS pool shows the deduplication ratio, and df acts as if the disk is getting bigger - growing in this case to 3GB. Otherwise, how could it explain 2.7GB of data in a filesystem that was only the original size of 983M?

A bigger test


Well, the above tests are probably "best case scenarios". Let's try something bigger, and use a real disk.

I'll use a spare ZFS pool called 'tpool'.
Since this is a ZFS pool I've had for a while, I
have to upgrade the pool to the latest on-disk format, which I show here.
Note which level (21, in listing below) provides deduplication.
For a change of pace, I'll do this via pfexec from my userid, instead of being root.

$ zpool list tpool
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
tpool 736G 87.1G 649G 11% 1.00x ONLINE --
$ pfexec zpool status tpool
pool: tpool
state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can

still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the

pool will no longer be accessible on older software versions.
scrub: none requested
config:

NAME STATE READ WRITE CKSUM

tpool ONLINE 0 0 0

c18t0d0p2 ONLINE 0 0 0
errors: No known data errors
$ zpool upgrade -v
This system is currently running ZFS pool version 22.
The following versions are supported:
VER DESCRIPTION
--- --------------------------------------------------------
1 Initial ZFS version
2 Ditto blocks (replicated metadata)
3 Hot spares and double parity RAID-Z
4 zpool history
5 Compression using the gzip algorithm
6 bootfs pool property
7 Separate intent log devices
8 Delegated administration
9 refquota and refreservation properties
10 Cache devices
11 Improved scrub performance
12 Snapshot properties
13 snapused property
14 passthrough-x aclinherit
15 user/group space accounting
16 stmf property support
17 Triple-parity RAID-Z
18 Snapshot user holds
19 Log device removal
20 Compression using zle (zero-length encoding)21 Deduplication
22 Received properties
For more information on a particular version, including supported releases, see:
http://www.opensolaris.org/os/community/zfs/version/N
Where 'N' is the version number.
$ pfexec zpool upgrade tpool
This system is currently running ZFS pool version 22.
Successfully upgraded 'tpool' from version 16 to version 22

Now I can turn dedup on. As I mentioned before, you turn it on at the ZFS dataset level.
Here, I want all ZFS datasets in this pool to inherit this property, so I set it at the topmost level.
$ pfexec zfs set dedup=on tpool
$ zfs get dedup tpool
NAME PROPERTY VALUE SOURCE
tpool dedup on local
$ zpool get dedupratio tpool
NAME PROPERTY VALUE SOURCE
tpool dedupratio 1.00x -

Now I'll copy about 28GB worth of data - note how the free disk space is only reduced by 5GB!
$ zpool list tpool
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
tpool 736G 206G 530G 28% 1.02x ONLINE -
$ cp -rp /usbpool/Music /tpool/
$ zfs list tpool/Music
NAME USED AVAIL REFER MOUNTPOINT
tpool/Music 27.9G 513G 27.9G /tpool/Music
$ zpool list tpool
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
tpool 736G 211G 525G 28% 1.20x ONLINE -

ZFS found that the written data was a duplicate of existing data, and deduplicated it on the fly. Only one copy of the 28GB occupies space on disk.

Putting this in context

Now, let's understand this correctly - somewhere in the tpool filepool I had the same 28GB of music files. With a small amount of effort I could have found the original copy (I know where I tend to put things), and just stored new or changed data. But that's not the point!

Imagine if this were a file store serving hundreds of users. They obviously can't search through one another's directories and see if the exact same data was already stored by someone else (and somehow know that the data would be retained unchanged, for exactly as long as they themselves needed it). Consider the case where a group of employees receive the same e-mail attachments and store them in their private directores: without deduplication the same data is stored many times, with deduplication it is stored only once.

ZFS deduplication works at the block level, not on an individual file basis, so it is possible that "mostly the same" files can enjoy benefits of deduplication and reduced disk space consumption. Imagine even a single user working with a series of medical image files from CAT scans, or different versions of an animated film. The different files might be almost identical in contents, and potentially the majority of disk blocks could be the same - and stored only once instead of storing the same contents multiple times.

Caveats and Considerations

There are important considerations to keep in mind, so please don't blindly turn on deduplication. First, Dedupe is available in OpenSolaris 2009.06 and Solaris 11 Express - it is not available in Solaris 10.

Second, deduplication requires a great deal of RAM (or an SSD-based L2ARC) to store the ZFS dedup table - the metadata that contains the block signatures and describes ownership. If you don't have enough RAM, then file operations (especially writes and in particular, file deletion) will be extremely slow. On a large server class machine this may not be a problem, but be very cautious about deploying it on a desktop class system with a few GB of RAM. See Roch Bourbonnais' blog on dedup performance for technical details and results of sizing experiments.

Given those considerations - when would you use deduplication? Consider it when you have business processes that create a lot of duplicate data on disk, hosted on a server with suitable RAM and I/O. Then you can get substantial space benefits while maintaining good performance.

This is not quite the same advice that I give for ZFS compression: I recommend turning default ZFS compression on in almost all cases, since the CPU cost is small and the savings in disk space can be valuable (in fact, the reduced disk space can decrease overall CPU cost, since fewer physical I/Os are needed) - very little downside risk is involved. It's more like deciding whether to turn on ZFS gzip-9 compression, which provides greater potential space savings at a much higher CPU cost. In general, evaluate your data and your server configuration before turning on deduplication.

Conclusions

This is a powerful new feature for ZFS - providing deduplication as a new native feature of the most advanced file system, without imposing add-on license fees and with a simple user interface. This currently is a feature of Solaris 11 Express, not Solaris 10. Users who want to make use of this feature now can certainly do so with Solaris 11 Express and can obtain formal support.

However, you don't have to upgrade to Solaris 11 Express to get benefits from deduplication (and other features introduced in OpenSolaris.) Sun leverages Solaris 11 in the Sun Storage 7000 Unified Storage Systems (a bit wordy, that name) storage appliances. This provides a fully-supported way to gain the benefits of our rapid introduction of advanced features, providing unique solutions for the systems and storage markets.



Join the discussion

Comments ( 2 )
  • Jeffrey Savit Monday, December 7, 2009

    Thanks for your comment. I think it will take us all a while to figure out how to interpret the different commands' output, and Joerg's note really provides the right insight. Watching the "size of a file system" reported by "df" climb as you add deduped data is surprising, but probably the right way to handle it. I'm going to add a paragraph in the blog entry to illustrate that. Please comment when you have a chance on your own research!

    Note to all: CSB sent me a private note pointing out that there is a missing "l" on the URL he posted above. it's "html", not "htm")


  • Jeffrey Savit Thursday, December 31, 2009

    Hey, very cool research, Craig. I have had to use my (supposed) week off doing things less exciting than trying this out, so appreciate hearing about your experiences here. When the logjam clears, I'll kick the tires again!

    regards, Jeff


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.