Deduplication now in ZFS
By Jsavit-Oracle on Dec 06, 2009
Getting started with ZFS dedup
If you're running OpenSolaris 2009.06 and are pointing to the hot-off-the-presses repository you can upgrade by using System -> Package Manager -> File -> Updates, or use the CLI example shown in Roman Ivanov's blog. Some painless downloading and a reboot, and you have access to the bits that contain ZFS dedup.
$ uname -a SunOS suzhou 5.11 snv_128a i86pc i386 i86pc Solaris $ cat /etc/release OpenSolaris Development snv_128a X86 Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 23 November 2009
Now, I'm a rather paranoid guy, and the machine in question is my main desktop, so I'm going to experiment on a ZFS pool in a ramdisk, rather than the pool I keep all my data. At least for today! Here, I create a 100MB ramdisk, put a ZFS pool in it, and turn on dedup for its top-level ZFS dataset. I then create a ZFS dataset in it, and hand it to my normal, non-root userid.
Note: the error message below is because you don't turn on 'dedup' at the pool level - you turn it on at the ZFS dataset level. This is despite the (possibly confusing fact) that candidate blocks for deduplication are across the entire ZFS pool (not the dataset), and the space savings ratio due to deduplication are reported at the pool level. So, in this case, the zpool command thinks I'm looking for the deduplication ratio, which is a read-only attribute. After that error, I turn dedup on for the ZFS dataset at the top of the pool. Out of habit, I turn on compression too.
# ramdiskadm -a dimmdisk 100m /dev/ramdisk/dimmdisk # zpool create dimmpool /dev/ramdisk/dimmdisk # zfs create dimmpool/fast # zpool set dedup=on dimmpool cannot set property for 'dimmpool': 'dedup' is readonly # zfs set dedup=on dimmpool # zfs set compression=on dimmpool # chown savit /dimmpool/fastNow, I'll populate that ZFS dataset with a few directories and copy some files into the first of them:
$ mkdir /dimmpool/fast/v1 $ mkdir /dimmpool/fast/v2 $ mkdir /dimmpool/fast/v3 $ mkdir /dimmpool/fast/v4 $ mkdir /dimmpool/fast/v5 $ cp \*gz /dimmpool/fast/v1 $ zpool list dimmpool NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT dimmpool 95.5M 5.32M 90.2M 5% 1.00x ONLINE -No duplicate data so far, which isn't a surprise, now for the real test. Let's copy the same data into different directories and see what happens:
$ cp \*gz /dimmpool/fast/v2 $ cp \*gz /dimmpool/fast/v3 $ cp \*gz /dimmpool/fast/v4 $ cp \*gz /dimmpool/fast/v5 $ zpool list dimmpool NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT dimmpool 95.5M 6.55M 89.0M 6% 5.00x ONLINE -
Wow! I have 5 copies of the same data and ZFS trimmed away the excess copies when storing on disk.
The mystery of the growing diskOne surprise with dedup is how traditional tools like df respond. They obviously are unaware of deduplication, but have to somehow cope with the idea that the same filesystem is storing more (user accessible) bytes. This is very well explained in Joerg's blog at df considered problematic (and tip of the hat to Craig to provoke me to comment on this).
To illustrate this, I'll create a small ZFS pool backed by a file (a file on ZFS, which is perfectly valid). I'm going to fill it with multiple copies of the same CD image (a Ubuntu 9.10 ISO), which neatly get deduped away. As I do this, look at the output of the df command:
# mkfile 1g /var/tmp/TEMP # zpool create temp /var/tmp/TEMP # zfs set dedup=on temp; zfs set compression=on temp # zpool list temp NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT temp 1016M 141K 1016M 0% 1.00x ONLINE - # cp ubuntu-9.10-desktop-i386.iso /temp # zpool list temp NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT temp 1016M 685M 331M 67% 1.00x ONLINE - # df -h /temp Filesystem Size Used Avail Use% Mounted on temp 983M 683M 300M 70% /temp # cp ubuntu-9.10-desktop-i386.iso /temp/iso2 # df -h /temp Filesystem Size Used Avail Use% Mounted on temp 1.7G 1.4G 295M 83% /temp # cp ubuntu-9.10-desktop-i386.iso /temp/iso3 # df -h /temp Filesystem Size Used Avail Use% Mounted on temp 2.3G 2.1G 289M 88% /temp # cp ubuntu-9.10-desktop-i386.iso /temp/iso4 # zpool list temp NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT temp 1016M 692M 324M 68% 4.00x ONLINE - # df -h /temp Filesystem Size Used Avail Use% Mounted on temp 3.0G 2.7G 278M 91% /temp
Notice that the ZFS pool shows the deduplication ratio, and df acts as if the disk is getting bigger - growing in this case to 3GB. Otherwise, how could it explain 2.7GB of data in a filesystem that was only the original size of 983M?
A bigger testWell, the above tests are probably "best case scenarios". Let's try something bigger, and use a real disk.
I'll use a spare ZFS pool called 'tpool'. Since this is a ZFS pool I've had for a while, I have to upgrade the pool to the latest on-disk format, which I show here. Note which level (21, in listing below) provides deduplication. For a change of pace, I'll do this via pfexec from my userid, instead of being root.
$ zpool list tpool NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT tpool 736G 87.1G 649G 11% 1.00x ONLINE -- $ pfexec zpool status tpool pool: tpool state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM tpool ONLINE 0 0 0 c18t0d0p2 ONLINE 0 0 0 errors: No known data errors $ zpool upgrade -v This system is currently running ZFS pool version 22. The following versions are supported: VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices 11 Improved scrub performance 12 Snapshot properties 13 snapused property 14 passthrough-x aclinherit 15 user/group space accounting 16 stmf property support 17 Triple-parity RAID-Z 18 Snapshot user holds 19 Log device removal 20 Compression using zle (zero-length encoding) 21 Deduplication 22 Received properties For more information on a particular version, including supported releases, see: http://www.opensolaris.org/os/community/zfs/version/N Where 'N' is the version number. $ pfexec zpool upgrade tpool This system is currently running ZFS pool version 22. Successfully upgraded 'tpool' from version 16 to version 22Now I can turn dedup on. As I mentioned before, you turn it on at the ZFS dataset level. Here, I want all ZFS datasets in this pool to inherit this property, so I set it at the topmost level.
$ pfexec zfs set dedup=on tpool $ zfs get dedup tpool NAME PROPERTY VALUE SOURCE tpool dedup on local $ zpool get dedupratio tpool NAME PROPERTY VALUE SOURCE tpool dedupratio 1.00x -Now I'll copy about 28GB worth of data - note how the free disk space is only reduced by 5GB!
$ zpool list tpool NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT tpool 736G 206G 530G 28% 1.02x ONLINE - $ cp -rp /usbpool/Music /tpool/ $ zfs list tpool/Music NAME USED AVAIL REFER MOUNTPOINT tpool/Music 27.9G 513G 27.9G /tpool/Music $ zpool list tpool NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT tpool 736G 211G 525G 28% 1.20x ONLINE -ZFS found that the written data was a duplicate of existing data, and deduplicated it on the fly. Only one copy of the 28GB occupies space on disk.
Putting this in context
Now, let's understand this correctly - somewhere in the tpool filepool I had the same 28GB of music files. With a small amount of effort I could have found the original copy (I know where I tend to put things), and just stored new or changed data. But that's not the point!
Imagine if this were a file store serving hundreds of users. They obviously can't search through one another's directories and see if the exact same data was already stored by someone else (and somehow know that the data would be retained unchanged, for exactly as long as they themselves needed it). Consider the case where a group of employees receive the same e-mail attachments and store them in their private directores: without deduplication the same data is stored many times, with deduplication it is stored only once.
ZFS deduplication works at the block level, not on an individual file basis, so it is possible that "mostly the same" files can enjoy benefits of deduplication and reduced disk space consumption. Imagine even a single user working with a series of medical image files from CAT scans, or different versions of an animated film. The different files might be almost identical in contents, and potentially the majority of disk blocks could be the same - and stored only once instead of storing the same contents multiple times.
Caveats and Considerations
There are important considerations to keep in mind, so please don't blindly turn on deduplication. First, Dedupe is available in OpenSolaris 2009.06 and Solaris 11 Express - it is not available in Solaris 10.
Second, deduplication requires a great deal of RAM (or an SSD-based L2ARC) to store the ZFS dedup table - the metadata that contains the block signatures and describes ownership. If you don't have enough RAM, then file operations (especially writes and in particular, file deletion) will be extremely slow. On a large server class machine this may not be a problem, but be very cautious about deploying it on a desktop class system with a few GB of RAM. See Roch Bourbonnais' blog on dedup performance for technical details and results of sizing experiments.
Given those considerations - when would you use deduplication? Consider it when you have business processes that create a lot of duplicate data on disk, hosted on a server with suitable RAM and I/O. Then you can get substantial space benefits while maintaining good performance.
This is not quite the same advice that I give for ZFS compression: I recommend turning default ZFS compression on in almost all cases, since the CPU cost is small and the savings in disk space can be valuable (in fact, the reduced disk space can decrease overall CPU cost, since fewer physical I/Os are needed) - very little downside risk is involved. It's more like deciding whether to turn on ZFS gzip-9 compression, which provides greater potential space savings at a much higher CPU cost. In general, evaluate your data and your server configuration before turning on deduplication.
This is a powerful new feature for ZFS - providing deduplication as a new native feature of the most advanced file system, without imposing add-on license fees and with a simple user interface. This currently is a feature of Solaris 11 Express, not Solaris 10. Users who want to make use of this feature now can certainly do so with Solaris 11 Express and can obtain formal support.