how dedupalicious is your pool?

WIth the putback of 6656655 zdb should be able to display blkptr signatures, we can now get the "signature" of the block pointers in a pool. To see an example, let's first put some content into an empty pool:

heavy# zpool create bigIO c0t0d0 c0t1d0
heavy# zpool list
NAME    SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
bigIO   928G  95.5K   928G     0%  ONLINE  -
heavy# mkfile 1m /bigIO/1m.txt
heavy# echo "dedup me" > /bigIO/ejk.txt
heavy# cp /bigIO/ejk.txt /bigIO/ejk2.txt
heavy# echo "no dedup" > /bigIO/nope.txt
heavy# cp /bigIO/ejk.txt /bigIO/ejk3.txt

Now lets run zdb with the new option "-S". We pass in "user:all", where "user" tells zdb that we only want user data blocks (as opposed to both user and metadata) and "all" tells zdb to print out all blocks (skipping any checksum algorithm strength comparisons).

heavy# zdb -L -S user:all bigIO
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       512     1       ZFS plain file  fletcher2       uncompressed    656d207075646564:a:ada40e0eac8cac80:140
0       512     1       ZFS plain file  fletcher2       uncompressed    656d207075646564:a:ada40e0eac8cac80:140
0       512     1       ZFS plain file  fletcher2       uncompressed    7075646564206f6e:a:eac8cac840dedc0:140
0       512     1       ZFS plain file  fletcher2       uncompressed    656d207075646564:a:ada40e0eac8cac80:140
heavy# 

This displays the signature of each block pointer - where the columns are level, physical size, number of dvas, object type, checksum type, compression type, and finally the actual checksum of the block.

So this is interesting, but what could we do with this information? Well, one thing we could do, is to figure out how much your pool can take advantage of dedup. Let's assume that the dedup implementation does matching based on the actual checksum and any checksum algorithm is strong enough (in reality, we'd need sha256 or stronger). So starting with the above pool and using a simple perl script 'line_by_line_process.pl' (shown at the end of this blog), we find:

heavy# zdb -L -S user:all bigIO > /tmp/zdb_out.txt
heavy# sort -k 7 -t "`/bin/echo '\\t'`" /tmp/zdb_out.txt > /tmp/zdb_out_sorted.txt
heavy# ./line_by_line_process.pl /tmp/zdb_out_sorted.txt 
total PSIZE:               0t1050624
total unique PSIZE:        0t132096
total that can be duped:   0t918528
percent that can be duped  87.4269005847953%
heavy#   

In our trivial case, we can see that we could get a huge win - 87% of the pool can be dedup'd!. Upon closer examination, we notice that mkfile writes out all zero blocks. If you had compression enabled, there won't be any actual blocks for this file. So let's look at a case where just the "ejk.txt" contents are getting dedup'd:

heavy# zpool destroy bigIO
heavy# zpool create bigIO c0t0d0 c0t1d0
heavy# dd if=/dev/random of=/bigIO/1m.txt bs=1024 count=5
5+0 records in
5+0 records out
heavy# echo "dedup me" > /bigIO/ejk.txt
heavy# cp /bigIO/ejk.txt /bigIO/ejk2.txt
heavy# echo "no dedup" > /bigIO/nope.txt
heavy# cp /bigIO/ejk.txt /bigIO/ejk3.txt
heavy# zdb -L -S user:all bigIO > /tmp/zdb_out.txt
heavy# sort -k 7 -t "`/bin/echo '\\t'`" /tmp/zdb_out.txt > /tmp/zdb_out_sorted.txt
heavy# ./line_by_line_process.pl /tmp/zdb_out_sorted.txt       
total PSIZE:               0t7168
total unique PSIZE:        0t6144
total that can be duped:   0t1024
percent that can be duped  14.2857142857143%
heavy# 

Ok in this different setup we can see ~14% of capacity can actually be dedup'd - a nice savings on capacity.

So the question becomes - how dedupalicious is your pool?




ps: here is the simple perl script 'line_by_line_process.pl':

#!/usr/bin/perl

# Run this script as:
#  % script 


# total PSIZE
$totalps = 0;

# total unique PSIZE
$totalups = 0;

$last_cksum = -1;

$path = $ARGV[0];
open(FH, $path) or die "Can't open $!";

while (<>) {
        my $line = $_;
        ($level, $psize, $ndvas, $type, $cksum_alg, $compress, $cksum) = split /\\t/, $line, 7;
        if ($cksum ne $last_cksum) {
                $totalups += $psize;
        }
        $last_cksum = $cksum;
        $totalps += $psize;
}

print "total PSIZE:               0t".$totalps."\\n";
print "total unique PSIZE:        0t".$totalups."\\n";
print "total that can be duped:   0t".($totalps - $totalups)."\\n";
print "percent that can be duped  ".($totalps - $totalups) / $totalps \* 100 ."%\\n";
Comments:

Very cool. Deduping is a pretty interesting feature, and I'm interested to see how you do with implementing it. Using zdb to gather some data so that you can properly prioritize it is a good first step. :-)

I don't have any real-world ZFS pools to test against, sadly, but I have to think there's quite a bit of duplication going on out there. Version-control of binary assets is a HUGE factor ... multiple users \* multiple branches checked out per user \* GB of data per branch = whoa.

In my work at SCEA we've found that this is a situation that game studios bang their heads against all the time. Just heard somebody complaining about it again last month, in fact.

Posted by Drew Thaler on March 19, 2008 at 03:21 PM PDT #

I still don't get it.
What is "dedup" supposed to be??? What does it do???

Posted by UX-admin on March 19, 2008 at 10:40 PM PDT #

Very interesting. At the very least, we can run this
against existing sites and see what happens... I think
I'll add it to some experiments I've been making
lately...

Posted by Richard Elling on March 27, 2008 at 07:06 AM PDT #

UX-admin:

dedup is just an abbreviation of "deduplication" - Theoretically you could make ZFS smart enough to realize that two files have blocks of data in common and store only one copy on disk, with the block pointers in each file pointing to the same location. This would be incredibly useful for developers that check out multiple branches of a source tree, since most files would be identical.

If you remember running chkdsk on a damaged MSDOS filesystem, you could see "cross-linked file" errors; these would be similar, but intentional :)

Posted by Dan Nelson on March 30, 2008 at 12:03 PM PDT #

Hi - this sounds pretty good, almost like RLE on a PCX file, but much more complex. One question - say there is are 2 blkptrs pointing to the same block - what happens if the file itself changes, and that change happens to invalidate what the other blkptr expects to be pointing at - is this "orphan" blkptr re-pointed at another identical block, and if one is not available, where does it "return" to?

Posted by Danny Webster on March 31, 2008 at 07:58 PM PDT #

Danny Webster:
ZFS is copy-on-write, so when you change a file, you make a copy of the block and point to the new version. The old version is still referenced by the other files so it is not garbage-collected. In other words, this is \*not\* the same thing as a hard link in a filesystem.

Question:
when you say we need a stronger hash, I guess you only mean we need a strong hash if we want an exact answer. But a hash that is not so strong seems sufficient to me to detect potential dups, assuming that we always check afterwards whether the blocks really are the same.

It also seems to me that there will be a trade off in the implementation of a dedup algorithm, where we can either use a LOT of memory to store all the hashes (and corresponding pointers) and do the dedup in one run (this may make the system unusable), or we can run a partial dedup algorithm (say only on the blocks with a hash that starts with 'aa', and move on to 'ab' when we are done), which involves reading the disks many more times but could more easily be run in the background. There is also a possibility to use the fact that blocks older than some age have already been deduped to reduce the computation required. Seems like a very interesting algorithmic problem.

Now it still comes later in my list of priorities than a way to have zfs write files without compression (or just lzjb) and have a background process that recompresses files with gzip -9. And maybe even auto-tune this behavior depending on the load...

Posted by Marc on April 01, 2008 at 11:23 PM PDT #

Post a Comment:
Comments are closed for this entry.
About

erickustarz

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today