Sillyt ZFS Dedup Experiment

Just for grins, I thought it would be fun to do some "extreme" deduping.  I started out created a pool from a pair of mirrored drives on a system running OpenSolaris build 129.  We'll call the pool p1.  Notice that everyone agrees on the size when we first create it.  zpool list, zfs list, and df -h all show 134G available, more or less.  Notice that when we created the pool, we turned deduplication on from the very start.

# zpool create -O dedup=on p1 mirror c0t2d0 c0t3d0
# zfs list p1
NAME   USED  AVAIL  REFER  MOUNTPOINT
p1      72K   134G    21K  /p1
# zpool list p1
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
p1     136G   126K   136G     0%  1.00x  ONLINE  -
# df -h /p1
Filesystem             size   used  avail capacity  Mounted on
p1                     134G    21K   134G     1%    /p1

So, what if we start copying a file over and over?  Well, we would expect that to dedup pretty well.  Let's get some data to play with.  We will create a set of 8 files, each one being made up of 128K of random data.  Then we will cat these together over and over and over and over and see what we get.

Why choose 128K for my file size?  Remember that we are trying to deduplicate as much as possible within this dataset.  As it turns out, the default recordsize for ZFS is 128K.  ZFS deduplication works at the ZFS block level.  By selecting a file size of 128K, each of the files I create fits exactly into a single ZFS block.  What if we picked a file size that was different from the ZFS block size? The blocks across the boundaries, where each file was cat-ed to another, would create some blocks that were not exactly the same as the other boundary blocks and would not deduplicate as well.

Here's an example.  Assume we have a file A whose contents are "aaaaaaaa", a file B containing "bbbbbbbb", and a file C containing "cccccccc".  If our blocksize is 6, while our files all have length 8, then each file spans more than 1 block.

# cat A B C > f1
# cat f1
aaaaaaaabbbbbbbbcccccccc
111111222222333333444444
# cat B A C > f2
# cat f2
bbbbbbbbaaaaaaaacccccccc
111111222222333333444444

The combined contents of the three files span across 4 blocks.  Notice that the only block in this example that is replicated is block 4 of f1 and block 4 of f2.  The other blocks all end up being different, even though the files were the same.  Think about how this would work as files numbers of files grew.

So, if we want to make an example where things are guaranteed to dedup as well as possible, our files need to always line up on block boundaries (remember we're not trying to be a real world - we're trying to get silly dedupratios).  So, let's create a set of files that all match the ZFS blocksize.  We'll just create files b1-b8 full of blocks of /dev/

# zfs get recordsize p1
NAME  PROPERTY    VALUE    SOURCE
p1    recordsize  128K     default
# dd if=/dev/random bs=1024 count=128 of=/p1/b1

# ls -ls b1 b2 b3 b4 b5 b6 b7 b8
 257 -rw-r--r--   1 root     root      131072 Dec 14 15:28 b1
 257 -rw-r--r--   1 root     root      131072 Dec 14 15:28 b2
 257 -rw-r--r--   1 root     root      131072 Dec 14 15:28 b3
 257 -rw-r--r--   1 root     root      131072 Dec 14 15:28 b4
 257 -rw-r--r--   1 root     root      131072 Dec 14 15:28 b5
 257 -rw-r--r--   1 root     root      131072 Dec 14 15:28 b6
 257 -rw-r--r--   1 root     root      131072 Dec 14 15:28 b7
 205 -rw-r--r--   1 root     root      131072 Dec 14 15:28 b8

Now, let's make some big files out of these.

# cat b1 b2 b3 b4 b5 b6 b7 b8 > f1
# cat f1 f1 f1 f1 f1 f1 f1 f1 > f2
# cat f2 f2 f2 f2 f2 f2 f2 f2 > f3
# cat f3 f3 f3 f3 f3 f3 f3 f3 > f4
# cat f4 f4 f4 f4 f4 f4 f4 f4 > f5
# cat f5 f5 f5 f5 f5 f5 f5 f5 > f6
# cat f6 f6 f6 f6 f6 f6 f6 f6 > f7

# ls -lh
total 614027307
-rw-r--r--   1 root     root        128K Dec 14 15:28 b1
-rw-r--r--   1 root     root        128K Dec 14 15:28 b2
-rw-r--r--   1 root     root        128K Dec 14 15:28 b3
-rw-r--r--   1 root     root        128K Dec 14 15:28 b4
-rw-r--r--   1 root     root        128K Dec 14 15:28 b5
-rw-r--r--   1 root     root        128K Dec 14 15:28 b6
-rw-r--r--   1 root     root        128K Dec 14 15:28 b7
-rw-r--r--   1 root     root        128K Dec 14 15:28 b8
-rw-r--r--   1 root     root        1.0M Dec 14 15:28 f1
-rw-r--r--   1 root     root        8.0M Dec 14 15:28 f2
-rw-r--r--   1 root     root         64M Dec 14 15:28 f3
-rw-r--r--   1 root     root        512M Dec 14 15:28 f4
-rw-r--r--   1 root     root        4.0G Dec 14 15:28 f5
-rw-r--r--   1 root     root         32G Dec 14 15:30 f6
-rw-r--r--   1 root     root        256G Dec 14 15:49 f7

This looks pretty weird.  Remember our pool is only 134GB big.  Already the file f7 is 256G and we are not using any sort of compression.  What does df tell us?

# df -h /p1
Filesystem             size   used  avail capacity  Mounted on
p1                     422G   293G   129G    70%    /p1

Somehow, df now believes that the pool is 422GB instead of 134GB.  Why is that?  Well, rather than reporting the amount of available space by subtracting used from size, df now calculates its size dynamically as the sum of the space used plus the space available.  We have lots of space available since we have many many many duplicate references to the same blocks.

# zfs list p1
NAME   USED  AVAIL  REFER  MOUNTPOINT
p1     293G   129G   293G  /p1
# zpool list p1
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
p1     136G   225M   136G     0%  299594.00x  ONLINE  -

zpool list tells us the actual size of the pool, along with the amount of space that it views as being allocated and the amount free.  So, the pool really has not changed size.  But the pool says that 225M are in use.  Metadata and pointer blocks, I presume.

Notice that the dedupratio is 299594!  That means that on average, there are almost 300,000 references to each actual block on the disk.

One last bit of interesting output comes from zdb.  Try zdb -DD on the pool.  This will give you a histogram of how many blocks are referenced how many times.  Not for the faint of heart, zdb will give you lots of ugly internal info on the pool and datasets. 

# zdb -DD p1
DDT-sha256-zap-duplicate: 8 entries, size 768 on disk, 1024 in core

DDT histogram (aggregated over all DDTs):

bucket              allocated                       referenced         
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
  256K        8      1M      1M      1M    2.29M    293G    293G    293G
 Total        8      1M      1M      1M    2.29M    293G    293G    293G

dedup = 299594.00, compress = 1.00, copies = 1.00, dedup \* compress / copies = 299594.00

So, what's my point?  I guess the point is that dedup really does work.  For data that has a commonality, it can save space.  For data that has a lot of commonality, it can save a lot of space.  With that come some surprises in terms of how some commands have had to adjust to changing sizes (or perceived sizes) of the storage they are reporting.

My suggestion?  Take a look at zfs dedup.  Think about where it might be helpful.  And then give it a try!

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

Interesting bits about Solaris, Virtualization, and Ops Center

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today