Demonstrating ZFS Self-Healing

I'm the kind of guy who likes to tinker. To see under the bonnet. I used to have a go at "fixing" TV's by taking the back off and seeing what could be adjusted (which is kind-of anathema to one of the philosophies of ZFS).

So, when I have been presenting and demonstrating ZFS to customers, the thing I really like to show is what ZFS does when I inject "silent data corruption" into one device in a mirrored storage pool.

This is cool, because ZFS does a couple of things that are not done by any comparable product:

  • It detects the corruption by using checksums on all data and metadata.
  • It automatically repairs the damage, using data from the other mirror, assuming checksum(s) on that mirror are OK.

This all happens before the data is passed off to the process that asked for it. This is how it looks in slideware:

Self-Healing ZFS
slide

The key to demonstrating this live is how to inject corruption, without having to apply a magnet or lightning bolt to my disk. Here is my version of such a demonstration:

  1. Create a mirrored storage pool, and filesystem

    cleek[bash]# zpool create demo mirror /export/zfs/zd0 /export/zfs/zd1
    cleek[bash]# zfs create demo/ccs
    

  2. Load up some data into that filesystem, see how we are doing

    cleek[bash]# cp -pr /usr/ccs/bin /demo/ccs
    cleek[bash]# zfs list
    NAME                   USED  AVAIL  REFER  MOUNTPOINT
    demo                  2.57M   231M  9.00K  /demo
    demo/ccs              2.51M   231M  2.51M  /demo/ccs
    

  3. Get a personal checksum of all the data in the files - the "find/cat" will output the contents of all files, then I pipe all that data into "cksum"

    cleek[bash]# cd /demo/ccs
    cleek[bash]# find . -type f -exec cat {} + | cksum
    1891695928      2416605
    

  4. Now for the fun part. I will inject some corruption by writing some zeroes onto the start of one of the mirrors.

    cleek[bash]# dd bs=1024k count=32 conv=notrunc if=/dev/zero of=/export/zfs/zd0
    32+0 records in
    32+0 records out
    

  5. Now if I re-read the data now, ZFS will not find any problems, and I can verify this at any time using "zpool status"

    cleek[bash]# find . -type f -exec cat {} + | cksum
    1891695928      2416605
    cleek[bash]# zpool status demo
      pool: demo
     state: ONLINE
     scrub: none requested
    config:
    
            NAME                 STATE     READ WRITE CKSUM
            demo                 ONLINE       0     0     0
              mirror             ONLINE       0     0     0
                /export/zfs/zd0  ONLINE       0     0     0
                /export/zfs/zd1  ONLINE       0     0     0
    

    The reason for this is that ZFS still has all the data for this filesystem cached, so it does not need to read anything from the storage pool's devices.

  6. To force ZFS' cached data to be flushed, I export and re-import my storage pool

    cleek[bash]# cd /
    cleek[bash]# zpool export -f demo
    cleek[bash]# zpool import -d /export/zfs demo
    cleek[bash]# cd -
    /demo/ccs
    

  7. At this point, I should find that ZFS has found some corrupt metadata

    cleek[bash]# zpool status demo
      pool: demo
     state: ONLINE
    status: One or more devices has experienced an unrecoverable error.  An
            attempt was made to correct the error.  Applications are unaffected.
    action: Determine if the device needs to be replaced, and clear the errors
            using 'zpool online' or replace the device with 'zpool replace'.
       see: http://www.sun.com/msg/ZFS-8000-9P
     scrub: none requested
    config:
    
            NAME                 STATE     READ WRITE CKSUM
            demo                 ONLINE       0     0     0
              mirror             ONLINE       0     0     0
                /export/zfs/zd0  ONLINE       0     0     7
                /export/zfs/zd1  ONLINE       0     0     0
    

  8. Cool - Solaris Fault Manager at work. I'll bring that mirror back online, so ZFS will try using it for what I plan to do next...

    cleek[bash]# zpool online demo/export/zfs/zd0
    Bringing device /export/zfs/zd0 online
    

  9. Now, I can repeat my read of data to generate my checksum, and check what happens

    cleek[bash]# find . -type f -exec cat {} + | cksum
    1891695928      2416605    note that my checksum is the same
    cleek[bash]# zpool status
    [...]
            NAME                 STATE     READ WRITE CKSUM
            demo                 ONLINE       0     0     0
              mirror             ONLINE       0     0     0
                /export/zfs/zd0  ONLINE       0     0    63
                /export/zfs/zd1  ONLINE       0     0     0
    

Of course, if I wanted to know the instant things happened, I could also use DTrace (in another window):

cleek[bash]# dtrace -n :zfs:zio_checksum_error:entry
dtrace: description ':zfs:zio_checksum_error:entry' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0  40650         zio_checksum_error:entry
  0  40650         zio_checksum_error:entry
  0  40650         zio_checksum_error:entry
  0  40650         zio_checksum_error:entry
[...]

Technorati Tag:

Comments:

Looks fantastic, when do we get it? ;)

Posted by Jay Fenton on November 14, 2005 at 08:41 AM PST #

You already did: http://www.opensolaris.org/os/community/zfs/

Posted by Ceri Davies on November 16, 2005 at 07:50 PM PST #

Tim, In step 7, the faulted device still reads as "ONLINE", so it seems somewhat counter-intuitive that it could be re-enabled with "zpool online". Is that an error in the output of "zpool status"?

Posted by Ceri Davies on November 17, 2005 at 08:02 PM PST #

ZFS Self-Healing feature sounds really good. But based upon my experience, there is always some catches with automatic recovery, or at sth to be aware of.

Posted by April from Bakersfield on January 10, 2011 at 08:22 AM PST #

Step 6 is pretty critical. Otherwise, ZFS still has the data cashed.

Posted by Eric Los Angeles on February 24, 2011 at 10:50 AM PST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Tim Cook's Weblog The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today