Friday Apr 17, 2009


I tripped over this while looking at something else in mdb. My curious mind wondered what it was and might it be useful at some point in the future.

# zpool scrub pool
mdb -k
> ::help zfs_blkstats

  zfs_blkstats - given a spa_t, print block type stats from last scrub

  addr ::zfs_blkstats [-v]


  Target: kvm
  Module: zfs
  Interface Stability: Unstable

> ::walk spa | ::zfs_blkstats -v
Dittoed blocks on same vdev: 2194

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
     1    512     512   1.50K   1.50K    1.00     0.00  object directory
     2     1K      1K   3.00K   1.50K    1.00     0.00  object array
     1    16K   1.50K   4.50K   4.50K   10.66     0.00  packed nvlist
     1    16K      1K   3.00K   3.00K   16.00     0.00  bplist
    20   128K   42.5K    128K   6.37K    3.01     0.00  SPA space map
    16    64K   38.5K    116K   7.21K    1.66     0.00    L0 SPA space map
     4    64K      4K   12.0K   3.00K   16.00     0.00    L1 SPA space map
    58   928K   60.0K    124K   2.13K   15.46     0.00  DMU dnode
    10   160K   12.0K   28.0K   2.79K   13.33     0.00    L0 DMU dnode
     8   128K      8K     16K      2K   16.00     0.00    L1 DMU dnode
     8   128K      8K     16K      2K   16.00     0.00    L2 DMU dnode
     8   128K      8K     16K      2K   16.00     0.00    L3 DMU dnode
     8   128K      8K     16K      2K   16.00     0.00    L4 DMU dnode
     8   128K      8K     16K      2K   16.00     0.00    L5 DMU dnode
     8   128K      8K     16K      2K   16.00     0.00    L6 DMU dnode
     9  9.00K   4.50K    9.5K   1.05K    2.00     0.00  DMU objset
     4     2K      2K   6.00K   1.50K    1.00     0.00  DSL directory child map
     3  1.50K   1.50K   4.50K   1.50K    1.00     0.00  DSL dataset snap map
     6  65.0K   8.50K   25.5K   4.25K    7.64     0.00  DSL props
     1    512     512      1K      1K    1.00     0.00  ZFS directory
     1    512     512      1K      1K    1.00     0.00  ZFS master node
     1    512     512      1K      1K    1.00     0.00  ZFS delete queue
  258K  2.03G   2.00G   2.00G   7.95K    1.01    99.98  zvol object
  256K  1.99G   1.99G   1.99G      8K    1.00    99.72    L0 zvol object
 2.00K  32.0M   2.52M   5.05M   2.52K   12.68     0.24    L1 zvol object
    22   352K    152K    303K   13.7K    2.32     0.01    L2 zvol object
     7   112K   10.5K   21.0K   3.00K   10.66     0.00    L3 zvol object
     1    512     512      1K      1K    1.00     0.00  zvol prop
     1   128K   6.00K   18.0K   18.0K   21.33     0.00  SPA history
     1    512     512   1.50K   1.50K    1.00     0.00  DSL dataset next clones
  258K  2.03G   2.00G   2.00G   7.95K    1.01    100.0  Total
  256K  2.00G   1.99G   2.00G   7.99K    1.00    99.73    L0 Total
 2.01K  32.2M   2.54M   5.08M   2.52K   12.70     0.24    L1 Total
    30   480K    160K    319K   10.6K    3.00     0.01    L2 Total
    15   240K   18.5K   37.0K   2.46K   12.97     0.00    L3 Total
     8   128K      8K     16K      2K   16.00     0.00    L4 Total
     8   128K      8K     16K      2K   16.00     0.00    L5 Total
     8   128K      8K     16K      2K   16.00     0.00    L6 Total

Tuesday May 13, 2008

Remote replication using ZFS

So the question I was asked by one of our UK Academic Sales Account Manager was "Can you use ZFS for replication between remote sites"?

The answer is depends

It depends on

  • How big the window of data you can afford to loose is?
  • How much data get written to the filesystem?
  • How much data you can send over ssh?

So, if you can not afford to loose a single transaction in the event of needing to fail over, then ZFS replication is not for you. Look at the vast range of SNDR type products which do synchronous data replication across remote sites.

If the number of transactions you can afford to loose is non zero, then ZFS may open up an exciting world at no extra cost. Lets start by finding a few figures

  • What is the peak change rate on your filesystem (now and projected)?
  • What transaction loss window can be tolerated?
  • How may GB/s can you send over ssh between your 2 candidate machines?

I have been working with Geoff Bell at the University of Bradford who manages their mail service. The rate of change of the mail servers filestore has been observed at 20GB of change over a 6 day period. This is in the region of 135MB an hour or close to 2 MB a minute average change.

The mail servers that Geoff manages get backed up every night. So the current transaction loss window is up to 24 hours meaning that if an email comes in during the day and an improbable event such as the disk array going on fire occurs, then all messages sent in that day may be lost.

The command

ptime dd if=/dev/zero bs=16k count=10000 | ssh >hostname< dd of=/dev/null
Shows that we can get just over 2GB a minute between the two X4500's using ssh. This improves by about 30% if we add -c blowfish to ssh.

So we have headroom for error/growth in the region of around 1000 times.

I put togther this script to manage a loop of zfs snapshot and zfs send/recv. The experimental results show that it was good up to 2GB of filesystem change per minute.

The script is simple. It looks for a snapshot on the failover system. If it is not there, then does a full snapshot. If there is a existing snapshot it takes the scripts argument and works from it.

It then works in a loop taking a snapshot and doing incremental send/recv until the end of time.

The biggest downside is that with 1.4TB of existing mail, the 1st send/recv will take in the region of 8 hours! Still, should only have to do it once.

I have left Geoff the open problem of working out which snapshots to delete, but pointed him at Chris Gerhard's blog which gives a solution to this very problem.

Failover would of course be manual, but on the standby machine would only require the most current complete snapshot to be promoted and renamed and the service restarted on the standb node.

Each site will have different needs in terms of filesystem layout, interval, etc. I can only really provide a template that worked in one place. The script does not need an argument, but if you want to restart again from the last snapshot transfered, then just give that as an argument to the script. Any changes/improvements very welcome.

ZFS snapshots and the send/recv mechnaism opens some novel options for very little extra cost to provide improved currency of the data in case of fail over




« July 2016