Remote replication using ZFS
By clive on May 13, 2008
So the question I was asked by one of our UK Academic Sales Account Manager was "Can you use ZFS for replication between remote sites"?
The answer is depends
It depends on
- How big the window of data you can afford to loose is?
- How much data get written to the filesystem?
- How much data you can send over ssh?
So, if you can not afford to loose a single transaction in the event of needing to fail over, then ZFS replication is not for you. Look at the vast range of SNDR type products which do synchronous data replication across remote sites.
If the number of transactions you can afford to loose is non zero, then ZFS may open up an exciting world at no extra cost. Lets start by finding a few figures
- What is the peak change rate on your filesystem (now and projected)?
- What transaction loss window can be tolerated?
- How may GB/s can you send over ssh between your 2 candidate machines?
I have been working with Geoff Bell at the University of Bradford who manages their mail service. The rate of change of the mail servers filestore has been observed at 20GB of change over a 6 day period. This is in the region of 135MB an hour or close to 2 MB a minute average change.
The mail servers that Geoff manages get backed up every night. So the current transaction loss window is up to 24 hours meaning that if an email comes in during the day and an improbable event such as the disk array going on fire occurs, then all messages sent in that day may be lost.
ptime dd if=/dev/zero bs=16k count=10000 | ssh >hostname< dd of=/dev/nullShows that we can get just over 2GB a minute between the two X4500's using ssh. This improves by about 30% if we add -c blowfish to ssh.
So we have headroom for error/growth in the region of around 1000 times.
I put togther this script to manage a loop of zfs snapshot and zfs send/recv. The experimental results show that it was good up to 2GB of filesystem change per minute.
The script is simple. It looks for a snapshot on the failover system. If it is not there, then does a full snapshot. If there is a existing snapshot it takes the scripts argument and works from it.
It then works in a loop taking a snapshot and doing incremental send/recv until the end of time.
The biggest downside is that with 1.4TB of existing mail, the 1st send/recv will take in the region of 8 hours! Still, should only have to do it once.
I have left Geoff the open problem of working out which snapshots to delete, but pointed him at Chris Gerhard's blog which gives a solution to this very problem.
Failover would of course be manual, but on the standby machine would only require the most current complete snapshot to be promoted and renamed and the service restarted on the standb node.
Each site will have different needs in terms of filesystem layout, interval, etc. I can only really provide a template that worked in one place. The script does not need an argument, but if you want to restart again from the last snapshot transfered, then just give that as an argument to the script. Any changes/improvements very welcome.
ZFS snapshots and the send/recv mechnaism opens some novel options for very little extra cost to provide improved currency of the data in case of fail over