In this update, we share Btrfs functionality that helps make moving data between Btrfs volumes faster and more efficient. It's not new feature but it's an underutilized feature which showcases the unique capabilities of Btrfs as the native Linux copy-on-write filesystem.
Btrfs send is introduced in Linux v3.5 and the amazing part is that it offers the ability of incremental update. Here I'll go through the command as a user and try to understand it as a btrfs developer. A user can transfer one whole subvolume tree to another btrfs filesystem by using 'send', keep in mind that the subvolume tree must be _readonly_, so the steps could be
as simple as a few commands.
By 'whole subvolume tree' I mean both data and metadata will be transferred to the receive side, in order to do this, command 'send' uses pipe(2), which creates two file descriptors, one for reader and one for writer, it is the writer fd that kernel writes send's instructions to, and in the userspace progs retrives those instructions from the reader fd and writes to stdout by default. In the above example, we created another pipe to redirect stdout to the receive side.
$ man btrfs-send usage: btrfs send [-ve] [-p ] [-c <clone-src>] [-f ] <subvol> [<subvol>...] Send the subvolume(s) to stdout. Sends the subvolume(s) specified by <subvol> to stdout. <subvol> should be read-only here. By default, this will send the whole subvolume. To do an incremental send, use '-p <parent>'. If you want to allow btrfs to clone from any additional local snapshots, use '-c <clone-src>' (multiple times where applicable). You must not specify clone sources unless you guarantee that these snapshots are exactly in the same state on both sides, the sender and the receiver. It is allowed to omit the '-p <parent>' option when '-c <clone-src>' options are given, in which case 'btrfs send' will determine a suitable parent among the clone sources itself. -e If sending multiple subvols at once, use the new format and omit the end-cmd between the subvols. -p <parent> Send an incremental stream from <parent> to <subvol>. -c <clone-src> Use this snapshot as a clone source for an incremental send (multiple allowed) -f <outfile> Output is normally written to stdout. To write to a file, use this option. An alternative would be to use pipes. --no-data send in NO_FILE_DATA mode, Note: the output stream does not contain any file data and thus cannot be used to transfer changes. This mode is faster and useful to show the differences in metadata. -v|--verbose enable verbose output to stderr, each occurrence of this option increases verbosity -q|--quiet suppress all messages, except errors $ btrfs subvolume snapshot -r /mnt/send/subvol /mnt/send/snapshot $ btrfs send /mnt/send/snapshot | btrfs receive /mnt/recv/ #then, we get a identical 'snapshot' under /mnt/recv_side $ ls /mnt/receive_side snapshot |
Then on the receive side, 'btrfs receive' is used to create a new subvolume (/mnt/recv/snapshot) and apply the instructions in the send stream to make it look like the one on the send side (/mnt/send/snapshot).
This feature is often found to be helpful when people do regular backup on filesystem because it combines built-in easy and cheap snapshot with incremental updates. Paired with out-of-band deduplication, btrfs provides all the features to build a powerful backup appliance.
Last but not least, please note that nothing comes for free, although creating a snapshot can be as easy, fast and cheap as nothing, deleting snapshot could be a factor to slow down the whole filesystem. It takes a good amount of efforts to traverse across several btrees to remove references on everything, and can consume CPU quite intensively. The problem is also known as "snowball effect of wandering trees". It's highly recommended to only keep snapshots which are necessary to have.
Although stdout is used by default, often its file descriptor can refer to tty(terminal), then we may get this error,
$ btrfs send /mnt/btrfs/snap2 ERROR: not dumping send stream into a terminal, redirect it into a file # Fix this error with one of the following commands: btrfs send /mnt/snap > output btrfs send -f output /mnt/snap |
This option can potentially speed up a 'send-receive' process because it informs the receiver to create a snapshot of <parent> before applying changes passed in the send stream. It assumes that a previous send-receive had happened so that <parent> exists on both sender side and receiver side.
Incremental updates can be applied with a minimum amount of effort by making a snapshot of <parent> on receiver side. It mostly works as expected, except one problem I observed, i.e. the receiver doesn't check whether <parent> is readonly or read-write. You can see this
a) toggle off the RO bit of <parent> with 'btrfs property set -s subvol <parent> ro false'
b) add or remove files/directories under <parent>
then the snapshot on the sender side will not be identical to the snapshot on the receive side, here is an example,
$ btrfs sub create /mnt/send/sub $ touch /mnt/send/sub/foo $ btrfs sub snap -r /mnt/send/sub /mnt/send/parent # send parent out $ btrfs send /mnt/send/parent | btrfs receive /mnt/recv/ # change parent and file under it $ btrfs property set -t subvol /mnt/recv/parent ro false $ truncate -s 4096 /mnt/recv/parent/foo $ btrfs sub snap -r /mnt/send/sub /mnt/send/update $ btrfs send -p /mnt/send/parent /mnt/send/update | btrfs receive /mnt/recv $ ls -l /mnt/send/update total 0 -rw-r--r-- 1 root root 0 Mar 6 11:13 foo $ ls -l /mnt/recv/update total 0 -rw-r--r-- 1 root root 4096 Mar 6 11:14 foo |
However, if 'foo' in /mnt/send/update has a non-zero size, it shows the correct size on receiver side,
$ truncate -s 8192 /mnt/send/sub/foo $ btrfs sub snap -r /mnt/send/sub /mnt/send/update-new $ btrfs send -p /mnt/send/parent /mnt/send/update-new | btrfs receive /mnt/recv $ ls -l /mnt/send/update-new total 0 -rw-r--r-- 1 root root 8192 Mar 6 11:21 foo $ ls -l /mnt/recv/update-new total 0 -rw-r--r-- 1 root root 8192 Mar 6 11:21 foo |
'btrfs receive' doesn't apply the file size if size is zero.
These issues are under development. The correct way to make changes in a readonly snapshot is to create another snapshot of itself which has write access.
To understand the option, we need to explain clone first.
Clone simply refers to a kind of operation which allows two files (or two different parts within the same file) to share the same piece of data on disk, and copy-on-write will happen if any parts of the shared data gets changed.
With '-c' option, the send-receive process can avoid transferring data in the send stream because the required data has been availalbe on the receiver side, all it needs to do is to do reflink from <clone-src>.
Similar to '-p <parent>', it also assumes that <clone-src> exists on both sender side and receiver side, the difference is that '-c <clone-src>' only avoids tranferring data and '-p <parent>' avoids both data and metadata.
To reach the best result, multiple <clone-src> can be given and 'btrfs send' will try to figure out the best fit parent to use, but in case of failing to do so, an error will be printed: 'parent determination failed for xxx'.