Seven Years of Good Luck: Splitting Mirrors

The Problem

Imagine you had a nice zfs pool holding all the data for your business application. Regular back-ups are a must, right? But imagine further that you want to back it up without impacting the application, without going directly to the pool and reading the data, without having to incur the overhead of all those additional I/O operations. What's the solution?

Traditionally, it is been the practice to mirror the data locally, break the mirror, and then move the "broken-off" disks to a new machine for backup. This is possible to do with ZFS but, until now, has been very awkward:

Consider the case of a pool, "tank", composed of a mirror of two disks, "c0t0d0" and "c0t1d0". First, remove the disk from the pool:

# zpool offline tank c0t0d0
# zpool detach tank c0t0d0
Then, physically move disk c0t0d0 to a new machine, and use zfs import -f to find it. The -f is necessary because the pool still thinks it's imported on the original machine:
# zpool import -f tank
But even after that, the "tank" pool that was on the new machine still remembered being attached to the other disk, so some clean-up is in order:
# zpool detach tank c0t1d0
A bit awkward, but still possible, right? Now extend that to a pool with many n-way mirrors, and you can see how this is risky and prone to error. And there are certain configs where this won't actually work, which I'll go into detail on later.

In any case, using the technique above, the pool cannot be imported on the same machine because of something called a "Pool GUID". By using the offline/detach sequence means the kernel will find two copies of the GUID: one in the original pool and one in the offlined/detached disk, and this stops the GUID from honouring the "U" bit in its acronym.

The Solution

With the integration of PSARC 2009/511, we've introduced a new command: "zpool split". In the simplest case, zpool split takes two arguments: the existing pool to split disks from, and a name for the new pool. Consider again the "tank" example above. The two disks, c0t0d0 and c0t1d0 are mirrors and each is identical to the other. Running the command:

# zpool split tank vat
will result in two pools: the original pool "tank" with the c0t0d0 disk, and the new pool "vat" with the c0t1d0 disk.

That's it. The c0t1d0 disk can immediately be removed and plugged into a new machine. A "zpool import vat" will find it on the new machine and import it.

Behind the scenes, several things went on: first, zfs evaluated the configuration -- only certain configurations will work -- and chose a disk to split off. Next, the in-memory data was flushed out to the mirror. Incidentally, this is one of many reasons it's REALLY important to have disks that honour the Flush Write Cache command instead of ignore it. After flushing the data out, the disk can be detached from the pool and given a new label with a new Pool GUID. By generating a new pool GUID, zfs allows the pool to be imported on the same machine that it was split from. (See below for more detail).

In fact, there's an option, -R, that tells the split command to go ahead and import it after the split is complete:

# zpool split -R /vatroot tank vat
This command imports the "vat" pool under the altroot directory /vatroot. The only reason for having to specify an altroot is if there are any non-default mountpoints on any of the datasets in "tank". Because "vat" is an exact copy of tank, all the dataset properties will be exactly the same. If all the mountpoints are the default mountpoints (e.g. tank/foo is mounted at /tank/foo), then there is no need for an altroot. However, if dataset tank/foo is mounted at /etc/foo instead, then the "vat" pool's vat/foo dataset will also have a mountpoint of /etc/foo, and they will conflict. So the simplest thing to do for split is to require an altroot if the split-off pool is to be mounted.

By specifying an altroot /vatroot, the dataset vat/foo will instead be mounted under /vatroot/etc/foo, and there will be no conflict. Moreover, when the disk is moved to a new machine, it can be mounted without the need for an altroot, and all the mountpoints will be correct.

The split code has another bit of flexibility in it: you can specify which disks to split off. Normally the split code simply choose the "last" disk in each mirror to use for the new pool. But if there are specific disks that you're planning on moving to the new machine, you simply put that on the command line. For example, let's say you had created a pool with the command: "zpool create tank mirror c0t0d0 c1t0d0 c2t0d0 mirror c0t1d0 c1t1d0 c2t1d0". If you just run "zpool split tank vat", then the "vat" pool will be composed of c2t0d0 and c2t1d0, leaving the remaining four disks as part of the "tank" pool.

Let's say you wanted to use controller 1's disks to move. The command would instead be:

# zpool split tank vat c1t0d0 c1t1d0
and the split code will use those disks instead. The "vat" pool will get c1t0d0 and c1t1d0, leaving the other four as part of "tank".

To verify this before doing any actual splitting, you can use the -n option: this option goes through all the configuration validation that is normally done for zpool split, but does not do the actual split. Instead, this displays what the new pool would look like if the split command were to succeed.

The Gory Details

So is it that simple, then? Just offline and detach the mirrors, and give them a new pool GUID? Not quite, alas. Not if you want to do it right. We need to contend with real-world problems, and therefore we cannot assume that all operations succeed, or even that the machine doesn't die part way through the process. The typical way to handle these situations is to use something known as a "three-phase commit", which boils down to: (1) state your intentions, (2) perform the operation, and (3) remove the statement of intentions. That way, after any one of the phases, your system is in a known state, and you can either roll back to the previous state, or forge ahead and complete the task.

For split, these steps are: (1) create a list of disks being split off, and the offline them, (2) create a new pool using those offlined disks, and (3) go back to the original pool and detach the disks. If we die after step 1, then on resume we know we can just change the disks back online, and we've successfully rolled back. If we die after step 2, the new pool is already created, so all we have to do when we start back up again is complete step 3.

The obvious question to ask is: what happens if we die part-way through one of the steps? That's where ZFS's transaction model saves us: we only commit operations at the end of each step, so if we die part-way through a step, it's as if that step never got started. However, there are two things that are not covered by the transaction model and, unfortunately, the split code is required to touch one of them explicitly: the vdev label. The other non-transactional block of data is the /etc/zfs/zpool.cache file, which we don't have to worry about for splitting, because the innards of the kernel handles that for us.

What the vdev label holds is a number of pieces of information, and these include the pool to which it belongs, and the other members of the top-level vdev to which it belongs. The vdev label is not part of the pool data. It resides outside the pool, and is written to separately from the pool. In order to keep the label from being corrupt, not only does it get checksummed, but it gets written in four different locations on the disk. If the zfs kernel has to change the vdev configuration, a new nvlist is generated with all the in-memory configuration information, and then that is written out on the next sync operation to all four label locations. The bottom line is that it is possible for the vdev label's idea of the configuration to be out of sync with the configuration stored within the pool, known as the "spa config", depending on the timing of the writes and when the disk loses power.

So, for step (1), as stated above we offline the disks and generate a list of disks to be split off. This list is written to the spa config. If we die at this point then, on reboot, the vdev labels have not yet been updated, so the split is incomplete. The remedy is to throw away the list and put the disks back online.

For step (2) we update the vdev labels on the offlined disks, and generate a new spa config for them. If we die at this point, then, on reboot, we see the vdev labels are updated, and we remove the disks from the original pool in order to compelte the split.

For step (3) we just clean things up, removing the list of disks from the spa config, and removing the disks themselves from the in-core data structures.

So how does this actually work? You can see the heart of the code in spa.c, with the new function spa_vdev_split_mirror.

One tricky area is how we deal with log devices. Log (and "cache" and "spare") disks are not part of the split and, normally this would not be a problem. However, due to the way block pointers work, it is possible to generate a configuration that cannot be easily split. Consider the following sequence of commands:

# zpool create tank mirror c0t0d0 c1t0d0
# zpool add tank log c0t0d1
# zpool add tank mirror c0t1d0 c1t1d0

For the first line, zfs creates a new pool. The mirror is the top-level vdev, and it has a vdev ID of 0. The second line adds a log device. This device is also a top-level vdev, and gets an ID of 1. Finally, the third line adds a mirror as a new stripe to the pool. It gets a vdev ID of 2. A zpool status command confirms this numbering:

# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c0t0d0  ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
          c0t1d0    ONLINE       0     0     0

errors: No known data errors

What this means is that any block pointer pointing to data on the second stripe is going to have a vdev ID of 2. If the second and third commands were swapped -- in other words, if the stripe was added before the log device was added -- then the stripe would be "mirror-1" instead of "mirror-2", and the vdev ID would be 1. But we need to be able to handle both cases.

Why is this important? After a split, there will only be two stripes and no logs. Somehow, we need to ensure that these are vdev ID 0 and vdev ID 2, or all the block pointers that point to ID 2 will result in panicking zfs. How do we tell zfs to skip over ID 1?

With George Wilson's putback of CR 6574286, he introduced the concept of "holes". A hole is a top-level vdev that cannot be allocated from, and provides no data, but it takes up a slot in the vdev list. This made it possible for log devices to be removed, with the hole device taking over. The split code leverages this feature to do its form of log device removal, inserting holes in the new config wherever log devices are. And of course, it's smart about it: if the log device is the last device in the configuration, there's no need to put in a hole. This is done in libzfs, in the new function zpool_vdev_split. Look how the "lastlog" variable is used.

And that's it. That's splitting in a nutshell. Or at least a few nutshells. I'm guessing the information density in this post is pretty high, but splitting up zpools is rather complex. There are still things it doesn't do that would be nice to do in the future, such as splitting off mirrors from mirrors, or even rejoining a split config to its parent. I hope I get to work on those.


It should be zpool import -f instead of zfs import....

It is a cool feature! Comparable to Veritas SFS5.0 MP3 boot disk snapshot/resize feature. Is there any document already available to use this feature in boot disk management?

Posted by kevin on January 05, 2010 at 10:09 AM GMT #

Oops fixed. Thanks. Docs are forthcoming.

Posted by Mark Musante on January 05, 2010 at 10:23 AM GMT #

Since we got this far, is it possible to add another feature to zpool similar to
"mksnap" option as in SFS5.0:

"vxrootadm -g <src-dg> mksnap <target-disks> <target-dg>"

Posted by kevin on January 05, 2010 at 10:42 AM GMT #

Post a Comment:
Comments are closed for this entry.

Known throughout Sun as a man of infinite wit, of jovial attitude, and of making things up about himself at the slightest whim.


« January 2017