Tuesday Jan 05, 2010

Seven Years of Good Luck: Splitting Mirrors

The Problem

Imagine you had a nice zfs pool holding all the data for your business application. Regular back-ups are a must, right? But imagine further that you want to back it up without impacting the application, without going directly to the pool and reading the data, without having to incur the overhead of all those additional I/O operations. What's the solution?

Traditionally, it is been the practice to mirror the data locally, break the mirror, and then move the "broken-off" disks to a new machine for backup. This is possible to do with ZFS but, until now, has been very awkward:

Consider the case of a pool, "tank", composed of a mirror of two disks, "c0t0d0" and "c0t1d0". First, remove the disk from the pool:

# zpool offline tank c0t0d0
# zpool detach tank c0t0d0
Then, physically move disk c0t0d0 to a new machine, and use zfs import -f to find it. The -f is necessary because the pool still thinks it's imported on the original machine:
# zpool import -f tank
But even after that, the "tank" pool that was on the new machine still remembered being attached to the other disk, so some clean-up is in order:
# zpool detach tank c0t1d0
A bit awkward, but still possible, right? Now extend that to a pool with many n-way mirrors, and you can see how this is risky and prone to error. And there are certain configs where this won't actually work, which I'll go into detail on later.

In any case, using the technique above, the pool cannot be imported on the same machine because of something called a "Pool GUID". By using the offline/detach sequence means the kernel will find two copies of the GUID: one in the original pool and one in the offlined/detached disk, and this stops the GUID from honouring the "U" bit in its acronym.

The Solution

With the integration of PSARC 2009/511, we've introduced a new command: "zpool split". In the simplest case, zpool split takes two arguments: the existing pool to split disks from, and a name for the new pool. Consider again the "tank" example above. The two disks, c0t0d0 and c0t1d0 are mirrors and each is identical to the other. Running the command:

# zpool split tank vat
will result in two pools: the original pool "tank" with the c0t0d0 disk, and the new pool "vat" with the c0t1d0 disk.

That's it. The c0t1d0 disk can immediately be removed and plugged into a new machine. A "zpool import vat" will find it on the new machine and import it.

Behind the scenes, several things went on: first, zfs evaluated the configuration -- only certain configurations will work -- and chose a disk to split off. Next, the in-memory data was flushed out to the mirror. Incidentally, this is one of many reasons it's REALLY important to have disks that honour the Flush Write Cache command instead of ignore it. After flushing the data out, the disk can be detached from the pool and given a new label with a new Pool GUID. By generating a new pool GUID, zfs allows the pool to be imported on the same machine that it was split from. (See below for more detail).

In fact, there's an option, -R, that tells the split command to go ahead and import it after the split is complete:

# zpool split -R /vatroot tank vat
This command imports the "vat" pool under the altroot directory /vatroot. The only reason for having to specify an altroot is if there are any non-default mountpoints on any of the datasets in "tank". Because "vat" is an exact copy of tank, all the dataset properties will be exactly the same. If all the mountpoints are the default mountpoints (e.g. tank/foo is mounted at /tank/foo), then there is no need for an altroot. However, if dataset tank/foo is mounted at /etc/foo instead, then the "vat" pool's vat/foo dataset will also have a mountpoint of /etc/foo, and they will conflict. So the simplest thing to do for split is to require an altroot if the split-off pool is to be mounted.

By specifying an altroot /vatroot, the dataset vat/foo will instead be mounted under /vatroot/etc/foo, and there will be no conflict. Moreover, when the disk is moved to a new machine, it can be mounted without the need for an altroot, and all the mountpoints will be correct.

The split code has another bit of flexibility in it: you can specify which disks to split off. Normally the split code simply choose the "last" disk in each mirror to use for the new pool. But if there are specific disks that you're planning on moving to the new machine, you simply put that on the command line. For example, let's say you had created a pool with the command: "zpool create tank mirror c0t0d0 c1t0d0 c2t0d0 mirror c0t1d0 c1t1d0 c2t1d0". If you just run "zpool split tank vat", then the "vat" pool will be composed of c2t0d0 and c2t1d0, leaving the remaining four disks as part of the "tank" pool.

Let's say you wanted to use controller 1's disks to move. The command would instead be:

# zpool split tank vat c1t0d0 c1t1d0
and the split code will use those disks instead. The "vat" pool will get c1t0d0 and c1t1d0, leaving the other four as part of "tank".

To verify this before doing any actual splitting, you can use the -n option: this option goes through all the configuration validation that is normally done for zpool split, but does not do the actual split. Instead, this displays what the new pool would look like if the split command were to succeed.

The Gory Details

So is it that simple, then? Just offline and detach the mirrors, and give them a new pool GUID? Not quite, alas. Not if you want to do it right. We need to contend with real-world problems, and therefore we cannot assume that all operations succeed, or even that the machine doesn't die part way through the process. The typical way to handle these situations is to use something known as a "three-phase commit", which boils down to: (1) state your intentions, (2) perform the operation, and (3) remove the statement of intentions. That way, after any one of the phases, your system is in a known state, and you can either roll back to the previous state, or forge ahead and complete the task.

For split, these steps are: (1) create a list of disks being split off, and the offline them, (2) create a new pool using those offlined disks, and (3) go back to the original pool and detach the disks. If we die after step 1, then on resume we know we can just change the disks back online, and we've successfully rolled back. If we die after step 2, the new pool is already created, so all we have to do when we start back up again is complete step 3.

The obvious question to ask is: what happens if we die part-way through one of the steps? That's where ZFS's transaction model saves us: we only commit operations at the end of each step, so if we die part-way through a step, it's as if that step never got started. However, there are two things that are not covered by the transaction model and, unfortunately, the split code is required to touch one of them explicitly: the vdev label. The other non-transactional block of data is the /etc/zfs/zpool.cache file, which we don't have to worry about for splitting, because the innards of the kernel handles that for us.

What the vdev label holds is a number of pieces of information, and these include the pool to which it belongs, and the other members of the top-level vdev to which it belongs. The vdev label is not part of the pool data. It resides outside the pool, and is written to separately from the pool. In order to keep the label from being corrupt, not only does it get checksummed, but it gets written in four different locations on the disk. If the zfs kernel has to change the vdev configuration, a new nvlist is generated with all the in-memory configuration information, and then that is written out on the next sync operation to all four label locations. The bottom line is that it is possible for the vdev label's idea of the configuration to be out of sync with the configuration stored within the pool, known as the "spa config", depending on the timing of the writes and when the disk loses power.

So, for step (1), as stated above we offline the disks and generate a list of disks to be split off. This list is written to the spa config. If we die at this point then, on reboot, the vdev labels have not yet been updated, so the split is incomplete. The remedy is to throw away the list and put the disks back online.

For step (2) we update the vdev labels on the offlined disks, and generate a new spa config for them. If we die at this point, then, on reboot, we see the vdev labels are updated, and we remove the disks from the original pool in order to compelte the split.

For step (3) we just clean things up, removing the list of disks from the spa config, and removing the disks themselves from the in-core data structures.

So how does this actually work? You can see the heart of the code in spa.c, with the new function spa_vdev_split_mirror.

One tricky area is how we deal with log devices. Log (and "cache" and "spare") disks are not part of the split and, normally this would not be a problem. However, due to the way block pointers work, it is possible to generate a configuration that cannot be easily split. Consider the following sequence of commands:

# zpool create tank mirror c0t0d0 c1t0d0
# zpool add tank log c0t0d1
# zpool add tank mirror c0t1d0 c1t1d0

For the first line, zfs creates a new pool. The mirror is the top-level vdev, and it has a vdev ID of 0. The second line adds a log device. This device is also a top-level vdev, and gets an ID of 1. Finally, the third line adds a mirror as a new stripe to the pool. It gets a vdev ID of 2. A zpool status command confirms this numbering:

# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c0t0d0  ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
        logs
          c0t1d0    ONLINE       0     0     0

errors: No known data errors
#

What this means is that any block pointer pointing to data on the second stripe is going to have a vdev ID of 2. If the second and third commands were swapped -- in other words, if the stripe was added before the log device was added -- then the stripe would be "mirror-1" instead of "mirror-2", and the vdev ID would be 1. But we need to be able to handle both cases.

Why is this important? After a split, there will only be two stripes and no logs. Somehow, we need to ensure that these are vdev ID 0 and vdev ID 2, or all the block pointers that point to ID 2 will result in panicking zfs. How do we tell zfs to skip over ID 1?

With George Wilson's putback of CR 6574286, he introduced the concept of "holes". A hole is a top-level vdev that cannot be allocated from, and provides no data, but it takes up a slot in the vdev list. This made it possible for log devices to be removed, with the hole device taking over. The split code leverages this feature to do its form of log device removal, inserting holes in the new config wherever log devices are. And of course, it's smart about it: if the log device is the last device in the configuration, there's no need to put in a hole. This is done in libzfs, in the new function zpool_vdev_split. Look how the "lastlog" variable is used.

And that's it. That's splitting in a nutshell. Or at least a few nutshells. I'm guessing the information density in this post is pretty high, but splitting up zpools is rather complex. There are still things it doesn't do that would be nice to do in the future, such as splitting off mirrors from mirrors, or even rejoining a split config to its parent. I hope I get to work on those.

Tuesday Dec 08, 2009

Backing up a zvol

Over at Spiceworks, Michael2024 asks, "Anybody know how to get rsync to backup a ZFS zvol?"

My response is: "That's the wrong question." In fact, someone replied to Michael2024 already saying that rsync was not the right tool, but no one suggested the best tool for backing up zvols: snapshots

"But Mark," you say (because we're on first-name terms, and that is in fact my first name). "The snapshot is right there on the device that I'm trying to back up! How can that possibly help me?"

I'm glad you asked.

If you try to "back up" a zvol using a tool like dd, you're going to have to copy the whole volume, even the blocks that contain no data. But zvols are ZFS constructs which means they follow the copy-on-write paradigm which, in turn, means that ZFS needs to know what's data and what's not.

So that means that any snapshot will only contain the data that is actually on the disk. That's right: a snapshot of a 100TB volume that has 10MB of data will only contain those 10MB of data. And therefore, any "zfs send" stream will only contain real data and not a bunch of unwritten garbage.

To demonstrate, let's create a 100MB volume and snapshot it:

-bash-4.0# zfs create -V 100m tank/vol
-bash-4.0# zfs snapshot tank/vol@snap
How big is the send stream? Easy enough to check:
-bash-4.0# zfs send tank/vol@snap | wc -c
4256
Just a smidge over 4k. Let's write some data:
-bash-4.0# dd if=/dev/random of=/dev/zvol/rdsk/tank/vol bs=1k count=10
10+0 records in
10+0 records out
-bash-4.0# zfs snapshot tank/vol@snap2
-bash-4.0# zfs send tank/vol@snap2 | wc -c
21264
OK, we wrote 10k of data, and the send stream is 20k. With such a little amount of data, the overhead is about half the stream. But, what if we write to the same blocks again?
-bash-4.0# dd if=/dev/random of=/dev/zvol/rdsk/tank/vol bs=1k count=10
10+0 records in
10+0 records out
-bash-4.0# zfs snapshot tank/vol@snap3
-bash-4.0# zfs send tank/vol@snap3 | wc -c
21264
The exact same amount! So ZFS knows exactly how much data there is on the zvol. Let's write 1MB instead:
-bash-4.0# dd if=/dev/random of=/dev/zvol/rdsk/tank/vol bs=1k count=1024
1024+0 records in
1024+0 records out
-bash-4.0# zfs snapshot tank/vol@snap4
-bash-4.0# zfs send tank/vol@snap4 | wc -c
1092768
-bash-4.0#
And now the overhead is quite a bit smaller than the data, around 3-4%.

The question then is: which is more efficient? Doing a full block-by-block copy using something "higher up in the stack" (quoting from Michael2024 there), or creating another pool and doing a "zfs send | zfs recv"? On top of that, add the under-appreciated feature of incremental send streams, and you have a full backup solution that does not require any external tools.

I would respond on the Spiceworks website, but alas they are both members-only and require you download a Windows client just to register. Lame!

Thursday Mar 26, 2009

Cloning Humans: Bad; Cloning Filesystems: Good

Imagine you could snapshot yourself and generate a clone from that snapshot. On the surface, that would seem like a cool idea, but scratch the surface and you quickly run into problems. These problems could be ethical (what rights does your clone have to your own possessions?), or they could be practical (how can the planet sustain the population explosion?). Fortunately, with zfs datasets, snapshotting and cloning not only are permitted, they are actively encouraged.

Over at the Xerces Tech site, a recent article outlines how to use zfs clones in order to safely do an apt-get on their Nexenta box. Take a look at the article called Unbreakable upgrades, ZFS and apt-get to see how it's done.

Wednesday Jan 28, 2009

Building a ZFS server

Have you ever wanted to try building your own ZFS-based file server? Jermaine Maree has done just that, and blogged about it. Start with part 1.

Thursday Jan 01, 2009

Don't shout at your JBODs

They don't like it!

Tuesday Nov 18, 2008

ZFS can boost performance

Even a suboptimal configuration can result in a performance boost. The most interesting thing, I think, is the ease with which the zpool was created.

I wonder what kind of performance numbers this user would see with the 7110 compared with the Dell Powervault. The 7110 can hold 14 146gb sas drives, whereas the Dell uses 14 146gb scsi drives, so comparing power utilization would be interesting as well.

Wednesday Oct 22, 2008

ZFS User Directories on OS X

Check out this blog post on setting up OS X with zfs-based user directories.

Monday Jun 16, 2008

ZFS In The Wild, Part 5

It's been over a year since I last posted sightings of ZFS around the web, so it's high time I offered another list.

Thursday Jun 12, 2008

Time Flies

It was nearly a year ago that I first made this screenshot:

Since then, I have done quite a number of different things, all related in some way to getting zfs to install and boot. Some of these things also involved teaching Live Upgrade to understand zfs datsets.

But now I'm starting to see that screenshot elsewhere, virtually unchanged from that fateful day long ago when I used the original to help design the changes needed for the text based installer.

Here are some: The Sect of Rama | Number 9 | Otmanix' Blog | Osamu Sayama's Weblog

It's really exciting to see it get out there and for it to be used outside of the development and test teams. Of course, there are some CRs being filed, and there are some things we'll need to address, but it's great nevertheless.

Wednesday Jan 02, 2008

That Seattle Fireworks thing...

It's too bad they weren't using a filesystem that could detect and automatically repair corruption. Ah well.

Tuesday Oct 16, 2007

ZFS and automatically growing pools

The question of replacing disks in ZFS pools comes up every so often. The most common thing that's asked is whether ZFS will see larger disks if they replace smaller disks. Let's go through an example:

First, we'll create some files to use as pool storage, and create a zpool out of the smaller two.

bash-3.00# mkfile 64m /var/tmp/a0 /var/tmp/b0
bash-3.00# mkfile 128m /var/tmp/a1 /var/tmp/b1
bash-3.00# zpool create tank /var/tmp/a0 /var/tmp/b0
bash-3.00# zpool list
NAME   SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
tank   119M   111K   119M     0%  ONLINE  -
bash-3.00# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

	NAME           STATE     READ WRITE CKSUM
	tank           ONLINE       0     0     0
	  /var/tmp/a0  ONLINE       0     0     0
	  /var/tmp/b0  ONLINE       0     0     0

errors: No known data errors

Here we've striped a pair of 64MB files for our pool. Now we'll replace the two disks in our stripe with their 128MB counterparts:

bash-3.00# zpool replace tank /var/tmp/a0 /var/tmp/a1
bash-3.00# zpool replace tank /var/tmp/b0 /var/tmp/b1

We wait a few moments, and then check to see that we're done:

bash-3.00# zpool status
  pool: tank
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Oct 15 15:47:58 2007
config:

	NAME           STATE     READ WRITE CKSUM
	tank           ONLINE       0     0     0
	  /var/tmp/a1  ONLINE       0     0     0
	  /var/tmp/b1  ONLINE       0     0     0

errors: No known data errors

Everything seems to have gone well, and the resilvering is complete. Let's take a look at the pool now:

bash-3.00# zpool list
NAME   SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
tank   247M   231K   247M     0%  ONLINE  -

This shows that it works with stripes. Will it work with raidz? Let's create a few more files and test.

bash-3.00# mkfile 64m /var/tmp/c0 /var/tmp/d0
bash-3.00# mkfile 128m /var/tmp/c1 /var/tmp/d1
bash-3.00# zpool destroy tank
bash-3.00# zpool create tank raidz /var/tmp/a0 /var/tmp/b0 /var/tmp/c0 /var/tmp/d0
bash-3.00# zpool list
NAME   SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
tank   238M   177K   238M     0%  ONLINE  -
bash-3.00# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

	NAME             STATE     READ WRITE CKSUM
	tank             ONLINE       0     0     0
	  raidz1         ONLINE       0     0     0
	    /var/tmp/a0  ONLINE       0     0     0
	    /var/tmp/b0  ONLINE       0     0     0
	    /var/tmp/c0  ONLINE       0     0     0
	    /var/tmp/d0  ONLINE       0     0     0

errors: No known data errors

And now do the replace:

bash-3.00# for f in a b c d; do zpool replace tank /var/tmp/${f}0 /var/tmp/${f}1; done

We wait a little bit for the resilver to complete, and then check the status and size:

bash-3.00# zpool status
  pool: tank
 state: ONLINE
 scrub: resilver completed with 0 errors on Tue Oct 16 08:01:00 2007
config:

	NAME             STATE     READ WRITE CKSUM
	tank             ONLINE       0     0     0
	  raidz1         ONLINE       0     0     0
	    /var/tmp/a1  ONLINE       0     0     0
	    /var/tmp/b1  ONLINE       0     0     0
	    /var/tmp/c1  ONLINE       0     0     0
	    /var/tmp/d1  ONLINE       0     0     0

errors: No known data errors
bash-3.00# zpool list
NAME   SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
tank   238M   408K   238M     0%  ONLINE  -

OK, so that didn't exactly work. The device list is correct, but the size is the same. Let's try export-import to see if that will allow ZFS to see the new size:

bash-3.00# zpool export tank
bash-3.00# zpool import -d /var/tmp tank
bash-3.00# zpool list
NAME   SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
tank   494M   189K   494M     0%  ONLINE  -
bash-3.00# 

And it works! Of course, if you've got shared filesystems or volumes, via nfs or iscsi, it makes exporting and reimporting a bit trickier - you'd need to wait until your users have gone home for the day, or just reboot the machine (which does an implicit export/import). It'd be nice if this could happen automatically, as in the striping case above. A bug has been written for this (6606879)

The final case is mirroring:

bash-3.00# zpool destroy tank
bash-3.00# zpool create tank mirror /var/tmp/a0 /var/tmp/b0
bash-3.00# zpool list
NAME   SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
tank  59.5M    94K  59.4M     0%  ONLINE  -
bash-3.00# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

	NAME             STATE     READ WRITE CKSUM
	tank             ONLINE       0     0     0
	  mirror         ONLINE       0     0     0
	    /var/tmp/a0  ONLINE       0     0     0
	    /var/tmp/b0  ONLINE       0     0     0

errors: No known data errors

OK, now we'll do the replace:

bash-3.00# zpool replace tank /var/tmp/a0 /var/tmp/a1
bash-3.00# zpool replace tank /var/tmp/b0 /var/tmp/b1
bash-3.00# zpool status
  pool: tank
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Oct 15 16:09:10 2007
config:

	NAME             STATE     READ WRITE CKSUM
	tank             ONLINE       0     0     0
	  mirror         ONLINE       0     0     0
	    /var/tmp/a1  ONLINE       0     0     0
	    /var/tmp/b1  ONLINE       0     0     0

errors: No known data errors
bash-3.00# zpool list
NAME   SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
tank  59.5M   218K  59.3M     0%  ONLINE  -

The size is still 59.5M. As in the raidz case above, this will take an export/import in order to effect the size change:

bash-3.00# zpool export tank
bash-3.00# zpool import -d /var/tmp tank
bash-3.00# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

	NAME             STATE     READ WRITE CKSUM
	tank             ONLINE       0     0     0
	  mirror         ONLINE       0     0     0
	    /var/tmp/a1  ONLINE       0     0     0
	    /var/tmp/b1  ONLINE       0     0     0

errors: No known data errors
bash-3.00# zpool list
NAME   SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
tank   124M   116K   123M     0%  ONLINE  -
bash-3.00# 

To summarise: for plain stripes, also known as RAID-0, ZFS can automatically grow the pool after a replace. For mirroring (a.k.a. RAID-1) and raidz/raidz2 (an improved RAID-5/6), you need to export and reimport (or reboot) to get the new size until 6606879 is fixed.

Thursday Sep 13, 2007

Data Recovery Done Right

Phase One, in which Alec gets his data back after a terror-inducing message.

Lesson learned: Don't Panic!

Tuesday Aug 28, 2007

ZFS and ease of use

jamesd_wi likes how easy it is to use zfs

Tuesday Aug 07, 2007

Mason on Btrfs

Over on liquidat's wordpress blog, Chris Mason talks about how work on Btrfs is progressing.
About

Known throughout Sun as a man of infinite wit, of jovial attitude, and of making things up about himself at the slightest whim.

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today