Thursday Jun 30, 2011

ZFS Basics

Stage 1 basics: creating a pool

# zpool create $NAME $REDUNDANCY $DISK1_0..N [$REDUNDANCY $DISK2_0..N]...

$NAME = name of the pool you're creating. This will also be the name of the first filesystem and, by default, be placed at the mountpoint "/$NAME"

$REDUNDANCY = either mirror or raidzN, and N can be 1, 2, or 3. If you leave N off, then it defaults to 1.

$DISK1_0..N = the disks assigned to the pool.

Example 1: zpool create tank mirror c4t1d0 c4t2d0

name of pool: tank

redundancy: mirroring

disks being mirrored: c4t1d0 and c4t2d0

Capacity: size of a single disk

Example 2: zpool create tank raidz c4t1d0 c4t2d0 c4t3d0 c4t4d0 c4t5d0

Here the redundancy is raidz, and there are five disks, in a 4+1 (4 data, 1 parity) config. This means that the capacity is 4 times the disk size. If the command used "raidz2" instead, then the config would be 3+2. Likewise, "raidz3" would be a 2+3 config.

Example 3: zpool create tank mirror c4t1d0 c4t2d0 mirror c4t3d0 c4t4d0

This is the same as the first mirror example, except there are two mirrors now. ZFS will stripe data across both mirrors, which means that writing data will go a bit faster.

Note: you cannot create a mirror of two raidzs. You can create a raidz of mirrors, but to do that requires trickery.

Friday Mar 18, 2011


#pragma redirect

Friday Mar 12, 2010

Ten years ahoy

Tomorrow, the 13th, would have been my ten-year anniversary at Sun Microsystems. Further bulletins as events warrant.

Tuesday Jan 05, 2010

Seven Years of Good Luck: Splitting Mirrors

The Problem

Imagine you had a nice zfs pool holding all the data for your business application. Regular back-ups are a must, right? But imagine further that you want to back it up without impacting the application, without going directly to the pool and reading the data, without having to incur the overhead of all those additional I/O operations. What's the solution?

Traditionally, it is been the practice to mirror the data locally, break the mirror, and then move the "broken-off" disks to a new machine for backup. This is possible to do with ZFS but, until now, has been very awkward:

Consider the case of a pool, "tank", composed of a mirror of two disks, "c0t0d0" and "c0t1d0". First, remove the disk from the pool:

# zpool offline tank c0t0d0
# zpool detach tank c0t0d0
Then, physically move disk c0t0d0 to a new machine, and use zfs import -f to find it. The -f is necessary because the pool still thinks it's imported on the original machine:
# zpool import -f tank
But even after that, the "tank" pool that was on the new machine still remembered being attached to the other disk, so some clean-up is in order:
# zpool detach tank c0t1d0
A bit awkward, but still possible, right? Now extend that to a pool with many n-way mirrors, and you can see how this is risky and prone to error. And there are certain configs where this won't actually work, which I'll go into detail on later.

In any case, using the technique above, the pool cannot be imported on the same machine because of something called a "Pool GUID". By using the offline/detach sequence means the kernel will find two copies of the GUID: one in the original pool and one in the offlined/detached disk, and this stops the GUID from honouring the "U" bit in its acronym.

The Solution

With the integration of PSARC 2009/511, we've introduced a new command: "zpool split". In the simplest case, zpool split takes two arguments: the existing pool to split disks from, and a name for the new pool. Consider again the "tank" example above. The two disks, c0t0d0 and c0t1d0 are mirrors and each is identical to the other. Running the command:

# zpool split tank vat
will result in two pools: the original pool "tank" with the c0t0d0 disk, and the new pool "vat" with the c0t1d0 disk.

That's it. The c0t1d0 disk can immediately be removed and plugged into a new machine. A "zpool import vat" will find it on the new machine and import it.

Behind the scenes, several things went on: first, zfs evaluated the configuration -- only certain configurations will work -- and chose a disk to split off. Next, the in-memory data was flushed out to the mirror. Incidentally, this is one of many reasons it's REALLY important to have disks that honour the Flush Write Cache command instead of ignore it. After flushing the data out, the disk can be detached from the pool and given a new label with a new Pool GUID. By generating a new pool GUID, zfs allows the pool to be imported on the same machine that it was split from. (See below for more detail).

In fact, there's an option, -R, that tells the split command to go ahead and import it after the split is complete:

# zpool split -R /vatroot tank vat
This command imports the "vat" pool under the altroot directory /vatroot. The only reason for having to specify an altroot is if there are any non-default mountpoints on any of the datasets in "tank". Because "vat" is an exact copy of tank, all the dataset properties will be exactly the same. If all the mountpoints are the default mountpoints (e.g. tank/foo is mounted at /tank/foo), then there is no need for an altroot. However, if dataset tank/foo is mounted at /etc/foo instead, then the "vat" pool's vat/foo dataset will also have a mountpoint of /etc/foo, and they will conflict. So the simplest thing to do for split is to require an altroot if the split-off pool is to be mounted.

By specifying an altroot /vatroot, the dataset vat/foo will instead be mounted under /vatroot/etc/foo, and there will be no conflict. Moreover, when the disk is moved to a new machine, it can be mounted without the need for an altroot, and all the mountpoints will be correct.

The split code has another bit of flexibility in it: you can specify which disks to split off. Normally the split code simply choose the "last" disk in each mirror to use for the new pool. But if there are specific disks that you're planning on moving to the new machine, you simply put that on the command line. For example, let's say you had created a pool with the command: "zpool create tank mirror c0t0d0 c1t0d0 c2t0d0 mirror c0t1d0 c1t1d0 c2t1d0". If you just run "zpool split tank vat", then the "vat" pool will be composed of c2t0d0 and c2t1d0, leaving the remaining four disks as part of the "tank" pool.

Let's say you wanted to use controller 1's disks to move. The command would instead be:

# zpool split tank vat c1t0d0 c1t1d0
and the split code will use those disks instead. The "vat" pool will get c1t0d0 and c1t1d0, leaving the other four as part of "tank".

To verify this before doing any actual splitting, you can use the -n option: this option goes through all the configuration validation that is normally done for zpool split, but does not do the actual split. Instead, this displays what the new pool would look like if the split command were to succeed.

The Gory Details

So is it that simple, then? Just offline and detach the mirrors, and give them a new pool GUID? Not quite, alas. Not if you want to do it right. We need to contend with real-world problems, and therefore we cannot assume that all operations succeed, or even that the machine doesn't die part way through the process. The typical way to handle these situations is to use something known as a "three-phase commit", which boils down to: (1) state your intentions, (2) perform the operation, and (3) remove the statement of intentions. That way, after any one of the phases, your system is in a known state, and you can either roll back to the previous state, or forge ahead and complete the task.

For split, these steps are: (1) create a list of disks being split off, and the offline them, (2) create a new pool using those offlined disks, and (3) go back to the original pool and detach the disks. If we die after step 1, then on resume we know we can just change the disks back online, and we've successfully rolled back. If we die after step 2, the new pool is already created, so all we have to do when we start back up again is complete step 3.

The obvious question to ask is: what happens if we die part-way through one of the steps? That's where ZFS's transaction model saves us: we only commit operations at the end of each step, so if we die part-way through a step, it's as if that step never got started. However, there are two things that are not covered by the transaction model and, unfortunately, the split code is required to touch one of them explicitly: the vdev label. The other non-transactional block of data is the /etc/zfs/zpool.cache file, which we don't have to worry about for splitting, because the innards of the kernel handles that for us.

What the vdev label holds is a number of pieces of information, and these include the pool to which it belongs, and the other members of the top-level vdev to which it belongs. The vdev label is not part of the pool data. It resides outside the pool, and is written to separately from the pool. In order to keep the label from being corrupt, not only does it get checksummed, but it gets written in four different locations on the disk. If the zfs kernel has to change the vdev configuration, a new nvlist is generated with all the in-memory configuration information, and then that is written out on the next sync operation to all four label locations. The bottom line is that it is possible for the vdev label's idea of the configuration to be out of sync with the configuration stored within the pool, known as the "spa config", depending on the timing of the writes and when the disk loses power.

So, for step (1), as stated above we offline the disks and generate a list of disks to be split off. This list is written to the spa config. If we die at this point then, on reboot, the vdev labels have not yet been updated, so the split is incomplete. The remedy is to throw away the list and put the disks back online.

For step (2) we update the vdev labels on the offlined disks, and generate a new spa config for them. If we die at this point, then, on reboot, we see the vdev labels are updated, and we remove the disks from the original pool in order to compelte the split.

For step (3) we just clean things up, removing the list of disks from the spa config, and removing the disks themselves from the in-core data structures.

So how does this actually work? You can see the heart of the code in spa.c, with the new function spa_vdev_split_mirror.

One tricky area is how we deal with log devices. Log (and "cache" and "spare") disks are not part of the split and, normally this would not be a problem. However, due to the way block pointers work, it is possible to generate a configuration that cannot be easily split. Consider the following sequence of commands:

# zpool create tank mirror c0t0d0 c1t0d0
# zpool add tank log c0t0d1
# zpool add tank mirror c0t1d0 c1t1d0

For the first line, zfs creates a new pool. The mirror is the top-level vdev, and it has a vdev ID of 0. The second line adds a log device. This device is also a top-level vdev, and gets an ID of 1. Finally, the third line adds a mirror as a new stripe to the pool. It gets a vdev ID of 2. A zpool status command confirms this numbering:

# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c0t0d0  ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
          c0t1d0    ONLINE       0     0     0

errors: No known data errors

What this means is that any block pointer pointing to data on the second stripe is going to have a vdev ID of 2. If the second and third commands were swapped -- in other words, if the stripe was added before the log device was added -- then the stripe would be "mirror-1" instead of "mirror-2", and the vdev ID would be 1. But we need to be able to handle both cases.

Why is this important? After a split, there will only be two stripes and no logs. Somehow, we need to ensure that these are vdev ID 0 and vdev ID 2, or all the block pointers that point to ID 2 will result in panicking zfs. How do we tell zfs to skip over ID 1?

With George Wilson's putback of CR 6574286, he introduced the concept of "holes". A hole is a top-level vdev that cannot be allocated from, and provides no data, but it takes up a slot in the vdev list. This made it possible for log devices to be removed, with the hole device taking over. The split code leverages this feature to do its form of log device removal, inserting holes in the new config wherever log devices are. And of course, it's smart about it: if the log device is the last device in the configuration, there's no need to put in a hole. This is done in libzfs, in the new function zpool_vdev_split. Look how the "lastlog" variable is used.

And that's it. That's splitting in a nutshell. Or at least a few nutshells. I'm guessing the information density in this post is pretty high, but splitting up zpools is rather complex. There are still things it doesn't do that would be nice to do in the future, such as splitting off mirrors from mirrors, or even rejoining a split config to its parent. I hope I get to work on those.

Tuesday Dec 08, 2009

Backing up a zvol

Over at Spiceworks, Michael2024 asks, "Anybody know how to get rsync to backup a ZFS zvol?"

My response is: "That's the wrong question." In fact, someone replied to Michael2024 already saying that rsync was not the right tool, but no one suggested the best tool for backing up zvols: snapshots

"But Mark," you say (because we're on first-name terms, and that is in fact my first name). "The snapshot is right there on the device that I'm trying to back up! How can that possibly help me?"

I'm glad you asked.

If you try to "back up" a zvol using a tool like dd, you're going to have to copy the whole volume, even the blocks that contain no data. But zvols are ZFS constructs which means they follow the copy-on-write paradigm which, in turn, means that ZFS needs to know what's data and what's not.

So that means that any snapshot will only contain the data that is actually on the disk. That's right: a snapshot of a 100TB volume that has 10MB of data will only contain those 10MB of data. And therefore, any "zfs send" stream will only contain real data and not a bunch of unwritten garbage.

To demonstrate, let's create a 100MB volume and snapshot it:

-bash-4.0# zfs create -V 100m tank/vol
-bash-4.0# zfs snapshot tank/vol@snap
How big is the send stream? Easy enough to check:
-bash-4.0# zfs send tank/vol@snap | wc -c
Just a smidge over 4k. Let's write some data:
-bash-4.0# dd if=/dev/random of=/dev/zvol/rdsk/tank/vol bs=1k count=10
10+0 records in
10+0 records out
-bash-4.0# zfs snapshot tank/vol@snap2
-bash-4.0# zfs send tank/vol@snap2 | wc -c
OK, we wrote 10k of data, and the send stream is 20k. With such a little amount of data, the overhead is about half the stream. But, what if we write to the same blocks again?
-bash-4.0# dd if=/dev/random of=/dev/zvol/rdsk/tank/vol bs=1k count=10
10+0 records in
10+0 records out
-bash-4.0# zfs snapshot tank/vol@snap3
-bash-4.0# zfs send tank/vol@snap3 | wc -c
The exact same amount! So ZFS knows exactly how much data there is on the zvol. Let's write 1MB instead:
-bash-4.0# dd if=/dev/random of=/dev/zvol/rdsk/tank/vol bs=1k count=1024
1024+0 records in
1024+0 records out
-bash-4.0# zfs snapshot tank/vol@snap4
-bash-4.0# zfs send tank/vol@snap4 | wc -c
And now the overhead is quite a bit smaller than the data, around 3-4%.

The question then is: which is more efficient? Doing a full block-by-block copy using something "higher up in the stack" (quoting from Michael2024 there), or creating another pool and doing a "zfs send | zfs recv"? On top of that, add the under-appreciated feature of incremental send streams, and you have a full backup solution that does not require any external tools.

I would respond on the Spiceworks website, but alas they are both members-only and require you download a Windows client just to register. Lame!

Thursday Mar 26, 2009

Cloning Humans: Bad; Cloning Filesystems: Good

Imagine you could snapshot yourself and generate a clone from that snapshot. On the surface, that would seem like a cool idea, but scratch the surface and you quickly run into problems. These problems could be ethical (what rights does your clone have to your own possessions?), or they could be practical (how can the planet sustain the population explosion?). Fortunately, with zfs datasets, snapshotting and cloning not only are permitted, they are actively encouraged.

Over at the Xerces Tech site, a recent article outlines how to use zfs clones in order to safely do an apt-get on their Nexenta box. Take a look at the article called Unbreakable upgrades, ZFS and apt-get to see how it's done.

Sunday Feb 15, 2009

ZFS Mandates

While making ZFS required for court and police NAS/SAN devices it would be nice if people would use ZFS because it is so obviously better, not because its use is mandated.

Wednesday Jan 28, 2009

Building a ZFS server

Have you ever wanted to try building your own ZFS-based file server? Jermaine Maree has done just that, and blogged about it. Start with part 1.

Thursday Jan 01, 2009

Don't shout at your JBODs

They don't like it!

Tuesday Dec 23, 2008

Environmental Disaster

A TVA ash pond has flooded the Tennessee River valley. This is a huge disaster, but I didn't hear about it through the normal news channels. Why not?

Tuesday Nov 18, 2008

ZFS can boost performance

Even a suboptimal configuration can result in a performance boost. The most interesting thing, I think, is the ease with which the zpool was created.

I wonder what kind of performance numbers this user would see with the 7110 compared with the Dell Powervault. The 7110 can hold 14 146gb sas drives, whereas the Dell uses 14 146gb scsi drives, so comparing power utilization would be interesting as well.

Wednesday Oct 22, 2008

ZFS User Directories on OS X

Check out this blog post on setting up OS X with zfs-based user directories.

Tuesday Oct 07, 2008

Why it's going to take me forever...

One of the things I'm trying to do, besides have a full time job at Sun and raise three kids (another full-time job, even with my wife's help), is to learn Japanese. I'm not in any great rush, so I make time where I can and I try to spend at least a few minutes a day on it. I've started by trying to learn the kana writing system before I move onto the more complex kanji. There are two primary kinds of kana, hiragana and katakana. The former is used when writing Japanese, and the latter is used when writing "foreign" words, including words that have become part of the Japanese language but were borrowed from, for example, English.

Sun, as you're probably aware, encourages blogging from its employees around the globe, and more than a few of these are from Japan. Here's an example of one: キャンパス アンバサダ (I hope that came through).

Here's how the entry looks in my browser - I'll include it here in case your browser doesn't show the characters correctly

What I've been doing, in order to practice my kana, is to pick out the katakana symbols from the entries and see if I can work out what word it is in English. Here's the title of the entry I linked to, broken down kana by kana:

KanaEnglish pronunciation
first word
n or m
second word
n or m

Spelling it phonetically, we get kyanpasu or kyampasu for the first word, and anbasada or ambasada for the second. And I was puzzled. Contrast that with another katakana word that appears in the blog entry: "オリエンテーション", which is o-ri-e-n-teh-sho-n ... orientehshon ... orientation. Obvious, right? What's kyanpasu? Words that end with ス tend to have the final 'oo' sound dropped, so it becomes kyanpas when saying it aloud. I give up, so Google Translate tells me: Campus. This is where I hit my head on the desk. Why does Campus start with キャ? The a in kya sounds like the a in father, not the a in campus, so maybe that's why kya is used to differentiate it from the カ character ('ka') which also sounds like the a in father?

Of course, once I saw the first word was 'campus', it's easy to figure out the second word is Ambassador. Which I would have spelled, apparently incorrectly, アンバサドル

Tuesday Aug 19, 2008

Cache on hand

When we think of a cache, we think of a way of storing information "closer" to the place it's needed. Most general-purpose CPUs, for example, have an on-board cache which is used to avoid accessing main memory - after all, the memory that's on the same die as the CPU is going to be quicker to access than the RAM chips. Filesystems use caches of RAM to make disk accesses appear to be quicker, as RAM chips are much faster than moving a mechanical arm across a spinning disks. If we're lucky, and if we've got a good caching algorithm, we can get an impressive speed boost by keeping the right bits in RAM. Likewise, CPUs get a speed boost by keeping the right instructions and data on chip.

Caching is not limited to CPUs and filesystems, of course. Most browsers maintain a cache of pages, of images, of css files, of javascript, and of any other bit of information that is useful for displaying web pages. By using a local on-disk cache (some of which is going to be in RAM anyway, thanks to the filesystem), browsing appears to be much quicker than it would if the browser had to re-load every single image from a distant web site. The browser does check to see if any files need to be retrieved again (see http's 304 Not Modified response code), so there is some over-the-wire activity, but that's about it.

All of this is a long-winded way of saying I was amused by the second bullet item here (from Apple Insider):

Either the cache is poorly implemented, or the users reporting this information are confused.


Known throughout Sun as a man of infinite wit, of jovial attitude, and of making things up about himself at the slightest whim.


« July 2014