Pool discovery and 'zpool import'

In the later months of ZFS development, the mechanism used to open and import pools was drastically changed. The reasons behind this change make an interesting case study in complexity management and how a few careful observations can make all the difference.

The original spa_open()

Skipping past some of the very early prototypes, we'll start with the scheme that Jeff implemented back in 2003. The scheme had a lofty goal:

Be able to open a pool from an arbitrary subset of devices within the pool.

Of course, given limited label space, it was not possible to guarantee this entirely, but we created a scheme which would work in all but the most pessimistic scenarios. Basically, each device was part of a strongly connected graph made up of each toplevel vdev. The graph began with a circle of all the toplevel vdevs. Each device had it's parent vdev config, the vdev configs of the nearest neighbors, and up to 5 other vdev configs for good measure. When we went to open the pool, we read in this graph and constructed the complete vdev tree which we then used to open the pool.

You can see the original 'big theory' comment from the top of vdev_graph.c here.

First signs of problems

The simplest problem to notice was that these vdev configs were stored in a special one-off ASCII representation that required a special parser in the kernel. In a world where we were rapidly transitioning to nvlists, this wasn't really a good idea.

The first signs of real problems came when I tried to implement the import functionality. For import, we needed the ability to know if a pool could be imported without actually opening it. This had existed in a rather disgusting form previously by linking to libzpool (which is the userland port of the SPA and DMU code), and hijacking the versions of these functions to construct a vdev tree in userland. In the new userland model, it wasn't acceptable to use libzpool in this manner. So, I had to construct spa_open_by_dev() that parsed the configuration into a vdev tree and tried to open the vdevs, but not actually load the pool metadata.

This required a fair amount of hackery, but it was nowhere near as bad as when I had to make the resulting system tolerant of all faults. For both 'zpool import' and 'zpool status', it wasn't enough just to know that a pool couldn't be opened. We needed to know why, and exactly which devices were at fault. While the original scheme worked well when a single device was missing in a toplevel vdev, it failed miserably when an entire toplevel vdev was missing. In particular, it relied on the fact that it had at least a complete ring of toplevel vdevs to work with. For example, you were missing a single device in an unmirrored pool, there was no way to know what device was missing because the entire graph parsing algorithm would break down. So, I went in and hacked on the code to understand multiple "versions" of a vdev config. If we had two neighbors referring to a missing toplevel vdev, we could surmise its config even though all its devices were absent.

At this point, things were already getting brittle. The code was enormously complex, and hard to change in any well defined way. On top of that, things got even worse when our test guys started getting really creative. In particular, if you disconnected a device and then exported a pool, or created a new pool over parts of an exported pool, things would get ugly fast. The labels on the disks would be technically valid, but semantically invalid. I had to make even more changes to the algorithms to accomodate all these edge cases. Eventually, it got to the point where every change I made was prefixed by /\* XXX kludge \*/. Jeff and I decided something needed to be done.

On top of all this, the system still had a single point of failure. Since we didn't want to scan every attached device on boot, we kept an /etc/system file around that described the first device in the pool as a 'hint' for where to get started. If this file was corrupt, or that particular toplevel vdev was not available, the pool configuration could not be determined.

vdev graph 2.0

At this point we had a complex, brittle, and functionally insufficient system for discovering pools. As Jeff, Bill, and I talked about this problem for a while, we made two key observations:

  • The kernel didn't have parse the graph. We already had the case where we were dependent on a cached file for opening our pool, so why not keep the whole configuration there? The kernel can do (relatively) atomic updates to this file on configuration changes, and then open the resulting vdev tree without having to construct it based on on-disk data.
  • During import, we already need to check all devices in the system. We don't have to worry about 'discovering' other toplevel vdevs, because we know that we will, by definition, look at the device during the discovery phase.

With these two observations under our belt, we knew what we had to do. For both open and import, the SPA would know only how to take a fully constructed config and parse it into a working vdev tree. Whether that config came from the new /etc/zfs/zpool.cache file, or whether it was constructed during a 'zpool import' scan, it didn't matter. The code would be exactly the same. And best of all, no complicated discovery phase - the config was taken at face value1. And the config was simply an nvlist. No more special parsing of custom 'vdev spec' strings.

So how does import work?

The config cache was all well and good, but how would import work? And how would it fare in the face of total device failure? You can find all of the logic for 'zpool import' in libzfs_import.c.

Each device keeps the complete config for its toplevel vdev. No neighbors, no graph, no nothing. During pool discovery, we keep track of all toplevel vdevs for a given pool that we find during a scan. Using some simple heuristics, we construct the 'best' version of the pool's config, and even go through and update the path and devid information based on the unique GUID for each device. Once that's done, we have the same nvlist we would have as if we had read it from zpool.cache. The kernel happily goes off and tries to open the vdevs (if this is just a scan) or open it for real (if we're doing the import).

So what about missing toplevel vdevs? In the online (cached) state, we'll have the complete config and tell you which device is missing. For the import case, we'll be a little worse off because we'll never see any vdevs indicating that there is another toplevel vdev in the pool. The most important thing is that we're able to detect this case and error out properly. To do this, we have a 'vdev guid sum' stored in the uberblock that indicates the sum of all the vdev guids for every device in the config. If this doesn't match, we know that we have missed a toplevel vdev somewhere. Unfortunately, we can't tell you what device it is. In the future, we hope to improve this by adding the concept of 'neighbor lists' - arbitrary lists of devices without entire vdev configs. Unlike the previous incarnation of vdev graph, these will be purely suggestions, and not actually be required for correctness. There will, of course, be cases where we can never provide you enough information about all your neighbors, such as plugging in a single disk from a thousand disk unreplicated pool.

Conclusions

So what did we learn from this? The main thing is that phrasing the problem slightly differently can cause you to overdesign a system beyond the point of maintainability. As Bryan is fond of saying, one of the most difficult parts of solving hard problems is not just working within a highly constrained environment, but identifying which constraints aren't needed at all. By realizing that opening a pool was a very different operation than discovering a pool, we were able to redesign our interfaces into a much more robust and maintainable state, with more features to boot.

It's no surprise that there were more than a few putbacks like "SPA 2.0", "DMU 2.0", or "ZIL 3.0". Sometimes you just need to take a hammer to the foundation to incorporate all that you've learned from years of using the previous version.


1 Actually, Jeff made this a little more robust by adding the config as part of the MOS (Meta objset), which is stored transactionally with the rest of the pool data. So even if we added two devices, but the /etc cache file didn't get updated correctly, we'll still be able to open the MOS config and realize that there are two new devices to be had.

Comments:

Hi Eric, I wonder if there's a cluster version of ZFS in the works? I'm almost certain you're probably planning on it because it's a hell of disk you could address with 128 bits so one would think that a) people will want to access it through more than one system at a time and b) they'll want it highly available. I'll appreciate if you could comment in a future blog.

Posted by Dimitar Ivanov on November 28, 2005 at 02:59 PM PST #

Dimitar -

Yes, we have two plans in this area. The first is a HA-ZFS project which allows failover between SunCluster nodes. This should be finished in the near future. In the long term, we want to make ZFS a natively supported clusterd filesystem. This will be much more work, and currently hasn't been scoped.

Posted by Eric Schrock on November 30, 2005 at 01:46 PM PST #

yikes

Posted by guest on June 20, 2007 at 04:48 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

Musings about Fishworks, Operating Systems, and the software that runs on them.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today