Wednesday Mar 14, 2007

DTrace sysevent provider

I've been heads down for a long time on a new project, but occasionally I do put something back to ON worth blogging about. Recently I've been working on some problems which leverage sysevents (libsysevent(3LIB)) as a common transport mechanism. While trying to understand exactly what sysevents were being generated from where, I found the lack of observability astounding. After poking around with DTrace, I found that tracking down the exact semantics was not exactly straightforward. First of all, we have two orthogonal sysevent mechanisms, the original syseventd legacy mechanism, and the more recent general purpose event channel (GPEC) mechanism, used by FMA. On top of this, the sysevent_impl_t structure isn't exactly straightforward, because all the data is packed together in a single block of memory. Knowing that this would be important for my upcoming work, I decided that adding a stable DTrace sysevent provider would be useful.

The provider has a single probe, sysevent:::post, which fires whenever a sysevent post attempt is made. It doesn't necessarily indicate that the syevent was successfully queued or received. The probe has the following semantics:

# dtrace -lvP sysevent
   ID   PROVIDER            MODULE                          FUNCTION NAME
44528   sysevent           genunix                    queue_sysevent post

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Evolving
                Data Semantics:   Evolving
                Dependency Class: ISA

        Argument Types
                args[0]: syseventchaninfo_t \*
                args[1]: syseventinfo_t \*

The 'syseventchaninfo_t' translator has a single member, 'ec_name',which is the name of the event channel. If this is being posted via the legacy sysevent mechanism, then this member will be NULL. The 'syeventinfo_t' translator has three members, 'se_publisher', 'se_class', and 'se_subclass'. These mirror the arguments to sysevent_post(). The following script will dump all sysevents posted to syseventd(1M):

#!/usr/sbin/dtrace -s

#pragma D option quiet

BEGIN
{
	printf("%-30s  %-20s  %s\\n", "PUBLISHER", "CLASS",
	    "SUBCLASS");
}

sysevent:::post
/args[0]->ec_name == NULL/
{
	printf("%-30s  %-20s  %s\\n", args[1]->se_publisher,
	    args[1]->se_class, args[1]->se_subclass);
}

And the output during a cfgadm -c unconfigure:

PUBLISHER                       CLASS                 SUBCLASS 
SUNW:usr:devfsadmd:100237       EC_dev_remove         disk
SUNW:usr:devfsadmd:100237       EC_dev_branch         ESC_dev_branch_remove
SUNW:kern:ddi                   EC_devfs              ESC_devfs_devi_remove

This has already proven quite useful in my ongoing work, and hopefully some other developers out there will also find it useful.

Friday Sep 22, 2006

New ZFS Features

I've been meaning to get around to blogging about these features that I putback a while ago, but have been caught up in a few too many things. In any case, the following new ZFS features were putback to build 48 of Nevada, and should be availble in the next Solaris Express

Create Time Properties

An old RFE has been to provide a way to specify properties at create time. For users, this simplifies admnistration by reducing the number of commands which need to be run. It also allows some race conditions to be eliminated. For example, if you want to create a new dataset with a mountpoint of 'none', you first have to create it and the underlying inherited mountpoint, only to remove it later by invoking 'zfs set mountpoint=none'.

From an implementation perspective, this allows us to unify our implementation of the 'volsize' and 'volblocksize' properties, and pave the way for future create-time only properties. Instead of having a separate ioctl() to create a volume and passing in the two size parameters, we simply pass them down as create-time options.

The end result is pretty straightforward:

        # zfs create -o compression=on tank/home
        # zfs create -o mountpoint=/export -o atime=off tank/export

'canmount' property

The 'canmount' property allows you create a ZFS dataset that serves solely as a mechanism for inheriting properties. When we first created the hierarchical dataset model, we had the notion of 'containers' - filesystems with no associated data. Only these datasets could contain other datasets, and you had to make the decision at create-time.

This turned out to be a bad idea for a number of reasons. It complicated the CLI, forced the user to make a create-time decision that could not be changed, and led to confusion when files were accidentally created on the underlying filesystem. So we made every filesystem able to have child filesystems, and all seemed well.

However, there is power in having a dataset that exists in the hierarchy but has no associated filesystem data (or effectively none by preventing from being mounted). One can do this today by setting the 'mountpoint' property to 'none'. However, this property is inherited by child datasets, and the administrator cannot leverage the power of inherited mountpoints. In particular, some users have expressed desire to have two sets of directories, belonging to different ZFS parents (or even to UFS filesystems), share the same inherited directory. With the new 'canmount' property, this becomes trivial:

        # zfs create -o mountpoint=/export -o canmount=off tank/accounting
        # zfs create -o mountpoint=/export -o canmount=off tank/engineering
        # zfs create tank/accounting/bob
        # zfs create tank/engineering/anne

Now, both anne and bob have directories at '/export/', except that they are inheriting ZFS properties from different datasets in the hierarchy. The adminsitrator may decide to turn compression on for one group of people or another, or set a quota to limit the amount of space consumed by the group. Or simply have a way to view the total amount of space consumed by each group without resorting to scripted du(1).

User Defined Properties

The last major RFE in this wad added the ability to set arbitrary properties on ZFS datasets. This provides a way for administrators to annotate their own filesystems, as well as ISVs to layer intelligent software without having to modify the ZFS code to introduce a new property.

A user-defined property name is one which contains a colon (:). This provides a unique namespace which is guaranteed to not overlap with native ZFS properties. The emphasis is to use the colon to separate a module and property name, where 'module' should be a reverse DNS name. For example, a theoretical Sun backup product might do:

        # zfs set com.sun.sunbackup:frequency=1hr tank/home

The property value is an arbitrary string, and no additional validation is done on it. These values are always inherited. A local adminstrator might do:

        # zfs set localhost:backedup=9/19/06 tank/home
        # zfs list -o name,localhost:backedup
        NAME            LOCALHOST:BACKEDUP
        tank            -
        tank/home       9/19/06
        tank/ws         9/10/06

The hope is that this will serve as a basis for some innovative products and home grown solutions which interact with ZFS datasets in a well-defined manner.

Tuesday Aug 22, 2006

ZFS on FreeBSD

More exciting news on the ZFS OpenSolaris front. In addition to the existing ZFS on FUSE/Linux work, we now have a second active port of ZFS, this time for FreeBSD. Pawel Dawidek has been hard at work, and has made astounding progress after just 10 days (!). This is both a testament to his ability as well as the portability of ZFS. As with any port, the hard part comes down to integrating the VFS layer, but Pawel has already made good progress there. The current prototype can already mount fielsystems, create files, and list directory contents. Of course, our code isn't completely without portability headaches, but thanks to Pawel (and Ricardo on FUSE/Linux), we can take patches and implement the changes upstream to ease future maintenance. You can find the FreeBSD repository Here. If you're a FreeBSD developer or user, please give Pawel whatever support you can, whether it's code contributions, testing, or just plain old compliments. We'll be helping out where we can on the OpenSolaris side.

In related news, Ricard Correia has made significant progress on the FUSE/Linux port. All the management functionality of zfs(1M) and zpool(1M) is there, and he's working on mounting ZFS filesystems. All in all, it's an exciting time, and we're all crossing our fingers that ZFS will follow in the footsteps of its older brother DTrace.

Monday Jun 12, 2006

ztest on Linux

As Jeff mentioned previously, Ricardo Correia has been working on porting ZFS to FUSE/Linux as part of Google SoC. Last week, Ricardo got libzpool and ztest running on Linux, which is a major first step of the project.

The interesting part is the set of changes that he had to make in order to get it working. libzpool was designed to be run from userland and the kernel from the start, so we've already done most of the work of separating out the OS-dependent interfaces. The most prolific changes were to satisfy GCC warnings. We do compile ON with gcc, but not using the default options. I've since updated the ZFS porting page with info about gcc use in ON, which should make future ports easier. The second most common change was header files that are available in both userland and kernel on Solaris, but nevertheless should be placed in zfs_context.h, concentrating platform-specific knowledge in this one file. Finally, there were some simple changes we could make (such as using pthread_create() instead of thr_create()) to make ports of the tools easier. It would also be helpful to have ports of libnvpair and libavl, much like some have done for libumem, so that developers don't have to continually port the same libraries over and over.

The next step (getting zfs(1M) and zpool(1M) working) is going to require significantly more changes to our source code. Unlike libzpool, these tools (libzfs in particular) were not designed to be portable. They include a number of Solaris specific interfaces (such as zones and NFS shares) that will be totally different on other platforms. I look forward to seeing Ricardo's progress to know how this will work out.

Tuesday Jun 06, 2006

ZFS Hot Spares

It's been a long time since the last time I wrote a blog entry. I've been working heads-down on a new project and haven't had the time to keep up my regular blogging. Hopefully I'll be able to keep something going from now on.

Last week the ZFS team put the following back to ON:

PSARC 2006/223 ZFS Hot Spares
PSARC 2006/303 ZFS Clone Promotion
6276916 support for "clone swap"
6288488 du reports misleading size on RAID-Z
6393490 libzfs should be a real library
6397148 fbufs debug code should be removed from buf_hash_insert()
6405966 Hot Spare support in ZFS
6409302 passing a non-root vdev via zpool_create() panics system
6415739 assertion failed: !(zio->io_flags & 0x00040)
6416759 ::dbufs does not find bonus buffers anymore
6417978 double parity RAID-Z a.k.a. RAID6
6424554 full block re-writes need not read data in
6425111 detaching an offline device can result in import confusion

There are a couple of cool features mixed in here. Most importantly, hot spares, clone swap, and double-parity RAID-Z. I'll focus this entry on hot spares, since I wrote the code for that feature. If you want to see the original ARC case and some of the discussion behind the feature, you should check out the original zfs-discuss thread.

The following features make up hot spare support:

Associating hot spares with pools

Hot spares can be specified when creating a pool or adding devices by using the spare vdev type. For example, you could create a mirrored pool with a single hot spare by doing:

# zpool create test mirror c0t0d0 c0t1d0 spare c0t2d0
# zpool status test
  pool: test
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c0t0d0  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
        spares
          c0t2d0    AVAIL   

errors: No known data errors

Notice that there is one spare, and it currently available for use. Spares can be shared between multiple pools, allowing for a single set of global spares on systems with multiple spares.

Replacing a device with a hot spare

There is now an FMA agent, zfs-retire, which subscribes to vdev failure faults and automatically initiates replacements if there are any hot spares available. But if you want to play around with this yourself (without forcibly faulting drives), you can just use 'zpool replace'. For example:

# zpool offline test c0t0d0
Bringing device c0t0d0 offline
# zpool replace test c0t0d0 c0t2d0
# zpool status test
  pool: test
 state: DEGRADED
status: One or more devices has been taken offline by the adminstrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
 scrub: resilver completed with 0 errors on Tue Jun  6 08:48:41 2006
config:

        NAME          STATE     READ WRITE CKSUM
        test          DEGRADED     0     0     0
          mirror      DEGRADED     0     0     0
            spare     DEGRADED     0     0     0
              c0t0d0  OFFLINE      0     0     0
              c0t2d0  ONLINE       0     0     0
            c0t1d0    ONLINE       0     0     0
        spares
          c0t2d0      INUSE     currently in use

errors: No known data errors

Note that the offline is optional, but it helps visualize what the pool would look like should and actual device fail. Note that even though the resilver is completed, the 'spare' vdev stays in-place (unlike a 'replacing' vdev). This is because the replacement is only temporary. Once the original device is replaced, then the spare will be returned to the pool.

Relieving a hot spare

A hot spare can be returned to its previous state by replacing the original faulted drive. For example:

# zpool replace test c0t0d0 c0t3d0
# zpool status test
  pool: test
 state: DEGRADED
 scrub: resilver completed with 0 errors on Tue Jun  6 08:51:49 2006
config:

        NAME             STATE     READ WRITE CKSUM
        test             DEGRADED     0     0     0
          mirror         DEGRADED     0     0     0
            spare        DEGRADED     0     0     0
              replacing  DEGRADED     0     0     0
                c0t0d0   OFFLINE      0     0     0
                c0t3d0   ONLINE       0     0     0
              c0t2d0     ONLINE       0     0     0
            c0t1d0       ONLINE       0     0     0
        spares
          c0t2d0         INUSE     currently in use

errors: No known data errors
# zpool status test
  pool: test
 state: ONLINE
 scrub: resilver completed with 0 errors on Tue Jun  6 08:51:49 2006
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c0t3d0  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
        spares
          c0t2d0    AVAIL   

errors: No known data errors

The drive is actively being replaced for a short period of time. Once the replacement is completed, the old device is removed, and the hot spare is returned to the list of available spares. If you want a hot spare replacement to become permanent, you can zpool detach the original device, at which point the spare will be removed from the hot spare list of any active pools. You can also zpool detach the spare itself to cancel the hot spare operation.

Removing a spare from a pool

To remove a hot spare from a pool, simply use the zpool remove command. For example:

# zpool remove test c0t2d0
# zpool status
  pool: test
 state: ONLINE
 scrub: resilver completed with 0 errors on Tue Jun  6 08:51:49 2006
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c0t3d0  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0

errors: No known data errors

Unfortunately, we don't yet support removing anything other than hot spares (it's on our list, we swear). But you can see how hot spares naturally fit into the existing ZFS scheme. Keep in mind that to use hot spares, you will need to upgrade your pools (via 'zpool upgrade') to version 3 or later.

Next Steps

Despite the obvious usefulness of this feature, there is one more step that needs to be done for it to be truly useful. This involves phase two of the ZFS/FMA integration. Currently, a drive is only considered faulted if it 'goes away' completely (i.e. ldi_open() fails). This covers only subset of known drive failure modes. It's possible for a drive to continually return errors, and yet be openable. The next phase of ZFS and FMA will introduce a more intelligent diagnosis engine to watch I/O and checksum errors as well as the SMART predictive failure bit in order to proactively offline devices when they are experiencing an abnormal amount of errors, or appear like they are going to fail. With this functionality, ZFS will be able to better respond to failing drives, thereby making hot spare replacement much more valuable.

Monday Dec 12, 2005

Nobel Aftermath

What a party.

I can't say much more - I'm afraid I won't do it justice. Suffice to say that we rolled back into our hotel room at 5:00 AM after hanging out with Swedish royalty and Nobel laureates in what has to be one of the most amazing ceremony and banquest ever conceived.

You'll have to ask me in person for some of the details, but a few highlights include my father escorting princess Madeleine down the staircase, as well as my mother being escorted (and talking with) the King of Sweden in the more private dinner the next night. Not to mention way too many late night drinking escapades with the likes of Grubbs and Nocera.

The only downside is that my flight home was delayed 5 hours (while we were on the plane), so I missed my connection and am now hanging out in a Newark hotel for a night. At least I get a midway point to adjust to the new timezone...

Hopefully the banquest footage will be available soon at nobelprize.org. Check it out when it is.

Thursday Dec 08, 2005

Nobel day 5

Things have picked up over here in Stockholm. The rest of the troops have arrived, filling up the Grand Hotel to the best of our ability. My father's been running around to press conferences, interviews, and various preparations. The rest of us only have one or two events each day to keep track of. Thankfully, my father has a personal attendant to make sure he doesn't miss a thing, not that my mom would ever allow such a thing to happen.

On the 6th, there was a small reception at the Nobel museum. They used the opportunity to show an "instructional" video to make sure we all knew how to behave during the ceremony and the reception. My father even signed one of the cafeteria chairs, apparently a tradition at the museum:

It was a real treat to arrive in a limo with dozens of cheering children peering through the windows and waiting to see who got out. My father has even met his adoring fans outside the hotel:

The whole experience is rather surreal. The next day we had a reception for all the science (physics, chemistry, and economics) laureates and their families at the Royal Academy of Science. Besides having the chance to get our whole family together in once place, it was interesting to talk with other families in the same boat. There were a few amusing moments when we were standing with Bob Grubbs and his children, who are all above 6'3" tall (including his daughter Katy), as my father, brother, and myself are all 6'3" or taller. The 7 of us are able to make almost anyone feel incredibly short.

Today was the day of lectures. Besides being hung over from some ill-advised late night wine via room service, these lectures were definitely intended for colleagues, not family members. I'd like to say that I understood my father's lecture, but when you can't pronounce half the words coming out of his mouth, it makes it rather difficult to keep up. At least there were some molecular diagrams that I could pretend to understand, though even those were quite a bit more complex than the ones I learned in high school chemistry.

Of course, Stockholm is a beautiful city. Lots of time is spent walking around the streets and poking our heads in the little shops. Tomorrow we'll try to hit up a few more museums in the little spare time we have.

Monday Dec 05, 2005

Nobel day 2

I'm off on vacation for a while while I attend the Nobel Festivities. While this is my second trip to Stockholm, it will be rather different as a Nobel guest staying in a 5-star hotel. Last time I was here I was a poor recently graduated college student at the end of his backpacking trip. While pizza and kebabs were nice, I think I'll get a better taste of Swedish food this time around.

I came rather early with my parents, thanks to a convenient business trip to the east coast. So while today is my second day in Stockholm, nothing really starts until tomorrow (when there is a tour of the Nobel museum). Eventually, we'll end up with 28 family and friends here when all is said and done. But arriving with my parents did have its advantages. We got to hang out in a private room in the SAS business lounge during our layover in Newark. When we arrived in Stockholm, we were whisked away from the gate via limousine to a private VIP lounge with a respresentative of the Royal Academy of Sciences and my father's personal handler. We handed over our passports and baggage tags, and half an hour later we were climbing back into the limo pre-packed with all our bags.

I haven't done much as of yet, but I did walk around Skansen with my mother, aunt and uncle. This is probably the last time I'll have a stretch of free time with my parents, as starting tomorrow my father is busy with receptions, meetings, rehearsals, and interviews. I'll leave you with a picture from the top of Skansen, taken at 2:30 PM - we don't get much light here this time of year:

I'm sure I'll have more interesting things to report in the coming days. Be sure to check out videos from last year to get an idea of what I'm in for.

Monday Nov 28, 2005

Pool discovery and 'zpool import'

In the later months of ZFS development, the mechanism used to open and import pools was drastically changed. The reasons behind this change make an interesting case study in complexity management and how a few careful observations can make all the difference.

The original spa_open()

Skipping past some of the very early prototypes, we'll start with the scheme that Jeff implemented back in 2003. The scheme had a lofty goal:

Be able to open a pool from an arbitrary subset of devices within the pool.

Of course, given limited label space, it was not possible to guarantee this entirely, but we created a scheme which would work in all but the most pessimistic scenarios. Basically, each device was part of a strongly connected graph made up of each toplevel vdev. The graph began with a circle of all the toplevel vdevs. Each device had it's parent vdev config, the vdev configs of the nearest neighbors, and up to 5 other vdev configs for good measure. When we went to open the pool, we read in this graph and constructed the complete vdev tree which we then used to open the pool.

You can see the original 'big theory' comment from the top of vdev_graph.c here.

First signs of problems

The simplest problem to notice was that these vdev configs were stored in a special one-off ASCII representation that required a special parser in the kernel. In a world where we were rapidly transitioning to nvlists, this wasn't really a good idea.

The first signs of real problems came when I tried to implement the import functionality. For import, we needed the ability to know if a pool could be imported without actually opening it. This had existed in a rather disgusting form previously by linking to libzpool (which is the userland port of the SPA and DMU code), and hijacking the versions of these functions to construct a vdev tree in userland. In the new userland model, it wasn't acceptable to use libzpool in this manner. So, I had to construct spa_open_by_dev() that parsed the configuration into a vdev tree and tried to open the vdevs, but not actually load the pool metadata.

This required a fair amount of hackery, but it was nowhere near as bad as when I had to make the resulting system tolerant of all faults. For both 'zpool import' and 'zpool status', it wasn't enough just to know that a pool couldn't be opened. We needed to know why, and exactly which devices were at fault. While the original scheme worked well when a single device was missing in a toplevel vdev, it failed miserably when an entire toplevel vdev was missing. In particular, it relied on the fact that it had at least a complete ring of toplevel vdevs to work with. For example, you were missing a single device in an unmirrored pool, there was no way to know what device was missing because the entire graph parsing algorithm would break down. So, I went in and hacked on the code to understand multiple "versions" of a vdev config. If we had two neighbors referring to a missing toplevel vdev, we could surmise its config even though all its devices were absent.

At this point, things were already getting brittle. The code was enormously complex, and hard to change in any well defined way. On top of that, things got even worse when our test guys started getting really creative. In particular, if you disconnected a device and then exported a pool, or created a new pool over parts of an exported pool, things would get ugly fast. The labels on the disks would be technically valid, but semantically invalid. I had to make even more changes to the algorithms to accomodate all these edge cases. Eventually, it got to the point where every change I made was prefixed by /\* XXX kludge \*/. Jeff and I decided something needed to be done.

On top of all this, the system still had a single point of failure. Since we didn't want to scan every attached device on boot, we kept an /etc/system file around that described the first device in the pool as a 'hint' for where to get started. If this file was corrupt, or that particular toplevel vdev was not available, the pool configuration could not be determined.

vdev graph 2.0

At this point we had a complex, brittle, and functionally insufficient system for discovering pools. As Jeff, Bill, and I talked about this problem for a while, we made two key observations:

  • The kernel didn't have parse the graph. We already had the case where we were dependent on a cached file for opening our pool, so why not keep the whole configuration there? The kernel can do (relatively) atomic updates to this file on configuration changes, and then open the resulting vdev tree without having to construct it based on on-disk data.
  • During import, we already need to check all devices in the system. We don't have to worry about 'discovering' other toplevel vdevs, because we know that we will, by definition, look at the device during the discovery phase.

With these two observations under our belt, we knew what we had to do. For both open and import, the SPA would know only how to take a fully constructed config and parse it into a working vdev tree. Whether that config came from the new /etc/zfs/zpool.cache file, or whether it was constructed during a 'zpool import' scan, it didn't matter. The code would be exactly the same. And best of all, no complicated discovery phase - the config was taken at face value1. And the config was simply an nvlist. No more special parsing of custom 'vdev spec' strings.

So how does import work?

The config cache was all well and good, but how would import work? And how would it fare in the face of total device failure? You can find all of the logic for 'zpool import' in libzfs_import.c.

Each device keeps the complete config for its toplevel vdev. No neighbors, no graph, no nothing. During pool discovery, we keep track of all toplevel vdevs for a given pool that we find during a scan. Using some simple heuristics, we construct the 'best' version of the pool's config, and even go through and update the path and devid information based on the unique GUID for each device. Once that's done, we have the same nvlist we would have as if we had read it from zpool.cache. The kernel happily goes off and tries to open the vdevs (if this is just a scan) or open it for real (if we're doing the import).

So what about missing toplevel vdevs? In the online (cached) state, we'll have the complete config and tell you which device is missing. For the import case, we'll be a little worse off because we'll never see any vdevs indicating that there is another toplevel vdev in the pool. The most important thing is that we're able to detect this case and error out properly. To do this, we have a 'vdev guid sum' stored in the uberblock that indicates the sum of all the vdev guids for every device in the config. If this doesn't match, we know that we have missed a toplevel vdev somewhere. Unfortunately, we can't tell you what device it is. In the future, we hope to improve this by adding the concept of 'neighbor lists' - arbitrary lists of devices without entire vdev configs. Unlike the previous incarnation of vdev graph, these will be purely suggestions, and not actually be required for correctness. There will, of course, be cases where we can never provide you enough information about all your neighbors, such as plugging in a single disk from a thousand disk unreplicated pool.

Conclusions

So what did we learn from this? The main thing is that phrasing the problem slightly differently can cause you to overdesign a system beyond the point of maintainability. As Bryan is fond of saying, one of the most difficult parts of solving hard problems is not just working within a highly constrained environment, but identifying which constraints aren't needed at all. By realizing that opening a pool was a very different operation than discovering a pool, we were able to redesign our interfaces into a much more robust and maintainable state, with more features to boot.

It's no surprise that there were more than a few putbacks like "SPA 2.0", "DMU 2.0", or "ZIL 3.0". Sometimes you just need to take a hammer to the foundation to incorporate all that you've learned from years of using the previous version.


1 Actually, Jeff made this a little more robust by adding the config as part of the MOS (Meta objset), which is stored transactionally with the rest of the pool data. So even if we added two devices, but the /etc cache file didn't get updated correctly, we'll still be able to open the MOS config and realize that there are two new devices to be had.

Monday Nov 21, 2005

ZFS and FMA

In this post I'll describe the interactions between ZFS and FMA (Fault Management Architecture). I'll cover the support that's present today, as well as what we're working on and where we're headed.

ZFS Today (phase zero)

The FMA support in ZFS today is what we like to call "phase zero". It's basically the minimal amount of integration needed in order to leverage (arguably) the most useful feature in FMA: knowledge articles. One of the key FMA concepts is to present faults in a readable, consistent manner across all subsystems. Error messages are human readable, contain precise descriptions of what's wrong and how to fix it, and point the user to a website that can be updated more frequently with the latest information.

ZFS makes use of this strategy in the 'zpool status' command:

$ zpool status
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool online' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1d0s0  ONLINE       0     0     3
            c0d0s0  ONLINE       0     0     0

In this example, one of our disks has experienced multiple checksum errors (because I dd(1)ed over most of the disk), but it was automatically corrected thanks to the mirrored configuration. The error message described exactly what has happened (we tried to self-heal the data from the other side of the mirror) and the impact (none - applications are unaffected). It also directs the user to the appropriate repair procedure, which is either to clear the errors (if they are not indicative of hardware fault) or replace the device.

It's worth noting here the ambiguity of the fault. We don't actually know if the errors are due to bad hardware, transient phenomena, or administrator error. More on this later.

ZFS tomorrow (phase one)

If we look at the implementation of 'zpool status', we see that there is no actual interaction with the FMA subsystem apart from the link to the knowledge article. There are no generated events or faults, and no diagnosis engine or agent to subscribe to the events. The implementation is entirely static, and contained within libzfs_status.c.

This is obviously not an ideal solution. It doesn't give the administrator any notification of errors as they occur, nor does it allow them to leverage other pieces of the FMA framework (such as upcoming SNMP trap support). This is going to be addressed in the near term by the first "real" phase of FMA integration. The goal of this phase, in addition to a number of other fault capabilities under the hood, is the following:

  • Event Generation - I/O errors and vdev transitions will result in true FMA ereports. These will be generated by the SPA and fed through the FMA framework for further analysis.
  • Simple Diagnosis Engine - An extremely dumb diagnosis engine will be provided to consume these ereports. It will not perform any sort of predictive analysis, but will be able to keep track of whether these errors have been seen before, and pass them off to the appropriate agent.
  • Syslog agent - The results from the DE will be fed to an agent that simply forwards the faults to syslog for the administrator's benefit. This will give the same messages as seen in 'zpool status' (with slightly less information) synchronous with a fault event. Future work to generate SNMP traps will allow the administrator to be email, paged, or implement a poor man's hot spare system.

ZFS in the future (phase next)

Where we go from here is rather an open road. The careful observer will notice that ZFS never makes any determination that a device is faulted due to errors. If the device fails to reopen after an I/O error, we will mark it as faulted, but this only catches cases where a device has gotten completely out of whack. Even if your device is experiencing a hundred uncorrectable I/O errors per second, ZFS will continue along its merry way, notifying the administrator but otherwise doing nothing. This is not because we don't want to take the device offline; it's just that getting the behavior right is hard.

What we'd like to see is some kind of predictive analysis of the error rates, in an attempt to determine if a device is truly damaged, or whether it was just a random event. The diagnosis engines provided by FMA are designed for exactly this, though the hard part is determining the algorithms for making this distinction. ZFS is both a consumer of FMA faults (in order to take proactive action) as well as a producer of ereports (detecting checksum errors). To be done right, we need to harden all of our I/O drivers to generate proper FMA ereports, implement a generic I/O retire mechanism, and link it in with all the additional data from ZFS. We also want to gather SMART data from the drives to notice correctible errors fixed by the firmware, as well doing experiments to determine the failure rates and pathologies of common storage drives.

As you can tell, this is not easy. I/O fault diagnosis is much more complicated than CPU and memory diagnosis. There are simply more components, as well as more changes for administrative error. But between FMA and ZFS, we have laid the groundwork for a truly fault-tolerant system capable of predictive self healing and automated recovery. Once we get past the initial phase above, we'll start to think about this in more detail, and make our ideas public as well.

Thursday Nov 17, 2005

UFS/SVM vs. ZFS: Code Complexity

A lot of comparisons have been done, and will continue to be done, between ZFS and other filesystems. People tend to focus on performance, features, and CLI tools as they are easier to compare. I thought I'd take a moment to look at differences in the code complexity between UFS and ZFS. It is well known within the kernel group that UFS is about as brittle as code can get. 20 years of ongoing development, with feature after feature being bolted on tends to result in a rather complicated system. Even the smallest changes can have wide ranging effects, resulting in a huge amount of testing and inevitable panics and escalations. And while SVM is considerably newer, it is a huge beast with its own set of problems. Since ZFS is both a volume manager and a filesystem, we can use this script written by Jeff to count the lines of source code in each component. Not a true measure of complexity, but a reasonable approximation to be sure. Running it on the latest version of the gate yields:

-------------------------------------------------
  UFS: kernel= 46806   user= 40147   total= 86953
  SVM: kernel= 75917   user=161984   total=237901
TOTAL: kernel=122723   user=202131   total=324854
-------------------------------------------------
  ZFS: kernel= 50239   user= 21073   total= 71312
-------------------------------------------------

The numbers are rather astounding. Having written most of the ZFS CLI, I found the most horrifying number to be the 162,000 lines of userland code to support SVM. This is more than twice the size of all the ZFS code (kernel and user) put together! And in the end, ZFS is about 1/5th the size of UFS and SVM. I wonder what those ZFS numbers will look like in 20 years...

Wednesday Nov 16, 2005

Principles of the ZFS CLI

Well, I'm back. I've been holding off blogging for a while due to ZFS. Now that it's been released, I've got tons of stuff lined up to talk about in the coming weeks.

I first started working on ZFS about nine months ago, and my primary task from the beginning was to redesign the CLI for managing storage pools and filesystems. The existing CLI at the time had evolved rather organically over the previous 3 years, each successive feature providing some immediate benefit but lacking in long-term or overarching design goals. To be fair, it was entirely capable in the job it was intended to do - it just needed to be rethought in the larger scheme of things.

I have some plans for detailed blog posts about some of the features, but I thought I'd make the first post a little more general and describe how I approached the CLI design, and some of the major principles behind it.

Simple but powerful

One of the hardest parts of designing an effective CLI is to make it simple enough for new users to understand, but powerful enough so that veterans can tweak everything they need to. With that in mind, we adopted a common design philosophy:

"Simple enough for 90% of the users to understand, powerful enough for the other 10% to use

A good example of this philosophy is the 'zfs list' command. I plan to delve into some of the history behind its development at a later point, but you can quickly see the difference between the two audiences. Most users will just use 'zfs list':

$ zfs list
NAME                   USED  AVAIL  REFER  MOUNTPOINT
tank                  55.5K  73.9G   9.5K  /tank
tank/bar                 8K  73.9G     8K  /tank/bar
tank/foo                 8K  73.9G     8K  /tank/foo

But a closer look at the usage reveals a lot more power under the hood:

        list [-rH] [-o property[,property]...] [-t type[,type]...]
            [filesystem|volume|snapshot] ...

In particular, you can ask questions like 'what is the amount of space used by all snapshots under tank/home?' We made sure that sufficient options existed so that power users could script whatever custom tools they wanted.

Solution driven error messages

Having good error messages is a requirement for any reasonably complicated system. The Solaris Fault Management Architecture has proved that users understand and appreciate error messages that tell you exactly what is wrong in plain english, along with how it can be fixed.

A great example of this is through the 'zpool status' output. Once again, I'll go into some more detail about the FMA integration in a future post, but you can quickly see how basic FMA integration really allows the user to get meaningful diagnositics on their pool:

$ zpool status
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool online' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1d0s0  ONLINE       0     0     3
            c0d0s0  ONLINE       0     0     0

Consistent command syntax

When it comes to command line syntax, everyone seems to have a different idea of what makes the most sense. When we started redesigning the CLI, we took a look at a bunch of other tools in solaris, focusing on some of the more recent ones which had undergone a more rigorous design. In the end, our primary source of inspiration were the SMF (Server Management Facility) commands. To that end, every zfs(1M) and zpool(1M) command has the following syntax:

<command> <verb> <options> <noun> ...

There are no "required options". We tried to avoid positional parameters at all costs, but there are certain subcommands (zfs get, zfs get, zfs clone, zpool replace, etc) that fundamentally require multiple operands. In these cases, we try to direct the user with informative error messages indicating that they may have forgotten a parameter:

# zpool create c1d0 c0d0
cannot create 'c1d0': pool name is reserved
pool name may have been omitted

If you mistype something and find that the error message is confusing, please let us know - we take error messages very seriously. We've already had some feedback for certain commands (such as 'zfs clone') that we're working on.

Modular interface design

On a source level, the initial code had some serious issues around interface boundaries. The problem is that the user/kernel interface is managed through ioctl(2) calls to /dev/zfs. While this is a perfectly fine solution, we wound up with multiple consumers all issuing these ioctl() calls directly, making it very difficult to evolve this interface cleanly. Since we knew that we were going to have multiple userland consumers (zpool and zfs), it made much more sense to construct a library (libzfs) which was responsible for managing this direct interface, and have it present a unified object-based access method for consumers. This allowed us to centralize logic in one place, and the command themselves became little more than glorified argument parsers around this library.

Wednesday Oct 05, 2005

To my father

ZFS work and RSI issues have prevented me from making regular postings to this blog, but I do want to make a brief personal entry in honor of my father:

Nobel Prize Announcement
AP article
For the scientifically inclined

I've been so proud of you for so long, and now I think the rest of the world finally understands how I feel.

Monday Sep 12, 2005

First sponsored bugfix

Yes, I am still here. And yes, I'm still working on ZFS as fast as I can. But I do have a small amount of free time, and managed to pitch in with some of the OpenSolaris bug sponsor efforts over at the two basic code cleanup fixes into the Nevada gate. Nothing spectacular, but worthy of a proof of concept, and it adds another name to the list of contributors who have had fixes putback into Nevada. Next week I'll try to grab one of the remaining bugfixes to lend a hand. Maybe someday I'll have enough time to blog for real, but don't expect much until ZFS is back in the gate.

Also, check out the Nevada putback logs for build 22. Very cool stuff - kudos to Steve and the rest of the OpenSolaris team. Pay attention to the fixes contributed by Shawn Walker and Jeremy Teo - It's nice to see active work being done, despite the fact that we still have so much work left to do in building an effective community.

Technorati Tag:

Thursday Aug 11, 2005

Fame, Glory, and a free iPod shuffle

Thanks to Jarod Jenson (of DTrace and Aeysis fame), I now have a shiny new 512MB iPod shuffle to give away to a worthy OpenSolaris cause. Back when I posted my original MDB challenge, I had no cool stuff to entice potential suitors. So now I'll offer this iPod shuffle to the first person who submits an acceptable solution to the problem and follows through to integrate the code into OpenSolaris (I will sponsor any such RFE). Send your diffs against the latest OpenSolaris source to me at eric dot schrock at sun dot com. We'll put a time limit of, say, a month and a half (until 10/1) so that I can safely recycle the iPod shuffle into another challenge should no one respond.

Once again, the original challenge is here.

So besides the fame and glory of integrating the first non-bite size RFE into OpenSolaris, you'll also walk away with a cool toy. Not to mention all the MDB knowledge you'll have under your belt. Feel free to email me questions, or head over to the mdb-discuss forum. Good Luck!

Tags:

About

Musings about Fishworks, Operating Systems, and the software that runs on them.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today