Simplified Pseudo-Filesystem Implementation

A common complaint we hear from Linux users who try Solaris for the first time is that our /proc sucks. "Sucks" in this case usually refers to either the fact that you can't cat files under /proc and get text out, or the fact that we don't have things like /proc/pci. The former is a flamewar for another day; right now I want to drill down on the latter.

The idea behind the Solaris /proc is simple: export information about the process model (hence its name). We try not to put other stuff there. This begs the question: "Where does other stuff go?". Some functionality is available from libraries or devices, other functionality might not be available at all. But some functionality may be best suited for a hierarchical namespace like... a filesystem.

To satisfy the need for pseudofilesystems interfaces and the desire for consistent nomenclature, we've introduced a /system directory in Solaris 10. /system is intended to contain mount points for file systems which export non-process system information. Under /system we currently mount two new filesystems, the contract filesystem (ctfs(7FS), at /system/contract) and the kernel object filesytem (objfs(7FS), at /system/object).

Now that we have a home for this kind of thing, and now that OpenSolaris is open for business, there are many opportunities for the industrious to fortify Solaris with the filesystem conveniences they desire. Of course, just because there's an opportunity doesn't mean it's easy. There's a lot of work involved in interfacing with the parts of the kernel you want to interact with, and a lot of work involved in writing the FS glue. The latter is actually the insidious part; if you've spent any time perusing the source under usr/src/uts/common/fs, you've probably noticed that most of the filesystems are copy-and-pasted from each other and often reimplement a lot of complex algorithms. I consider this a badge of shame, but the practice dates back to the earliest days of the OS (in other words, there's no-one left to pin this badge on).

One of the projects I worked on in Solaris 10 was SMF, for which I was responsible for creating a process tracking mechanism. The result was process contracts, and we decided that the most appropriate interface for process contracts was a fileystem.1

Having spent a lot of time reading filesystem code, I was familiar with the mistakes of the past and for the sanity of my successors was determined not to repeat them. To make a long, uninteresting story short, I ultimately created a library of abstractions called gfs which I used when implementing ctfs. With these abstractions I was able to tuck away a lot of the complexity of implementing common fs entry points such as VOP_READDIR, leaving behind a relatively simple filesystem-specific implementation. When Eric later wrote the objfs filesystem, he spent a lot of time refining the gfs interfaces to be even more streamlined.

The end result is that it is now pretty darn easy to create pseudo filesystems on Solaris. For a rough before/after comparison, check out this behemoth, weighing in at over 70 lines of grungy C code. I hope we never have to write something like it again. The equivalent from ctfs is much cleaner:

static int
ctfs_tdir_do_readdir(vnode_t \*vp, struct dirent64 \*dp, int \*eofp,
    offset_t \*offp, offset_t \*nextp, void \*data)
        uint64_t zuniqid;
        ctid_t next;
        ct_type_t \*ty = ct_types[gfs_file_index(vp)];

        zuniqid = VTOZ(vp)->zone_uniqid;
        next = contract_type_lookup(ty, zuniqid, \*offp);

        if (next == -1) {
                \*eofp = 1;
                return (0);

        dp->d_ino = CTFS_INO_CT_DIR(next);
        numtos(next, dp->d_name);
        \*offp = next;
        \*nextp = next + 1;

        return (0);

As you can see, there's little boilerplate. In fact, the function's body is almost completely specific to reading a ctfs template directory2.

The next step is obvious: we need to take some time and re-factor existing filesystems to use these interfaces wherever possible. As part of my initial putback I also reimplemented parts3 of /proc to use them4. A more thorough eye needs to be turned to /proc to finish the job (and perhaps elevate its low-level gfs usage to complete gfs management5), and things like fdfs have little reason to be ignored.

What have we learned here?

  • /system is our home for new pseudo filesystems.
  • Copying old code is a bad idea.
  • Factoring code is a good idea (and fun, because you usually get to delete the aforementioned old code).
  • ctfs and objfs are good examples to follow when writing your spiffy new pseudo filesystem.
  • I like footnotes.

We've made other refinements to our file system implementation in Solaris 10. Things like the new file system interfaces (which replace the crusty, fixed-length vnode and vfs definitions with an opaque structure initialized with a variable length parameter list) have done much to improve the sanity of filesystem developers. I encourage you to explore our filesystem code; perhaps you'll find a new class of improvements we should make (or even make them yourself!).


1I'm not sure this was the wisest choice, but it seems to have worked out all right.

2"But what is a template directory?" you may be asking yourself. I'll elaborate on contracts another day.

3At the time I was already modifying a bajillion popular kernel files, and wasn't eager to add more to the list. To avoid unnecessarily complicating my merge process, I only rewrote those few functions which didn't require coordinated changes in other files.

4I had already added a bunch of code for some contract-related entries in /proc, and wanted to "restore balance to the force" by removing an equivalent amount of code. Perhaps surprisingly, we tend to get more excited about an opportunity to delete code than an opportunity to add more...

5See the comment at the top of gfs.c.


Thanks Dave; for this good Articel. It answered a lot questions I asked my self. PS: In this line is a simplif yed typo :-)

Posted by Fabian Otto on February 07, 2006 at 05:58 PM PST #

Oops, I'd better go file a bug. Thanks, Fabian.

Posted by Dave on February 07, 2006 at 06:57 PM PST #

Post a Comment:
  • HTML Syntax: NOT allowed



« July 2016