Sunday Dec 27, 2009

Five cscope Tips

As software becomes increasingly complex and codebases continue to sprawl, source code cross-reference tools have become a critical component of a software engineer's toolbox. Indeed, since most of us are tasked with enhancing an existing codebase (rather than writing from scratch), proficiency in use of a cross-reference tool can mean the difference between understanding the subtleties of a subsystem in an afternoon and spending weeks battling "unforeseen" complications.

At Sun, we primarily use a tweaked version of the venerable cscope utility which has origins going back to AT&T in the 1980s (now freely available from As with many UNIX utilities, despite its age it has remained popular because of its efficiency and flexibility, which are especially important when understanding (and optionally modifying) source trees with several million lines of code.

Despite cscope's importance and popularity, I've been surprised to discover that few are familiar with anything beyond the basics. As such, in the interest of increasing cscope proficiency, here's my list of five features every cscope user should know:

  1. Display more than 9 search results per page with -r.

    Back in the 1980s the default behavior may have made sense, but with modern xterms often configured to have 50-70 rows the default is simply inefficient and tedious. By passing the -r option to cscope at startup (or including -r in the CSCOPEOPTIONS environment variable), cscope will display as many search results as will fit. The only caveat is that selecting an entry from the results must include explicitly pressing return (e.g., "3 [return]" instead of "3") so that entries greater than 9 can be selected. I find this tradeoff more than acceptable. (Apparently, the current open-source version of cscope uses letters to represent search results beyond 9 and thus does not require -r.)

  2. Display more pathname components in search results with -pN.

    By default, cscope only displays the basename of a given matching file. In large codebases, files in different parts of the source tree can often have the same name (consider main.c), which makes for confusing search results. By passing the -pN option to cscope at startup (or including -pN in the CSCOPEOPTIONS environment variable) -- where N is the number of pathname components to display -- this confusion can be eliminated. I've generally found -p4 to be a good middle-ground. Note that -p0 will cause pathnames to be omitted entirely from search results, which can also be useful for certain specialized queries.
  3. Use regular expressions when searching.

    While it is clear that one can enter a regexp when using "Find this egrep pattern", it's less apparent that almost all search fields will accept regexps. For instance, to find all definitions starting with ipmp_ and ending with ill, just specify ipmp_.\*ill to "Find this definition". In addition to allowing groups of related functions to be quickly found, I find this feature is quite useful when I cannot remember the exact name of a given symbol but can recall specific parts of its name. Note that this feature is not limited to symbols -- e.g., passing .\*ipmp.\* to "Find files #including this file" returns all files in the cscope database that #include a file with ipmp somewhere in its name.
  4. Use filtering to refine previous searches.

    cscope provides several mechanisms for refining searches. The most powerful is the ability to filter previous searches through an arbitrary shell command via \^. For instance, suppose you want to find all calls to GLDv3 functions (which all start with mac_) from the nge driver (which has a set of source files starting with nge). You might first specify a search pattern of mac_.\* to "Find functions calling this function". With ON's cscope database, this returns a daunting 2400 matches; filtering with "\^grep common/io/nge", quickly pares the results down to the 12 calls that exist within the nge driver. Note that this can be repeated any number of times -- e.g., "\^sort -k2" alphabetizes the remaining search results by calling function.
  5. Use the built-in history mechanisms.

    You can quickly restore previous search queries by using \^b (control-b); \^f will move forward through the history. This feature is especially useful when performing depth-first exploration of a given function hierarchy. You can also use \^a to replay the most recent search pattern (e.g., in a different search field), and the > and < commands to save and restore the results of a given search. Thus, you could save search results prior to refining it using \^ (as per the previous tip) and restore them later, or restore results from a past cscope session.

Of course, this is just my top-five list -- there are many other powerful features, such as the ability to make changes en masse, build custom cscope databases using the xref utility, embed command-line mode in scripts (mentioned in a previous blog entry), and employ numerous extensions that provide seamless interaction with popular editors such as XEmacs and vim. Along these lines, I'm eager to hear from others who have found ways to improve their productivity with this exceptional utility.

Sunday May 31, 2009

Clearview IPMP in Production

When I was first getting obsessed with programming in my early teens, I recall waking up on many a Saturday to the gleeful realization I had the whole day to improve some crazy piece of home-grown software. Back then, the excitement was simply in the journey itself -- I was completely content in being the entire userbase.

Of course I still love writing software (though the idea of being able to devote a whole day to it seems quaint) -- but it pales in comparison to the thrill of knowing that real people are using that software to solve their real problems. Unfortunately, with enterprise-class products such as Solaris, release schedules have historically meant that a completed project may have to wait years until it gets to solve its first real-world problem. By then, several other projects may have run their course and I'm invariably under another one's spell and not in the right frame of mind to even reminisce, let alone rejoice.

Thankfully, times have changed. First, courtesy of OpenSolaris's ipkg /dev repository, only a few weeks after January's integration, Clearview IPMP was available for bleeding-edge customers to experiment with (and based on feedback I've received, quite a few have successfully done so). Second, for the vast majority who need a supported release, Clearview IPMP can now be found in the brand-new OpenSolaris 2009.06 release. Third, thanks to the clustering team, Clearview IPMP also works with the current version of OpenSolaris Open HA Cluster.

Further, there is one little-known but immensely important release vehicle for Clearview IPMP: the Sun Storage 7000 Q2 release. Indeed, in the months since the integration of Clearview IPMP, I've partnered with the Fishworks team on getting all of the latest and greatest networking technologies from OpenSolaris into the Sun Storage 7000 appliances. As such, the Q2 release contains all of the Solaris networking projects delivered up to OpenSolaris build 106 (most notably Volo and Crossbow), plus Clearview IPMP from build 107. Of course, these projects also open up a range of new opportunities for the appliance -- especially around Networking QoS and simplified HA configuration -- which will find their way into subsequent quarterly releases.

Needless to say, all of this is immensely satisfying for me personally -- especially the idea that some of our most demanding enterprise customers are relying on Clearview IPMP to ensure their mission-critical storage remains available when networking hardware or upstream switches fail. As per my blog entry announcing Clearview IPMP in OpenSolaris, it's clear I'm a proud parent, but given the thrashing we've given it internally and its track-record thus far with customers, I'm confident it's ready for prime time.

For those exploring IPMP for the first time, Xiang Zhou (the co-author of its extensive test suite) has put together a great blog entry, including step-by-step instructions. Additionally, Raoul Carag and I extensively revised the IPMP administrative overview and IPMP tasks guide.

Those familiar with Solaris 10 IPMP may wish to check out a short slide deck that highlights the core differences and new utilities (if nothing else, I'd recommend scanning slides 12-21).

Have fun -- and of course, I (and the rest of the Clearview team) am eager to hear how it stacks up against your real-world networking high-availability problems!

Tuesday May 12, 2009

Hunting Cruft

It's no secret that I am borderline-O.C.D. in many aspects of my life -- and especially so when it comes to developing software. However, large-scale software development is inherently a messy process, and even with the most disciplined engineering practices, remnants from aborted or bygone designs often remain, lying in wait to confuse future developers.

Thankfully, many of the more obvious remnants can be identified with automated programs. For instance, the venerable lint utility can identify unused functions within an application. Many moons ago, I applied a similar concept to the OS/Net nightly build process with a utility called findunref that allows us to automatically identify files in the source tree that are not used during a build. (Frighteningly, it also identified 1100 unreferenced files in the sourcebase. That is, roughly 4% of the files we were dutifully maintaining had no bearing whatsoever on our final product. Of course, some of these should have been used, such as disconnected localization files and packaging scripts.)

Cruft-wise, Clearview IPMP posed a particular challenge: the old IPMP implementation was peanut-buttered through 135,000 lines of code in the TCP/IP stack, and I was determined to leave no trace of it behind. As such, over time I amassed collection of programs which were run as cron jobs that mined the sourcebase for possible vestiges (note that this was an ongoing task because the sourcebase Clearview IPMP replaced was still undergoing change to address critical customer needs). Some of these programs were simple (e.g., text-based searches for old IPMP-related abbreviations such as "ill group" and "ifgrp"), but others were a bit more evolved.

For instance, one key problem is the identification of unused functions. As I mentioned earlier, lint can identify unused functions in a program, but for a kernel module like ip things are more complex because other kernel modules may be the lone consumers of symbols provided by it. While it is possible to identify all the dependent modules, build lint libraries for each of them and perform a lint crosscheck across them (and in fact, we do these during the nightly build, though not for unused functions), it is also quite time-consuming and as such a bit heavyweight for my needs.

Thinking about the problem further, another solution emerged: during development, it is customary to maintain a source code cross-reference database, typically built with the classic cscope utility. A little-known aspect of cscope is that it can be scripted. For instance, to find the definition for symbol foo, one can do cscope -dq -L1 foo. Indeed, a common way to check that a symbol is unused is to (interactively) look for all uses of the symbol in cscope. Thus, for a given kernel module, it is straightforward to write a script to find unused functions: use nm(1) to obtain the module's symbol table and then check whether each of those symbols is used via cscope's scripting interface. In fact, that is exactly what my tiny dead-funcs utility does. Clearly, this requires the kernel module to be build from the same source base as the cscope database, and identifies already-extant cruft (in addition to interfaces that may have consumers outside of the OS/Net source tree), but it nonetheless proved quite useful during development (and has been valuable to others as well).

A similar approach can be followed to ensnare dead declarations, though some creativity may be needed to build the list of function/variable names to feed to cscope, as the compiler will have already pruned them out prior to constructing the kernel module and header files require care to properly parse. I resorted to convincing lint to build a lint library out of the header file in question (via PROTOLIB1), then using lintdump (another utility I contributed to the OS/NET tool chain) to dump out the symbol list -- admittedly a clunky approach, but effective nonetheless.

Unfortunately, scripts such as dead-funcs are too restrictive to become general-purpose tools in our chain, though perhaps you will find them (or their approaches) useful for your own O.C.D. development.

Wednesday Jan 21, 2009

Clearview IPMP in OpenSolaris

Clearview IPMP in OpenSolaris At long last, and somewhat belatedly, I'm thrilled to announce that the Clearview IPMP Rearchitecture has integrated into Solaris Nevada (build 107)! Build 107 has just closed internally, so internal WOS images will be available in the next few days (unfortunately, it will likely be a few weeks before the bits are available via OpenSolaris packages). For more on the new administrative experience, please check out the revised IPMP documentation, or Steffen Weiberle's in-depth blog entry. For more on the internal design, there's an extensive high-level design document, internals slides and numerous low-level design discussions in the code itself.

Here, I'd like to get a bit more personal as the designer and developer of Clearview IPMP. The project has been a real labor of love, borne both from the challenges many of Sun's top enterprise customers have faced trying to deploy IPMP, and from the formidable internal effort needed to keep the pre-Clearview IPMP implementation chugging along for the past decade. That is, it became clear that IPMP was both simultaneously a critical high-availability technology for our top customers and also an increasing cost on both our engineering and support organizations -- we either needed to kill it or fix it. Ever the optimist and buoyed by a growing customer interest in IPMP, I convinced management that I could tackle this work as part of the broader Clearview initiative that Seb and I were in the process of scoping (and moreover, either killing or fixing IPMP was required to meet Clearview's Umbrella Objectives).

From an engineering standpoint, IPMP is a case study in how much it matters to have the right abstractions. Specifically, the old (pre-Clearview) model was a struggle in large part because it introduced a new "group" abstraction to represent the IPMP group as a whole, rather than modeling an IPMP group as an IP interface (more on core network interface abstractions). This meant that every technology that interacted directly with IP interfaces (e.g., routing, filtering, QoS, monitoring, ...), required heaps of special-case code to deal with IPMP, which introduced significant complexity and a neverending stream of corner cases, some of which were unresolvable. It also made certain technologies (e.g., DHCP) downright impossible to implement, because their design was based on assumptions that held in \*all\* cases other than IPMP (e.g, that a given IP address would not move between IP interfaces). More broadly, with each new networking technology, significant effort was needed to consider how it could be made to work with IPMP, which simply does not scale.

The real tragedy of the old implementation is that the actual semantics -- while often misunderstood by customers and Sun engineers alike -- actually acted as if each IPMP group had an IP interface. For instance, if one placed two IP interfaces into an IPMP group, then added a route over one of those IP interfaces, it was as if a route had been added over the IPMP group. I say "tragedy" because this was wholly unobvious, and thus understandably led to numerous support calls. Similar surprises came from the fact that a packet with a source IP address from one IP interface could be sent out through another IP interface. In short, the implementation had cobbled together various other abstractions to build something that acted mostly like an IPMP group IP interface, but wasn't actually one.

From this one central mistake came a raft of related problems that impacted both the programmatic and administrative models. For instance, in addition to having to teach technologies about IPMP groups, consider what happens when an IP interface fails. In concept, this should be a simple operation: the IP addresses that were mapped to the failed interface's hardware address need to be remapped to the hardware address of a functioning interface in the group. This remapping can occur entirely within IP itself -- applications using those IP addresses should not need to know or care. However, in the old IPMP implementation, this was actually a very disruptive operation: the IP addresses had to be visibly moved from the failed IP interface to a functioning IP interface, confusing applications that either interacted with the IP interface namespace or listened to routing sockets. Moreover, the application had to be specially coded to know that while the IP interface had failed, it should not react to the failure because another IP interface had taken over responsibility. Similar problems abounded in areas both far and near; an interesting recent example is the issue Steffen found with the new defrouter feature and Solaris 10 IPMP. That problem doesn't exist with Clearview IPMP not because we overpowered it with reams of code but simply because the Clearview IPMP design precludes it.

Speaking of "reams of code", one of the aspects I'm most proud of with Clearview IPMP is the size of the codebase. In terms of raw numbers, the kernel implementation has shrunk by more than 35%, from roughly 8500 lines of code to 5500 lines (roughly 1000 lines of that are comments), and the lion's share of that code is isolated behind a simple kernel API of a few dozen functions (in contrast, the old IPMP codebase was sprawling and often written in-line). More importantly, the work needed to integrate the Clearview IPMP code with related technology was minimal: packet monitoring across the group required 15 lines of code; IP filter support required 5 lines of code; dynamic routing required no additional code. The new model also opened up unexpected opportunities, such as allowing the IPSQ framework (the core synchronization framework inside IP) to be massively simplified. Further, as a side effect of the new model, Clearview IPMP was able to fix many longstanding bugs -- some as old as IPMP itself -- such as 5015757, 6184000, 6359536, 6516992, 6591186, 6698480, 6752560, and 6787091 (among others).

Anyway, it's obvious that I'm a proud and biased parent. Whether my pride is justified will only become clear once Clearview IPMP has ten years of production use under its belt and an objective comparison is possible. However, I encourage you all to take it for a spin now and make your own assessment -- and of course feedback is welcome, either to me in private or on

Technorati Tag:
Technorati Tag:
Technorati Tag:

Tuesday Sep 02, 2008

Creating Shell-Friendly Parsable Output

Creating Shell-Friendly Parsable Output

Being able to easily write scripts from the command-line has long been regarded as one of UNIX's core strengths. However, over the years, surprisingly little attention has been paid to writing CLIs whose output lend themselves to scripting. Indeed, even modern CLIs often fail to consider parsable output as a distinct concept from human output, leading to overwrought and fragile scripts which inevitably break as the CLI is enhanced over time. Some recent CLIs have "solved" the parsable format problem by using popular formats such as XML and JSON. These are fine formats for sophisticated scripting languages, but a poor match for traditional UNIX line-oriented tools (e.g. grep, cut, head) that form the foundation of shell-scripting.

Even those CLIs that consider the shell when designing a parsable output format often fall short of the mark. For dladm(1M), it took us (Sun) three tries to create a format that can be easily parsed from the shell. So, while the final format we settled on may seem simple and obvious, as is often the case, making things simple can prove to be surprisingly hard. Further, there are a number of alternative output formats that seem compelling at first blush but ultimately prove to be unworkable.

So that others working on similar problems may benefit, below I've summarized our set of guidelines -- some obvious, some not -- that we arrived at while working on dladm. As each CLI has its own constraints, not all of them may prove applicable, but I'd urge anyone designing a CLI with parsable output to consider each one carefully.

To provide some specifics to hang our guidelines on, first, here's an example of the dladm human output format:

  # dladm show-link -o link,class,over
  LINK        CLASS    OVER
  eth0        phys     --
  eth1        phys     --
  eth2        phys     --
  default0    aggr     eth0 eth1 eth2
  cyan0       vlan     default0
... and here's the equivalent parsable output format:
  # dladm show-link -p -o link,class,over
  default0:aggr:eth0 eth1 eth2
Now, the guidelines:
  1. Design CLIs that output in a regular format -- even in human output mode.

    Once your human output mode ceases to be regular (ifconfig(1M) output is a prime example of an irregular format), later adding a parsable output mode becomes difficult if not impossible. (As an aside, I've often found that irregular output suggests deeper design flaws, either in the CLI itself or the objects it operates on.)

  2. Prefer tabular formats in parsable output mode.

    Because traditional shell scripting works best with lines of information, tabular formats where each line both identifies and describes a unique object are ideal. For example, above, the link field uniquely identifies the object, and the class and over fields describe that object. In some cases, multiple fields may be required to uniquely identify the object (e.g., with dladm show-linkprop, both the link and the property field are needed). As an aside: in the multiple-field case, the human output mode may choose to use visual nesting (e.g., by grouping all of a link's properties together on successive lines and omitting the link value entirely), but it's important this not be done in parsable output mode so that the shell script can remain simple.

  3. Require output fields to be specified.

    Unlike humans, scripts always invoke a CLI with a specific purpose in mind. Also unlike humans, scripts are not facile at adapting to change (e.g., the addition of new fields). Thus, it's imperative that scripts be forced to explicitly specify the fields they need (with dladm, attempting to use -p without -o yields an error message). With this approach, new fields can be added to a CLI without any risk of breaking existing consumers. Further, if a field used by a script is removed, the failure mode becomes hard (the CLI will produce an error), rather than soft (the consumer misparses the CLI's output and does something unpredictable). Note that for similar reasons, if your CLI provides a way to print field subsets that may change over time (e.g., -o all), those must also fail in parsable output mode.

  4. Leverage field specifiers to infer field names.

    Because field names must be specified in an order, it's natural to use that same order as the output order, and thus avoid having to explicitly identify the field names in the parsable output format. That is, as shown above, dladm can omit indicating which field name corresponds with which value because the order is inferred from the invocation of -olink,class,over. This may seem a minor point, but in practice it saves a lot of grotty work in the shell to otherwise correlate each field name with its value.

  5. Omit headers.

    Similarly, because the field order is known (and no human will be staring at the output) there is no utility in providing a header in parsable output mode, and indeed its presence would only complicate parsing. As shown above, dladm omits the header in parsable output mode.

  6. Do not adorn your field values.

    In human output mode, it can be useful to give visual indications for specific field values. For instance, as shown above, dladm shows empty values as "--" in human output mode so that the table does not look malformed. In parsable output mode, such embellishments only complicate and confuse consumers of the data (and may in fact make it ambiguous), and thus should be avoided. As above, in parsable output format, empty values are shown as actually being empty.

  7. Do not use whitespace as a field separator.

    Whitespace may seem like a natural field separator, but in practice it's problematic. Specifically, many shells treat whitespace separators specially by merging consecutive instances into a single instance. For example, consider representing three consecutive empty values. With a non-whitespace field separator such as ":", this would be output as "::" (empty value 1, : separator, empty value 2, :, empty value 3). With the shell's IFS variable set to ":", the shell will parse this as three separate empty values, as intended. With space as the field separator, this would be output as "   ", and with IFS set to " " the shell would misparse this as a single empty value.

  8. Do not restrict your allowed field values.

    While some fields may be controlled directly by the CLI (e.g., the class field above), others are either outside of your direct control (e.g., the link field above), or outside of even your system's control (e.g., the essid field output by dladm show-wifi). As such, aside from ensuring the field value is printable ASCII (where newline is considered as unprintable), no values should be filtered out or forbidden[1].

    Thus, any values that have special meaning should generally be escaped. For instance, with ":" as a field delimiter, IPv6 address "fe80::1" would become "fe80\\:\\:1" when displayed in parsable output mode. Thankfully, escaping does not complicate shell parsing because all popular scripting shells have read builtins that will automatically strip escapes. Thus, the common idiom of piping the output to a read/while loop works as expected without any special-purpose logic. For instance, even though the BSSID field will contain embedded colons, this will loop through each BSSID on each link, trying to connect to one until it succeeds:

          dladm scan-wifi -p -o link,bssid |
          while IFS=: read link bssid; do
                  dladm connect-wifi -i $bssid $link && break
    That said, if only a single field has been requested, the field separator is not needed. Since no ambiguity exists in that case, there's no need to escape it -- and not doing so can make things more convenient for other shell idioms -- e.g., to collect all in-range SSIDs:
          ssids=`dladm scan-wifi -p -o bssid`
I'd welcome hearing back from others who have tackled this problem.

[1] If unprintable ASCII values can legitimately occur in a given field's output, you need to use another encoding format.

Wednesday Aug 20, 2008



Yes, it's been a whole year since I last posted a blog entry. Between moving from Boston to San Francisco (metro, anyway), countless urgent matters (both professional and personal), and wrapping up Clearview IPMP development (more on that real soon), blogging hasn't exactly been top priority. That said, I have amassed a really nice list of topics for future blog entries over the coming weeks (OK, maybe months ;-).

Before we get to all that though, I have an urgent tip for those who are using GNOME's Nautilus on OpenSolaris build 94 or later. It seems that the GNOME development team (not inside Sun) decided to change the Open Terminal menu item (available by right clicking on the desktop) to Open in Terminal, and correspondingly changed things so that the GNOME terminal will open in your ~/Desktop directory, rather than ~. The unmitigated idiocy and arrogance of this change is beyond comprehension, and the pain associated with it only intensifies with each opened terminal. Nonetheless, thankfully, there is a simple way to restore the previous (correct) behavior:

  gconftool-2 -s /apps/nautilus-open-terminal/desktop_opens_home_dir
              -t bool true 
Hope this saves some other poor soul from spending half a day digging through the GNOME sources for a solution.

Technorati Tag:
Technorati Tag:
Technorati Tag:

Thursday Aug 16, 2007

Solaris Networking Abstractions


Solaris draws clear boundaries between IP interfaces, data-links, devices, and physical hardware. However, these boundaries are a frequent source of confusion, especially for migrants from other operating systems that do not have such clear delineations. Further, with data-link abstractions becoming ever-richer (via link aggregations, VLANs, IP tunnels -- and soon VNICs, vswitches, and vbridges), people have become increasingly confused about how the abstractions within and across each layer relate. As such, the Clearview team has been working closely with Sun's documentation writers to provide a background chapter (including illustrations) that illuminate the core abstractions.

Needless to say, I was thrilled to see my original skrawls turned into wonderful images like this one:

Above, one can see the flexible and powerful networking topologies that can be created simply from two common Sun networking cards (in this case, ce and qfe). Above the hardware layer, we see five devices -- one for the ce card, and four for the qfe card (the "q" stands for "quad"; qfe has four network ports on one card, which appear to the operating system as four independent devices).

Above the device layer, we see four physical links (shown in blue) that have been instantiated using those devices (the qfe1 device is unused). These links (as with all links) have been named by the administrator using Clearview's upcoming vanity naming feature. As illustrated, VLANs can be created over the links -- as can aggregations. Further, any of the links can also be instantiated at the IP layer (with their link name) using the ifconfig plumb subcommand. We also see that some links can exist independently of any specific underlying hardware -- such as vpn1, which uses the IP routing table to determine the actual link to direct a given packet to.

Finally, at the IP layer, we see that while most IP interfaces have a one-to-one relationship with an underlying datalink, some (such as lo0) have no underlying datalink, and others (such as eml3) group IP interfaces on the same IP broadcast domain together using IPMP (at least, they will once Clearview IPMP is complete).

Technorati Tag:
Technorati Tag:
Technorati Tag:

Wednesday Apr 25, 2007

IPMP Development Update #2

IPMP Development Follow-up

Several folks have again (understandably) asked for updates on the Next-Generation IPMP work. Significant progress has been made since my last update. Notably:

  • Probe-based failure detection is operational (in addition to the earlier support for link-based failure detection).
  • DR support of interfaces using IPMP through RCM works. Thanks to the new architecture, the code is almost 1000 lines more compact than Solaris's current implementation -- and more robust.
  • Boot support is now complete. That is any number (including all) interfaces can be missing at boot and then transparently repaired during operation.
  • At long last, ipmpstat. As discussed in the high-level design document, this is a new utility that allows the IPMP subsystem to be compactly examined.

Since ipmpstat allows other aspects of the architecture to be succinctly examined, let's take a quick look at a simple two-interface group on my test system:

  # ipmpstat -g
  net57       a           ok        10000ms   ce1 ce0

As we can see, the "-g" (group) output mode tells us all the basics about the group: the group interface name and group name (these will usually be the same, but differ above for illustrative purposes), its current state ("ok", indicating that all of the interfaces are operational), the maximum time needed to detect a failure (10 seconds), and the interfaces that comprise the group.

We can get a more detailed look at the IPMP health and configuration of the interfaces under IPMP using the "-i" (interface) output mode:

  # ipmpstat -i
  ce1         yes     net57       ------  up        ok        ok
  ce0         yes     net57       ------  up        disabled  ok

Here, we can see that ce0 has probe-based failure detection disabled. We can also see issues that prevent an interface from being used (aka being "active") -- e.g., if suppose we enable standby on ce0:

  # ifconfig ce0 standby

  # ipmpstat -i
  ce1         yes     net57       ------  up        ok        ok
  ce0         no      net57       si----  up        disabled  ok

We can see that ce0 is now no longer active, because it's an inactive standby (indicated by the "i" and "s" flags). This means that all of the addresses in the group must be restricted to ce1 (unless ce1 becomes unusable), which we can see via the "-a" (address) output mode ("-n" turns off address-to-hostname resolution):

  # ipmpstat -an
  ADDRESS             GROUP       STATE   INBOUND     OUTBOUND         net57       up      ce1         ce1          net57       up      ce1         ce1

For fun, we can offline ce1 and observe the failover to ce0:

  # if_mpadm -d ce1

  # ipmpstat -i
  ce1         no      net57       ----d-  disabled  disabled  offline
  ce0         yes     net57       s-----  up        disabled  ok
[ In addition to the "offline" state, the "d" flag also indicates that all of the addresses on ce0 are down, preventing it from receiving any traffic. ]
  # ipmpstat -an
  ADDRESS             GROUP       STATE   INBOUND     OUTBOUND         net57       up      ce0         ce0          net57       up      ce0         ce0
We can also convert ce0 back to a "normal" interface, online ce1 and observe the load spreading configurations:
  # ifconfig ce0 -standby
  # if_mpadm -r ce1

  # ipmpstat -i
  ce1         yes     net57       ------  up        ok        ok
  ce0         yes     net57       ------  up        disabled  ok

  # ipmpstat -an
  ADDRESS             GROUP       STATE   INBOUND     OUTBOUND         net57       up      ce0         ce1 ce0          net57       up      ce1         ce1 ce0
In particular, this indicates that incoming traffic to will go to ce0 and inbound traffic to will go to ce1 (as per the ARP mappings). However, outbound traffic will potentially flow over either interface (though to sidestep packet ordering issues, a given connection will remain latched unless the interface becomes unusable).

This also highlights another aspect of the new IPMP design: the kernel is responsible for spreading the IP addresses across the interfaces (rather than the administrator). The current algorithm simply attempts to keep the number of IP addresses "evenly" distributed over the set of interfaces, but more sophisticated policies (e.g., based on load measurements) could be added in the future.

To round out the ipmpstat feature set, one can also monitor the targets and probes used during probe-based failure detection:

  # ipmpstat -tn
  ce1         mcast
  ce0         disabled  --                  --
Above, we can see that ce1 is using "mcast" (multicast) mode to discover its probe targets, and we can see the targets it has decided to probe, in firing order. We can also look at the probes themselves, in real-time:
  # ipmpstat -pn
  TIME      INTERFACE   PROBE     TARGET              RTT       RTTAVG    RTTDEV
  1.15s     ce1         112         1.09ms    1.14ms    0.11ms
  2.33s     ce1         113         1.11ms    1.18ms    0.13ms
  3.94s     ce1         114         1.07ms    2.10ms    2.00ms
  5.38s     ce1         115         1.08ms    1.14ms    0.10ms
  6.19s     ce1         116         1.43ms    1.20ms    0.19ms
  7.73s     ce1         117         1.04ms    1.13ms    0.11ms
  9.47s     ce1         118         1.04ms    1.16ms    0.13ms
  10.67s    ce1         119         1.06ms    1.97ms    1.76ms
Above, the inflated RTT average and standard deviation for indicate that something went wrong with in the not-too-distant past. (As an aside: "-p" also revealed a subtle longstanding bug in in.mpathd that was causing inflated jitter times for probe targets; see 6549950.)

Anyway, hopefully all this gives you not only a feel for ipmpstat, but a feel for how development is progressing. It should be noted that several key features are still missing, such as:

  • Broadcast and multicast support on IPMP interfaces.
  • IPv6 traffic on IPMP interfaces.
  • IP Filter support on IPMP interfaces.
  • MIB and kstat support on IPMP interfaces.
  • DHCP on IPMP interfaces.
  • Sun Cluster support.
All of these are currently being worked on. In the meantime, we will be making early-access BFU archives based on what we have so far to those who are interested in kicking the tires. (And a big thanks to those customers who have already volunteered!)

Technorati Tag:
Technorati Tag:
Technorati Tag:

Tuesday Feb 06, 2007

IPMP Development Update

IPMP Development Update

A number of people have sent me emails asking for updates on the Next-Generation IPMP work. In short, there's a lot to do, but development is progressing smoothly and early-access bits are on the horizon[1]. At this point, one can:

  • Create, destroy, and reconfigure IPMP groups with arbitrary numbers of interfaces and IP addresses, using either the legacy or new administrative model.
  • Load-spread inbound and outbound traffic across the interfaces and addresses. As per the new model, all IP addresses are hosted on "IPMP" interfaces and the kernel handles the binding of IP addresses to interfaces in the group internally. There is no longer a visible concept of failover or failback.
  • Use in.mpathd to track the failure and repair of interfaces. It notifies the kernel of these changes so that the kernel can update its interface-to-address bindings.
  • Use if_mpadm to offline and undo-offline interfaces. Again, this causes the kernel to update its interface-to-address bindings.

To illustrate where I'm at, let me use last night's build to show the lay of the land. (What's been implemented is almost identical to what was proposed in the high-level design document -- so please consult that document for additional background.) For starters, one can use the old IPMP administrative commands as before -- e.g., to create a two-interface group with two IP data addresses:

# ifconfig ce0 plumb group ipmp0 up
# ifconfig ce1 plumb group ipmp0 up
But what you end up with looks a bit different:
# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet netmask ff000000
ce0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet netmask ff000000
        groupname ipmp0
        ether 0:3:ba:94:3b:74
ce1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
        inet netmask ff000000
        groupname ipmp0
        ether 0:3:ba:94:3b:75
ipmp0: flags=8001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,IPMP> mtu 1500 index 3
        inet netmask ffffff00 broadcast
        groupname ipmp0
ipmp0:1: flags=8001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,IPMP> mtu 1500 index 3
        inet netmask ffffff00 broadcast
Above, we can see that ifconfig has created an ipmp0 interface for IPMP group a and placed the two data addresses that we configured onto it. The ce0 and ce1 interfaces have no actual addresses configured on them (though they would if we'd configured test addresses), but are marked UP so that they can be used to send and receive traffic. Note that ipmp0 is marked with a special IPMP flag to indicate that it is an IP interface that represents an IPMP group.

Though the legacy configuration works, we will recommend configuring IPMP through the new model, since it better expresses the intent. The same configuration as above would be achieved instead by doing:

# ifconfig ipmp0 ipmp up addif up
# ifconfig ce0 plumb group ipmp0 up
# ifconfig ce1 plumb group ipmp0 up
Note the presence of the ipmp keyword, which tells ifconfig that the interface represents an IPMP group. Because of this keyword, an IPMP interface can actually be given any valid unused IP interface name -- e.g., ifconfig xyzzy0 ipmp will create an IPMP interface named xyzzy0. This follows the Project Clearview tenet that IP interface names must not be tied to the interface type -- which in turn allows one to roll out new networking technologies without disturbing the system's higher-level network configuration.

In general, an IPMP interface can be used like any other IP interface -- e.g., to create a default route through ipmp0, we can do:

# route add default -ifp ipmp0 
We can also examine the ARP table to see the current distribution of ipmp0's IP addresses to IP interfaces in the group (once development is complete, this will be able to be done more easily with ipmpstat):
# arp -an | grep ipmp0
ipmp0  SPLA     00:03:ba:94:3b:74
ipmp0 SPLA     00:03:ba:94:3b:75
Here, we see that is using ce0's hardware address, and is using ce1's hardware address. If we offline ce0, we can see the kernel will change the binding:
# if_mpadm -d ce0
# arp -an | grep ipmp0
ipmp0  SPLA     00:03:ba:94:3b:75
ipmp0 SPLA     00:03:ba:94:3b:75
One interesting consequence of the new design is that it's possible to remove all of the interfaces in a group and still preserve the IPMP group configuration. For instance:
# ifconfig ce0 unplumb
# ifconfig ce1 unplumb
# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet netmask ff000000
ipmp0: flags=8001000803<UP,BROADCAST,MULTICAST,IPv4,IPMP> mtu 1500 index 3
        inet netmask ffffff00 broadcast
        groupname ipmp0
ipmp0:1: flags=8001000803<UP,BROADCAST,MULTICAST,IPv4,IPMP> mtu 1500 index 3
        inet netmask ffffff00 broadcast
Since all of the network configuration (e.g., the routing table) is tied to ipmp0 rather than to the underlying interfaces, it's unaffected. However, of course, no network traffic can flow through ipmp0 until another interface is placed back into the group -- as evidenced by the fact that the RUNNING flag has been cleared on ipmp0.

Those familiar with the existing IPMP implementation may be asking yourself what's left to do. The answer is "quite a bit". Notable current omissions include:

  • Broadcast and multicast support on IPMP interfaces.
  • IPv6 traffic on IPMP interfaces.
  • Probe-based failure detection.
  • DR support of interfaces using IPMP through RCM.
  • MIB and kstat support on IPMP interfaces.
  • DHCP over IPMP interfaces.
  • ipmpstat

Of the above, the first four are supported by the existing IPMP implementation and are (with minor exceptions) requirements for any early-access candidate. That said, as I mentioned earlier, development is proceeding at a good clip -- especially now that the hairy IP configuration multithreading model as been tamed, and several lethal bugs in IP have been nailed[2]. So stay tuned.


[1] If you're a Sun customer interested in kicking the tires in a pre-production environment, please send an email to meem AT eng DOT sun DOT com.
[2] For instance, see or

Technorati Tag:
Technorati Tag:

Monday Dec 04, 2006



Now that the new Solaris WiFi architecture has integrated into Build 54 of Nevada, this seemed an appropriate time to interrupt the static with a fresh blog entry. In short, WiFi's integration represents both a new beginning for WiFi on Solaris and a milestone for the new architecture of Solaris networking. I realize such hyperbole may set off more than a few BS-meters, so let me back these statements up with some specifics.

First, with regard to WiFi:

  1. The kernel now has first-class support for the WiFi link-layer protocols. Previously, WiFi drivers in Solaris masqueraded as Ethernet drivers, and internally performed header translations to send and receive actual WiFi link-layer frames.

    Having first-class support for WiFi simplifies our codebase, reduces per-packet overhead, introduces WiFi-specific kstats, and opens the door for network sniffers like snoop and Ethereal to directly interpret WiFi frames on Solaris via the new DLIOCNATIVE ioctl.

  2. Similarly, the kernel's GLDv3 networking driver framework now natively supports WiFi, allowing WiFi to be seamlessly handled by the protocol stack. Accordingly, the bundled ath driver has been ported to GLDv3, and all unbundled drivers have either been ported to GLDv3 or in the process of being ported. Previously, WiFi drivers were relegated to the GLDv2 framework, which is no longer under active development.

  3. The kernel now has a dedicated net80211 kernel module which facilitates code sharing across WiFi drivers. This kernel module is based closely on the mature and robust FreeBSD 7 WLAN module, allowing us to easily incorporate enhancements as they become available. Previously, different versions of the WLAN framework had been directly linked into each driver, which was a significant maintenance and support hazard.

  4. As a result of (2), WiFi drivers can now be managed using our GLDv3 administration command, dladm. For instance, running dladm show-link on a laptop now shows any available WiFi links alongside the Ethernet links. In addition, new dladm subcommands have been added to allow WiFi administration -- e.g., to connect to the most optimal unencrypted WiFi link in-range, just run dladm connect-wifi. Or you can create keys with dladm create-secobj and then pass those keys to connect-wifi. Check out the EXAMPLES section of the latest dladm manpage for specifics.

Collectively -- and in combination with other smaller improvements -- I hope you agree that this adds up to a solid foundation for WiFi on Solaris. Moreover, this foundation greatly benefits the development of our two follow-on WiFi projects -- specifically, WPA/WPA2 support, and bundled support for many more WiFi chipsets.

Now, with regard to Solaris Networking, WiFi both builds on recent work by other projects and paves the way for ongoing and future projects:

  1. The new WiFi support in GLDv3 makes use of the MAC Type plugin architecture integrated by Project Clearview into Build 44 of Nevada. In fact, two plugin operations -- mtops_header_cook() and mtops_header_uncook() -- were designed around WiFi's requirements. The entire mac_wifi plugin source is just 415 lines of straightforward code, including comments. Without the MAC Type plugin architecture, native WiFi support would have been significantly more complex and less elegant.

  2. The userland WiFi architecture was designed to be easily used by future administrative tools -- especially Network Auto-Magic (NWAM). Specifically, the heavy lifting for the new WiFi dladm subcommands is actually done by the new libwladm library, rendering dladm mostly a trivial wrapper around the libwladm routines.

    This separation of WiFi mechanism from UI policy allows other tools such as NWAM to make use of the WiFi framework without having to resort to ugly calls to dladm or (gag) code duplication. Note that libwladm is currently Consolidation Private, and thus is only safe to call from code in the ON consolidation.

  3. Enhancing dladm to support WiFi required new facilities for administering WEP keys and WiFi properties (e.g., radio and powermode settings). However, to keep the dladm administrative model simple and extensible, we introduced two new generic dladm facilities: link properties and secure objects.

    While these facilities are only used by WiFi at present, future projects will extend them in new directions. For instance, both Project Clearview and IP Instances already have new link properties planned (autopush and zone respectively), and an upcoming project to take administration out of the ndd(1M) stone-age will likely build extensively on link properties. Similarly, future secure objects will exist for WPA/WPA2 keys and perhaps other secure data such as certificates.

Looking ahead to 2007, Project Clearview, NWAM, Crossbow, and IP Instances will all make use of features introduced by WiFi - and with features introduced by one another -- to collectively realize a "new world order" for Solaris networking. Stay tuned, it's going to rock.

Peter Memishian
Last modified: Mon Dec 4 16:21:48 EST 2006 Technorati Tag:
Technorati Tag:

Sunday Jul 30, 2006

On Locking

On Locking

Recently, I've been hip-deep in the 105,000 lines of heavily-multithreaded code that comprise Solaris's IPv4/IPv6 implementation, finishing the bring-up of our new IP Network Multipathing (IPMP) implementation for Clearview. Along the way, I've been reminded of some collected wisdom regarding locking that could use wider dissemination:

  1. An object cannot synchronize its own visibility.

    Most long-lived objects are put into a structure such as a hash table that allows them to be looked up at some point in the future. In order to track the number of threads that currently have an object "checked out", objects are usually reference counted, with the reference count itself being manipulated under the object's lock.

    Things get tricky when the object must go away. In particular, many make the mistake of trying to synchronize the object's removal from visibility under the object's lock when the reference count reaches one. This is not possible: any thread looking up the object -- by definition -- does not yet have the object and thus cannot hold the object's lock during the lookup operation (unless the object lock is not tied to the object itself -- see (2)). Thus, another thread can race and acquire another reference at the same time the object is being destroyed, leading to incorrect behavior. Thus, whatever higher-level synchronization is used to coordinate the threads looking up the object must also be used as part of removing the object from visibility.

    Once the object has been removed from visibility, an object can indeed synchronize its own destruction [1]. The simplest approach is to have the object's destruction done by whatever thread causes the object's reference count to reach zero [2] -- that is, "if you're the last one out, turn off the lights". Note that this logic can be part of the standard code that decrements the object's reference count, but will be guaranteed to be unreachable until the object is removed from visibility. This is because the object's reference count will be incremented when it is made visible to another structure (e.g., the hash table providing its visibility), and that reference will remain until the object is removed from visibility.

  2. An object should not synchronize itself without due cause.

    This is a generalization of (1). When building objects that are intended to be used in a multithreaded environment, it is tempting to build the locks into the objects themselves. For instance, a stack object might contain an internal lock to ensure that multiple threads issuing a push or pop operation simultaneously will operate without corrupting the underlying stack object or returning erroneous results. Languages like Java have promoted this sort of locking to a first-class concept through the "synchronized" keyword.

    While this "works" in the small, it misses the big picture. Specifically, each of those aforementioned threads working on the stack object was performing its pushes and pops as part of accomplishing some larger task. Those larger tasks are indeed what need to be synchronized with one another, so that they appear atomic to each other as a whole. However, only the callers (the threads using the stack objects) have insight into the granularity and semantics of those tasks and the objects that comprise them -- so only they can implement that locking. However, once that locking has been implemented, any internal object locks performing similar functions become superfluous, and only end up complicating the object's implementation (e.g., to avoid recursive mutex locking).

    Thus, many objects are better off leaving their synchronization to their callers, since those users will have to synchronize between each other anyway. Of course, objects will in turn use other objects -- so it's still quite likely that an object will have embedded locks. However, those locks will be used to synchronize access to the objects it's using, rather than attempting to synchronize its own use across multiple threads. In short, locks are best kept external to the data structures they manage, unless the structure itself must support high-performance concurrent access.

  3. A condition can be signaled without holding the condition's lock.

    There is a longstanding and deeply embedded superstition that signaling a condition without holding that condition's lock will compromise correctness. In fact, Solaris's own cond_signal(3C) manpage contains:

          Both functions should be called under the protection of the same
          mutex that is used with the condition variable being signaled.
          Otherwise, the condition variable may be signaled between the test
          of the associated condition and blocking in cond_wait().  This can
          cause an infinite wait.
    This is completely false: the thread heading into cond_wait() must have already tested the condition under the lock and concluded it was false in order to decide to cond_wait(). Since any thread changing state that would affect the condition must also be holding the lock, there is no way for the state (and thus the outcome of the test) to change beween the test and the cond_wait(), and thus any cond_signal() sent during that window would end up being spurious anyway. Accordingly, 6437070 has been filed.

    All that said, it may be true that signaling the condition while holding the lock may improve overall determinism since it eliminates a possible avenue for priority inversion. It also may waste cycles since the thread being signalled may not be able to grab the lock (if the signaling thread has not yet dropped it). This recent thread on mdb-discuss contains more on this issue.


[1] As David Powell mentioned to me while discussing this blog entry, "This problem is frequently misrepresented as an inability to synchronize against one's own destruction To be precise, the problem is that an object can't synchronize its own visibility."
[2] The other common option is to have the thread removing the object to wait for the reference counts to drop, but that forces the thread to block for a potentially unbounded period of time, and should only be used if required for correctness.

Technorati Tag:
Technorati Tag:

Friday May 12, 2006

Clearview Updates

Clearview Updates

Clearview development has been proceeding at a rapid pace -- here's a quick update on the milestones reached over the past month:

  • Thanks especially to Sebastien's hard work, the Nemo Binary Compatibility and Nemo Generalization components have been reviewed by the OpenSolaris community, approved by our internal architectural review board, and are nearing integration into OpenSolaris. With these, non-Ethernet Nemo drivers (such as the Clearview IP Tunneling driver) will be able to be written -- not to mention Nemo WiFi drivers (which are almost ready for intergation into OpenSolaris as well). This work also brings us a step closer to making the Nemo interfaces available for third-party use, and paves the way for TCP LSO support.
  • Cathy and Dan Groves have published a proposal for improving the observability of VLAN's, which will be submitted to the our architecture review board shortly. These changes are necessary for Clearview's Nemo Unification component, but are also quite useful in their own regard since they make it significantly easier to track down networking problems that occur on VLANs. Code that implements the proposal is already running internally, and is also destined for a nearby build of OpenSolaris.
  • Phil Kirk has published our proposed architecture for IP Observability. With this work comes the vital ability to debug intra-zone and inter-zone networking problems using traditional utilities such as ethereal and snoop -- along with opening the door for interesting possibilities such as inter-zone IDS's. Again, the code that implements this proposal is already running internally.
  • Sagun Shakya has published our proposal for a public library that can be used to communicate with link-layer devices via DLPI. This work is necessary for [Vanity Naming], but also allows us to centralize all application-level DLPI handling -- and to torch thousands of lines of tedious and obscure code. Expect a revised proposal -- based on our experiences of porting the Solaris DHCP client to use it -- to be posted to OpenSolaris shortly.
I'm also happy to report that we will shortly be making builds of the Clearview gate available to the OpenSolaris community. These early-access bits will include everything mentioned above.

And finally, as promised, here are my photos from Sichuan. Thanks again to Cathy Zhou and her family for taking me on this amazing trip.

Technorati Tag:
Technorati Tag:

Tuesday Jan 03, 2006

Private vs. Secret

Private vs. Secret

Some personal responses I got to why lsof does not build in OpenSolaris build 27 made it clear that the distinction we make between private and secret has not been well-communicated.

Specifically, several were confused by my comment that <inet/udp_impl.h> "is not shipped". By this, I meant that it is not part of any OpenSolaris package, and thus that it will not be installed as /usr/include/inet/udp_impl.h on a machine running OpenSolaris. Thus, <inet/udp_impl.h> is what we consider to be a private header file: its contents represent private interfaces that we do not want external software to develop dependencies on[1]. This is the essence of good software engineering: making sure that each software layer depends only on well-defined interfaces from other layers. Moreover, well-defined interfaces allow stability levels to be specified (see attributes(5)) which allow the volatility of each interface to be known by its consumers.

Unfortunately, for a variety of (mostly historical) reasons, many header files containing only private interfaces have been shipped over the years. Moreover, there is not widespread consensus that shipping private header files is a bad idea -- many feel that there is a de facto understanding that undocumented header files are private, and thus that all headers, private or not, should be shipped with the system[2].

Regardless, while <inet/udp_impl.h> is private, it is not secret. That is, although <inet/udp_impl.h> is not shipped with OpenSolaris, it is part of the OpenSolaris source distribution, and available for anyone to examine. In contrast, a secret header file is one that we are legally prohibited from making available as part of OpenSolaris. For instance, the LLC2 driver was developed by a third party, and the terms of the agreement grant only Sun employees (and those under NDA) access to its source. Thus, <sys/llc2.h> is a secret header file, as we are legally prohibited from including <sys/llc2.h> with OpenSolaris. By contrast with open, we term all of our secret source files closed. These files reside in a parallel directory hierarchy of the Solaris ON gate rooted at usr/closed. The OpenSolaris source tree contains everything in the ON gate except for those files under usr/closed.

So, to summarize: private header files are not installed under /usr/include to keep the product more robust and flexible. However, they are part of the OpenSolaris source tree. In contrast, secret (or closed) header files are not installed under /usr/include because we are legally prohibited from doing so. Further, they are not part of the OpenSolaris source tree because we are legally prohibited from doing so.


[1] These dependencies usually happen accidentally, but sometimes they are on purpose. For instance, the aforementioned lsof utility uses many data structures that are contained in private header files, including <inet/ipclassifier.h>. The right long-term answer is to provide public, well-defined interfaces that these utilities can depend on.
[2] This is something I vehemently disagree with. As the lsof example makes clear, once an interface is shipped, external software is tempted to make use of it. These dependencies often remain unknown until an innocent developer needs to change the private interfaces and breaks a popular software package, affecting myriad end-users. Moreover, the "relief" provided by these private interfaces lowers the internal priority of developing proper well-defined interfaces.

Technorati Tag:
Technorati Tag:

Saturday Nov 26, 2005

Voodoo vs. Engineering

Voodoo vs. Engineering

Even at my stately pace for blog entries, it's been a while. Truthfully, the last few months have been a blur, as I've been completely consumed by two projects: Clearview, and the future of WiFi on Solaris (more on that in a future blog entry). On the Clearview front, the IPMP design review on OpenSolaris has wrapped up, the IP Tunnel Rearchitecture and IP Observability Device design reviews are nearing completion, and Cathy's formidable "Nemo Unification and Vanity Naming" design is about to get underway. Thanks again to everyone who has contributed -- we really appreciate the feedback -- and keep it coming!

Each time I go through the design phase[1] of a project, I'm reminded of just how hard a process it is do with rigor and integrity. Performed as intended, the design phase forces us to confront and address the critical flaws at the start of a project, rather than right before integration -- or in the case of the technologies Clearview's rearchitecting, long after shipment. With Clearview, we've spent many months carefully working through the design process, constructing crisp, precise, and thorough design materials that collectively top 200 pages.

Along the way, I occasionally find it necessary to seek more real-world examples of why we are willing to endure such pain. The most entertaining method is to don the hazmat suit, wade into the toxic anarchy of subsystems from years gone by, and bear witness to the nightmare that results from inadequate design.

Though under-designed (or just plain never-designed) subsystems make themselves known in many capacities, one of their hallmarks is the presence of voodoo coding. Specifically, because the forces shaping the design space were never really understood, the developer must iterate until something finally "works". Code from previous iterations or stillborn ideas of how to solve unexpected problems are left behind as the logic is debugged into existence. Eventually, persistence prevails over common sense, and the code is checked in, complete with ritual sacrifices. As the years go by, subsequent developers become convinced that there is something tricky going on that they simply aren't capable of understanding, and the subsystem in question becomes steeped in mysticism.

As I've mentioned before, one subsystem that has more than its fair share of arcana is STREAMS -- which is undoubtedly why I find it so fascinating. About once a release, I like to go in and haul out a few thousand lines[2] of nonsense. While I started on the periphery, at this point I've plucked the low-hanging fruit and must instead target the wizened roots that comprise the core paths. Once such routine is str_mate(), which "twists" two streams together so that they can act as pipe or a FIFO. Since its introduction more than a decade ago, it has looked like this (line numbers added for subsequent discussion):

     1    /\*
     2     \*  No activity allowed on streams
     3     \*  Input: 2 write driver end write queues
     4     \*  Return 0 if error
     5     \*  If wrq1 == wrq2 then we are doing a loop back,
     6     \*  otherwise just mate two queues.
     7     \*  If these queues are not the stream head they must have
     8     \*  a service procedure
     9     \*  It is up to caller to ensure that neither queue goes away.
    10     \*  XXX str_mate() and str_unmate() should be moved to new file kstream.c
    11     \*  XXX these routines need to be a little more general
    12     \*/
    13    int
    14    str_mate(queue_t \*wrq1, queue_t \*wrq2)
    15    {
    16            /\*
    17             \* Loop back?
    18             \*/
    19            if (wrq2 == 0 || wrq1 == wrq2) {
    20                    /\*
    21                     \* driver end of stream?
    22                     \*/
    23                    if (! (wrq1->q_flag & QEND))
    24                            return (EINVAL);
    26                    ASSERT(wrq1->q_next == 0);      /\* sanity \*/
    28                    /\*
    29                     \* connect write queue to read queue
    30                     \*/
    31                    wrq1->q_next = _RD(wrq1);
    33                    /\*
    34                     \* If write queue does not have a service routine,
    35                     \* assign the forward service procedure from the
    36                     \* read queue
    37                     \*/
    39                    if (! wrq1->q_qinfo->qi_srvp)
    40                            wrq1->q_nfsrv = _RD(wrq1)->q_nfsrv;
    41                   /\*
    42                     \* set back service procedure..
    43                     \* XXX - note back service procedure is not implemented
    44                     \* this may cause a race condition, breaking it
    45                     \* a bit more.
    46                     \*/
    47                    _RD(wrq1)->q_nbsrv = wrq1;
    48            } else {
    49                    /\*
    50                     \* driver end of stream?
    51                     \*/
    52                    if (! (wrq1->q_flag & QEND))
    53                            return (EINVAL);
    54                    if (! (wrq2->q_flag & QEND))
    55                            return (EINVAL);
    57                    ASSERT(wrq1->q_next == NULL);   /\* sanity \*/
    58                    ASSERT(wrq2->q_next == NULL);   /\* sanity \*/
    61                    /\*
    62                     \* if first queue is a stream head, second must
    63                     \* must also be one
    64                     \*/
    65                    if (! (_RD(wrq1)->q_flag & QEND)) {
    66                            if (_RD(wrq2)->q_flag & QEND)
    67                                    return (EINVAL);
    68                    } else if (! (_RD(wrq2)->q_flag & QEND))
    69                            return (EINVAL);
    70                    /\*
    71                     \* Twist the stream head queues so that the write queue
    72                     \* points to the other stream's read queue.
    73                     \*/
    74                    wrq1->q_next = _RD(wrq2);
    75                    wrq2->q_next = _RD(wrq1);
    77                    if (! wrq1->q_qinfo->qi_srvp)
    78                            wrq1->q_nfsrv = _RD(wrq2)->q_nfsrv;
    79                    if (! wrq2->q_qinfo->qi_srvp)
    80                            wrq2->q_nfsrv = _RD(wrq1)->q_nfsrv;
    82                    /\*
    83                     \* Nothing really uses the back service routines,
    84                     \* but fill them in for completeness
    85                     \*/
    87                    _RD(wrq1)->q_nbsrv = wrq2;
    88                    _RD(wrq2)->q_nbsrv = wrq1;
    90                    SETMATED(STREAM(wrq1), STREAM(wrq2));
    91            }
    92            return (0);
    93    }

As a developer, this is the kind of ghetto you hope you never have to enhance or debug: gnarly logic, unsettling comments, and suspect codepaths abound. Further, as a code minimalist, it's a personal affront to everything I hold dear.

The function's biggest problem is that it is pointlessly general-purpose (despite the contrary claim on line 11), undoubtedly because no one ever bolted down its requirements. Specifically, the function signature and comments suggest that it can mate any two STREAMS driver queues together at any time. However, because of the limitations expressed on lines 2, 7, 8, and 9, this was never possible in practice -- instead, the function was only used to mate stream head queues before they were put into use. In fact, all of its callers go through extra work to accommodate this needless generality:

                if (dotwist && firstopen) {
                        queue_t \*wq = strvp2wq(\*vpp);
                        (void) str_mate(wq, wq);

Adding insult to injury, str_mate() can never fail when passed stream head queues -- so, as shown above, all callers simply discard the return value. Indeed, by simply changing str_mate() to accept two vnode pointers and letting it derive the stream heads to mate, we can (a) simplify its callers, (b) simplify its interface, and (c) simplify its implementation. With this change, callers will just do:

                if (dotwist && firstopen)
                        str_mate(\*vpp, \*vpp);

Regarding (c):

  • Because the queues are guaranteed to be stream heads, lines 61-69 can be removed
  • Because stream heads always have QEND set, lines 20-23 and 49-55 can be removed
  • Because stream heads always have a service procedure, lines 33-40 and 77-81 can be removed. However, since one day this may not be true, an ASSERT() should be added to avoid debugging headaches.
After our first pass, things are starting to look more comprehensible -- and at no loss of functionality:

     1    /\*
     2     \* Mate the stream heads of two vnodes together. If the two vnodes are the
     3     \* same, we just make the write-side point at the read-side -- otherwise,
     4     \* we do a full mate.  Only works on vnodes associated with streams that
     5     \* are still being built and thus have only a stream head.
     6     \*/
     7    void
     8    str_mate(vnode_t \*vp1, vnode_t \*vp2)
     9    {
    10            queue_t \*wrq1 = strvp2wq(vp1);
    11            queue_t \*wrq2 = strvp2wq(vp2);
    13            /\* 
    14             \* We rely on the stream head always having a service procedure
    15             \* to avoid tweaking q_nfsrv.
    16             \*/
    17            ASSERT(wrq1->q_qinfo->qi_srvp != NULL);
    18            ASSERT(wrq2->q_qinfo->qi_srvp != NULL);
    20            /\*
    21             \* Loop back?
    22             \*/
    23            if (wrq2 == 0 || wrq1 == wrq2) {
    24                    ASSERT(wrq1->q_next == 0);      /\* sanity \*/
    26                    /\*
    27                     \* connect write queue to read queue
    28                     \*/
    29                    wrq1->q_next = _RD(wrq1);
    31                    /\*
    32                     \* set back service procedure..
    33                     \* XXX - note back service procedure is not implemented
    34                     \* this may cause a race condition, breaking it
    35                     \* a bit more.
    36                     \*/
    37                    _RD(wrq1)->q_nbsrv = wrq1;
    38            } else {
    39                    ASSERT(wrq1->q_next == NULL);   /\* sanity \*/
    40                    ASSERT(wrq2->q_next == NULL);   /\* sanity \*/
    42                    /\*
    43                     \* Twist the stream head queues so that the write queue
    44                     \* points to the other stream's read queue.
    45                     \*/
    46                    wrq1->q_next = _RD(wrq2);
    47                    wrq2->q_next = _RD(wrq1);
    49                    /\*
    50                     \* Nothing really uses the back service routines,
    51                     \* but fill them in for completeness
    52                     \*/
    54                    _RD(wrq1)->q_nbsrv = wrq2;
    55                    _RD(wrq2)->q_nbsrv = wrq1;
    57                    SETMATED(STREAM(wrq1), STREAM(wrq2));
    58            }
    59    }

That hauled out a third of it, but there's plenty more to go. First, since wrq2 is no longer passed in, there is no risk that it can be NULL, so the check on 23 can be disposed of. Next, as boldly proclaimed in the comments on lines 32-35 and 50-51, q_nbsrv is indeed unused (a weed which I hauled out as part of 6267823) -- so it can be torched as well. Finally, the "sanity" checks on lines 24, 39, and 40 can be hoisted to neighbor the qi_nfsrv ASSERT()'s at minimal cost (one extra compare on DEBUG systems). Our result:

     1    /\*
     2     \* Mate the stream heads of two vnodes together. If the two vnodes are the
     3     \* same, we just make the write-side point at the read-side -- otherwise,
     4     \* we do a full mate.  Only works on vnodes associated with streams that
     5     \* are still being built and thus have only a stream head.
     6     \*/
     7     void
     8     str_mate(vnode_t \*vp1, vnode_t \*vp2)
     9     {
    10             queue_t \*wrq1 = strvp2wq(vp1);
    11             queue_t \*wrq2 = strvp2wq(vp2);
    13             /\*
    14              \* Verify that there are no modules on the stream yet.  We also
    15              \* rely on the stream head always having a service procedure to
    16              \* avoid tweaking q_nfsrv.
    17              \*/
    18             ASSERT(wrq1->q_next == NULL && wrq2->q_next == NULL);
    19             ASSERT(wrq1->q_qinfo->qi_srvp != NULL);
    20             ASSERT(wrq2->q_qinfo->qi_srvp != NULL);
    22             /\*
    23              \* If the queues are the same, just twist; else do a full mate.
    24              \*/
    25             if (wrq1 == wrq2) {
    26                     wrq1->q_next = _RD(wrq1);
    27             } else {
    28                     wrq1->q_next = _RD(wrq2);
    29                     wrq2->q_next = _RD(wrq1);
    30                     SETMATED(STREAM(wrq1), STREAM(wrq2));
    31             }
    32     }

Behold, under all that cargo-cult programming: a scraggly, unintimidating function -- but functionally identical to the original 92-line thug. Now, we can trivially follow the logic:

  1. If the two vnodes are the same (pipe), then the single stream head has its write-side pointed to its read-side.
  2. If the two vnodes are different (FIFO), then each stream head's write-side is pointed to the other's read-side.
In the second case, the SETMATED() macro is also called in order to link the two stream heads (via sd_mate) and to set the STRMATE sd_flag on each stream head[3]. Since str_mate() is the only routine that establishes mates, it is also the only user of the SETMATED() macro. Since the macro neither adds semantic value nor reduces code duplication, our final version inlines the macro (whose definition can then be removed), preferring clarity to brevity. In addition, str_mate() has been renamed to strmate() for consistency with other STREAMS functions:
     1    /\*
     2     \* Mate the stream heads of two vnodes together. If the two vnodes are the
     3     \* same, we just make the write-side point at the read-side -- otherwise,
     4     \* we do a full mate.  Only works on vnodes associated with streams that
     5     \* are still being built and thus have only a stream head.
     6     \*/
     7     void
     8     strmate(vnode_t \*vp1, vnode_t \*vp2)
     9     {
    10             queue_t \*wrq1 = strvp2wq(vp1);
    11             queue_t \*wrq2 = strvp2wq(vp2);
    13             /\*
    14              \* Verify that there are no modules on the stream yet.  We also
    15              \* rely on the stream head always having a service procedure to
    16              \* avoid tweaking q_nfsrv.
    17              \*/
    18             ASSERT(wrq1->q_next == NULL && wrq2->q_next == NULL);
    19             ASSERT(wrq1->q_qinfo->qi_srvp != NULL);
    20             ASSERT(wrq2->q_qinfo->qi_srvp != NULL);
    22             /\*
    23              \* If the queues are the same, just twist; else do a full mate.
    24              \*/
    25             if (wrq1 == wrq2) {
    26                     wrq1->q_next = _RD(wrq1);
    27             } else {
    28                     wrq1->q_next = _RD(wrq2);
    29                     wrq2->q_next = _RD(wrq1);
    30                     STREAM(wrq1)->sd_mate = STREAM(wrq2);
    31                     STREAM(wrq1)->sd_flag |= STRMATE;
    32                     STREAM(wrq2)->sd_mate = STREAM(wrq1);
    33                     STREAM(wrq2)->sd_flag |= STRMATE;
    34             }
    35     }

The above is identical to the version in OpenSolaris.


[1] Of course, no one who has actually done design believes in a pure waterfall model -- instead, designs must be revised over the lifetime of the project. Regardless, the more rigorous the initial design is, the fewer changes that will have to be made later in the project.
[2] While most developers pride themselves on the "lines of code" they've \*produced\*, I'm quite the opposite: I feel a profound sense of failure if the codebase is larger after I've integrated a bugfix or feature.
[3] STRMATE isn't strictly needed, but since sd_flag is often checked to accept or reject certain STREAMS operations, it's more convenient than constantly checking whether sd_mate != NULL.

Technorati Tag:
Technorati Tag:

Sunday Sep 18, 2005

Openly Excited

Openly Excited The past few weeks have been some of the most exciting (and hectic!) of my professional career, as I acclimate to the power and responsibility of an increasingly open software development model.

To rewind for a second: Clearview is a project that Sebastien and I are spearheading to rationalize, unify, and enhance the way network interfaces are handled in Solaris. "What?,", you say? Well, by defining and implementing a set of core unifying attributes (or unalienable rights)[1] for all interfaces, a wide variety of administrative, programmatic, and diagnostic problems are addressed. For example, after this work, one will be able to:

  • Observe all IP layer network traffic, including loopback, IPMP group and IP tunnel traffic.
  • Observe all IP layer network traffic flowing to and from a zone.
  • Administrate all network interfaces using dladm(1M).
  • Use VLANs and form link aggregations on all Ethernet devices.
  • Use IPMP with technologies such as DHCP and routing protocols.

... though that list is far from comprehensive (nor does the above list constitute a commitment -- this is work-in-progress, mind you!)

Before I start hearing cries of "vaporware!", let's get back to the part that has me stoked: Clearview is one of the first Solaris projects to directly involve the OpenSolaris community in its design. To wit, our team is in the process of releasing design materials for each component of the Clearview project, and is actively soliciting input from the community at large.

In fact, the first document -- covering our massive overhaul of IPMP -- was sent out a week ago, and has already received some great feedback (keep it coming!) Further, just this morning, Sebastien unveiled our proposed Tunnelling Device Driver -- if you've spent much time with tunnels in Solaris, this will be a sight for sore eyes. Of course, that's not all: in the coming weeks, Phil Kirk will provide our design for IP-Level Observability Devices, and Cathy Zhou will detail our sinister plan to bring the power of dladm to all Solaris network interfaces (and more!)

When we say directly involve, we mean it. That is, this review is instead of, rather than in addition to our traditional Sun-internal design review. As such, other project teams inside Sun are seeing these materials for the first time as well -- and we have encouraged all of them to provide their feedback to the community forum. We hope that this underscores not only our commitment to open up our development process, but our desire to actively involve the community in each step, from design through integration, and beyond.

Speaking of design: we hope these hefty documents make it clear that we do not believe in leaving the design or architecture of Solaris to the whims of late-night hackery. Moreover, although technical documents are rarely known for their humanity, we hope that our materials provide a glimpse into the deep pride we each have in both Solaris and our craft.

Needless to say, your participation is most welcome -- and rest assured that we will continue to seek your involvement as the components proceed through various stages of our software development process.


[1] The rights themselves and their implications are too important to serve as a sidebar to this blog entry, and will surely get one of their own in the future.

Technorati Tag:
Technorati Tag:




« July 2016

No bookmarks in folder


No bookmarks in folder