Sunday May 31, 2009

Clearview IPMP in Production

When I was first getting obsessed with programming in my early teens, I recall waking up on many a Saturday to the gleeful realization I had the whole day to improve some crazy piece of home-grown software. Back then, the excitement was simply in the journey itself -- I was completely content in being the entire userbase.

Of course I still love writing software (though the idea of being able to devote a whole day to it seems quaint) -- but it pales in comparison to the thrill of knowing that real people are using that software to solve their real problems. Unfortunately, with enterprise-class products such as Solaris, release schedules have historically meant that a completed project may have to wait years until it gets to solve its first real-world problem. By then, several other projects may have run their course and I'm invariably under another one's spell and not in the right frame of mind to even reminisce, let alone rejoice.

Thankfully, times have changed. First, courtesy of OpenSolaris's ipkg /dev repository, only a few weeks after January's integration, Clearview IPMP was available for bleeding-edge customers to experiment with (and based on feedback I've received, quite a few have successfully done so). Second, for the vast majority who need a supported release, Clearview IPMP can now be found in the brand-new OpenSolaris 2009.06 release. Third, thanks to the clustering team, Clearview IPMP also works with the current version of OpenSolaris Open HA Cluster.

Further, there is one little-known but immensely important release vehicle for Clearview IPMP: the Sun Storage 7000 Q2 release. Indeed, in the months since the integration of Clearview IPMP, I've partnered with the Fishworks team on getting all of the latest and greatest networking technologies from OpenSolaris into the Sun Storage 7000 appliances. As such, the Q2 release contains all of the Solaris networking projects delivered up to OpenSolaris build 106 (most notably Volo and Crossbow), plus Clearview IPMP from build 107. Of course, these projects also open up a range of new opportunities for the appliance -- especially around Networking QoS and simplified HA configuration -- which will find their way into subsequent quarterly releases.

Needless to say, all of this is immensely satisfying for me personally -- especially the idea that some of our most demanding enterprise customers are relying on Clearview IPMP to ensure their mission-critical storage remains available when networking hardware or upstream switches fail. As per my blog entry announcing Clearview IPMP in OpenSolaris, it's clear I'm a proud parent, but given the thrashing we've given it internally and its track-record thus far with customers, I'm confident it's ready for prime time.

For those exploring IPMP for the first time, Xiang Zhou (the co-author of its extensive test suite) has put together a great blog entry, including step-by-step instructions. Additionally, Raoul Carag and I extensively revised the IPMP administrative overview and IPMP tasks guide.

Those familiar with Solaris 10 IPMP may wish to check out a short slide deck that highlights the core differences and new utilities (if nothing else, I'd recommend scanning slides 12-21).

Have fun -- and of course, I (and the rest of the Clearview team) am eager to hear how it stacks up against your real-world networking high-availability problems!

Tuesday May 12, 2009

Hunting Cruft

It's no secret that I am borderline-O.C.D. in many aspects of my life -- and especially so when it comes to developing software. However, large-scale software development is inherently a messy process, and even with the most disciplined engineering practices, remnants from aborted or bygone designs often remain, lying in wait to confuse future developers.

Thankfully, many of the more obvious remnants can be identified with automated programs. For instance, the venerable lint utility can identify unused functions within an application. Many moons ago, I applied a similar concept to the OS/Net nightly build process with a utility called findunref that allows us to automatically identify files in the source tree that are not used during a build. (Frighteningly, it also identified 1100 unreferenced files in the sourcebase. That is, roughly 4% of the files we were dutifully maintaining had no bearing whatsoever on our final product. Of course, some of these should have been used, such as disconnected localization files and packaging scripts.)

Cruft-wise, Clearview IPMP posed a particular challenge: the old IPMP implementation was peanut-buttered through 135,000 lines of code in the TCP/IP stack, and I was determined to leave no trace of it behind. As such, over time I amassed collection of programs which were run as cron jobs that mined the sourcebase for possible vestiges (note that this was an ongoing task because the sourcebase Clearview IPMP replaced was still undergoing change to address critical customer needs). Some of these programs were simple (e.g., text-based searches for old IPMP-related abbreviations such as "ill group" and "ifgrp"), but others were a bit more evolved.

For instance, one key problem is the identification of unused functions. As I mentioned earlier, lint can identify unused functions in a program, but for a kernel module like ip things are more complex because other kernel modules may be the lone consumers of symbols provided by it. While it is possible to identify all the dependent modules, build lint libraries for each of them and perform a lint crosscheck across them (and in fact, we do these during the nightly build, though not for unused functions), it is also quite time-consuming and as such a bit heavyweight for my needs.

Thinking about the problem further, another solution emerged: during development, it is customary to maintain a source code cross-reference database, typically built with the classic cscope utility. A little-known aspect of cscope is that it can be scripted. For instance, to find the definition for symbol foo, one can do cscope -dq -L1 foo. Indeed, a common way to check that a symbol is unused is to (interactively) look for all uses of the symbol in cscope. Thus, for a given kernel module, it is straightforward to write a script to find unused functions: use nm(1) to obtain the module's symbol table and then check whether each of those symbols is used via cscope's scripting interface. In fact, that is exactly what my tiny dead-funcs utility does. Clearly, this requires the kernel module to be build from the same source base as the cscope database, and identifies already-extant cruft (in addition to interfaces that may have consumers outside of the OS/Net source tree), but it nonetheless proved quite useful during development (and has been valuable to others as well).

A similar approach can be followed to ensnare dead declarations, though some creativity may be needed to build the list of function/variable names to feed to cscope, as the compiler will have already pruned them out prior to constructing the kernel module and header files require care to properly parse. I resorted to convincing lint to build a lint library out of the header file in question (via PROTOLIB1), then using lintdump (another utility I contributed to the OS/NET tool chain) to dump out the symbol list -- admittedly a clunky approach, but effective nonetheless.

Unfortunately, scripts such as dead-funcs are too restrictive to become general-purpose tools in our chain, though perhaps you will find them (or their approaches) useful for your own O.C.D. development.

Wednesday Jan 21, 2009

Clearview IPMP in OpenSolaris

Clearview IPMP in OpenSolaris At long last, and somewhat belatedly, I'm thrilled to announce that the Clearview IPMP Rearchitecture has integrated into Solaris Nevada (build 107)! Build 107 has just closed internally, so internal WOS images will be available in the next few days (unfortunately, it will likely be a few weeks before the bits are available via OpenSolaris packages). For more on the new administrative experience, please check out the revised IPMP documentation, or Steffen Weiberle's in-depth blog entry. For more on the internal design, there's an extensive high-level design document, internals slides and numerous low-level design discussions in the code itself.

Here, I'd like to get a bit more personal as the designer and developer of Clearview IPMP. The project has been a real labor of love, borne both from the challenges many of Sun's top enterprise customers have faced trying to deploy IPMP, and from the formidable internal effort needed to keep the pre-Clearview IPMP implementation chugging along for the past decade. That is, it became clear that IPMP was both simultaneously a critical high-availability technology for our top customers and also an increasing cost on both our engineering and support organizations -- we either needed to kill it or fix it. Ever the optimist and buoyed by a growing customer interest in IPMP, I convinced management that I could tackle this work as part of the broader Clearview initiative that Seb and I were in the process of scoping (and moreover, either killing or fixing IPMP was required to meet Clearview's Umbrella Objectives).

From an engineering standpoint, IPMP is a case study in how much it matters to have the right abstractions. Specifically, the old (pre-Clearview) model was a struggle in large part because it introduced a new "group" abstraction to represent the IPMP group as a whole, rather than modeling an IPMP group as an IP interface (more on core network interface abstractions). This meant that every technology that interacted directly with IP interfaces (e.g., routing, filtering, QoS, monitoring, ...), required heaps of special-case code to deal with IPMP, which introduced significant complexity and a neverending stream of corner cases, some of which were unresolvable. It also made certain technologies (e.g., DHCP) downright impossible to implement, because their design was based on assumptions that held in \*all\* cases other than IPMP (e.g, that a given IP address would not move between IP interfaces). More broadly, with each new networking technology, significant effort was needed to consider how it could be made to work with IPMP, which simply does not scale.

The real tragedy of the old implementation is that the actual semantics -- while often misunderstood by customers and Sun engineers alike -- actually acted as if each IPMP group had an IP interface. For instance, if one placed two IP interfaces into an IPMP group, then added a route over one of those IP interfaces, it was as if a route had been added over the IPMP group. I say "tragedy" because this was wholly unobvious, and thus understandably led to numerous support calls. Similar surprises came from the fact that a packet with a source IP address from one IP interface could be sent out through another IP interface. In short, the implementation had cobbled together various other abstractions to build something that acted mostly like an IPMP group IP interface, but wasn't actually one.

From this one central mistake came a raft of related problems that impacted both the programmatic and administrative models. For instance, in addition to having to teach technologies about IPMP groups, consider what happens when an IP interface fails. In concept, this should be a simple operation: the IP addresses that were mapped to the failed interface's hardware address need to be remapped to the hardware address of a functioning interface in the group. This remapping can occur entirely within IP itself -- applications using those IP addresses should not need to know or care. However, in the old IPMP implementation, this was actually a very disruptive operation: the IP addresses had to be visibly moved from the failed IP interface to a functioning IP interface, confusing applications that either interacted with the IP interface namespace or listened to routing sockets. Moreover, the application had to be specially coded to know that while the IP interface had failed, it should not react to the failure because another IP interface had taken over responsibility. Similar problems abounded in areas both far and near; an interesting recent example is the issue Steffen found with the new defrouter feature and Solaris 10 IPMP. That problem doesn't exist with Clearview IPMP not because we overpowered it with reams of code but simply because the Clearview IPMP design precludes it.

Speaking of "reams of code", one of the aspects I'm most proud of with Clearview IPMP is the size of the codebase. In terms of raw numbers, the kernel implementation has shrunk by more than 35%, from roughly 8500 lines of code to 5500 lines (roughly 1000 lines of that are comments), and the lion's share of that code is isolated behind a simple kernel API of a few dozen functions (in contrast, the old IPMP codebase was sprawling and often written in-line). More importantly, the work needed to integrate the Clearview IPMP code with related technology was minimal: packet monitoring across the group required 15 lines of code; IP filter support required 5 lines of code; dynamic routing required no additional code. The new model also opened up unexpected opportunities, such as allowing the IPSQ framework (the core synchronization framework inside IP) to be massively simplified. Further, as a side effect of the new model, Clearview IPMP was able to fix many longstanding bugs -- some as old as IPMP itself -- such as 5015757, 6184000, 6359536, 6516992, 6591186, 6698480, 6752560, and 6787091 (among others).

Anyway, it's obvious that I'm a proud and biased parent. Whether my pride is justified will only become clear once Clearview IPMP has ten years of production use under its belt and an objective comparison is possible. However, I encourage you all to take it for a spin now and make your own assessment -- and of course feedback is welcome, either to me in private or on

Technorati Tag:
Technorati Tag:
Technorati Tag:

Tuesday Sep 02, 2008

Creating Shell-Friendly Parsable Output

Creating Shell-Friendly Parsable Output

Being able to easily write scripts from the command-line has long been regarded as one of UNIX's core strengths. However, over the years, surprisingly little attention has been paid to writing CLIs whose output lend themselves to scripting. Indeed, even modern CLIs often fail to consider parsable output as a distinct concept from human output, leading to overwrought and fragile scripts which inevitably break as the CLI is enhanced over time. Some recent CLIs have "solved" the parsable format problem by using popular formats such as XML and JSON. These are fine formats for sophisticated scripting languages, but a poor match for traditional UNIX line-oriented tools (e.g. grep, cut, head) that form the foundation of shell-scripting.

Even those CLIs that consider the shell when designing a parsable output format often fall short of the mark. For dladm(1M), it took us (Sun) three tries to create a format that can be easily parsed from the shell. So, while the final format we settled on may seem simple and obvious, as is often the case, making things simple can prove to be surprisingly hard. Further, there are a number of alternative output formats that seem compelling at first blush but ultimately prove to be unworkable.

So that others working on similar problems may benefit, below I've summarized our set of guidelines -- some obvious, some not -- that we arrived at while working on dladm. As each CLI has its own constraints, not all of them may prove applicable, but I'd urge anyone designing a CLI with parsable output to consider each one carefully.

To provide some specifics to hang our guidelines on, first, here's an example of the dladm human output format:

  # dladm show-link -o link,class,over
  LINK        CLASS    OVER
  eth0        phys     --
  eth1        phys     --
  eth2        phys     --
  default0    aggr     eth0 eth1 eth2
  cyan0       vlan     default0
... and here's the equivalent parsable output format:
  # dladm show-link -p -o link,class,over
  default0:aggr:eth0 eth1 eth2
Now, the guidelines:
  1. Design CLIs that output in a regular format -- even in human output mode.

    Once your human output mode ceases to be regular (ifconfig(1M) output is a prime example of an irregular format), later adding a parsable output mode becomes difficult if not impossible. (As an aside, I've often found that irregular output suggests deeper design flaws, either in the CLI itself or the objects it operates on.)

  2. Prefer tabular formats in parsable output mode.

    Because traditional shell scripting works best with lines of information, tabular formats where each line both identifies and describes a unique object are ideal. For example, above, the link field uniquely identifies the object, and the class and over fields describe that object. In some cases, multiple fields may be required to uniquely identify the object (e.g., with dladm show-linkprop, both the link and the property field are needed). As an aside: in the multiple-field case, the human output mode may choose to use visual nesting (e.g., by grouping all of a link's properties together on successive lines and omitting the link value entirely), but it's important this not be done in parsable output mode so that the shell script can remain simple.

  3. Require output fields to be specified.

    Unlike humans, scripts always invoke a CLI with a specific purpose in mind. Also unlike humans, scripts are not facile at adapting to change (e.g., the addition of new fields). Thus, it's imperative that scripts be forced to explicitly specify the fields they need (with dladm, attempting to use -p without -o yields an error message). With this approach, new fields can be added to a CLI without any risk of breaking existing consumers. Further, if a field used by a script is removed, the failure mode becomes hard (the CLI will produce an error), rather than soft (the consumer misparses the CLI's output and does something unpredictable). Note that for similar reasons, if your CLI provides a way to print field subsets that may change over time (e.g., -o all), those must also fail in parsable output mode.

  4. Leverage field specifiers to infer field names.

    Because field names must be specified in an order, it's natural to use that same order as the output order, and thus avoid having to explicitly identify the field names in the parsable output format. That is, as shown above, dladm can omit indicating which field name corresponds with which value because the order is inferred from the invocation of -olink,class,over. This may seem a minor point, but in practice it saves a lot of grotty work in the shell to otherwise correlate each field name with its value.

  5. Omit headers.

    Similarly, because the field order is known (and no human will be staring at the output) there is no utility in providing a header in parsable output mode, and indeed its presence would only complicate parsing. As shown above, dladm omits the header in parsable output mode.

  6. Do not adorn your field values.

    In human output mode, it can be useful to give visual indications for specific field values. For instance, as shown above, dladm shows empty values as "--" in human output mode so that the table does not look malformed. In parsable output mode, such embellishments only complicate and confuse consumers of the data (and may in fact make it ambiguous), and thus should be avoided. As above, in parsable output format, empty values are shown as actually being empty.

  7. Do not use whitespace as a field separator.

    Whitespace may seem like a natural field separator, but in practice it's problematic. Specifically, many shells treat whitespace separators specially by merging consecutive instances into a single instance. For example, consider representing three consecutive empty values. With a non-whitespace field separator such as ":", this would be output as "::" (empty value 1, : separator, empty value 2, :, empty value 3). With the shell's IFS variable set to ":", the shell will parse this as three separate empty values, as intended. With space as the field separator, this would be output as "   ", and with IFS set to " " the shell would misparse this as a single empty value.

  8. Do not restrict your allowed field values.

    While some fields may be controlled directly by the CLI (e.g., the class field above), others are either outside of your direct control (e.g., the link field above), or outside of even your system's control (e.g., the essid field output by dladm show-wifi). As such, aside from ensuring the field value is printable ASCII (where newline is considered as unprintable), no values should be filtered out or forbidden[1].

    Thus, any values that have special meaning should generally be escaped. For instance, with ":" as a field delimiter, IPv6 address "fe80::1" would become "fe80\\:\\:1" when displayed in parsable output mode. Thankfully, escaping does not complicate shell parsing because all popular scripting shells have read builtins that will automatically strip escapes. Thus, the common idiom of piping the output to a read/while loop works as expected without any special-purpose logic. For instance, even though the BSSID field will contain embedded colons, this will loop through each BSSID on each link, trying to connect to one until it succeeds:

          dladm scan-wifi -p -o link,bssid |
          while IFS=: read link bssid; do
                  dladm connect-wifi -i $bssid $link && break
    That said, if only a single field has been requested, the field separator is not needed. Since no ambiguity exists in that case, there's no need to escape it -- and not doing so can make things more convenient for other shell idioms -- e.g., to collect all in-range SSIDs:
          ssids=`dladm scan-wifi -p -o bssid`
I'd welcome hearing back from others who have tackled this problem.

[1] If unprintable ASCII values can legitimately occur in a given field's output, you need to use another encoding format.

Wednesday Aug 20, 2008



Yes, it's been a whole year since I last posted a blog entry. Between moving from Boston to San Francisco (metro, anyway), countless urgent matters (both professional and personal), and wrapping up Clearview IPMP development (more on that real soon), blogging hasn't exactly been top priority. That said, I have amassed a really nice list of topics for future blog entries over the coming weeks (OK, maybe months ;-).

Before we get to all that though, I have an urgent tip for those who are using GNOME's Nautilus on OpenSolaris build 94 or later. It seems that the GNOME development team (not inside Sun) decided to change the Open Terminal menu item (available by right clicking on the desktop) to Open in Terminal, and correspondingly changed things so that the GNOME terminal will open in your ~/Desktop directory, rather than ~. The unmitigated idiocy and arrogance of this change is beyond comprehension, and the pain associated with it only intensifies with each opened terminal. Nonetheless, thankfully, there is a simple way to restore the previous (correct) behavior:

  gconftool-2 -s /apps/nautilus-open-terminal/desktop_opens_home_dir
              -t bool true 
Hope this saves some other poor soul from spending half a day digging through the GNOME sources for a solution.

Technorati Tag:
Technorati Tag:
Technorati Tag:

Wednesday Apr 25, 2007

IPMP Development Update #2

IPMP Development Follow-up

Several folks have again (understandably) asked for updates on the Next-Generation IPMP work. Significant progress has been made since my last update. Notably:

  • Probe-based failure detection is operational (in addition to the earlier support for link-based failure detection).
  • DR support of interfaces using IPMP through RCM works. Thanks to the new architecture, the code is almost 1000 lines more compact than Solaris's current implementation -- and more robust.
  • Boot support is now complete. That is any number (including all) interfaces can be missing at boot and then transparently repaired during operation.
  • At long last, ipmpstat. As discussed in the high-level design document, this is a new utility that allows the IPMP subsystem to be compactly examined.

Since ipmpstat allows other aspects of the architecture to be succinctly examined, let's take a quick look at a simple two-interface group on my test system:

  # ipmpstat -g
  net57       a           ok        10000ms   ce1 ce0

As we can see, the "-g" (group) output mode tells us all the basics about the group: the group interface name and group name (these will usually be the same, but differ above for illustrative purposes), its current state ("ok", indicating that all of the interfaces are operational), the maximum time needed to detect a failure (10 seconds), and the interfaces that comprise the group.

We can get a more detailed look at the IPMP health and configuration of the interfaces under IPMP using the "-i" (interface) output mode:

  # ipmpstat -i
  ce1         yes     net57       ------  up        ok        ok
  ce0         yes     net57       ------  up        disabled  ok

Here, we can see that ce0 has probe-based failure detection disabled. We can also see issues that prevent an interface from being used (aka being "active") -- e.g., if suppose we enable standby on ce0:

  # ifconfig ce0 standby

  # ipmpstat -i
  ce1         yes     net57       ------  up        ok        ok
  ce0         no      net57       si----  up        disabled  ok

We can see that ce0 is now no longer active, because it's an inactive standby (indicated by the "i" and "s" flags). This means that all of the addresses in the group must be restricted to ce1 (unless ce1 becomes unusable), which we can see via the "-a" (address) output mode ("-n" turns off address-to-hostname resolution):

  # ipmpstat -an
  ADDRESS             GROUP       STATE   INBOUND     OUTBOUND         net57       up      ce1         ce1          net57       up      ce1         ce1

For fun, we can offline ce1 and observe the failover to ce0:

  # if_mpadm -d ce1

  # ipmpstat -i
  ce1         no      net57       ----d-  disabled  disabled  offline
  ce0         yes     net57       s-----  up        disabled  ok
[ In addition to the "offline" state, the "d" flag also indicates that all of the addresses on ce0 are down, preventing it from receiving any traffic. ]
  # ipmpstat -an
  ADDRESS             GROUP       STATE   INBOUND     OUTBOUND         net57       up      ce0         ce0          net57       up      ce0         ce0
We can also convert ce0 back to a "normal" interface, online ce1 and observe the load spreading configurations:
  # ifconfig ce0 -standby
  # if_mpadm -r ce1

  # ipmpstat -i
  ce1         yes     net57       ------  up        ok        ok
  ce0         yes     net57       ------  up        disabled  ok

  # ipmpstat -an
  ADDRESS             GROUP       STATE   INBOUND     OUTBOUND         net57       up      ce0         ce1 ce0          net57       up      ce1         ce1 ce0
In particular, this indicates that incoming traffic to will go to ce0 and inbound traffic to will go to ce1 (as per the ARP mappings). However, outbound traffic will potentially flow over either interface (though to sidestep packet ordering issues, a given connection will remain latched unless the interface becomes unusable).

This also highlights another aspect of the new IPMP design: the kernel is responsible for spreading the IP addresses across the interfaces (rather than the administrator). The current algorithm simply attempts to keep the number of IP addresses "evenly" distributed over the set of interfaces, but more sophisticated policies (e.g., based on load measurements) could be added in the future.

To round out the ipmpstat feature set, one can also monitor the targets and probes used during probe-based failure detection:

  # ipmpstat -tn
  ce1         mcast
  ce0         disabled  --                  --
Above, we can see that ce1 is using "mcast" (multicast) mode to discover its probe targets, and we can see the targets it has decided to probe, in firing order. We can also look at the probes themselves, in real-time:
  # ipmpstat -pn
  TIME      INTERFACE   PROBE     TARGET              RTT       RTTAVG    RTTDEV
  1.15s     ce1         112         1.09ms    1.14ms    0.11ms
  2.33s     ce1         113         1.11ms    1.18ms    0.13ms
  3.94s     ce1         114         1.07ms    2.10ms    2.00ms
  5.38s     ce1         115         1.08ms    1.14ms    0.10ms
  6.19s     ce1         116         1.43ms    1.20ms    0.19ms
  7.73s     ce1         117         1.04ms    1.13ms    0.11ms
  9.47s     ce1         118         1.04ms    1.16ms    0.13ms
  10.67s    ce1         119         1.06ms    1.97ms    1.76ms
Above, the inflated RTT average and standard deviation for indicate that something went wrong with in the not-too-distant past. (As an aside: "-p" also revealed a subtle longstanding bug in in.mpathd that was causing inflated jitter times for probe targets; see 6549950.)

Anyway, hopefully all this gives you not only a feel for ipmpstat, but a feel for how development is progressing. It should be noted that several key features are still missing, such as:

  • Broadcast and multicast support on IPMP interfaces.
  • IPv6 traffic on IPMP interfaces.
  • IP Filter support on IPMP interfaces.
  • MIB and kstat support on IPMP interfaces.
  • DHCP on IPMP interfaces.
  • Sun Cluster support.
All of these are currently being worked on. In the meantime, we will be making early-access BFU archives based on what we have so far to those who are interested in kicking the tires. (And a big thanks to those customers who have already volunteered!)

Technorati Tag:
Technorati Tag:
Technorati Tag:




« June 2016

No bookmarks in folder


No bookmarks in folder