Sunday May 31, 2009

Clearview IPMP in Production

When I was first getting obsessed with programming in my early teens, I recall waking up on many a Saturday to the gleeful realization I had the whole day to improve some crazy piece of home-grown software. Back then, the excitement was simply in the journey itself -- I was completely content in being the entire userbase.

Of course I still love writing software (though the idea of being able to devote a whole day to it seems quaint) -- but it pales in comparison to the thrill of knowing that real people are using that software to solve their real problems. Unfortunately, with enterprise-class products such as Solaris, release schedules have historically meant that a completed project may have to wait years until it gets to solve its first real-world problem. By then, several other projects may have run their course and I'm invariably under another one's spell and not in the right frame of mind to even reminisce, let alone rejoice.

Thankfully, times have changed. First, courtesy of OpenSolaris's ipkg /dev repository, only a few weeks after January's integration, Clearview IPMP was available for bleeding-edge customers to experiment with (and based on feedback I've received, quite a few have successfully done so). Second, for the vast majority who need a supported release, Clearview IPMP can now be found in the brand-new OpenSolaris 2009.06 release. Third, thanks to the clustering team, Clearview IPMP also works with the current version of OpenSolaris Open HA Cluster.

Further, there is one little-known but immensely important release vehicle for Clearview IPMP: the Sun Storage 7000 Q2 release. Indeed, in the months since the integration of Clearview IPMP, I've partnered with the Fishworks team on getting all of the latest and greatest networking technologies from OpenSolaris into the Sun Storage 7000 appliances. As such, the Q2 release contains all of the Solaris networking projects delivered up to OpenSolaris build 106 (most notably Volo and Crossbow), plus Clearview IPMP from build 107. Of course, these projects also open up a range of new opportunities for the appliance -- especially around Networking QoS and simplified HA configuration -- which will find their way into subsequent quarterly releases.

Needless to say, all of this is immensely satisfying for me personally -- especially the idea that some of our most demanding enterprise customers are relying on Clearview IPMP to ensure their mission-critical storage remains available when networking hardware or upstream switches fail. As per my blog entry announcing Clearview IPMP in OpenSolaris, it's clear I'm a proud parent, but given the thrashing we've given it internally and its track-record thus far with customers, I'm confident it's ready for prime time.

For those exploring IPMP for the first time, Xiang Zhou (the co-author of its extensive test suite) has put together a great blog entry, including step-by-step instructions. Additionally, Raoul Carag and I extensively revised the IPMP administrative overview and IPMP tasks guide.

Those familiar with Solaris 10 IPMP may wish to check out a short slide deck that highlights the core differences and new utilities (if nothing else, I'd recommend scanning slides 12-21).

Have fun -- and of course, I (and the rest of the Clearview team) am eager to hear how it stacks up against your real-world networking high-availability problems!

Tuesday May 12, 2009

Hunting Cruft

It's no secret that I am borderline-O.C.D. in many aspects of my life -- and especially so when it comes to developing software. However, large-scale software development is inherently a messy process, and even with the most disciplined engineering practices, remnants from aborted or bygone designs often remain, lying in wait to confuse future developers.

Thankfully, many of the more obvious remnants can be identified with automated programs. For instance, the venerable lint utility can identify unused functions within an application. Many moons ago, I applied a similar concept to the OS/Net nightly build process with a utility called findunref that allows us to automatically identify files in the source tree that are not used during a build. (Frighteningly, it also identified 1100 unreferenced files in the sourcebase. That is, roughly 4% of the files we were dutifully maintaining had no bearing whatsoever on our final product. Of course, some of these should have been used, such as disconnected localization files and packaging scripts.)

Cruft-wise, Clearview IPMP posed a particular challenge: the old IPMP implementation was peanut-buttered through 135,000 lines of code in the TCP/IP stack, and I was determined to leave no trace of it behind. As such, over time I amassed collection of programs which were run as cron jobs that mined the sourcebase for possible vestiges (note that this was an ongoing task because the sourcebase Clearview IPMP replaced was still undergoing change to address critical customer needs). Some of these programs were simple (e.g., text-based searches for old IPMP-related abbreviations such as "ill group" and "ifgrp"), but others were a bit more evolved.

For instance, one key problem is the identification of unused functions. As I mentioned earlier, lint can identify unused functions in a program, but for a kernel module like ip things are more complex because other kernel modules may be the lone consumers of symbols provided by it. While it is possible to identify all the dependent modules, build lint libraries for each of them and perform a lint crosscheck across them (and in fact, we do these during the nightly build, though not for unused functions), it is also quite time-consuming and as such a bit heavyweight for my needs.

Thinking about the problem further, another solution emerged: during development, it is customary to maintain a source code cross-reference database, typically built with the classic cscope utility. A little-known aspect of cscope is that it can be scripted. For instance, to find the definition for symbol foo, one can do cscope -dq -L1 foo. Indeed, a common way to check that a symbol is unused is to (interactively) look for all uses of the symbol in cscope. Thus, for a given kernel module, it is straightforward to write a script to find unused functions: use nm(1) to obtain the module's symbol table and then check whether each of those symbols is used via cscope's scripting interface. In fact, that is exactly what my tiny dead-funcs utility does. Clearly, this requires the kernel module to be build from the same source base as the cscope database, and identifies already-extant cruft (in addition to interfaces that may have consumers outside of the OS/Net source tree), but it nonetheless proved quite useful during development (and has been valuable to others as well).

A similar approach can be followed to ensnare dead declarations, though some creativity may be needed to build the list of function/variable names to feed to cscope, as the compiler will have already pruned them out prior to constructing the kernel module and header files require care to properly parse. I resorted to convincing lint to build a lint library out of the header file in question (via PROTOLIB1), then using lintdump (another utility I contributed to the OS/NET tool chain) to dump out the symbol list -- admittedly a clunky approach, but effective nonetheless.

Unfortunately, scripts such as dead-funcs are too restrictive to become general-purpose tools in our chain, though perhaps you will find them (or their approaches) useful for your own O.C.D. development.

Wednesday Jan 21, 2009

Clearview IPMP in OpenSolaris

Clearview IPMP in OpenSolaris At long last, and somewhat belatedly, I'm thrilled to announce that the Clearview IPMP Rearchitecture has integrated into Solaris Nevada (build 107)! Build 107 has just closed internally, so internal WOS images will be available in the next few days (unfortunately, it will likely be a few weeks before the bits are available via OpenSolaris packages). For more on the new administrative experience, please check out the revised IPMP documentation, or Steffen Weiberle's in-depth blog entry. For more on the internal design, there's an extensive high-level design document, internals slides and numerous low-level design discussions in the code itself.

Here, I'd like to get a bit more personal as the designer and developer of Clearview IPMP. The project has been a real labor of love, borne both from the challenges many of Sun's top enterprise customers have faced trying to deploy IPMP, and from the formidable internal effort needed to keep the pre-Clearview IPMP implementation chugging along for the past decade. That is, it became clear that IPMP was both simultaneously a critical high-availability technology for our top customers and also an increasing cost on both our engineering and support organizations -- we either needed to kill it or fix it. Ever the optimist and buoyed by a growing customer interest in IPMP, I convinced management that I could tackle this work as part of the broader Clearview initiative that Seb and I were in the process of scoping (and moreover, either killing or fixing IPMP was required to meet Clearview's Umbrella Objectives).

From an engineering standpoint, IPMP is a case study in how much it matters to have the right abstractions. Specifically, the old (pre-Clearview) model was a struggle in large part because it introduced a new "group" abstraction to represent the IPMP group as a whole, rather than modeling an IPMP group as an IP interface (more on core network interface abstractions). This meant that every technology that interacted directly with IP interfaces (e.g., routing, filtering, QoS, monitoring, ...), required heaps of special-case code to deal with IPMP, which introduced significant complexity and a neverending stream of corner cases, some of which were unresolvable. It also made certain technologies (e.g., DHCP) downright impossible to implement, because their design was based on assumptions that held in \*all\* cases other than IPMP (e.g, that a given IP address would not move between IP interfaces). More broadly, with each new networking technology, significant effort was needed to consider how it could be made to work with IPMP, which simply does not scale.

The real tragedy of the old implementation is that the actual semantics -- while often misunderstood by customers and Sun engineers alike -- actually acted as if each IPMP group had an IP interface. For instance, if one placed two IP interfaces into an IPMP group, then added a route over one of those IP interfaces, it was as if a route had been added over the IPMP group. I say "tragedy" because this was wholly unobvious, and thus understandably led to numerous support calls. Similar surprises came from the fact that a packet with a source IP address from one IP interface could be sent out through another IP interface. In short, the implementation had cobbled together various other abstractions to build something that acted mostly like an IPMP group IP interface, but wasn't actually one.

From this one central mistake came a raft of related problems that impacted both the programmatic and administrative models. For instance, in addition to having to teach technologies about IPMP groups, consider what happens when an IP interface fails. In concept, this should be a simple operation: the IP addresses that were mapped to the failed interface's hardware address need to be remapped to the hardware address of a functioning interface in the group. This remapping can occur entirely within IP itself -- applications using those IP addresses should not need to know or care. However, in the old IPMP implementation, this was actually a very disruptive operation: the IP addresses had to be visibly moved from the failed IP interface to a functioning IP interface, confusing applications that either interacted with the IP interface namespace or listened to routing sockets. Moreover, the application had to be specially coded to know that while the IP interface had failed, it should not react to the failure because another IP interface had taken over responsibility. Similar problems abounded in areas both far and near; an interesting recent example is the issue Steffen found with the new defrouter feature and Solaris 10 IPMP. That problem doesn't exist with Clearview IPMP not because we overpowered it with reams of code but simply because the Clearview IPMP design precludes it.

Speaking of "reams of code", one of the aspects I'm most proud of with Clearview IPMP is the size of the codebase. In terms of raw numbers, the kernel implementation has shrunk by more than 35%, from roughly 8500 lines of code to 5500 lines (roughly 1000 lines of that are comments), and the lion's share of that code is isolated behind a simple kernel API of a few dozen functions (in contrast, the old IPMP codebase was sprawling and often written in-line). More importantly, the work needed to integrate the Clearview IPMP code with related technology was minimal: packet monitoring across the group required 15 lines of code; IP filter support required 5 lines of code; dynamic routing required no additional code. The new model also opened up unexpected opportunities, such as allowing the IPSQ framework (the core synchronization framework inside IP) to be massively simplified. Further, as a side effect of the new model, Clearview IPMP was able to fix many longstanding bugs -- some as old as IPMP itself -- such as 5015757, 6184000, 6359536, 6516992, 6591186, 6698480, 6752560, and 6787091 (among others).

Anyway, it's obvious that I'm a proud and biased parent. Whether my pride is justified will only become clear once Clearview IPMP has ten years of production use under its belt and an objective comparison is possible. However, I encourage you all to take it for a spin now and make your own assessment -- and of course feedback is welcome, either to me in private or on

Technorati Tag:
Technorati Tag:
Technorati Tag:

Wednesday Apr 25, 2007

IPMP Development Update #2

IPMP Development Follow-up

Several folks have again (understandably) asked for updates on the Next-Generation IPMP work. Significant progress has been made since my last update. Notably:

  • Probe-based failure detection is operational (in addition to the earlier support for link-based failure detection).
  • DR support of interfaces using IPMP through RCM works. Thanks to the new architecture, the code is almost 1000 lines more compact than Solaris's current implementation -- and more robust.
  • Boot support is now complete. That is any number (including all) interfaces can be missing at boot and then transparently repaired during operation.
  • At long last, ipmpstat. As discussed in the high-level design document, this is a new utility that allows the IPMP subsystem to be compactly examined.

Since ipmpstat allows other aspects of the architecture to be succinctly examined, let's take a quick look at a simple two-interface group on my test system:

  # ipmpstat -g
  net57       a           ok        10000ms   ce1 ce0

As we can see, the "-g" (group) output mode tells us all the basics about the group: the group interface name and group name (these will usually be the same, but differ above for illustrative purposes), its current state ("ok", indicating that all of the interfaces are operational), the maximum time needed to detect a failure (10 seconds), and the interfaces that comprise the group.

We can get a more detailed look at the IPMP health and configuration of the interfaces under IPMP using the "-i" (interface) output mode:

  # ipmpstat -i
  ce1         yes     net57       ------  up        ok        ok
  ce0         yes     net57       ------  up        disabled  ok

Here, we can see that ce0 has probe-based failure detection disabled. We can also see issues that prevent an interface from being used (aka being "active") -- e.g., if suppose we enable standby on ce0:

  # ifconfig ce0 standby

  # ipmpstat -i
  ce1         yes     net57       ------  up        ok        ok
  ce0         no      net57       si----  up        disabled  ok

We can see that ce0 is now no longer active, because it's an inactive standby (indicated by the "i" and "s" flags). This means that all of the addresses in the group must be restricted to ce1 (unless ce1 becomes unusable), which we can see via the "-a" (address) output mode ("-n" turns off address-to-hostname resolution):

  # ipmpstat -an
  ADDRESS             GROUP       STATE   INBOUND     OUTBOUND         net57       up      ce1         ce1          net57       up      ce1         ce1

For fun, we can offline ce1 and observe the failover to ce0:

  # if_mpadm -d ce1

  # ipmpstat -i
  ce1         no      net57       ----d-  disabled  disabled  offline
  ce0         yes     net57       s-----  up        disabled  ok
[ In addition to the "offline" state, the "d" flag also indicates that all of the addresses on ce0 are down, preventing it from receiving any traffic. ]
  # ipmpstat -an
  ADDRESS             GROUP       STATE   INBOUND     OUTBOUND         net57       up      ce0         ce0          net57       up      ce0         ce0
We can also convert ce0 back to a "normal" interface, online ce1 and observe the load spreading configurations:
  # ifconfig ce0 -standby
  # if_mpadm -r ce1

  # ipmpstat -i
  ce1         yes     net57       ------  up        ok        ok
  ce0         yes     net57       ------  up        disabled  ok

  # ipmpstat -an
  ADDRESS             GROUP       STATE   INBOUND     OUTBOUND         net57       up      ce0         ce1 ce0          net57       up      ce1         ce1 ce0
In particular, this indicates that incoming traffic to will go to ce0 and inbound traffic to will go to ce1 (as per the ARP mappings). However, outbound traffic will potentially flow over either interface (though to sidestep packet ordering issues, a given connection will remain latched unless the interface becomes unusable).

This also highlights another aspect of the new IPMP design: the kernel is responsible for spreading the IP addresses across the interfaces (rather than the administrator). The current algorithm simply attempts to keep the number of IP addresses "evenly" distributed over the set of interfaces, but more sophisticated policies (e.g., based on load measurements) could be added in the future.

To round out the ipmpstat feature set, one can also monitor the targets and probes used during probe-based failure detection:

  # ipmpstat -tn
  ce1         mcast
  ce0         disabled  --                  --
Above, we can see that ce1 is using "mcast" (multicast) mode to discover its probe targets, and we can see the targets it has decided to probe, in firing order. We can also look at the probes themselves, in real-time:
  # ipmpstat -pn
  TIME      INTERFACE   PROBE     TARGET              RTT       RTTAVG    RTTDEV
  1.15s     ce1         112         1.09ms    1.14ms    0.11ms
  2.33s     ce1         113         1.11ms    1.18ms    0.13ms
  3.94s     ce1         114         1.07ms    2.10ms    2.00ms
  5.38s     ce1         115         1.08ms    1.14ms    0.10ms
  6.19s     ce1         116         1.43ms    1.20ms    0.19ms
  7.73s     ce1         117         1.04ms    1.13ms    0.11ms
  9.47s     ce1         118         1.04ms    1.16ms    0.13ms
  10.67s    ce1         119         1.06ms    1.97ms    1.76ms
Above, the inflated RTT average and standard deviation for indicate that something went wrong with in the not-too-distant past. (As an aside: "-p" also revealed a subtle longstanding bug in in.mpathd that was causing inflated jitter times for probe targets; see 6549950.)

Anyway, hopefully all this gives you not only a feel for ipmpstat, but a feel for how development is progressing. It should be noted that several key features are still missing, such as:

  • Broadcast and multicast support on IPMP interfaces.
  • IPv6 traffic on IPMP interfaces.
  • IP Filter support on IPMP interfaces.
  • MIB and kstat support on IPMP interfaces.
  • DHCP on IPMP interfaces.
  • Sun Cluster support.
All of these are currently being worked on. In the meantime, we will be making early-access BFU archives based on what we have so far to those who are interested in kicking the tires. (And a big thanks to those customers who have already volunteered!)

Technorati Tag:
Technorati Tag:
Technorati Tag:




« July 2016

No bookmarks in folder


No bookmarks in folder