Thursday Jul 19, 2007

Observations on packaging

Over the past few months, a bunch of us have been exploring various options for packaging. (Actually, I suppose I've been pursuing this on a part-time basis for a year now, but it's only recently that it's made it to the top of the stack.) I've looked at a bunch of packaging systems, ported a few to OpenSolaris, run a bunch on their native platforms, and read a slew of manual pages, FAQs, blogs, and the like. In parallel with those efforts, Bart, Sanjay, and others have been analyzing the complexity around patching in the Solaris 10 patch stream, and improving the toolset to address the forms of risk that the larger features in the update releases presented.

In the course of those investigations, we've come up with a number of different approaches to understanding requirements and possibilities; I'll probably write those out more fully in a proper design document, but I thought it would be helpful to outline some of those constraints here. For instance, one way to look at how we might improve packaging is to separate the list of "design inputs" for a packaging system into "old" and "new".

When I make a list of this kind, I know it's bound to offend. (I already know it's more scattershot than the argument we'll present in a design document.) Feel free to send me the important inputs I've omitted, as I have a few follow-up posts on requirements—lower level, more specific intentions—and architectural thoughts where I can cover any issues not mentioned here.

"Old" inputs

The "old" inputs are those that are derived from current parts of the feature set around software packaging and installation. Many of these inputs will really be satisfied by the new installer work ("Caiman") Dave's leading, but the capabilities of the packaging system will change some from difficult and fragile to straightforward and robust. With effort, some might even achieve some elegance.

  1. Hands off. For a long time, via the JumpStart facility, it's been easy to provision systems via scripted installed. This technology is still sound, particularly when we consider that the Flash archives allow a mix of image-based provisioning with the customizations JumpStart offered.
  2. Virtualized systems. As long or longer, we've supported the notion of diskless systems, where the installed image is shared out, in portions, between multiple running environments. Zones was a direct evolutionary successor of the diskless environments, and shows that this approach can lead to some very economical deployment models. Lower-level virtualization methods can also benefit from the administrative precision that comes out of sharing known portions of the installation image.
  3. Availability and liveness. With Live Upgrade, it's been possible for a while now to update a second image on a running system while the first image is active—a command to switch images and a reboot is all that's required to be running the new software. This approach requires too much feature-specific knowledge at present, but provides a very safe approach to installing an upgraded kernel or larger software stack, as reverting to the previous version is just a matter of switching back to the previous image. So, a package installation option that doesn't ruin a working system is a must.
  4. Change control. In principle, with the current set of package and patch operations, it is possible to create a very specific configuration—that may never have been tested at any other site, mind you—from the issued stream of updates. From a service perspective, the variety of possible configuration combinations is too broad, but the ability to precisely know your configuration and make it a fixed point remains important.
  5. Migration and compatibility. There's a very large library of software packaged in the System V format that will run on most OpenSolaris-based systems. Providing well-defined compatibility and migration paths is an obvious constraint, given the goals of the development process.
  6. Networked behaviour. Believe it or not, the current installer and packaging support a large set of network-based operations—you can install specific packages and systems over the network, if you know where everything is. That is, the current components need to be assembled into some locally relevant architecture by each site to be useful—any replacement needs to make this assembly unnecessary, potentially via a number of mechanisms, but definitely via a list of known (trusted?) public package repositories.

"New" inputs

But, as you might expect, the efforts around Approachability, Modernization, and Indiana have brought to light some new qualities a packaging system must possess.

  1. Safety. One of the real complications from virtualized systems, at least in current packaging, is that a developer has to understand each of the image types his or her package might reach, and make sure that the set of pre- and post-install scripts, class actions scripts, and the like are correct for each of these cases. When that doesn't happen, the package is at a minimum incorrectly installed; in certain real cases, this class of failures compromises other data on the system. Restrictions in this space, particularly during package operations on an inert image, seem like a promising trade-off to achieve greater safety.
  2. Developer burden. Current packaging requires the developer provide a description of the package, its dependencies, and contents across a minimum of three files, in addition to knowing a collection of rules and guidelines around files, directories, and package boundaries. Most of these conventions should be encoded and enforced by the package creation and installation tools themselves, and it should be possible to construct a package with only a single file—or from an existing package or archive.
  3. smf(5) aware. OpenSolaris-based systems have an understanding of services and service interdependencies, derived from the Service Management Facility's service descriptions. smf(5) service management during packaging operations is awkward under current packaging and very much needs to be improved but, more importantly, the service graph provides a rich set of relationships among larger components that should lead to a better understanding of the rules around consistent system assembly.
  4. Minimization. For a while, the current system has had some very coarse package "blocks", from which one could construct a system—the various metaclusters. These, with the exception of the minimally required metacluster and the entire metacluster, split the system along boundaries of interest only to workstation installs. Any suitable packaging system must provide query and image management tools to make image construction more flexible and less error-prone (and eliminate the need for things like a "developer" metacluster, for that matter).

    It's also pretty clear that the package boundaries aren't optimized in any fashion, as evidenced by the differing rates of change of the binaries they currently enclose—in the form of the issued patches against Solaris 10.

  5. Multiple streams of change, of a single kind. Although we noted the continued need to control change above, it's also important to be able to subscribe to a stream of change consistent with one's expectations around system stability. The package system needs to allow one to bind to one or more streams of change, and limit how the interfaces the aggregate binaries from those streams evolve. That is, it should be possible to subscribe to a stream of only the most important updates, like security and data corruption fixes, or to the development stream, so that one's system changes in a way consistent with one's expectations.

    Conversely, the tradeoff between complexity and space optimization in current patches—which introduce a separate and difficult-to-calculate dependency graph and distinct namespace entries for each platform, among other issues—has slid much too far, given the increase in system richness and the increases in disk space and bandwidth. There seems to be little long-term benefit in preserving the current patch mechanism, particularly since Sun never offered it in a form useful outside of Sun's own software.

  6. ZFS aligned. ZFS offers the administrator access to so many sophisticated options around deployment and data management that it would be foolish for a packaging system to not explore how to take advantage of them—zfs snapshot and zfs clone are the most obvious capabilities that allow the packaging system to limit the risk of its modifications to an image, without excessive consumption of storage space or system time.
  7. Prevent bad stuff early. Another classic OpenSolaris-style requirement is that the set of available packages be known to be self-consistent and correct at all times, to spare the propagation of incomplete configurations. In a packaging system, this input expands on our intent to reduce the developer burden to assist the developer in writing as correct a package as possible, and to enable the repository operator to block incomplete packages from being made available to clients. There's a rich set of data here, as we noted for smf(5) service descriptions above.
  8. Friendly user deployment. Direct from Indiana, but sensible and apparent to all is that packaging systems have advanced to a point where the usability of the client is expected, and not an innovation. I haven't got the complete timeline of packaging developments—the literature survey continues—but it's clear that Debian's apt system marks the current expectations about the client-side capability and ease-of-use.

In the course of the Indiana discussions, Bryan raised one point, which I'll paraphrase as "it's not an idea until there's code". That's a sound position, which I also happen to hold—Danek and I (and Bart and Daniel, I hope) and have been quietly prototyping for a little while now, to see if we had a handle on this collection of inputs. I'd like to give a bit more background, in the form of requirements and assertions, in an upcoming post or two. Then we're hoping to start a project on to take the idea all the way from notional fancy to functioning code.

[ T: ]

Wednesday Jun 06, 2007

OpenSolaris: Five updates conservative developers should make

It's been almost two-and-a-half years since Solaris 10 was released, and if we look at Nevada (via Developer Edition or one of the other distributions), we can see that many of the technologies introduced in S10 are becoming still more capable. At this point, even the most conservative software developer can assume that certain features are always present. So, for the conservative OpenSolaris application developer, here are the five low-risk, high-reward updates you should make to your application:

  1. Provide x86 and SPARC versions. OpenSolaris has two primary instruction set targets, i386 and sparc. Each of these has both a 32- and a 64-bit variant. The metrics on Solaris 10 and SX:CE/DE downloads tell us that the Solaris volume is substantial on both targets so, for maximum uptake, you should attempt to offer software on both.

    On x86, you should consider delivering both 32- and 64-bit versions, if your application can take advantage of a 64-bit address space. But there is a large contingent of 32-bit only users, so don't stop delivering appropriate binaries prematurely.

    Of course, if you're writing at a hardware-independent level, like on a Java language platform, then you get x86 and SPARC (and presumably others) for free.

  2. Make packages that deliver into sparse zones. The primary software delivery mechanism is still System V packages—but your software's already packaged properly, so that's not an issue. (Right?) With Solaris 10, the Zones feature offers a sparse variant that requires package support. Roughly, this support means that the package author shouldn't deliver into /usr and should add the three properties needed to the pkginfo file.

    There are some fairly serious Zones deployers out there; Joyent is probably the most public example, but there are plenty of corporate datacentres using Zones to their advantage. If you want your software run by them or their customers, providing a Zones-compatible package seems like the easiest way to get it into their hands and onto their Zones.

  3. Replace your init.d scripts with smf(5) manifests. The Service Management Facility (smf(5)) provides a collection of capabilities that make service administration easier, while also reducing the development burden on application authors. Converting your service from the rc\*.d script to a service description and methods means that administrators get automatic restart (and higher service availability), an easy on/off switch, and a place to make site-specific annotations (using the various template properties). There's a free comptetive advantage here, if your service runs under smf(5) and a rival's doesn't.

    Of course, you can do more: placing key configuration values in the service repository means that various administrative utilities can be taught to make manipulating your application's feature set easy to the deploying administrator. But that won't happen without an initial service conversion.

    (Once you write a manifest for your service, you'll also probably want to write a rights profile, so that administrative authority for your service and its instances can be easily delegated.)

  4. Understand needed privileges. One of the more interesting features in Solaris 10 and later is the work Casper did to split out the absolute privilege owned by root into a specific collection of privileges. That means that you can take away a process's ability to fork or exec, change file ownership, or manipulate or utilize various subsystems of the operating system. If your application runs with the minimal set of privileges it needs to function, then the set of actions a hypothetical exploit against your application can invoke becomes limited, which reduces the impact of an intrusion. You can reduce your privileges via the smf(5) manifest you wrote for #3, via the role-based access control (RBAC) configuration, or via the privileges API.

  5. Don't unnecessarily ship duplicate components. The various OpenSolaris distributions include a lot of software; most of these offer one or more update mechanisms for the components they include. Whether or not you prefer minimal patches to wholesale package replacements, if you ship a duplicate component, it's your responsiblity to update it if a defect or security hole is found. Sometimes you have to ship a component—the distros don't update it often enough—but private libraries (or private copies of the Java runtime) have a collection of costs, many of which are imposed on your customer.

For specific kinds of software, there's more to investigate. Language interpreters and byte-code virtual machines (and probably complex daemons) should have DTrace providers. Network device drivers should write to the latest version of the generic LAN device (gld) interface. Archival programs should be ZFS-compatible—there's going to be a lot of data on ZFS. Daemons should investigate using libumem for memory allocation (and event ports in place of poll(2) or select(3C)). And so on.

There are OpenSolaris communities for each of these topics but, if you're having trouble getting started, I would suggest an email to opensolaris-code, that reads something like: "I have a [daemon/driver/app] that does [practical purpose/amazing feat/curious entertainment]. Are there any OpenSolaris features I can use to make it better?"

Looking forward to your mail.

Thanks to Dave for #5. Dave also confesses to being keen on #3.

[ T: ]

Tuesday Jun 05, 2007

SFW: Integrating coreutils and which variants

Last night, I finished up another task, in an attempt to reduce my current multitasking factor: I integrated initial versions of coreutils and which from the GNU Project into the Freeware consolidation. As a lower priority task, it took longer than a more dedicated developer might have managed, but it's reasonably pleasing to look back:

There's still a bunch of process associated with SFW that requires redesign—legal review and Section 508 compliance, in particular—but I think, barring the latent intervals, this sequence was a reasonable consensus-driven open development experience.

If you look at other mailing lists during June – November and February – April, you'll be able to verify that I was indeed working—just on other things, and not just surfing the Web...

Of course, now that I know that these commands will start to show up more widely when Build 67 is released, I can update my dotfiles, so I get the versions I prefer:

$ svn diff
Index: sh-functions
--- sh-functions        (revision 91)
+++ sh-functions        (working copy)
@@ -36,6 +36,10 @@
                        PATH=$HOME/bin:$HOME/bin/$(/usr/bin/uname -p):$PATH
+               gnu)    # PREPEND: Bundled GNU command variants
+                       PATH=/usr/gnu/bin:$PATH
+                       MANPATH=/usr/gnu/share/man:$PATH
+                       ;;
Index: bashrc
--- bashrc      (revision 91)
+++ bashrc      (working copy)
@@ -49,8 +49,12 @@
        path clear home sfw csw

-if hash gls > /dev/null 2>&1; then
-       alias ls="gls --color -CF"
+if [ -x /usr/gnu/bin/ls ]; then
+       alias ls="/usr/gnu/bin/ls --color -CF"

+if [ -x /usr/gnu/bin/which ]; then
+       alias which="/usr/gnu/bin/which"

If you're using a distribution that offers SUNWgnu-coreutils, SUNWgnu-which, and the other /usr/gnu packages, do share your feedback with the maintainers on sfwnv-discuss—or become one and pick your favourite package.

[ T: ]

Tuesday Mar 28, 2006

Bespoke services: network/rmi/registry

Gary and I were recently prototyping an application that uses Java RMI, and so I ended up searching around to see if anyone has done a service conversion for rmiregistry(1). (rmiregistry(1) is the daemon that lets RMI clients find the available remote objects being served by various virtual machines on a given system.) Turns out no one has (or no one's published it), which means it's time to rev up the convert-o-tron.

Since we're still developing our application and it's likely we'll change a definition or two, and since we need to restart the registry to cause the remote objects, we're going to make our prototype service restart automatically if we restart the registry. That means our prototype service has a dependency on network/rmi/registry with specific restart_on behaviour, meaning that its service description has a fragment like the following:

    As an RMI server application, we expect to be able to
    register our RMI classes with the registry server.
        <service_fmri value='svc:/network/rmi/registry' />

Inject that fragment into your various RMI servers' descriptions (or the equivalent property group into the repository) and you'll save a bit of time on application reinitializations.

So, if you're interested, please feel free to take a copy of network/rmi/registry; comments and corrections welcome.

[ T: ]

Wednesday Dec 14, 2005 to Niagara/Solaris 10

James Dickens spied a new BluePrint on the planned redeployment of [PDF] to SunFire T2000 systems running Solaris 10. Beyond the tremendous reduction in occupied space and the around 90 percent estimated reduction in input power and output heat, the document describes the use of Solaris Containers to consolidate the middle tier servers, complete with invocations of zonecfg(1M) (for the zones the applications run inside of) and poolcfg(1M) (for the resource pools the zones sit upon).

Business applications are complex—maybe some smf(5) service conversions will appear in the next version to make the dependency and failure handling more precise.

[ T: ]

Wednesday Dec 07, 2005

LISA05 Tuesday: device errors, iostat, and logging

One of the questions raised at Tuesday night's BoF was "why are some of the statistics that iostat -E displays result in a console message and some do not?" I was sitting in the back with a copy of Mike Kupfer's split ON source tree, and decided to have a look. iostat(1M) is a kstat reader, with some simple processing and formatted output. The output function is show_disk_errors() but requires understanding how iostat groups the disks and statistics in its implementation; the code that acquires the device statistics is located in cmd/stat/common/acquire_iodevs.c. Searching for error show that the critical function is acquire_iodev_errors(), and that two classes of kstats contribute to the error output: device_error and iopath_error.

The most direct way to see these statistics is to invoke kstat(1M) with these classes. On my laptop, the result for device_error is

$ kstat -p -c device_error
sderr:0:sd0,err:Device Not Ready        0
sderr:0:sd0,err:Hard Errors     0
sderr:0:sd0,err:Illegal Request 1
sderr:0:sd0,err:Media Error     0
sderr:0:sd0,err:No Device       0
sderr:0:sd0,err:Predictive Failure Analysis     0
sderr:0:sd0,err:Product UJ-832D         Revision
sderr:0:sd0,err:Recoverable     0
sderr:0:sd0,err:Revision        1.50
sderr:0:sd0,err:Serial No       
sderr:0:sd0,err:Size    0
sderr:0:sd0,err:Soft Errors     1
sderr:0:sd0,err:Transport Errors        0
sderr:0:sd0,err:Vendor  MATSHITA
sderr:0:sd0,err:class   device_error
sderr:0:sd0,err:crtime  76.139658104
sderr:0:sd0,err:snaptime        1857.960128997

(The laptop has no kstats of class iopath_error.)

We then look for the creation of the named kstat for each of these strings—invocations of kstat_create() to identify the structure member names associated with each. And then we can look for statements that involve those member names; this leads us to the various SDUPDATEERRSTATS() invocations throughout uts/common/io/scsi/targets/sd.c.

The discrepancy between updates and logging arises because the macro, SD_UPDATE_ERRSTATS(), which bumps the counters and the function which displays the error, sd_print_sense_msg(), are sometimes both invoked, and sometimes are not. I don't know the details of the SCSI error categories, but the decision to make some of these errors silent and some not appears arbitrary. So, unfortunately, the only answer today to determine "why is this messaged" is to look at the code. (Perhaps an 'sd' expert can offer an enlightening comment.)

If you're trying to anticipate and avoid potential failures, having random messages emitted arbitrarily isn't very helpful: that's why Mike and the FMA team developed the fault management architecture to have a framework in which errors are processed in a predictable fashion, resulting in the proper diagnosis of faults. Eric described one possible scenario involving disk errors, FMA, and ZFS a couple of weeks ago, but there's a smaller step that seems useful involving only sd and FMA: the error increments could be converted to error events, and the decision to issue a notice deferred until a series of errors is diagnosed into a fault except, I suppose, if the error can be immediately diagnosed as a critical fault. Taking this step would result in consistent reporting for all disk consumers, including less sophisticated consumers than ZFS, like older filesystems and raw disk accessors.

In a software engineering sense, the FMA approach, where error issuance and diagnosis are separated, is much more sound: at the initial driver software composition, the field experience with the hardware device is typically limited. Over time, the actual impact of errors on system practice becomes better known, and the diagnosis and the actions associated with it can be refined. fmd(1M) can handle on-the-fly module updates gracefully, and also deal with overlapping event flows so that both primitive and ZFS-specific fault handling policies can be implemented, depending on the use of a particular device.

Now, the community that's discussing technical issues and directions for fault management is aptly called the Fault Management community; if you are interested in how this work is going to proceed, and ways to contribute, I suggest joining it.

[ T: ]

Monday Oct 10, 2005

smf(5): Stepping through an rc.d conversion

Over on, I see Bob Netherton has posted a nice tutorial from Solaris Boot Camp, entitled "Migrating a legacy RC service". The presentation covers the hiccoughs you might run into during your first conversion, and its step-by-step approach is very soothing.

[ T: ]

Tuesday Jun 14, 2005

libuutil and designing for debuggability

Going into Solaris 10, I knew we were planning to develop a troupe of new daemons; we ultimately ended up with svc.startd(1M), svc.configd(1M), and a new implementation of inetd(1M). I wanted to make sure we made some progress on daemon implementation practice, and bounced some ideas around with the afternoon coffee group and also with Mike, and probably some others—I wander around a bit.

We anticipated that most of the daemons would be multithreaded, and it became apparent that they would all present large, complicated images for postmortem debugging1. To reduce the time to acquire familiarity with each of these daemons, we worked out three common requirements:

  • include Compact C Type Format (CTF) information with each daemon,
  • use libumem(3LIB) for memory allocation, and
  • use standard, debuggable, MT-safe implementations of data structures.

The problem was, of course, that there wasn't a library with such data structures in Solaris at the time.2, 3. So we began to design libuutil, which combines a number of established utility functions used in authoring Solaris commands with these new "good" implementations of useful data structures.

The library in question was named in sympathy with libumem(3LIB)—libuutil for "userland utility functions". libuutil provides both a doubly linked list implementation and an AVL tree implementation. The list implementation is mostly located in lib/libuutil/common/uu_list.c; we'll use that to explore the debugging assistance we designed in.

The model used is that each program is likely to have multiple lists of common structures, and that there would be multiple such structures. This led us to create an interface that is expressed in terms of pools of list. So, for each structure, you create a list pool using uu_list_pool_create(). Then, for each list of that structure, you create a list in the respective pool using uu_list_create().

That sounds complicated, but it's for a good reason: at each call to uu_list_pool_create(), we register the newly created pool on a global list, headed by the "null pool", uu_null_lpool:

uu_list_pool_t \*
uu_list_pool_create(const char \*name, size_t objsize,
    size_t nodeoffset, uu_compare_fn_t \*compare_func, uint32_t flags)
	uu_list_pool_t \*pp, \*next, \*prev;

	/\* validate name, allocate storage, initialize members \*/

	(void) pthread_mutex_init(&pp->ulp_lock, NULL);

	pp->ulp_null_list.ul_next = &pp->ulp_null_list;
	pp->ulp_null_list.ul_prev = &pp->ulp_null_list;

	(void) pthread_mutex_lock(&uu_lpool_list_lock);
	pp->ulp_next = next = &uu_null_lpool;
	pp->ulp_prev = prev = next->ulp_prev;
	next->ulp_prev = pp;
	prev->ulp_next = pp;
	(void) pthread_mutex_unlock(&uu_lpool_list_lock);

	return (pp);

with similar code being used to connect each list to its pool on calls to uu_list_create().

So now we have an address space where each list pool is linked in a list, and each list in a pool is linked to a list headed at that pool. This leads us to the second part, which is to use the encoded information in a debugger. The typical debugger for kernel work in Solaris is mdb(1), the modular debugger. It's been shipping with Solaris since 5.8, and has a rich set of extensions for kernel debugging. For userland, the modules are rarer: libumem is probably the best known.4

The source code for the libuutil module (or "dmod") is located at cmd/mdb/common/modules/libuutil/libuutil.c; the function that provides the dcmd itself, uutil_listpool, is just a wrapper around the walker for uu_list_pool_t structures. The pertinent portion is the initialization function, uutil_listpool_walk_init():5

uutil_listpool_walk_init(mdb_walk_state_t \*wsp)
        uu_list_pool_t null_lpool;
        uutil_listpool_walk_t \*ulpw;
        GElf_Sym sym;

        bzero(&null_lpool, sizeof (uu_list_pool_t));

        if (mdb_lookup_by_obj("", "uu_null_lpool", &sym) ==
            -1) {
                mdb_warn("failed to find 'uu_null_lpool'\\n");
                return (WALK_ERR);

        if (mdb_vread(&null_lpool, sym.st_size, (uintptr_t)sym.st_value) ==
            -1) {
                mdb_warn("failed to read data from 'uu_null_lpool' address\\n");
                return (WALK_ERR);

        ulpw = mdb_alloc(sizeof (uutil_listpool_walk_t), UM_SLEEP);

        ulpw->ulpw_final = (uintptr_t)null_lpool.ulp_prev;
        ulpw->ulpw_current = (uintptr_t)null_lpool.ulp_next;
        wsp->walk_data = ulpw;

        return (WALK_NEXT);

which safely pulls out the value of the uu_null_pool head element, and the relevant pieces we'll need to walk the list.

This means that, for any program linked with libuutil, we can attach with mdb(1M) and display its list pools:

# mdb -p `pgrep -z global startd`
Loading modules: [ svc.startd ]
> ::uu_list_pool
ADDR     NAME                            COMPARE FLAGS
080dcf08 wait_info                      00000000     D
080dce08 SUNW,libscf_datael             00000000     D
080dcd08 SUNW,libscf_iter               00000000     D
080dcc08 SUNW,libscf_transaction_entity c2b0476c     D
080dc808 dict                           0805749c     D
080dc908 timeouts                       0806ffab     D
080dca08 restarter_protocol_events      00000000     D
080dcb08 restarter_instances            0806ccd7     D
080dc708 restarter_instance_queue       00000000     D
080dc608 contract_list                  00000000     D
080dc508 graph_protocol_events          00000000     D
080dc408 graph_edges                    00000000     D
080dc308 graph_vertices                 08059844     D

and then drill down into constituent lists of interest.

Additional walkers are also provided, such that the lists and list nodes can be visited from the command line or programmatically. As an example, the ::vertex dcmd from the svc.startd module uses the walkers to display the various service graph nodes in a quasi-readable format.5

So, by providing extra structured information in the library and support to consume that information in the debugger, we end up with a set of data structures that, if used, leads to more debuggable programs. More work up front for less later: welcome to OpenSolaris.


1. By postmortem debugging, I'm referring to the operation of debugging a failed application after its failure, from a core file or other memory image captured as soon after that failure as possible. Suitability for postmortem debugging is a standard expectation for software design in Solaris, as it reduces the time to diagnose and fix software failures. In particular, multiple engineers can debug a core file in parallel; this can be contrasted with the cost of setting up a duplicate installation and trying to reproduce the failure, let alone expecting the customer to risk further downtime experimenting with "try this" scenarios.

2. Please remember that we were making these decisions three years ago, and that this choice had to fit the then-applicable constraints on the product.

3. In contrast, the kernel has had a generic, modular hash table since 5.8/2000 (uts/common/os/modhash.c), a generic AVL tree since 5.9/2002 (common/avl/avl.c), and a generic list implementation early in 5.10/2005 (uts/common/os/list.c). Of course, the kernel has used the slab allocator (uts/common/os/kmem.c) since 5.4/1994.

4. A quick listing in /usr/lib/mdb/proc/ will display the other modules valid in the process target: beyond libumem and libuutil, there's support for the linker, libc, name-value pairs, system event, and the two main smf(5) daemons.

5. As an example, here's the output of "::vertex on my current system, for those services related to my VNC server (and the service itself):

> ::vertex ! grep vnc
0x85d3380  212 I 1 svc:/application/vncserver:sch
0x85d3320  213 s - svc:/application/vncserver
0x85d3200  214 R - svc:/application/vncserver:sch>milestone
0x85d3260  215 R - svc:/application/vncserver:sch>autofs
0x85d32c0  216 R - svc:/application/vncserver:sch>nis

[ T: ]

Wednesday Jun 01, 2005

Bespoke services: application/vncserver

In honour of the "Mugs for Manifests" contest, I thought I would spin out another custom service description I wrote some months ago.

My setup for working from home—key during the last six months of Solaris 10—is to tunnel into Sun's network via one implementation or another of a virtual private network (VPN). In all cases, the VPN solution runs on Solaris. Although the VPN lets your system participate more or less like a regular host, I find it's easier to use VNC to remotely present an X11 display from my main workstation, muskoka. But, of course, machine running pre-production bits can fail or be rebooted or be reinstalled regularly, so I wanted the VNC server on my system to always be up: I wanted a VNC service.

What's distinct about running the VNC server is that it should run as me, with my environment, and not as root with init(1M)'s. svc.startd(1M), while it can run methods according to smf_method(5), doesn't populate the environment fully in the sense of login(1). So we will need to extract some data from the name service, which is cumbersome to perform in a shell script. We'll write our method in Perl, which implies

Tip 1: Methods need not be shell scripts.

In fact, the start method and the stop method can be totally separate commands: you could write one in Python, and one can be an executable Java .jar archive, or some even more bizarre combination.

The other trick is that, if VNC fails for some reason, I want to be aggressive about cleaning up its various leftover temporary files. For this purpose, I run the stop method with a different credential—the default of root—than the start method, which is done in our brief manifest by locating the <method_context> element on only the start method.

Tip 2: Methods need not be run with identical method contexts. Credentials, privileges, and the like may all differ from method to method.

Our manifest then looks like:

<?xml version='1.0'?>
<!DOCTYPE service_bundle SYSTEM '/usr/share/lib/xml/dtd/service_bundle.dtd.1'>
<service_bundle type='manifest' name='export'>
  <service name='application/vncserver' type='service' version='0'>
    <instance name='sch' enabled='true'>
      <dependency name='milestone' grouping='require_all' restart_on='none' type='service'>
        <service_fmri value='svc:/milestone/multi-user:default'/>
      <dependency name='autofs' grouping='require_all' restart_on='none' type='service'>
        <service_fmri value='svc:/system/filesystem/autofs:default'/>
      <dependency name='nis' grouping='require_all' restart_on='none' type='service'>
        <service_fmri value='svc:/network/nis/client:default'/>
      <exec_method name='stop' type='method' exec='/home/sch/bin/vncserver_method stop' timeout_seconds='60'/>
      <exec_method name='start' type='method' exec='/home/sch/bin/vncserver_method start' timeout_seconds='300'>
          <method_credential user='sch' group='staff' />

The dependencies above are needed if you use NFS for home directories and NIS for name services; they could be reduced for less networked setups.

And, for the method, we have a short Perl program. The complete list of environment variables in login(1) would include LOGNAME, PATH, MAIL, and TZ (timezone), and exclude my silly setting of LANG, but most of these will be set up by the shell that the VNC startup script (its analgue to .xinitrc. The various print calls are just to let the service log show a little activity, and could be removed.


require 5.8.3;
use strict;
use warnings;

use locale;

my ($name, $passwd, $uid, $gid,  $quota, $comment, $gcos, $dir, $shell,
    $expire) = getpwuid "$<";

$ENV{USER} = $name;
$ENV{HOME} = $dir;
$ENV{SHELL} = $shell;
$ENV{LANG} = "en_CA";           # Just to create havoc (i.e. expose bugs).

# The stop method is run as root so that it can cleanup.
if (defined($ARGV[0]) && $ARGV[0] eq "stop") {
        # ksh and sh specific
        print "stop method\\n";
        system("$ENV{SHELL}", "-c", "/opt/csw/bin/vncserver -kill :1");

        if (-S "/tmp/.X11-unix/X1") {

        exit 0;

# The start method is run with the user's identity.
print "start method\\n";

if (-f "/tmp/.X1-lock") {

if (-S "/tmp/.X11-unix/X1") {
        system("logger -p 1 application/vncserver requires " .
            "/tmp/.X11-unix/X1 be removed");
        exit 0;

# ksh and sh specific
{ exec "$ENV{SHELL}", "-c",
    "/opt/csw/bin/vncserver -pn -geometry 1600x1200 -depth 24 :1" };
system("logger -p 1 application/vncserver can't exec /opt/csw/bin/vncserver");
exit 1;

And now we have always-on VNC service for the regular telecommuter:

$ svcs -p vncserver
STATE          STIME    FMRI
online         13:01:01 svc:/application/vncserver:sch
	       13:01:00   100577 Xvnc
	       13:01:17   100625 xwrits
	       13:01:17   100626 ctrun
	       13:01:17   100632 xautolock
	       13:11:18   102348 xlock
$ uptime
 12:00pm  up 23 hr(s),  4 users,  load average: 0.04, 0.07, 0.07


  1. Remove the hard coded display numbering (":1", "X1", etc.).
  2. Make the resolution, display depth, RGB encoding, and other standard options into properties.

[ T: ]

Friday May 27, 2005

smf(5) not-quite-free stuff

I'm in the middle of some longish, and one rather preachy, blog posts. These will need editing, so to pep things up...

Like one of these?

We had a bunch of custom mugs made up, to commemorate the completion of smf(5)'s integration into Solaris 10. If you've been at a customer or community presentation on S10 or smf(5), you might have received one: for asking a good question, for answering one from me, or for physical attendance. But these mugs—fine, solid, large capacity, high quality mugs for coffee, tea, or even pens—are heavy: too heavy for us to lug a box to all the conferences we might attend.

So instead we're going to run a little contest.

Liane summarized our understanding of other service conversions circulating a few months ago. I'd like to get another batch done, and there's no incentive like a ceramic container incentive, so I'm going to suggest a few categories:

  1. Historical: Convert one (or more) of the unconverted services in /etc/rc\*.d in Solaris 10.
  2. Free/Open: Convert a F/OSS daemon to be an smf(5) service.
  3. Commercial: Convert a commercial software package to be one or more smf(5) services.
  4. Artistic/Offbeat: Convert something unexpected into a particularly elegant service.

The conditions are pretty simple: there are 36 mugs in the box, so the first round can have 36 winners. One mug for each converted service; the winning entry for a specific service will be judged by completeness (dependencies in particular), correctness (methods), utility (will anyone else use this?), and date received. I'll give some no-prize honorable mentions in each category as well. This round will be quick: entries must be received by June 15th.

An entry should disclose:

  • Your name,
  • preferred email,
  • blog URL (optional),
  • mailing address,
  • description of the software (plus details if obscure) and
  • the service manifest and method(s) (if any), or
  • an accessible URL to same.
Send it to sch AT I'll assemble a few smf(5) keeners to help me evaluate the submissions.

Services on the list Liane gave are not eligible, unless you think your conversion is substantially better by the criteria above.

If your conversion wins, I'll send you your mug via an amazing cooperative, potentially international, mechanism composed of government-granted-monopoly package delivery agencies. Winners, and their entries (or pointers) will be posted here.

[ T: ]

Wednesday Apr 27, 2005

Banging on multiple heads

I spent a bit of time each of the past few weeks trying out different graphics cards to drive two displays—a multihead configuration. Presently, I'm using an older ATI Radeon 7000-based card to run two displays at 1600 × 1200 each. The radeon driver included with the X server can knit these together into a single display with an effective resolution of 3200 × 1200.

Other folks are using nVidia drivers to get similar configurations.

Once I had this setup running, I started to see familiar applications fail with a pleasant message:

$ gvim
The program 'gvim' received an X Window System error.
This probably reflects a bug in the program.
The error was 'BadWindow (invalid Window parameter)'.
  (Details: serial 13 error_code 3 request_code 128 minor_code 2)
  (Note to programmers: normally, X errors are reported asynchronously;
   that is, you will receive the error a while after causing it.
   To debug your program, run it with the --sync command line
   option to change this behavior. You can then get a meaningful
   backtrace from your debugger if you break on the gdk_x_error() function.)

Dave Powell helped me to use xscope to watch the X11 protocol requests. This, and a little code browsing allows us a diagnosis.

It turns out that Xsun and Xorg use different implementations of the Xinerama extension (but with similar names). As far as I can tell, the standard behaviour changed after Sun developed support for a draft proposal. Now, a few years later, the applications know how to deal with both versions. With Xorg though, you now have a Sun system which doesn't speak Sun's Xinerama variant--hence our error message. Alan knows how to fix this for real but, after looking at libgdk startup, it's pretty easy to work around this with a preloaded shared object.

All we do is pretend that our display can't do Sun's Xinerama. That means we need a function like

$ cat > xin_shim.c
        return (0);
We then compile it into a shared object
$ cc -o -G -Kpic xin_shim.c
or we could freely compile it into a shared object
$ which gcc
$ gcc -o -G -fpic xin_shim.c

Then, to use your shim, you use LD_PRELOAD (with an absolute path)

$ LD_PRELOAD=`pwd`/ gvim
[happy editing...]

Because isn't in /usr/lib/secure, you may see messages from setuid processes as the linker refuses to preload your potentially unsafe object. The message looks something like /usr/lib/utmp_update: warning:
/home/sch/src/preloads/ open failed: illegal insecure pathname

Tuesday Mar 29, 2005

Bespoke services: application/catman

For various reasons—some reasonable, some suspect—Solaris doesn't ship with a compiled set of windex databases for its manual pages. The unfortunate result is that helpful commands like apropos(1) or man -k are unhelpful:

$ apropos sort
/usr/man/windex: No such file or directory

smf(5) provides one way to address this shortcoming, via a transient service to be run during startup. Our service description would be roughly equivalent to the following:

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">

<service_bundle type='manifest' name='sch:catman'>


        <create_default_instance enabled='false' />

        <single_instance />

          By default, application/catman will run in the background
          during boot.  If you want to run it periodically, execute

             # /usr/sbin/svcadm restart catman

          If you wish to augment the default MANPATH, use the setenv
          subcommand to svccfg(1M).  For instance, to add the Java
          manual pages to the build:

             # /usr/sbin/svccfg -s application/catman
             > setenv MANPATH /usr/share/man:/usr/java/man
             > exit

             # /usr/sbin/svcadm refresh catman

          If MANPATH is not defined, the default manual path is
          /usr/share/man, as per catman(1M).

                <service_fmri value='svc:/system/filesystem/local' />

                <service_fmri value='svc:/network/nfs/client' />
                <service_fmri value='svc:/system/filesystem/autofs' />

                exec='/usr/bin/catman -w'
                timeout_seconds='0' />

                timeout_seconds='0' />

        <property_group name='startd' type='framework'>
                <propval name='duration' type='astring' value='transient' />

        <stability value='Unstable' />

                        <loctext xml:lang='C'>
                                manual page index generation

                                manpath='/usr/share/man' />


Following my own instructions in the comment block, I defined a value for MANPATH and refreshed the service. My setting can be double-checked with svcprop(1) like so:

$ svcprop -p start application/catman
start/exec astring /usr/bin/catman\\ -w
start/timeout_seconds count 0
start/type astring method
start/environment astring MANPATH=/usr/share/man:/usr/openwin/man:/usr/sfw/man:/usr/dt/man:/usr/perl5/man:/usr/java/man:/usr/apache/man:/usr/X11/man:/opt/sfw/man:/opt/csw/man

Issuing "svcadm enable catman" will cause the service to be executed immediately, and upon each subsequent boot. Our earlier query becomes fecund:

$ apropos sort
FcFontSort      FcFontSort (3fontconfig)    - Return list of matching fonts
aclsort         aclsort (3sec)  - sort an ACL
alphasort       scandir (3c)    - scan a directory
alphasort       scandir (3ucb)  - scan a directory
bsearch         bsearch (3c)    - binary search a sorted table
bunzip2         bzip2 (1)       - a block-sorting file compressor and associated utilities
bzcat           bzip2 (1)       - a block-sorting file compressor and associated utilities
bzip2           bzip2 (1)       - a block-sorting file compressor and associated utilities
bzip2recover    bzip2 (1)       - a block-sorting file compressor and associated utilities
disksort        disksort (9f)   - single direction elevator seek sort for buffers
ldap_sort       ldap_sort (3ldap)   - LDAP entry sorting functions
ldap_sort_entries               ldap_sort (3ldap)   - LDAP entry sorting functions
ldap_sort_strcasecmp            ldap_sort (3ldap)   - LDAP entry sorting functions
ldap_sort_values                ldap_sort (3ldap)   - LDAP entry sorting functions
libbz2          libbz2 (3)      - library for block-sorting data compression
look            look (1)        - find words in the system dictionary or lines in a sorted list
qsort           qsort (3c)      - quick sort
sort            sort (1)        - sort, merge, or sequence check text files
sortbib         sortbib (1)     - sort a bibliographic database
tsort           tsort (1)       - topological sort

  1. Add a configuration property that makes the service also rebuild the nroffed versions of the manual pages, if set to true.
  2. Make the service regenerate only in the case that components in the path have changed.

Tie knot: Knot 54 (Hanover).

Monday Mar 14, 2005

smf(5): manifest editing assistance

We had a productive wrap up meeting for Solaris 10 Platinum Beta last week, with lots of good feedback on smf(5). One point raised is that few people like to hand-edit XML—or, maybe, many people hate to—so tools for composing service manifests are needed. We'll need to percolate on how best to improve or extend the current set of tools, but there are a few tricks out there already.

A bad manifest. Let's take a well-formed and valid manifest file and add the nonsensical line

<french_fry>I,m a bad element.</french_fries>
to simulate a developer making a composition error during service development. How do we determine that our manifest is now broken?

svccfg(1M) validation. As I mentioned, the basic tools aren't helpful. The logical svccfg(1M) subcommand to check a manifest for correctness is validate. Its output on our manifest is

$ svccfg validate /tmp/gdm2-login.xml
svccfg: couldn't parse document
which accurately tells us the manifest is broken but does not indicate how (at all).

xmllint(1). The XML parser implementation of svccfg(1M) is the GNOME libxml2, which includes a general validation tool in the form of xmllint(1). If we invoke this command with its --valid long option, we get

$ xmllint --valid /tmp/gdm2-login.xml
/tmp/gdm2-login.xml:26: parser error : Opening and ending tag mismatch: french_fry line 26 and french_fries
        <french_fry>I,m a bad element.</french_fries>
which isn't validating the document, but is telling us where and how it is not well-formed.

Graphically clear. An interesting option is to use the jEdit editor, with its XML plugin. With our document, the XML plugin will validate on save and highlight the incorrect line with red underlining:
jEdit main window
Moreover, the error window shows both the non-well-formedness and the invalid <french_fry> element (which is absent from the non-fast food-oriented service bundle DTD).
jEdit error window
So we see both the immediate and the deeper error, plus the plugin highlights matching tags and provides completion menus for tag selection. Civilization to most, I expect.

I happily use vim for development, but it's important to note the value in jEdit just from using a different XML implementation. Using other tools to ease your composition of service descriptions (or profiles)? Let us know—and, rest assured, we're working to make that svccfg(1M) output more useful.

Tie knot: Knot 6 (Victoria).

Tuesday Dec 21, 2004

smf(5) on /.

smf(5) ended up in two stories on slashdot today. In "Torvalds on Opening Solaris", elmegil observed

I'm rather amused to see Sun be the first to implement a replacement for the old init and have it done. I can't say I know who thought it up first, but Solaris 10 SMF is the first working implementation I'm aware of that's going to get any kind of wide deployment. I saw some linux-head saying this needed to be done a year or more ago, but I can't even find their website in google now. And obviously if Solaris has it now, the implementation started a while back (probably more than a year)...

I suspect elmegil is referring to Seth Nickell's System Services work, which ended up being discussed in an article on in October 2003. But there is other work in the parallel startup area that's been cited on slashdot, and elsewhere.

The second story is "A Diagnosis of Self-Healing Systems", which is a discussion around Mike Shapiro's recent overview in ACM Queue of the problems we're working to solve in the Predictive Self-Healing effort. The comments range across a number of topics in deployment and architecture, but I was interested in the observations that self-healing in a general purpose system is a different proposition than in a limited purpose system. (I probably would also contrast open software systems and closed ones—perhaps a distinction of the past.)

Wednesday Dec 01, 2004

Personal restarters and ctrun(1)

(I know I need to wrap up the cal(1) contest. I also need to finish about three smf(5) blog entries. I am also mostly keeping up with the forum/list/newsgroup traffic on smf(5). I also have a few more bugfixes to get into Solaris 10 first. But a small dispatch seems necessary.)

One of the neat things about GNOME is that it provides restarter operations for specified applications—your application references a bad address and dies, it gets restarted. I'm pretty attached to ion, however, and don't really need all of the GNOME environment all the time. But I do want restart for a couple of applications in my session.

For instance, I wrote a little C program, osdclock, using libxosd that displays the date, the time, and those of my mail folders containing new mail. It looks like this:

Upper right-hand screenshot

Occasionally, and for reasons I lack the time to debug, new ion workspaces will obscure the on-screen display. Rather than write code into osdclock to call exec(2) (of itself) on a received signal, it's easier to use the new commands associated with the contracts subsystem of Solaris 10, specifically the ctrun(1) command. By adding the invocation

/usr/bin/ctrun -r 0 -o noorphan -f signal,hwerr $HOME/bin/osdclock &

to my .xinitrc file, I get an osdclock that is restarted from any fatal external signal (or an uninterceptable hardware error of some kind), but is not restarted if a core file is generated (from a software error) or if the ctrun process is itself killed (like on session exit). So I avoid a home directory filled with core files. (You are using coreadm(1M) to have a meaningful core dump pattern, right?)

Since the cause of restart is ctrun's awareness of the process contract becoming empty, this same ctrun invocation could be used with an application that prefers to daemonize, like a personal web proxy. It could also be used to set up a restartable group of processes, like an application suite or widget collection. (ctrun has some other interesting options, and is a versatile lightweight restarter on its own—try it out.)

Having applications always available leads to predictable computing, but I think a more dramatic way to express it in this medium (but imagine appropriate stormy weather sound effects) would be

<mad_scientist>Restarters, restarters everywhere! <laughter glee="maniacal" /><mad_scientist>



« July 2016
External blogs