Thursday Mar 05, 2009

Speeding to a Halt

On Sunday I committed changes into Nevada build 110 (and hence, into OpenSolaris to improve the speed at which OpenSolaris systems shutdown.  On a generic x86 test system, it used to take about 41 seconds to shut the system down starting from the time you typed init 6, shutdown(1m), or pressed the shutdown button in GNOME.  This form of shutdown is known as gentle shutdown because SMF takes care to stop each software service in dependency order.  In contrast, when you use reboot, halt, or poweroff, you're using what I'll call violent shutdown.  In the latter, the boot archive is brought up to date and that's about it.  It has traditionally been much, much faster than gentle shutdown.

This relatively long shutdown time has had an interesting effect: Solaris developers almost universally cheat, and use violent shutdown.  Typing reboot is terribly ingrained in my fingertips.  This is bad, because it means less test coverage for the shutdown method which actually matters to customers.  I recently began to be bothered by this because the GNOME gui uses init 6, and so shutting down via the nice shiny "Shut Down" button is also terribly slow.

On something of a whim, I dusted off our variant of the bootchart project which we ported in-house several years ago to get some initial information on what was happening during system shutdown.  Click here to see a graphical representation of a typical system shutting down (note: big image).  To read the image, note that time is on the X axis, from left to right.  Processes are represented as horizontal bars representing their duration.  At the rightmost side of the chart, the system has stopped.

In the image I've highlighted a couple of points of interest:

  • The pppd shutdown script seems to sleep 1, always, even if you aren't using ppp; since pppd isn't converted to SMF (bug 6310547), we will try to stop it on all systems on every shutdown.
  • The wbem service seems to sleep 1 while shutting down, and the the webconsole service takes a while to shutdown.  However, these services are present only on Nevada, and not on OpenSolaris, so I chose not to pursue trying to fix them.
  • The deferred patching script, installupdates is really slow.  And needlessly so-- it can run in a few milliseconds with a simple fix; I filed a bug.
  • There are some long calls to /usr/bin/sleep.  In the chart linked above, you can see svc-autofs, rpc-bind, and svc-syseventd each taking five seconds to stop.  Five seconds is a really long time!
  • There's a call to something called killall near the end of shutdown.  Then, 5 seconds later, another.  Then, 10 seconds later, things proceed again.  I wondered what the killall was all about?  Did it really need 15 seconds to do its work?

After a bit of effort (ok, a lot of effort), I've cleaned up these, and some other problems I spotted along the way.  It turns out that the five second delays are from some poor quality shell code in /lib/share/

smf_kill_contract() {
        # Kill contract.
        /usr/bin/pkill -$2 -c $1
        # If contract does not empty, keep killing the contract to catch
        # any child processes missed because they were forking
        /usr/bin/sleep 5
        /usr/bin/pgrep -c $1 > /dev/null 2>&1
        while [ $? -eq 0 ] ; do

Ugh. So this shell function rather bizarrely always waits five seconds, even if the contract empties out in 1/10th of a second!  I fixed this to have a smarter algorithm, and to keep checking at more frequent intervals (every 0.2 seconds).

I discovered that the calls to killall(1m) were really ancient history, and probably did not need to occupy 15 seconds worth of system shutdown.  I have shortened the interval substantially.

Another problem we faced was that, in the last moments before shutdown, startd runs some commands using system(3c).  This can be a problem if one of those commands, for some reason, wedges up.  So, I've replaced the calls to system with calls which timeout after a set number of seconds.  This is some nice insurance that the system continues to make progress as it shuts down.  Since I wound up with so much extra time available at shutdown, I've taken the chance to add a call to lockfs(1m) in an effort to get as much on-disk UFS consistency as possible.

So, here is the intermediate picture.  I've slightly revised the shutdown messaging, as well, to include a datestamp and a measurement of how long the shutdown took:

    svc.startd: The system is coming down.  Please wait.
    svc.startd: 83 system services are now being stopped.
    Mar  5 19:43:34 soe-x4100-3 syslogd: going down on signal 15
    svc.startd: Killing user processes.
    Mar  5 19:43:40 The system is down.  Shutdown took 17 seconds.

But wait, there's more!  On OpenSolaris, we don't have the time consuming wbem or webconsole services.  So, we can disable those and try again.  And, we use ZFS, for which the time consuming lockfs call at the end of shutdown is a no-op (on UFS, it takes at least two seconds).  This slimmed down stack results in an impressive shutdown time:

    Mar  6 02:51:51 The system is down.  Shutdown took 7 seconds.

And here is what it looks like.  If you want to see the complete set of changes, the codereview is also available.  As you can see, revising the way we kill off processes at the end of the life of the system is the big unrealized win.  And doing so would likely shave about 3 more seconds off, for a gentle shutdown of 4-5 seconds.  I ran out of time to do that this time around.

Some caveats:

  • You mileage may vary: you might run a different mix of services on your system, and perhaps one of those has a slow shutdown method which will gum up the works.  If you want to test how long a service takes to stop, try ptime svcadm disable -s <servicename>.
  • Your performance improvement is likely to be less dramatic on systems with less available parallelism.  Most of my test systems have two or four CPUs.

I should add a coda here: this work is greatly improved by recent bootadm performance work by Enrico.  While building the boot archive is still sometimes triggered on shutdown, it takes a lot less time than it did previously.

I had a good time working on this project; I hope you'll enjoy using it.

Sunday May 11, 2008

A field guide to Zones in OpenSolaris 2008.05

I have had a busy couple of months. After wrapping up work on Solaris 8 Containers (my teammate Steve ran the Solaris 9 Containers effort), I turned my attention to helping the Image Packaging team (rogue's gallery) with their efforts to get OpenSolaris 2008.05 out the door.

Among other things, I have been working hard to provide a basic level of zones functionality for OpenSolaris 2008.05. I wish I could have gotten more done, but today I want to cover what does and does not work. I want to be clear that Zones support in OpenSolaris 2008.05 and beyond will evolve substantially. To start, here's an example of configuring a zone on 2008.05:

# zonecfg -z donutshop
donutshop: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:donutshop> create
zonecfg:donutshop> set zonepath=/zones/donutshop
zonecfg:donutshop> add net
zonecfg:donutshop:net> set physical=e1000g0
zonecfg:donutshop:net> set address=
zonecfg:donutshop:net> end
zonecfg:donutshop> add capped-cpu
zonecfg:donutshop:capped-cpu> set ncpus=1.5
zonecfg:donutshop:capped-cpu> end
zonecfg:donutshop> commit
zonecfg:donutshop> exit

# zoneadm list -vc
  ID NAME             STATUS     PATH                           BRAND    IP    
   0 global           running    /                              native   shared
   - donutshop        configured /zones/donutshop               ipkg     shared

If you're familiar with deploying zones, you can see that there is a lot which is familiar here.  But you can also see that donutshop isn't, as you would normally expect, using the native brand. Here we're using the ipkg brand. The reason is that commands like zoneadm and zonecfg have some special behaviors for native zones which presume that you're using a SystemV Packaging based OS. In the future, we'll make native less magical, and the zones you install will be branded native as you would expect. Jerry is actually working on that right now. Note also that I used the relatively new CPU Caps resource management feature to put some resource limits on the zone-- it's easy to do!. Now let's install the zone:

# zoneadm -z donutshop install
A ZFS file system has been created for this zone.

      Image: Preparing at /zones/donutshop/root ... done.
    Catalog: Retrieving from ... done.
 Installing: (output follows)
DOWNLOAD                                    PKGS       FILES     XFER (MB)
Completed                                  49/49   7634/7634 206.85/206.85 

PHASE                                        ACTIONS
Install Phase                            12602/12602 

       Note: Man pages can be obtained by installing SUNWman
Postinstall: Copying SMF seed repository ... done.
Postinstall: Working around
Postinstall: Working around
       Done: Installation completed in 208.535 seconds.

 Next Steps: Boot the zone, then log into the zone console
             (zlogin -C) to complete the configuration process

There are a couple of things to notice, both in the configuration and in the install:
Non-global zones are not sparse, for now
Zones are said to be sparse if /usr, /lib, /platform, /sbin and optionally /opt are looped back, read-only, from the global zone. This allows a substantial disk space savings in the traditional zones model (which is that the zones have the same software installed as the global zone).

Whether we will ultimately choose to implement sparse zones, or not, is an open question. I plan to bring this question to the Zones community, and to some key customers, in the near future.

Zones are installed from a network repository
Unlike with traditional zones, which are sourced by copying bits from the global zone, here we simply spool the contents from the network repository. The upside is that this was easy to implement; the downside is that you must be connected to the network to deploy a zone. Getting the bits from the global zone is still desirable, but we don't have that implemented yet.

By default, zones are installed using the system's preferred authority (use pkg authority to see what that is set to). The preferred authority is the propagated into the zone. If you want to override that, you can specify a different repository using the new -a argument to zoneadm install:

# zoneadm -z donutshop install -a ipkg=http://ipkg.eng:80
Non-global zones are small
Traditionally, zones are installed with all of the same software that the global zone contains. In the case of "whole root" zones (the opposite of sparse), this means that non-global zones are about the same size as global zones-- easily at least a gigabyte in size.

Since we're not supporting sparse zones, I decided to pare down the install as much as I could, within reason: the default zone installation is just 206MB, and has a decent set of basic tools. But you have to add other stuff you might need. And we can even do more: some package refactoring should yield another 30-40MB of savings, as packagings like Tcl and Tk should not be needed by default. For example, Tk (5MB) gets dragged in as a dependency of python (the packaging system is written in python); Tcl (another 5MB) is dragged in by Tk. Tk then pulls in parts of X11. Smallness yields speed: when connected to a fast package repository server, I can install a zone in just 24 seconds!.

I'm really curious to know what reaction people will have to such minimalist environments. What do you think?

Once you start thinking about such small environments, some new concerns surface: vim (which in 2008.05 we're using as our vi implementation) is 17MB, or almost 9% of the disk space used by the zone!

Non-global zones are independent of the global zone
Because ipkg zones are branded, they exist independently of the global zone. This means that if you do an image-update of the global zone, you'll also need to update each of your zones, and ensure that they are kept in sync. For now this is a manual process-- in the future we'll make it less so.
ZFS support notes
OpenSolaris 2008.05 makes extensive use of ZFS, and enforces ZFS as the root filesystem. Additional filesystems are created for /export, /export/home and /opt. Non-global zones don't yet follow this convention. Additionally, I have sometimes seen our auto-zfs file system creation fail to work (you can see it working properly in the example above). We haven't yet tracked down that problem-- my suspicion is that there is a bad interaction with the 2008.05 filesystem layout's use of ZFS legacy mounts.

As a result of this (and for other reasons too, probably), zones don't participate in the boot-environment subsystem. This means that you won't get an automatic snapshot when you image-update your zone or install packages. That means no automatic rollback for zones. Again, this is something we will endeavor to fix.

Beware of bug 6684810
You may see a message like the following when you boot your zone:
zoneadm: zone 'donutshop': Unable to set route for interface lo0 to éÞùÞ$
zoneadm: zone 'donutshop': 
This is a known bug (6684810); fortunately the message is harmless.

In the next month, I hope to: take a vacation, launch a discussion with our community about sparse root zones, and to make a solid plan for the overall support of zones on OpenSolaris. I've got a lot to do, but that's easily balanced by the fact that I've been having a blast working on this project...

Friday Mar 28, 2008

OpenSolaris Elections Disappointment

The results of the OpenSolaris Elections are in.  Congratulations to the new board members!

So why did I entitle this post Elections Disappointment?  My complaint is with the "community priorities" question which was asked; this was a chance for core contributors to indicate what they felt were pressing priorities.  I was not happy with the OGB's choice (which I didn't know about until I voted) to reuse the same set of priorities questions from last year.  One of which was:

  • Deploy a public code review facility on

Which was subsequently identified by voters in this election as the #3 priority.  Did the OGB believe that we did not accomplish this goal in the past year?  I know that I took the results of the last poll (it was voted #3 in that poll, too) to heart, and worked hard to make that goal a reality, publicized it using the official channels, and have been enhancing it and taking care of it since that time, mostly on my own time.  I felt that my contribution was really undermined by the question.

Second: Who voted for this item as a priority?  If you voted for this as a priority, I would like to hear why you did (anonymously is fine).  Are you unaware that exists?  Are you unsatisfied with the service it provides?

I hope the new OGB will find a way to reformulate a more cogent poll question about priorities.

Monday Mar 24, 2008 gets an ATOM feed

For the past couple of weeks, I have been working late at night and on weekends to add an ATOM feed (i.e. blog feed) to, so that as people post new code reviews, they are automatically discovered and published.  Stephen has been heckling me to do this work for more than a year.  This weekend I managed to finish it, despite the incredibly nice weather in the bay area: I was stuck inside with a nasty cold.

As an aside, I'm looking for help with  This is a great opportunity for someone to step up and help out with an important part of our community infrastructure.  Send me mail.

You can check out the results of my hacking on  Or you can subscribe to the feed.  If you want to opt-out of publishing reviews, you can create a file called "opt-out" in the same directory as your webrev's index.html file.  Or you can create a file called "opt-out" in your home directory, if you'd like to opt out of all reviews.

Implementation Notes 

This was an interesting learning experience for me, since I had to learn a lot about ATOM in the process.  I also learned the XSLT language along the way as well, and how to process HTML using python.  All in all, I'd say this project took about 20 hours of effort, and resulted in about 500 lines of python code.  The most difficult problems to solve were:

  • I wanted the feed to include some meaningful information about the codereview.  If you subscribe to the feed using your favorite reader, you'll see that a portion of the "index.html" file from each webrev is included.  This is done using a somewhat tricky piece of python code.  In retrospect, using XSL for this might have been a better choice, although I've found that people have a tendency to introduce non-standard HTML artifacts into their webrev index.html files, and I don't know how well XSL would cope with that.

  • ATOM has some rules about generating unique and lasting IDs for things-- this is the contents of the <id> tag in the ATOM specification.  I found a lot of valuable information on dive-into-mark.  For, this was complicated by the fact that the user might log in and move their codereview around, or might copy one review over another.  In the end, I solved this by remembering the <id> tag in a dot-file which rides along with the codereview.  A cronjob roves around the filesystem looking for new reviews, and adds the special tag-file.  By storing the original <id> tag value, and looking at the modtime of the index.html file, I can correctly compute both the value of the <id> and <updated> fields for each entry.  If a user deletes a codereview, the dot-file will go away with it.

  • Once I had an ATOM feed I needed to transform it back into HTML for display on the home page.  The only problem was that there aren't a lot of good examples of this on the web-- many of the ATOM-to-HTML conversions only work with ATOM 0.3, not the 1.0 specification, and I didn't know the first thing about XPATH or XSL.  In the end, I only needed 25 lines or so of XSLT code.

Future Work 

I think of the current implementation as a "1.0"-- it'll probably last us pretty well for a while.  One thing I'd like to research for a future revision is actually placing the entries into a lightweight blog engine, and letting it do the rest of the work: Using an excellent list from Social Desire I took a quick look at Blosxom, Flatpress, Nanoblogger, and some others.

Saturday Oct 13, 2007 now live, exits beta

On Thursday I posted this announcement to opensolaris-announce, which notes that I have taken out of beta status. Please take a look, use, and enjoy...

I am pleased to announce that, the OpenSolaris
Code Review site, is now officially part of the
infrastructure. I encourage everyone in the community to use it.

In the March 2007 community priorities poll, the community declared:

#3 Deploy a public code review facility on
... was designed to meet that goal. It has been in Beta
test for some time, and after some final tweaking[1] it is now ready for

You will need an SSH key linked to your account in order to use it.
For more information on SSH keys, see

Once you have a registered key, see for
information about how to gain access to this system. We request
that users review the terms of service before using it. Problems can be
reported to website-discuss at opensolaris dot org.

Finally, if you would like to help out with the coding/maintenance of
this site, please visit the site and read the "Call for Help" section.


Dan Price

[1]: Specifically, concerns about who may access the system have been

Thursday May 17, 2007 goes to Beta

I am happy to announce that, the OpenSolaris code review site, is now in Beta. For now, the site automatically grants an account to anyone with a contributor grant or better. For more on the "grants" system, see article III, Section 3.3 of the OpenSolaris Constitution.

For now, new accounts get created periodically by me running a script -- eventually this will update hourly. So if you don't seem to have an account now, let me know when you've got your grant.  Update 5/27/07: I set this up to auto-update every 20 minutes.

So take a look, give it a spin, and see if it works for you-- and please let me know, good or bad. Keep in mind that this is a Beta, so things might be in flux over the next few weeks. Also: if you think you should have an account (because you have a grant) but don't, please let me know.

Happy code reviewing!

Monday May 07, 2007, a work in progress

A while back I wrote about online codereview.  And Stephen has been hassling me to move my existing code review site, (hosting graciously provided by Steve Lau) over to the infrastructure.  So I have been slowly chipping away at all this, and now have some progress to report.

Gary was kind enough to provision me a zone in the OpenSolaris infrastructure.  As such, I've started to bring up the new codereview site.  Thus far, I have been working on getting mod_layout working so that we can decorate user supplied content with a little header and footer.  This was a bit of a pain because the Apache bundled with Solaris is compiled with the Sun compilers, which are not installed on the datacenter systems.  Only gcc is available (as it is in general on Solaris 10, under /usr/sfw/bin/gcc).  So, I did the build elsewhere-- but I wonder what customers do when faced with this problem?  I wonder if we could use our "compiler wrapper" technology (which we use when building the OS under both compilers) to help us to bridge this gap?

Once I got that squared away, I brought up Apache with some stubbed out content.  Here is the beginning of the home page, and here is a sample code review with the decorations in place (that bar at the top is being dynamically inserted at page transmit time by mod_layout).  What do you think?  I'm pretty pleased about the way it looks, which was inspired by a similar "link bar" feature in gmail.

There is a bunch of stuff left to do: I'm going to change the account management around so that participates in the SSH key management scheme.  My working idea is that anyone with a contributor grant will be able to upload files.  The rssh stuff needs to be moved over from grommit.  Existing user data needs to move over.  The main page needs an overhaul... and probably a bunch of other stuff I haven't thought of yet needs to be done.  Hopefully in another couple of weeks it will be fully up and running.


Kernel Gardening.


« July 2016