Wednesday Apr 08, 2009

An irritating bug in bash

Here's what kernel developers talk about at coffee: bugs in our command shells.  At least, that was the topic the other day when Stephen, Dave and I were complaining about the various troubles we'd had with different command shells.

While others moved to zsh some years ago, I have been a bash user since kicking the tcsh habit.  But for years I have been plagued by a subtle nuisance in bash: sometimes it doesn't catch terminal window resizes properly.  The result is that command line editing works very poorly until bash finally figures this out. After a while, I worked out that this behavior happens only when the window size change happens while you're in some application spawned by bash.  So if you're in an editor like vim, change the window size, and then exit (or suspend) the editor, bash will be confused about the terminal size.

While this has always annoyed me, it never quite reached the threshold for me to do anything about it.  But recently it has been bugging me more and more.  After we returned from coffee, I dug into the bash manual and discovered a little-known option, the checkwinsize builtin.  In a nutshell, you can set this shell option as follows:

    shopt -s checkwinsize

which the bash manual says: If set, Bash checks the window size after each command and, if necessary, updates the values of LINES and COLUMNS. Sounds great!  As an aside, I think that as a modern shell, bash should set this option by default.  (Others think so too).

With much self-satisfaction I set this option and got ready for line editing bliss.  But, no joy.  I checked and rechecked, and finally started using truss, and then DTrace, to try to understand the problem.  After some digging I eventually discovered the following bug in the shell.  Here's the meat of the writeup I submitted to the bash-bug list:

On Solaris/OpenSolaris platforms, I have discovered what I believe is a
bug in lib/sh/winsize.c.

I discovered with a debugger that the get_new_window_size() function
has no effect on Solaris.  In fact, here is what this file looks like if
you compile it:

$ dis winsize.o
disassembly for winsize.o
section .text
    get_new_window_size:     c3                 ret

That's it-- an empty function.  The problem is that the appropriate header
file is not getting pulled in, in order to #define TIOCGWINSZ.

As a result, even with 'shopt -s checkwinsize' set on Solaris, bash
does not check the win size on suspend of a program, or on program
exit.  This is massively frustrating, and I know of several Solaris
users who have switched to zsh as a result of this bug.

I have not tried bash 4.0, but looking at the source code, it appears
that the bug is present there as well.


I added an ifdef clause which looks to see if the HAVE_TERMIOS_H define
is set, after the #include of config.h.  If it is, then I #include the
termios.h header file.  This solves the problem, which I confirmed by
rebuilding and dis'ing the function.  I also ran my recompiled bash
and confirmed that it now worked correctly.

Hopefully the bash maintainers will take note and fix this bug.  In the mean time, I'm going to see if we can get the fix for this applied to the Nevada (and hence, OpenSolaris) version of bash.

Update: The bash maintainers have fixed this bug in the following patch to bash 4.x.  Hurray!

Thursday Mar 05, 2009

Speeding to a Halt

On Sunday I committed changes into Nevada build 110 (and hence, into OpenSolaris to improve the speed at which OpenSolaris systems shutdown.  On a generic x86 test system, it used to take about 41 seconds to shut the system down starting from the time you typed init 6, shutdown(1m), or pressed the shutdown button in GNOME.  This form of shutdown is known as gentle shutdown because SMF takes care to stop each software service in dependency order.  In contrast, when you use reboot, halt, or poweroff, you're using what I'll call violent shutdown.  In the latter, the boot archive is brought up to date and that's about it.  It has traditionally been much, much faster than gentle shutdown.

This relatively long shutdown time has had an interesting effect: Solaris developers almost universally cheat, and use violent shutdown.  Typing reboot is terribly ingrained in my fingertips.  This is bad, because it means less test coverage for the shutdown method which actually matters to customers.  I recently began to be bothered by this because the GNOME gui uses init 6, and so shutting down via the nice shiny "Shut Down" button is also terribly slow.

On something of a whim, I dusted off our variant of the bootchart project which we ported in-house several years ago to get some initial information on what was happening during system shutdown.  Click here to see a graphical representation of a typical system shutting down (note: big image).  To read the image, note that time is on the X axis, from left to right.  Processes are represented as horizontal bars representing their duration.  At the rightmost side of the chart, the system has stopped.

In the image I've highlighted a couple of points of interest:

  • The pppd shutdown script seems to sleep 1, always, even if you aren't using ppp; since pppd isn't converted to SMF (bug 6310547), we will try to stop it on all systems on every shutdown.
  • The wbem service seems to sleep 1 while shutting down, and the the webconsole service takes a while to shutdown.  However, these services are present only on Nevada, and not on OpenSolaris, so I chose not to pursue trying to fix them.
  • The deferred patching script, installupdates is really slow.  And needlessly so-- it can run in a few milliseconds with a simple fix; I filed a bug.
  • There are some long calls to /usr/bin/sleep.  In the chart linked above, you can see svc-autofs, rpc-bind, and svc-syseventd each taking five seconds to stop.  Five seconds is a really long time!
  • There's a call to something called killall near the end of shutdown.  Then, 5 seconds later, another.  Then, 10 seconds later, things proceed again.  I wondered what the killall was all about?  Did it really need 15 seconds to do its work?

After a bit of effort (ok, a lot of effort), I've cleaned up these, and some other problems I spotted along the way.  It turns out that the five second delays are from some poor quality shell code in /lib/share/

smf_kill_contract() {
        # Kill contract.
        /usr/bin/pkill -$2 -c $1
        # If contract does not empty, keep killing the contract to catch
        # any child processes missed because they were forking
        /usr/bin/sleep 5
        /usr/bin/pgrep -c $1 > /dev/null 2>&1
        while [ $? -eq 0 ] ; do

Ugh. So this shell function rather bizarrely always waits five seconds, even if the contract empties out in 1/10th of a second!  I fixed this to have a smarter algorithm, and to keep checking at more frequent intervals (every 0.2 seconds).

I discovered that the calls to killall(1m) were really ancient history, and probably did not need to occupy 15 seconds worth of system shutdown.  I have shortened the interval substantially.

Another problem we faced was that, in the last moments before shutdown, startd runs some commands using system(3c).  This can be a problem if one of those commands, for some reason, wedges up.  So, I've replaced the calls to system with calls which timeout after a set number of seconds.  This is some nice insurance that the system continues to make progress as it shuts down.  Since I wound up with so much extra time available at shutdown, I've taken the chance to add a call to lockfs(1m) in an effort to get as much on-disk UFS consistency as possible.

So, here is the intermediate picture.  I've slightly revised the shutdown messaging, as well, to include a datestamp and a measurement of how long the shutdown took:

    svc.startd: The system is coming down.  Please wait.
    svc.startd: 83 system services are now being stopped.
    Mar  5 19:43:34 soe-x4100-3 syslogd: going down on signal 15
    svc.startd: Killing user processes.
    Mar  5 19:43:40 The system is down.  Shutdown took 17 seconds.

But wait, there's more!  On OpenSolaris, we don't have the time consuming wbem or webconsole services.  So, we can disable those and try again.  And, we use ZFS, for which the time consuming lockfs call at the end of shutdown is a no-op (on UFS, it takes at least two seconds).  This slimmed down stack results in an impressive shutdown time:

    Mar  6 02:51:51 The system is down.  Shutdown took 7 seconds.

And here is what it looks like.  If you want to see the complete set of changes, the codereview is also available.  As you can see, revising the way we kill off processes at the end of the life of the system is the big unrealized win.  And doing so would likely shave about 3 more seconds off, for a gentle shutdown of 4-5 seconds.  I ran out of time to do that this time around.

Some caveats:

  • You mileage may vary: you might run a different mix of services on your system, and perhaps one of those has a slow shutdown method which will gum up the works.  If you want to test how long a service takes to stop, try ptime svcadm disable -s <servicename>.
  • Your performance improvement is likely to be less dramatic on systems with less available parallelism.  Most of my test systems have two or four CPUs.

I should add a coda here: this work is greatly improved by recent bootadm performance work by Enrico.  While building the boot archive is still sometimes triggered on shutdown, it takes a lot less time than it did previously.

I had a good time working on this project; I hope you'll enjoy using it.

Wednesday Jun 25, 2008

BigAdmin Updates Zones Page

The zones team recently identified was that our web presence has been lacking.  A quick survey surprised us: the relatively unloved BigAdmin: Solaris Containers page generates a lot of daily visits.  We realized that this is because it is the top search result if you google for "Solaris Zones".  So Penny and Robert (one of the BigAdmin maintainers) set out to make improvements.  The new page is now posted.  Take a look!  Also, if you have suggestions for further improvements, or pointers to materials we should add, please leave them in my blog's comments section.

Wednesday Jun 18, 2008

Blastwave on Solaris 8 Containers

I just got word of a great new Solaris 8 Containers success story on, from  I think my favorite quote from Dennis Clarke (Founder of Blastwave) is:

"I virtualized critical Solaris 8 production servers and nobody noticed.  I literally shut the server down, backed it up, created a Solaris 8 Container, restored the environment, and brought the server back up. The process was simple, transparent, and completely flawless."

Our team did a lot of challenging work to ensure this sort of customer experience.  It's gratifying to see it pay off for an important contributor to the Solaris ecosystem.

So if you've been trying to persuade your PHB that you should give Solaris 8 Containers a try, here is a great reference!

Thursday Jun 12, 2008

Zones (Containers) Hosting Providers

I've been keeping track of a number of companies who provide virtual server hosting based on Solaris Zones.  Since we occasionally get asked about this, I thought I'd share my personal list.  This is not an endorsement of these businesses.  In the interest of full disclosure, I was once taken to lunch by David and Jason from Joyent.  Sorted alphabetically:

I'm pretty sure this list is incomplete.  I'll try to keep it updated as I learn more.  Feel free to post (or send me) corrections and additions, and I'll add them.

Finally, if you are a hosting provider, and want to consider adding zones to your suite of offerings, we're more than willing to talk, anytime.

Update #1: Added Gangus and Layered Technologies.
Update #2: Removed Layered Technologies (we're not sure); added Stability Hosting
Update #3 (6/30/2008): Added Beacon VPS (confirmed by email with Beacon)

Wednesday Jun 04, 2008

ii_bitmap=1; ii_bitmap=1; ii_bitmap=1; ii_bitmap=1; ...

Sometimes when you go hunting for bugs, it's pretty mundane.  Other times, you strike gold... 

On Monday night, Jan, Liane and I stayed late at work to help with some maintenance on our building's file and mail server, which we affectionately know as Jurassic.  The point of Jurassic is to run the latest Solaris Nevada builds in a production environment.  The system's regular admin is on vacation, and Jurassic was experiencing some unusual problems, and so a group of kernel engineers volunteered to help out.  It's our data, and our code, after all...

Jurassic had experienced two failed disks, and we really wanted wanted to replace those.  For some complicated reasons, we needed to reboot the system, which was fine with us anyway, because we wanted to see firsthand a problem which had been reported but not diagnosed: why was the svc:/system/filesystem/root:default service experiencing timeouts on boot? This service, it turns out, doesn't do much (despite its rather grand name): it takes care of finding and mounting a performance-optimized copy of libc, then runs a devfsadm invocation which ensures that the kernel has the latest copies of various driver.conf files.  This made the timeout we saw all the more puzzling: why would a five minute timeout expire for this service? To make matters worse, SMF tries three times to start this service, and so the aggregate time was 3 × 5 = 15 minutes.

Once we waited, what we found was pretty surprising: the "hang" we were seeing was due to seemingly stuck devfsadm processes-- three in fact:

# pgrep -lf devfsadm
100015 /usr/sbin/devfsadm -I -P
100050 /usr/sbin/devfsadm -I -P
100054 /usr/sbin/devfsadm -I -P

The next step I usually take in a case like this is to use pstack to see what the processes are doing.  However, in this case, that wasn't working:

# pstack 100050
pstack: cannot examine 100050: unanticipated system error
Hmm. Next we tried mdb -k, to look at the kernel:
  # mdb -k
mdb: failed to open /dev/ksyms: No such file or directory
This was one of those "oh shit..." moment.  This is not a failure mode I've seen before.  Jan and I speculated that because devfsadm was running, perhaps it was blocking device drivers from loading, or being looked up. Similarly:
# mpstat 1
mpstat: kstat_open failed: No such file or directory
This left us with one recourse: kmdb, the in situ kernel debugger.  We dropped in and looked up the first devfsadm process, walked its threads (it had only one) and printed the kernel stack trace for that thread:
# mdb -K
kmdb: target stopped at:
kmdb_enter+0xb: movq   %rax,%rdi
> 0t100015::pid2proc | ::walk thread | ::findstack -v
stack pointer for thread ffffff0934cbce20: ffffff003c57dc80
[ ffffff003c57dc80 _resume_from_idle+0xf1() ]
  ffffff003c57dcc0 swtch+0x221()
  ffffff003c57dd00 sema_p+0x26f(ffffff0935930648)
  ffffff003c57dd60 hwc_parse+0x88(ffffff09357aeb80, ffffff003c57dd90,
  ffffff003c57ddb0 impl_make_parlist+0xab(e4)
  ffffff003c57de00 i_ddi_load_drvconf+0x72(ffffffff)
  ffffff003c57de30 modctl_load_drvconf+0x37(ffffffff)
  ffffff003c57deb0 modctl+0x1ba(12, ffffffff, 3, 8079000, 8047e14, 0)
  ffffff003c57df00 sys_syscall32+0x1fc()

The line highlighted above was certainly worth checking out-- you can tell just from the name that we're in a function which loads a driver.conf file.  We looked at the source code, here: i_ddi_load_drvconf().  The argument to this function, which you can see from the stack trace, is ffffffff, or -1.  You can see from the code that this indicates "load driver.conf files for all drivers" to the routine.  This ultimately results in a call to impl_make_parlist(m), where 'm' is the major number of the device. So what's the argument to impl_make_parlist()?  You can see it above, it's 'e4' (in hexadecimal).  Back in kmdb:

[12]> e4::major2name

This whole situation was odd-- why would the kernel be (seemingly) stuck, parsing ii.conf?  Normally, driver.conf files are a few lines of text, and should parse in a fraction of a second. ⁞ We thought that perhaps the parser had a bug, and was in an infinite loop.  We figured that the driver's .conf file might have gotten corrupted, perhaps with garbage data.  Then the payoff:

$ ls -l /usr/kernel/drv/ii.conf
-rw-r--r--   1 root    sys     3410823  May 12 17:39  /usr/kernel/drv/ii.conf

$ wc -l ii.conf
  262237 ii.conf

Wait, what?  ii.conf is 262,237 lines long?  What's in it?

# 2 indicates that if FWC is present strategy 1 is used, otherwise strategy 0.
# 2 indicates that if FWC is present strategy 1 is used, otherwise strategy 0.
# 2 indicates that if FWC is present strategy 1 is used, otherwise strategy 0.

This pattern repeats, over and over, with the number of ii_bitmap=1; lines doubling, for 18 doublings!  We quickly concluded that some script or program had badly mangled this file.  We don't use this driver on Jurassic, so we simply moved the .conf file aside.  After that, we were able to re-run the devfsadm command without problems.

Dan Mick later tracked down the offending script, the i.preserve packaging script for ii.conf, and filed a bug.  Excerpting from Dan's analysis:

 Investigating, it appears that SUNWiiu's i.preserve script, used as a class-action
 script for the editable file ii.conf, will:
 1) copy the entire file to the new version of the file each time it's run (grep -v -w                  
 with the pattern "ii_bitmap=" essentially copies the whole file, because lines                         
 consisting of "ii_bitmap=1;" are not matched in 'word' mode; the '=' is a word                         
 separator, not part of the word pattern itself)                                                        

 2) add two blank comment lines and one substantive comment each time it's run,                         
 despite presence of that line in the file (it should be tested before being                            

 3) add in every line containing an instance of "ii_bitmap=" to the new file                            
 (which grows as a power of two).  (this also results from the grep -v -w                               

 Because jurassic is consistently live-upgraded rather than fresh-installed,                            
 the error in the class-action script has multiplied immensely over time. 

This was all compounded by the fact that driver properties (like ii_bitmap=1;) are stored at the end of a linked list, which means that the entire list must be traversed prior to insertion. This essentially turns this into a (n+1)\*n/2 pathology, where n is something like: (2LU+1)-1 (LU here is the number of times the system has been live-upgraded).  Plugging in the numbers we see:

(218 \* 218) / 2 = 2\^35 = 34 Billion

I wrote a quick simulation of this algorithm, and ran it on Jurassic.  It is just an approximation, but it's amazing to watch this become an explosive problem, especially as the workload gets large enough to fall out of the cpu's caches.

LU Generation  List Items
List Operations
Time (simulated) 
215 = 8192
225 = 33554432 .23s
14 216 = 16384 227 = 134217728
15 217 = 32768
229 = 536870912
218 = 65536
231 = 2147483648
219 = 131072
233 = 8589934592
220 = 262144
235 = 34359738368

Rest assured that we'll get this bug squashed.  As you can see, you're safe unless you've done 16 or more live upgrades of your Nevada system!

Sunday May 11, 2008

A field guide to Zones in OpenSolaris 2008.05

I have had a busy couple of months. After wrapping up work on Solaris 8 Containers (my teammate Steve ran the Solaris 9 Containers effort), I turned my attention to helping the Image Packaging team (rogue's gallery) with their efforts to get OpenSolaris 2008.05 out the door.

Among other things, I have been working hard to provide a basic level of zones functionality for OpenSolaris 2008.05. I wish I could have gotten more done, but today I want to cover what does and does not work. I want to be clear that Zones support in OpenSolaris 2008.05 and beyond will evolve substantially. To start, here's an example of configuring a zone on 2008.05:

# zonecfg -z donutshop
donutshop: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:donutshop> create
zonecfg:donutshop> set zonepath=/zones/donutshop
zonecfg:donutshop> add net
zonecfg:donutshop:net> set physical=e1000g0
zonecfg:donutshop:net> set address=
zonecfg:donutshop:net> end
zonecfg:donutshop> add capped-cpu
zonecfg:donutshop:capped-cpu> set ncpus=1.5
zonecfg:donutshop:capped-cpu> end
zonecfg:donutshop> commit
zonecfg:donutshop> exit

# zoneadm list -vc
  ID NAME             STATUS     PATH                           BRAND    IP    
   0 global           running    /                              native   shared
   - donutshop        configured /zones/donutshop               ipkg     shared

If you're familiar with deploying zones, you can see that there is a lot which is familiar here.  But you can also see that donutshop isn't, as you would normally expect, using the native brand. Here we're using the ipkg brand. The reason is that commands like zoneadm and zonecfg have some special behaviors for native zones which presume that you're using a SystemV Packaging based OS. In the future, we'll make native less magical, and the zones you install will be branded native as you would expect. Jerry is actually working on that right now. Note also that I used the relatively new CPU Caps resource management feature to put some resource limits on the zone-- it's easy to do!. Now let's install the zone:

# zoneadm -z donutshop install
A ZFS file system has been created for this zone.

      Image: Preparing at /zones/donutshop/root ... done.
    Catalog: Retrieving from ... done.
 Installing: (output follows)
DOWNLOAD                                    PKGS       FILES     XFER (MB)
Completed                                  49/49   7634/7634 206.85/206.85 

PHASE                                        ACTIONS
Install Phase                            12602/12602 

       Note: Man pages can be obtained by installing SUNWman
Postinstall: Copying SMF seed repository ... done.
Postinstall: Working around
Postinstall: Working around
       Done: Installation completed in 208.535 seconds.

 Next Steps: Boot the zone, then log into the zone console
             (zlogin -C) to complete the configuration process

There are a couple of things to notice, both in the configuration and in the install:
Non-global zones are not sparse, for now
Zones are said to be sparse if /usr, /lib, /platform, /sbin and optionally /opt are looped back, read-only, from the global zone. This allows a substantial disk space savings in the traditional zones model (which is that the zones have the same software installed as the global zone).

Whether we will ultimately choose to implement sparse zones, or not, is an open question. I plan to bring this question to the Zones community, and to some key customers, in the near future.

Zones are installed from a network repository
Unlike with traditional zones, which are sourced by copying bits from the global zone, here we simply spool the contents from the network repository. The upside is that this was easy to implement; the downside is that you must be connected to the network to deploy a zone. Getting the bits from the global zone is still desirable, but we don't have that implemented yet.

By default, zones are installed using the system's preferred authority (use pkg authority to see what that is set to). The preferred authority is the propagated into the zone. If you want to override that, you can specify a different repository using the new -a argument to zoneadm install:

# zoneadm -z donutshop install -a ipkg=http://ipkg.eng:80
Non-global zones are small
Traditionally, zones are installed with all of the same software that the global zone contains. In the case of "whole root" zones (the opposite of sparse), this means that non-global zones are about the same size as global zones-- easily at least a gigabyte in size.

Since we're not supporting sparse zones, I decided to pare down the install as much as I could, within reason: the default zone installation is just 206MB, and has a decent set of basic tools. But you have to add other stuff you might need. And we can even do more: some package refactoring should yield another 30-40MB of savings, as packagings like Tcl and Tk should not be needed by default. For example, Tk (5MB) gets dragged in as a dependency of python (the packaging system is written in python); Tcl (another 5MB) is dragged in by Tk. Tk then pulls in parts of X11. Smallness yields speed: when connected to a fast package repository server, I can install a zone in just 24 seconds!.

I'm really curious to know what reaction people will have to such minimalist environments. What do you think?

Once you start thinking about such small environments, some new concerns surface: vim (which in 2008.05 we're using as our vi implementation) is 17MB, or almost 9% of the disk space used by the zone!

Non-global zones are independent of the global zone
Because ipkg zones are branded, they exist independently of the global zone. This means that if you do an image-update of the global zone, you'll also need to update each of your zones, and ensure that they are kept in sync. For now this is a manual process-- in the future we'll make it less so.
ZFS support notes
OpenSolaris 2008.05 makes extensive use of ZFS, and enforces ZFS as the root filesystem. Additional filesystems are created for /export, /export/home and /opt. Non-global zones don't yet follow this convention. Additionally, I have sometimes seen our auto-zfs file system creation fail to work (you can see it working properly in the example above). We haven't yet tracked down that problem-- my suspicion is that there is a bad interaction with the 2008.05 filesystem layout's use of ZFS legacy mounts.

As a result of this (and for other reasons too, probably), zones don't participate in the boot-environment subsystem. This means that you won't get an automatic snapshot when you image-update your zone or install packages. That means no automatic rollback for zones. Again, this is something we will endeavor to fix.

Beware of bug 6684810
You may see a message like the following when you boot your zone:
zoneadm: zone 'donutshop': Unable to set route for interface lo0 to éÞùÞ$
zoneadm: zone 'donutshop': 
This is a known bug (6684810); fortunately the message is harmless.

In the next month, I hope to: take a vacation, launch a discussion with our community about sparse root zones, and to make a solid plan for the overall support of zones on OpenSolaris. I've got a lot to do, but that's easily balanced by the fact that I've been having a blast working on this project...

Songbird for Solaris

Looks like Alfred's hard work has paid off.  You can pull down a package of Songbird for OpenSolaris (see Alfred's blog entry for the links).  Songbird is a next-gen media player built atop the Mozilla platform.   Although I've had it crash once, on the whole it has worked quite well.  SteveL's mashtape extension is really neat, and you can see it in action in the screenshot below (it's the thing offering pictures, youtube videos, etc. at the bottom of the window).

Next steps would be to get this into the OpenSolaris package repository-- I hope that someday soon you will be able to pkg install songbird.

Nice work guys!

Wednesday Apr 09, 2008

Solaris 8 Containers, Solaris 9 Containers

In the flurry of today's launch event, we've launched Solaris 8 Containers (which was previously called Solaris 8 Migration Assistant, or Project Etude).  Here is the datasheet about the product.  Even better: We've also announced that Solaris 9 Containers will be available soon!  Jerry and Steve on the containers team have been toiling away like mad to make this possible.

Why the rename?  Well, for one thing, it's easier to say :)  It also signals a shift in the way Sun will offer this technology to customers:

  • Professional Services Engagement: No longer required, now recommend.  It's also simpler to order a SunPS engagement for this product.
  • Partners: (Some of) Sun's partners are now ready to deliver this solution to customers.  Talk to your partner for more information.
  • Right to Use: Previously, we provided a 90 day evaluation RTU.  Now, the RTU is unlimited.  However, you must still pay for support.
I invite you to download Solaris 8 Containers, and give it a try! And as always, talk to your local SE or Sales Rep if you're interested in obtaining support licenses (or any kind of help with) your Solaris 8 (or 9) containers.

Here's Joost, our fearless marketing leader, with an informative talk about the why and how of Solaris 8 Containers. 

Friday Mar 28, 2008

OpenSolaris Elections Disappointment

The results of the OpenSolaris Elections are in.  Congratulations to the new board members!

So why did I entitle this post Elections Disappointment?  My complaint is with the "community priorities" question which was asked; this was a chance for core contributors to indicate what they felt were pressing priorities.  I was not happy with the OGB's choice (which I didn't know about until I voted) to reuse the same set of priorities questions from last year.  One of which was:

  • Deploy a public code review facility on

Which was subsequently identified by voters in this election as the #3 priority.  Did the OGB believe that we did not accomplish this goal in the past year?  I know that I took the results of the last poll (it was voted #3 in that poll, too) to heart, and worked hard to make that goal a reality, publicized it using the official channels, and have been enhancing it and taking care of it since that time, mostly on my own time.  I felt that my contribution was really undermined by the question.

Second: Who voted for this item as a priority?  If you voted for this as a priority, I would like to hear why you did (anonymously is fine).  Are you unaware that exists?  Are you unsatisfied with the service it provides?

I hope the new OGB will find a way to reformulate a more cogent poll question about priorities.

Monday Mar 24, 2008 gets an ATOM feed

For the past couple of weeks, I have been working late at night and on weekends to add an ATOM feed (i.e. blog feed) to, so that as people post new code reviews, they are automatically discovered and published.  Stephen has been heckling me to do this work for more than a year.  This weekend I managed to finish it, despite the incredibly nice weather in the bay area: I was stuck inside with a nasty cold.

As an aside, I'm looking for help with  This is a great opportunity for someone to step up and help out with an important part of our community infrastructure.  Send me mail.

You can check out the results of my hacking on  Or you can subscribe to the feed.  If you want to opt-out of publishing reviews, you can create a file called "opt-out" in the same directory as your webrev's index.html file.  Or you can create a file called "opt-out" in your home directory, if you'd like to opt out of all reviews.

Implementation Notes 

This was an interesting learning experience for me, since I had to learn a lot about ATOM in the process.  I also learned the XSLT language along the way as well, and how to process HTML using python.  All in all, I'd say this project took about 20 hours of effort, and resulted in about 500 lines of python code.  The most difficult problems to solve were:

  • I wanted the feed to include some meaningful information about the codereview.  If you subscribe to the feed using your favorite reader, you'll see that a portion of the "index.html" file from each webrev is included.  This is done using a somewhat tricky piece of python code.  In retrospect, using XSL for this might have been a better choice, although I've found that people have a tendency to introduce non-standard HTML artifacts into their webrev index.html files, and I don't know how well XSL would cope with that.

  • ATOM has some rules about generating unique and lasting IDs for things-- this is the contents of the <id> tag in the ATOM specification.  I found a lot of valuable information on dive-into-mark.  For, this was complicated by the fact that the user might log in and move their codereview around, or might copy one review over another.  In the end, I solved this by remembering the <id> tag in a dot-file which rides along with the codereview.  A cronjob roves around the filesystem looking for new reviews, and adds the special tag-file.  By storing the original <id> tag value, and looking at the modtime of the index.html file, I can correctly compute both the value of the <id> and <updated> fields for each entry.  If a user deletes a codereview, the dot-file will go away with it.

  • Once I had an ATOM feed I needed to transform it back into HTML for display on the home page.  The only problem was that there aren't a lot of good examples of this on the web-- many of the ATOM-to-HTML conversions only work with ATOM 0.3, not the 1.0 specification, and I didn't know the first thing about XPATH or XSL.  In the end, I only needed 25 lines or so of XSLT code.

Future Work 

I think of the current implementation as a "1.0"-- it'll probably last us pretty well for a while.  One thing I'd like to research for a future revision is actually placing the entries into a lightweight blog engine, and letting it do the rest of the work: Using an excellent list from Social Desire I took a quick look at Blosxom, Flatpress, Nanoblogger, and some others.

Tuesday Jan 29, 2008

The joy of 'zpool scrub'

Some days, when it's cold and you're not feeling very motivated (like me, today), it's nice to do a zpool scrub on the machines you manage, and then once it's done:

$ zpool status
  pool: aux
 state: ONLINE
 scrub: scrub completed with 0 errors on Tue Jan 29 15:52:38 2008

        NAME          STATE     READ WRITE CKSUM
        aux           ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c1t0d0s7  ONLINE       0     0     0
            c1t1d0s7  ONLINE       0     0     0

errors: No known data errors
And then relax, knowing that your data is safe.

Wednesday Nov 14, 2007

A big mess...

Recently I've been thinking about, learning about, and contributing to IPS, the Image Packaging System, a next generation package management solution project which is happening in the OpenSolaris community.  IPS happens to also be the packaging system project Indiana has elected to use.  Stephen has written extensively about his thoughts on his blog.  And Bart has too.  IPS subsumes a lot of existing functionality which appears in various parts of the system today.  But a lot of people seem to be willing to look at it only as a package manager in the sense that "pkgadd" is a package manager.

My problem with this is that "pkgadd" is only a small part of a larger problem.  So, to explain that, I want to distill a series of email posts I made to pkg-discuss last month into a coherent blog entry, since I've referred back to them a couple of times.

My feelings on this topic are pretty nicely summarized by an article by Peter J. Denning which recently appeared in Communications of ACM, entitled "Mastering the Mess."  The whole article is instructive, but see in particular: "Signs of Mess."

If one accepts Dr. Denning's "mess" framework, then the next question is whether we are in what he dubs, "a mess."  I personally think the answer is "yes."   In no particular order (apologies to anything I left out), as a community, we have:


  • SVR4 package creation tools
  • SVR4 package deployment tools
  • Sun's patch creation tools
  • Sun's patch application and inventory tools (patchadd, showrev -p)
  • PCA (Patch Check Advanced, a nice open source tool I use)
  • Solaris Patch Manager (smpatch)
  • pfinstall
  • Live Upgrade
  • flash archive creation and deployment tools
  • graphical install (old and dwarf caiman, etc)
  • ttinstall
  • Jumpstart
  • virt-install (from xVM)
  • zones install
  • zones attach/detach logic (which knows how to parse various patch and packaging databases)
  • So-called "toxic" patching for zones
  • Zones support for live upgrade (zulu)
  • BFU/ACR (update part of the system, but violates package metadata)
  • IDR (patches the system, but renders system subsequently unpatchable until IDR is removed and a "real patch" is applied)
  • Solaris Product Registry (I've never really understood what this was for, but you can try it via prodreg(1))
  • Service Tags -- a layer which adds "software RFID tags" in a sense: UUIDs backed by a Sun-maintained ontology; helps to inventory what is on your system.
  • pkg-get
  • Network Repositories (like Blastwave)
  • DVD media & CD media construction tools (several of these, I think)
  • Various other unbundled products which promise to ease "patching pain"
  • Various system minimization tools
  • Layered inventory management tools
  • Numerous hand-rolled and home-grown solutions built on some or all of the above.                                            

Some parts of the mess represent great (from the perspective of the those caught up in the mess) technologies which people have spent a lot of time and effort building.  But a lot of the above represent accreted layers with duplicated functionality.  In some cases, the various layers interact in complex and subtle, and perhaps interface-violating ways. To people outside of the mess (i.e. new users we would like to entice) the mess looks bizarre, and terrible.  Another sign of a "big mess":  In several cases, huge engineering efforts have resulting in only modest improvements.  In some cases, huge engineering efforts have been total failures: Sun attempted a rewrite of the SVR4 packaging system in the early part of this decade, the project basically failed.

It's easy to look at the above list and feel a sense of hopelessness-- how will we \*ever\* improve upon this situation?  Will people keep creating new and different tools which add more layers? 

I'll cite a second source which has helped guide my thinking on this topic: Jeff Bonwick.  Jeff spent years relentlessly seeking out and blowing up duplicated and broken kernel functionality, and then took on the storage stack.  The result was ZFS, which was recently labelled "a rampant layering violation" by a detractor.  Jeff responded this way.  In particular, Jeff said:

"We found that by refactoring the problem a bit -- that is, changing           
where the boundaries are between layers -- we could make the whole thing
much simpler."

Which to me summarizes my thinking about what the \*opportunity\* is here: to rethink the layers and to merge and unmerge them to come to a more complete, efficient, modern

IPS is heading in this direction: Packaging, patching, upgrade, live upgrade, the mechanisms for Software Delivery, the toolset for delivering packages/patches, and the software-enforced policy decisions seem to be condensing here into a coherent entity-- which means we'll have many fewer layers.  And because the system will be fast, lightweight, redistributable and shared, we should also be able to discard artifacts such as BFU and ACR (in other words, OpenSolaris developers will use the same tools our customers use to update systems).  The huge amount of code which handles zones patch and packaging should be greatly reduced.  Package dependencies will be far more accurate and minimization will be easier, and diskless client support should be far more robust.

What I see with Caiman, IPS and Distro Constructor is the opportunity to do for software delivery and update to OpenSolaris systems what ZFS did for storage management.  I do not think we have all the answers just yet, but I think we can get there.

Monday Oct 22, 2007

Solaris 8 Migration Assistant 1.0 (Project Etude) Ships!

I'm very happy to announce that the Solaris 8 Migration Assistant 1.0 (also known as Project Etude) has shipped!  The product is now officially available from Sun.  Some key links:

In a nutshell, the product provides a migration solution from Solaris 8 to Solaris 10 by creating a bridge between the two operating systems.  You can perform P2V (physical-to-virtual) conversions of existing Solaris 8 systems, and drop those into Solaris 8 containers running on your Solaris 10 host.

Above all, I want to take another chance to thank the many people who worked extremely hard for the past eight months to make this project a reality.  It was sprint from start to finish, which is certainly tough on everyone involved.  But I was amazed and pleased that almost universally, people helped us out with dedication and a good sense of humor.  Thank you very much.

Saturday Oct 13, 2007 now live, exits beta

On Thursday I posted this announcement to opensolaris-announce, which notes that I have taken out of beta status. Please take a look, use, and enjoy...

I am pleased to announce that, the OpenSolaris
Code Review site, is now officially part of the
infrastructure. I encourage everyone in the community to use it.

In the March 2007 community priorities poll, the community declared:

#3 Deploy a public code review facility on
... was designed to meet that goal. It has been in Beta
test for some time, and after some final tweaking[1] it is now ready for

You will need an SSH key linked to your account in order to use it.
For more information on SSH keys, see

Once you have a registered key, see for
information about how to gain access to this system. We request
that users review the terms of service before using it. Problems can be
reported to website-discuss at opensolaris dot org.

Finally, if you would like to help out with the coding/maintenance of
this site, please visit the site and read the "Call for Help" section.


Dan Price

[1]: Specifically, concerns about who may access the system have been

Kernel Gardening.


« July 2016