Sunday Aug 16, 2009 Moving to XWiki

I'm afraid I haven't made any progress on my OSCON trip report this week. We've started the beta testing for migrating from the current portal application to XWiki. I've been reviewing the ON Developer Reference, and it's taken more of my time than I had expected. (In fact, I'm still not done.)

If you're a community or project leader, please do take the time to review the pages that you're responsible for. Some issues will be easier to fix if they are identified before the migration. And the migration team needs user feedback to help identify which issues cause the most trouble.

If you just have a question, you can ask on the website-discuss list (at If you're sure you've found a bug, either in the migration code or in the new XWiki-based site, go to and file a bug under Development: product=website, component=site-wiki.

Friday Aug 07, 2009

OSCON 2009: Lying and Geniuses

Alas, it's taking me much longer to write up my OSCON trip report than I had planned. Part of that is that I've been busy with other things, like resurrecting a testbed for the SCM framework, so that I can start on the workaround for the "unwanted mounts" issue. And part of it is that it's just taking me longer to do the writing, despite already having an outline. So since I have finished my comments on a couple talks, I'll go ahead and post those now.

My favorite session at OSCON was "How to Lie Like a Geek" by Michael Schwern, which was a lively talk about things we can do to help or hinder communication. Most of us are familiar with technical lying, particularly bad benchmarking and abuse of statistics. Michael recommended as a place to get good benchmarks.

Michael also suggested several ways to "lie" that are more generic. Many of these had to do with getting so caught up in details that the big picture is lost. For example, there's lying by information overload: providing too much detail. There's also lying by pedantry or by being excessively literal: focusing on the wrong details.

A fascinating way to mislead is to state the obvious. For example, consider this exchange (which I've adapted from the example Michael used):

Alice: We should change this code.
Bob: But that could introduce bugs.
Alice: Of course. Why are you arguing against making the change?
Bob (annoyed): I'm not, I'm just saying we might introduce bugs. Don't put words in my mouth.

I've seen exchanges like this in the past. At best, there's this unnecessary hiccup in the conversation. At worst, the conversation goes off into the weeds, with endless rounds of claims and counter-claims about who said what and what was really meant.

When I mentioned this example to my wife, she pointed out that people often make implicit requests in their statements. So if Bob says something obvious, he's clearly not offering new information, so it's not surprising for Alice to interpret his comment as a request. If Bob had a different request in mind, like "let's defer that until after the code freeze", it'd be more helpful to say that in the first place.

Another talk that I liked a lot was "Programmer Insecurity and the Genius Myth", by Ben Collins-Sussman and Brian Fitzpatrick. I had to chuckle at this quote, which Ben and Brian said came from the 2008 Google I/O developer conference:

Can you guys make it possible to create open source projects that start out hidden to the world, then get "revealed" when they're ready?

The site supports hidden projects. And for awhile we created new projects as hidden, so that the project team could get the project page set up before making the project visible. But hiding the project doesn't provide that much benefit. After all, the project space is a work area; it's okay for things to be rough. And we've had a few problems with projects lingering in hidden mode for a long time. So we've moved towards making projects visible from the start.

Of course, it's natural for people to want to get things right before sharing them with the outside world. But there are potential advantages to sharing early. One is that if you're going down the wrong path, early sharing improves the odds that you'll discover the problem quickly. And if you have to explore a couple dead-ends before you come up with something that works, the information from those explorations is available for others to learn from.

And then there's the "bus factor", which refers to the number of people who would have to all be run over by a bus to stop the project. Working in the open--with archived discussions, publicly visible code, any plans and documentation that you've written down--makes it easier to recover if a team member is unavailable for whatever reason.

Ben and Brian did point out that it is possible to share too early. The project needs to be far enough along that it won't get stalled when outsiders show up and start asking questions and suggesting changes.

But since working in the open like this can be challenging, what are things we can do to encourage it? Ben and Brian made several suggestions, both social and technical.

On the social side, the first thing to remember is not to let your ego get tied up in your design or code. You've probably heard that before, but I think it bears repeating. When someone points out a problem in my design or code, I try to think of it as an opportunity to improve something that I care about, and as an opportunity to learn. But that's often not my initial, automatic response; it takes some effort.

Even if you don't expect your project to benefit from outsiders' comments, engaging in a conversation can have benefits. As Ben and Brian put it, if you want to influence people, you need to be open to influence.

On the technical side, consider what behavior your tools are encouraging. For example, the current portal doesn't keep page histories, which discourages its use for collaborative writing and editing. That's one of the things that will be fixed by the move to XWiki in September.

Ben and Brian also recommended responding to questions and arguments on the project web site, rather than by email. With that approach, they said, the discussion is less likely to degenerate into pointless argumentation. I don't recall them saying why they think this works. I suppose one reason is that it helps keep arguments from being repeated.

Wednesday Jul 29, 2009

OSCON 2009

The O'Reilly Open Source Convention (OSCON) was at the San Jose Convention Center this year. It's been in Portland in the past, and while I like Portland, the additional expenses for air fare and hotel probably would have meant staying home this year. So I'm glad it was in San Jose. We'll see where O'Reilly decides to hold it next year. At the feedback session after the closing keynote, there seemed to be quite a few people who would like the conference to return to Portland.

Lunches on Wednesday and Thursday were catered. At my last OSCON (2006) the conference provided basic box lunches; nothing special. The lunches this year were impressive, with a variety of well-prepared dishes.

I attended several talks, which I've organized into 2 categories: people talks and technology talks. The people talks covered things like communication skills, community-building, etc. The technology talks that I went to covered topics that I wanted to learn more about, either for use at work or for personal projects. I also went to a couple BOFS and the closing keynote session. In order to make the writeups more digestible, I'll cover the BOFS and keynote here, with separate postings over the next few days for the "people" and "technology" sessions.


I went to Brian Nitz's SourceJuicer BOF session Wednesday night. Alas, only a few people attended, mostly from Sun.

Brian gave a short presentation and demo. This helped fill in some of the holes in my knowledge of SourceJuicer, like how Pkgfactory and Robo-Porter fit into the picture. (Pkgfactory is an automated mechanism that feeds into SourceJuicer. Robo-Porter is a component of Pkgfactory.)

Individuals can contribute spec files to SourceJuicer, though apparently the tags aren't quite the same as they are for RPM spec files. Builds are done in a freshly-created zone which has a minimal build environment. Thus the spec file must list build dependencies (as well as runtime dependencies).

Thursday night I went to the Silicon Valley OpenSolaris Users Group (SVOSUG) meeting, which was relocated to OSCON for this month. John Weeks demoed a couple toys that he has built and talked about what went into making them. John Plocher also demoed his programmable xylophone. I always find this sort of presentation fascinating, even though I've never built anything similar myself.

Closing Keynote

A few things struck me from closing keynote address, which was given by Jim Zemlin of the Linux Foundation.

The first thing that grabbed my attention was how Microsoft's attitude towards open source and the GPL have changed over the years. Jim had a Microsoft quote from a few years ago about how open source and the GPL are just horrible (a threat to business, if I remember correctly). But this year, Microsoft is releasing some code under the GPL because that's what customers want.

The second thing was Jim's discussion about the introduction of netbooks and the changes that he predicts for the PC business ecosystem. In particular, Jim predicts that wireless service providers will be making discounted PCs available, just as they make discounted cell phones available today. If I understand his argument correctly, end users will focus more on the applications and services that they get from the wireless provider, and less on the underlying operating system. And the platform - hardware plus operating system - providers will be under pressure to make their platforms attractive to the wireless service providers (cheap, good functionality). The resulting competitive pressure should improve the opportunities for operating systems other than Windows (Jim was focusing on Linux, of course).

I suppose this could happen; I don't know the PC business well enough to have an opinion. It'll certainly be interesting to watch. And it'll be interesting to see whether these subsidized netbooks are treated more like computers or like phones. Linux is already bundled on some netbooks, and from what I've read, users can run into problems if they upgrade from someone other than the netbook provider. And Jim mentioned that the Linux Foundation has been getting a lot more phone calls in the last 6 months. Some of them are from people offering kudos, some are like the one he played for us, which was from a guy who needed technical support. Compare that with my cell phone: I know who made the hardware, but I have no idea what OS it's running.

The third item was a discussion about software patents and the danger they present to the open source movement. This was mostly old news to me, but Jim mentioned something I hadn't heard about before: One of the problems with the current patent system, at least in the USA, is getting information recorded so that it is counted as prior art. Filing defensive patents is a pain, and the US patent office doesn't keep up on the zillions of articles that are published at academic conferences and in trade magazines. You can challenge a patent after it's been granted, but that's a pain, too. But now there's an organization dedicated to collecting technical disclosures and publishing it in a prior art database that the patent office will check. Very cool.

Tuesday Jul 21, 2009

Unwanted Mounts

As described in the design document, source code access on is done via ssh. The user doesn't invoke ssh directly. Rather, the user runs Mercurial (or Subversion), which invokes ssh using its standard processing for ssh URLs. Once connected to the server, a custom restricted shell invokes the server-side program. This is all done in a chroot environment, with loopback mounts providing access to only those repositories that the user has write access to.

The loopback mounts are created when the user logs in, and they are torn down when the source code management (SCM) operation completes. This is done by way of a custom PAM module. As part of the session's "open" processing, the module determines what repositories to grant access to, and it establishes those mount points. As part of the session's "close" processing, it removes those mount points.

We recently noticed that the loopback mounts were not getting unmounted. This causes a couple problems. One is that thousands of unused loopback mounts accumulate on the server. If nothing else, this makes life more difficult for administrators.

The lingering mounts can also lead to a denial of service problem, which we've witnessed a few times. The problem occurs if a repository is deleted and recreated while there is still a loopback mount for it. Future references to the loopback mount will fail with an error. This can interfere with the setup of a user's loopback mounts in a subsequent login, resulting in a situation where users are unable to access recently created repositories. Worse, attempts to unmount the broken loopback mount fail, and lofs doesn't support forced unmount. So the only way to recover is to reboot the server.

After the third or so instance of this, we decided to figure out why the loopback mounts were not getting unmounted. Arguments can be passed to a PAM module by putting them after the module name in /etc/pam.conf, and there's a convention to enable debugging output with the argument "debug", e.g.,

other	session requisite	debug

For this to be useful, syslogd needs to be configured to display the debug output. For example, put

auth.debug	/var/adm/auth.log

in syslog.conf and utter

# svcadm restart system/system-log

Once we made these two changes, we could see that the session-open routine was running normally, but it didn't look like the session-close routine was getting invoked.

This seemed awfully strange, so we enabled PAM framework debugging with

# touch /etc/pam_debug

(This, too, requires that syslogd be configured to put auth.debug output somewhere accessible.)

This showed that our session-close routine was, in fact, being invoked.

Looking more closely at the session-closed routine, we noticed that it checks what user it is invoked as. If it's not invoked as uid 0, it bails out, before doing any debug logging. Moving the debug logging to come before the uid check confirmed that it was running as the user whose session was ending.

Some Googling revealed a known issue in OpenSSH (from which the Solaris SSH is derived) in which the session-close routine is called as the session's user, not uid 0.

From the comments in the OpenSSH Bugzilla, it looks like a fix is available from upstream, so we're hopeful that we just need to talk to the Sun SSH team about getting the fix into OpenSolaris. We're also looking into possible workarounds, in case the fix can't be pulled in promptly.

Update 2009-09-16

I filed a bug for this: 6869790.

The current status is that the Solaris SSH team is discussing possible fixes, but they haven't come up with a good approach yet. Just reverting the code isn't an option because it would break support for hardware acceleration. And the upstream privilege separation code is different from the code in Solaris, so they can't just use the upstream patch.

Friday Feb 13, 2009

OpenSolaris and gnuserv

I installed OpenSolaris 2008.11 on my notebook (a VAIO TX) several weeks ago. I've been tweaking the environment, in preparation for the day when I move to OpenSolaris on my desktop system.

One of the issues that came up was that gnuserv would exit immediately after being started. This meant that every time I wanted an editor (e.g., for a Mercurial commit), I had to wait for a new XEmacs process to start.

I looked around for some sort of error message but couldn't find anything. I finally started XEmacs using truss -f. Looking at the truss output, I saw that gnuserv was looking in /etc/hosts and not finding an entry with the notebook's hostname ("loiosh").

I added "loiosh" to the localhost ( line, and that fixed the problem.

Wednesday Feb 11, 2009

Mercurial pretxn Hook Race

Currently the ON gate (or at least the open source part) is mirrored on We were having a discussion the other day about what needs to be done so that we can actually host it on

One of the issues that came up is the race in the Mercurial pre-transaction hooks, such as the pretxnchangegroup hook. These hooks let a repository reject pushes that don't meet whatever criteria that the hook has set. For the ON gate, we use it for things like making sure there is an approved RTI for the changegroup.

The problem is that the implementation of these hooks opens up a race condition. The metadata for the changegroups gets written to the repository, then the pre-transaction hook gets run. The advantage of this approach is that the pre-transaction hooks can use existing APIs and code paths when examining the incoming changegroup. But Mercurial repositories are structured so that readers don't need a lock; instead they depend on an atomic update of the top-level metadata. So the disadvantage of this approach is that there's a window during which someone pulling from the gate could get the pending changegroup, even if the hook later rejects it.

This issue is described in Section 10.3 of Bryan O'Sullivan's Mercurial book; it is also issue 1321 in the Mercurial bug tracker. The workaround that the Mercurial book describes is the one that we used for the ON gate: the repository that people push to is write-only. After the pre-transaction hooks have cleared the changegroup, another hook pushes the changegroup to a second clone repository, which developers pull from.

While this approach is functional, it's not esthetically pleasing. And there's a practical problem: the SCM infrastructure on doesn't support having two repositories tied together like that. I'm sure it could be done, but administration (e.g., updating the access list) would be clumsy, and it might require giving the ON gatekeepers shell access to the servers (which would not please them or the server administrators).

Fortunately, Matt Mackall has devised a fix for the race condition. The new changegroup will not be visible for pulls until it has passed the pre-transaction hooks. And if I understand correctly, the fix will not require changes to existing hooks, except for the case of Python hooks that spawn subprocesses..

There are other changes that we will probably make before hosting the gate on For example, we'll probably change the SCM console (the web interface for managing repositories) so that it scales better for large numbers of committers. But getting a fix for this race condition means we'll have one less issue to deal with.

Sunday Oct 05, 2008

Printing Pages with Recent Builds

Back in August I upgraded my desktop to snv_95. I sometimes print pages from to read during my commute, but with snv_95 the pages came out pretty much unreadable. They looked like they had been through several fax transmissions, with blotchy, almost indecipherable characters. At the time I chalked it up to known issues with fonts and went back to running an earlier build (thank you, Live Upgrade).

I revisited the issue last week, after noticing that the headers and footers from Firefox looked okay. It was just the main text that was messed up. I checked my preferences (Content>Fonts&Colors>Advanced)--the checkbox "Allow pages to choose their own fonts" was enabled. I disabled it and tried again, and now the printed pages are legible.

Wednesday Oct 01, 2008

Using the MH date2local function

Bill Janssen recently suggested on the MH-E developers list a different format for showing the contents of a folder. Normally MH-E just shows the month and day, e.g., 09/17, for the date. One of Bill's suggestions was that if the message is less than a day old, MH-E could show the time instead, e.g., 13:45.

I've been happy just seeing the date, but I could see how showing the time might be useful. I did have one concern, which was whether MH (which is what MH-E runs on top of) would use the message's timezone or my local timezone. I routinely get email from all over the USA, plus the United Kingdom, China, India, Australia, and Japan. To get an accurate sense of when the emails were sent--for the timestamps to be useful--I'd want to see them all in my timezone.

Unfortunately, some experimentation showed me that Bill's patch used the sender's timezone.

But poking around in the mh-format man page showed a date2local function that would convert the displayed time to my local timezone. Ah, just what I wanted.

I first tried using it in something like


but that produced error messages. Looking more closely at the man page showed me that date2local works by side-effect; it doesn't return an updated date string. Okay, so how do you use it? I couldn't figure it out from the man page, and Googling for date2local didn't show any usage examples.

But I did get some Google hits that reminded me of the sample format files that typically ship with MH. After staring at them for a bit, I tried


and that worked.

Tuesday Aug 26, 2008

Handsome Heron

Over the weekend I upgraded an Ubuntu box to 8.04 LTS (Hardy Heron). The default background image is a way cool rendering of a heron. If you go walking along the bayshore here you can sometimes see a heron, so I was delighted to see this new background. It's gorgeous.

Wednesday Apr 09, 2008

Converting Projects to Mercurial

One of the things that we consider when deprecating components of (Open)Solaris is how users move from the old software to the new software. We've applied that principle to the SCM Migration project, so we've been working on documentation (e.g., a Mercurial cheat sheet for TeamWare users), and the updated tools work with both TeamWare and Mercurial. Also, we don't want to tie the schedules of large projects to the SCM Migration schedule or vice versa. So we need to support projects that are begun under TeamWare, but which are still under development when we're ready to move the gate from TeamWare to Mercurial. That support is provided by a new script called wx2hg.

In general, it's hard to convert a TeamWare workspace to Mercurial, at least if you want to maintain history. But ON already has a policy that putbacks should (usually) add a single delta. That is, any project-specific history will be lost anyway. That makes the job of wx2hg a lot easier.

Suppose you have a project gate--call it my-proj--that is a child of onnv-gate, the ON master gate. We already maintain a Mercurial mirror of onnv-gate, which I will call onnv-hg for now. So when you're ready to move to Mercurial, what you want is a child of onnv-hg. That child should have the same changes relative to onnv-hg that my-proj has relative to onnv-gate.

It turns out that it is pretty easy for wx2hg to do this. The wx front-end keeps track of renames and files with contents changes. So wx2hg just needs to get that information from wx and apply it to a child of onnv-hg. The rest of the script is error detection and handling.

Let's walk through an example.

Suppose I have a workspace that deletes all the SCCS helper scripts in usr/src/tools. And to demonstrate renames, it renames the scripts directory makefile to

$ pwd
$ putback -n

Would put back name changes: 10

rename from: usr/src/tools/scripts/Makefile
         to: usr/src/tools/scripts/
rename from: usr/src/tools/scripts/sccscheck.1
         to: deleted_files/usr/src/tools/scripts/sccscheck.1
rename from: usr/src/tools/scripts/
         to: deleted_files/usr/src/tools/scripts/
rename from: usr/src/tools/scripts/sccscp.1
         to: deleted_files/usr/src/tools/scripts/sccscp.1
rename from: usr/src/tools/scripts/
         to: deleted_files/usr/src/tools/scripts/
rename from: usr/src/tools/scripts/
         to: deleted_files/usr/src/tools/scripts/
rename from: usr/src/tools/scripts/sccsmv.1
         to: deleted_files/usr/src/tools/scripts/sccsmv.1
rename from: usr/src/tools/scripts/
         to: deleted_files/usr/src/tools/scripts/
rename from: usr/src/tools/scripts/sccsrm.1
         to: deleted_files/usr/src/tools/scripts/sccsrm.1
rename from: usr/src/tools/scripts/
         to: deleted_files/usr/src/tools/scripts/

The following files are currently checked out and have been edited in workspace
No changes were put back

Note that although is checked out, it need not be.

Converting this to Mercurial is simple. If your TeamWare workspace is in a directory that you have write access to, just point wx2hg at it.

$ pwd
$ /opt/onbld/bin/wx2hg

wx2hg first creates a Mercurial child (this step can take a few minutes). The child is created in the same directory as the TeamWare workspace, with the same name plus "-hg".

requesting all changes
adding changesets
adding manifests
adding file changes
added 6349 changesets with 91335 changes to 49774 files
44994 files updated, 0 files merged, 0 files removed, 0 files unresolved

wx2hg then initializes wx if you haven't already done so. If the workspace is already under wx control, it does a "wx update" to ensure it will get up-to-date information about the workspace.

Initializing wx...
New renamed file list:
New active file list:
Will backup wx and active files if necessary
wx initialization complete

wx2hg then checks out all the files with contents changes. We want to put the files into Mercurial with unexpanded SCCS keywords, and checking them out is a quick hack to help us do so.

usr/src/tools/scripts/ already checked out

wx2hg then processes the rename list.

rename usr/src/tools/scripts/Makefile -> usr/src/tools/scripts/
rename usr/src/tools/scripts/sccscheck.1 -> deleted_files/usr/src/tools/scripts/sccscheck.1
rename usr/src/tools/scripts/ -> deleted_files/usr/src/tools/scripts/
rename usr/src/tools/scripts/sccscp.1 -> deleted_files/usr/src/tools/scripts/sccscp.1
rename usr/src/tools/scripts/ -> deleted_files/usr/src/tools/scripts/
rename usr/src/tools/scripts/ -> deleted_files/usr/src/tools/scripts/
rename usr/src/tools/scripts/sccsmv.1 -> deleted_files/usr/src/tools/scripts/sccsmv.1
rename usr/src/tools/scripts/ -> deleted_files/usr/src/tools/scripts/
rename usr/src/tools/scripts/sccsrm.1 -> deleted_files/usr/src/tools/scripts/sccsrm.1
rename usr/src/tools/scripts/ -> deleted_files/usr/src/tools/scripts/

After the renames, it applies a patch for each modified file...

patching file usr/src/tools/scripts/

...and then you're done.

$ ls -dF \*demo\*

You can verify that wx2hg transferred all your changes:

$ cd
$ hg diff -g
diff --git a/usr/src/tools/scripts/sccscheck.1 b/deleted_files/usr/src/tools/scripts/sccscheck.1
rename from usr/src/tools/scripts/sccscheck.1
rename to deleted_files/usr/src/tools/scripts/sccscheck.1
diff --git a/usr/src/tools/scripts/ b/deleted_files/usr/src/tools/scripts/
rename from usr/src/tools/scripts/
rename to deleted_files/usr/src/tools/scripts/
diff --git a/usr/src/tools/scripts/sccscp.1 b/deleted_files/usr/src/tools/scripts/sccscp.1
rename from usr/src/tools/scripts/sccscp.1
rename to deleted_files/usr/src/tools/scripts/sccscp.1
diff --git a/usr/src/tools/scripts/ b/deleted_files/usr/src/tools/scripts/
rename from usr/src/tools/scripts/
rename to deleted_files/usr/src/tools/scripts/
diff --git a/usr/src/tools/scripts/ b/deleted_files/usr/src/tools/scripts/
rename from usr/src/tools/scripts/
rename to deleted_files/usr/src/tools/scripts/
diff --git a/usr/src/tools/scripts/sccsmv.1 b/deleted_files/usr/src/tools/scripts/sccsmv.1
rename from usr/src/tools/scripts/sccsmv.1
rename to deleted_files/usr/src/tools/scripts/sccsmv.1
diff --git a/usr/src/tools/scripts/ b/deleted_files/usr/src/tools/scripts/
rename from usr/src/tools/scripts/
rename to deleted_files/usr/src/tools/scripts/
diff --git a/usr/src/tools/scripts/sccsrm.1 b/deleted_files/usr/src/tools/scripts/sccsrm.1
rename from usr/src/tools/scripts/sccsrm.1
rename to deleted_files/usr/src/tools/scripts/sccsrm.1
diff --git a/usr/src/tools/scripts/ b/deleted_files/usr/src/tools/scripts/
rename from usr/src/tools/scripts/
rename to deleted_files/usr/src/tools/scripts/
diff --git a/usr/src/tools/scripts/Makefile b/usr/src/tools/scripts/
rename from usr/src/tools/scripts/Makefile
rename to usr/src/tools/scripts/
--- a/usr/src/tools/scripts/
+++ b/usr/src/tools/scripts/
@@ -50,11 +50,6 @@ SHFILES= \\
 	nightly \\
 	onblddrop \\
 	protocmp.terse \\
-	sccscheck \\
-	sccscp \\
-	sccshist \\
-	sccsmv \\
-	sccsrm \\
 	sdrop \\
 	webrev \\
 	ws \\

Note that you still need to do "hg commit" to check in your new version.

All this assumes that your workspace is in sync with /ws/onnv-clone. If it isn't you may get messages like

wx2hg: can't rename: usr/src/tools/scripts/sccscheck.1 doesn't exist.


wx2hg: usr/src/tools/scripts/ parent mismatch; 
  resync with /ws/onnv-clone or specify branch point with -r hg_rev.

Doing a bringover from /ws/onnv-clone, and resolving any conflicts, should fix things up.

You may also see a message like

Please run
  hg --cwd /export/kupfer/tonic/wx2hg-tests/ update -C
before retrying.

This is telling you you can reuse the Mercurial child, but you need to reset it first. Once you've resynched with /ws/onnv-clone and run the "hg ... update..." command, you use the -t option to tell wx2hg to reuse the Mercurial child. For example,

/opt/onbld/bin/wx2hg -t

There's more that wx2hg can do, but those features won't be needed until ON moves to Mercurial. If you get stuck using wx2hg, you can ask for help on the SCM migration team list (scm-migration-dev at opensolaris dot org).

Friday Feb 15, 2008

SCM Migration: The Big Picture

When Steve Lau left Sun at the end of last September, I became the go-to guy inside Sun for the migration to Mercurial. I had thought that I had a good high-level grasp of the project. But after getting blindsided a couple times by dependencies I hadn't considered, I drew up a diagram to help me get oriented, identify stakeholders, and maybe anticipate future issues.

Here's a slightly simplified version of the original diagram from the whiteboard in my office:

Blue parallelograms indicate repositories, tan boxes are software modules, solid lines indicate data flow, and dashed lines tie users with the modules that they're using. The three red-rimmed boxes (gk tools, gate hooks, and onbld tools) are where most of the development effort is going.

The primary simplifications in this diagram are

  • the data flow from the project gate actually goes through the SCM front-end before going through the gate hooks.
  • I've omitted the consolidation's clone workspace (a nightly snapshot of the gate)
  • I've omitted the bridge between the current ON workspace in TeamWare and the Mercurial repository that is shadowing it

Even so, this is a moderately busy diagram. There are several components to keep track of and make sure they all fit together.

Most of the work so far has been in the area of the ON build (onbld) tools, pieces of which are used by other consolidations and by the Solaris Companion project. Many of the changes are related to making the tools work with Mercurial as well as with TeamWare/SCCS. We've also had to consider the implications of moving everything outside the Sun firewall, which has meant rethinking interfaces to things like the bug database and our RTI (Request To Integrate) system.

We haven't done as much work on the gatekeeper (gk) tools, although we've started to think about design issues. Many of the design decisions boil down to this question: do we make the minimal set of changes needed to work with Mercurial, or do we make more extensive changes so that the tools can make better use of the features provided by Mercurial? In some cases we are staying with the current approach. For example, we are using separate repositories for build snapshots, rather than using branches and tags in the main gate repository. In other cases we will be changing the tools to use Mercurial features. For example, any automated post-putback processing will be driven directly by Mercurial hooks, rather than the email-based hook system that is needed with TeamWare.

Another set of interesting design decisions has centered around the use of gate hooks to enforce various style and bookkeeping rules. With the current TeamWare setup, we enforce these rules after a putback (at least for ON). The putback triggers various checks, and if your putback violates a rule, you get notified of the problem and given a short window to fix it or your putback is reverted. The gate is normally configured so that anyone (inside Sun) can putback.

While this approach worked when Solaris was closed source, we expect it not to scale for OpenSolaris, where the repository is accessible from anywhere on the Internet and both Sun employees and non-employees can have commit rights. Certain Mercurial hooks can abort a putback ("push" in Mercurial terms), so we could move all the post-putback checks to pre-transaction checks. But moving more checks means more work (e.g., testing), which means a longer time before we can move to Mercurial. So the question becomes which checks really need to happen before putback, and which ones can happen after putback. The check to ensure that a putback has an approved RTI probably needs to happen prior to the putback. The check for adherence to the C style rules can happen after the putback, at least for now.

The webapp has various bits of functionality for source code management. A project leader or gatekeeper can use the webapp to create, destroy, and lock repositories, as well as to manage commit rights for the project's repositories. Unfortunately, the current set of operations is limited. For example, a gatekeeper might want to lock a repository for most users, but allow access for a specific large project. Alas, this lock granularity is not currently supported. Furthermore, all the controls are currently through a web-based interface, with no scripting hooks. Although there is currently work to improve the webapp and make it easier to change, this work is unlikely to be finished in time for us to make any changes that we expect gatekeepers to want. So we will need to think about other ways to provide the needed functionality, such as giving gatekeepers shell access to the server that hosts the repositories.

The SCM front-end gives a user access to repositories by creating a chroot environment which contains only the repositories that the user has commit privileges for. (Access to other repositories is done via the "anon" user.) If the user reports being unable to pull from, or push to, a repository, the problem could be with the SCM program itself, the SCM front-end, or some other general system service. This diagnosis typically requires shell access to the servers.

We are using Nagios to monitor the health of the servers and services on We have written a couple simple Nagios plugins to monitor the Mercurial and Subversion services. As we gain experience with the system, we could update the probes to check for specific failure scenarios.

OpenGrok makes it into this diagram because it makes a private snapshot of each repository that it indexes, so as to provide a consistent view of the tree. We once managed to break the OpenGrok indexing of ON by trying to undo (rollback) a particular putback, so that it would vanish completely from the repository. We didn't know to roll back OpenGrok's snapshot repository as well. So the next time OpenGrok tried to pull from the Mercurial onnv-gate, it created a branch that had to be merged. This was not something OpenGrok was prepared for, so the snapshot tree was not updated. After several days, we started getting complaints from ON teams who couldn't find their recent putbacks in OpenGrok. We figured out the problem, replaced OpenGrok's snapshot repositories, and vowed not to undo/rollback any future putbacks.

So that's the "big picture" of what the SCM Migration project is working on. If you've been frustrated by how long things are taking, well, we're not happy about it, either. Our hope is that by keeping the entire picture in mind, we will not have any serious problems when we finally do move.

Sunday Nov 25, 2007

A Different Type of Temporary E-Mail Address

I was reading an update from Bay Area Consumers' Checkbook, and they mentioned a new spin on temporary email addresses. Make up a name at When email arrives there, it is available to read for a few hours and then deleted. No sign-up, no password. Anyone who guesses the account name that you used can read it.

I don't know how much I'd actually use this, but it's a neat idea. And their FAQ is a hoot.

Friday Aug 17, 2007

ksh93 Putback

April Chin put back ksh93 into the ON gate this morning. Woohoo! I'm delighted to have a modern, open-source Korn shell in OpenSolaris, and I'm looking forward to when we can (someday) retire the old Solaris ksh. Many thanks to April, Roland Mainz, and Don Cragun for all their work, as well as to everyone who participated in the project reviews and discussions.

Sunday Jun 24, 2007

GNOME Disk Analyzer

I upgraded my desktop to snv_66 (build 66 of Solaris Nevada) earlier in the week and played around some with the new GNOME bits (2.18). The new Disk Analyzer GUI has a much-improved format for showing where you're using disk space. In the example below, about 25% of my home directory is email, and about 23 MB is email for NFS.

Mike's home directory (GNOME)

I still prefer the equivalent view in Konqueror, because it can identify individual files, whereas the GNOME tool only tells you about directories. But the radial format in the GNOME tool is pretty cool.

    home directory (KDE)

(In case anyone is wondering, the KDE screenshot is from back in March, so this picture is not directly comparable with the one above.)

Wednesday May 30, 2007

What I Learned From Ubuntu

Mark Shuttleworth and a few Ubuntu developers stopped by the Sun Menlo Park campus on Friday May 4th. I'm not working with Ubuntu, but since I'm involved with the Solaris Companion and with general OpenSolaris issues, I wanted to see what they had to say about third-party packages and about how they do their releases.

You can organize Ubuntu packages along two dimensions. The first dimension is whether the package is free (libre). The second dimension is whether Canonical (Ubuntu's corporate sponsor) provides support (e.g., security fixes). This gives us the following table:

supported by Canonical not supported by Canonical
free main
(2,000 packages)
(18,000 packages)
not free restricted
(5 packages)
(200 packages)

Notice that Canonical only supports 10% of the packages in the distro.

There are two levels of access to the third-party packages. The first level is an engineering repository which bypasses Canonical. That is, people can update the repository at any time, without regard to the Ubuntu release schedule. The second level is the actual distro, which has tighter controls.

Some of the packages are available on the Ubuntu CD, but many are only available via network download. Canonical does not track the downloads. This would be heresy inside Sun, where there's a big emphasis on measuring things. But Mark said that Canonical doesn't really care about the download numbers, and it would be difficult to get accurate numbers anyway (e.g., because of mirroring).

Someone asked Mark how they deal with packages that potentially infringe on a patent. Mark said that there's no such thing as a global patent, so those packages are allowed in the distro, but they're only available via network download. The user self-certifies that it's okay for him or her to use the package.

Another issue that comes up with third-party packages is how to track bugs. Mark talked about this a bit, and it's is something we're facing with OpenSolaris, too. The basic problem is that for a given package, there may be two bug databases: one deployed by the upstream project and one deployed by the distro. So far, the industry best practice seems to be to push distro-independent information to the upstream database, leaving distro-specific details in the distro's database. This approach is less than ideal, because it requires a fair amount of manual effort to track the bug status and to keep the right information in the right database. Canonical developed a tracking application called Launchpad to help deal with this, but Mark mentioned that it's still not quite what they want, and that Canonical might be revisiting the issue in a couple years. It'd be nice if the Ubuntu and OpenSolaris communities could somehow work together on that.

Mark spent a little time describing Launchpad, and it does have some nice bug-tracking features. For example, you can create hyperlinks to the upstream database entry, and Launchpad can automatically query the upstream database to get the bug's status.

Launchpad also has more general collaboration support, such as mailing lists, project web space, and a code repository. Launchpad includes features that would be useful on, like a translation tracker and an application for proposing and tracking project ideas.

The other major topic that I was interested in was how Ubuntu releases are done. Ubuntu releases follow a train model, with releases appearing every 6 months. There is support for 18 months, except for Long Term Support (LTS) releases, where servers are supported for 5 years. For those who are not familiar with the train model, the basic idea is that if your code is not ready in time, it is bumped to the next release, rather than delaying the current release.

Sun tried a train model for Solaris in the 1990s, with releases every 6 months[1]. It didn't work for us, and we eventually gave it up. I wasn't involved with Solaris release management, so I probably have a limited perspective on what all the issues were. But as a developer I could see a couple things that contributed to abandoning 6-month trains.

The first problem that I saw was that we didn't stick to the cutoff dates. There was often some new feature that just couldn't wait for the next train, so we would bend the rules and let changes integrate after the nominal cutoff[2]. I suppose that having a late binding mechanism makes sense for exceptional circumstances, but I think it got overused. These days, it seems like late binding isn't just a safety net to keep the release from falling apart, it's a regular phase in the release cycle. I suppose the net effect isn't too horrible--it's effectively a gradual freezing of the code, rather than a hard freeze. But it does push back the real, final freeze date, which then reduces the time that is available for later parts of the release cycle.

This ties in to the other problem that I saw, which was that the Beta Test period was too short. I forget how long the Beta periods were, but they were short enough that by the time customers had actually deployed the code, identified and reported issues, and we had worked out a fix, it was too late to get the fix into that release.

Of course, this begs the question of why Canonical doesn't have the same problems with Ubuntu.

One explanation is that much of what goes into Ubuntu comes from an upstream source and is already (more or less) stable. There is some original work done for Ubuntu, but it's not the "deep R&D"[3] of things like SMF, DTrace, or ZFS. It's hard to predict the schedule for cutting-edge projects, particularly ones that affect large parts of the system. That's not an entirely satisfactory answer, though, because according to the train model, if a project is late, you just bump it to the next release. So there must be more going on than that.

One thing that could mess up a train model is technical dependencies. Suppose Project A depends on Project B. If you integrate parts of A under the assumption that B will integrate later in the release, there will be a strong temptation to delay the release if B is late. The Ubuntu folks try to avoid this problem by avoiding dependencies on upstream cde that's scheduled to be released near the feature freeze. How strict they are about this depends in part on how much they trust the upstream provider to meet its schedule. And in a pinch, they might take beta code if it's deemed to be stable enough. I don't know if technical dependencies were a factor in moving a way from the train model for Solaris releases. It shouldn't have been an issue for the OS/Net consolidation ("FCS Quality All the Time"), but I don't know about Solaris as a whole.

I suppose there could have also been a sort of "marketing and PR" dependency problem, where we feared a loss of face if Feature X didn't make its target release. I don't know if this was actually an issue, but Sun does seem to like big, flashy announcements, and there are quite a few analyst briefings that happen under embargo[4] prior to these events.

Another explanation for why Canonical can make 6-month trains work is that the 6-month releases serve a different target market than the one Solaris has been in. A noticeable chunk of the Solaris user base would go nuts with a 6-month release cycle and 18-month support tail. As soon as they got one release qualified and deployed, they'd have to do it all over again.

So one thing we might want to look at for Solaris is to have two release vehicles, similar to the 6-month and LTS releases that Canonical is doing with Ubuntu. But there are still some issues with that model that we'd want to figure out. For example, the Ubuntu folks said that most of the Ubuntu LTS customers just want security fixes, whereas Solaris customers often demand patches for non-security bugs.

Another thing that distinguishes Ubuntu releases from the 6-month Solaris trains is when customers actually get the bits to play with. There are only 3 weeks between the Beta release and final release for Gutsy, but there will be six snapshots that are available sooner, with the first (fairly unstable) one appearing 16 weeks before the Beta release. This gives users a larger window than we had with the 6-month Solaris trains in which to try out the release and give feedback.

So, to sum it all up: I learned that distros can successfully deal with issues that OpenSolaris and Sun are facing, like how to provide the many third-party packages that users want, and how to keep them current. What we need to do now is figure out how to make it work for OpenSolaris, without sacrificing the stability that attracted many Solaris users in the first place.

[1] The internal code names for SunOS 5.2, 5.3, and 5.4 were on493, on1093, and on494, respectively.

[2] At some point we came up with a formalized "late binding" process, but I don't remember just when that was introduced.

[3] That's the term Mark used.

[4] That is, the analyst isn't allowed to publish anything about it before a certain date and time.


Random information that I hope will be interesting to Oracle's technical community. The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.


« April 2014