Tuesday Oct 12, 2010

Watch This Space

With the latest XWiki update, opensolaris.org now has working watchpoints. The UI for watching a page is pretty obvious--there's a watch/unwatch link in the upper right of the page. But I couldn't figure out how to watch an entire space (e.g., the Tools community space). Today I stumbled across it. The 2nd menu in the upper left shows the space that the current page belongs to. One of the operations you can invoke via that menu is "watch space".

Wednesday Nov 25, 2009

Signed Crypto Gets Its Own Tarball

The OS/Net (ON) component of OpenSolaris has some closed-source code. The binaries for this code (well, the binaries that are redistributable) are made available to non-Sun developers in the form of a compressed tar file, which the build tools incorporate into BFU archives or packages. These closed-bins tarballs also contain binaries for open-source cryptographic code. To satisfy US government regulations, the OpenSolaris cryptography framework requires that certain crypto binaries be signed. Most external developers don't have the necessary key and certificate to sign their binaries, so we provide a working set for them.

This setup has worked okay since the launch of OpenSolaris in 2005, but it's got a couple problems. First, bindrop, the script that puts the crypto binaries into the closed-bins tarball, works off a hard-coded list. As with any manually-maintained list, this introduces a risk that it will not be updated when a new crypto module is added to the system. Second, bindrop gets the crypto binaries by extracting them from the SVR4-format packages that are generated from the ON gate. With the upcoming move to IPS, those packages will go away, and it will be much harder to extract the binaries from the IPS packages that will replace them.

So I'm working on changes to the way we deliver the signed crypto binaries. First, we'll be splitting the crypto out into its own tarball. This gives us more flexibility about when we deliver the crypto binaries. Second, instead of using a hard-coded list, we'll scan the proto area, which is the staging area before files are packaged. Any properly signed binaries will be included in the crypto tarball. If you're interested, CR 6855998 has more details.

The code for this has been written, though it still needs a little more polishing, like making sure that error messages are handled correctly. I'm hoping to get this into build 130, but it might slip into 131.

Saturday Oct 10, 2009

SCM Mounts: Done (Almost)

I've finished the workaround for the sshd privileges issue. I ended up writing a simple setuid C program so that our PAM module could unmount the loopback filesystems. I had been using an RBAC-based approach, but that requires that the user own the mount point for each loopback mount. The more I worked on it, the more failure scenarios I ran into because of that requirement. The setuid approach had none of those issues, and it turned out to be much simpler to code than I had been expecting.

So the changes have been committed to the repository for the SCM infrastructure, and the new bits have been deployed on the backup SCM server. The only thing left is to deploy on the primary SCM server.

Unfortunately, this doesn't mean I'll now have time to finish off the OSCON trip report. Instead, I'll be focusing on a change to the way we deliver crypto binaries to ON developers.

Saturday Aug 22, 2009

Still Reviewing Web Pages

I'm still reviewing web pages in preparation for the migration of opensolaris.org to XWiki. So far I've finished reviewing the ON Developer Reference and the SCM-related pages in the Tools web space.

The only thing left for (my) review is the SCM Migration project web space. Since that project is no longer active, I don't plan to look too carefully. But a sanity check does seem in order, and maybe there are some obsolete pages or attachments that we can delete.

The ON Developer Reference (DevRef) is a particularly tough case for the migration software because of its extensive use of anchored links. I had been planning to retire the XML (Docbook) source that the DevRef currently uses, and keep everything in XWiki markup, but I'm not looking forward to fixing all the cross-references. So I'm having second thoughts about that strategy.

Sunday Aug 16, 2009

OpenSolaris.org Moving to XWiki

I'm afraid I haven't made any progress on my OSCON trip report this week. We've started the beta testing for migrating opensolaris.org from the current portal application to XWiki. I've been reviewing the ON Developer Reference, and it's taken more of my time than I had expected. (In fact, I'm still not done.)

If you're a community or project leader, please do take the time to review the pages that you're responsible for. Some issues will be easier to fix if they are identified before the migration. And the migration team needs user feedback to help identify which issues cause the most trouble.

If you just have a question, you can ask on the website-discuss list (at opensolaris.org). If you're sure you've found a bug, either in the migration code or in the new XWiki-based site, go to defect.opensolaris.org and file a bug under Development: product=website, component=site-wiki.

Tuesday Jul 21, 2009

Unwanted Mounts

As described in the design document, source code access on opensolaris.org is done via ssh. The user doesn't invoke ssh directly. Rather, the user runs Mercurial (or Subversion), which invokes ssh using its standard processing for ssh URLs. Once connected to the server, a custom restricted shell invokes the server-side program. This is all done in a chroot environment, with loopback mounts providing access to only those repositories that the user has write access to.

The loopback mounts are created when the user logs in, and they are torn down when the source code management (SCM) operation completes. This is done by way of a custom PAM module. As part of the session's "open" processing, the module determines what repositories to grant access to, and it establishes those mount points. As part of the session's "close" processing, it removes those mount points.

We recently noticed that the loopback mounts were not getting unmounted. This causes a couple problems. One is that thousands of unused loopback mounts accumulate on the server. If nothing else, this makes life more difficult for administrators.

The lingering mounts can also lead to a denial of service problem, which we've witnessed a few times. The problem occurs if a repository is deleted and recreated while there is still a loopback mount for it. Future references to the loopback mount will fail with an error. This can interfere with the setup of a user's loopback mounts in a subsequent login, resulting in a situation where users are unable to access recently created repositories. Worse, attempts to unmount the broken loopback mount fail, and lofs doesn't support forced unmount. So the only way to recover is to reboot the server.

After the third or so instance of this, we decided to figure out why the loopback mounts were not getting unmounted. Arguments can be passed to a PAM module by putting them after the module name in /etc/pam.conf, and there's a convention to enable debugging output with the argument "debug", e.g.,

other	session requisite	pam_foo.so.1	debug
    

For this to be useful, syslogd needs to be configured to display the debug output. For example, put

auth.debug	/var/adm/auth.log
    

in syslog.conf and utter

# svcadm restart system/system-log

Once we made these two changes, we could see that the session-open routine was running normally, but it didn't look like the session-close routine was getting invoked.

This seemed awfully strange, so we enabled PAM framework debugging with

# touch /etc/pam_debug

(This, too, requires that syslogd be configured to put auth.debug output somewhere accessible.)

This showed that our session-close routine was, in fact, being invoked.

Looking more closely at the session-closed routine, we noticed that it checks what user it is invoked as. If it's not invoked as uid 0, it bails out, before doing any debug logging. Moving the debug logging to come before the uid check confirmed that it was running as the user whose session was ending.

Some Googling revealed a known issue in OpenSSH (from which the Solaris SSH is derived) in which the session-close routine is called as the session's user, not uid 0.

From the comments in the OpenSSH Bugzilla, it looks like a fix is available from upstream, so we're hopeful that we just need to talk to the Sun SSH team about getting the fix into OpenSolaris. We're also looking into possible workarounds, in case the fix can't be pulled in promptly.

Update 2009-09-16

I filed a bug for this: 6869790.

The current status is that the Solaris SSH team is discussing possible fixes, but they haven't come up with a good approach yet. Just reverting the code isn't an option because it would break support for hardware acceleration. And the upstream privilege separation code is different from the code in Solaris, so they can't just use the upstream patch.

Friday Feb 13, 2009

OpenSolaris and gnuserv

I installed OpenSolaris 2008.11 on my notebook (a VAIO TX) several weeks ago. I've been tweaking the environment, in preparation for the day when I move to OpenSolaris on my desktop system.

One of the issues that came up was that gnuserv would exit immediately after being started. This meant that every time I wanted an editor (e.g., for a Mercurial commit), I had to wait for a new XEmacs process to start.

I looked around for some sort of error message but couldn't find anything. I finally started XEmacs using truss -f. Looking at the truss output, I saw that gnuserv was looking in /etc/hosts and not finding an entry with the notebook's hostname ("loiosh").

I added "loiosh" to the localhost (127.0.0.1) line, and that fixed the problem.

Wednesday Feb 11, 2009

Mercurial pretxn Hook Race

Currently the ON gate (or at least the open source part) is mirrored on opensolaris.org. We were having a discussion the other day about what needs to be done so that we can actually host it on opensolaris.org.

One of the issues that came up is the race in the Mercurial pre-transaction hooks, such as the pretxnchangegroup hook. These hooks let a repository reject pushes that don't meet whatever criteria that the hook has set. For the ON gate, we use it for things like making sure there is an approved RTI for the changegroup.

The problem is that the implementation of these hooks opens up a race condition. The metadata for the changegroups gets written to the repository, then the pre-transaction hook gets run. The advantage of this approach is that the pre-transaction hooks can use existing APIs and code paths when examining the incoming changegroup. But Mercurial repositories are structured so that readers don't need a lock; instead they depend on an atomic update of the top-level metadata. So the disadvantage of this approach is that there's a window during which someone pulling from the gate could get the pending changegroup, even if the hook later rejects it.

This issue is described in Section 10.3 of Bryan O'Sullivan's Mercurial book; it is also issue 1321 in the Mercurial bug tracker. The workaround that the Mercurial book describes is the one that we used for the ON gate: the repository that people push to is write-only. After the pre-transaction hooks have cleared the changegroup, another hook pushes the changegroup to a second clone repository, which developers pull from.

While this approach is functional, it's not esthetically pleasing. And there's a practical problem: the SCM infrastructure on opensolaris.org doesn't support having two repositories tied together like that. I'm sure it could be done, but administration (e.g., updating the access list) would be clumsy, and it might require giving the ON gatekeepers shell access to the opensolaris.org servers (which would not please them or the server administrators).

Fortunately, Matt Mackall has devised a fix for the race condition. The new changegroup will not be visible for pulls until it has passed the pre-transaction hooks. And if I understand correctly, the fix will not require changes to existing hooks, except for the case of Python hooks that spawn subprocesses..

There are other changes that we will probably make before hosting the gate on opensolaris.org. For example, we'll probably change the SCM console (the web interface for managing repositories) so that it scales better for large numbers of committers. But getting a fix for this race condition means we'll have one less issue to deal with.

Sunday Oct 05, 2008

Printing opensolaris.org Pages with Recent Builds

Back in August I upgraded my desktop to snv_95. I sometimes print pages from opensolaris.org to read during my commute, but with snv_95 the pages came out pretty much unreadable. They looked like they had been through several fax transmissions, with blotchy, almost indecipherable characters. At the time I chalked it up to known issues with fonts and went back to running an earlier build (thank you, Live Upgrade).

I revisited the issue last week, after noticing that the headers and footers from Firefox looked okay. It was just the main text that was messed up. I checked my preferences (Content>Fonts&Colors>Advanced)--the checkbox "Allow pages to choose their own fonts" was enabled. I disabled it and tried again, and now the printed pages are legible.

Wednesday Apr 09, 2008

Converting Projects to Mercurial

One of the things that we consider when deprecating components of (Open)Solaris is how users move from the old software to the new software. We've applied that principle to the SCM Migration project, so we've been working on documentation (e.g., a Mercurial cheat sheet for TeamWare users), and the updated tools work with both TeamWare and Mercurial. Also, we don't want to tie the schedules of large projects to the SCM Migration schedule or vice versa. So we need to support projects that are begun under TeamWare, but which are still under development when we're ready to move the gate from TeamWare to Mercurial. That support is provided by a new script called wx2hg.

In general, it's hard to convert a TeamWare workspace to Mercurial, at least if you want to maintain history. But ON already has a policy that putbacks should (usually) add a single delta. That is, any project-specific history will be lost anyway. That makes the job of wx2hg a lot easier.

Suppose you have a project gate--call it my-proj--that is a child of onnv-gate, the ON master gate. We already maintain a Mercurial mirror of onnv-gate, which I will call onnv-hg for now. So when you're ready to move to Mercurial, what you want is a child of onnv-hg. That child should have the same changes relative to onnv-hg that my-proj has relative to onnv-gate.

It turns out that it is pretty easy for wx2hg to do this. The wx front-end keeps track of renames and files with contents changes. So wx2hg just needs to get that information from wx and apply it to a child of onnv-hg. The rest of the script is error detection and handling.

Let's walk through an example.

Suppose I have a workspace that deletes all the SCCS helper scripts in usr/src/tools. And to demonstrate renames, it renames the scripts directory makefile to Makefile.new.

$ pwd
/export/kupfer/tonic/wx2hg-tests/tw.no-sccs-tools.demo
$ putback -n
...

Would put back name changes: 10

rename from: usr/src/tools/scripts/Makefile
         to: usr/src/tools/scripts/Makefile.new
rename from: usr/src/tools/scripts/sccscheck.1
         to: deleted_files/usr/src/tools/scripts/sccscheck.1
rename from: usr/src/tools/scripts/sccscheck.sh
         to: deleted_files/usr/src/tools/scripts/sccscheck.sh
rename from: usr/src/tools/scripts/sccscp.1
         to: deleted_files/usr/src/tools/scripts/sccscp.1
rename from: usr/src/tools/scripts/sccscp.sh
         to: deleted_files/usr/src/tools/scripts/sccscp.sh
rename from: usr/src/tools/scripts/sccshist.sh
         to: deleted_files/usr/src/tools/scripts/sccshist.sh
rename from: usr/src/tools/scripts/sccsmv.1
         to: deleted_files/usr/src/tools/scripts/sccsmv.1
rename from: usr/src/tools/scripts/sccsmv.sh
         to: deleted_files/usr/src/tools/scripts/sccsmv.sh
rename from: usr/src/tools/scripts/sccsrm.1
         to: deleted_files/usr/src/tools/scripts/sccsrm.1
rename from: usr/src/tools/scripts/sccsrm.sh
         to: deleted_files/usr/src/tools/scripts/sccsrm.sh

The following files are currently checked out and have been edited in workspace
"/export/kupfer/tonic/wx2hg-tests/tw.no-sccs-tools.demo":
	usr/src/tools/scripts/Makefile.new
...
No changes were put back
$ 
    

Note that although Makefile.new is checked out, it need not be.

Converting this to Mercurial is simple. If your TeamWare workspace is in a directory that you have write access to, just point wx2hg at it.

$ pwd
/export/kupfer/tonic/wx2hg-tests
$ /opt/onbld/bin/wx2hg tw.no-sccs-tools.demo
    

wx2hg first creates a Mercurial child (this step can take a few minutes). The child is created in the same directory as the TeamWare workspace, with the same name plus "-hg".

requesting all changes
adding changesets
adding manifests
adding file changes
added 6349 changesets with 91335 changes to 49774 files
44994 files updated, 0 files merged, 0 files removed, 0 files unresolved
    

wx2hg then initializes wx if you haven't already done so. If the workspace is already under wx control, it does a "wx update" to ensure it will get up-to-date information about the workspace.

Initializing wx...
...
New renamed file list:
...
New active file list:
...
Will backup wx and active files if necessary
...
wx initialization complete
    

wx2hg then checks out all the files with contents changes. We want to put the files into Mercurial with unexpanded SCCS keywords, and checking them out is a quick hack to help us do so.

usr/src/tools/scripts/Makefile.new already checked out
    

wx2hg then processes the rename list.

rename usr/src/tools/scripts/Makefile -> usr/src/tools/scripts/Makefile.new
rename usr/src/tools/scripts/sccscheck.1 -> deleted_files/usr/src/tools/scripts/sccscheck.1
rename usr/src/tools/scripts/sccscheck.sh -> deleted_files/usr/src/tools/scripts/sccscheck.sh
rename usr/src/tools/scripts/sccscp.1 -> deleted_files/usr/src/tools/scripts/sccscp.1
rename usr/src/tools/scripts/sccscp.sh -> deleted_files/usr/src/tools/scripts/sccscp.sh
rename usr/src/tools/scripts/sccshist.sh -> deleted_files/usr/src/tools/scripts/sccshist.sh
rename usr/src/tools/scripts/sccsmv.1 -> deleted_files/usr/src/tools/scripts/sccsmv.1
rename usr/src/tools/scripts/sccsmv.sh -> deleted_files/usr/src/tools/scripts/sccsmv.sh
rename usr/src/tools/scripts/sccsrm.1 -> deleted_files/usr/src/tools/scripts/sccsrm.1
rename usr/src/tools/scripts/sccsrm.sh -> deleted_files/usr/src/tools/scripts/sccsrm.sh
    

After the renames, it applies a patch for each modified file...

patching file usr/src/tools/scripts/Makefile.new
    

...and then you're done.

$ ls -dF \*demo\*
tw.no-sccs-tools.demo/		tw.no-sccs-tools.demo-hg/
    

You can verify that wx2hg transferred all your changes:

$ cd tw.no-sccs-tools.demo-hg
$ hg diff -g
diff --git a/usr/src/tools/scripts/sccscheck.1 b/deleted_files/usr/src/tools/scripts/sccscheck.1
rename from usr/src/tools/scripts/sccscheck.1
rename to deleted_files/usr/src/tools/scripts/sccscheck.1
diff --git a/usr/src/tools/scripts/sccscheck.sh b/deleted_files/usr/src/tools/scripts/sccscheck.sh
rename from usr/src/tools/scripts/sccscheck.sh
rename to deleted_files/usr/src/tools/scripts/sccscheck.sh
diff --git a/usr/src/tools/scripts/sccscp.1 b/deleted_files/usr/src/tools/scripts/sccscp.1
rename from usr/src/tools/scripts/sccscp.1
rename to deleted_files/usr/src/tools/scripts/sccscp.1
diff --git a/usr/src/tools/scripts/sccscp.sh b/deleted_files/usr/src/tools/scripts/sccscp.sh
rename from usr/src/tools/scripts/sccscp.sh
rename to deleted_files/usr/src/tools/scripts/sccscp.sh
diff --git a/usr/src/tools/scripts/sccshist.sh b/deleted_files/usr/src/tools/scripts/sccshist.sh
rename from usr/src/tools/scripts/sccshist.sh
rename to deleted_files/usr/src/tools/scripts/sccshist.sh
diff --git a/usr/src/tools/scripts/sccsmv.1 b/deleted_files/usr/src/tools/scripts/sccsmv.1
rename from usr/src/tools/scripts/sccsmv.1
rename to deleted_files/usr/src/tools/scripts/sccsmv.1
diff --git a/usr/src/tools/scripts/sccsmv.sh b/deleted_files/usr/src/tools/scripts/sccsmv.sh
rename from usr/src/tools/scripts/sccsmv.sh
rename to deleted_files/usr/src/tools/scripts/sccsmv.sh
diff --git a/usr/src/tools/scripts/sccsrm.1 b/deleted_files/usr/src/tools/scripts/sccsrm.1
rename from usr/src/tools/scripts/sccsrm.1
rename to deleted_files/usr/src/tools/scripts/sccsrm.1
diff --git a/usr/src/tools/scripts/sccsrm.sh b/deleted_files/usr/src/tools/scripts/sccsrm.sh
rename from usr/src/tools/scripts/sccsrm.sh
rename to deleted_files/usr/src/tools/scripts/sccsrm.sh
diff --git a/usr/src/tools/scripts/Makefile b/usr/src/tools/scripts/Makefile.new
rename from usr/src/tools/scripts/Makefile
rename to usr/src/tools/scripts/Makefile.new
--- a/usr/src/tools/scripts/Makefile.new
+++ b/usr/src/tools/scripts/Makefile.new
@@ -50,11 +50,6 @@ SHFILES= \\
 	nightly \\
 	onblddrop \\
 	protocmp.terse \\
-	sccscheck \\
-	sccscp \\
-	sccshist \\
-	sccsmv \\
-	sccsrm \\
 	sdrop \\
 	webrev \\
 	ws \\
$ 
    

Note that you still need to do "hg commit" to check in your new version.

All this assumes that your workspace is in sync with /ws/onnv-clone. If it isn't you may get messages like

wx2hg: can't rename: usr/src/tools/scripts/sccscheck.1 doesn't exist.
    

or

wx2hg: usr/src/tools/scripts/Makefile.new: parent mismatch; 
  resync with /ws/onnv-clone or specify branch point with -r hg_rev.
    

Doing a bringover from /ws/onnv-clone, and resolving any conflicts, should fix things up.

You may also see a message like

Please run
  hg --cwd /export/kupfer/tonic/wx2hg-tests/tw.no-sccs-tools.demo-hg update -C
before retrying.
    

This is telling you you can reuse the Mercurial child, but you need to reset it first. Once you've resynched with /ws/onnv-clone and run the "hg ... update..." command, you use the -t option to tell wx2hg to reuse the Mercurial child. For example,

/opt/onbld/bin/wx2hg -t tw.no-sccs-tools.demo-hg tw.no-sccs-tools.demo
    

There's more that wx2hg can do, but those features won't be needed until ON moves to Mercurial. If you get stuck using wx2hg, you can ask for help on the SCM migration team list (scm-migration-dev at opensolaris dot org).

Friday Feb 15, 2008

SCM Migration: The Big Picture

When Steve Lau left Sun at the end of last September, I became the go-to guy inside Sun for the migration to Mercurial. I had thought that I had a good high-level grasp of the project. But after getting blindsided a couple times by dependencies I hadn't considered, I drew up a diagram to help me get oriented, identify stakeholders, and maybe anticipate future issues.

Here's a slightly simplified version of the original diagram from the whiteboard in my office:

Blue parallelograms indicate repositories, tan boxes are software modules, solid lines indicate data flow, and dashed lines tie users with the modules that they're using. The three red-rimmed boxes (gk tools, gate hooks, and onbld tools) are where most of the development effort is going.

The primary simplifications in this diagram are

  • the data flow from the project gate actually goes through the SCM front-end before going through the gate hooks.
  • I've omitted the consolidation's clone workspace (a nightly snapshot of the gate)
  • I've omitted the bridge between the current ON workspace in TeamWare and the Mercurial repository that is shadowing it

Even so, this is a moderately busy diagram. There are several components to keep track of and make sure they all fit together.

Most of the work so far has been in the area of the ON build (onbld) tools, pieces of which are used by other consolidations and by the Solaris Companion project. Many of the changes are related to making the tools work with Mercurial as well as with TeamWare/SCCS. We've also had to consider the implications of moving everything outside the Sun firewall, which has meant rethinking interfaces to things like the bug database and our RTI (Request To Integrate) system.

We haven't done as much work on the gatekeeper (gk) tools, although we've started to think about design issues. Many of the design decisions boil down to this question: do we make the minimal set of changes needed to work with Mercurial, or do we make more extensive changes so that the tools can make better use of the features provided by Mercurial? In some cases we are staying with the current approach. For example, we are using separate repositories for build snapshots, rather than using branches and tags in the main gate repository. In other cases we will be changing the tools to use Mercurial features. For example, any automated post-putback processing will be driven directly by Mercurial hooks, rather than the email-based hook system that is needed with TeamWare.

Another set of interesting design decisions has centered around the use of gate hooks to enforce various style and bookkeeping rules. With the current TeamWare setup, we enforce these rules after a putback (at least for ON). The putback triggers various checks, and if your putback violates a rule, you get notified of the problem and given a short window to fix it or your putback is reverted. The gate is normally configured so that anyone (inside Sun) can putback.

While this approach worked when Solaris was closed source, we expect it not to scale for OpenSolaris, where the repository is accessible from anywhere on the Internet and both Sun employees and non-employees can have commit rights. Certain Mercurial hooks can abort a putback ("push" in Mercurial terms), so we could move all the post-putback checks to pre-transaction checks. But moving more checks means more work (e.g., testing), which means a longer time before we can move to Mercurial. So the question becomes which checks really need to happen before putback, and which ones can happen after putback. The check to ensure that a putback has an approved RTI probably needs to happen prior to the putback. The check for adherence to the C style rules can happen after the putback, at least for now.

The opensolaris.org webapp has various bits of functionality for source code management. A project leader or gatekeeper can use the webapp to create, destroy, and lock repositories, as well as to manage commit rights for the project's repositories. Unfortunately, the current set of operations is limited. For example, a gatekeeper might want to lock a repository for most users, but allow access for a specific large project. Alas, this lock granularity is not currently supported. Furthermore, all the controls are currently through a web-based interface, with no scripting hooks. Although there is currently work to improve the webapp and make it easier to change, this work is unlikely to be finished in time for us to make any changes that we expect gatekeepers to want. So we will need to think about other ways to provide the needed functionality, such as giving gatekeepers shell access to the server that hosts the repositories.

The SCM front-end gives a user access to repositories by creating a chroot environment which contains only the repositories that the user has commit privileges for. (Access to other repositories is done via the "anon" user.) If the user reports being unable to pull from, or push to, a repository, the problem could be with the SCM program itself, the SCM front-end, or some other general system service. This diagnosis typically requires shell access to the servers.

We are using Nagios to monitor the health of the servers and services on opensolaris.org. We have written a couple simple Nagios plugins to monitor the Mercurial and Subversion services. As we gain experience with the system, we could update the probes to check for specific failure scenarios.

OpenGrok makes it into this diagram because it makes a private snapshot of each repository that it indexes, so as to provide a consistent view of the tree. We once managed to break the OpenGrok indexing of ON by trying to undo (rollback) a particular putback, so that it would vanish completely from the repository. We didn't know to roll back OpenGrok's snapshot repository as well. So the next time OpenGrok tried to pull from the Mercurial onnv-gate, it created a branch that had to be merged. This was not something OpenGrok was prepared for, so the snapshot tree was not updated. After several days, we started getting complaints from ON teams who couldn't find their recent putbacks in OpenGrok. We figured out the problem, replaced OpenGrok's snapshot repositories, and vowed not to undo/rollback any future putbacks.

So that's the "big picture" of what the SCM Migration project is working on. If you've been frustrated by how long things are taking, well, we're not happy about it, either. Our hope is that by keeping the entire picture in mind, we will not have any serious problems when we finally do move.

Friday Aug 17, 2007

ksh93 Putback

April Chin put back ksh93 into the ON gate this morning. Woohoo! I'm delighted to have a modern, open-source Korn shell in OpenSolaris, and I'm looking forward to when we can (someday) retire the old Solaris ksh. Many thanks to April, Roland Mainz, and Don Cragun for all their work, as well as to everyone who participated in the project reviews and discussions.

Wednesday May 30, 2007

What I Learned From Ubuntu

Mark Shuttleworth and a few Ubuntu developers stopped by the Sun Menlo Park campus on Friday May 4th. I'm not working with Ubuntu, but since I'm involved with the Solaris Companion and with general OpenSolaris issues, I wanted to see what they had to say about third-party packages and about how they do their releases.

You can organize Ubuntu packages along two dimensions. The first dimension is whether the package is free (libre). The second dimension is whether Canonical (Ubuntu's corporate sponsor) provides support (e.g., security fixes). This gives us the following table:

supported by Canonical not supported by Canonical
free main
(2,000 packages)
universe
(18,000 packages)
not free restricted
(5 packages)
multiverse
(200 packages)

Notice that Canonical only supports 10% of the packages in the distro.

There are two levels of access to the third-party packages. The first level is an engineering repository which bypasses Canonical. That is, people can update the repository at any time, without regard to the Ubuntu release schedule. The second level is the actual distro, which has tighter controls.

Some of the packages are available on the Ubuntu CD, but many are only available via network download. Canonical does not track the downloads. This would be heresy inside Sun, where there's a big emphasis on measuring things. But Mark said that Canonical doesn't really care about the download numbers, and it would be difficult to get accurate numbers anyway (e.g., because of mirroring).

Someone asked Mark how they deal with packages that potentially infringe on a patent. Mark said that there's no such thing as a global patent, so those packages are allowed in the distro, but they're only available via network download. The user self-certifies that it's okay for him or her to use the package.

Another issue that comes up with third-party packages is how to track bugs. Mark talked about this a bit, and it's is something we're facing with OpenSolaris, too. The basic problem is that for a given package, there may be two bug databases: one deployed by the upstream project and one deployed by the distro. So far, the industry best practice seems to be to push distro-independent information to the upstream database, leaving distro-specific details in the distro's database. This approach is less than ideal, because it requires a fair amount of manual effort to track the bug status and to keep the right information in the right database. Canonical developed a tracking application called Launchpad to help deal with this, but Mark mentioned that it's still not quite what they want, and that Canonical might be revisiting the issue in a couple years. It'd be nice if the Ubuntu and OpenSolaris communities could somehow work together on that.

Mark spent a little time describing Launchpad, and it does have some nice bug-tracking features. For example, you can create hyperlinks to the upstream database entry, and Launchpad can automatically query the upstream database to get the bug's status.

Launchpad also has more general collaboration support, such as mailing lists, project web space, and a code repository. Launchpad includes features that would be useful on opensolaris.org, like a translation tracker and an application for proposing and tracking project ideas.

The other major topic that I was interested in was how Ubuntu releases are done. Ubuntu releases follow a train model, with releases appearing every 6 months. There is support for 18 months, except for Long Term Support (LTS) releases, where servers are supported for 5 years. For those who are not familiar with the train model, the basic idea is that if your code is not ready in time, it is bumped to the next release, rather than delaying the current release.

Sun tried a train model for Solaris in the 1990s, with releases every 6 months[1]. It didn't work for us, and we eventually gave it up. I wasn't involved with Solaris release management, so I probably have a limited perspective on what all the issues were. But as a developer I could see a couple things that contributed to abandoning 6-month trains.

The first problem that I saw was that we didn't stick to the cutoff dates. There was often some new feature that just couldn't wait for the next train, so we would bend the rules and let changes integrate after the nominal cutoff[2]. I suppose that having a late binding mechanism makes sense for exceptional circumstances, but I think it got overused. These days, it seems like late binding isn't just a safety net to keep the release from falling apart, it's a regular phase in the release cycle. I suppose the net effect isn't too horrible--it's effectively a gradual freezing of the code, rather than a hard freeze. But it does push back the real, final freeze date, which then reduces the time that is available for later parts of the release cycle.

This ties in to the other problem that I saw, which was that the Beta Test period was too short. I forget how long the Beta periods were, but they were short enough that by the time customers had actually deployed the code, identified and reported issues, and we had worked out a fix, it was too late to get the fix into that release.

Of course, this begs the question of why Canonical doesn't have the same problems with Ubuntu.

One explanation is that much of what goes into Ubuntu comes from an upstream source and is already (more or less) stable. There is some original work done for Ubuntu, but it's not the "deep R&D"[3] of things like SMF, DTrace, or ZFS. It's hard to predict the schedule for cutting-edge projects, particularly ones that affect large parts of the system. That's not an entirely satisfactory answer, though, because according to the train model, if a project is late, you just bump it to the next release. So there must be more going on than that.

One thing that could mess up a train model is technical dependencies. Suppose Project A depends on Project B. If you integrate parts of A under the assumption that B will integrate later in the release, there will be a strong temptation to delay the release if B is late. The Ubuntu folks try to avoid this problem by avoiding dependencies on upstream cde that's scheduled to be released near the feature freeze. How strict they are about this depends in part on how much they trust the upstream provider to meet its schedule. And in a pinch, they might take beta code if it's deemed to be stable enough. I don't know if technical dependencies were a factor in moving a way from the train model for Solaris releases. It shouldn't have been an issue for the OS/Net consolidation ("FCS Quality All the Time"), but I don't know about Solaris as a whole.

I suppose there could have also been a sort of "marketing and PR" dependency problem, where we feared a loss of face if Feature X didn't make its target release. I don't know if this was actually an issue, but Sun does seem to like big, flashy announcements, and there are quite a few analyst briefings that happen under embargo[4] prior to these events.

Another explanation for why Canonical can make 6-month trains work is that the 6-month releases serve a different target market than the one Solaris has been in. A noticeable chunk of the Solaris user base would go nuts with a 6-month release cycle and 18-month support tail. As soon as they got one release qualified and deployed, they'd have to do it all over again.

So one thing we might want to look at for Solaris is to have two release vehicles, similar to the 6-month and LTS releases that Canonical is doing with Ubuntu. But there are still some issues with that model that we'd want to figure out. For example, the Ubuntu folks said that most of the Ubuntu LTS customers just want security fixes, whereas Solaris customers often demand patches for non-security bugs.

Another thing that distinguishes Ubuntu releases from the 6-month Solaris trains is when customers actually get the bits to play with. There are only 3 weeks between the Beta release and final release for Gutsy, but there will be six snapshots that are available sooner, with the first (fairly unstable) one appearing 16 weeks before the Beta release. This gives users a larger window than we had with the 6-month Solaris trains in which to try out the release and give feedback.

So, to sum it all up: I learned that distros can successfully deal with issues that OpenSolaris and Sun are facing, like how to provide the many third-party packages that users want, and how to keep them current. What we need to do now is figure out how to make it work for OpenSolaris, without sacrificing the stability that attracted many Solaris users in the first place.


[1] The internal code names for SunOS 5.2, 5.3, and 5.4 were on493, on1093, and on494, respectively.

[2] At some point we came up with a formalized "late binding" process, but I don't remember just when that was introduced.

[3] That's the term Mark used.

[4] That is, the analyst isn't allowed to publish anything about it before a certain date and time.

Friday May 18, 2007

Defeating the OpenSolaris Address Mangler

The opensolaris.org webapp includes an automatic email address mangler to make it harder for spammers to harvest email addresses. But it's not very smart, and it mangles things that aren't email addresses, like device paths and repository URLs. If you're editing an HTML page on the web site and you want to bypass the email mangler, replace "@" with "@", as in

ssh://anon@hg.opensolaris.org/hg/onnv/onnv-gate

Friday Nov 03, 2006

Testing nightly(1)

I've been making a lot of changes to nightly(1), the main ON build script. With most of the build tools, you can test your changes by adding the -t flag to your nightly options, but that doesn't work for nightly itself.

For awhile, I was making $HOME/bin/nightly.new be a symbolic link to the new version, in whatever workspace it lived in. That got a bit awkward if I had more than one workspace with changes to nightly. Worse, because I had set things up so that I didn't really have to think about what I was doing, well, I wouldn't think about what I was doing--I would invoke nightly.new before invoking make to update it.

So now I manually do

$ cp usr/src/tools/proto/opt/onbld/bin/nightly ~/bin/nightly.new

It's more typing, but I've had to rerun fewer tests than I used to.


Technorati tags: OpenSolaris

About

Random information that I hope will be interesting to Oracle's technical community. The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today