Teamware/SCCS history conversion to Mercurial

Teamware/SCCS history conversion to Mercurial

Originally posted back in December 2007, I've added some new references and some possible strategies, at the end.

Silver Falls, Oregon. No, it doesn't use Mercurial, yet.

Just a few notes on converting source file change history from Teamware/SCCS to Mercurial. These are just notes because in the JDK area and in any Teamware JDK workspaces being converted, we don't plan on converting the old source change history into Mercurial. The major reason why we aren't is a legal issue, and you can imagine what the legal issues are with regards to non-open sources that become open. I won't get into that. But there are some technical issues too, which I will try and cover in case someone decides to attempt such a conversion.

Why convert the revision history?

The complete source history is an extremely valuable asset, being able to know when and who made a change years ago is often essential to understanding a problem in a product. Initially we wanted to preserve this source change history and assumed it wasn't a difficult job. Most engineers have been upset that our current plans don't include this history conversion, but read on if you are curious as to the problems encountered.

The Basic Idea

The basic idea in doing an 'ideal' source history conversion would be to create a Mercurial changeset for each Teamware putback. That means you need to identify the putback event, the specific SCCS revisions of the files, and any file renames or deletes. And each changeset is built upon the previous changeset, so the ordering of the changesets is critical here.

Sounds simple right? Well, read on, it's not so simple.

The Problems

History Files: You need to understand how the Teamware history file works. The Codemgr_wsdata/history file in a workspace does not propagate, so the specifics on a putback won't percolate around your tree of workspaces. This means that each workspace has a history of the Teamware events that happened to it, but not the details of anything that happened to the other workspaces. So to get accurate Teamware history you need the entire tree of integration workspaces (any workspaces that might be the target of a putback) and all that ever existed, then you'd need to fold all the events in these history files together in the proper time order. So the more complicated the Teamware workspace integration tree, the more difficult this task becomes. The JDK workspaces (there are many different workspaces) each have 6-20 different integration workspace instances, and some of these workspaces go back quite a few years, so we are talking some major source change history here.

SCCS file revision numbers: The details in the Teamware history will just list the files involved in a putback or bringover, not the specific SCCS Rev numbers for the files. So matching up the specific SCCS Revs on files to the specific putback event that putback these SCCS revisions is not trivial. (I think there may have been an option in Teamware to record the SCCS revision numbers in the workspace history file, but it is off by default, which is shame). So to create a nice neat Mercurial changeset means you need to somehow match up the filelists and timestamps of the putbacks with the individual SCCS revision numbers of source files. Unfortunately, the SCCS files record a time but no timezone, so if anyone decides to do this kind of history conversion will need to have lots of fuzzy timestamp logic to match up the right SCCS revisions with the putbacks. The username is included in the Teamware history file and the SCCS revisions, so that may also help, except that often an integrator of changes isn't the same person that did the SCCS revision.

SCCS Revision Tree: The SCCS revision tree for each file can be fairly complex graph, depending on how many file merges happened to the file. You might be able to just use the top level SCCS revision number, but information in the SCCS comments of the other revisions will contain important information to preserve.

Deleted files: Teamware deletes files by moving them from where they are to a top level deleted_files directory. So they don't really get deleted, just renamed. However, a common practice with many teams is to purge the deleted_files directory once a product reaches a major milestone. So some of the files may actually be permanently gone, and this needs to be taken into consideration. At some point, you can't recreate the complete source tree if this has happened.

Empty comments: Empty SCCS revision comments, and empty putback comments would also create problems if you planned on using these comments or cookies of information in these comments to connect up the files to the putback events (e.g. bug id numbers or change request numbers). So more specific SCCS revision comments and more specific putback comments might make this job easier.


We considered multiple approaches to doing a source revision history conversion. You could come at it from the putback events, using the history files to identify 'real' changesets, and hope any deleted files are still around. What you'd use as Revision 1 of the files might be a little tricky. Or you could try and just look at the SCCS revisions, and figure out via timestamps, usernames, and perhaps SCCS comments, which files were changed as a group. Or a combination of both. Or you could try to come at if from a time perspective, e.g. all the changes to get you from April 1, 2004 to May 1, 2004.

The simple approach of one changeset per SCCS revision isn't really that simple because Mercurial changesets have an order to them. To do it right you'd need to view the Teamware workspace as a large graph of file nodes, with small sub-graphs of SCCS revisions. Then pick a time T to start Revision 1 of the Mercurial sources, find all the file instances at time T, add these files as a changeset to Mercurial, then repeat that for T+1. Or perhaps T+N where N is selected based on sampling timestamps after T for a quiet time (to avoid picking a time that might split up file changes that happened in a group). Just some wild ideas.

But it just feels wrong, no putback data, the files won't be bunched right, and the resulting repository would contain inaccurate source state in any of these converted changesets.

We never fully explored all the approaches because once the legal constraint came in, there seemed no need to pursue it. It's an interesting and complicated problem, but ultimately one we decided we didn't have to solve.


So the bottom line is that whatever can be created would likely have questionable data if someone asked to have the sources per a particular date or if they wanted to know the state of the entire source tree when a given change was made... Hard to ever be perfect here, and not being perfect could send a few engineers down some deep rabbit holes. :\^(

The old history isn't being destroyed, it's just being left in the old Teamware workspaces. So we will still have access to it, just not via Mercurial repositories. As time passes, we'll build up new and better history in our Mercurial repositories, and maybe by the time I retire, it won't matter much. ;\^)

Update: Some Ideas

Jesse's conversion script turns out to be a possibility. He documents the problems with it, but it's certainly a step toward something.

With the OpenJDK6 repositories which were originally in TeamWare, we had two ways to gain some history. With each build promotion while in TeamWare, we saved a source bundle, so we had a raw snapshot of the source for each build. By using these as potential working set files, this allowed us to start rev0 with Build 0 source bundles, then for each build promoted after that, repeat the steps:

  1. Delete the working set files
  2. Copy in new working set files from the source bundle for Build promotion N.
  3. Run: hg addremove ; hg commit --message "Build Promotion N" ; hg tag BuildN
This provides a large grain history, not great, but could be very valuable to narrow down when a change came in. Adding in more specific history required patch files that you would apply in between, but you needed the patches, you needed to know what Build a change went in, and most critically, you needed to know the order of the patches or changes in case two fixes modified the same files. Ultimately, it worked for OpenJDK6, to a degree. The Build Promotion revisions were accurate, but sometimes getting the others accurate was hard to do. And unfortunately, all the changesets in OpenJDK6 look like they were created by me, which is right in a way, but I really wasn't responsible for many of the patches. So the authors, dates, and SCCS comments were not included, but the bugids were.

Anyway, just thought I would update this rather old posting.



Post a Comment:
Comments are closed for this entry.

Various blogs on JDK development procedures, including building, build infrastructure, testing, and source maintenance.


« June 2016