Source Code Archaeology

As should be obvious from my recurring blog topic of coding style, I care a lot about code readability and style.

One criticism I've heard frequently is that you shouldn't reformat code to be style-compliant, since that introduces lots of diffs into the version history of a file. Oh no, diffs! What horror! Think of the children! Somebody, The children!

I have two responses to that line of thinking:

  • First, you clearly don't agree with me on the importance of consistent formatting and readability. Perhaps I should just grow up, but when I see code like this I cringe:
       void myfunc(int a_foo, int a_bar)
         foo = a_foo;
         myfunc (foo);
    Source formatting typically doesn't rename variables (to wipe out the underscores ( _ ) from variable names) but it certainly could put the opening brace back up where it belongs, as well as remove the space between the method name and the opening parenthesis, etc.

    Reading the above code is the common operation. Tracking down line history is the occasional operation.

  • Second, and this is the more important point: You're probably not familiar with, or at least well versed in, source code archaeology.

Source code archaeology is very similar to the archaeology you're familiar with: you dig down, layer after layer, uncovering artifacts from a particular time period. You essentially get to see the state of affairs at one particular point in time. Then historical events occur and layers upon layers are deposited on top.

With source code archaeology, you can dig back into the history of a file. Let's pretend we agreed on a particular coding style for the NetBeans source code base, and I went and ran a filter over the entire code base, reformatted it, and checked the code in with the comment "Source code reformatted to spec using codestyle ide/formatsettings.xml". Reformatting the entire source base (yes, I know, blasphemy!) is usually criticized as bad because you can't find out why a particular line was added or changed, since now the version information for the line points to the reformatting delta. But but but - remember the layers! (Insert mandatory Shrek reference here...) It's easy to go to the layer below the reformat, and see what the version of the file looked like prior to the reformatting delta. So, once you've done a giant reformat of the source base, looking for the origin of a particular source file line may involve multiple operations. First you determine what delta corresponds to the current version of the line. If you see that it was for a reformat, you peek at the version below it, locate the same line, and look at its original comment.

Now, you might argue that that's an extra step. You would be right - but think about it, what's more common, reading code, or reading the checkin comment for a particular source line? As long as you can get the team to use the same source formatting conventions (perhaps even performed or enforced automatically at checkin) you'll only need to do this global operation once.

Tools should make this job easy. It's not all that hard on the command line either. You may have heard of the "cvs blame" command.

tor:1% cvs blame
Unknown command: `blame'

CVS commands are:
        add          Add a new file/directory to the repository
Ok, so that's not really the command, but it's the common name for the cvs annotate method! It lets you figure out who to blame for the ridiculously stupid bug you just found in the code. The real name is cvs annotate.

With cvs annotate you can get annotations for the current version. Let's say that you discovered that a reformat happened in revision 1.43 of file To see what the file looked like before that, simply do

tor:2% cvs annotate -r 1.42

1.1          (tor      24-Mar-03): /\*
1.1          (tor      24-Mar-03):  \* Copyright 2003 Sun Microsystems, Inc. All rights reserved.
1.1          (tor      24-Mar-03):  \* SUN PROPRIETARY/CONFIDENTIAL. Use is subject to license terms.
1.1          (tor      24-Mar-03):  \*/
1.1          (tor      24-Mar-03):
1.1          (tor      24-Mar-03): package;
1.1          (tor      24-Mar-03):
1.1          (tor      24-Mar-03): import java.awt.geom.Rectangle2D;

As I mentioned, tools should make this easy. I'm not sure what the current state of this is in NetBeans' new CVS support, or in other IDEs. I had a strong need for this back when I worked with SCCS and Teamware on Sun's C++ development environment. I built a prototype for doing this kind of exploration; I think it's still relevant today, seven years later, so I'll include a couple of screenshots of it. Consider this a wishlist for $favorite_ide, in my case NetBeans.

The idea is that you get to open your source file in a version-annotated view, where all the source lines are colored according to some source code metrics. For example, really old lines have a dark background, and recent lines have a bright red background. (To show which lines belong to the same delta, look for the < and > signs in the left margin showing contiguous blocks, since code about the same age can have roughly the same color but should be logically distinguishable.

The point is that you want to be able to click on a line to be able to see its checkin comment - and to quickly cycle through previous versions of the file (and diff between versions).

Another possible coloration assigns a different color to each user who has touched the file. The overview gives you an indication of who's most responsible for this source file.

There are some other minor things here too, but I think I've made my point: Source Code Archaeology - it's important. It needs better tools support. But it's possible today, and should help you make the transition to a properly formatted source code base.


Move that curly up, and I'll reformat the code to put it again down where it Belongs..=) Seriously speaking, what about a coder having differentn coding styles?

Posted by Lorenzo on November 16, 2005 at 07:05 PM PST #

Lorenzo, on the different coding styles I recommend to read Style is Substance by Ken Arnold. To quote: "freedom for formatting style has proven extremely expensive, and does not deliver much value for cost".

Tor, you should look at the new CVS support in NetBeans. It does make it easier to dig into the code, although it does not have the view that you have in those screen shots. This should definitely be added in the next version! Maybe you can contribute it, at least in the form of RFE and design document?

As for <code>cvs blame</code>, I'm sure you know <cvs>cvs whyTheHackIsThisHere</code> :-) How about running the format automatically before cvs add to avoid the ugly code being in cvs in the first place?

Posted by Pavel Buzek on November 16, 2005 at 11:10 PM PST #

I have to agree with Ken Arnold. The basic idea he puts forward is that we need to consider whitespace and style as part of the language design rather than leaving that up for each person to decide. This would insure that all code by all team members is similar ( at least in regards to style ). A modern IDE should have little trouble setting and helping to format any code that requires such, we are already almost there as it is now....

Posted by VHF on November 17, 2005 at 01:11 AM PST #

> ... think about it, what's more common, reading code, or reading the checkin comment for a particular source line? This really doesn’t have much to do with what's more common, or with CVS in particular. It has to do with the software lifecycle; Software quality assurance, configuration management and in the end risk management. Presumably if you are fixing someone else’s code, it’s baselined code. There are more reasons not to reformat than to reformat it. It is often the case when tracking defects and compiling SQA metrics that several metrics are tied to the number of lines of code changed between updates/releases/baselines. Software that counts the changes to compile the metric doesn’t know that you changed an entire file just to reformat it, and counts every changed line against the software team’s metrics. A change control board wanting to review code changes prior to accepting a change to a baseline can’t be expected to do the archeology to figure out what really changed. Automated tools document what’s changed between the baseline and the proposed change, but not if there are intermediate steps where the entire file has been reformatted. Unless you can prove that the reformatting module/program has zero defects and therefore no chance whatsoever of breaking code that works, you run the risk of introducing bugs along with the intended fix. CVS in particular has a very nasty problem with bulk code reformatting in a distributed team environment, namely conflicts. You can never be certain that someone else isn’t fixing something else in the same file unless you are going to force locking ala SCCS. If you commit a reformat, you will force the other person to waste time resolving white space conflicts. I am very pedantic about coding style, but I have also been coding long enough to have gained quite a bit of flexibility in reading other people’s code.

Posted by Me on November 22, 2005 at 06:29 PM PST #

Ug. Paragraph breaks were there when I previewed, but gone when I posted. Sorry, hope it's still readable.

Posted by Me on November 22, 2005 at 06:31 PM PST #

If your organization depends on metrics like that, where there's an incentive to not touch any lines (e.g. avoid reformatting the javadoc comment if you can help it) then certainly this approach is not for you. I've just never had those kinds of metrics applied in any groups I've worked. Our QA metrics tend to include bug counts per product area, fix versus find rates, regression rates, etc.

Posted by Tor Norbye on November 23, 2005 at 01:49 PM PST #

One criticism I've heard frequently is that you shouldn't reformat code to be style-compliant, since that introduces lots of diffs into the version history of a file.
I'm also a friend of reformating the code often. In Eclipse there's a feature called "ignore whitespace" in the compare view. Since the code formating basically inserts/deletes spaces/linebreaks, the too-many-diffs argument is pretty much disproved by this feature. I don't know about other IDEs though...

Posted by Paul on March 30, 2006 at 09:16 PM PST #

Post a Comment:
Comments are closed for this entry.

Tor Norbye


« June 2016