Wednesday Feb 18, 2009

Continuous Integration

I am giving a presentation at the Free Test 2009 conference in Trondheim in March, and I am looking into visualizing or proving why continuous integration is such a good idea because of this.

These visualizations are based on three dimensions: line, time and person doing the change.  In the (horizontal) x-dimension you have time. In the (vertical) y-dimension you have line number. Imagine all lines in a SCM concatenated together e.g. like "find . -exec cat {} \\+ | nl". A changed line would get a single line number with this transformation. With this visualization, each time a person saves what has been changed in her editor, there would be a vertical line (with possible holes) which shows the fingerprint of which lines in the complete SCM the person has changed per save. The last dimension - persons, I have tried to visualize with blue and red color. I appologise to the color-blind, but adding a z-dimension made it difficult to see overlaps, and using different line strokes also made it difficult to see.

I have two visualizations so far. The first came after a discussion with Terje Røsten, where we agreed that the rate of change per unit of time was constant, and that therefor the possible overlap of two persons changes are bigger the more seldom you integrate. This would translate to "there is more work involved to fix the merge".

In this figure, the blue and the red persons has both integrated a patch that touches some of the same lines.  When the patches are big the average overlap can be lines (a), and when the patches are small, the overlap must be few lines (b).

In the next figure I try to visualize the possibility of a collision to occur. In this figure I have painted one integration point (think commit, push and pull) - i1. At this point we have two conflicts - c1 and c2.

In the next figure I have the exact same sequence of changes from both programmers, but much more often integrations. Here there are no conflicts because they are resolved before they happen because of the integrations.

OK, so now I have some visualizations showing a couple of reasons why continuous integration is good for you, but I am still looking for a mathematical formula. I think it has to include something like this:

P(collision) = ((avg size of patches)/(total size of code in SCM)) \* (number of committers) \* (number of commits per integration)

The "problem" with this formula is that while you increase the number of integrations, you decrease the average size of each patch - so the P(collision) for each integration point decreases, but the number of integration points increase with the same factor - so the net result is the same.

Are there any mathematicians or statisticians out there that could shed some light?