Be Careful What You Measure, It Might Improve

There's a saying: That which is measured improves; that which is ignored degrades. But be careful what you measure, because it just might improve.

Collecting and analyzing metrics is not something to be undertaken without the proper preparation. When you apply metrics to any activity, the metric may "improve" at the expense of other activities; the net effect might be negative.

A long time ago, I was on a large software team developing a new telecommunications product. The project was supposed to be to port an existing code base to a new hardware platform and OS. The differences in the hardware and OS changes were greatly underestimated, and the result was a huge number of bugs.

Management was concerned that (A) product quality was very low, with a high bug count, (B) testing was finding new bugs too slowly, and (C) development was fixing bugs too slowly. So they decided to employ metrics. They decided to measure:

  • The number of bug reports filed per tester.
  • The number of bugs fixed per developer.
The metrics were tracked weekly, and a sorted list showing how each employee was doing against metrics was distributed to everyone. When the product shipped, bonuses would be tied to individual performance against metrics.

At first glance, this seems like a reasonable approach -- it encouraged testers to find lots of bugs, and developers to fix lots of bugs, and it also encouraged them to do it quickly so the product would ship and bonuses would be paid. The result, however, was a disaster.

Testers began filing bug reports by the truckload. Every combination and permutation of conditions resulted in a new bug report: The wrong ring tone is used when connected to a BRI interface. The wrong ring tone is used when connected to a Tri-BRI interface. The wrong ring tone is used when connected to a PRI interface. The ring tone had nothing to do with the network interface; it was one software bug, but it was reported as three separate bug reports, thus boosting the individual tester's metrics.

Developers were no less shameful. They would seek out the bugs that were quick-and-easy to fix in order to drive up their metrics. Critical, high-priority bugs that would take a long time to diagnose and debug were avoided like The Plague. Also, the quality of code changes dropped, not just because of haste, but if a developer fixed one bug and introduced two others, then the tester would get to file two more bug reports, and the developer could fix two more bugs, driving up everyone's metrics. I honestly believe no one was intentionally introducing new bugs, but the metrics discouraged developers from testing for regression.

In the end, the metrics improved -- the rate of bug reports being filed went up, and the rate of bug fixes increased. But the goal was not achieved; overall, the quality of the product actually decreased. After a few weeks, management realized what was happening. They decided to measure the total number of open bugs, instead of maintaining metrics on a per-person basis.

When designing a system of metrics (and it is a design process), one has to consider the goal, and identify the measurable criteria directly related to that goal. In this example, the goal was to improve product quality quickly. Unfortunately, the selected metrics were only second-order criteria and were not directly related to the goal. The metrics improved while the goal slipped away. In other words, what you measure will improve, so be careful what you measure.

Copyright 2007, Robert J. Hueston. All rights reserved.

Devising good metrics is indeed trickier than it appears at first glance. Bug metrics seem to be a particularly thorny area, and your example is a particularly egregious case. I've typically seen release criteria expressed as so many bugs (i. e. "defects") of such and such priority levels.

The problem with all of these metrics is that they don't really measure the experience of the person who's actually \*using\* the product -- they measure specific defects that someone who's actively trying to find defects finds. This is of some value; this is a good model of how someone who's trying to break into a system operates. However, a basically positively-disposed individual who's just trying to use the system isn't normally going to operate in this way unless they're looking for specific issues. Any scheme that's based on "all defects are equal" or "all defects in a certain broad equivalence class are equal" is going to ill serve the user base.

I've thought about a different kind of quality measurement that's based on impact to functionality (which is what a user sees) rather than specific defects (which is what developers see). In this model, the total functionality of a product is expressed as a tree, where each branch has a certain weight and all the weights add up to 1 (for simplicity's sake).

Each particular function, of course, is comprised of sub-functions, which is where the tree comes in. A defect cuts off or occludes some part of the functionality tree. The quality of the product can then be expressed as the sum of the branches that can be reached from the root. This is found through positive testing, where each test case corresponds to a branch on the tree. As tests pass, more branches are found and their weights added to the functionality metric.

Ideally, there would be two metrics: a positive functionality metric representing functionality that has been validated and a negative functionality metric representing functionality that has failed. The remainder of the functionality simply hasn't been tested.

I've frequently been on projects where part way through the project the rate of bugs filed against it increases dramatically. At this point, the common response is to panic -- EVERYTHING IS FALLING APART!!! HELP!!! There might be several waves of this through the life of the project; typically each internal release results in a new wave of bugs being filed and a fresh wave of management panic. The reality is not as dire.

What's actually happening, of course, is that early on in a project there's so little functionality available (and even less that actually works) that there simply isn't very much to file bugs against, so the bug count is low. The few bugs that are present are cutting off a lot of branches with very high weights, in my model. The first big wave of bugs, in fact, corresponds to the first release that's actually usable enough to test -- the branches close to the root have been repaired, and we're starting to see branches farther away that previously could not be reached.

Personally, I'd be more worried if I didn't start to see a big influx of bugs at a reasonable point in the project's lifetime. When the wave hits, it helps to maintain the discipline of fixing the bugs correctly in order of importance and refrain from desperate measures -- this is a normal part of the project lifecycle. There might be several waves before things settle down, and it's important to maintain this discipline despite scary numbers of defects. With a functionality tree model, the goal would be to achieve steady progress on the functionality metric and avoid regressions (previously working functionality that breaks). In a traditional model, it means that the project manager has to take a hands-on approach, personally assessing the individual bugs, and shielding the development team from management.

In this kind of functionality model, developers would have incentives for fixing easy bugs that block a lot of functionality rather than a lot of very easy bugs that have very little effect on functionality, since the latter simply wouldn't improve the functionality metric by very much. Testers would prefer to find problems that have the greatest impact on functionality, for much the same reason. The likely weakness of this scheme is that it wouldn't effectively model minor rough edges that individually or even in sum have little effect on the functionality of the product but taken as a whole give an impression of poor fit and finish.

This kind of metric is probably easier to explain than it is to devise in practice. With good functional and test specs, it could be done by assigning a weight to each individual specification. It might not have to be expressed in the form of a tree, either; each function could simply have a weight. However, the tree model better reflects the way a product is going to be used -- for example, a network management screen may have a lot of subcomponents, and the network management screen is itself part of an overall system management function that in turn is part of the product as a whole.

Posted by Robert Krawitz on January 18, 2007 at 03:52 AM EST #

Post a Comment:
Comments are closed for this entry.

Bob Hueston


« July 2016