### Confidence and Deviation

#### By eoin on Sep 04, 2009

# Confidence and Deviation

The seventh circle of benchmarking hell is reserved for the analyst who quotes a single number when reporting the performance of a system, and supplies no other information. In the outer circles of this particular hell the analyst might provide details such as how many times the benchmark was run, what units the result is in, complete details on the configuration of the system and benchmark etc.

For marketing purposes it may be desirable to have a pithy result like "15% faster than the competition!" However if we have a genuine wish to understand one or more sets of performance data, then there are lots of statistical tools that can help.

In this entry I'm going to look at two related concepts, and explain the important differences between them: standard deviation and confidence intervals.

Although in theory computers are deterministic - return the same output for a given input - in practice the inputs are so numerous and varied that we cannot completely control them all. For example, consider a disk IO benchmark - a sudden noise or an increase in vibration due to a fan starting can cause a change in performance.

In order to account for the range of possible results we typically
perform a benchmark multiple times, while controlling as far as possible
external disturbances. At the end we will have a set of numbers. It is
not feasible to quote the entire set of numbers to describe the performance
the system. Usually we use the average value to describe the
performance. However this doesn't take any account of the range of
individual results. This is where standard deviation
makes its appearance. It is estimated using a simple formula:

but don't worry about the exact details for now.

It is essentially a measure of dispersion or of how spread out the numbers are, while the mean is a measure of the central point of the range.

The mean and standard deviation are intrinsic properties of a system. It
is usually not possible to directly measure them, but we can estimate
them by making a number of measurements. For example: do people prefer
the colour blue to the colour red? We can't ask everyone in the world,
so we estimate the answer by asking a

It is important to note that our estimate of the standard deviation does not tell us anything about how accurate our estimate of the mean is. A large (estimated) standard deviation does not imply that our estimate of the mean is bad, and conversely a small (estimated) standard deviation does not imply that our estimate of the mean is highly accurate. (It is bad only in so far as it makes the performance of a system less predictable). These two measures are unchanging properties of the system that we merely estimating. But we don't know how good our estimates are.

This is where confidence intervals come in. We typically calculate
the confidence interval for the mean but it can be done for
the estimate of the standard deviation as well. It provides a description of
the data in the form
"we are 95% certain that the mean of the population is in the range X to Y."
This is a statement about the accuracy of our estimates. The smaller the range
of X to Y, the more accurate. We can generally make this range smaller if required
by running more benchmarks. This is quite different to the standard deviation
(or mean)
which has a fixed value that we

To summarize:

- Standard deviation is a measure of the intrinsic variability of a system
- The 95% confidence interval is a measure of the quality of our estimate of the mean (or standard deviation).
- We can reduce the standard deviation by improving the system itself
- We can reduce the size of the confidence interval by running the benchmark lots of times.

Finally, I am not going to describe the mechanics of calculating the 95% confidence interval. Wikipedia describes it quite well, and most stats packages will do the heavy lifting for you; I like to use R.

I hope this helps understand where and when standard deviation and confidence intervals can be used, and keep you out of benchmark analysis hell. There are some caveats - for instance, the assumption of normalacy - but I will leave those for another day.