Eoin LawlessFlotsam and Jetsamhttps://blogs.oracle.com/Eoin/feed/entries/atom2010-05-20T07:27:38+00:00Apache Rollerhttps://blogs.oracle.com/Eoin/entry/confidence_and_deviationConfidence and Deviationeoin 2009-09-04T06:03:15+00:002009-09-04T13:03:15+00:00<h1>Confidence and Deviation</h1>
<p>
The seventh circle of benchmarking hell is reserved for the analyst
who quotes a single number when reporting the performance of a system,
and supplies no other information. In the outer circles of this
particular hell the analyst might provide details such as how many
times the benchmark was run, what units the result is in, complete
details on the configuration of the system and benchmark etc.
<p>
For marketing purposes it may be desirable to have a pithy
result like "15% faster than the competition!" However if we
have a genuine wish to understand one or more sets of performance
data, then there are lots of statistical tools that can help.
<p>
In this entry I'm going to look at two related concepts, and
explain the important differences between them: standard deviation
and confidence intervals.
<p>
Although in theory computers are deterministic - return the same
output for a given input - in practice the inputs are so numerous
and varied that we cannot completely control them all. For example,
consider a disk IO benchmark - a sudden noise or an increase in vibration
due to a fan starting can cause a change in performance.
<p>
In order to account for the range of possible results we typically
perform a benchmark multiple times, while controlling as far as possible
external disturbances. At the end we will have a set of numbers. It is
not feasible to quote the entire set of numbers to describe the performance
the system. Usually we use the average value to describe the
performance. However this doesn't take any account of the range of
individual results. This is where standard deviation
makes its appearance. It is estimated using a <a href="http://en.wikipedia.org/wiki/Standard_deviation">simple formula:</a>
<quote>
<br><br>
<img src="http://upload.wikimedia.org/math/8/5/3/853c79575bd7e5a9fdbc480844b76337.png">
</quote>
but don't worry about the exact details for now.
<p>
It is essentially a measure of dispersion or of how spread out the numbers are, while
the mean is a measure of the central point of the range.
<p>
The mean and standard deviation are intrinsic properties of a system. It
is usually not possible to directly measure them, but we can estimate
them by making a number of measurements. For example: do people prefer
the colour blue to the colour red? We can't ask everyone in the world,
so we estimate the answer by asking a <emph>sample</emph> of the population.
In the same way we estimate the mean and standard deviation of a
systems performace
by running a benchmark a number of times, and using the average value
to estimate the mean, and the <a href="http://en.wikipedia.org/
wiki/Standard_deviation">formula above</a>
to estimate the standard deviation.
<p>
It is important to note that our estimate of the standard deviation
does not tell us anything about how accurate our estimate of the mean is.
A large (estimated) standard deviation does not imply that our estimate of the
mean is bad, and conversely a small (estimated) standard deviation does not imply
that our estimate of the mean is highly accurate.
(It is bad only in so far as it makes the performance
of a system less predictable).
These two measures are unchanging properties of the system that we merely
estimating. But we don't know how good our estimates are.
<p>
This is where confidence intervals come in. We typically calculate
the confidence interval for the mean but it can be done for
the estimate of the standard deviation as well. It provides a description of
the data in the form
"we are 95% certain that the mean of the population is in the range X to Y."
This is a statement about the accuracy of our estimates. The smaller the range
of X to Y, the more accurate. We can generally make this range smaller if required
by running more benchmarks. This is quite different to the standard deviation
(or mean)
which has a fixed value that we <emph>estimate</emph>, but the real (unknown)
value of which will not change by rerunning the benchmark.
<p>
To summarize:
<ul>
<li>Standard deviation is a measure of the intrinsic variability of a system</li>
<li>The 95% confidence interval is a measure of the quality of our estimate of
the mean (or standard deviation).</li>
<li>We can reduce the standard deviation by improving the system itself</li>
<li>We can reduce the size of the confidence interval by running the benchmark lots of times.</li>
</ul>
<p>
Finally, I am not going to describe the mechanics of calculating the
95% confidence interval. Wikipedia <a href="http://en.wikipedia.org/wiki/Confidence_interval">describes</a> it quite well, and most
stats packages will do the heavy lifting for you; I like to use
<a href="http://www.r-project.org">R</a>.
<p>
I hope this helps understand where and when standard deviation and confidence
intervals can be used, and keep you out of benchmark analysis hell.
There are some caveats - for instance, the assumption of normalacy - but
I will leave those for another day.
https://blogs.oracle.com/Eoin/entry/erratic_network_performance_spin_mutexesErratic network performance: Spin mutexes vs. Interruptseoin 2009-09-02T10:20:42+00:002009-09-02T17:20:42+00:00<h1>Erratic network performance: Spin mutexes vs. Interrupts</h1>
I was recently investigating the cause of high variance in network performance
between Logical Domains on a SunFire T2000. I was running the iperf benchmark
from one LDom guest to two other LDom guests.
The rig was configured like this:
<pre>
# ldm ls
NAME STATE FLAGS CONS VCPU MEMORY UTIL UPTIME
primary active -n-cv- SP 8 4G 2.1% 1d 1h
oaf381-ld-1 active -n---- 5000 8 6G 13% 1m
oaf381-ld-2 active sn---- 5001 8 6G 0.0% 1m
oaf381-ld-3 active sn---- 5002 8 6G 0.0% 2m
</pre>
Sometimes I would see throughput of up to 1360 Mb/s, but other
runs it would drop to as low as 870 Mb/s. Here's a graph of the
benchmark results, as you can see they are very erratic. (You may need to
open it in a separate window if your browser scales it).
<br>
<img src="https://blogs.oracle.com/Eoin/resource/smtx/twoclient-data.png">
<br>
Looking at mpstat output
there seemed by some sort of connection between a high spin mutex
count and performance, but it's hard to get a grasp of tens of mpstat
outputs at once.
<p>
For example here is mpstat output for a run with a result of 1318 Mb/s
<pre>
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 1490 313 1 66 11 2 4250 0 4 0 100 0 0
1 0 0 1467 314 0 71 9 4 4678 0 4 0 100 0 0
2 0 0 486 2207 4 3687 2 1277 187 0 34 0 24 0 76
3 0 0 192 1048 2 1574 2 526 106 0 21 0 12 0 87
4 0 0 627 3302 5 6008 9 657 163 0 36 0 30 0 70
5 0 0 608 3134 6 5597 11 695 159 0 45 0 31 0 68
6 0 0 3911 6130 4094 4590 31 663 222 0 62 0 44 0 56
7 0 0 4462 6279 4205 4625 32 666 238 0 50 0 45 0 55
</pre>
and here is mpstat output for a run with a result of 882 Mb/s
<pre>
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 666 5338 4 9695 8 523 318 0 29 0 34 0 66
1 0 0 540 4593 6 8272 9 506 277 0 34 0 31 0 69
2 0 0 405 3382 6 5795 9 448 202 0 43 0 28 0 72
3 0 0 1644 2037 112 3208 5 338 124 0 48 0 25 0 75
4 0 0 84 928 4 1283 1 402 82 0 30 0 14 0 86
5 0 0 36 503 2 496 0 152 31 0 15 0 5 0 95
6 0 0 6490 6540 6102 87 23 1 2197 0 5 0 100 0 0
7 0 0 6485 6547 6107 92 23 2 2336 0 5 0 100 0 0
</pre>
The best way I found to see the pattern was to graph it.
For each benchmark run I found what CPU had the highest smtx count,
and plotted that smtx value against the iperf result, using a different colour
for each CPU. The graph is below and reveals an unusual pattern:
<img src="https://blogs.oracle.com/Eoin/resource/smtx/smtx-cpu.png">
<p>
A few notes:
<ul>
<li>There appear to be four groupings of behaviour</li>
<li>If the highest smtx count is on CPU 6 or 7, the iperf result is low</li>
<li>If the highest smtx count is on CPU 1 or 2 the iperf result is high</li>
<li>The highest smtx count is never on CPU 3.</li>
<li>There is a range of results with very low smtx values, so there may be anoth
er variable in play as well.</li>
</ul>
<p>
Another data point is that in every run, the same two CPUs (6 and 7) handled the interrupts
for the vnet device. Here is the <strong>intrstat</strong> output, and it is con
firmed
by the mpstat output above:
<pre>
device | cpu0 %tim cpu1 %tim cpu2 %tim cpu3 %tim
-------------+------------------------------------------------------------
vdc#0 | 0 0.0 0 0.0 0 0.0 4 0.0
vnet#0 | 0 0.0 0 0.0 0 0.0 0 0.0
vnet#1 | 0 0.0 0 0.0 0 0.0 0 0.0
device | cpu4 %tim cpu5 %tim cpu6 %tim cpu7 %tim
-------------+------------------------------------------------------------
vdc#0 | 0 0.0 0 0.0 0 0.0 0 0.0
vnet#0 | 0 0.0 0 0.0 0 0.0 0 0.0
vnet#1 | 0 0.0 0 0.0 3973 3.6 3980 3.6
</pre>
So the first conclusion I could draw was that if the interrupt handling
and whatever generates the spin mutexes is on the same two CPUs, then
iperf performance is badly affected.
<p>
I will follow up this blog entry with more analysis and some workarounds.
<h3>Notes</h3>
I was running with iperf 2.0.4. oaf381-ld-1 is the server, oaf381-ld-2 and oaf
381-ld-3 are the clients. It is invoked on the server as:
<pre>
iperf204 -c 192.1.44.2 -f m -t 120 -N -l 1M -P 100 &
iperf204 -c 192.1.44.3 -f m -t 120 -N -l 1M -P 100 &
</pre>
and on the clients as:
<pre>
iperf204 -s -N -f m -l 1M
</pre>
https://blogs.oracle.com/Eoin/entry/bad_practices_exposedBad practices exposedeoin 2009-09-01T08:47:10+00:002009-09-01T15:47:10+00:00There is an interesting (and frightening!)
study
recently published in the ACM Transactions on Storage:
<a href="http://portal.acm.org/citation.cfm?id=1367829.1367831&coll=portal&dl=ACM&idx=J960&part=transaction&WantType=Transactions&title=ACM%20Transactions%20on%20Storage%20%28TOS%29&CFID=://tos.acm.org/&CFTOKEN=tos.acm.org/">
A nine year study of file system and storage benchmarking</a>
together with a
<a href="http://www.byteandswitch.com/storage/storage-management/notes-on-a-nine
-year-study-of-file-system-and-storage-benchmarking.php">short summary</a>
of the results and a set of recommendations.
<p>
It surveys nine years worth of papers on file system and storage
benchmarks and makes for sobering reading:
<blockquote>
We found that most popular benchmarks are flawed, and many research papers used
poor benchmarking practices and did not provide a clear indication of the system's true performance.
</blockquote>
and:
<blockquote>
Finally, only about 45% of the surveyed papers included any mention of statistical dispersion.
</blockquote>
We can only hope that the paper will raise awareness of the important of rigor and thoroughness in performance benchmarking.https://blogs.oracle.com/Eoin/entry/first_past_the_line_analyzingFirst Past the Line - Analyzing Performance Benchmarking Dataeoin 2009-02-17T06:01:26+00:002009-09-08T13:57:19+00:00<h1><a name="stats">First Past the Line</a></h1>
<h2>Analyzing performance benchmarking data.</h2>
<h3>Eoin Lawless<br>Eoin.Lawless@Sun.Com<br>Version 1.1</h3>
<p>
Imagine you're a developer. You've just thought up a clever little
algorithm for, lets say, improving latency of small packets on a
10Gig network interface. Now you want to see exactly how much better
it is. You run a standard benchmark, like iperf, on both the original
code and your new improved code. But - horror of horrors! - the old code
appears to be faster. You run the test again, this time the new
code is better. To be sure you run the comparison a third time - and it's
a dead heat. What's going on? How can you say definitively which is better.
</p>
<p>
Unfortunately benchmark results of complex computer systems
rarely generate a completely reproducible result. Almost invariably
there is some noise - in fact as Brendan Gregg
<a href="http://blogs.sun.com/brendan/entry/unusual_disk_latency">recently demonstrated,</a>
actual noise, his shouting, can cause a spike in disk drive latency.
</p>
<p>
In this article I outline a procedure for benchmark results analysis. I'll be using
the R statistics package, but the ideas are not tied to R and many
other stats packages or spreadsheets can be used.
</p>
<p>
R is a variant of the S statistics programming language. It is
open source and freely available from
<a href="http://www.r-project.org">http://www.r-project.org.</a>
Binaries
are available there for Windows and Mac OS X, and it is easily compiled
on any recent Solaris.
</p>
<p>
I'm going to introduce just enough R to enable people
compare two sets of performance benchmark results. I'm not going to explain
R syntax in any great detail - it is done better at
<a href="http://cran.r-project.org/doc/manuals/R-intro.html">http://cran.r-project.org/doc/manuals/R-intro.html</a>.
Although I use the R software, the overall approach is the important aspect, not
R specific details. Finally, if you are impatient, and want the quick version, jump to the
<a href="#summary">checklist.</a>
</p>
<h2>Understand the benchmark</h2>
<p>
Before using benchmark results to analyze or compare performance, it's
essential to understand what the benchmark is measuring. What is a sensible
result? For example, if we're measuring latency of small packet
transfers, we'd expect to see milli or micro second numbers, not pico
seconds or minutes. Is a larger result better, or a smaller result
better? Often when looking at raw numbers, we forget what they
represent. You don't want to boast to your boss that you've improved
small packet latency by making it larger!
</p>
<p>
What size change is important 1%, 5% ? When making many successive small changes, you
may need to ensure that small regressions don't accumulate, conversely when comparing
major changes, a very minor regression may be acceptable.
</p>
<h2>Comparing Data Sets</h2>
<p>
The exact criteria that our performance testing needs to meet will vary
from situation to situation. A reasonable set of requirements are:
<ul>
<li>We want to detect changes of 1% or greater</li>
<li>We want more than 80% probability of finding a real change</li>
<li>We don't want more than 5% false positives (this is called the significance level)</li>
<li>We want to do as few benchmark runs as possible</li>
</ul>
<h2>Look at the data!</h2>
Raw numbers are hard to absorb. Plotting the data, preferably in more than
one format, will make obvious features jump out. Let's start our tour of
R by doing just that. Here's two sample sets of data, comparing openssl results from
successive builds of Solaris Nevada:
<p align=center>
<table border=1>
<thead>
<tr><td>Build 97</td><td>Build 98</td></tr></thead>
<tr><td>600870.36</td><td>596775.85</td></tr>
<tr><td>606059.30</td><td>597114.60</td></tr>
<tr><td>600799.39</td><td>592075.09</td></tr>
<tr><td>611484.07</td><td>605196.20</td></tr>
<tr><td>605925.89</td><td>592342.02</td></tr>
<tr><td>606033.32</td><td>605277.01</td></tr>
<tr><td>611066.67</td><td>592401.92</td></tr>
<tr><td>600784.94</td><td>597161.81</td></tr>
<tr><td>600697.75</td><td>604817.72</td></tr>
<tr><td>611689.59</td><td>597408.31</td></tr>
<tr><td>600719.45</td><td>592312.69</td></tr>
<tr><td>606255.10</td><td>597660.42</td></tr>
<tr><td>610618.37</td><td>604943.36</td></tr>
<tr><td>611113.47</td><td>604761.17</td></tr>
<tr><td>605951.83</td><td>592601.77</td></tr>
</table>
</p>
Now look at the same data plotted side-by-side as bar charts:
<table align="center"><tr>
<td><img src="https://blogs.oracle.com/Eoin/resource/analysis/base-first.png"></td>
<td><img src="https://blogs.oracle.com/Eoin/resource/analysis/test-first.png"></td>
</tr>
</table>
and as a strip chart:
<img align="center" src="http://blogs.sun.com/Eoin/resource/analysis/stripchart-first.png">
<p>
<!--
http://perfwww.ireland/cgi-bin/detail.cgi?rig=oaf441_64&benchmark=openssl-md5-speed-evp-multi-cpu-256b&build=snv_98-rerun&base=snv_97-rerun&clean=TRUE
-->
The data files are called <a href="http://blogs.sun.com/Eoin/resource/analysis/snv_97.data">snv_97.data</a> and <a href="http://blogs.sun.com/Eoin/resource/analysis/snv_98.data">snv_98.data</a>. Start R and load the data:
<pre>
> basefile <- read.table("snv_97.data", header = FALSE, col.names= c("basedata"))
> testfile <- read.table("snv_98.data", header = FALSE, col.names= c("testdata"))
> attach(basefile)
> attach(testfile)
> basedata
[1] 600870.4 606059.3 600799.4 611484.1 605925.9 606033.3 611066.7 600784.9
[9] 600697.8 611689.6 600719.4 606255.1 610618.4 611113.5 605951.8
> testdata
[1] 596775.8 597114.6 592075.1 605196.2 592342.0 605277.0 592401.9 597161.8
[9] 604817.7 597408.3 592312.7 597660.4 604943.4 604761.2 592601.8
> datamax <- max(basedata, testdata)
[1] 611689.6
> datamin <- min(basedata, testdata)
[1] 600697.8
</pre>
Let's have a look at a bar plot:
<pre>
par(mfcol=c(1,2) ) # to plot side by
> barplot(basedata, xlab="Iteration", ylab="Result", ylim=c(0,datamax\*1.1))
> barplot(testdata, xlab="Iteration", ylab="Result", ylim=c(0,datamax\*1.1))
> dev.off(2)
</pre>
Now we want to display a strip chart containing both sets of results:
<pre>
> plot.new()
> stripchart(basedata, method="stack", xlim=c( 0.95 \* datamin, 1.05 \* datamax), col="red", at=1.2, offset=1)
> stripchart(testdata, method="stack", add=TRUE, at=0.8, col="blue", offset=1)
> dev.off(2)
</pre>
We can save the output by first doing:
<pre>
> bitmap(file="stripchart.png", type="png256", height=3)
</pre>
<p>
At this point I should note that R comes with a comprehensive help. So, for example,
<pre>
> help(stripchart)
</pre>
displays help for the stripchart command. There is also an online <a href="http://cran.r-project.org/doc/manuals/R-intro.html">introduction</a>, which
is more general than this guide.
</p>
<p>
By a visual inspection we can see there is trouble here. Both sets of data show that
the results are clumped into groups. Before trying to compare the performance of these
two builds it's necessary to find out what is happening and why the data is clumping like this.
Time to use Solaris's famed diagnostic tools: dtrace, vmstat, mpstat, intrstat and friends.
</p>
<h2>Testing for (mis)behaviour</h2>
As we've seen, data can sometimes have odd behaviours that make a straightforward
comparison difficult, useless or even misleading. R has many statistical tests
to check data for (mis)behaviours. As we've seen already, some of these issues are
clearly visible when we plot or graph the data. However other behaviour is either
not as obvious, or else we may wish to have a programmatic way of identifying
anomolies. In particular there are a few common issues with data sets that make
a naive comparison of results either misleading or plain wrong:
<ul>
<li>Multimodal data</li>
<li>Outliers</li>
<li>Non-normal data (when used with tests that assume normality)
<li>Insufficient data</li>
</ul>
<h4>Is the data multimodal</h4>
We say a dataset is
<a href="http://en.wikipedia.org/wiki/Multimodal_distribution">multimodal</a>
if the frequency of results has two or more distinct peaks. Tests for multimodality
are not include in the base R package, but can be optionally installed
from the diptest package.
<pre>
> install.packages("diptest")
</pre>
Unfortunately this package is not easy to use, and I will skip a detailed example.
Generally the stripchart makes multiple modes visually obvious, though the diptest
is very useful for automatic analysis of large numbers of data sets.
<h4>Are there outliers</h4>
An <a href="http://en.wikipedia.org/wiki/Outlier">outlier</a> is a single point that is distant from the bulk of the data set. There is
an unfortunate tendency to blindly discard inconvenient outliers, especially when
they make a comparison look bad. In our experience, outliers are often due to
a benchmark misconfiguration or error, (ie a damaged cable, faulty disk) but they
can also be symptomatic of real problems.
There is an optional add on package for R called outliers that can be used to
test data sets for their presence.
<pre>
> install.packages("outliers")
# package installs from the web
> library(outliers)
> grubbs.test(basedata)
Grubbs test for one outlier
data: basedata
G = 1.2892, U = 0.8728, p-value = 1
alternative hypothesis: highest value 611689.59 is an outlier
</pre>
The p-value in this case is 1, so we accept that there are no outliers.
<h4>Is the data 'normal'</h4>
If the data is <a href="http://en.wikipedia.org/wiki/Normal_distribution">normal</a> it
is much easier to use it in comparisons. Normality effectively means that results
are clustered around a central mean value, with 95% of results within two standard deviations of
the mean. In addition, the frequency of a result falls off very rapidly the further from the
mean it is.
R comes with a test called the Shapiro-Wilk normality test, used as follows:
<pre>
> shapiro.test(basedata)
Shapiro-Wilk normality test
data: basedata
W = 0.8351, p-value = 0.01076
</pre>
If the p-value is less then 0.05 then the data is probably not normal, as we see in this case.
This means that we cannot use the Student T-Test to compare this data set with another. This
is not problem though - we'll can always use the Wilcoxon test which does not require
normal data.
<h4>Do we have enough data?</h4>
There is an abundance of theory to calculate the number of results required
to detect a change of a give size (the theory is called
<a href="http://en.wikipedia.org/wiki/Statistical_power">Power Analysis</a>).
R includes a suitable test, but unfortunately we need to know
the mean and standard deviation of the result data distribution before
we can apply the test.
<!--
/results9/s10u5_10.oaf408_64/spec_14_64b_jvm98Servergm
/results9/s10u6_07b.oaf408_64/spec_14_64b_jvm98Servergm
/regression/detaildir/s10u6_07b-s10u5_10-oaf408_64-spec_14_64b_jvm98Servergm-TRUE-0.05
-->
</p>
<p>
We can estimate these quantities from the data set if it contains three or more points.
With a small number of points its accuracy will be low, however it does serve
as a starting point. We'll use two new sample data sets as the first two are not normal:
<a href="http://blogs.sun.com/Eoin/resource/analysis/s10u5_10.data">s10u5_10.data</a>
<a href="http://blogs.sun.com/Eoin/resource/analysis/s10u6_07b.data">s10u6_07b.data</a>.
<pre>
> basefile <- read.table("s10u5_10.data", header = FALSE, col.names= c("basedata"))
> testfile <- read.table("s10u6_07b.data", header = FALSE, col.names= c("testdata"))
> attach(basefile)
> attach(testfile)
> basedata
[1] 325.9843 321.4696 326.9845
> testdata
[1] 319.1681 318.2914 317.6821 319.5697 320.9252 322.6830 321.3920 316.2342
[9] 318.3186 323.7438 317.3700
>
</pre>
The strip chart gives a clear picture of the distribution of results. (Base data, basedata, is in red).
<img src="https://blogs.oracle.com/Eoin/resource/analysis/stripchart-jvm.png"/>
<br>
We will use the power.t.test to see if we have enough results to detect a 1% change.
<pre>
> sd <- max( sd(basedata), sd(testdata))
> d <- mean(basedata) \* 0.01
> power.t.test( sd=sd, sig.level=0.05, power=0.8, delta=d, type="two.sample")
Two-sample t test power calculation
n = 13.87397
delta = 3.248128
sd = 2.938169
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in \*each\* group
</pre>
So, using our estimated values of the mean and standard deviation, we need to get 14 datapoints
to be able to detect a 1% change in performance. Note however that a bad estimate of the standard
deviation, sd, will result in a very large estimate of the number of data points required. In this case
I would concentrate on getting extra numbers for basedata.
<p>
See <a href="http://www.stat.uiowa.edu/~rlenth/Power">http://www.stat.uiowa.edu/~rlenth/Power</a>
for a more sophisticated applet for calculating required sample sizes.
Note however that power analysis is
<a href="http://web.vims.edu/fish/faculty/pdfs/hoenig2.pdf"> easy to get wrong </a>.
As a general rule, if it is easy to get extra benchmark runs, then do so!
<h2>Comparing the data sets</h2>
The final step, once we have confirmed both sets of data are usable, is
to compare the datasets. There are three possible outcomes:
<ul>
<li>The performance has changed</li>
<li>The performance is the same</li>
<li>The results are inconclusive</li>
</ul>
Hopefully we have exlcuded the third possibility
by gathering enough data and investigating bimodal datasets and outlier datapoints.
<h4>Has the performance changed?</h4>
The statistics procedure for comparing two sets of data for a difference is called
<emph>hypothesis testing</emph>. We have two alternative hypotheses, conventionally
called the <emph>null hypothesis</emph> and the <emph>alternative hypothesis</emph>.
The null hypothesis represents no change, nothing happening. The alternative hypothesis
represents a new finding or an important result. The choice of null hypothesis is very
important, the idea being that less harm is caused by accepting the null hypothesis than
by rejecting it. Some examples:
<p>
<table border=1>
<thead>
<tr>
<th>
Scenario
</th>
<th>
Null hypothesis
</th>
<th>
Alternative
</th>
<th>
Reason
</th>
</tr>
</thead>
<tr>
<td>Doctor testing a new cancer drug</td>
<td>The drug has no effect</td>
<td>The drug improves survival rates by more than 4%</td>
<td>A new drug should not be sold unless it is certain it does good.</td>
</tr>
<tr>
<td>A food company wants to use a new food colouring</td>
<td>The additive may cause harm to some people</td>
<td>The additive causes no harm</td>
<td>Need very strong proof that additive causes no harm</td>
</tr>
<tr>
<td>Performance Regression Testing (ie Performance QE)</td>
<td>No change</td>
<td>Change greater than 1%</td>
<td>Crying wolf with insufficient evidence wastes developers time</td>
</tr>
</tr>
</table>
<p>
We are looking for a change (either an improvement or a regression).
There are many methods for hypothesis testing,
each with multiple subvariants. We typically use just two, the Student T-Test and
the Wilcoxon Signed Rank test. Of these the Wilcoxon is more general.
<p>
Both tests calculate the probability (p-value) that the spread of results could occur if the null
hypothesis is true. We will only reject the null hypothesis if this probability
is very low, typically 0.05, or 5%. This figure is refered to as the significance
level of the test. If the p-value is less than the significance level, then we must
act on the basis that there is a change.
<p>
Bear in mind, that even if we don't reject the null hypothesis (ie don't find a change),
this does not imply that all is good, it could be that there is a change but that the
data we have collected is inconclusive. Hence the important of checking the data
and making sure we have enough datapoints.
<h4>Is performance the same?</h4>
Just because our data does not show an performance change, doesn't necessarily
mean that performance is unchanged. Our data might simply be inadequate
for the job of detecting a change of the size we're interested in. We have
rejected the alternative hypothesis, but that doesn't mean we can accept
the null hypothesis.
<p>
Is there a way to test whether we can accept the null hypothesis (ie performance
has stayed the same)? As it turns out there is a relatively simple way. If
the 90% confidence interval reported by the Wilcox (or T)
test lies entirely
within 1% of the baseline average, then we can safely assume the test build
performance has remained with 1% of the base build performance. (See
<a href="http://web.vims.edu/fish/faculty/pdfs/hoenig2.pdf">http://web.vims.edu/fish/faculty/pdfs/hoenig2.pdf</a> for details.)
<h4>Example 1</h4>
We will use two sets of results from an openssl benchmark, the baseline running on
<a href="http://blogs.sun.com/Eoin/resource/analysis/ssl.snv_106.data">snv_106</a>
and the test on
<a href="http://blogs.sun.com/Eoin/resource/analysis/ssl.snv_107.data">snv_107</a> .
Neither data set is multimodal, or has outliers.
<img align="left" src="http://blogs.sun.com/Eoin/resource/analysis/stripchart-ssl.png">
<br>
<!--
http://perfwww.ireland/cgi-bin/detail.cgi?benchmark=openssl-md5-speed-evp-256b&rig=oaf566_64&build=snv_107&base=snv_106&clean=TRUE&refresh=TRUE
/results9/snv_106.oaf566_64/openssl-md5-speed-evp-256b
/results9/snv_107.oaf566_64/openssl-md5-speed-evp-256b
-->
<pre>
> basefile <- read.table("ssl.snv_106.data", header = FALSE, col.names= c("basedata"))
> testfile <- read.table("ssl.snv_107.data", header = FALSE, col.names= c("testdata"))
> attach(basefile)
> attach(testfile)
> power.t.test( sd=sd(basedata), sig.level=0.05, power=0.75, delta=mean(basedata)\*0.01, type="two.sample")
Two-sample t test power calculation
n = 5.296159
delta = 285.0265
sd = 155.7760
sig.level = 0.05
power = 0.75
alternative = two.sided
</pre>
For testdata, power.t.test cannot compute n at the default power of 0.75 (75%), so we use a higher (better) power level:
<pre>
> power.t.test( sd=sd(testdata), sig.level=0.05, power=0.95, delta=(mean(testdata)\*0.01), type="two.sample")
Two-sample t test power calculation
n = 2.135454
delta = 281.677
sd = 42.21613
sig.level = 0.05
power = 0.95
alternative = two.sided
</pre>
We have more than enough data. We can now safely compare the two data sets. We will use the Wilcoxon
test, even though the T-Test is applicable here, since the Wilcoxon can be applied more widely.
Be careful with the ordering of basedata and testdata in the function call. With the ordering below
(test data, then baseline data)
a regression for a "bigger is better" benchmark will show as negative and an improvement as positive.
<pre>
> wilcox.test(testdata,basedata, conf.int=TRUE)
Wilcoxon rank sum test
data: testdata and basedata
W = 0, p-value = 2.896e-07
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-432.64 -234.41
sample estimates:
difference in location
-265.75
</pre>
Here is how the result is interpreted:
<table border="1">
<tr><td>Null hypothesis</td><td>No change in performance</td></tr>
<tr><td>Alternative hypothesis</td><td>Improvement or regression</td></tr>
<tr><td>
p-value </td><td> A p-value less than 0.05 indicates a change in performance</td></tr>
<tr><td>95 percent confidence interval</td><td>The change in performance is 95% sure to be within these limites
<tr><td>difference in location</td><td>negative is a regression, positive an improvement
</td></tr></table>
In our case the p-value is very low, so we definitely have a change. Our benchmark is
"bigger is better" and the change is negative, so we have a regression.
<h4>Example 2</h4>
<img align="left" src="http://blogs.sun.com/Eoin/resource/analysis/stripchart-jes.png">
<br>
<!--
No change
http://perfwww.ireland/cgi-bin/detail.cgi?benchmark=jes_ds_phlkp&rig=oaf300_64&build=dsee-6-3-1-20081031-1-snv_104&base=dsee-6-3-1-20081031-1-snv_103&clean=TRUE&ref
resh=TRUE
/results9/dsee-6-3-1-20081031-1-snv_103.oaf300_64/jes_ds_phlkp
/results9/dsee-6-3-1-20081031-1-snv_104.oaf300_64/jes_ds_phlkp
-->
<pre>
basefile <- read.table("jes.snv_103.data", header = FALSE, col.names= c("basedata"))
testfile <- read.table("jes.snv_104.data", header = FALSE, col.names= c("testdata"))
attach(basefile)
attach(testfile)
sd <- max( sd(basedata), sd(testdata))
d <- mean(basedata) \* 0.01
power.t.test( sd=sd, sig.level=0.05, power=0.8, delta=d, type="two.sample")
mean(basedata) \*0.01
> wilcox.test(testdata, basedata, conf.int=TRUE, conf.level=0.9)
Wilcoxon rank sum test
data: testdata and basedata
W = 111, p-value = 0.917
alternative hypothesis: true location shift is not equal to 0
90 percent confidence interval:
-1.918 2.146
sample estimates:
difference in location
0.3145
</pre>
As before we interpret the output of the Wilcox test as follows:
<br>
<table border="1">
<tr><td>Null hypothesis</td><td>No change in performance</td></tr>
<tr><td>Alternative hypothesis</td><td>Improvement or regression</td></tr>
<tr><td>
p-value </td><td> A p-value less than 0.05 indicates a change in performance</td></tr>
<tr><td>90 percent confidence interval</td><td>The change in performance is 90% sure to be within these limites
<tr><td>difference in location</td><td>negative is a regression, positive an improvement
</td></tr></table>
<br>
Our p-value is 0.917, which is far higher than 0.05, so this data does not
indicate any change in performance between the two builds. However, does
that imply the converse? If the 90% confidence interval is within 1% of the
base mean value, then we can accept that the performance is unchanged.
<pre>
> mean(basedata) \* 0.01
[1] 9.729103
</pre>
Our 90% confidence level is from -1.918 to 2.146, which is well within
± 9.7291, so for this benchmark we can accept that performance is unchanged.
<br>
<h4>Example 3</h4>
<img align="left" src="http://blogs.sun.com/Eoin/resource/analysis/stripchart-web.png">
<br>
<!--
http://perfwww.ireland/cgi-bin/detail.cgi?benchmark=SPECweb_Support_SunWebserver_7u3_jsp&rig=oaf207_64_8&build=s10u7_02&base=s10u7_01&clean=TRUE&refresh=TRUE
/results9/s10u7_01.oaf207_64_8/SPECweb_Support_SunWebserver_7u3_jsp
/results9/s10u7_02.oaf207_64_8/SPECweb_Support_SunWebserver_7u3_jsp
-->
<pre>
> basefile <- read.table("web.s10u7_01.data", header = FALSE, col.names= c("basedata"))
> testfile <- read.table("web.s10u7_02.data", header = FALSE, col.names= c("testdata"))
> attach(basefile)
> attach(testfile)
> sd <- max( sd(basedata), sd(testdata))
> d <- mean(basedata) \* 0.01
> power.t.test( sd=sd, sig.level=0.05, power=0.8, delta=d, type="two.sample")
> mean(basedata) \*0.01
</pre>
In this final example we look at a case where the data is inconculsive.
The datasets are from two builds of solaris 10, and the benchmark is
a standard web server benchmark (
<a href="http://blogs.sun.com/Eoin/resource/analysis/web.s10u7_01.data">web.s10u7_01.data</a>
and
<a href="http://blogs.sun.com/Eoin/resource/analysis/web.s10u7_02.data">web.s10u7_02.data</a>).
The Wilcox test does not indicate any change in
performance. However, the 90% confidence interval ( -18 to 480)
is far wider than the ± 1% of the base mean (± 32), so
we cannot assume that the performance has stayed constant. Our data is
not adequate and we need to investigate further. For instance, several
results look like outliers, is there an intermittant performance problem?
<pre>
> wilcox.test(basedata,testdata, conf.int=TRUE, conf.level=0.9)
Wilcoxon rank sum test
data: basedata and testdata
W = 25, p-value = 0.3301
alternative hypothesis: true location shift is not equal to 0
90 percent confidence interval:
-18 460
sample estimates:
difference in location
19
> mean(basedata) \* 0.01
[1] 32.6425
</pre>
<a name="summary"></a>
<h2>Summary and Checklist</h2>
You may have noted that I make very little mention of the average value of our results. This is
deliberate. Unless we first check for outliers, multiple modes and normality, it makes
little sense to directly compare average results.
<p>
Finally, please let me know of any mistakes.
<p>
This has been a very rough guide to statistical analysis of performance benchmark results.
Here is a checklist to help
<ul>
<li> Make sure you understand the benchmark you are running</li>
<ul>
<li> Are the results sensible?</li>
<li> Is larger better or is smaller better?</li>
<li> What size change is significant to you?</li>
</ul>
<li> Look at the data - strip chart and histogram</li>
<ul>
<li> Are there any obvious anomalies</li>
<li> Do the numbers make rough sense in the context</li>
</ul>
<li> Is the data 'normal'? If so we can use the Student T-Test.</li>
<li> Is the data multimodal</li>
<ul>
<li> If the data is multimodal, the average is no longer a useful measure.</li>
<li> Investigate the reasons for the multimodal results,</li>
</ul>
<li> Is there an outlier?</li>
<ul>
<li> Is there a reason (ie obvious configuration error or runtime failure in benchmark)</li>
</ul>
<li> If data is normal </li>
<ul>
<li> Do we have enough results to detect the size change we're interested in?</li>
<li> Does the Student T test confirm a difference between the results?</li>
<li> Does the Wilcox Test confirm difference between the results?</li>
</ul>
<li> If the data is unimodal but not normal</li>
<ul>
<li> Do we have enough results to detect the size change we're interested in [use
t-test power analysis as a rough estimate]</li>
<li> Does the Wilcoxon test confirm difference between the results?</li>
</ul>
<li> If we don't detect a change in performance, can we say that
performance is unchanged (is the 90% confidence interval with than a ±1% range of the base mean).</li>
</ul>
</p>
<h2>Bibliography</h2>
<ul>
<li>Statistics Explained, 2nd Edition, Perry R. Hinton</li>
<li><a href="http://cran.r-project.org/doc/manuals/R-intro.html">http://cran.r-project.org/doc/manuals/R-intro.html</a></li>
<li><a href="http://www.r-project.org">http://www.r-project.org.</a></li>
<li><a href="http://www.saem.org/download/lewis4.pdf">http://www.saem.org/download/lewis4.pdf</a></li>
<li><a href="http://en.wikipedia.org/wiki/Student_t_test">http://en.wikipedia.org/wiki/Student_t_test</a></li>
<li><a href="http://en.wikipedia.org/wiki/Normal_distribution">http://en.wikipedia.org/wiki/Normal_distribution</a></li>
<li><a href="http://en.wikipedia.org/wiki/Wilcoxon_rank_sum">http://en.wikipedia.org/wiki/Wilcoxon_rank_sum</a></li>
<li><a href="http://en.wikipedia.org/wiki/Wilcoxon_signed_rank_test">http://en.wikipedia.org/wiki/Wilcoxon_signed_rank_test</a></li>
<li><a href="http://en.wikipedia.org/wiki/Hypothesis_testing">http://en.wikipedia.org/wiki/Hypothesis_testing</a></li>
<li><a href="http://en.wikipediwikipedia.org/wiki/Statistical_power">http://en.wikipedia.org/wiki/Statistical_power</a></li>
<li><a href="http://www.stat.uiowa.edu/~rlenth/Power/">http://www.stat.uiowa.edu/~rlenth/Power/</a></li>
<li><a href="http://web.vims.edu/fish/faculty/pdfs/hoenig2.pdf">http://web.vims.edu/fish/faculty/pdfs/hoenig2.pdf</a></li>
<li><a href="http://en.wikipedia.org/wiki/Outlier">http://en.wikipedia.org/wiki/Outlier</a></li>
http://www.west.asu.edu/rlberge1/papers/statsci96.pdf
</ul>