Real Value in Benchmarks
By tomdaly on Aug 11, 2008
Benchmarks, Workloads , Micro benchmarks and In-House Performance Testing
As a contributor to the SPEC performance organisation on behalf of Sun, I tend to notice and read comments both negative and positive on the benchmarks SPEC creates and administers, I read with particular interest articles on the SPECjAppServer benchmarks that I am involved in. A few days ago I was forwarded a post in which the author offers the opinion that the SPECjAppServer2004 results provide no value and also offers a pretty negative view of industry standard benchmarks in general. I certainly don't believe that the SPEC benchmarks or any benchmarks are perfect , nor do I thnk that they are the only valuable source for performance information but to claim that the results have no value seems ..well, absurd. So I thought it might be useful to offer some background and observations regarding performance measurement, and in the discussion below I try to categorise the main sources of performance information and to highlight the main benefits and shortcomings of each source.
A benchmark is comprised of a performance testing workload (application) or workload definition + a set of run rules and procedures that define how the workload will be run + a process for ensuring that the published results conform to the run rules and to prescribed "fair use" rules about how comparisons may be made between results.
The workload is generally a (very) complex application and includes the user simulations, data models either in code or specification form and all the information necessary to run repeatable performance tests over a (potentially) wide variety of computing environments. The run rules and procedures define how the workload will be run, what constitutes fair and reasonable tuning techniques, what the requirements are for the products being tested and the format and length of test runs and the reporting requirements . The benchmark usage rules (fair use) outline how one benchmark result can be compared to others and effectively puts constraints on the claims that can be made about any particular benchmark result. Hence (ideally) increasing the value of the published results to end user consumers of the results.
Industry standard benchmarks organisations such as SPEC
or TPC are comprised of (IT)
companies, and interested individuals who contribute
time and/or money to the organisation to develop (complex) benchmarks
and to help manage these benchmarks. The reason these benchmark
organisations exist is to create benchmarks and performance data that
is credible, relevant and useful to end user consumers of this data.
There are many benefits to the contributing vendors in creating and
running benchmarks, having a forum to prove performance or
price/performance gains in the their products is certainly a big
motivation but not the only one , many of the benchmarks defined and
created for example by SPEC are used by hardware and software vendors
to improve their products, long before a result is ever published on
the competitive public site. So there are very sound engineering as well as marketing
reasons for vendors to contribute to the goal of creating credible relevant useful perfomance benchmarks.
Another valuable source of benchmark and performance data comes from vendor benchmarks. The Oracle applications benchmarks and SAP benchmarks are good and well know examples, the workload, run rules and usage rules are defined by the vendor company and then made available to 3rd parties or perhaps hardware partners who want to run and tune these workloads on their environments. These benchmarks have much in common with the industry standard benchmarks but the scope is limited generally to just the product offered by the vendor. These benchmarks are very useful for potential customers of these systems, to gain the performance information required to size implementations of these products and hence to build confidence in the performance capacities of the system prior to purchase or implementation.
Extremely cheap for end users, normally the large IT vendors have done most of the work and published results, end users then need only look at these results and then decide if and how they might be applied to their business and what comparisons they can make based on the published data. End users can use the numbers with a degree of confidence knowing that the results have been audited or perhaps peer reviewed to ensure compliance to the benchmark rules.
There is a lot of tuning information and real value in the benchmark results themselves, for instance consider the SPECjAppServer2004 benchmark results. In each result you will find the .html result page which is the full disclosure report (FDR). The FDR contains not just the benchmarking results and final and repeat run scores but also a wealth of tuning information, tuning for the database the application server, hardware , java virtual machine, JDBC driver and operating system, everything another user might need to be able to reproduce the result. In the FDR there is also the full disclosure archive (FDA). The FDA contains the scripts, database schema, deployment information and instructions on how to the environment was established. The SPECjAppServer2004 FDR and FDA are valuable resources I use all the time on customer sites as a reference on how to tune and configure their production and test systems.
Again in reference to the FDR and FDA much of the raw data and data rate information is useful, examples include the number of concurrent web tier transactions or the network traffic or perhaps the size of the database supported by the database hardware. These data rates and speeds and feeds can be used to assess the capabilities of certain parts of the system being tested and can be useful in sizing some aspect of similar applications.
Hardware and software vendors use the benchmarks as tools to improve their products which in flows to end users. A good example of this at SPEC is when decided to use BigDecimal in the web tier of SPECjAppServer2004, even prior to the SPECjAppServer2004 workload being released it was obvious to the java virtual machine vendors participating at SPEC that this was an opportunity to optimise BigDecimal processing in their JVMs. So before the first SPECjAppServer2004 results were released the JVMs were already providing optimisations for BigDecimal and SPECjAppServer2004 was helping quantify the performance gains from these optimisations. The benefits of these optimisations flowed to all users of java BigDecimal that could move to the later JVMs.
Competition. Industry standard benchmarks are one way a vendor can show performance improvements in their products and performance leadership over their competition and perhaps gain a marketing advantage, so there is a fiercely competitive aspect to industry standard and vendor benchmarking. This competition is generally good for end users as it commonly produces tunings and optimisations in the vendors products that benefit a wide range of applications using their technology and indeed this is the situation that the run rules and fair use rules generally strive to promote.
Inappropriate comparisons, or extrapolation of results. Care must be taken to make selective and reasonable judgements based on the information provided in the results or benchmark reports. It makes no sense for instance to use SPECjAppServer2004 results which is a transactional benchmark say to size a system for data warehousing or business intelligence, a TPC-H benchmark would be the place to go for this information. Also looking at the transaction rate disclosed in a benchmark report and then extrapolating this result upwards is risky as performance is not a continuous function but instead can have many discrete jumps and tested configurations may have hard ceilings such as memory capacity or bus bandwidth. For example it might not be accurate to predict the performance of a single instance of Glassfish application server on a 64 core machine based on the JOPS/Glassfish instance result measured on a 4 core machine.
There will be limitations on how closely industry standard benchmarks model your chosen or developed application. In the case of SPECjAppServer2004 the developers and participants , companies like IBM, Intel,Sun,Oracle have looked at our customer base and tried to model the web applications we have seen our customers developing or perhaps based our modelling decisions what they told us they were going to develop.
Industry standard benchmarks trail the technology curve and hence will often be using an older version of infrastructure / technology than the market would like. This is because the benchmark can't come out until there is an established set of products to run the benchmark and because it takes time for companies to run and scale the benchmarks and to build a set of results that is useful for end users. For example the development of SPECjAppServer2004 workload was well underway before there were many/any J2EE 1.4 products available, but it wasn't released until after most of the major J2EE application server companies had released their products. Similarly work is underway for the new version of SPECjAppServer but it is trailing the availability of the application servers that have implemented the Java Enterprise Edition 5.0 specification.
A performance workload is similar to the benchmark described above but it lacks the run rules, process and oversight. This means that end users can't (in general) have high levels of confidence in the performance claims made by vendors publishing results based on these workloads. End users reading benchmark reports and performance claims made from using workloads without the process of a formal benchmark will have much more work to do to decide what comparisons make sense and what comparisons may in fact even be misleading. For example a vendor could use a workload like DBT2 and publish test results comparing say the largest server hardware running database "A" and a tiny single cpu based server running database "B" and then without disclosing the different hardware platforms could offer this as data to suggest that database "A" is better performing than database "B". Sure this is an exaggeration but it serves to demonstrate the value of the process and disclosure rules of the formal benchmarks.
Workloads are often easy to run, easily understood and readily available. This makes them very useful to run in-house and therefore end users can make their own comparisons without having to rely on external vendors.
Workloads don't have the restrictions of the process imposed by the industry standard benchmark bodies and as such it is mush easier to just run and report results from workloads. For example in the open source database world the SysBench workload is a very valuable tool and is commonly used for performance testing of code changes to the MySQL database. Results of these tests are widely and openly reported and used as the basis for even more performance improvement. One key here is, in this situation the workload is being used collaboratively for investigation not primarily competitively to sell something.
Workloads are potentially designed by individuals and so development cycles may be shorter that the industry standard benchmarks.
Risk of error , especially in tuning. Even though running performance workloads in house can be relatively straight forward there is still the risk of getting the wrong answer. Consider trying to determine which is the best performing database "A" or "B" by doing workload based performance testing of both. The user running the tests and trying to accurately compare the results has to have the expertise to be able to tune both "A" and "B" to the point where they can make optimal use of the hardware and operating system resources otherwise the results may be misleading.
There is a cost to running performance investigations in-house and though running performance workloads may be relatively cheap it is potentially more expensive than if published industry standard or vendor benchmark figures are available and can be used.
A micro benchmark is usually a small
generally simple workload that tests only a limited number of system
or user functions.
In fact most often the micro benchmark will not have any process such as reporting rules or any basis for comparison of results so really I believe a better terminology would be micro workloads.
Generally free or cheap to download or develop.
Very easy to run and report results on
Potentially very powerful too for diagnosing low-level performance problems
Because micro benchmarks are generally fairly simple and only measure a very small set of performance attributes then comparisons may in fact be valid across platforms.
It is generally not possible and rarely a good idea to predict larger system or application performance based on micro benchmarking, again by their nature micro benchmarks will test and consider only a small sub set of the performance of the system being tested so it is quite likely that other factors beyond the scope of the micro benchmarks will effect total system response and throughput.
In House Application Performance Testing (customer benchmarking)
Where an end user, customer or software developer creates a
purpose built stress test (workload) for their application or
hardware and tests what they intend to run in production.
This is arguably the most reliable way for an end user or application developer to understand their code / environment as it involves running the actual code planned to be run in production. There is no requirement to try and apply the results and resource utilisation from some other performance test (benchmark or workload) as it is your code that is being run and directly observed.
This is by far the most accurate way to determine real application and / or system performance i.e. to actually test in-house what you intend to run in production. This way no guesses have to be made as to how applicable the test environment is to the intended production environment.
By far this is the most expensive of all of the options but for a company or individual potentially spending a lot of money on hardware or software this may well be the best option and it might be that the potentially substantial costs associated with developing a test harness and simulation, determining the test parameters and running the performance tests is well worth it. One caveat here might be that as software costs fall with the increasing enterprise use of open source software then the costs of running â€œin houseâ€� performance tests may start to look large vrs software purchase cost.
Running in-house application performance benchmarks is still not without risk, a wide variety of skills is required to create the simulation and determine that it covers the expected usage pattern for the application. Different skills are needed to be able to deploy and tune the application and any middle-ware required for the application and also DBA skills will be required as well as general performance tuning skills...the variables really start to add up.
I hope to have provided a very high level overview and a useful categorisation of the main sources of performance data available today, in my opinion each of these sources or approaches to obtaining performance data has great value however because performance analysis is always a contentious and often a more subjective topic than it should be I am not sure I expect to settle too many debates. Hopefully offering a broader perspective on the value of benchmarking than I have seen in (some) other forums is useful to those who might be needing or relying on this data.