When Good Benchmarks Go Bad
By drapeau on Oct 05, 2004
First things first:; today was a fantastic day for the Siebel-Sun alliance. We announced the results of the benchmark, which puts Sun at the top of the UNIX heap; perhaps more importantly than the benchmark results, though, was that Siebel announced its intention to port to Solaris 10 on x86 (as well as Solaris 10 on SPARC, of course). Life is good today. But enough of that; back to Benchmarking 101.
So just how difficult could it be to install a piece of software, run it with a simulated workload of 10,000 users logged in and doing stuff at the same time, collect the numbers, and report victory? Let's count, shall we?
To get an idea of the kind of animal we're working with, click on the picture of the topology at the beginning of this entry.
First of all, we don't just have all of this equipment lying around. Sun has a customer benchmark center that gives us access to the best stuff we make given a good business case for needing the equipment, but we only have temporary access to this equipment. So when we do a benchmark that tries to scale to 10,000 concurrent users, we first have to start with a smaller hardware setup than what you see in the diagram above. That means we can't test with 10,000 users; we start smaller, work out the kinks, then try to scale up when we get the big iron.
I've already named at least three problems in that paragraph:
- Getting access to the right equipment (lead times are measured in months, and any number of problems can lead to additional delays of several months, especially when we try to do benchmarks with bleeding edge equipment)
- Starting small and scaling up are two different animals. Running a benchmark with 2,000 users successfully is not the same as running 10,000 users. You doubt that? Think about this: is running 2 miles the same as running 10 miles? Want a different analogy, more programmer-friendly? Is a development team of 2 engineers the same as a development team of 10 engineers? I think not.
- All kinds of problems with the benchmark kit itself, or the software being tested, or the operating system + patch levels, or the hardware itself.
What else can go wrong with the kit? Well, some of the data in the kit may be time-sensitive (e.g., a trouble ticket may have a "reply-by" date that is hard coded to, say, six months after the kit was created, the assumption being that benchmarks will be run fairly soon after the kit is created. So if you run the benchmark 9 months after the kit is created, some of the transactions you run might run into expiration dates, which creates failed transactions which means you fail your benchmark. That's not good.
Another problem: since the kit has UI in it, it is extremely sensitive to the slightest changes in UI layout (remember, we're simulating mouse clicks on buttons and links and stuff on a web page, which seems crazy to me but that's how it works for tons of enterprise application benchmarks). If the application makes any changes in UI from one minor release to another, the benchmark kit's scripts may have to be re-recorded, meaning somebody has to manually go into the application and record him/herself logging in with the browser, running through the task(s) in question, then logging out and saving that script back to the benchmark kit. Crazy.
I could go on and on about benchmark kit problems, but you get the idea. Turns out that working out kinks in any benchmark kit can easily take months, and often does. I mean it: months.
And this assumes that the application itself works fine when running under stress or scaling up to a load it has never seen before during normal testing. Often, this causes a number of additional failures when trying to scale up. In a sense, benchmarks are a great mechanism to increase stability...if it weren't for the tremendous cost of getting all that expensive hardware and the group of specialists it takes to tune everything (DB tuners, web server tuners, OS tuners, Siebel specialists, network specialists, storage experts, lab administration engineers, benchmark kit tuners, LoadRunner experts, etc.). It's not easy to get all of these people together, and it's not cheap. So maybe I should say it this way: full-scale benchmarks are a great way to find and fix stability problems in your software...if you're made of money. Which Sun definitely is not.
Next: Why laziness in benchmark engineering is actually a good thing, and how PG&E can be a benchmark show-stopper.