When Good Benchmarks Go Bad

The topology used by Sun to win back UNIX leadership on Siebel's benchmark performance

First things first:; today was a fantastic day for the Siebel-Sun alliance.  We announced the results of the benchmark, which puts Sun at the top of the UNIX heap; perhaps more importantly than the benchmark results, though, was that Siebel announced its intention to port to Solaris 10 on x86 (as well as Solaris 10 on SPARC, of course).  Life is good today.  But enough of that; back to Benchmarking 101.

So just how difficult could it be to install a piece of software, run it with a simulated workload of 10,000 users logged in and doing stuff at the same time, collect the numbers, and report victory?  Let's count, shall we?

To get an idea of the kind of animal we're working with, click on the picture of the topology at the beginning of this entry.

First of all, we don't just have all of this equipment lying around.  Sun has a customer benchmark center that gives us access to the best stuff we make given a good business case for needing the equipment, but we only have temporary access to this equipment.  So when we do a benchmark that tries to scale to 10,000 concurrent users, we first have to start with a smaller hardware setup than what you see in the diagram above.  That means we can't test with 10,000 users; we start smaller, work out the kinks, then try to scale up when we get the big iron.

I've already named at least three problems in that paragraph:

  1. Getting access to the right equipment (lead times are measured in months, and any number of problems can lead to additional delays of several months, especially when we try to do benchmarks with bleeding edge equipment)
  2. Starting small and scaling up are two different animals.  Running a benchmark with 2,000 users successfully is not the same as running 10,000 users.  You doubt that?  Think about this: is running 2 miles the same as running 10 miles?  Want a different analogy, more programmer-friendly?  Is a development team of 2 engineers the same as a development team of 10 engineers?  I think not.
  3. All kinds of problems with the benchmark kit itself, or the software being tested, or the operating system + patch levels, or the hardware itself.
Let's talk about the benchmark kit.  Suppose you're creating a benchmark that simulates 10,000 users all using a call center application, logging in, clicking on stuff in the UI (yes, we're simulating users actually using a UI), getting results, doing stuff with those results, then logging out.  Well, the people who create that benchmark kit create a sample database with customer data in there and a number of users that represent the call center agents, and they create a simulated set of tasks that each of the call center agents are going to run, etc.  This can get to be pretty involved; the benchmark kit might only have a sample DB of 150 users' worth of stuff, not the 10,000 you want.  It's your job to figure out how to make the sample data "bigger" in a way that doesn't make the kit break when you run it.

What else can go wrong with the kit?  Well, some of the data in the kit may be time-sensitive (e.g., a trouble ticket may have a "reply-by" date that is hard coded to, say, six months after the kit was created, the assumption being that benchmarks will be run fairly soon after the kit is created.  So if you run the benchmark 9 months after the kit is created, some of the transactions you run might run into expiration dates, which creates failed transactions which means you fail your benchmark.  That's not good.

Another problem: since the kit has UI in it, it is extremely sensitive to the slightest changes in UI layout (remember, we're simulating mouse clicks on buttons and links and stuff on a web page, which seems crazy to me but that's how it works for tons of enterprise application benchmarks).  If the application makes any changes in UI from one minor release to another, the benchmark kit's scripts may have to be re-recorded, meaning somebody has to manually go into the application and record him/herself logging in with the browser, running through the task(s) in question, then logging out and saving that script back to the benchmark kit.  Crazy.

I could go on and on about benchmark kit problems, but you get the idea.  Turns out that working out kinks in any benchmark kit can easily take months, and often does.  I mean it: months.

And this assumes that the application itself works fine when running under stress or scaling up to a load it has never seen before during normal testing.  Often, this causes a number of additional failures when trying to scale up.  In a sense, benchmarks are a great mechanism to increase stability...if it weren't for the tremendous cost of getting all that expensive hardware and the group of specialists it takes to tune everything (DB tuners, web server tuners, OS tuners, Siebel specialists, network specialists, storage experts, lab administration engineers, benchmark kit tuners, LoadRunner experts, etc.).  It's not easy to get all of these people together, and it's not cheap.  So maybe I should say it this way: full-scale benchmarks are a great way to find and fix stability problems in your software...if you're made of money.  Which Sun definitely is not.

Next: Why laziness in benchmark engineering is actually a good thing, and how PG&E can be a benchmark show-stopper.

good insight about benchmarks. But you know what, customers ask us consultants to reproduce benchmark number in their environment :(

Posted by bbr on October 05, 2004 at 07:08 PM PDT #

Ouch, don't I know this pain :(. Can't think of the number of times where that first run with a small dataset goes fine, and you start off a bunch of experiments, and well, it goes pear shaped. The first 80% is always nice and handy, its that final 20 where you spend 80% of the work. - F.

Posted by fintanr on October 05, 2004 at 11:01 PM PDT #

reading your blog from 'Siebel User week' in LA. Good job George, bloggin about the siebel benchmark complexities, creates an awareness among people who think achieving a world record benchmark is a piece of cake. Also it shows how difficult my job is !! :-). I was asked by several siebel customers at SUW, if the benchmark numbers can be re-produced at customer production sites. My answer is 'yes' why not, if you are running the same workload and have the same kinda hardware/network and apply all tunables used in benchmark env. This is easier said than done - but it is not 'impossible'.

Posted by Khader Mohiuddin on October 06, 2004 at 09:19 AM PDT #

Post a Comment:
Comments are closed for this entry.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle. What more do you need to know, really?


« July 2016