Thursday May 31, 2007

Using DTrace to Capture Statement Execution Times in MySQL


I have recently been engaged with a customer that is evaluating MySQL, in particular with its Memory (formerly known as Heap) engine, which stores all database data and metadata in RAM.

I was asked to assist with diagnosing if/whether/where statements were taking longer than 5 milliseconds to complete. Now, this was being observed from the viewpoint of the "client" - the client was a synthetic benchmark run as a Java program. It could be run either on a separate system or on the same system as the MySQL database, and a small number of transactions would be measured as taking longer than 5ms.

Now, there is more than one way to skin this cat, and it turns out that Jenny Chen has had a go at putting static probes into MySQL. For my (and the customer's) purposes however, we wanted to skip the step of re-compiling with our own probes, and just use what we can observe via use of the PID provider.

How Do We Do This?

Well, it is not trivial. However as it turns out, I have seen a bit of the MySQL code. I also had someone pretty senior from MySQL next to me, who helped confirm what was happening, as I used some "fishing" bits of DTrace to watch a mysqld thread call functions as we ran the "mysql" client and entered simple statements.

This allowed me to narrow down on a couple of vio_\*() routines, and to build some pseudo-code to describe the call flow around reception of requests from a client, processing of the statement, then return of a result to the client.

It is not as simple as looking for the entry and return of a single function, because I wanted to capture the full time from when the mysqld thread returns from a read(2) indicating a request has arrived from a client through to the point where the same thread has completed a write(2) to send the response back. This is the broadest definition of "response time measure at the server".

The Result

The result of all of our measurements showed that there were no statements from the synthetic benchmark that took longer than 5 ms to complete. Here is an example of the output of my DTrace capture (a histogram of microseconds):

bash-3.00# ./request.d -p `pgrep mysqld`

           value  ------------- Distribution ------------- count
             < 0 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@                    10691
             500 |@@@@@@@@@@@@@@@@@                        8677
            1000 |@                                        680
            1500 |                                         31
            2000 |                                         0

The Script

Feel free to use the DTrace script for your own purposes. It should work on MySQL 5.0 and 5.1.

The Upshot - Observability in the JVM

There is a nagging question remaining - why was the Java client seeing some requests run longer than 5 ms?.

There are three possible answers I can think of:

  1. There is latency involved in transmitting the requests and responses between the client and server (e.g. network packet processing).
  2. The thread inside the JVM was being pre-empted (thrown off the CPU) between taking its time measurements.
  3. The measurements (taken using System.nanoTime()) are not reliable.

Friday Mar 16, 2007

Simplifying "lockstat -I" Output (or Ruby Is Better Than Perl)

There, I said it. I have been a Perl scripter for nearly 20 years (since early version 3). I think Ruby has pretty much everything Perl has, and more, like:

  • It is really object-oriented (rather than Perl's bolt-on objects). I am much more likely to end up using this functionality.
  • It has operator overloading
  • It has a "case" statement
  • There is no Obfuscated Ruby Contest (though IMHO there could be one)

Anyway, enough religous argument. I want to thank my colleague Neel for putting me onto it, and now I will provide a simple example.

Post-Processing "lockstat -I"

The Ruby script below post-processes the output of "lockstat -I". Why you ask? - well, because the output of "lockstat -I" can be tens of thousands of lines, and it does not coalesce by CPU.

The script below will coalesce by CPU, and PC within a function (useful if you forgot the "-k" flag to lockstat). A very simple change will also make it coalesce by PIL, but I will leave that as an exercise for the reader.

Ruby Post-Processor Script

One thing about this is that I would write this script almost exactly the same way if I used Perl. That is another plus of Ruby - it is easy to pick-up for a Perl programmer.

#!/bin/env ruby -w
# lockstatI.rb -	Simplify "lockstat -I" output
#, 16 Mar 2007
#	lockstat -I [-i 971 -n <nnnnnn> ] sleep <nnn> | lockstatI.rb

PROG = "lockstatI"

#-- Once we have printed values that cover this proportion, then quit
CUTOFF = 95.0

DASHES = ('-' \* 79) + "\\n"


print "#{PROG} - will display top #{CUTOFF}% of events\\n"

events = 0
period = 0
state = 0
counts = {}

#-- The classic state machine
ARGF.each_line do |line|
    next if line == "\\n"
    case state
    when 0
	if line =~ /\^Profiling interrupt: (\\d+) events in (\\d.\*\\d) seconds/
	    puts line
	    state = 1
	    events = $1
	    period = $2
    when 1
	if line == DASHES then
	    state = 2
    when 2
	if line == DASHES then
	f = line.split
	count = f[0]
	cpu, pil = f[5].split('+')

	#-- Coalesce PCs within functions; i.e. do not differentiate by
	#-- offset within a function.  Useful if "-k" was not specified
	#-- on the lockstat command.
	caller = f[6].split('+')[0]

	if pil then
	    caller = caller + "[" + pil + "]"
	counts[caller] = counts[caller].to_i + count.to_i

#-- Give me an array of keys sorted by descending value
caller_keys = counts.keys.sort { |a, b| counts[b] <=> counts[a] }

#-- Dump it out
printf "%12s  %5s  %5s  %s\\n", "Count", "%", "cuml%", "Caller[PIL]"
cuml = 0.0
caller_keys.each do |key|
    percent = counts[key].to_f \* 100.0 / events.to_f
    cuml += percent
    printf "%12d  %5.2f  %5.2f  %s\\n", counts[key], percent, cuml, key
    if cuml >= CUTOFF then

Example Output

Free beer to the first person to tell me what this is showing. It was not easy to comprehend the 90,000 line "lockstat -I" output prior to post-processing it though. You get this problem when you have 72 CPUs...

lockstatI - will display top 95.0% of events
Profiling interrupt: 4217985 events in 60.639 seconds (69559 events/sec)
       Count      %  cuml%  Caller[PIL]
     1766747  41.89  41.89  disp_getwork[11]
     1015005  24.06  65.95  (usermode)
      502560  11.91  77.86  lock_set_spl_spin
       83066   1.97  79.83  lock_spin_try[11]
       62670   1.49  81.32  mutex_enter
       53883   1.28  82.60  mutex_vector_enter
       40847   0.97  83.57  idle
       40024   0.95  84.51  lock_set_spl[11]
       27004   0.64  85.15  splx
       17432   0.41  85.57  send_mondo_set[13]
       15876   0.38  85.94  atomic_add_int
       14841   0.35  86.30  disp_getwork

Thursday Feb 22, 2007

nicstat - now for Linux, too

Just a quick one to flag that I have released a version of nicstat for Linux (see latest blog on nicstat).

I do not have a myriad of Linux systems to test it on, so if anyone finds any issues, please let me know.

Wednesday Feb 14, 2007

nicstat - the Solaris Network Monitoring Tool You Did Not Know You Needed


This is a placeholder entry - see the latest blog on nicstat, for the current source and binaries.

Tuesday Jan 23, 2007

The Compiler Detective - What Compiler and Options Were Used to Build This Application?

("cc CSI")


Performance engineers often look at improving application performance by getting the compiler to produce more efficient binaries from the same source. This is done by changing what compiler options are used. In this modern era of Open Source Software, you can often get your hands on a number of binary distributions of an application, but if you really want to roll your sleeves up, the source is there, just waiting to be compiled with the latest compiler and optimisations.

Now, it might be useful to have as a reference the compiler version and flags that were originally used on the binary distribution you tried out, or you just might be interested to know. Read on for details on the forensic tools.

What architecture was the executable compiled for?

Solaris supports 64 and 32-bit programming models on SPARC and x86. You may need to know which one an application is using - it's easy enough to find out.

$ file \*.o
xxx.o:          ELF 64-bit LSB relocatable AMD64 Version 1
yyy.o:          ELF 32-bit LSB relocatable 80386 Version 1 [SSE2]

Note: yyy.o was built using "-native", informing the compiler that it can use the SSE2 instruction set extensions supported by the CPUs on the build system.

Some more examples, including a binary built on a SunOS 4.x system:

$ file gobble.\*
gobble.sunos4:  Sun demand paged SPARC executable dynamically linked
gobble.sparcv9: ELF 64-bit MSB executable SPARCV9 Version 1,
dynamically linked, not stripped
gobble.i386:    ELF 32-bit LSB executable 80386 Version 1 [FPU],
dynamically linked, not stripped
gobble.amd64:   ELF 64-bit LSB executable AMD64 Version 1 [SSE FXSR FPU],
dynamically linked, not stripped

Can I detect what version of whose compiler was used to build a binary, and on what version of Solaris?

The command to use is:

$ /usr/ccs/bin/mcs -p executable | object

Here are some examples. The "ld:" entries should indicate what release of Solaris the executable was built on, which is a good indicator of the oldest version of Solaris it will run on.

  • Sun Studio 11, on Solaris 10:
        acomp: Sun C 5.8 2005/10/13
        iropt: Sun Compiler Common 11 2005/10/13
        as: Sun Compiler Common 11 2005/10/13
        ld: Software Generation Utilities - Solaris Link Editors: 5.10-1.477
  • An old version of GCC (this was built in 1994, looks like Solaris 2.3):
        @(#)SunOS 5.3 Generic September 1993
        GCC: (GNU) 2.5.8
        as: SC3.0 early access 01 Sep 1993
        ld: (SGU) SunOS/ELF (LK-1.3)
  • Sun C 5.0, Solaris???
        @(#)stdio.h     1.78    99/12/08 SMI
        acomp: WorkShop Compilers 5.0 98/12/15 C 5.0
        ld: Software Generation Utilities - Solaris-ELF (4.0)
  • GCC 2.8.1, Solaris 7:
        @(#)SunOS 5.7 Generic October 1998
        as: WorkShop Compilers 5.0 Alpha 03/27/98 Build
        GCC: (GNU) 2.8.1
        ld: Software Generation Utilities - Solaris/ELF (3.0)
  • GCC 3.4.3 (object). The GCC compiler was built on/for "sol210" [sic]
        GCC: (GNU) 3.4.3 (csl-sol210-3_4-branch+sol_rpath)

There is a non-linear relationship between versions of the compiler ("Sun C") and the development suite ("Sun Studio", "Forte Developer", "Sun WorkShop", etc.). You can figure it out using Sun Studio Support Matrix.

Can I figure out what compiler flags were used?

It depends on what compiler was used, and whether you have objects.

  • For Sun Studio 11, use "dwarfdump -i", and look for "DW_AT_SUN_command_line" in the output.
  • For Sun Studio 10 and earlier, use "dumpstabs -s", and look for "CMDLINE" in the output.
  • For GCC, you can only figure out what version of GCC was used (see above).

"dwarfdump" and "dumpstabs" are available with the Sun Studio suite.

Some examples:

$ dwarfdump -i gobble.sparcv9
                DW_AT_name                  gobble.c
                DW_AT_language              DW_LANG_C99
                DW_AT_comp_dir              /home/tc35445/tools/c/gobble
                DW_AT_SUN_command_line       /usr/dist/share/sunstudio_sparc,
v11.0/SUNWspro/prod/bin/cc -xO3 -xarch=v9 -c  gobble.c
                DW_AT_SUN_compile_options   Xa;O;R=Sun C 5.8 2005/10/13;

$ dumpstabs -s test
   5:  .stabs "Xa ; g ; V=3.1 ; R=Sun WorkShop 6 2000/04/07 C 5.1
 ; G=$XAB0qnBB6
   6:  .stabs "/net/vic/export/home/timc/tools/c/interposition;
 /opt/Forte6/SUNWspro/bin/../WS6/bin/cc -g -c  test.c

$ dumpstabs -s getwd
   1:  .stabs "Xa ; V=3.1 ; R=WorkShop Compilers 4.2 30 Oct 1996 C 4.2",
   2:  .stabs "/home/staff/timc/tools/c/getwd;
 /opt/SUNWspro/bin/../SC4.2/bin/cc -c  getwd.c -W0,-xp",N_CMDLINE,0x0,0x0,0x0

Can I figure out where the binary will look for shared objects?

If an RPATH has been specified, dumpstabs can find it:
$ dumpstabs mysqladmin | grep RPATH
13: Tag = 15 (RPATH)    /usr/sfw/lib/mysql:/usr/sfw/lib

Can I figure out what shared objects the binary needs?

Yes, just use "ldd":

$ ldd mysqladmin =>  /usr/sfw/lib/ =>     /usr/lib/ =>    /usr/lib/ =>       /usr/lib/ =>   /usr/lib/ =>        /usr/lib/ =>   /usr/lib/ =>     /usr/lib/ =>        /usr/lib/ =>     /usr/lib/ =>   /usr/lib/ =>   /usr/lib/ =>    /usr/lib/ =>   /usr/lib/ =>  /usr/lib/ =>         /usr/lib/


Wednesday Sep 13, 2006

How the T2000 handles 20,000 Database Connections

I have recently changed careers at Sun, moving from a pre/post-sales technical role to an engineer role. I now work in Sun's PAE (Performance Availability & Architecture Engineering) group.

You could say I am definitely enjoying the change, getting to play with real geeky stuff :)

Both jobs had the opportunity to play with new hardware, but the new job should see me get much more play time with pre-release systems, which I find very interesting.

Anyway, I had the chance to do some testing on one of our new systems that many customers have also had a great chance to test on - the T2000.

The workload I am using is designed to replicate a database query load that is common for some parts of modern on-line retail operations - almost exclusive relatively simple queries, running pretty much fully cached in the RAM of the database server. The queries come from a very large population of clients (5 - 20 thousand).

As a highly analytical fact-oriented engineer, you need to realise that up until my first opportunity to experience the T2000 first-hand, I remained just a little sceptical that it could deliver something like the promise of what I had heard. It has been covered that the T2000 suits some workloads better than others - I was keen to get a quantitative understanding of this to support the qualitative things I had heard.

My base system I had been doing tests on was an 8-CPU (16 core) E2900. I expected this would provide a pretty fair comparison to a system that has 8 cores in its single CPU. In fact I knew of some preliminary numbers for the T2000 which made be think the E2900 would come out slightly ahead.

I need to update my thinking. The advantage of four threads per core is proven by reality, with the 8-core T2000 showing it certainly had greater throughput for my workload than the 16-core E2900.

As you can see in the graph, the T2000's CPU consumption is well ahead of the E2900 (lower is better). Also, as I scale up the number of connections (keeping the overall throughput the same), the T2000 does not see as much of a penalty for the increase.

Chart: CPU Utilization - E2900 v T2000



Tim Cook's Weblog The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.


« July 2016