Thursday May 31, 2007

Using DTrace to Capture Statement Execution Times in MySQL


I have recently been engaged with a customer that is evaluating MySQL, in particular with its Memory (formerly known as Heap) engine, which stores all database data and metadata in RAM.

I was asked to assist with diagnosing if/whether/where statements were taking longer than 5 milliseconds to complete. Now, this was being observed from the viewpoint of the "client" - the client was a synthetic benchmark run as a Java program. It could be run either on a separate system or on the same system as the MySQL database, and a small number of transactions would be measured as taking longer than 5ms.

Now, there is more than one way to skin this cat, and it turns out that Jenny Chen has had a go at putting static probes into MySQL. For my (and the customer's) purposes however, we wanted to skip the step of re-compiling with our own probes, and just use what we can observe via use of the PID provider.

How Do We Do This?

Well, it is not trivial. However as it turns out, I have seen a bit of the MySQL code. I also had someone pretty senior from MySQL next to me, who helped confirm what was happening, as I used some "fishing" bits of DTrace to watch a mysqld thread call functions as we ran the "mysql" client and entered simple statements.

This allowed me to narrow down on a couple of vio_\*() routines, and to build some pseudo-code to describe the call flow around reception of requests from a client, processing of the statement, then return of a result to the client.

It is not as simple as looking for the entry and return of a single function, because I wanted to capture the full time from when the mysqld thread returns from a read(2) indicating a request has arrived from a client through to the point where the same thread has completed a write(2) to send the response back. This is the broadest definition of "response time measure at the server".

The Result

The result of all of our measurements showed that there were no statements from the synthetic benchmark that took longer than 5 ms to complete. Here is an example of the output of my DTrace capture (a histogram of microseconds):

bash-3.00# ./request.d -p `pgrep mysqld`

           value  ------------- Distribution ------------- count
             < 0 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@                    10691
             500 |@@@@@@@@@@@@@@@@@                        8677
            1000 |@                                        680
            1500 |                                         31
            2000 |                                         0

The Script

Feel free to use the DTrace script for your own purposes. It should work on MySQL 5.0 and 5.1.

The Upshot - Observability in the JVM

There is a nagging question remaining - why was the Java client seeing some requests run longer than 5 ms?.

There are three possible answers I can think of:

  1. There is latency involved in transmitting the requests and responses between the client and server (e.g. network packet processing).
  2. The thread inside the JVM was being pre-empted (thrown off the CPU) between taking its time measurements.
  3. The measurements (taken using System.nanoTime()) are not reliable.

Wednesday May 30, 2007

Case Study - AT&T Customer Relationship Management

Time-line of Events

April 20th, 2006 - I apply for SBC Yahoo Internet, which includes a sweetheart deal that will last for 12 months. Included in the terms & conditions is an early termination fee of $99 is mentioned.

April 21st, 2006 - The service is connected

April 13th, 2007 (approx) - I contact AT&T (had since re-acquired SBC) to disconnect phone & internet service, due to my imminent move to a new apartment. Disconnection is booked for April 14th. No mention from AT&T call center staff of early termination fee.

April 18th, 2007 - AT&T issue me a a bill. It is for $17.22 and is clearly marked "FINAL".

May 16th, 2007 - AT&T issue me another bill. This one is marked "REVISED FINAL BILL", and contains a $99 charge for "EARLY TERMINATION-HSI BASIC".

May 25th, 2007 - I contact AT&T and ask for an explanation. They explain the early termination fee. I ask if I can get my service re-connected. I am told that no, I can not get it re-connected as it has been more than 30 days since it was disconnected. I ask if there is any way this situation can be rectified - suggest that I could reconnect if a credit was made to cover the early termination fee. There is no flexibility available to the person I was speaking to, though they did empathize with my situation.


June 26th, 2007 - I send a check to AT&T for the $99, but include a request for an enclosed copy of this blog entry to be forwarded to their customer service department.

July 27th, 2007 - AT&T send a check for $99 back to me, refunding the early termination fee.


So, initially I was disappointed, but ultimately I am happy with AT&T. I could never say that they were not in the right, but I would still suggest that they could give more latitude to the people who work in their call centers.

It would also be wise to advise customers of the pending early termination fee at the earliest opportunity.

Friday Mar 16, 2007

Simplifying "lockstat -I" Output (or Ruby Is Better Than Perl)

There, I said it. I have been a Perl scripter for nearly 20 years (since early version 3). I think Ruby has pretty much everything Perl has, and more, like:

  • It is really object-oriented (rather than Perl's bolt-on objects). I am much more likely to end up using this functionality.
  • It has operator overloading
  • It has a "case" statement
  • There is no Obfuscated Ruby Contest (though IMHO there could be one)

Anyway, enough religous argument. I want to thank my colleague Neel for putting me onto it, and now I will provide a simple example.

Post-Processing "lockstat -I"

The Ruby script below post-processes the output of "lockstat -I". Why you ask? - well, because the output of "lockstat -I" can be tens of thousands of lines, and it does not coalesce by CPU.

The script below will coalesce by CPU, and PC within a function (useful if you forgot the "-k" flag to lockstat). A very simple change will also make it coalesce by PIL, but I will leave that as an exercise for the reader.

Ruby Post-Processor Script

One thing about this is that I would write this script almost exactly the same way if I used Perl. That is another plus of Ruby - it is easy to pick-up for a Perl programmer.

#!/bin/env ruby -w
# lockstatI.rb -	Simplify "lockstat -I" output
#, 16 Mar 2007
#	lockstat -I [-i 971 -n <nnnnnn> ] sleep <nnn> | lockstatI.rb

PROG = "lockstatI"

#-- Once we have printed values that cover this proportion, then quit
CUTOFF = 95.0

DASHES = ('-' \* 79) + "\\n"


print "#{PROG} - will display top #{CUTOFF}% of events\\n"

events = 0
period = 0
state = 0
counts = {}

#-- The classic state machine
ARGF.each_line do |line|
    next if line == "\\n"
    case state
    when 0
	if line =~ /\^Profiling interrupt: (\\d+) events in (\\d.\*\\d) seconds/
	    puts line
	    state = 1
	    events = $1
	    period = $2
    when 1
	if line == DASHES then
	    state = 2
    when 2
	if line == DASHES then
	f = line.split
	count = f[0]
	cpu, pil = f[5].split('+')

	#-- Coalesce PCs within functions; i.e. do not differentiate by
	#-- offset within a function.  Useful if "-k" was not specified
	#-- on the lockstat command.
	caller = f[6].split('+')[0]

	if pil then
	    caller = caller + "[" + pil + "]"
	counts[caller] = counts[caller].to_i + count.to_i

#-- Give me an array of keys sorted by descending value
caller_keys = counts.keys.sort { |a, b| counts[b] <=> counts[a] }

#-- Dump it out
printf "%12s  %5s  %5s  %s\\n", "Count", "%", "cuml%", "Caller[PIL]"
cuml = 0.0
caller_keys.each do |key|
    percent = counts[key].to_f \* 100.0 / events.to_f
    cuml += percent
    printf "%12d  %5.2f  %5.2f  %s\\n", counts[key], percent, cuml, key
    if cuml >= CUTOFF then

Example Output

Free beer to the first person to tell me what this is showing. It was not easy to comprehend the 90,000 line "lockstat -I" output prior to post-processing it though. You get this problem when you have 72 CPUs...

lockstatI - will display top 95.0% of events
Profiling interrupt: 4217985 events in 60.639 seconds (69559 events/sec)
       Count      %  cuml%  Caller[PIL]
     1766747  41.89  41.89  disp_getwork[11]
     1015005  24.06  65.95  (usermode)
      502560  11.91  77.86  lock_set_spl_spin
       83066   1.97  79.83  lock_spin_try[11]
       62670   1.49  81.32  mutex_enter
       53883   1.28  82.60  mutex_vector_enter
       40847   0.97  83.57  idle
       40024   0.95  84.51  lock_set_spl[11]
       27004   0.64  85.15  splx
       17432   0.41  85.57  send_mondo_set[13]
       15876   0.38  85.94  atomic_add_int
       14841   0.35  86.30  disp_getwork

Wednesday Feb 28, 2007

SSH Cheat Sheet

This is offered for those who want to kick their telnet habit. I also offer a simple text version, which you can keep in ~/.ssh.

To create an SSH key for an account

srchost$ ssh-keygen -t rsa

This will create id_rsa and in ~/.ssh. "-t dsa" can be used instead. You will need an SSH key if you want to log in to a system without supplying a password.

To be able to log in to desthost from srchost without a password (as below)

srchost$ ssh desthost

Simply add the contents of srchost:~/.ssh/ to desthost:~/.ssh/authorized_keys in the form "ssh-rsa AAAkeystringxxx= myusername@srchost".

To enable forwarding of an X-windows session back to your $DISPLAY on srchost

Just use "-X":

srchost$ ssh -X desthost
desthost$ xterm

If I use a different account on desthost (and I want to use a short name for desthost)

Add something like this:

Host	paedata
    Hostname	paedata.sfbay
    User	tc35445
to srchost:~/.ssh/config

Still Getting Prompted For a Password

If I find that my key is not being recognised on desthost (I still get prompted for a password), I probably have a premission problem. try this as the user on desthost:

chmod g-w,o-w .
chmod g=,o= .ssh .ssh/authorized_keys

To allow root logins (but must specify password or have an authorized_key) on a host

  1. Edit /etc/ssh/sshd_config, change line to
    PermitRootLogin yes
  2. Solaris 9 & earlier:
    # /etc/init.d/sshd restart
  3. Solaris 10 & later:

    # svcadm restart ssh

Here is a patch (will save the originial config file in sshd_config.orig)

/usr/bin/patch -b /etc/ssh/sshd_config << 'EOT'
--- sshd_config.orig       Fri Feb  2 11:27:12 2007
+++ sshd_config Fri Feb 23 14:12:24 2007
@@ -129,7 +129,8 @@
 # Note that sshd uses pam_authenticate(3PAM) so the root (or any other) user
 # maybe denied access by a PAM module regardless of this setting.
 # Valid options are yes, without-password, no.
-PermitRootLogin no
+#PermitRootLogin no
+PermitRootLogin yes

 # sftp subsystem
 Subsystem      sftp    /usr/lib/ssh/sftp-server

Host Key Has Changed

Reconfigure of desthost - this happens when you (re-)install Solaris. You can avoid it by restoring /etc/ssh/ssh_host_\*_key\*. Otherwise:

bash$ ssh katie
Someone could be eavesdropping on you right now(man-in-the-middle attack)
It is also possible that the RSA host key has been changed.
The fingerprint for the RSA key sent by the remote host is
md5 8e:c4:53:93:64:5b:2d:b4:f8:e9:a8:9c:d9:95:4a:70.
Please contact your system administrator.
Add correct host key in /home/tc35445/.ssh/known_hosts
Offending key is entry 3 in /home/tc35445/.ssh/known_hosts
RSA host key for katie has changed and you have requested strict checking.

Solution - remove the "katie" entry in ~/.ssh/known_hosts and log-in again - ssh will put a new host key in for you.

Thursday Feb 22, 2007

nicstat - now for Linux, too

Just a quick one to flag that I have released a version of nicstat for Linux (see latest blog on nicstat).

I do not have a myriad of Linux systems to test it on, so if anyone finds any issues, please let me know.

Wednesday Feb 14, 2007

nicstat - the Solaris Network Monitoring Tool You Did Not Know You Needed


This is a placeholder entry - see the latest blog on nicstat, for the current source and binaries.

Tuesday Jan 23, 2007

The Compiler Detective - What Compiler and Options Were Used to Build This Application?

("cc CSI")


Performance engineers often look at improving application performance by getting the compiler to produce more efficient binaries from the same source. This is done by changing what compiler options are used. In this modern era of Open Source Software, you can often get your hands on a number of binary distributions of an application, but if you really want to roll your sleeves up, the source is there, just waiting to be compiled with the latest compiler and optimisations.

Now, it might be useful to have as a reference the compiler version and flags that were originally used on the binary distribution you tried out, or you just might be interested to know. Read on for details on the forensic tools.

What architecture was the executable compiled for?

Solaris supports 64 and 32-bit programming models on SPARC and x86. You may need to know which one an application is using - it's easy enough to find out.

$ file \*.o
xxx.o:          ELF 64-bit LSB relocatable AMD64 Version 1
yyy.o:          ELF 32-bit LSB relocatable 80386 Version 1 [SSE2]

Note: yyy.o was built using "-native", informing the compiler that it can use the SSE2 instruction set extensions supported by the CPUs on the build system.

Some more examples, including a binary built on a SunOS 4.x system:

$ file gobble.\*
gobble.sunos4:  Sun demand paged SPARC executable dynamically linked
gobble.sparcv9: ELF 64-bit MSB executable SPARCV9 Version 1,
dynamically linked, not stripped
gobble.i386:    ELF 32-bit LSB executable 80386 Version 1 [FPU],
dynamically linked, not stripped
gobble.amd64:   ELF 64-bit LSB executable AMD64 Version 1 [SSE FXSR FPU],
dynamically linked, not stripped

Can I detect what version of whose compiler was used to build a binary, and on what version of Solaris?

The command to use is:

$ /usr/ccs/bin/mcs -p executable | object

Here are some examples. The "ld:" entries should indicate what release of Solaris the executable was built on, which is a good indicator of the oldest version of Solaris it will run on.

  • Sun Studio 11, on Solaris 10:
        acomp: Sun C 5.8 2005/10/13
        iropt: Sun Compiler Common 11 2005/10/13
        as: Sun Compiler Common 11 2005/10/13
        ld: Software Generation Utilities - Solaris Link Editors: 5.10-1.477
  • An old version of GCC (this was built in 1994, looks like Solaris 2.3):
        @(#)SunOS 5.3 Generic September 1993
        GCC: (GNU) 2.5.8
        as: SC3.0 early access 01 Sep 1993
        ld: (SGU) SunOS/ELF (LK-1.3)
  • Sun C 5.0, Solaris???
        @(#)stdio.h     1.78    99/12/08 SMI
        acomp: WorkShop Compilers 5.0 98/12/15 C 5.0
        ld: Software Generation Utilities - Solaris-ELF (4.0)
  • GCC 2.8.1, Solaris 7:
        @(#)SunOS 5.7 Generic October 1998
        as: WorkShop Compilers 5.0 Alpha 03/27/98 Build
        GCC: (GNU) 2.8.1
        ld: Software Generation Utilities - Solaris/ELF (3.0)
  • GCC 3.4.3 (object). The GCC compiler was built on/for "sol210" [sic]
        GCC: (GNU) 3.4.3 (csl-sol210-3_4-branch+sol_rpath)

There is a non-linear relationship between versions of the compiler ("Sun C") and the development suite ("Sun Studio", "Forte Developer", "Sun WorkShop", etc.). You can figure it out using Sun Studio Support Matrix.

Can I figure out what compiler flags were used?

It depends on what compiler was used, and whether you have objects.

  • For Sun Studio 11, use "dwarfdump -i", and look for "DW_AT_SUN_command_line" in the output.
  • For Sun Studio 10 and earlier, use "dumpstabs -s", and look for "CMDLINE" in the output.
  • For GCC, you can only figure out what version of GCC was used (see above).

"dwarfdump" and "dumpstabs" are available with the Sun Studio suite.

Some examples:

$ dwarfdump -i gobble.sparcv9
                DW_AT_name                  gobble.c
                DW_AT_language              DW_LANG_C99
                DW_AT_comp_dir              /home/tc35445/tools/c/gobble
                DW_AT_SUN_command_line       /usr/dist/share/sunstudio_sparc,
v11.0/SUNWspro/prod/bin/cc -xO3 -xarch=v9 -c  gobble.c
                DW_AT_SUN_compile_options   Xa;O;R=Sun C 5.8 2005/10/13;

$ dumpstabs -s test
   5:  .stabs "Xa ; g ; V=3.1 ; R=Sun WorkShop 6 2000/04/07 C 5.1
 ; G=$XAB0qnBB6
   6:  .stabs "/net/vic/export/home/timc/tools/c/interposition;
 /opt/Forte6/SUNWspro/bin/../WS6/bin/cc -g -c  test.c

$ dumpstabs -s getwd
   1:  .stabs "Xa ; V=3.1 ; R=WorkShop Compilers 4.2 30 Oct 1996 C 4.2",
   2:  .stabs "/home/staff/timc/tools/c/getwd;
 /opt/SUNWspro/bin/../SC4.2/bin/cc -c  getwd.c -W0,-xp",N_CMDLINE,0x0,0x0,0x0

Can I figure out where the binary will look for shared objects?

If an RPATH has been specified, dumpstabs can find it:
$ dumpstabs mysqladmin | grep RPATH
13: Tag = 15 (RPATH)    /usr/sfw/lib/mysql:/usr/sfw/lib

Can I figure out what shared objects the binary needs?

Yes, just use "ldd":

$ ldd mysqladmin =>  /usr/sfw/lib/ =>     /usr/lib/ =>    /usr/lib/ =>       /usr/lib/ =>   /usr/lib/ =>        /usr/lib/ =>   /usr/lib/ =>     /usr/lib/ =>        /usr/lib/ =>     /usr/lib/ =>   /usr/lib/ =>   /usr/lib/ =>    /usr/lib/ =>   /usr/lib/ =>  /usr/lib/ =>         /usr/lib/


Saturday Dec 16, 2006

Why CONNECT / AS SYSDBA did not work - Trap For Young Players


I have been doing some installing of Oracle databases onto system where a database and oracle account are already installed. I wanted to effect changes that could be easily cleaned up. So, I created my own oracle account/uid and group/gid.

The Problem

The problem I encountered was that when I tried to connect to the database to do DBA-type things, my experience went like this:

tora ) sqlplus / as sysdba

ORA-01031: insufficient privileges

The Solution

This mode of connection, where a bare slash (/) is used instead of oracleaccount/oraclepassword, is where you ask Oracle to authorize you via your OS credentials. What I was missing is that the Unix group specified during creation of the database (or installation of the software, I forget which) is separate from the Unix group used to automatically grant a user the SYSDBA and SYSOPER privileges - which would have allowed the "sqlplus / as sysdba" to succeed.

So the Unix group that does grant these privileges (I think they are Oracle roles, just as Solaris has roles) is "dba". I am not sure whether this can be changed, so I took the shortest route and added my "tora" user to the "dba" group, and all was well.

File this one under just-hoping-anyone-in-the-same-pain-finds-this-via-Google.

I also would like to plug Installing and Configuring Oracle Database 10g on the Solaris Platform by Roger Schrag as a useful Cheat Sheet.

Friday Oct 06, 2006

Lsync - Keeping Your Sanity by Keeping Your Home Directory Synchronised

I haver battled for a number of years on how to keep a laptop and a home directory reasonably syncronised. About a year ago I decided to solve this problem once and for all.

Below I describe Lsync, a script I wrote to do the work for me. Here is the Lsync script for download.


WARNING: Command-Line Content

Lsync is a tool to keep your home directory synchronised across multiple systems. While there are a few solutions out there for doing this automagically, such solutions are either still in a research phase, have a price tag, or are designed on a different scale.

This solution is intended to be cheap, and work with most variants of Unix. It is also intended to be used by the user who is doing updates on one or both copies of their own home directory. It should be used regularly, to minimise the work required at each operation, and to reduce the risk of data loss, or insanity.

Lsync is implemented on top of the excellent OSS program rsync.


As configured, Lsync depends on bash, ssh, and a recent version of rsync (exactly how recent I do not know). It can probably be modified to use the (far less secure) rsh/rexec protocol, but I am not going to do this work. So far, I have found it to work on Solaris 9, 10 and Express, as well as Mac OS X 10.3.9. I would fully expect it to work on any variant of Linux.

The user will need to edit the Lsync script to set the values for their "laptop" and "master" hosts.

What It Does

Lsync offers 5 basic functions:

  • Check on what needs to be synced from your "laptop" to your "master"
  • Check on vice-versa
  • Sync from your "laptop" to your "master" (make it so...)
  • Sync vice-versa
  • Edit your "rsync includes" file

The sync/check operations by default are done between the user's home directories on the "laptop" and "master" hosts, but they can instead perform the operation just on a sub-directory of the home directory.

Any operation that modifies data requires further input from the user - either editing of the "rsync includes" file, or entering a "y" to confirm that we really want to sync.

An advantage of using ssh as the remote shell protocol is that the user can leverage ~/.ssh/config to specify a different username to be used on a remote host. For example, I have the username "timc" on my laptop, but have to use the corporate standard "tc35445" on any SWAN host (yuck). The magic for this to happen without work on my behalf is to put this in ~/.ssh/config on my laptop:

	Host	\*.sfbay
	  User     tc35445


A host you designate. This can have a dynamic name, established at run-time as the host you are running Lsync on, or it can be static. If the laptop host is static, you can run Lsync on either the master or the laptop host.
A host you designate. This is static.
Host deemed to have the current authoritative version of your home directory.
Host being synchronised to the sender's version of your home directory.
rsync includes file
A file containing rules for including & excluding files and/or directories to be synchronised. The format is documented in the rsync(1) manual page. The default location for this is ~/.rsync-include.


Unless you have seen rsync before, you will be surprised how fast it can do it's work. I am currently syncing about 50,000 files, but if the metadata for these is cached on both systems, a full home directory check takes around a minute. If it is the first check of the day, it might take 5 minutes.

In either case, it is fast enough to use at least daily, which means you can easily have your full, up-to-date "working set" with you on your laptop when you are out of the office, but go back to a SunRay when you get in.

Alternatively, if you know you have been working in a sub-directory, just specify that subdirectory, and Lsync limits its update to that directory tree. For example:

	$ Lsync L ~/tools/sh/Lsync

Caveats, Warnings, Disclaimers and Other Fine Print

It is important to understand that I am using the "--delete" option to rsync, which means that rsync will delete files on the receiver that do not exist on the sender. This means if you delete something on one host, it won't come back to haunt you, but it also means you must get sync operations in the correct order with your work activity. For easy identification, rsync tells you whenever it would have deleted or is deleting something by prepending it with "deleting".

Also, all sync operations will act on files and directories at the same level under your home directory on both systems. In other words, you can not use Lsync to copy a directory to a different location on the receiver, leaving the receiver's version of the directory in place.

This is a deliberate decision - Lsync is intended to synchronise, not to replicate.


When I want to sync my home directory, if I am not sure what I have modified on what directory, I first check:

d-mpk12-65-186 ) Lsync l
-- Lsync: listing what to sync under ~
     from d-mpk12-65-186.SFBay.Sun.COM to paedata.sfbay
building file list ... done
deleting tools/sh/Lsync/tmpfile

wrote 1016147 bytes  read 28 bytes  16796.28 bytes/sec
total size is 4882980897  speedup is 4805.26
-- Use "Lsync L" to perform sync
d-mpk12-65-186 ) 

Then, noticing that the only thing to delete is something I have deleted on the sender and genuinely do not want any more, I go ahead and "make it so":

d-mpk12-65-186 ) Lsync L
-- Lsync: SYNCING everything under ~
     from d-mpk12-65-186.SFBay.Sun.COM to paedata.sfbay
Enter 'y' to confirm: y
building file list ... 
49676 files to consider
deleting tools/sh/Lsync/tmpfile
       47248 100%  780.49kB/s    0:00:00  (1, 70.8% of 49676)
       12288 100%  255.32kB/s    0:00:00  (2, 99.4% of 49676)

wrote 1075767 bytes  read 60 bytes  16942.16 bytes/sec
total size is 4882980897  speedup is 4538.82

If I want to double-check, I can now see what might be out of date with my "master":

d-mpk12-65-186 ) Lsync m
-- Lsync: listing what to sync under ~
     from paedata.sfbay to d-mpk12-65-186.SFBay.Sun.COM
receiving file list ... done

wrote 337 bytes  read 1049835 bytes  21653.03 bytes/sec
total size is 4882980897  speedup is 4649.70
-- Use "Lsync M" to perform sync

If I want to just sync a particular directory tree, I can specify this as an argument after the operation letter. This will be a lot faster than examining my whole home directory tree on both hosts. It also ignores my "rsync includes" file.

Any absolute or relative path can be specified, but it must resolve to something below my $HOME:

d-mpk12-65-186 ) pwd
d-mpk12-65-186 ) Lsync l .
-- Lsync: listing what to sync under ~/tools/sh/Lsync
     from d-mpk12-65-186.SFBay.Sun.COM to paedata.sfbay
building file list ... done

wrote 165 bytes  read 28 bytes  55.14 bytes/sec
total size is 32083  speedup is 166.23
-- Use "Lsync L" to perform sync
d-mpk12-65-186 ) Lsync l ~/tools/
-- Lsync: listing what to sync under ~/tools
     from d-mpk12-65-186.SFBay.Sun.COM to paedata.sfbay
building file list ... done

wrote 54756 bytes  read 28 bytes  9960.73 bytes/sec
total size is 84506552  speedup is 1542.54
-- Use "Lsync L" to perform sync
d-mpk12-65-186 ) Lsync l /var/tmp
Lsync: can not sync "/var/tmp", as it is not under $HOME

Wednesday Sep 13, 2006

How the T2000 handles 20,000 Database Connections

I have recently changed careers at Sun, moving from a pre/post-sales technical role to an engineer role. I now work in Sun's PAE (Performance Availability & Architecture Engineering) group.

You could say I am definitely enjoying the change, getting to play with real geeky stuff :)

Both jobs had the opportunity to play with new hardware, but the new job should see me get much more play time with pre-release systems, which I find very interesting.

Anyway, I had the chance to do some testing on one of our new systems that many customers have also had a great chance to test on - the T2000.

The workload I am using is designed to replicate a database query load that is common for some parts of modern on-line retail operations - almost exclusive relatively simple queries, running pretty much fully cached in the RAM of the database server. The queries come from a very large population of clients (5 - 20 thousand).

As a highly analytical fact-oriented engineer, you need to realise that up until my first opportunity to experience the T2000 first-hand, I remained just a little sceptical that it could deliver something like the promise of what I had heard. It has been covered that the T2000 suits some workloads better than others - I was keen to get a quantitative understanding of this to support the qualitative things I had heard.

My base system I had been doing tests on was an 8-CPU (16 core) E2900. I expected this would provide a pretty fair comparison to a system that has 8 cores in its single CPU. In fact I knew of some preliminary numbers for the T2000 which made be think the E2900 would come out slightly ahead.

I need to update my thinking. The advantage of four threads per core is proven by reality, with the 8-core T2000 showing it certainly had greater throughput for my workload than the 16-core E2900.

As you can see in the graph, the T2000's CPU consumption is well ahead of the E2900 (lower is better). Also, as I scale up the number of connections (keeping the overall throughput the same), the T2000 does not see as much of a penalty for the increase.

Chart: CPU Utilization - E2900 v T2000


Wednesday Nov 16, 2005

Demonstrating ZFS Self-Healing

I'm the kind of guy who likes to tinker. To see under the bonnet. I used to have a go at "fixing" TV's by taking the back off and seeing what could be adjusted (which is kind-of anathema to one of the philosophies of ZFS).

So, when I have been presenting and demonstrating ZFS to customers, the thing I really like to show is what ZFS does when I inject "silent data corruption" into one device in a mirrored storage pool.

This is cool, because ZFS does a couple of things that are not done by any comparable product:

  • It detects the corruption by using checksums on all data and metadata.
  • It automatically repairs the damage, using data from the other mirror, assuming checksum(s) on that mirror are OK.

This all happens before the data is passed off to the process that asked for it. This is how it looks in slideware:

Self-Healing ZFS

The key to demonstrating this live is how to inject corruption, without having to apply a magnet or lightning bolt to my disk. Here is my version of such a demonstration:

  1. Create a mirrored storage pool, and filesystem

    cleek[bash]# zpool create demo mirror /export/zfs/zd0 /export/zfs/zd1
    cleek[bash]# zfs create demo/ccs

  2. Load up some data into that filesystem, see how we are doing

    cleek[bash]# cp -pr /usr/ccs/bin /demo/ccs
    cleek[bash]# zfs list
    NAME                   USED  AVAIL  REFER  MOUNTPOINT
    demo                  2.57M   231M  9.00K  /demo
    demo/ccs              2.51M   231M  2.51M  /demo/ccs

  3. Get a personal checksum of all the data in the files - the "find/cat" will output the contents of all files, then I pipe all that data into "cksum"

    cleek[bash]# cd /demo/ccs
    cleek[bash]# find . -type f -exec cat {} + | cksum
    1891695928      2416605

  4. Now for the fun part. I will inject some corruption by writing some zeroes onto the start of one of the mirrors.

    cleek[bash]# dd bs=1024k count=32 conv=notrunc if=/dev/zero of=/export/zfs/zd0
    32+0 records in
    32+0 records out

  5. Now if I re-read the data now, ZFS will not find any problems, and I can verify this at any time using "zpool status"

    cleek[bash]# find . -type f -exec cat {} + | cksum
    1891695928      2416605
    cleek[bash]# zpool status demo
      pool: demo
     state: ONLINE
     scrub: none requested
            NAME                 STATE     READ WRITE CKSUM
            demo                 ONLINE       0     0     0
              mirror             ONLINE       0     0     0
                /export/zfs/zd0  ONLINE       0     0     0
                /export/zfs/zd1  ONLINE       0     0     0

    The reason for this is that ZFS still has all the data for this filesystem cached, so it does not need to read anything from the storage pool's devices.

  6. To force ZFS' cached data to be flushed, I export and re-import my storage pool

    cleek[bash]# cd /
    cleek[bash]# zpool export -f demo
    cleek[bash]# zpool import -d /export/zfs demo
    cleek[bash]# cd -

  7. At this point, I should find that ZFS has found some corrupt metadata

    cleek[bash]# zpool status demo
      pool: demo
     state: ONLINE
    status: One or more devices has experienced an unrecoverable error.  An
            attempt was made to correct the error.  Applications are unaffected.
    action: Determine if the device needs to be replaced, and clear the errors
            using 'zpool online' or replace the device with 'zpool replace'.
     scrub: none requested
            NAME                 STATE     READ WRITE CKSUM
            demo                 ONLINE       0     0     0
              mirror             ONLINE       0     0     0
                /export/zfs/zd0  ONLINE       0     0     7
                /export/zfs/zd1  ONLINE       0     0     0

  8. Cool - Solaris Fault Manager at work. I'll bring that mirror back online, so ZFS will try using it for what I plan to do next...

    cleek[bash]# zpool online demo/export/zfs/zd0
    Bringing device /export/zfs/zd0 online

  9. Now, I can repeat my read of data to generate my checksum, and check what happens

    cleek[bash]# find . -type f -exec cat {} + | cksum
    1891695928      2416605    note that my checksum is the same
    cleek[bash]# zpool status
            NAME                 STATE     READ WRITE CKSUM
            demo                 ONLINE       0     0     0
              mirror             ONLINE       0     0     0
                /export/zfs/zd0  ONLINE       0     0    63
                /export/zfs/zd1  ONLINE       0     0     0

Of course, if I wanted to know the instant things happened, I could also use DTrace (in another window):

cleek[bash]# dtrace -n :zfs:zio_checksum_error:entry
dtrace: description ':zfs:zio_checksum_error:entry' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0  40650         zio_checksum_error:entry
  0  40650         zio_checksum_error:entry
  0  40650         zio_checksum_error:entry
  0  40650         zio_checksum_error:entry

Technorati Tag:

Tuesday Nov 08, 2005

FAQ du Jour - Daylight Savings in Australia, Autumn 2006

Well, I have been asked this three times now, so it sounds worthy of a blog entry...

The beginning and end of Daylight Savings in Australia is controlled by state goverment regulation, which means it can be changed at relatively short notice (except in Queensland, where it can not be changed, as the extra daylight will fade curtains and confuse dairy cows).

Has or will Sun release Solaris patches to take account of changes to daylight savings due to happen for the Commonwealth Games in 2006?

There is an RFE (Request For Enhancement) for this, but we have not yet developed & released patches.

If you are a customer and want to be notified when the patches are released, please log a call to this effect. You can reference BugID 6282969 to identify the issue.

I got a bit wistful when researching this, because back when I was a Sun customer in 1996 (I think), I logged a call regarding the same type of issue - South Australia was extending Daylight Savings for the WOMAdelaide music festival.

Customising man(1) for more readable manual pages

This entry can be considered a test run, or it can be considered as an entry that will stop my next entry from being the first and only entry when it is released. These things are true, but I hope this entry will be of interest anyway.

I am still a habitual user of the "man" command in Solaris (and other Unixes). It has pretty good response time, and the format is familiar. I did notice a number of years ago that the format could be better. All sorts of terminals are able to display bold and underlined text, and printed man pages show some elements as bold, but man(1) on Solaris only shows underlined elements (see below).

Boring man(1)

My interest was piqued when I re-acquainted myself with Linux, and noticed that man(1) was showing bold elements. I had to investigate.

I did a few things like "truss -f" on the man program, "strings" on the binary, until I discovered that Solaris' man(1) was different in that it was using a "-u0" option to nroff(1) when formatting the text. This flag was undocumented at the time, but I discovered that if I hand-built a man page and used "-u1" instead, I got bold text.

I was obviously not firing on all cylinders that day, as I chose to customise man(1) by putting a copy of /usr/bin/man (the binary) in my ~/bin directory, then editing the binary, changing any occurence of "-u0" to "-u1".

When I next upgraded my workstation, I then had to replace the binary with a script that called a release-specific binary, as I was then using multiple releases of Solaris on different systems.

Eventually, the injector on cylinder 3 cleared, and I figured out how to do it with a script that effectively interposed on /usr/bin/man and /usr/bin/nroff. This would also work on many releases of Solaris (currently working fine on S10 and the most recent build of OpenSolaris). At the same time, I figured I could use a custom PAGER program for displaying man pages, and implement this in the same script.

If you feel envious and want this for yourself, download the script from here. Put this in a directory in your PATH that comes before /usr/bin and /bin, then link it to "nroff" in the same directory (e.g. "ln ~/bin/man ~/bin/nroff"). You will then get what I see:

Custom man(1)


Tim Cook's Weblog The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.


« April 2014