Wednesday Mar 05, 2014

Random or Sequential I/O? How to find out with DTrace

One of the perennial performance or sizing questions for a workload is how much I/O it generates. This can be broken down several ways:

  • Network versus disk
  • Read versus write
  • Random versus sequential (for the disk component)

Many of the metrics have always been easy to quantify. There are standard operating system tools to measure disk I/O, network packets and bytes, etc. The question of how much random versus sequential I/O is much harder to answer, but it can be an important one if your storage is a potential limiter of performance and that storage has a significant amount of "conventional" disk - i.e. spindles of rotating rust.

Sequential I/O on conventional disks can generally be served at a higher throughput because conventional disks can avoid almost all rotational delay and seek time penalties. Random I/O will always incur these penalties, at a varying rate.

So if you need to decompose your workload enough to understand how much Random versus Sequential I/O there is, what do you do? You may be able to discuss it with the application developer to get their take on how the application behaves, what it stores, and how and when it needs to fetch data into memory or write or update it on disk. This information (or access to the developer) is not always available though.

What about if I can get inside the application and measure the I/O as it happens? That is a task for DTrace. I have developed a script that tallies up all I/O for all filesystems and can tell you how much of the I/O is Sequential (a read or write at a location in a file is then followed by a read or write at the next location for that file) or Random.

Here is some example output from the script, where I first ran a Sequential filebench test:

Sample interval: 120 seconds
End time: 2011 Mar 23 10:42:20

FYI: Page I/O on VREG vnodes after read/write:
  fop_read      fop_getpage      2

/u                         rd/s    wr/s   rdKB/s   wrKB/s   acc/look:       0
             sequential 2616.20 0.62501  2676408 0.718235    readdir:       0
                 random 2.65836 0.20834 2713.623 0.641998  geta/seta:       0
                   page 0.00833       0 0.033334        0      cr/rm:       0
                  TOTAL 2618.86 0.83334  2679122 1.360233      other: 2.86669

then I ran a random filebench test:

Sample interval: 60 seconds
End time: 2011 Mar 22 12:09:28

FYI: Page I/O on VREG vnodes after read/write:
  fop_write      fop_putpage      1
  fop_read       fop_getpage      5
  fop_write      fop_getpage     18

/u                         rd/s    wr/s   rdKB/s   wrKB/s   acc/look: 2.28871
             sequential 157.036 161.446 1255.552 1118.578    readdir:       0
                 random 20113.5 17119.0 160860.2 136919.0  geta/seta:       0
                   page       0 4.27672        0 17.10687      cr/rm: 0.05012
                  TOTAL 20270.6 17284.7 162115.7 138054.7      other: 5.26237

As you can see the script outputs a table breaking down reads and writes by operations and by KB/s, but also by "sequential", "random" and "page". There are also some totals, plus some statistics on a few other VFS operations.

What does "page" mean? Well, this script uses Solaris' virtual filesystem interface to measure filesystem I/O. Through this interface we can see conventional read and write operations, however there are also "page" operations that can happen which do not necessarily map to a read or write from an application. Because of this uncertainty I decided to list the I/O via these page operations separately. They will generally be small for an application that uses conventional I/O calls, but may be large if an application for is using the mmap() interface for example.

For the purposes of this blog entry I have simplified the output of the script. It normally outputs a table for every filesystem that has I/O during its run. The script could easily be modified to select a desired filesystem. It could also be modified to select I/O related to a specific user, process, etc.

References

Monday Feb 24, 2014

nicstat update - version 1.95

Another minor nicstat release is now available.

Changes for Version 1.95, January 2014

Common

  • Added "-U" option, to display separate read and write utilization.
  • Simplified display code regarding "-M" option.

Solaris

  • Fixed fetch64() to check type of kstats
  • Fixed memory leak in update_nicdata_list()

Availability

nicstat source and binaries are available from sourceforge.

History

For more history on nicstat and examples of use, see my earlier entry

Monday Nov 12, 2012

How to Get Vim to do Filename Completion Even When You are Root

From the Obscure Unix Admin Tip of the Day section...

If you occasionally edit files as root (I never do, I always use pfexec, wink wink), then you may have noticed that the vim (Vi Improved) editor that normally does filename completion via the <Tab> key now gives you something like:

:e /etc/mo^I

when you try to open up /etc/motd with a little less typing

So, there are at least five solutions to this:

  • Use <Ctrl>-E instead of <Tab>
  • Use the "-N" flag when you start Vim
  • :set wildchar=<Tab> (Enter those 5 characters, not an actual Tab)
  • Use pfexec vim filename
  • In Solaris 11.1 or later, use pfedit filename

The reason for this? It seems that when you are root, Vim sets it's "compatible" flag, which makes it behave more like its ancestor vi. In turn this makes Vim set 'wildchar' to <Ctrl>-E.

For more info, read the section you get when you enter :help cmdline-completion

Thanks to Darren for the tip on pfexec

Monday Oct 22, 2012

nicstat update - version 1.92

Another minor nicstat release is now available.

Changes for Version 1.92, October 2012

Common

  • Added "-M" option to change throughput statistics to Mbps (Megabits per second). Suggestion from Darren Todd.
  • Fixed bugs with printing extended parseable format (-xp)
  • Fixed man page's description of extended parseable output.

Solaris

  • Fixed memory leak associated with g_getif_list
  • Add 2nd argument to dladm_open() for Solaris 11.1
  • Modify nicstat.sh to handle Solaris 11.1

Linux

  • Modify nicstat.sh to see "x86_64" cputype as "i386". All Linux binaries are built as 32-bit, so we do not need to differentiate these two cpu types.

Availability

nicstat source and binaries are available from sourceforge.

History

For more history on nicstat, see my earlier entry

Friday Aug 24, 2012

Mercurial says "nothing changed", but it did. Sometimes my software is too clever.

It seems I have found a "bug" in Mercurial. It takes a shortcut when checking for differences in tracked files. If the file's size and modification time are unchanged, it assumes its contents are unchanged:

$ hg init .
$ cp -p .sccs2hg/2005-06-05_00\:00\:00\,nicstat.c nicstat.c
$ ls -ogE nicstat.c
-rw-r--r--   1   14722 2012-08-24 11:22:48.819451726 -0700 nicstat.c
$ hg add nicstat.c
$ hg commit -m "added nicstat.c"
$ cp -p .sccs2hg/2005-07-02_00\:00\:00\,nicstat.c nicstat.c
$ ls -ogE nicstat.c
-rw-r--r--   1   14722 2012-08-24 11:22:48.819451726 -0700 nicstat.c
$ hg diff
$ hg commit
nothing changed
$ touch nicstat.c
$ hg diff
diff -r b49cf59d431d nicstat.c
--- a/nicstat.c	Fri Aug 24 11:21:27 2012 -0700
+++ b/nicstat.c	Fri Aug 24 11:22:50 2012 -0700
@@ -2,7 +2,7 @@
  * nicstat - print network traffic, Kb/s read and written. Solaris 8+.
  *	"netstat -i" only gives a packet count, this program gives Kbytes.
  *
- * 05-Jun-2005, ver 0.81  (check for new versions, http://www.brendangregg.com)
+ * 02-Jul-2005, ver 0.90  (check for new versions, http://www.brendangregg.com)
  *

[...]

Now, before you agree or disagree with me on whether this is a bug, I will also say that I believe it is a feature. Yes, I feel it is an acceptable shortcut because in "real" situations an edit to a file will change the modification time by at least one second (the resolution that hg diff or hg commit is looking for). The benefit of the shortcut is greatly improved performance of operations like "hg diff" and "hg status", particularly where your repository contains a lot of files.

Why did I have no change in modification time? Well, my source file was generated by a script that I have written to convert SCCS change history to Mercurial commits. If my script can generate two revisions of a file within a second, and the files are the same size, then I run afoul of this shortcut.

Solution - I will just change my script to apply the modification time from the SCCS history to the file prior to commit. A "touch -t

Tuesday Jan 03, 2012

Analyzing Interrupt Activity with DTrace

This article is about interrupt analysis using DTrace. It is also available on the Solaris Internals and Performance FAQ Wiki, as part of the DTrace Topics collection.

Interrupt Analysis

Interrupts are events delivered to CPUs, usually by external devices (e.g. FC, SCSI, Ethernet and Infiniband adapters). Interrupts can cause performance and observability problems for applications.

Performance problems are caused when an interrupt "steals" a CPU from an application thread, halting its process while the interrupt is serviced. This is called pinning - the interrupt will pin an application thread if the interrupt was delivered to a CPU on which an application was executing at the time.

This can affect other threads or processes in the application if for example the pinned thread was holding one or more synchronization objects (locks, semaphores, etc.)

Observability problems can arise if we are trying to account for work the application is completing versus the CPU it is consuming. During the time an interrupt has an application thread pinned, the CPU it consumes is charged to the application.

Strategy

The SDT provider offers the following probes that indicate when an interrupt is being serviced:

 interrupt-start
 interrupt-complete

The first argument (arg0) to both probes is the address of a struct dev_info (AKA dev_info_t *), which can be used to identify the driver and instance for the interrupt.

Pinning

If the interrupt has indeed pinned a user thread, the following will be true:

 curthread->t_intr != 0;
 curthread->t_intr->t_procp->p_pidp->pid_id != 0

The pid_id field will correspond to the PID of the process that has been pinned. The thread will be pinned until either sdt:::interrupt-complete or fbt::thread_unpin:return fire.

DTrace Scripts

Attached are some scripts that can be used to assess the effect of pinning. These have been tested with Solaris 10 and Solaris 11.

Probe effect will vary. De-referencing four pointers then hashing against a character string device name each time an interrupt fires; as some of the scripts do; can be expensive. The last two scripts are designed to have a lower probe effect if your application or system is sensitive to this.

The scripts and their outputs are:
pin_by_drivers.d
How much drivers are pinning processes. Does not identify the PID(s) affected.
pids_by_drivers.d
How much each driver is pinning each process.
pid_cpu_pin.d
CPU consumption for a process, including pinning per driver, and time waiting on run queues.
intr_flow.d
Identifies the interrupt routine name for a specified driver
The following scripts are designed to have a lower probe effect
pid_pin_devi.d
Pinning on a specific process - shows drivers as raw "struct dev_info *" values.
pid_pin_any.d
Lowest probe effect - shows pinning on a specific process without identifying the driver(s) responsible.

Resolving Pinning Issues

The primary technique used to improve the performance of an application experiencing pinning is to "fence" the interrupts from the application. This involves the use of either processor binding or processor sets (sets are usually preferable) to either dedicate CPUs to the application that are known to not have the high-impact interrupts targeted at them, or to dedicate CPUs to the driver(s) delivering the high-impact interrupts.

This is not the optimal solution for all situations. Testing is recommended.

Another technique is to investigate whether the interrupt handling for the driver(s) in question can be modified. Some drivers allow for more or less work to be performed by worker threads, reducing the time during which an interrupt will pin a user thread. Other drivers can direct interrupts at more than a single CPU, usually depending on the interface on which the I/O event has ocurred. Some network drivers can wait for more or fewer incoming packets before sending an interrupt.

Most importantly, only attempt to resolve these issues yourself if you have a good understanding of the implications, preferably one backed-up by testing. An alternative is to open a service call with Oracle asking for assistance to resolve a suspected pinning issue. You can reference this article and include data obtained by using the DTrace scripts.

Exercise For The Reader

If you have identified that your multi-threaded or multi-process application is being pinned, but the stolen CPU time does not seem to account for the drop in performance, the next step in DTrace would be to identify whether any critical kernel or user locks are being held during any of the pinning events. This would require marrying information gained about how long application threads are pinned with information gained from the lockstat and plockstat providers.

References

Thursday Jul 28, 2011

nicstat update - version 1.90

Yes! A new version is now available with some long-awaited features. Many thanks to those who suggested improvements and helped with testing.

Changes for Version 1.90, April 2011

Common

  • nicstat.sh script, to provide for automated multi-platform deployment. See the Makefile's for details.
  • Added "-x" flag, to display extended statistics for each interface.
  • Added "-t" and "-u" flags, to include TCP and UDP (respectively) statistics. These come from tcp:0:tcpstat and udp:0:udpstat on Solaris, or from /proc/net/snmp and /proc/net/netstat on Linux.
  • Added "-a" flag, which equates to "-tux".
  • Added "-l" flag, which lists interfaces and their configuration.
  • Added "-v" flag, which displays nicstat version.

Solaris

  • Added use of libdladm.so:dladm_walk_datalink_id() to get list of interfaces. This is better than SIOCGLIFCONF, as it includes interfaces given exclusively to a zone.
    NOTE: this library/routine can be (by default is) linked in to nicstat in "lazy" mode, meaning that a Solaris 11 binary built with knowledge of the routine will also run on Solaris 10 without failing when the routine or library is not found - in this case nicstat will fall back to the SIOGLIFCONF method.
  • Added search of kstat "link_state" statistics as a third method for finding active network interfaces. See the man page for details.

Linux

  • Added support for SIOCETHTOOL ioctl, so that nicstat can look up interface speed/duplex (i.e. "-S" flag not necessarily needed any longer).
  • Removed need for LLONG_MAX, improving Linux portability.

Availability

nicstat source and binaries are available from sourceforge.

History

For more history on nicstat, see my earlier entry

Wednesday Dec 08, 2010

Which package delivers the file/directory I can not find?

Back in the day (ahem, Solaris 10 and earlier...) you could use a command like the following to figure out which package was responsible for; e.g.; "/usr/include/math.h", or any other "include/math.h", just in case it is somewhere less obvious:

mashie$ fgrep include/math.h /var/sadm/install/contents
/usr/include/math.h f none 0644 root bin 10514 7356 1249334889 SUNWlibm

In other words, /var/sadm/install/contents was a flat-file index of every file, directory, link, etc. installed on the system. The last field is the package name - "SUNWlibm" in this case.

Things are a little more tricky with the new packaging system in Solaris 11 Express. Here is the equivalent "pkg contents" command:

mashie$ pkg contents -o pkg.name,path -a path='\*/include/math.h'
PKG.NAME                        PATH
system/library/math/header-math usr/include/math.h

Thursday Sep 17, 2009

querystat - DTrace script to monitor your queries, query cache and server thread pre-emption

I was recently helping some colleagues check what was happening with their MySQL queries, and wrote a DTrace script to do it. Time to share that script.

First of all, a look at some output from the script:

mashie[bash]# ./querystat.d -p `pgrep mysqld`
Tracing started at 2009 Sep 17 16:28:35
2009 Sep 17 16:28:38   throughput 3 queries/sec
2009 Sep 17 16:28:41   throughput 4 queries/sec
2009 Sep 17 16:28:44   throughput 528 queries/sec
2009 Sep 17 16:28:47   throughput 1603 queries/sec
2009 Sep 17 16:28:50   throughput 1676 queries/sec
\^C
Tracing ended   at 2009 Sep 17 16:28:51
Average latency, all queries: 107 us
Latency distribution, all queries (us): 
           value  ------------- Distribution ------------- count    
              16 |                                         0        
              32 |@@                                       170      
              64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        3728     
             128 |@@@@@                                    533      
             256 |                                         26       
             512 |                                         18       
            1024 |                                         2        
            2048 |                                         1        
            4096 |                                         0        
            8192 |                                         1        
           16384 |                                         1        
           32768 |                                         0        
Query cache statistics:
    count             hit: 6
    count            miss: 4474
    avg latency      miss: 107 (us)
    avg latency       hit: 407 (us)
Latency distribution, for query cache hit (us): 
           value  ------------- Distribution ------------- count    
              64 |                                         0        
             128 |@@@@@@@@@@@@@                            2        
             256 |@@@@@@@                                  1        
             512 |@@@@@@@@@@@@@@@@@@@@                     3        
            1024 |                                         0        
Latency distribution, for query cache miss (us): 
           value  ------------- Distribution ------------- count    
              16 |                                         0        
              32 |@@                                       170      
              64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        3728     
             128 |@@@@@                                    531      
             256 |                                         25       
             512 |                                         15       
            1024 |                                         2        
            2048 |                                         1        
            4096 |                                         0        
            8192 |                                         1        
           16384 |                                         1        
           32768 |                                         0        
Average latency when query WAS NOT pre-empted: 73 us
Average latency when query     WAS pre-empted: 127 us
Pre-emptors:
[...]
   mysql                                     6
   Xorg                                     18
   sched                                    25
   firefox-bin                              44
   sysbench                               3095

You can see that while the script is running (prior to pressing <Ctrl>-C), we get a throughput count every 3 seconds.

Then we get some totals, some averages, and even some distribution histograms, covering all queries, then with breakdowns on whether we used the query cache, and whether the thread executing the query was pre-empted.

This may be useful for determining things like:

  • Do I have some queries in my workload that consume a lot more CPU than others?
  • Is the query cache helping or hurting?
  • Are my database server threads being pre-empted (kicked off the CPU) by (an)other process(es)?

Things have become easier since I first tried this, and had to use the PID provider to trace functions in the database server.

If you want to try my DTrace script, get it from here. NOTE: You will need a version of MySQL with DTrace probes for it to work.

Friday Sep 04, 2009

nicstat - the Solaris and Linux Network Monitoring Tool You Did Not Know You Needed

Update - Version 1.95, January 2014

Added "-U" option, to display separate read and write utilization. Simplified display code regarding "-M" option. For Solaris, fixed fetch64() to check type of kstats andf ixed memory leak in update_nicdata_list(). Full details at the entry for version 1.95

Update - Version 1.92, October 2012

Added "-M" option to display throughput in Mbps (Megabits per second). Fixed some bugs. Full details at the entry for version 1.92

Update - Version 1.90, July 2011

Many new features available, including extended NIC, TCP and UDP statistics. Full details at the entry for version 1.90

Update - February 2010

Nicstat now can produce parseable output if you add a "-p" flag. This is compatible with System Data Recorder (SDR). Links below are for the new version - 1.22.

Update - October 2009

Just a little one - nicstat now works on shared-ip Solaris zones.

Update - September 2009

OK, this is heading toward overkill...

The more I publish updates, the more I get requests for enhancement of nicstat. I have also decided to complete a few things that needed doing.

The improvements for this month are:

  • Added support for a "fd" or "hd" (in reality anything starting with an upper or lower-case F or H) suffix to the speed settings supplied via the "-S" option. This advises nicstat the interface is half-duplex or full-duplex. The Linux version now calculates %Util the same way as the Solaris version.
  • Added a script, enicstat, which uses ethtool to get speeds and duplex modes for all interfaces, then calls nicstat with an appropriate -S value.
  • Made the Linux version more efficient.
  • Combined the Solaris and Linux source into one nicstat.c. This is a little ugly due to #ifdef's, but that's the price you pay.
  • Wrote a man page.
  • Wrote better Makefile's for both platforms
  • Wrote a short README
  • Licensed nicstat under the Artistic License 2.0

All source and binaries will from now on be distributed in a tarball. This blog entry will remain the home of nicstat for the time being.

Lastly, I have heard the requests for easier availability in OpenSolaris. Stay tuned.

Update - August 2009

That's more like it - we should get plenty of coverage now :)

A colleague pointed out to me that nicstat's method of calculating utilization for a full-duplex interface is not correct.

Now nicstat will look for the kstat "link_duplex" value, and if it is 2 (which means full-duplex), it will use the greater of rbytes or wbytes to calculate utilization.

No change to the Linux version. Use the links in my previous post for downloading.

Update - July 2009

I should probably do this at least once a year, as nicstat needs more publicity...

A number of people have commented to me that nicstat always reports "0.00" for %Util on Linux. The reason for this is that there is no simple way an unprivileged user can get the speed of an interface in Linux (quite happy for someone to prove me wrong on that however).

Recently I got an offer of a patch from David Stone, to add an option to nicstat that tells it what the speed of an interface is. Pretty reasonable idea, so I have added it to the Linux version. You will see this new "-S" option explained if you use nicstat's "-h" (help) option.

I have made another change which makes nicstat more portable, hence easier to build on Linux.

History

A few years ago, a bloke I know by the name of Brendan Gregg wrote a Solaris kstat-based utility called nicstat. In 2006 I decided I needed to use this utility to capture network statistics in testing I do. Then I got a request from a colleague in PAE to do something about nicstat not being aware of "e1000g" interfaces.

I have spent a bit of time adding to nicstat since then, so I thought I would make the improved version available.

Why Should I Still Be Interested?

nicstat is to network interfaces as "iostat" is to disks, or "prstat" is to processes. It is designed as a much better version of "netstat -i". Its differences include:

  • Reports bytes in & out as well as packets.
  • Normalizes these values to per-second rates.
  • Reports on all interfaces (while iterating)
  • Reports Utilization (rough calculation as of now)
  • Reports Saturation (also rough)
  • Prefixes statistics with the current time

How about an example?

eac-t2000-3[bash]# nicstat 5
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
17:05:17      lo0    0.00    0.00    0.00    0.00    0.00    0.00  0.00   0.00
17:05:17  e1000g0    0.61    4.07    4.95    6.63   126.2   628.0  0.04   0.00
17:05:17  e1000g1   225.7   176.2   905.0   922.5   255.4   195.6  0.33   0.00
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
17:05:22      lo0    0.00    0.00    0.00    0.00    0.00    0.00  0.00   0.00
17:05:22  e1000g0    0.06    0.15    1.00    0.80   64.00   186.0  0.00   0.00
17:05:22  e1000g1    0.00    0.00    0.00    0.00    0.00    0.00  0.00   0.00
eac-t2000-3[bash]# nicstat -i e1000g0 5 4
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
17:08:49  e1000g0    0.61    4.07    4.95    6.63   126.2   628.0  0.04   0.00
17:08:54  e1000g0    0.06    0.04    1.00    0.20   64.00   186.0  0.00   0.00
17:08:59  e1000g0   239.2    2.33   174.4   33.60  1404.4   71.11  1.98   0.00
17:09:04  e1000g0    0.01    0.04    0.20    0.20   64.00   186.0  0.00   0.00

For more examples, see the man page.

References & Resources

Friday Aug 14, 2009

nicstat - Update for Solaris only

Update - August 2009

That's more like it - we should get plenty of coverage now :)

A colleague pointed out to me that nicstat's method of calculating utilization for a full-duplex interface is not correct.

Now nicstat will look for the kstat "link_duplex" value, and if it is 2 (which means full-duplex), it will use the greater of rbytes or wbytes to calculate utilization.

No change to the Linux version. Use the links in my previous post for downloading.

Monday Apr 27, 2009

pstime - a mash-up of ps(1) and ptime(1)

I have done some testing in the past where I needed to know the amount of CPU consumed by a process more accurately than I can get from the standard set of operating system utilities.

Recently I hit the same issue - I wanted to collect CPU consumption of mysqld.

To capture process CPU utilization over an interval on Solaris, about the best I can get is the output from a plain "prstat" command, which might look like:

mashie ) prstat -c -p `pgrep mysqld` 5 2
Please wait...
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
  7141 mysql     278M  208M cpu0    39    0   0:38:13  40% mysqld/45
Total: 1 processes, 45 lwps, load averages: 0.63, 0.33, 0.18
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
  7141 mysql     278M  208M cpu1    32    0   0:38:18  41% mysqld/45
Total: 1 processes, 45 lwps, load averages: 0.68, 0.34, 0.18

I am after data from the second sample only (still not sure exactly how prstat gets data for the fist sample, which comes out almost instantaneously), so you can guess I will need some sed/perl that is a litte more complicated than I would prefer.

pstime reads PROCFS (i.e.. the virtualized file-system mounted on /proc) and captures CPU utilization figures for processes. It will report the %USR and %SYS either for a specific list of processes, or every process running on the system (i.e., running at both sample points). The start sample time is recorded in high resolution at the time a process' data is captured, and then again after N seconds, where N is the first parameter supplied to pstime.

The default output of pstime is expressed as either a percentage of whole system CPU, or CPU seconds, with four significant digits. Solaris itself records the original figures in nanosecond resolution, although we do not expect today's hardware to be that accurate.

Here is an example:

mashie ) pstime 10 `pgrep sysbench\\|mysqld`
  UID    PID  %USR  %SYS COMMAND
mysql   7141 44.17 3.391 /u/dist/mysql60-debug/bin/mysqld --defaults-file=/et
mysql  19870 2.517 2.490 sysbench --test=oltp --oltp-read-only=on --max-time=
mysql  19869 0.000 0.000 /bin/sh -p ./run-sysbench

Downloads

Tuesday Apr 21, 2009

Expanding Google's InnoDB Synchronization Improvements to Solaris

There is much excitement today at the launch of MySQL 5.4, so I will relate my story about a project I contributed to this new version.

When we started looking at performance improvements for MySQL, we were interested in "low hanging fruit", or fixes and changes that could reap measurable benefits for users in the short term.

An obvious candidate at that time was the now well-known Google SMP patch. I had seen Mark Callaghan present on this at the MySQL User Conference in 2008, and was interested to investigate.

I was pretty new to InnoDB at that time, and was soon to discover that InnoDB was possibly experiencing poor scalability around its mutexes and read-write locks because InnoDB had a private implementation of adaptive mutexes and read-write locks, and this was probably not the best implementation on all or even most platforms MySQL is available on.

Now InnoDB's "private" mutexes and rw-locks were a good way to get spin-locks on all platforms, which may be a win in many cases, but as the Google team had demonstrated, it could be improved on. Indeed, I knew that adaptive spin-locks are available on Solaris, and they offer an extra advantage - if the holder of a lock is found to be off CPU, we don't bother spinning, but instead put the thread wanting the lock straight to sleep.

So, I decided to undertake a couple of performance studies of InnoDB's locking, being:

  1. Apply the Google SMP patch to MySQL 5.1 and test
  2. Modify InnoDB in 5.1 to use POSIX mutexes and RW-locks and test

The second step turned out to be quite complicated. I could not even change all of InnoDB's RW-locks to POSIX ones, as the InnoDB sychronization objects offer functionality not available via POSIX. It also meant we would be diverging more significantly from the InnoDB in 5.1, so this option - although looking promising - was shelved.

This left the Google SMP patch. It also looked promising. It was a less dramatic change, and offered scaling benefits in all the testing I did.

There was one last snag though - the mutex and RW-lock improvments in the Google SMP patch would only be applied if you were building on x86/x64 with GCC 4.1 or later, as they relied on GCC's atomic built-ins.

You can consider that we have a two-dimensional matrix of platforms that MySQL supports, being a compiler, then an Operating System. To make a feature portable across this matrix, you need to find a portable API, write code that is portable, or write code that uses a choice of different portable API's depending on what is available.

Now we definitely wanted to get a similar benefit for InnoDB on SPARC, and not necessarily just with GCC. In any case, GCC did not offer all of the built-in atomics for SPARC at the time. Happily, there are atomic functions available in Solaris that fit the job fine. MySQL 5.4 uses the functions if you build on Solaris without a version of GCC that supports built-in atomics.

Just so you understand though, here is (a simplified version of) what happens when you build MySQL 5.4 on your chosen platform with your chosen compiler:

  • IF (compiler has GCC built-in atomics)
    use GCC built-in atomics
  • ELSE IF (OS has atomic functions)
    use atomic functions
  • ELSE
    use traditional InnoDB synchronization objects, based on pthread_mutex\*.

Summary

As Neel points out in his blog, it was an exercise we learnt something from, even if we did develop functionality that will not be used. The important thing is we know we have improved the performance of MySQL, by extending the Google SMP improvements to all Solaris users, regardless of chosen compiler.

Thursday Apr 09, 2009

Testing the New Pool-of-Threads Scheduler in MySQL 6.0, Part 2

In my last blog, I introduced my investigation of the "Pool-of-Threads" scheduler in MySQL 6.0. Read on to see where I went next.

I now want to take a different approach to comparing the two schedulers. It is one thing to compare how the schedulers work "flat out" - with a transaction request rate that is limited only by the maximum throughput of the system under test. I would like to instead look at how the two schedulers compare when I drive mysqld at a consistent transaction rate, then vary only the number of connections over which the transaction requests are arriving. I will aim to come up with a transaction rate that sees CPU utilization somewhere in the 40-60% range.

This is more like how real businesses use MySQL every day, as opposed to the type of benchmarking that computer companies usually engage in. This will also allow me to look at how the schedulers run at much higher connection counts - which is where the pool-of-threads scheduler is supposed to shine.

Now, I will let you all know that I first conducted my experiments with mysqld and the load generator (sysbench) on the same system. I was again not sure this would be be the best methodology, primarily because I would end up having one operating system instance scheduling in some cases a very large number of sysbench threads along with the mysqld threads.

It turned out the results from this mode threw up some issues (like not being able to get my desired throughput with 2048 connections in pool-of-threads mode), so I repeated my experiments - the second set of results have the load generation coming from two remote systems, each with a dedicated 1 Gbit ethernet link to the DB server.

The CPU utilization I have captured was just the %USR plus %SYS for the mysqld process. This makes the two sets of metrics comparable.

Here are my results. First for experiments where sysbench ran on the same host as mysqld:

Then for experiments where sysbench ran on two remote hosts, each with a dedicated Gigabit Ethernet link to the database server:

As you can see, the pool-of-threads model does incur an overhead, both in terms of CPU consumption and response time, at low connections counts. As hoped though, the advantage swings in pool-of-threads' favour. This is particularly noticeable in the case where our clients are remote. It is arguable that an architecture involving many hundreds or thousands of client connections is more likely to have those clients located remote from the DB server.

Now, the first issue I have is that while pool-of-threads starts to win on response time, the response time is still increasing in a similar fashion to thread-per-connection's response time (note - the scale is logarithmic). This is not what I expected, so we have a scalability problem in there somewhere.

The second issue is where I have to confess - I only got one "lucky" run where my target transaction rate was achieved for pool-of-threads at 2048 connections. For many other runs, the target rate could not be achieved, as these raw numbers show:

connections tps mysqld
%usr
mysqld
%sys
mysqld
%cpu
avg-resp 95%-resp
2048962.2225.2314.9340.161943.782368.78
20481197.0030.5911.2041.79317.98435.19
2048836.5021.9811.0933.072259.362287.03
2048963.0026.4912.0738.561333.671128.93
2048992.2525.8115.0840.891851.172280.50
2048915.7124.1615.0539.212220.452342.06
2048919.5424.2515.0539.302210.952331.45
2048917.0924.1515.0539.202217.862321.40
2048875.0923.2013.2936.492188.692344.91
20481180.6231.3514.5745.921439.961772.86
20481185.8030.7414.2444.981185.711814.24
20481146.9030.3415.2345.571602.851842.14
20481141.4730.2015.2245.421612.341873.95
20481158.7430.4712.9943.46999.761870.35
20481177.5930.6714.9745.641403.221838.84

This indicates we have some sort of bottleneck right at or around the 2048 thread point. This is not what we want with pool-of-threads, so I will continue my investigation.

Wednesday Apr 08, 2009

Testing the New Pool-of-Threads Scheduler in MySQL 6.0

I have recently been investigating a bew feature of MySQL 6.0 - the "Pool-of-Threads" scheduler. This feature is a fairly significant change to the way MySQL completes tasks given to it by database clients.

To begin with, be advised that the MySQL database is implemented as a single multi-threaded process. The conventional threading model is that there are a number of "internal" threads doing administrative work (including accepting connections from clients wanting to connect to the database), then one thread for each database connection. That thread is responsible for all communication with that database client connection, and performs the bulk of database operations on behalf of the client.

This architecture exists in other RDBMS implementations. Another common implementation is a collection of processes all cooperating via a region of shared memory, usually with semaphores or other synchronization objects located in that shared memory.

The creation and management of threads can be said to be cheap, in a relative sense - it is usually significantly cheaper to create or destroy a thread rather than a process. However these overheads do not come for free. Also, the operations involved in scheduling a thread as opposed to a process are not significantly different. A single operating system instance scheduling several thousand threads on and off the CPUs is not much less work than one scheduling several thousand processes doing the same work.

Pool-of-Threads

The theory behind the Pool-of-Threads scheduler is to provide an operating mode which supports a large number of clients that will be maintaining their connections to the database, but will not be sending a constant stream of requests to the database. To support this, the database will maintain a (relatively) small pool of worker threads that take a single request from a client, complete the request, return the results, then return to the pool and wait for another request, which can come from any client. The database's internal threads still exist and operate in the same manner.

In theory, this should mean less work for the operating system to schedule threads that want CPU. On the other hand, it should mean some more overhead for the database, as each worker thread needs to restore the context of a database connection prior to working on each client request.

A smaller pool of threads should also consume less memory, as each thread requires a minimum amount of memory for a thread stack, before we add what is needed to store things like a connection context, or working space to process a request.

You can read more about the different threading models in the MySQL 6.0 Reference Manual.

Testing the Theory

Mark Callaghan of Google has recently had a look at whether this theory holds true. He has published his results under "No new global mutexes! (and how to make the thread/connection pool work)". Mark has identified (via this bug he logged) that the overhead for using Pool-of-Threads seems quite large - up to 63 percent.

So, my first task is see if I get the same results. I will note here that I am using Solaris, whereas Mark was no doubt using a Linux distro. We probably have different hardware as well (although both are Intel x86).

Here is what I found when running sysbench read-only (with the sysbench clients on the same host). The "conventional" scheduler inside MySQL is known as the "Thread-per-Connection" scheduler, by the way.

This is in contrast to Mark's results - I am only seeing a loss in throughput of up to 30%.

What about the bigger picture?

These results do show there is a definite reduction in maximum throughput if you use the pool-of-threads scheduler.

I believe it is worth looking at the bigger picture however. To do this, I am going to add in two more test cases:

  • sysbench read-only, with the sysbench client and MySQL database on separate hosts, via a 1 Gb network
  • sysbench read-write, via a 1 Gb network

What I want to see is what sort of impact the pool-of-threads scheduler has for a workload that I expect is still the more common one - where our database server is on a dedicated host, accessed via a network.

As you can see, the impact on throughput is far less significant when the client and server are separated by a network. This is because we have introduced network latency as a component of each transaction and increased the amount of work the server and client need to do - they now need to perform ethernet driver, IP and TCP tasks.

This reduces the relative overhead - in CPU consumed and latency - introduced by pool-of-threads.

This is a reminder that if you are conducting performance tests on a system prior to implementing or modifying your architecture, you would do well to choose a test architecture and workload that is as close as possible to that you are intending to deploy. The same is true if you are are trying to extrapolate performance testing someone else has done to your own architecture.

The Converse is Also True

On the other hand, if you are a developer or performance engineer conducting testing in order to test a specific feature or code change, a micro-benchmark or simplified test is more likely to be what you need. Indeed, Mark's use of the "blackhole" storage engine is a good idea to eliminate that processing from each transaction.

In this scenario, if you fail to make the portion of the software you have modified a significant part of the work being done, you run the risk of seeing performance results that are not significantly different, which may lead you to assume your change has negligible impact.

In my next posting, I will compare the two schedulers using a different perspective.

About

Tim Cook's Weblog The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today