Thursday May 28, 2009

Inniostat - InnoDB IO Statistics

I wrote a small DTrace script to understand InnoDB IO statistics. This script shows statistics about different kinds of Innodb IO requests and how many of them result in actual IO. Sample output is shown below
#./inniostat -h
Usage: inniostat [-h] [-d] [-p pid] [interval]
                 -h : Print this message
                 -p : MySQL PID
                 -d : Dump dtrace script being used

# ./inniostat
 __physical__  ___Innodb___ ____read____     ______write______
   r/s    w/s    r/s    w/s   data    pre    log dblbuf dflush     Time
    24    121     24     50     24      0     50      0      0 16:00:57
    26    130     26     51     26      0     51      0      0 16:00:58
    18    134     18     54     18      0     54      0      0 16:00:59
    25    129     25     51     25      0     51      0      0 16:01:00
    29    116     46     47     17     29     47      0      0 16:01:01
    10    140     10    132     10      0     52      0     80 16:01:02
    29    129     35     53     14     21     53      0      0 16:01:03

r/s Physical (Actual) writes per sec
w/s Physical (Actual) writes per sec
r/s Reads issued by Innodb per sec
w/s Writes issued by Innodb per sec
data Regular Reads issued by Innodb to the datafiles per sec
pre Prefetch Reads issued by Innodb to the datafiles per sec
log Log writes issued by Innodb per sec
dblbuf Double buffer writes per sec
dflush Writes due to flushing of Innodb buffers to disk.

For the above example you can see that writes to the Innodb double buffer (dblbuf) is zero. This is because I had turned it off explicitly. You will also notice that there are more physical writes than what Innodb is requesting. This can be due to a variety of factors (other engines are doing writes, other applications are doing writes, recordsize mismatch between filesystem and mysql, filesystem logging, etc.). You will also notice a few prefetch requests are being issued.

Another example

Consider the sample output below (contributed by Dimitri)

 __physical__  ___Innodb___ ____read____    ______write______
   r/s    w/s    r/s    w/s   data    pre    log dblbuf dflush     Time
     0   1681      0   5259      0      0   2780      0   2479 14:58:13
     0    224      0   6111      0      0   6111      0      0 14:58:18
     0    761      0   4300      0      0   3368      0    932 14:58:23
     0    503      0   4232      0      0   3546      0    686 14:58:28
     0    904      0   4024      0      0   2923      0   1101 14:58:33
     0   1231      0   4046      0      0   2470      0   1575 14:58:38
     0    502      0   4192      0      0   3640      0    552 14:58:43
     0    928      0   4669      0      0   3331      0   1338 14:58:48
You can quickly notice that this is a cached workload (r/s is zero). There is a significant difference between log writes issued by Innodb and physical writes. This is because this particular setup had innodb_flush_log_at_trx_commit set to 2 (i.e flush log writes every second). If want to flush the log after every commit, you need a disk[s] that can handle around 3000 writes per second. You will also notice a lot of Innodb buffers being flushed (dflush). This maybe because modified buffers in the Innodb buffer pool are being flushed, and/or due to caching in the filesystem instead of in the InnoDB buffer pool.

But I can get some these statistics via "show status"

Some (but not all) of these statistics are available via "show status" command. Be sure to understand what they are measuring before using them. Few of them can be little puzzling. For ex, Innodb_buffer_pool_read_ahead_seq variable counts the number of times a sequential read ahead was triggered. It does not tell you, how many pages were fetched as a result of the read ahead. Similarly, Innodb_log_write_requests tells you how many times a write happened to the log, but its is not really useful as this is incremented when someone writes to the in-memory log block, it does not tell you when the log was written to disk.


Innodb issues reads to either the data files or log files (during recovery). This script does not differentiate between them. Writes are either to the data files, log files, or double write buffer. If you view the source of the script (or use the -d option), you will see that I am interposing on the fil_io function and interpreting its arguments to differentiate IO types. I am using the io:::start probe to figure out the physical IO. Note that this script tracks Innodb IO requests. The actual IO may be carried out by a Innodb background thread some time later. This script also makes assumptions regarding several Innodb constants and hence may not be accurate for future versions.


You can download it here. Since it uses DTrace, it only works on Solaris, MacOS and FreeBSD. Please feel free to use it and let me know of any feedback/comments.

Tuesday May 26, 2009

MySQL Innodb ZFS Best Practices

One of the cool things about talking about MySQL performance with ZFS is that there is not much tuning to be done :-) Tuning with ZFS is considered evil, but a necessity at times. In this blog I will describe some of the tunings that you can apply to get better performance with ZFS as well as point out performance bugs which when fixed will nullify the need for some of these tunings.

For the impatient, here is the summary. See below for the reasoning behind these recommendations and some gotchas.

  1. Match ZFS recordsize with Innodb page size (16KB for Innodb Datafiles, and 128KB for Innodb log files).
  2. If you have a write heavy workload, use a Seperate ZFS Intent Log.
  3. If your database working set size does not fit in memory, you can get a big boost by using a SSD as L2ARC.
  4. While using storage devices with battery backed caches or while comparing ZFS with other filesystems, turn off the cache flush.
  5. Prefer to cache within MySQL/Innodb over the ZFS Adaptive replacement cache (ARC).
  6. Disable ZFS prefetch.
  7. Disable Innodb double write buffer.

Lets look at all of them in detail.

What Match ZFS recordsize with Innodb page size (16KB for Datafiles, and 128KB for Innodb log files).
How zfs set recordsize=16k tank/db

The biggest boost in performance can be obtained by matching the ZFS record size with the size of the IO. Since a Innodb Page is 16KB in size, most read IO is of size 16KB (except for some prefetch IO which can get coalesced). The default recordsize for ZFS is 128KB. The mismatch between the read size and the ZFS recordsize can result in severely inflated IO. If you issue a 16KB read and the data is not already there in the ARC, you have to read 128KB of data to get it. ZFS cannot do a small read because the checksum is calculated for the whole block and you have to read it all to verify data integrity. The other reason to match the IO size and the ZFS recordsize is the read-modify-write penalty. With a ZFS recordsize of 128KB, When Innodb modifies a page, if the zfs record is not already in memory, it needs to be read in from the disk and modified before writing to disk. This increases the IO latency significantly. Luckily matching the ZFS recordsize with the IO size removes all the problems mentioned above.

For Innodb log file, the writes are usually sequential and varying in size. By using ZFS recordsize of 128KB you amortize the cost of read-modify-write.


You need to set the recordsize before creating the database files. If you have already created the files, you need to copy the files to get the new recordsize. You can use the stat(2) command to check the recordsize (look for IO Block:)

What If you have a write heavy workload, use a seperate intent log (slog).
How zpool add log c4t0d0 c4t1d0

Write latency is extremely critical for many MySQL workloads. Typically, a query will read some data, do some calculations, update some data and then commit the transaction. To commit, the Innodb log has to be updated. Many transactions can be committing at the same time. It is very important that this "wait" for commit be fast. Luckily in ZFS, synchronous writes can be accelerated up by using the Seperate Intent Log. In our tests with Sysbench read-write, we have seen around 10-20% improvement with the slog.


What L2ARC (or Level 2 ARC)
How zpool add cache c4t0d0

If your database does not fit in memory, every time you miss the database cache, you have to read a block from disk. This cost is quite high with regular disks. You can minimize the database cache miss latency by using a (or multiple) SSDs as a level-2 cache or L2ARC. Depending on your database working set size, memory and L2ARC size you may see several orders of magnitude improvement in performance.


What When it is safe, turn off ZFS cache flush

The ZFS Evil tuning guide has more information about setting this tunable. Refer to it for the best way to achieve this.


ZFS is designed to work reliably with disks with caches. Everytime it needs data to be stored persistantly on disk, it issues a cache flush command to the disk. Disks with a battery backed caches need not do anything (i.e the cache flush command is a nop). Many storage devices interpret this correctly and do the right thing when they receive a cache flush command. However, there are still a few storage systems which do not interpret the cache flush command correctly. For such storage systems, preventing ZFS from sending the cache flush command results in a big reduction in IO latency. In our tests with Sysbench read-write test we saw a 30% improvement in performance.

  • Setting this tunable on a system without a battery backed cache can cause inconsistencies in case of a crash.
  • When comparing ZFS with filesystems that blindly enable the write cache, be sure to set this to get a fair comparison.

What Prefer to cache within MySQL/Innodb over the ARC.
How Via my.cnf and by limiting the ARC size

You have multiple levels of caching when you are using MySQL/Innodb with ZFS. Innodb has its own buffer pool and ZFS has the ARC. Both of them make independent decisions on what to cache and what to flush. It is possible for both of them to cache the same data. By caching inside Innodb, you get a much shorter (and faster) code path to the data. Moreover, when the Innodb buffer cache is full, a miss in the Innodb buffer cache can lead to flushing of a dirty buffer, even if the data was cached in the ARC. This leads to unnecessary writes. Even though the ARC dynamically shrinks and expands relative to memory pressure, it is more efficient to just limit it.In our tests, we have found that it is better (7-200%) to cache inside Innodb rather than ZFS.


The ARC can be tuned to cache everything, just metadata or nothing on a per filesystem basis. See below for tuning advise about this.

What Disable ZFS Prefetch.
How In /etc/system: set zfs:zfs_prefetch_disable = 1

Most filesystems implement some kind of prefetch. ZFS prefetch detects linear (increasing and decreasing), strided, multiblock strided IO streams and issues prefetch IO when it will help performance. These prefetch IO have a lower priority than regular reads and are generally very beneficial. ZFS also has a lower level prefetch (commonly called vdev prefetch) to help with spatial locality of data.

In Innodb, rows are stored in order of primary index. Innodb issues two kinds of prefetch requests; one is triggered while accessing sequential pages and other is triggered via random access in an extent. While issuing prefetch IO, Innodb assumes that file is laid out in the order of the primary key. This is not true for ZFS. We are yet to investigate the impact of Innodb prefetch.

It is well known that OLTP workloads access data in a random order and hence do not benefit from prefetch. Thus we recommend that you turn off ZFS prefetch.

  • If you have changed the primary cache caching strategy to just cache metadata, you will not trigger file level prefetch.
  • If you have set recordsize to 16k, you will not trigger the lower level prefetch.

What Disable Innodb Double write buffer.
How skip-innodb_doublewrite in my.cnf

Innodb uses a double write buffer for safely updating pages in a tablespace. Innodb first writes the changes to the double write buffer before updating the data page. This is to prevent partial writes. Since ZFS does not allow partial writes, you can safely turn off the double write buffer. In our tests with Sysbench read-write, we say a 5% improvement in performance.


Tuesday Apr 21, 2009

Reducing Innodb mutex contention

Today Sun announces MySQL 5.4. This is a great day for customers as they can use systems with many cores much more efficiently. Its a great day for the MySQL community and the MySQL performance team because we made it happen. MySQL 5.4 includes a lot of community contributed fixes as well as many fixes from our team. Mikael and Allan are blogging about all the cool new features and the great scalability of MySQL 5.4. I thought I will take this opportunity to blog about some of the things we tried, and rejected. Sometimes there are a lot of things to be learnt from things that do not work :-)

Early on during our performance investigation, we were trying to see if we can reduce some of the contention in Innodb locks. If you are not familiar with Innodb locks, I suggest you read Tim Cook's excellent presentation to MySQL University on this very topic.


In a nutshell, Innodb has 4 kinds of lock modes (shared, exclusive, intention shared, and intention exclusive). Since POSIX synchronization methods support maximum of 2 modes (reader or writer), Innodb implements its own set of locking primitives using a condition variable and a regular mutex. Innodb also implements its own spin followed by a block. When it gets a mutex in an not contended case, it is very fast. However, when spinning for a lock fails, it gets interesting.

The problem

A failed spin for a lock means it has to block for the lock to be available. Currently Innodb uses the wait array interface to keep track of who is waiting on what mutex. Unfortunately there is a global mutex protecting the wait array. This global mutex (sync_primary_wait_array) has shown to be hot in some high thread count experiments.

The following callstack illustrates this perfectly.
    nsec ---- Time Distribution --- count Stack
   16384 |                        |     8 mysqld`os_mutex_enter+0x4
   32768 |                        |    11 mysqld`sync_array_free_cell+0x28
   65536 |                        |     7 mysqld`mutex_spin_wait+0x194
  131072 |                        |     9 mysqld`trx_undo_assign_undo+0x30
  262144 |@                       |    34 mysqld`trx_undo_report_row_operation+0x168
  524288 |@@@@                    |   122 mysqld`btr_cur_update_in_place+0x160
 1048576 |@@@@@                   |   157 mysqld`btr_cur_optimistic_update+0xbc
 2097152 |@@@@@@                  |   186 mysqld`row_upd_clust_rec+0x50
 4194304 |@@                      |    86 mysqld`row_upd_clust_step+0x5f0

As you can see, mutex_spin fails, it has to wait, it looks for a free cell in the sync_array to block. Before it can find a free cell, it has to lock the sync_array and search for a free cell. It blocks (using OS primitives, not Innodb locking) on the sync_primary_wait_array.

Proposed Solution

Locking the whole sync_array to pick a free cell does not sound too scalable. I thought of following simple ways to fix this
  1. Use a mutex per cell instead of a global mutex.
  2. Use atomic ops to mark cells free/busy instead of grabbing the global mutex and checking.
  3. The search for a free cell always starts from 0. This is suboptimal as busy cells will tend to accumulate at the beginning. I propose starting at the previously found free cell and circling back after hitting the end.

I implemented a quick prototype to measure the improvement in performance before asking the Innodb folks to take a closer look. My performance tests showed very minimal improvement in performance.

Why did it not give a big boost in performance

An important thing to note is that this contention happened when it failed to get a mutex. A thread fails to get a mutex if another thread has already acquired the mutex. So it does not really matter how long (within reasonable limits), the thread that failed to acquire the mutex took to sleep on the lock.

Richard Smith had a great idea of totally bypassing the sync_array interface. He prototyped it and got some good gains, but I will let him talk about it.

Did we add this to 5.4?

No, since it did not improve performance, it is not there in 5.4. I doubt we will put it in unless we see a decent improvement in performance, or a drop in CPU utilization as a result of fixing it. There are other implementation issues (there is some deadlock detection code that depends on entire sync_array being locked, and others) that make it risky to fix without good justification.

This just illustrates how hard performance work can be at times. Sometimes the number of ideas rejected is more than what got accepted :-)

Tuesday Nov 04, 2008

Introduction to the Innodb IO subsystem

Introduction to the Innodb IO subsystem


When a client connects to MySQL, it creates a thread to handle it. This thread executes SQL queries and interacts with the storage engine (for simplicity, lets call them user threads). Innodb uses a four (4) additional threads to implement asynchronous io1. Although Innodb has an option innodb_file_io_threads to control the number of IO handler threads, it has no effect on how many IO handler threads are actually created. These IO handler threads wait and process events in a loop. Each IO handler thread processes different kinds of events. (Insert buffer writes, log writes, datafile writes, and read-ahead or prefetch) Let us now see how different types of IO are handled by Innodb.


Using a simple sysbench read-only test, we notice that all user threads issue reads (pread(2)) to the data files. We can verify this using a simple dtrace script
shell>dtrace -qn io:::start'/args[0]->b_flags & B_READ/{@[tid]=count()}tick-5s{exit(0)}'
      709               39
      711               42
      710               43
      712               51
As you can see above, all 4 threads are issuing reads.

Since innodb_thread_concurrency limits how many threads are executing inside Innodb, it acts as a throttle on the number of reads that can be issued in parallel.

Optionally you can use dtrace to look at the callstacks for reads and verify that the reads originate while reading pages of the Innodb datafiles.


Looking at writes, we find that bulk of the IO (pwrite(2)) is initiated by the Innodb IO handler threads. We also noticed a weird (atleast for me) behavior. We found that some reads were triggering fdsync() and thus causing writes to disk!. As noted above, since reads are done by all user threads, fdsync() was being issued by many threads.

So why are reads triggering fdsync() ?

In case the Innodb buffer gets full, it needs to evict data from the buffer pool to make space for the new record that is being read. The eviction function (buf_flush_free_margin()) also flushes possible buffered writes from the double write memory buffer. This flush is done using fdsync(). You can try increasing the buffer pool size incase the fdsync() bothers you. It is also possible to turn off the Innodb double write buffer, but it is generally not recommended (except for ZFS) as you might lose data in case of a power outage.

The 100 ios per second rule

The Innodb master thread has the responsibility (among many others) of flushing out modified buffer pool pages. Once a second, the innodb master thread will flush up to a maximum of 100 pages. This is a hard coded limit and is too low! (search for buf_flush_batch(BUF_FLUSH_LIST, 100,..) in srv/srv0srv.c). Reads do not suffer from this limitation.

So what can we do to improve Innodb's IO behavior?

Having only one thread for writes is not great for performance. The 100 ios per second rule is seriously flawed. Luckily the MySQL community (Google and Percona in particular) have fixes that change this behavior. Check out Mark Callaghan's blog for more details

So in conclusion,

  1. Reads are issued by all user threads
  2. Writes are only issued by innodb threads (one per kind of write)
  3. Buffer pool writes are currently limited to a max of 100 writes per second.
  4. It is possible for reads to cause writes
  5. Google (Mark Callaghan) has some really cool patches to fix most of the issues mentioned above. I have not tried the patch yet, but will soon.

1Rant: Most modern operating systems support AIO natively, I cannot understand why Innodb wants to implement its own)

Tuesday Sep 16, 2008

Innodb just got better!

I just got back from vacation and noticed that two patchsets have been released that greatly improve Innodb performance! Maybe I need to take more breaks :-)

Yasufumi Kinoshita's patches

Percona recently released a patch which includes performance fixes developed by Yasufumi Kinoshita from NTT Comware. This helps diskbound applications quite significantly. Details at Bug #29413 Maximum performance of OLTP benchmark is not so scalable on multi-cpu. It looks like the bulk of performance improvements come from breaking up the lock guarding the buffer pool structures, and improvements in the IO code path. The "buf_pool->mutex" also gets quite hot when concurrency is not limited (via innodb_thread_concurrency) and you have multiple threads executing on multi-core systems.

Updated Google patch

Google recently released an updated patch for improving Innodb scalability. My experience with this patch shows a significant scalability improvement for many workloads. Great work Mark, Ben and rest of the gang!

This is excellent news! This clearly demonstrates that we (as a community) are making great progress in making MySQL scale. This also underlines the fact that the choice of the storage engine will influence your scalability experience quite a bit. This also sets the bar higher for Falcon and Maria :-) (A good thing in my opinion)

Wednesday Jul 23, 2008

Peeling the MySQL Scalability Onion

In this blog I will talk about how we (the Sun/MySQL Performance Team) eliminated the need for a lock to get better scalability with MySQL 5.1.24+.

While comparing sysbench runs using two different versions of MySQL 5.1, I noticed a big difference in the system utilization. One version had much more idle time than the other. This difference was much more apparent with high thread counts on machines with lots of CPU. A look at the system calls showed a large number of lwp_park system calls. That is, threads were being put to sleep.

bash # $ dtrace -qn 'syscall:::entry{@c[probefunc]=count()}' -n tick-5s'{trunc(@c, 10);exit(0)}'

  p_online                                                       1535
  ioctl                                                          2255
  fcntl                                                          6134
  priocntlsys                                                    6462
  write                                                          6492
  read                                                          12775
  yield                                                         19065
  gtime                                                        313927
  pollsys                                                      321074
  lwp_park                                                     951749
Investigating why threads are being parked showed a very interesting stacktrace
              mysqld`int ha_innobase::info(unsigned)+0x164
              mysqld`bool make_join_statistics(JOIN\*,TABLE.. +0x180
              mysqld`int JOIN::optimize()+0x670
              mysqld`bool mysql_select(THD\*,Item\*\*\*,TABLE_.. +0x224
              mysqld`bool handle_select(THD\*,st_lex\*,selec.. +0xc4
              mysqld`bool execute_sqlcom_select(THD\*,TABLE.. +0x1c8
              mysqld`int mysql_execute_command(THD\*)+0x380
              mysqld`bool Prepared_statement::execute(Stri.. +0x29c
              mysqld`bool Prepared_statement::execute_loop.. +0xbc
Looking at make_join_statistics() we see that it calls ha_innobase::info() with HA_STATUS_VARIABLE | HA_STATUS_NO_LOCK. A quick search for the definition of HA_STATUS_NO_LOCK shows
  assuming the table keeps shared actual copy of the 'info' and
  local, possibly outdated copy, the following flag means that
  it should not try to get the actual data (locking the shared structure)
  slightly outdated version will suffice
#define HA_STATUS_NO_LOCK        2

So the mysql server is requesting ha_innobase::info() to not hold a lock, and it is being ignored by ha_innobase::info()!.

I compared against MySQL 5.0 and saw that this particular lock was not held when ha_innobase::info() was called. Searching through the commit logs I found that this was introduced by Bug#32440.

Quickly hacking the code to revert to the old behaviour gave a big boost in performance. Hence I filed Bug #38185 ha_innobase::info can hold locks even when called with HA_STATUS_NO_LOCK. Luckily its a very small change and a fix is already in progress.

Moral of the story? A simple contended lock can bring down your performance by quite a lot.




« July 2016