Tuesday May 26, 2009

MySQL Innodb ZFS Best Practices

One of the cool things about talking about MySQL performance with ZFS is that there is not much tuning to be done :-) Tuning with ZFS is considered evil, but a necessity at times. In this blog I will describe some of the tunings that you can apply to get better performance with ZFS as well as point out performance bugs which when fixed will nullify the need for some of these tunings.

For the impatient, here is the summary. See below for the reasoning behind these recommendations and some gotchas.

  1. Match ZFS recordsize with Innodb page size (16KB for Innodb Datafiles, and 128KB for Innodb log files).
  2. If you have a write heavy workload, use a Seperate ZFS Intent Log.
  3. If your database working set size does not fit in memory, you can get a big boost by using a SSD as L2ARC.
  4. While using storage devices with battery backed caches or while comparing ZFS with other filesystems, turn off the cache flush.
  5. Prefer to cache within MySQL/Innodb over the ZFS Adaptive replacement cache (ARC).
  6. Disable ZFS prefetch.
  7. Disable Innodb double write buffer.

Lets look at all of them in detail.

What Match ZFS recordsize with Innodb page size (16KB for Datafiles, and 128KB for Innodb log files).
How zfs set recordsize=16k tank/db

The biggest boost in performance can be obtained by matching the ZFS record size with the size of the IO. Since a Innodb Page is 16KB in size, most read IO is of size 16KB (except for some prefetch IO which can get coalesced). The default recordsize for ZFS is 128KB. The mismatch between the read size and the ZFS recordsize can result in severely inflated IO. If you issue a 16KB read and the data is not already there in the ARC, you have to read 128KB of data to get it. ZFS cannot do a small read because the checksum is calculated for the whole block and you have to read it all to verify data integrity. The other reason to match the IO size and the ZFS recordsize is the read-modify-write penalty. With a ZFS recordsize of 128KB, When Innodb modifies a page, if the zfs record is not already in memory, it needs to be read in from the disk and modified before writing to disk. This increases the IO latency significantly. Luckily matching the ZFS recordsize with the IO size removes all the problems mentioned above.

For Innodb log file, the writes are usually sequential and varying in size. By using ZFS recordsize of 128KB you amortize the cost of read-modify-write.


You need to set the recordsize before creating the database files. If you have already created the files, you need to copy the files to get the new recordsize. You can use the stat(2) command to check the recordsize (look for IO Block:)

What If you have a write heavy workload, use a seperate intent log (slog).
How zpool add log c4t0d0 c4t1d0

Write latency is extremely critical for many MySQL workloads. Typically, a query will read some data, do some calculations, update some data and then commit the transaction. To commit, the Innodb log has to be updated. Many transactions can be committing at the same time. It is very important that this "wait" for commit be fast. Luckily in ZFS, synchronous writes can be accelerated up by using the Seperate Intent Log. In our tests with Sysbench read-write, we have seen around 10-20% improvement with the slog.


What L2ARC (or Level 2 ARC)
How zpool add cache c4t0d0

If your database does not fit in memory, every time you miss the database cache, you have to read a block from disk. This cost is quite high with regular disks. You can minimize the database cache miss latency by using a (or multiple) SSDs as a level-2 cache or L2ARC. Depending on your database working set size, memory and L2ARC size you may see several orders of magnitude improvement in performance.


What When it is safe, turn off ZFS cache flush

The ZFS Evil tuning guide has more information about setting this tunable. Refer to it for the best way to achieve this.


ZFS is designed to work reliably with disks with caches. Everytime it needs data to be stored persistantly on disk, it issues a cache flush command to the disk. Disks with a battery backed caches need not do anything (i.e the cache flush command is a nop). Many storage devices interpret this correctly and do the right thing when they receive a cache flush command. However, there are still a few storage systems which do not interpret the cache flush command correctly. For such storage systems, preventing ZFS from sending the cache flush command results in a big reduction in IO latency. In our tests with Sysbench read-write test we saw a 30% improvement in performance.

  • Setting this tunable on a system without a battery backed cache can cause inconsistencies in case of a crash.
  • When comparing ZFS with filesystems that blindly enable the write cache, be sure to set this to get a fair comparison.

What Prefer to cache within MySQL/Innodb over the ARC.
How Via my.cnf and by limiting the ARC size

You have multiple levels of caching when you are using MySQL/Innodb with ZFS. Innodb has its own buffer pool and ZFS has the ARC. Both of them make independent decisions on what to cache and what to flush. It is possible for both of them to cache the same data. By caching inside Innodb, you get a much shorter (and faster) code path to the data. Moreover, when the Innodb buffer cache is full, a miss in the Innodb buffer cache can lead to flushing of a dirty buffer, even if the data was cached in the ARC. This leads to unnecessary writes. Even though the ARC dynamically shrinks and expands relative to memory pressure, it is more efficient to just limit it.In our tests, we have found that it is better (7-200%) to cache inside Innodb rather than ZFS.


The ARC can be tuned to cache everything, just metadata or nothing on a per filesystem basis. See below for tuning advise about this.

What Disable ZFS Prefetch.
How In /etc/system: set zfs:zfs_prefetch_disable = 1

Most filesystems implement some kind of prefetch. ZFS prefetch detects linear (increasing and decreasing), strided, multiblock strided IO streams and issues prefetch IO when it will help performance. These prefetch IO have a lower priority than regular reads and are generally very beneficial. ZFS also has a lower level prefetch (commonly called vdev prefetch) to help with spatial locality of data.

In Innodb, rows are stored in order of primary index. Innodb issues two kinds of prefetch requests; one is triggered while accessing sequential pages and other is triggered via random access in an extent. While issuing prefetch IO, Innodb assumes that file is laid out in the order of the primary key. This is not true for ZFS. We are yet to investigate the impact of Innodb prefetch.

It is well known that OLTP workloads access data in a random order and hence do not benefit from prefetch. Thus we recommend that you turn off ZFS prefetch.

  • If you have changed the primary cache caching strategy to just cache metadata, you will not trigger file level prefetch.
  • If you have set recordsize to 16k, you will not trigger the lower level prefetch.

What Disable Innodb Double write buffer.
How skip-innodb_doublewrite in my.cnf

Innodb uses a double write buffer for safely updating pages in a tablespace. Innodb first writes the changes to the double write buffer before updating the data page. This is to prevent partial writes. Since ZFS does not allow partial writes, you can safely turn off the double write buffer. In our tests with Sysbench read-write, we say a 5% improvement in performance.


Thursday Nov 01, 2007

Visualizing callgraphs via dtrace and ruby

The allfrom.d dscript can be used to display all function entry/exits caused by a function call. When call graphs are deep, or long, a visual representation is very helpful to understand the flow as well as how much time each function consumed. The output displayed is inspired by Roch's CallStackAnalyzer which was in turn inspired by the work on vftrace by Jan Boerhout.

I wrote a simple ruby script to post process the allfrom.d output and generate a SVG image. The advantage of using SVG is that you can use javascript to provide added functionality. For example, you can hover your mouse over any block to see the name of the function and its elapsed time. Similarly, you could add support for Zoom and Pan

Unfortunately, I am having problems with blogs.sun.com serving svg files with the right mime type. So I have included a png image below. You can save the svg files somewhere on your computer and view them using Firefox.

A sample for the connect(3socket) is shown below. The input file used to generate it is available here. The width of each box indicates how long the function took. Y axis indicates call depth.

If you are interested, you can also check out

To use the scripts to generate your own call stacks, download the following two [ 1, 2] ruby source files. For example, to generate the above callgraph, you can use

dtrace -s allfromkernel.d acccept > accept.log
ruby function_call_graph.rb accept.log > accept.svg

Note that the script does not work reliably for asynchronous function calls, or cases where the thread changes CPU. If there is sufficient interest, I might be tempted to add those in :-)

Tuesday Sep 18, 2007

ZFS ARC Statistics

ARC Statistics are exported via kstat(1M) (CR 6510807) in Solaris 10 U4 and Solaris nevada (build 57+). A simple way to see them is to use kstat -m zfs.

I wrote a simple perl script to print out the ARC statistics. Sample output is shown below. You can either print the set of default fields, or specify what fields you want printed (see arcstat -v for details)

Download arcstat.pl

In later blogs, I will describe how to make sense of these numbers. Sample output below

Cached random read

thumper1[bash]$ ~/arcstat.pl
    Time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
18:15:21  985M  100M     10   47M   14   53M    8    1M    3    11G   11G
18:15:22  137K     0      0     0    0     0    0     0    0    11G   11G
18:15:23  138K     0      0     0    0     0    0     0    0    11G   11G
18:15:24  138K     0      0     0    0     0    0     0    0    11G   11G
18:15:25  138K     0      0     0    0     0    0     0    0    11G   11G
18:15:26  138K     0      0     0    0     0    0     0    0    11G   11G
18:15:27  139K     0      0     0    0     0    0     0    0    11G   11G
18:15:28  140K     0      0     0    0     0    0     0    0    11G   11G
18:15:29  139K     0      0     0    0     0    0     0    0    11G   11G
18:15:30  140K     0      0     0    0     0    0     0    0    11G   11G
18:15:31  139K     0      0     0    0     0    0     0    0    11G   11G
18:15:32   33K     0      0     0    0     0    0     0    0    11G   11G

Uncached sequential write

    Time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
18:17:48     0     0      0     0    0     0    0     0    0    11G   11G
18:17:49     0     0      0     0    0     0    0     0    0    11G   11G
18:17:50   664   372     56     7    2   365  100     5    2    11G   11G
18:17:51    1K   534     36     5    0   529  100     5    0    11G   11G
18:17:52    2K   774     33     6    0   768  100     6    0    10G   10G
18:17:53    2K   645     31     5    0   640  100     5    0    10G   10G
18:17:54    1K   645     35     5    0   640  100     5    0    10G   10G
18:17:55    2K   645     31     5    0   640  100     5    0    10G   10G
18:17:56    1K   646     35     6    0   640  100     6    0    10G   10G
18:17:57    2K   645     31     5    0   640  100     5    0    10G   10G
18:17:58    2K   774     33     6    0   768  100     6    0    10G   10G

Monday Jun 25, 2007

ZFS Intent Log (ZIL) on a seperate device

Neil Perrin, the lumberjack, has quietly added support for using separate devices for the ZFS intent log!!. This allows ZFS to use NVRAM when it is available, enabling it to perform even better for certain kinds of applications. His putback fixes CR 6339640 Make ZIL use NVRAM when available.

Thursday Feb 08, 2007

ZFS and OLTP workloads: Time for some numbers

My last entry provided some recommendations regarding the use of ZFS with databases. Time now to share some updated numbers.

Before we go to the numbers, it is important to note that these results are for the OLTP/Net workload, which may or may not represent your workload. These results are also specific to our system configuration, and may not be true for all system configurations. Please test your own workload before drawing any conclusions. That said, OLTP/Net is based on well known standard benchmarks, and we use it quite extensively to study performance on our rigs.

UFS Directio N/A
UFS Directio N/A
1 Both block checksumming as well as block checking
2 Bigger is better

Databases usually checksum its blocks to maintain data integrity. Oracle for example, uses a per-block checksum. For Oracle, checksum checking is on by default. This is typically recommended as most filesystems do not have a checksumming feature. With ZFS checksums are enabled by default. Since databases are not tightly integrated with the filesystem/volume manager, a checksum error is handled by the database. Since ZFS includes volume manager functionality, a checksum error will be transparently handled by ZFS (i.e if you have some kind of redundancy like mirroring or raidz), and the situation is corrected before returning a read error to the database. Moreover ZFS will repair corrupted blocks via self-healing. While RAS experts will note that end-to-end checksum at the database level is slightly better than end-to-end checksum at the ZFS level, ZFS checksums give you unique advantages while providing almost the same level of RAS.

If you do not penalize ZFS with double checksums, you can note that we are within 6% of our best UFS number.  So 6% gives you provable data integrity, unlimited snapshots, no fsck, and all the other good features. Quite good in my book :-) Of course, this number is only going to get better as more performances enhancements make it into the ZFS code.

More about the workload.
The tests were done with OLTP/Net with a 72 CPU Sun Fire E25K connected to 288 15k rpm spindles. We ran the test with around 50% idle time to simulate real customers. The test was done on Solaris Nevada build 46. Watch this space for numbers with the latest build of Nevada.



« July 2016