Thursday Jun 18, 2009

MySQL Performance on Sun Storage 7000

If you saw Don MacAskill's keynote (The Smugmug Tale) at the recent MySQL Conference and Expo, you know that he had lots of positive things to say about his experience running MySQL on the Sun Storage 7410 at Smugmug. The 7410 marks the current high end of the Sun Storage 7000 Unified Storage line of network attached storage devices. For the past few months, I have been investigating MySQL database performance on the entry level and mid-range models, to see whether they might provide similar benefits to smaller MySQL sites. I have to admit that I was more than a little surprised at just how well things performed, even on the entry level 7110. For the whole story, read on...

The Sun Storage 7000 Unified Storage product line currently ranges from the entry level Sun Storage 7110, which provides up to 4TB of storage in a 2U form factor, to the high end Sun Storage 7410, which scales up to half a petabyte, and can be configured with Flash accelerated Hybrid Storage Pools and/or a High-availability Cluster option. The entire line, which also includes the mid-range Sun Storage 7210 and 7310, features a common, easy to use management interface and, of course, killer-app Dtrace Analytics. My investigation has focused primarily on the Sun Storage 7110, with some additional work on the Sun Storage 7210, which has more CPU, memory, and spindle capacity, and supports optional Write Flash Accelerator components.

Key Benefits for MySQL

There are a couple of aspects of the Sun Storage 7000 line that are really interesting for MySQL. As Don pointed out in his keynote, the underlying use of ZFS in the appliance means that you can get many of the benefits of ZFS even without migrating your OS to Solaris. The snapshot and clone features, for example, make automated backups and replication slave creation a snap, and the way that cloning is implemented means that a slave created by this method may require less physical storage than a slave created by traditional methods. That is because clones share unmodified disk blocks, so anything that remains unchanged after slave creation is never actually duplicated. You can also enable compression in the appliance, if you choose, which could mean even more space savings.

Analytics, Analytics, Analytics!

As a performance engineer, I can tell you that I have been frustrated many, many times by the lack of observability in storage devices. Even the best interfaces I have seen are like looking through frosted glass compared to the Dtrace Analytics interface. As Don MacAskill says "Drill down on everything. Correlate anything. God-like power." But don't take my word for it (or Don's), go to the Sun Storage 7000 Software page and download a simulator that you can run on your laptop or desktop, so you can see first hand.

Performance Highlights

Ok, nobody would care about Dtrace Analytics if all it did was tell you why your system was running so darn slow. The good news is that MySQL performance on the Sun Storage 7000 line is excellent, even on the entry level Sun Storage 7110. There are a couple of issues that I will point out later, but for MySQL on Linux over NFS, performance is great, right out of the box.

Huge read cache vs traditional HW Raid

The current Sun Storage 7000 lineup is based on standard Sun x64 systems and 64-bit OpenSolaris. That means lots of addressable memory, most of which is used as cache. On a standard Sun Storage 7110 with 8GB of memory, that can translate to over 6GB of usable cache. This can grow all the way to 128GB of memory on the 7410. Compare that to typical entry level hardware RAID devices that often come with as little as 256MB of cache, or even relatively high end hardware RAID devices that often come with 2GB of cache or less.

Blazingly fast reads from cache

With a database working set (the subset of data that clients are actively using) that fits in the cache of the Unified Storage appliance, I found that MySQL read-only performance on Linux and Solaris NFS was primarily bound by network bandwidth. For example, I observed over 6400 reads per second on the 7110, with a single MySQL/InnoDB/sysbench server accessing a 6GB working set (20GB total table space) via NFS over a 1Gb NIC. The sustained data throughput at this rate was a little over 80MB/sec, and with protocol overhead we had a sustained rate of a little over 90MB/sec on the wire, with peaks near 100MB/sec. That means we were running very near network line rate for the 1Gb NIC.

This resulted in 2x to 3x the Sysbench throughput for the MySQL/InnoDB/sysbench server using the Sun Storage 7110, compared to the same server when it was attached via fiber channel to a traditional HW RAID array with 2GB memory (providing enough cache for less than 1/3 of our working set). The read-only results were strong across the board for Linux and Solaris over both NFS and iSCSI.

On the 7210, we saw ~90% linear scaling for 1 to 6 MySQL/sysbench servers over NFS, when configured with a 10Gb NIC on the Unified Storage appliance and 1Gb nics in each DB server. Analytics showed that CPU and network in the 7210 were both at approximately 50% utilization during this test, and our working set used 36GB out of the available ~60GB of cache, so it is likely that we could have seen similar scaling to 8 or 10 DB servers.

The only tuning we applied at the Unified Storage appliance end was to set the record size of the NFS share to 16KB for the InnoDB tablespace storage. We left the record size at its default of 128KB for the NFS share that contained the InnoDB logs. There was no special network or NFS tuning applied to either the DB server or the appliance.

Strong IOPS for large working sets

If you have an active working set that exceeds the cache capacity in the Unified Storage appliance, your random read performance will eventually be bound by the IOPS rate of the underlying storage. For the Sun Storage 7110 and 7210, which do not have Read Flash Accelerator options like the 7310 and 7410, that means the IOPS rate of the disks. In a MySQL/InnoDB/sysbench test on the 7110, with an aggregate working set of 36GB (180GB aggregate table space), resulting in a 40% cache hit rate in the appliance, I observed roughly 4000 NFS reads per second and roughly 2700 resulting disk IOPS. That translates to 230 IOPS for each of the 12 data drives. In this test we were using only about 15% of the drive capacity, so we were "short stroking" the disks and getting better IOPS rates than we would if we had longer average seeks. For example, at 80% capacity we might see something on the order of 150 to 180 IOPS per drive.

For additional background on IOPS and other metrics for Sun Storage 7000 storage devices, check out Roch Bourbonnais' blogs on Sun Storage 7000 Performance invariants and Compared Performance of Sun 7000 Unified Storage Array Line and Amitabha Banerjee's blog on Analyzing the Sun Storage 7000.

Excellent MySQL read-write performance on Linux over NFS

I was very pleasantly surprised by MySQL read-write performance on Linux NFS to the entry level Sun Storage 7110. I went into the study expecting good read-write performance to the 7210, but I thought that the lack of Write Flash Accelerators in the 7110 would be an issue for MySQL. Not so! In my testing, I observed over 2300 reads plus 400 writes per second on the 7110, with a single MySQL/InnoDB/sysbench server accessing a 6GB working set (20GB total table space) via NFS over a 1Gb NIC.

This resulted in 1.5x the Sysbench throughput for the MySQL/InnoDB/sysbench server using the Sun Storage 7110, compared to the same server when it was attached via fiber channel to a traditional HW RAID array with 2GB memory (providing enough cache for less than 1/3 of our working set).

As with the read-only test, the large read cache in the 7110 compared to the HW RAID array probably played a big role here. The 85% cache hit rate in the appliance provided an advantage on read performance, which offset the potential advantage of the battery backed write cache in the HW RAID array. In addition, the multi-threaded MySQL/InnoDB/sysbench workload appears to have benefited from group commit on writes, since the 7110 started with lower throughput with 1 or 2 active threads, but began to outperform the HW RAID array at 4 threads and higher.

I was surprised enough by the MySQL read-write performance over Linux NFS that I felt compelled to confirm that fsync() calls in Linux were actually resulting in correct NFS data commits. Based on a combination of Dtrace Analytics and Wireshark analysis of NFS traffic, I can confirm that the data was correctly going to disk.

Good MySQL read-write performance over iSCSI w/out write flash accelerators

The testing I have done so far with MySQL over iSCSI has been exclusively on the entry level Sun Storage 7110. Unlike the 7210, 7310, and 7410, this model does not currently include a write flash accelerator option, which would probably help MySQL read-write performance over iSCSI. Even so, we observed good performance over iSCSI, especially at higher thread counts. For example, using the same database and network configurations described earlier, MySQL/InnoDB/Sysbench throughput over iSCSI, at 32 to 64 active threads, was comparable to what was seen on the same server when it was attached via fiber channel to a traditional HW RAID array with 2GB memory for both Solaris and Linux.

Current Issues

MySQL read-write over Solaris NFS

MySQL read-only performance on Solaris NFS is currently excellent, but read-write performance is impacted by the lack of VMODSORT support in the Solaris NFS client (CR 6213799). This affects fsync() performance for large files using a default, buffered NFS mount. The normal workaround for this is to eliminate client side buffering of file data by mounting the file system with the forcedirectio option, or enabling directio on a per-file basis. For example, the MySQL/InnoDB option "innodb_flush_method = O_DIRECT" enables dirctio on the InnoDB tablespace files. That is likely to work fine on an appliance that includes write flash accelerators like the Sun Storage 7210 and higher, but write flash accelerators are not currently available on the Sun Storage 7110. For the 7110 without write flash accelerators, we did not see a MySQL read-write performance gain by using directio instead of default, buffered NFS file access.

For Solaris, MySQL read-write performance on ZFS over iSCSI currently exceeds its performance over buffered or directio enabled NFS on the Sun Storage 7110, provided that fast-write-ack is enabled on the iSCSI luns.

iSCSI fast-write-ack persistence

A Solaris system running ZFS over iSCSI can realize performance gains by enabling fast-write-ack on the iSCSI luns in the storage appliance, because ZFS is known to correctly issue SCSI cache flush commands when fsync() is called (that is not currently known to be true for any Linux file system). The fast-write-ack option can be activated by enabling the "write cache enabled" option in the appliance for a given iSCSI lun. However, due to CR 6843533 the write cache enabled setting on an iSCSI lun will be silently disabled any time an iscsi login occurs for a target, although the Unified Storage BUI and CLI will still indicate that it is enabled. Examples of iscsi target login triggers include a reboot of the Unified Storage appliance, a reboot of the client, a client "iscsiadm modify target-param -p ..." command to modify negotiated parameters, or an "iscsiadm remove ..." followed by an "iscsiadm add ..." for the affected target. The workaround for CR 6843533 is to manually disable and then reenable write cache following an iscsi target login.



For an NFS share that will be used to store MySQL tablespace files, match the record size of the share to the block size of the storage engine. For example, this should be 16k for InnoDB tablespace files. This can be configured on a per-share basis by setting the "Database record size" (in the BUI) or "recordsize" (in the CLI) for the share. This must be done before creating the tablespace files in the share.

For an NFS share that will store transaction logs or other non-tablespace files, the record size of the share can be left at its default of 128k.

For more information on Sun Storage 7000 Unified Storage network attached storage devices, including lots more performance data, check out these other blogs:

You will also find Sun Storage 7000 information on the new Sun System and Application Performance, Monitoring, and Tuning wiki, which includes performance information for a broad range of technologies.


We saw our best MySQL read-only and read-write performance on Linux using NFS. Read-only performance over iSCSI approached that of NFS, but read-write over NFS substantially outperformed iSCSI. Based on the read-write result, we recommend using NFS for MySQL on Linux. All of our testing on Linux used NFSv3, which was the default in the version of Linux we tested.


For the Sun Storage 7110 without write flash accelerators, we saw our best MySQL read-write performance on Solaris by running ZFS over iSCSI. This avoided the performance impact we would have seen with MySQL over NFS, due to the lack of VMODSORT support in the Solaris NFS client (CR 6213799). The data integrity guarantees of ZFS also allowed us to disable the InnoDB double write buffer, and ZFS cache management semantics allowed us to enable fast-write-ack on the iSCSI luns in the storage appliance. Until CR 6213799 has been addressed, we recommend using ZFS over iSCSI for MySQL on Solaris if you do not have write flash accelerators in your appliance. However, be sure to review the iSCSI fast-write-ack persistence discussion in the "Current Issues" section above if you use this feature.

For the Sun Storage 7210 and higher, write flash accelerators should benefit MySQL read-write performance over NFS using either the forcedirectio mount option or the "innodb_flush_method = O_DIRECT" MySQL/InnoDB option, and will also benefit MySQL read-write performance over iSCSI. We did not have an appliance with write flash accelerators available for this test, so at this time we can not recommend one configuration over the other.

Wednesday Apr 29, 2009

Performance Modeling Slides

Last week I gave a presentation titled "Predicting Performance with Queuing Models" at the MySQL Camp section of the MySQL User Conference. I have uploaded the slides for a longer version of the presentation than could have been covered in the 45 minute time slot, for anyone who would like a copy. I've also uploaded a version that includes speaker notes.

Presentation slides:

Tuesday Apr 21, 2009

Concurrent Commit Revisited

Today Sun announced the availability of MySQL 5.4, which contains a number of performance and scalability enhancements to the base MySQL 5.1 code. For the full run down on all of the enhancements, and the resulting scalability improvements, check out these blogs from Mikael Ronstrom and Allan Packer. You can also read Neelakanth Nadgir's blog, where he describes some performance fixes that were tried but rejected. As he says, sometimes you can learn a lot from the things that don't work out!

This release contains a range of improvements, both from the community (especially Google) and from work that has taken place in the MySQL performance team. As you can guess, we aren't stopping here, and there are a number of additional enhancements in the pipeline for future releases. Among those is a bug fix that is near and dear to my heart, and you can read some of the background in my blog post from last November. In essence, if you enable binlog logging, you currently lose concurrent commits in InnoDB, and this results in a loss of scalability. After languishing for far too long, an enhanced version of the fix I developed for this, and submitted to InnoDB in August 2008, is being tested for inclusion in a future release. The fix relates to a very small change in the use of the prepare_commit_mutex lock, which controls the order of commits to the binary log and the InnoDB logs, for the benefit of InnoDB hot backups (ibbackup). This small change results in a big improvement in scalability, as I demonstrated in my previous blog.

Based on a review by some of the top MySQL development engineers, including Mikael Ronstrom, Sergei Golubchik, and Mats Kindahl, the current plan is to introduce a new MySQL configuration option to enable variations on my fix. The new option, innodb_support_ibbackup, will accept values of 0, 1, or 2. These values will mean:

  • innodb_support_ibbackup=0 - no prepare_commit_mutex is ever used, which is unsafe for ibbackup but otherwise is ok and gives the best scalability.
  • innodb_support_ibbackup=1 - use my optimization of prepare_commit_mutex, which is safe for ibbackup, and dramatically reduces the performance impact.
  • innodb_support_ibbackup=2 - use the old behavior, which is safe for ibbackup, but breaks group commit and causes a huge scalability impact.
  • The default value is 1, which is the optimized use of the lock and will give good out of the box performance while still supporting ibbackup. If you know you don't use ibbackup, and don't have some other custom, in house requirement that unrelated commits in the binlog appear in the same order as commits in the InnoDB logs, you can improve scalability further by setting innodb_support_ibbackup to 0.

There are plenty of other performance and scalability enhancements that are already available in the new MySQL 5.4 release, so I urge you to check out the blogs by Mikael Ronstrom and Allan Packer for the full story. This is just one example of the additional improvements that are still in the pipeline!

Thursday Apr 09, 2009

Elements of Response Time

In a couple of weeks, I will be presenting a session on simple Queuing Models at the free MySQL Camp section of the MySQL User Conference. As luck would have it, I happened to come across a great example today of how a very simple back-of-the-envelope model can help us understand why a system performs the way it does.

My friend and co-worker, Tim Cook, has recently been doing some performance testing with the new Pool-of-Threads scheduler in MySQL 6.0. You can read about the results here and here on his blog.

As Tim showed in the first blog, with sysbench (the load driver) running on the same system as MySQL, throughput with the pool-of-threads scheduler was lower than with the default, thread-per-connection scheduler. However, with sysbench running on a remote system, accessing MySQL over the network (arguably a more realistic case), throughput with the two schedulers was quite similar. While reviewing response time data, it was noted that the ramping of response time (ie the response time curve) for pool-of-threads was very similar to that for thread-per-connection when communicating over the network. The question was, why?

As Sherlock Holmes would say, it's elementary! (pun intended)

In a closed queuing network, the lower bound on average response time, once you reach Nsat (the number of consumers at which some queuing is guaranteed to happen), is (N\*Dmax)-Z. Where N is the number of consumers, Dmax is the service demand at the bottleneck device, and Z is think time (the delay between requests from the same consumer).

If the response time curve is the same for both implementations, it suggests that Dmax is also the same and Nsat is comparable.

If you look at just CPU, network, and disk, you will have the following elements of response time:

  • The service demand at each device:
    • Dcpu = (%usr+%sys)/TPS
    • Ddisk = %busy/TPS
    • Dnet = (network utilization)/TPS
    • Note: TPS is Transactions Per Second reported by the application.
  • The total service demand:
    • D = Dcpu + Ddisk + Dnet
    The service demand at the bottleneck device:
    • Dmax = max(Dcpu, Ddisk, Dnet)
  • The number of consumers at which some queuing is guaranteed to occur:
    • Nsat = (D + Z)/Dmax
    • Note: Nsat marks the approximate knee in the throughput and response time curves. It is often written as N\*, but this can be confusing when written in plain text equations where "\*" indicates multiplication. Nsat may also be referred to as Nopt, meaning the optimal number of consumers in the system, because it marks the approximate point where throughput levels out and response time starts to climb.
  • The lower bound on average response time is then:
    • for N < Nsat: D
    • for N >= Nsat: (N\*Dmax)-Z

When sysbench was running locally with MySQL, Dmax was probably the CPU service demand, which differed between implementations. With sysbench remote, Dmax is probably now either Dnet or Ddisk. Some simple hand calculations will tell which, and that component will need to be addressed to reduce the average response time and increase the throughput.

I hope to see you at my MySQL Camp Session, at 2pm on Thursday, April 23rd, where we will discuss other uses of simple queuing models to answer questions about performance. MySQL Camp is free and you do not need to be registered for the main conference to attend, so drop by!

Friday Mar 06, 2009

MySQL Query Cache Sizing

I recently helped to analyze a performance issue with one of our MySQL customers, which highlighted the need to be cautious when sizing the query cache, and to validate whether it works well with your workload. The customer was evaluating a new 2-socket, 8 core, X64 system running Solaris, as a replication slave in an environment consisting of older systems running Linux. When they introduced the new slave, they found that it was unable to keep up with its replication load during their peak periods, and it took hours to catch up once the traffic dropped. The older systems running Linux had no trouble keeping up, so the obvious assumption was that there was something wrong with the new system, with Solaris, or with ZFS. In reality, the issue turned out to be an over sized MySQL query cache, and once that was addressed on the new system, it performed as expected.

The query cache is an interesting component of MySQL. It caches result sets of queries, which has the potential to improve performance by eliminating the entire parse/optimize/execute cycle for a repeated query. This comes at a cost though, and that cost tends to increase with the size of the cache. Due to the current implementation, the cost may also increase with the number of hardware threads and with the number of active connections, due to the cache's current coarse grained locking implementation. When the query cache helps, it can help a lot. When it hurts, it can hurt a lot. If you are currently using or considering using the query cache, you should definitely test your assumptions about how it interacts with your particular workload. You should also plan to revisit the topic from time to time, since changes to your application or workload may change the interactions with the query cache.

This particular case also highlights an issue related to migrations. If you are doing a "major" migration, for example between systems with different architectures, significantly different configurations, or, as in this case, with different operating systems, be sure to keep track of any other changes that are made along with the migration. It is natural to assume that any negative impacts are due to the big changes, but many times it will turn out to be a seemingly small change that causes grief. In this case, we knew that a major performance drop due to a Linux to Solaris migration was completely unexpected, which kept us from being led too far astray from finding the real root cause.

Now back to the query cache... In most cases, the query cache should be sized in the tens of megabytes and not larger. In the particular case we investigated, the customer had been running with a 20MB query cache on the older systems, and configured the new system with a 500MB query cache. This was presumably done because the new system had substantially more memory than the older systems, and it seemed to make sense to use some of that memory to boost the size of the query cache. However, due to a number of factors related to the implementation of the query cache, this actually hurt performance by a large margin. This was due to the level of overhead that was required to maintain the large cache, which was much higher than any benefit that it could provide.

To identify the problem, the customer generated a dtrace profile of mysqld with the DTraceToolkit "/opt/DTT/hotuser" script, or one similar to it. When piped through c++filt to demangle C++ names, the output looked like this:

Truncated to top 4`clear_lockbyte 361 1.2%`mutex_lock_impl 511 1.7%`mutex_trylock_adaptive 1127 3.8%
mysqld`Query_cache::insert_into_free_memory_sorted_list(Query_cache_block\*, Query_cache_block\*\*) 21702 73.1%

This shows that mysqld was spending 73% of its time in the Query_cache::insert_into_free_memory_sorted_list() method. As soon as this was seen, we knew that the issue was related to updating the query cache, and a quick comparison of configurations confirmed that this system had a dramatically larger query cache than the systems that were performing well. It is likely that plockstat would have shown substantial contention on the query cache structure_guard_mutex as well, but that wasn't checked. Decreasing the query cache size, by setting the query_cache_size tunable, resolved the problem instantly.

The issue here was that the customer had a moderate level of write traffic, and the current query cache implementation invalidates all result sets for a given table whenever that table is updated. As the query cache grows in size, the number of entries that must be invalidated for a given table may grow as well. In addition, the coarse locking on the cache can lead to lock contention that can kill performance, particularly on multi-core hardware.

This leads to some general recommendations:

  1. If you plan to use the query cache, start small, perhaps 10MB to 20MB, and monitor the impact on your query/transaction throughput and response time. If performance drops, the query cache may not be a good match for your workload.
  2. Consider the read/write mix of your workload, as well as the rate of data updates. Query cache invalidations due to write traffic can be quite expensive, and even moderate levels can counteract the benefit of the cache.
  3. Verify whether the query cache is enabled, and if so, verify its currently configured size. You can do this by running "SHOW VARIABLES LIKE '%query_cache%';".  If query_cache_size (which defines the size of the cache, in bytes) is non-zero, and have_query_cache is TRUE, the cache is active. If query_cache_size is set to zero or have_query_cache is FALSE, the query cache is disabled.
    • query_cache_size defaults to zero (ie disabled) on UNIX/Linux, but is non-zero in some of the example my.cnf files included with the MySQL distribution and on Windows.
    • You can change query_cache_size dynamically via "SET GLOBAL query_cache_size = XXX;" or permanently by adding a "query_cache_size = XXX" statement to your my.cnf file.
  4. Verify your query cache statistics via "SHOW GLOBAL STATUS;" for example
    • Hit rate = Qcache_hits / (Qcache_hits + Com_select)
    • Insert rate = Qcache_inserts / (Qcache_hits + Com_select)
    • Prune rate = Qcache_lowmem_prunes / Qcache_inserts
      • This is an approximation of the number of inserts that were too large for available free memory in the cache.
      • Updates to house keeping data may also result in a prune, so this is not an exact correlation.

You should try to allow all of the caches on your system to warm up before checking your cache statistics. This means allowing your application to run for at least 10 to 30 minutes, which is usually enough time for the system to reach a steady state.

When looking at hit rates, keep in mind that relatively low query cache hit rates may provide substantial benefit due to the potentially high cost of generating a result set for a query. For example, we usually try to maintain InnoDB buffer pool hit rates in the high 90% range, while hit rates below 50% in the query cache may be fine. On the other hand, if you can't get hit rates out of the single digits, you should really question the benefit of enabling the cache, since looking in the cache adds overhead to every select. In the end, query/transaction throughput and response time should be your ultimate indicator.

When looking at insert rates, keep in mind that relatively low insert rates may have a substantial negative impact on performance due to the high cost of updating the cache. I would be cautious of anything higher than single digit insert rates, and test the application with the query cache disabled (by setting query_cache_size to zero) to see if the cache is helping or hurting performance. However, keep in mind that a high prune rate might result in the need to repeatedly insert the same result set in the cache, so you need to consider that as well.

A non-zero prune rate may be an indication that you should increase the size of your query cache. However, keep in mind that the overhead of maintaining the cache is likely to increase with its size, so do this in small increments and monitor the result. If you need to dramatically increase the size of the cache to eliminate prunes, there is a good chance that your workload is not a good match for the query cache.

As you consider how your workload interacts with the query cache, keep in mind that you can use the SQL_NO_CACHE hint in a select statement if you find that a specific query has a tendency to negatively interact with the query cache.

As I said earlier, when the query cache helps, it can help a lot, but when it hurts, it can hurt a lot. It all depends on your particular workload, so it requires testing and is not something you should just blindly enable. In the end, the goal is to increase throughout and/or reduce response time, so those are the metrics you should ultimately consider. For more information about configuring and tuning the query cache, see the Query Cache section of the MySQL documentation. There is also an excellent write up in the book "High Performance MySQL" by Baron Schwartz et al, that touches on a number of issues I haven't mentioned.

I'd like to acknowledge my colleague Simon Vobes, who did a great job working with the customer to resolve this performance issue!

Tuesday Nov 04, 2008

Toward a More Scalable MySQL Replication Master

If you are a MySQL 5.x/6.0 InnoDB replication user, right now you take a significant performance hit on the replication master simply by turning on the binlog. The good news is that we've taken a big step toward eliminating that performance gap. I'll describe the problem, how I was able to track down the root cause, and point to a patch that fixes the problem. Since the changes are in the InnoDB code, right now we're waiting on Oracle/Innobase to review the fix and formally commit it. Once that happens, you should see it show up in binary releases. In the meantime, if you build your own binaries you can test the prototype patch yourself.

One of the things I have been working on quite a bit over the past several months is scalability of the nodes within a MySQL scale-out replication environment.  The reason being that there has been a rapid increase in CPU cores and hardware thread counts per server in recent years, across all system types.  Multi-core processors are now the norm, and regardless of the processor family you choose, each node in your environment will soon have dozens or hundreds of hardware threads available.  This trend is causing the notion of horizontal scaling (scale out) to be replaced by that of diagonal scaling (scale out and up), where a horizontally scaled architecture is made up of vertically scaled nodes.  This provides for increased performance while maintaining or even reducing the number of required nodes.  The key to this trend is that we need software that can scale both vertically and horizontally.  We also need to avoid negative scaling on multi-core systems, which refers to a drop in performance as thread counts increase, and is often caused by coarse grained locking in the code.

With this in mind, I set out to assess and improve the vertical scalability of nodes in a MySQL scale-out replication environment.  The first step was to analyze scalability of the master node, which accepts update requests from clients, applies them to the master copy of the database, then makes the updates available to slave nodes by way of the binlog file. For the purposes of this exercise, I will claim that enabling binlog logging is the primary differentiator between a replication master and a "regular" MySQL database server (yes, I know that binlog logging can also be used to support backups).  The first step in my analysis was to analyze the scalability impact of enabling binlog logging.

The basic methodology that I used to assess the impact of binlog logging was to drive a multi-threaded load against MySQL both with and without binlog logging enabled, gradually increasing the number of threads driving the load in each case.  Throughput was measured in transactions per second, and the chosen workload was a sysbench OLTP test.  This is a fairly simplistic benchmark, but has proven to be effective in this case.  The results of this initial test are shown below.

What the graph above shows is the difference in throughput between a baseline version of MySQL and the same version with binlog logging enabled. The baseline shown here is from a patched version of MySQL 6.0.5-alpha, which is roughly equivalent to 6.0.7-alpha. The results through 32 threads are similar to those I have obtained using 5.1.28-rc. The system used was a Sun SPARC Enterprise T5240, and psradm/psrset commands were used to allocate 2 cores (16 threads) to sysbench and 6 cores (48 threads) to mysqld. The remaining cores were taken offline. I want to mention that I did not see continued scaling beyond 48 sysbench threads, and have truncated the data as shown in order to focus on a specific scalability issue.

The striking result of this test was that, with the binlog enabled, both throughput and scalability were much worse than without binlog logging enabled.  Increasing the number of load generator threads did not result in any increase in throughput beyond 12 threads.  In order to find out why, I needed to determine what resource became a bottleneck as we approached and exceeded 12 load generator threads.  Since lock contention is a common reason for lack of scalability in a multi-threaded application, I turned next to the Solaris plockstat utility.  This dtrace consumer often makes it extremely easy, and sometimes quite obvious, to identify the root cause of a scalability issue in a multi-threaded program.  A sample plockstat command and the output from a run against MySQL 6.0.5-alpha with 32 sysbench threads is shown below.  The output has been truncated to save space.

plockstat -C -n 10 -e 5 -x bufsize=10k -x aggsize=2m -p 3491 | c++filt

Mutex block
Count     nsec Lock                         Caller
 2038 54747571 mysqld`$XAJzoBKpX4GIETk.prepare_commit_mutex mysqld`int innobase_xa_prepare(handlerton\*,THD\*,bool)+0x4c
  282   447655 mysqld`mysql_bin_log+0x8     mysqld`bool MYSQL_BIN_LOG::write(Log_event\*)+0xa4
   83   471277 mysqld`mysql_bin_log+0x8     mysqld`bool MYSQL_BIN_LOG::write(Log_event\*)+0xa4

The plockstat options used above include:

-Clook for contention events
-n 10top 10 locks
-e 5sample for 5 seconds
-x bufsize=10k
-x aggsize=2m
increase size of buffers in dtrace due to the large number of events
-p 3491monitor process 3491, which was mysqld

Note that the output from plockstat can include C++ mangled names, so c++filt, which is part of SunStudio, was used to convert the mangled names back to original code.

The output above shows that the hottest lock is prepare_commit_mutex and the hottest caller is innobase_xa_prepare().  Searching through the code for this mutex and function call showed that the mutex was acquired in innobase_xa_prepare() and released in innobase_commit(). This code is located in storage/innobase/handler/ in 5.1 and 6.0 source distributions.  A Google search showed that the code was added during the 4.1 to 5.0 upgrade, as part of XA transaction support.  Incidentally, this specific code section is a very tiny part of a major code upgrade.

Figuring out what happens in MySQL from the time the prepare_commit_mutex lock is acquired until it is released was an interesting exercise, and I may talk about some of the techniques I used in future blogs.  As you can guess, our old friend dtrace came in pretty handy, but I also did a lot of good old fashioned code reading and code diffing between 4.1 and 5.0 code.  My main goal was to determine how things worked in 4.1, what had changed in 5.0, whether we could restore the 4.1 behavior, and whether that would improve scalability.

The first question to ask was, what is the purpose of the prepare_commit_mutex lock?  In essence, what is this lock protecting? Comments in the code indicate that the lock is there to ensure that commits to the binlog file occur in the same order as commits to the InnoDB logs, and that this is required for InnoDB hot backup. Subsequent discussions with the MySQL Replication team also indicate that this ordering is required for replication, so that commits on the slave will occur in the same order as commits on the master.

One of the first things I noticed about the prepare_commit_mutex lock is that it is acquired before the InnoDB prepare phase takes place. With innodb_flush_log_at_trx_commit set to 1 in my.cnf, this includes a synchronous flush of the InnoDB logs, which means that this will be a very long time to hold a lock.  The lock is then held while the binlog file is committed, which will also include a synchronous flush if sync_binlog is set to 1.  It is finally released after the InnoDB commit phase, which will include another synchronous flush of the InnoDB logs if innodb_flush_log_at_trx_commit is set to 1.  This means that we will have two or possibly three synchronous flushes to disk while holding this lock.  This results in an extremely long hold time on the lock, and is the main reason that the lock becomes so hot and puts a ceiling on scalability.

So what could we do about it?  The goal was to reduce the hold time on the lock by acquiring it later, releasing it earlier, or both.  To track this effort, I opened MySQL bug#38501 (hold times on prepare_commit_mutex limit read/write scalability).  This was later determined to be a duplicate of bug#13669 (Group commit is broken in 5.0).

The first improvement I tried was to acquire the lock at the end of the innobase_xa_prepare() method instead of at the beginning, so the lock was not held during the prepare phase.  This resulted in a 20%+ throughput improvement at 32 threads.  There has been a lot of discussion about whether this is a valid change, and at this point I believe that it will be found to be ok.  The replication team has indicated that replication has no ordering requirement between the prepare phase and the binlog commit, and I am waiting for feedback from Oracle/Innobase about the constraints for InnoDB hot backups.

The second improvement was to restore the 4.1 behavior for synchronous flushes of the InnoDB logs at commit time.  In the 5.0 code, the lock is held while the logs are both written and flushed.  In the 4.1 code, the logs were written but not flushed, the protective lock was released, then the logs were flushed to disk.  This meant the lock was not held during the flush to disk, allowing for increased scalability. This was the basis for "group commit" behavior in 4.1.  As it turns out, this behavior can be returned to MySQL with relatively few changes to the code, utilizing methods that are already present.  This change, combined with the first, resulted in more than double the throughput at high thread counts, as shown below.

The graph above is the same as the one shown earlier, but with a new set of data generated using a prototype patch for bug#38501 and bug#13669.  This shows that with the prototype patch, the impact of enabling binlog logging is dramatically reduced.  At high thread counts the throughput is more than double the Baseline+Binlog number, and is only about 9% below the throughput without binlog enabled.  This is exactly the kind of improvement I was hoping to achieve!

So what's next?  The prototype patch for this issue still needs to go through formal code review, but I am hopeful that it will require only minor changes.  Based on conversations with the MySQL Replication team, I believe that the basic changes retain the functionality that is required for replication.  Oracle/Innobase also needs to review the functionality related to InnoDB hot backups, and hopefully that will happen in the near future.

If you would like to try out the patch yourself, you can download the current prototype here and apply it to a 5.1 or 6.0 source distribution. Follow the regular process to install MySQL from a source distribution, for example these instructions for a 5.1 install, but apply the source code patch before the configure/make steps. For example, if you saved the patch to a file named "patchfile" you would apply it like this:

shell> gunzip < mysql-VERSION.tar.gz | tar -xvf -
shell> cd mysql-VERSION
shell> patch -p0 < patchfile

Keep in mind that this is a prototype patch, and should be considered pre-alpha code.

Incidentally, the work I have described here primarily helps the case where innodb_flush_log_at_trx_commit is set to 1 and we synchronously flush the InnoDB logs at the end of the prepare and commit phases of each transaction.  It does not fully address the case where sync_binlog is set to 1 and we also synchronously flush the binlog file for each transaction, although you will get some scalability improvement even then.  The sync_binlog case is still under investigation, and is certainly solvable.  We are also looking at a number of other areas for increased scalability of both the master and slave nodes.




« August 2016