Tuesday Nov 04, 2008

Toward a More Scalable MySQL Replication Master

If you are a MySQL 5.x/6.0 InnoDB replication user, right now you take a significant performance hit on the replication master simply by turning on the binlog. The good news is that we've taken a big step toward eliminating that performance gap. I'll describe the problem, how I was able to track down the root cause, and point to a patch that fixes the problem. Since the changes are in the InnoDB code, right now we're waiting on Oracle/Innobase to review the fix and formally commit it. Once that happens, you should see it show up in binary releases. In the meantime, if you build your own binaries you can test the prototype patch yourself.

One of the things I have been working on quite a bit over the past several months is scalability of the nodes within a MySQL scale-out replication environment.  The reason being that there has been a rapid increase in CPU cores and hardware thread counts per server in recent years, across all system types.  Multi-core processors are now the norm, and regardless of the processor family you choose, each node in your environment will soon have dozens or hundreds of hardware threads available.  This trend is causing the notion of horizontal scaling (scale out) to be replaced by that of diagonal scaling (scale out and up), where a horizontally scaled architecture is made up of vertically scaled nodes.  This provides for increased performance while maintaining or even reducing the number of required nodes.  The key to this trend is that we need software that can scale both vertically and horizontally.  We also need to avoid negative scaling on multi-core systems, which refers to a drop in performance as thread counts increase, and is often caused by coarse grained locking in the code.

With this in mind, I set out to assess and improve the vertical scalability of nodes in a MySQL scale-out replication environment.  The first step was to analyze scalability of the master node, which accepts update requests from clients, applies them to the master copy of the database, then makes the updates available to slave nodes by way of the binlog file. For the purposes of this exercise, I will claim that enabling binlog logging is the primary differentiator between a replication master and a "regular" MySQL database server (yes, I know that binlog logging can also be used to support backups).  The first step in my analysis was to analyze the scalability impact of enabling binlog logging.

The basic methodology that I used to assess the impact of binlog logging was to drive a multi-threaded load against MySQL both with and without binlog logging enabled, gradually increasing the number of threads driving the load in each case.  Throughput was measured in transactions per second, and the chosen workload was a sysbench OLTP test.  This is a fairly simplistic benchmark, but has proven to be effective in this case.  The results of this initial test are shown below.

What the graph above shows is the difference in throughput between a baseline version of MySQL and the same version with binlog logging enabled. The baseline shown here is from a patched version of MySQL 6.0.5-alpha, which is roughly equivalent to 6.0.7-alpha. The results through 32 threads are similar to those I have obtained using 5.1.28-rc. The system used was a Sun SPARC Enterprise T5240, and psradm/psrset commands were used to allocate 2 cores (16 threads) to sysbench and 6 cores (48 threads) to mysqld. The remaining cores were taken offline. I want to mention that I did not see continued scaling beyond 48 sysbench threads, and have truncated the data as shown in order to focus on a specific scalability issue.

The striking result of this test was that, with the binlog enabled, both throughput and scalability were much worse than without binlog logging enabled.  Increasing the number of load generator threads did not result in any increase in throughput beyond 12 threads.  In order to find out why, I needed to determine what resource became a bottleneck as we approached and exceeded 12 load generator threads.  Since lock contention is a common reason for lack of scalability in a multi-threaded application, I turned next to the Solaris plockstat utility.  This dtrace consumer often makes it extremely easy, and sometimes quite obvious, to identify the root cause of a scalability issue in a multi-threaded program.  A sample plockstat command and the output from a run against MySQL 6.0.5-alpha with 32 sysbench threads is shown below.  The output has been truncated to save space.

plockstat -C -n 10 -e 5 -x bufsize=10k -x aggsize=2m -p 3491 | c++filt

Mutex block
Count     nsec Lock                         Caller
 2038 54747571 mysqld`$XAJzoBKpX4GIETk.prepare_commit_mutex mysqld`int innobase_xa_prepare(handlerton\*,THD\*,bool)+0x4c
  282   447655 mysqld`mysql_bin_log+0x8     mysqld`bool MYSQL_BIN_LOG::write(Log_event\*)+0xa4
   83   471277 mysqld`mysql_bin_log+0x8     mysqld`bool MYSQL_BIN_LOG::write(Log_event\*)+0xa4

The plockstat options used above include:

-Clook for contention events
-n 10top 10 locks
-e 5sample for 5 seconds
-x bufsize=10k
-x aggsize=2m
increase size of buffers in dtrace due to the large number of events
-p 3491monitor process 3491, which was mysqld

Note that the output from plockstat can include C++ mangled names, so c++filt, which is part of SunStudio, was used to convert the mangled names back to original code.

The output above shows that the hottest lock is prepare_commit_mutex and the hottest caller is innobase_xa_prepare().  Searching through the code for this mutex and function call showed that the mutex was acquired in innobase_xa_prepare() and released in innobase_commit(). This code is located in storage/innobase/handler/ha_innodb.cc in 5.1 and 6.0 source distributions.  A Google search showed that the code was added during the 4.1 to 5.0 upgrade, as part of XA transaction support.  Incidentally, this specific code section is a very tiny part of a major code upgrade.

Figuring out what happens in MySQL from the time the prepare_commit_mutex lock is acquired until it is released was an interesting exercise, and I may talk about some of the techniques I used in future blogs.  As you can guess, our old friend dtrace came in pretty handy, but I also did a lot of good old fashioned code reading and code diffing between 4.1 and 5.0 code.  My main goal was to determine how things worked in 4.1, what had changed in 5.0, whether we could restore the 4.1 behavior, and whether that would improve scalability.

The first question to ask was, what is the purpose of the prepare_commit_mutex lock?  In essence, what is this lock protecting? Comments in the code indicate that the lock is there to ensure that commits to the binlog file occur in the same order as commits to the InnoDB logs, and that this is required for InnoDB hot backup. Subsequent discussions with the MySQL Replication team also indicate that this ordering is required for replication, so that commits on the slave will occur in the same order as commits on the master.

One of the first things I noticed about the prepare_commit_mutex lock is that it is acquired before the InnoDB prepare phase takes place. With innodb_flush_log_at_trx_commit set to 1 in my.cnf, this includes a synchronous flush of the InnoDB logs, which means that this will be a very long time to hold a lock.  The lock is then held while the binlog file is committed, which will also include a synchronous flush if sync_binlog is set to 1.  It is finally released after the InnoDB commit phase, which will include another synchronous flush of the InnoDB logs if innodb_flush_log_at_trx_commit is set to 1.  This means that we will have two or possibly three synchronous flushes to disk while holding this lock.  This results in an extremely long hold time on the lock, and is the main reason that the lock becomes so hot and puts a ceiling on scalability.

So what could we do about it?  The goal was to reduce the hold time on the lock by acquiring it later, releasing it earlier, or both.  To track this effort, I opened MySQL bug#38501 (hold times on prepare_commit_mutex limit read/write scalability).  This was later determined to be a duplicate of bug#13669 (Group commit is broken in 5.0).

The first improvement I tried was to acquire the lock at the end of the innobase_xa_prepare() method instead of at the beginning, so the lock was not held during the prepare phase.  This resulted in a 20%+ throughput improvement at 32 threads.  There has been a lot of discussion about whether this is a valid change, and at this point I believe that it will be found to be ok.  The replication team has indicated that replication has no ordering requirement between the prepare phase and the binlog commit, and I am waiting for feedback from Oracle/Innobase about the constraints for InnoDB hot backups.

The second improvement was to restore the 4.1 behavior for synchronous flushes of the InnoDB logs at commit time.  In the 5.0 code, the lock is held while the logs are both written and flushed.  In the 4.1 code, the logs were written but not flushed, the protective lock was released, then the logs were flushed to disk.  This meant the lock was not held during the flush to disk, allowing for increased scalability. This was the basis for "group commit" behavior in 4.1.  As it turns out, this behavior can be returned to MySQL with relatively few changes to the code, utilizing methods that are already present.  This change, combined with the first, resulted in more than double the throughput at high thread counts, as shown below.

The graph above is the same as the one shown earlier, but with a new set of data generated using a prototype patch for bug#38501 and bug#13669.  This shows that with the prototype patch, the impact of enabling binlog logging is dramatically reduced.  At high thread counts the throughput is more than double the Baseline+Binlog number, and is only about 9% below the throughput without binlog enabled.  This is exactly the kind of improvement I was hoping to achieve!

So what's next?  The prototype patch for this issue still needs to go through formal code review, but I am hopeful that it will require only minor changes.  Based on conversations with the MySQL Replication team, I believe that the basic changes retain the functionality that is required for replication.  Oracle/Innobase also needs to review the functionality related to InnoDB hot backups, and hopefully that will happen in the near future.

If you would like to try out the patch yourself, you can download the current prototype here and apply it to a 5.1 or 6.0 source distribution. Follow the regular process to install MySQL from a source distribution, for example these instructions for a 5.1 install, but apply the source code patch before the configure/make steps. For example, if you saved the patch to a file named "patchfile" you would apply it like this:

shell> gunzip < mysql-VERSION.tar.gz | tar -xvf -
shell> cd mysql-VERSION
shell> patch -p0 < patchfile

Keep in mind that this is a prototype patch, and should be considered pre-alpha code.

Incidentally, the work I have described here primarily helps the case where innodb_flush_log_at_trx_commit is set to 1 and we synchronously flush the InnoDB logs at the end of the prepare and commit phases of each transaction.  It does not fully address the case where sync_binlog is set to 1 and we also synchronously flush the binlog file for each transaction, although you will get some scalability improvement even then.  The sync_binlog case is still under investigation, and is certainly solvable.  We are also looking at a number of other areas for increased scalability of both the master and slave nodes.




« April 2014