Toward a More Scalable MySQL Replication Master

If you are a MySQL 5.x/6.0 InnoDB replication user, right now you take a significant performance hit on the replication master simply by turning on the binlog. The good news is that we've taken a big step toward eliminating that performance gap. I'll describe the problem, how I was able to track down the root cause, and point to a patch that fixes the problem. Since the changes are in the InnoDB code, right now we're waiting on Oracle/Innobase to review the fix and formally commit it. Once that happens, you should see it show up in binary releases. In the meantime, if you build your own binaries you can test the prototype patch yourself.

One of the things I have been working on quite a bit over the past several months is scalability of the nodes within a MySQL scale-out replication environment.  The reason being that there has been a rapid increase in CPU cores and hardware thread counts per server in recent years, across all system types.  Multi-core processors are now the norm, and regardless of the processor family you choose, each node in your environment will soon have dozens or hundreds of hardware threads available.  This trend is causing the notion of horizontal scaling (scale out) to be replaced by that of diagonal scaling (scale out and up), where a horizontally scaled architecture is made up of vertically scaled nodes.  This provides for increased performance while maintaining or even reducing the number of required nodes.  The key to this trend is that we need software that can scale both vertically and horizontally.  We also need to avoid negative scaling on multi-core systems, which refers to a drop in performance as thread counts increase, and is often caused by coarse grained locking in the code.

With this in mind, I set out to assess and improve the vertical scalability of nodes in a MySQL scale-out replication environment.  The first step was to analyze scalability of the master node, which accepts update requests from clients, applies them to the master copy of the database, then makes the updates available to slave nodes by way of the binlog file. For the purposes of this exercise, I will claim that enabling binlog logging is the primary differentiator between a replication master and a "regular" MySQL database server (yes, I know that binlog logging can also be used to support backups).  The first step in my analysis was to analyze the scalability impact of enabling binlog logging.

The basic methodology that I used to assess the impact of binlog logging was to drive a multi-threaded load against MySQL both with and without binlog logging enabled, gradually increasing the number of threads driving the load in each case.  Throughput was measured in transactions per second, and the chosen workload was a sysbench OLTP test.  This is a fairly simplistic benchmark, but has proven to be effective in this case.  The results of this initial test are shown below.

What the graph above shows is the difference in throughput between a baseline version of MySQL and the same version with binlog logging enabled. The baseline shown here is from a patched version of MySQL 6.0.5-alpha, which is roughly equivalent to 6.0.7-alpha. The results through 32 threads are similar to those I have obtained using 5.1.28-rc. The system used was a Sun SPARC Enterprise T5240, and psradm/psrset commands were used to allocate 2 cores (16 threads) to sysbench and 6 cores (48 threads) to mysqld. The remaining cores were taken offline. I want to mention that I did not see continued scaling beyond 48 sysbench threads, and have truncated the data as shown in order to focus on a specific scalability issue.

The striking result of this test was that, with the binlog enabled, both throughput and scalability were much worse than without binlog logging enabled.  Increasing the number of load generator threads did not result in any increase in throughput beyond 12 threads.  In order to find out why, I needed to determine what resource became a bottleneck as we approached and exceeded 12 load generator threads.  Since lock contention is a common reason for lack of scalability in a multi-threaded application, I turned next to the Solaris plockstat utility.  This dtrace consumer often makes it extremely easy, and sometimes quite obvious, to identify the root cause of a scalability issue in a multi-threaded program.  A sample plockstat command and the output from a run against MySQL 6.0.5-alpha with 32 sysbench threads is shown below.  The output has been truncated to save space.

plockstat -C -n 10 -e 5 -x bufsize=10k -x aggsize=2m -p 3491 | c++filt

Mutex block
Count     nsec Lock                         Caller
 2038 54747571 mysqld`$XAJzoBKpX4GIETk.prepare_commit_mutex mysqld`int innobase_xa_prepare(handlerton\*,THD\*,bool)+0x4c
  282   447655 mysqld`mysql_bin_log+0x8     mysqld`bool MYSQL_BIN_LOG::write(Log_event\*)+0xa4
   83   471277 mysqld`mysql_bin_log+0x8     mysqld`bool MYSQL_BIN_LOG::write(Log_event\*)+0xa4

The plockstat options used above include:

-Clook for contention events
-n 10top 10 locks
-e 5sample for 5 seconds
-x bufsize=10k
-x aggsize=2m
increase size of buffers in dtrace due to the large number of events
-p 3491monitor process 3491, which was mysqld

Note that the output from plockstat can include C++ mangled names, so c++filt, which is part of SunStudio, was used to convert the mangled names back to original code.

The output above shows that the hottest lock is prepare_commit_mutex and the hottest caller is innobase_xa_prepare().  Searching through the code for this mutex and function call showed that the mutex was acquired in innobase_xa_prepare() and released in innobase_commit(). This code is located in storage/innobase/handler/ in 5.1 and 6.0 source distributions.  A Google search showed that the code was added during the 4.1 to 5.0 upgrade, as part of XA transaction support.  Incidentally, this specific code section is a very tiny part of a major code upgrade.

Figuring out what happens in MySQL from the time the prepare_commit_mutex lock is acquired until it is released was an interesting exercise, and I may talk about some of the techniques I used in future blogs.  As you can guess, our old friend dtrace came in pretty handy, but I also did a lot of good old fashioned code reading and code diffing between 4.1 and 5.0 code.  My main goal was to determine how things worked in 4.1, what had changed in 5.0, whether we could restore the 4.1 behavior, and whether that would improve scalability.

The first question to ask was, what is the purpose of the prepare_commit_mutex lock?  In essence, what is this lock protecting? Comments in the code indicate that the lock is there to ensure that commits to the binlog file occur in the same order as commits to the InnoDB logs, and that this is required for InnoDB hot backup. Subsequent discussions with the MySQL Replication team also indicate that this ordering is required for replication, so that commits on the slave will occur in the same order as commits on the master.

One of the first things I noticed about the prepare_commit_mutex lock is that it is acquired before the InnoDB prepare phase takes place. With innodb_flush_log_at_trx_commit set to 1 in my.cnf, this includes a synchronous flush of the InnoDB logs, which means that this will be a very long time to hold a lock.  The lock is then held while the binlog file is committed, which will also include a synchronous flush if sync_binlog is set to 1.  It is finally released after the InnoDB commit phase, which will include another synchronous flush of the InnoDB logs if innodb_flush_log_at_trx_commit is set to 1.  This means that we will have two or possibly three synchronous flushes to disk while holding this lock.  This results in an extremely long hold time on the lock, and is the main reason that the lock becomes so hot and puts a ceiling on scalability.

So what could we do about it?  The goal was to reduce the hold time on the lock by acquiring it later, releasing it earlier, or both.  To track this effort, I opened MySQL bug#38501 (hold times on prepare_commit_mutex limit read/write scalability).  This was later determined to be a duplicate of bug#13669 (Group commit is broken in 5.0).

The first improvement I tried was to acquire the lock at the end of the innobase_xa_prepare() method instead of at the beginning, so the lock was not held during the prepare phase.  This resulted in a 20%+ throughput improvement at 32 threads.  There has been a lot of discussion about whether this is a valid change, and at this point I believe that it will be found to be ok.  The replication team has indicated that replication has no ordering requirement between the prepare phase and the binlog commit, and I am waiting for feedback from Oracle/Innobase about the constraints for InnoDB hot backups.

The second improvement was to restore the 4.1 behavior for synchronous flushes of the InnoDB logs at commit time.  In the 5.0 code, the lock is held while the logs are both written and flushed.  In the 4.1 code, the logs were written but not flushed, the protective lock was released, then the logs were flushed to disk.  This meant the lock was not held during the flush to disk, allowing for increased scalability. This was the basis for "group commit" behavior in 4.1.  As it turns out, this behavior can be returned to MySQL with relatively few changes to the code, utilizing methods that are already present.  This change, combined with the first, resulted in more than double the throughput at high thread counts, as shown below.

The graph above is the same as the one shown earlier, but with a new set of data generated using a prototype patch for bug#38501 and bug#13669.  This shows that with the prototype patch, the impact of enabling binlog logging is dramatically reduced.  At high thread counts the throughput is more than double the Baseline+Binlog number, and is only about 9% below the throughput without binlog enabled.  This is exactly the kind of improvement I was hoping to achieve!

So what's next?  The prototype patch for this issue still needs to go through formal code review, but I am hopeful that it will require only minor changes.  Based on conversations with the MySQL Replication team, I believe that the basic changes retain the functionality that is required for replication.  Oracle/Innobase also needs to review the functionality related to InnoDB hot backups, and hopefully that will happen in the near future.

If you would like to try out the patch yourself, you can download the current prototype here and apply it to a 5.1 or 6.0 source distribution. Follow the regular process to install MySQL from a source distribution, for example these instructions for a 5.1 install, but apply the source code patch before the configure/make steps. For example, if you saved the patch to a file named "patchfile" you would apply it like this:

shell> gunzip < mysql-VERSION.tar.gz | tar -xvf -
shell> cd mysql-VERSION
shell> patch -p0 < patchfile

Keep in mind that this is a prototype patch, and should be considered pre-alpha code.

Incidentally, the work I have described here primarily helps the case where innodb_flush_log_at_trx_commit is set to 1 and we synchronously flush the InnoDB logs at the end of the prepare and commit phases of each transaction.  It does not fully address the case where sync_binlog is set to 1 and we also synchronously flush the binlog file for each transaction, although you will get some scalability improvement even then.  The sync_binlog case is still under investigation, and is certainly solvable.  We are also looking at a number of other areas for increased scalability of both the master and slave nodes.


Had you read either or High Performance MySQL 2nd Edition, you would have been able to skip all this analysis. Peter, Vadim and others at Percona (who are the authors) did this analysis more than THREE YEARS ago, as proven by the bug report yours was a duplicate of. And we have been blogging on it for YEARS. This is not news.

I both want to chastise MySQL/InnoDB for their utter lack of attention to this, and praise Sun for finally working on it!

Posted by Baron Schwartz on November 10, 2008 at 10:03 PM PST #

Let me redact my comment a little -- "skip all this analysis" is not meant to trivialize your research. What I really meant was that if anyone had ever wanted to know where to start to improve the scalability of InnoDB with binlogging enabled in 5.0, all they had to do was "what should we work on."

Posted by Baron Schwartz on November 10, 2008 at 10:12 PM PST #


Thanks for the feedback, and especially the redaction. I think there was a backhanded compliment in there somewhere... ;-)

You are right of course that the issue resulting in the first graph turned out not to be new, and Peter and others have written about it in the past. Peter filed the original bug, and has also provided some interesting feedback on the proposed fix. Comments by Heikki in the older bug report also suggest that he had an idea, in general, how to fix the problem quite a long time ago. However, I think it makes an interesting case study, especially for someone like me who came at it from a cold start. Also, since I was the first one to take the time and effort to be able to generate the second graph, I think it was worth a new blog! ;-)

Posted by David Lutz on November 11, 2008 at 12:51 PM PST #

Something is not clear to me on this. We are using 5.0.60sp1. It is not clear to me this fix is included at that level. Also, from reading the bugs you hyperlinked I see comments to the effect that it is not clear if the order of actions within a transaction needs to be maintained in the binlog. Did I misunderstand here? Here is my concern, within a single innodb transaction there are several updates to many rows in several tables, the last of which is the indicator all previous composite changes that comprise the whole transaction were completed. This occurs, properly, on the active master. The client code runs against the slaves. It appears the client is seeing that final table change before all of the individuals changes have been processed. This tells me two things, the temporal order of row changes within a transaction across a replicated link are not honoring the order of row changes within the original transaction and the notion of a multi-statement transaction (a normal transaction vs a stmt transaction?) is not honored across a replication. As this seems to relate to this blog, I wonder if you could comment on this for me.



Posted by Walt Corey on January 26, 2010 at 10:58 PM PST #


The changes related to bug#13669 and bug#38501 went into the MySQL 5.4 development release and the InnoDB Plugin 1.0.4. If you are running 5.0.60sp1, you do not have these changes unless you have explicitly switched to InnoDB Plugin 1.0.4 or later.

To answer your question about transaction commit, it looks like you misunderstood the comments in the bug reports. There was never any question about the ordering of changes within a transaction, or whether a slave could see some parts of a transaction before others. A transaction is an all or nothing event by definition, and no part of it is visible until all of it is visible. There was a question about the ordering of prepares for two different commits, but it was determined that their order doesn't matter. The prepares are invisible to each other until the commit is complete, so their order is irrelevant.

Posted by David Lutz on February 01, 2010 at 05:46 AM PST #

Post a Comment:
  • HTML Syntax: NOT allowed



« July 2016