MySQL 5.6: InnoDB scalability fix – Kernel mutex removed
By Calvin Sun on Apr 11, 2011
Note: this article was originally published on http://blogs.innodb.com on April 11, 2011 by Sunny Bains.
For those interested in InnoDB internals, this post tries to explain why the global kernel mutex was required and the new mutexes and rw-locks that now replace it. Along with the long term benefit from this change.
InnoDB’s core sub-systems up to v5.5 are protected by a global mutex called the Kernel mutex. This makes it difficult to do even some common sense optimisations. In the past we tried optimising the code but it would invariably upset the delicate balance that was achieved by tuning of the code that used the global Kernel mutex, leading to unexpected performance regression. The kernel mutex is also abused in several places to cover operations unrelated to the core e.g., some counters in the server thread main loop.
The InnoDB core sub-systems are:
- The Locking sub-system
- The Transaction sub-system
- MVCC views
For any state change in the above sub-systems we had to acquire the kernel mutex and this would reduce concurrency and made the kernel mutex very highly contended. A transaction that is creating a lock would end up blocking read view creation (for MVCC) and transaction start or commit/rollback. With the the finer granularity mutexes and rw-locks, a transaction that is creating a lock will not block transaction start or commit/rollback. MVCC read view creation will however block transaction create and commit/rollback because of the shared trx_sys_t::trx_list. But MVCC read view creations will not block each other because they will acquire an S lock.
In 5.6 the global kernel mutex has been further split into several localised mutexes. The important ones are:
- Transactions and MVCC views: trx_sys_t::lock (rw_lock) and trx_t::mutex
- Locking : lock_sys_t::mutex and lock_sys_t::wait_mutex
This change is significant from an architectural perspective and most of the effort has gone into proving that the new design is correct. Splitting a global mutex into several independent mutexes should improve performance by increasing concurrency. The downside of course is that there is probably going to be a little more context switching with finer grained mutexes. However, now we have far greater freedom in making localised changes to further speed up the sub-systems independently without worrying about the global state and a global mutex.
InnoDB was originally designed to support multiple query threads per transaction. However, this functionality has never been tested and since its first release the InnoDB engine has only ever worked in single query thread per transaction mode. This design decision to support multiple query threads per transaction caused a lot of tight coupling between the core sub-systems, most of which was related to the state of the (potentially multiple) query threads. The state changes and detection were protected by the Kernel mutex. The first step was in simplifying the rules around the handling of multiple-query threads. For the curious, it is the code in files that are prefixed with que0. The most complex part of the change was the transition from the state waiting for lock to the state deadlock or timeout rollback. The control flow was rather convoluted because of the handling for potentially multiple query threads that were bound to a transaction.
Difference between query state and transaction state
InnoDB distinguishes between transaction state and the query thread state. When a transaction waits for a lock it is not the transaction state that is changed to LOCK_WAIT but the query thread state. In a similar fashion when we rollback a transaction it is the query thread state that changes to ROLLING BACK not the transaction state. However, because there has always been a one-to-one mapping between a query thread and a transaction, the query state can be regarded as transaction sub-state. ie.
Transaction states are:
When a transaction is in state TRX_STATE_ACTIVE it can have the following sub-states (or query thread states)
Below is a somewhat simplified version of the roles that the (more important) new rw-locks and mutexes play in the new design.
This is a rw-lock that protects the global list of transactions that are ordered on transaction id. This transaction list is used every time a view for MVCC is created and also when a user does something like “SHOW ENGINE INNODB STATUS”. We add a transaction to this list when it is started and remove it from the list once it completes (commit or rollback).
All locking data structures and changes to transaction query thread states are protected by this mutex. The transaction’s query thread state can be changed asynchronously by other threads in the system. Therefore changing a query state requires that the thread doing the query thread state change has to acquire both the lock mutex and the transaction mutex.
Ongoing optimisations and future work
Getting rid of the global kernel mutex is only the start of fixing some of the mutex contention issues. There are several optimisations that can now be considered which were impossible with a single global mutex. We can now take this splitting process down to the sub-system level. I’ve been experimenting with eliminating the trx_sys_t::lock completely. By selectively commenting out parts of the code, I’ve been able to test InnoDB without this rw-lock and test where the other hot spots are. These findings are probably better in a separate blog post . We should finally be able to eliminate (or mitigate) the overhead of MVCC read view creation. More room to move in fixing the lock release overhead, this is especially important in read-write workloads. There are a whole slew of optimisations that we couldn’t do earlier that are now possible .