InnoDB 5.7 performance improvements
By Sunny Bains on Sep 21, 2013
InnoDB Performance OptimizationsFirst, some background. Optimizing InnoDB performance has been an ongoing process since MySQL 5.6. Before MySQL 5.6 there was a single global mutex called the kernel_mutex, which was used for managing transactions, MVCC (Multiversion Concurrency Control), locking and several other things. A single mutex limited scalability. Removing the kernel_mutex was the first step. This allowed us to reorganize the code and improve code modularization. In particular, separating the locking code from transaction life cycle management. At Oracle we try and do things incrementally to mitigate risk and make it easier to track the evolution of our code base. This strategy has served us well in improving test coverage, too. I would like to thank the System and Server QA team for their invaluable help. Also, where would we be without our Performance Architect Dimitri Kravtchuk.
MySQL 5.6 ChangesUp to MySQL 5.5, transactions were kept on a single list ordered in descending order by transaction ID. For MVCC, and when converting an implicit record lock to an explicit record lock, we had to acquire the kernel mutex and traverse the active transaction list. As the list grew it created a severe bottleneck.
In MySQL 5.6 we split the single transaction list into 2 separate lists.
- RW transactions list
RO transactions list - but not auto-commit non-locking selects
Note: Auto-commit non-locking selects (not on any list)
The split reduced the size of the read-write transaction list, which reduced contention for trx_sys_t::mutex. As a result, scalability was improved.
To put the transactions on the RO transactions list in MySQL 5.6 you have to use the following syntax:
START TRANSACTION READ ONLY;
There are two issues with this. One is that user code has to be changed to take advantage of the scalability changes. The second issue is the parsing and network overhead of sending additional text across the wire. Auto-commit non-locking selects, however, can be detected using heuristics and can benefit users out of the box.
This allowed for some equally spectacular performance gains in MySQL 5.6, but that was just a start :-)
MySQL 5.7 Changes
These days NUMA (Non Uniform Memory Access) hardware is the norm
for high end machines. This imposes an additional challenge.
Splitting mutexes alone is not sufficient anymore to improve scalability. One
issue we noticed was the cost of
traversing the trx_t instances on the read-write transaction list.
The trx_t objects were randomly located in memory which slowed
down scans during the MVCC read view create phase. To mitigate
this issue we created a pool of trx_t instances and put free
transactions on a priority queue that is ordered by the object
Another insight is that we do not require the transaction ID for read-only transactions. Transaction ID assignment results in some disk access (read/write) because it is a persistent system counter. There was an existing optimization in InnoDB to perform allocation of the ID in blocks with the aim of reducing this IO overhead. The size of the ID block is 256. This means we sync/flush the trx sys header to disk after every 256 transactions. For read-only transactions this creates an unnecessary overhead. Only read-write transactions require a transaction ID because a transaction ID is only used by MVCC to determine visibility of a transaction's changes.
In MySQL 5.7.2, there is another optimization where all transactions are treated as a read-only transactions by default, and a transaction is only assigned an ID and a rollback segment when the transaction attempts to do its first update. This optimization of treating all transactions as read-only by default has the additional benefit that users do not need to use special syntax or change their applications to take advantage of it.
For read-write loads, there has always been another bottleneck around implicit record to explicit record lock conversion and the related MVCC code for checking the visibility of a transaction's changes. The implicit to explicit record lock conversion code has to determine whether the transaction is currently active or not. If it is still active then the implicit record lock is converted to an explicit record lock. In 5.6 terms this means that we first acquire the lock_sys_t::mutex and then acquire the trx_sys_t::mutex to ensure that the transaction does not commit while we check its state. To get around this overhead, we have made two changes in InnoDB. The first change is to use reference counting for transactions. This reduces the overhead on the lock_sys_t::mutex. The second change is to use a data structure that is better suited to key value lookup, instead of scanning the list.
Another bottleneck was read view creation (required for MVCC). The problem is that we need to scan the read-write transaction list. This can be expensive because of its size, and also due to NUMA issues mentioned earlier. Another issue is that prior to MySQL 5.7.2 we allocate memory for the read view while holding the trx_sys_t::mutex. The 5.7.2 optimizations remove the memory allocation and use a pool of read view objects instead. An additional optimization reuses the read view attached to a transaction if no read-write transactions have been active since the last view was created. This removes a mutex acquire/release during the transaction commit and read view create phase, which improves scalability of read-only loads. This is an area of active development and we are investigating other potential optimizations that will allow us to implement even more improvements. Also worth mentioning is that MySQL 5.7.2 code around MVCC has been rewritten, which will make further optimizations that much easier.
- Faster create/drop
- No redo logging
- Separate tablespace for all uncompressed temporary tables
Temporary tables don't require recovery therefore there is no point in saving their definitions in the persistent data dictionary. By putting the all the temporary tables in a special tablespace that is not redo logged we can reduce a lot of the unnecessary overhead. We do UNDO log the changes to temporary tables, this is required for rollback to savepoint. However, the UNDO logs for temporary tables are not redo logged and the UNDO logs for temporary tables also reside in the special temporary tablespace.
Reducing the dict_index_t::lock contention
This deserves a dedicated blog of its own. InnoDB has two "modes" when updating the Btrees. An optimistic mode and a pessimistic mode. They roughly correspond to in-place changes and tree modification changes. If InnoDB thinks that an insert/delete will result in a tree modification change it will lock the index in X mode. This blocks readers too and limits concurrency. The fix was to introduce a new lock mode SX. This reduces the need to lock the entire Btree during tree modification changes.
InnoDB has had some sub-optimal code around traversing the dirty page and LRU list flushing. The problem was that when we release the buffer pool and associated mutexes during the IO phase we can no longer trust the iterator that is used to scan the aforementioned lists. So, we restart the scan from the start. The fix was to track whether the iterator was invalidated while the covering mutexes where released. If the iterator was not invalidated then we can carry on from where we left off. If on the other hand we detect that it was invalidated we go back to the start and do it the old way. The iterator invalidation should be very rare.
Dimitri has done some benchmarks to demonstrate some of these improvements. In my internal testing of read-only loads I see that InnoDB now maintains close to peak QPS right up to 8K connections. This is a substantial improvement over previous versions. We are very pleased with the result of our efforts and hope that users can benefit from these improvements. Give MySQL 5.7.2 a spin and check these great improvements :-)