Better scaling of read-only workloads
By Calvin Sun on Dec 20, 2011
Note: this article was originally published on http://blogs.innodb.com on Dec 20, 2011 by Sunny Bains.
The problem and its cause
There have been several complaints over the years about InnoDB’s inability to scale beyond 256 connections. One of the main issues behind this scalability bottleneck was the read view creation that is required for MVCC (Multi Version Concurrency Control) to work. When the user starts a transaction this is what InnoDB does under the hood:
- Create or reuse a transaction instance – usually it is reused, the transactions are reused from a pool (trx_sys_t::mysql_trx_list).
- Initialize the transaction start time and assign a rollback segment
- Append the transaction to an active transaction list ordered on trx_t::id in descending order
The append to the trx_sys_t::trx_list and corresponding remove during commit is covered by trx_sys_t::mutex. After the transaction is “started” and if the transaction has an isolation greater than or equal to REPEATABLE-READ then before the first record/row is accessed by the transaction, InnoDB creates a view (snapshot) of the running system state. It does this by examining the transactions that are active at the time of the MVCC snapshot, so that their changes can be excluded from the creating transaction’s read view. This read view creation is also covered by the trx_sys_t::mutex. As the number of active transactions in the system increases this read view creation takes longer and longer. This increases the wait times on the trx_sys_t::mutex (during transaction start and commit) and once threads are forced to wait on a condition variable (in contrast to simply spinning while waiting for the mutex) the system throughput drops dramatically.
While investigating this problem there were two observations that I made:
- Read only transactions should not be considered in the MVCC snapshot
- Auto commit non-locking read-only selects should not be in the trx_sys_t::trx_list at all
For the first to work we need to tag the transactions as READ ONLY when the transaction is started e.g.,
- START TRANSACTION READ ONLY;
I will not be discussing this functionality in this blog because the syntax for this doesn’t exist in MySQL (yet). However, the functionality exists in InnoDB to handle this case and is in the 5.6.4 release. Once the above syntax exists, InnoDB can take advantage of the new syntax trivially. What I want to talk about is the second case. This special case can be detected by InnoDB using existing state information and handled transparently without any syntax change in user applications and is fully functional in the 5.6.4 release.
InnoDB transaction life cycle redesign
Split the trx_sys_t::trx_list (the active transactions list) into two, trx_sys_t::ro_trx_list and trx_sys_t::rw_trx_list. Only transactions that are in the trx_sys_t::rw_trx_list are taken into consideration when creating the MVCC snapshot. For a read-only heavy work load the benefits are obvious, the smaller size of the RW active transaction list makes the read view creation for MVCC (and purge) very fast. For auto-commit read-only non-locking selects the additional benefit is that we don’t need to acquire the trx_sys_t::mutex at all because we don’t put them on the active list. This removes the bottleneck around the trx_sys_t::mutex and improves concurrency and scalability.
Auto-commit read-only non-locking transactions go from state NOT STARTED -> ACTIVE -> NOT STARTED, in contrast to locking read-only (and read-write) transactions which go from state NOT STARTED -> ACTIVE -> COMMIT TO MEMORY -> NOT STARTED. The additional advantage in skipping the COMMIT TO MEMORY state is that we know that they cannot acquire any locks and therefore it is pointless to acquire the lock_sys_t::mutex and attempt lock release. Also, during COMMIT because they are not on any active transaction list we don’t need to acquire the trx_sys_t::mutex to remove them from the list, improving concurrency and performance further.
Changes to transaction state visibility
Currently (5.6.4) doesn’t display the auto-commit read-only non-locking selects in “SHOW ENGINE INNODB STATUS“, however they are visible in the INFORMATION_SCHEMA.innodb_trx table. The innodb_trx table has two new fields that can be queried to determine whether a transaction is tagged as a read-only transaction and additionally whether it qualifies for the special handling of auto-commit read-only non-locking select. The new fields are:
- trx_read_only – 0 or 1 (INT)
- trx_autocommit_non_locking – 0 or 1 (INT)
Some additional minor tweaks, read-only transactions aren’t assigned a rollback segment and if they are flagged as autocommit non-locking selects then we only set the start time once every 32 times, his reduces the overhead of a system call.
These tests were run by Alexey Stroganov (a.k.a Ranger) using Sysbench (with –oltp-skip-trx=on), wl6046 refers to the internal worklog number of this performance fix. Note how InnoDB scales easily up to 4K threads on the 24 host and 1.5K threads on the 8 core host, there is very little (negligible) drop in the TPS as the number of threads is cranked up. In particular note how the peak TPS on the 24 core host is nearly double:-)
- QUERY - SELECT c FROM sbtest WHERE id=N
- QUERY – SELECT c FROM sbtest WHERE id BETWEEN N AND M
The same result for SIMPLE RANGES, at lower thread levels 5.6 has less TPS than 5.5 on the 8 core host however that is due to another known and unrelated issue that is currently being worked on actively.
This is a significant step forward in InnoDB read-only performance and will allow InnoDB to scale up to a very high number of concurrent queries and take advantage of greater number of processors with the improved parallelism. Note: mixed workloads with a read-only heavy component will also benefit from this improvement. Currently, only if the read-only transactions are auto-commit non-locking selects. The better news is that with this split of the active transaction list into two, we can now optimize deadlock detection, lock release and transaction scheduling around lock grants and waits, stay tuned!