Reducing Innodb mutex contention
By realneel on Apr 21, 2009
Today Sun announces MySQL 5.4. This is a great day for customers as they can use systems with many cores much more efficiently. Its a great day for the MySQL community and the MySQL performance team because we made it happen. MySQL 5.4 includes a lot of community contributed fixes as well as many fixes from our team. Mikael and Allan are blogging about all the cool new features and the great scalability of MySQL 5.4. I thought I will take this opportunity to blog about some of the things we tried, and rejected. Sometimes there are a lot of things to be learnt from things that do not work
Early on during our performance investigation, we were trying to see if we can reduce some of the contention in Innodb locks. If you are not familiar with Innodb locks, I suggest you read Tim Cook's excellent presentation to MySQL University on this very topic.
In a nutshell, Innodb has 4 kinds of lock modes (shared, exclusive, intention shared, and intention exclusive). Since POSIX synchronization methods support maximum of 2 modes (reader or writer), Innodb implements its own set of locking primitives using a condition variable and a regular mutex. Innodb also implements its own spin followed by a block. When it gets a mutex in an not contended case, it is very fast. However, when spinning for a lock fails, it gets interesting.
A failed spin for a lock means it has to block for the lock to be available. Currently Innodb uses the wait array interface to keep track of who is waiting on what mutex. Unfortunately there is a global mutex protecting the wait array. This global mutex (sync_primary_wait_array) has shown to be hot in some high thread count experiments.The following callstack illustrates this perfectly.
nsec ---- Time Distribution --- count Stack 16384 | | 8 mysqld`os_mutex_enter+0x4 32768 | | 11 mysqld`sync_array_free_cell+0x28 65536 | | 7 mysqld`mutex_spin_wait+0x194 131072 | | 9 mysqld`trx_undo_assign_undo+0x30 262144 |@ | 34 mysqld`trx_undo_report_row_operation+0x168 524288 |@@@@ | 122 mysqld`btr_cur_update_in_place+0x160 1048576 |@@@@@ | 157 mysqld`btr_cur_optimistic_update+0xbc 2097152 |@@@@@@ | 186 mysqld`row_upd_clust_rec+0x50 4194304 |@@ | 86 mysqld`row_upd_clust_step+0x5f0
As you can see, mutex_spin fails, it has to wait, it looks for a free cell in the sync_array to block. Before it can find a free cell, it has to lock the sync_array and search for a free cell. It blocks (using OS primitives, not Innodb locking) on the sync_primary_wait_array.
Proposed SolutionLocking the whole sync_array to pick a free cell does not sound too scalable. I thought of following simple ways to fix this
- Use a mutex per cell instead of a global mutex.
- Use atomic ops to mark cells free/busy instead of grabbing the global mutex and checking.
- The search for a free cell always starts from 0. This is suboptimal as busy cells will tend to accumulate at the beginning. I propose starting at the previously found free cell and circling back after hitting the end.
I implemented a quick prototype to measure the improvement in performance before asking the Innodb folks to take a closer look. My performance tests showed very minimal improvement in performance.
Why did it not give a big boost in performance
An important thing to note is that this contention happened when it failed to get a mutex. A thread fails to get a mutex if another thread has already acquired the mutex. So it does not really matter how long (within reasonable limits), the thread that failed to acquire the mutex took to sleep on the lock.
Richard Smith had a great idea of totally bypassing the sync_array interface. He prototyped it and got some good gains, but I will let him talk about it.
Did we add this to 5.4?No, since it did not improve performance, it is not there in 5.4. I doubt we will put it in unless we see a decent improvement in performance, or a drop in CPU utilization as a result of fixing it. There are other implementation issues (there is some deadlock detection code that depends on entire sync_array being locked, and others) that make it risky to fix without good justification.
This just illustrates how hard performance work can be at times. Sometimes the number of ideas rejected is more than what got accepted