InnoDB recovery is now faster…much faster!
By Calvin Sun on Apr 13, 2010
Note: this article was originally published on http://blogs.innodb.com on April 13, 2010 by Inaam Rana.
One of the well known and much written about
complaint regarding InnoDB recovery is that it does not scale well on
high-end systems. Well, not any more. In InnoDB plugin 1.0.7 (which is
GA) and plugin 1.1 (which is part of MySQL 5.5.4) this issue has been
addressed. Two major improvements, apart from some other minor tweaks,
have been made to the recovery code. In this post I’ll explain these
issues and the our solution for these.
First issue reported here
is about available memory check eating up too much CPU. During
recovery, the first phase, called redo scan phase, is where we read the
redo logs from the disk and store them in a hash table. In the second
phase, the redo application phase, these redo log entries are applied to
the data pages. The hash table that stores the redo log entries grows
in the buffer pool i.e.: memory for the entries is allocated in 16K
blocks from the buffer pool. We have to ensure that the hash table does
not end up allocating all the memory in the buffer pool leaving us with
no room to read in pages during the redo log application phase. For this
we have to keep checking the size of the heap that we are using for
allocating the memory for the hash table entries. So why would it kill
the performance? Because we do not have the total size of the heap
available to us. We calculate it by traversing the list of blocks so far
allocated. Imagine if we have gigabytes or redo log to apply (it can be
up to 4G). That would mean hundreds of thousands of blocks in the heap!
And we have to make a check roughly whenever we are reading in a new
redo page during scan. An O(n * m) algorithm where ‘n’ is number of
blocks in the heap and ‘m’ is number of redo pages that have to be
Lets talk about the second issue reported here. During the redo log application phase, data pages to which redo log entries are applied are to be inserted in a list called flush_list which is ordered by the LSN of the earliest modification to a page i.e.: oldest_modification. During the normal working the LSN increases monotonically therefore the insertion to the flush_list always happens at the head. But during recovery we have to linearly search the flush_list to find the appropriate spot for insertion. The length to which the flush_list can grow is the number of modified pages (called dirty pages in db parlance) we had at the time of crash. On high-end system with multi-gigabyte buffer pools this number can be very high i.e.: million or more dirty pages. A linear search for insertion won’t scale.
There has been a talk in the community about how to fix this and various solutions have been suggested and some were implemented by the community as well. What we, at InnoDB, have finally implemented is to have an auxiliary data structure (a red-black tree in this case) which is active only during the recovery phase and which is used to speed up sorted insertions in the the flush_list. The flush list remains a list and after the recovery is over the red black is tree is discarded as during the normal operations we only ever append to the flush_list.
The latter two are used to throttle flushing in order to maximize the number of dirty pages.
Note that all this comes to you transparently. You don’t have to set any parameter to take advantage of this feature. My only suggestion would be to use as large log files (there is a limit of 4G on total log file size) as you can. I know users have been using smaller log files to avoid recovery running into hours. They have taken a hit on throughput during normal running to avoid longer recovery time. You don’t have to do that any more. InnoDB recovery will never run into hours. It’s a guarantee!