InnoDB recovery is now faster…much faster!

Note: this article was originally published on on April 13, 2010 by Inaam Rana.

One of the well known and much written about complaint regarding InnoDB recovery is that it does not scale well on high-end systems. Well, not any more. In InnoDB plugin 1.0.7 (which is GA) and plugin 1.1 (which is part of MySQL 5.5.4) this issue has been addressed. Two major improvements, apart from some other minor tweaks, have been made to the recovery code. In this post I’ll explain these issues and the our solution for these.

First issue reported here is about available memory check eating up too much CPU. During recovery, the first phase, called redo scan phase, is where we read the redo logs from the disk and store them in a hash table. In the second phase, the redo application phase, these redo log entries are applied to the data pages. The hash table that stores the redo log entries grows in the buffer pool i.e.: memory for the entries is allocated in 16K blocks from the buffer pool. We have to ensure that the hash table does not end up allocating all the memory in the buffer pool leaving us with no room to read in pages during the redo log application phase. For this we have to keep checking the size of the heap that we are using for allocating the memory for the hash table entries. So why would it kill the performance? Because we do not have the total size of the heap available to us. We calculate it by traversing the list of blocks so far allocated. Imagine if we have gigabytes or redo log to apply (it can be up to 4G). That would mean hundreds of thousands of blocks in the heap! And we have to make a check roughly whenever we are reading in a new redo page during scan. An O(n * m) algorithm where ‘n’ is number of blocks in the heap and ‘m’ is number of redo pages that have to be scanned.

What is the solution we came up with? Store the total size of a heap in its header. Simple and effective. Our algorithm now becomes O(m).

Lets talk about the second issue reported here. During the redo log application phase, data pages to which redo log entries are applied are to be inserted in a list called flush_list which is ordered by the LSN of the earliest modification to a page i.e.: oldest_modification. During the normal working the LSN increases monotonically therefore the insertion to the flush_list always happens at the head. But during recovery we have to linearly search the flush_list to find the appropriate spot for insertion. The length to which the flush_list can grow is the number of modified pages (called dirty pages in db parlance) we had at the time of crash. On high-end system with multi-gigabyte buffer pools this number can be very high i.e.: million or more dirty pages. A linear search for insertion won’t scale.

There has been a talk in the community about how to fix this and various solutions have been suggested and some were implemented by the community as well. What we, at InnoDB, have finally implemented is to have an auxiliary data structure (a red-black tree in this case) which is active only during the recovery phase and which is used to speed up sorted insertions in the the flush_list. The flush list remains a list and after the recovery is over the red black is tree is discarded as during the normal operations we only ever append to the flush_list.

So much for the theory. Now let us see if we can walk the talk. To evaluate the effectiveness of this fix we’d need a crash when there are a lot of redo logs to apply and there are a lot of dirty pages. I requested Michael Izioumtchenko (InnoDB’s QA supremo) to come up with something. And he did the following:
The dataset was obtained by running a 60m sysbench readwrite uniform distribution in memory workload with prewarmed cache using the following configuration parameters:

The latter two are used to throttle flushing in order to maximize the number of dirty pages.

It took only about 20 min of running a workload to arrive to the test dataset, including cache prewarming.
So at time of crash we had:
Modified db pages  1007907
Redo bytes: 3050455773
And the recovery times were:

Plugin 1.0.7 (also Plugin 1.1): 1m52s scan, 12m04s apply, total 13m56s
Plugin 1.0.6: 31m39s scan, 7h06m21s apply, total 7h38m
1.0.7 (and Plugin 1.1) is better 16.95x on scan, 35.33x on apply, 32.87x overall

Note that all this comes to you transparently. You don’t have to set any parameter to take advantage of this feature. My only suggestion would be to use as large log files (there is a limit of 4G on total log file size) as you can. I know users have been using smaller log files to avoid recovery running into hours. They have taken a hit on throughput during normal running to avoid longer recovery time. You don’t have to do that any more. InnoDB recovery will never run into hours. It’s a guarantee!


Post a Comment:
Comments are closed for this entry.

This is the InnoDB team blog.


« June 2016