Monday Apr 11, 2011

NoSQL to InnoDB with Memcached

Note: this article was originally published on on April 11, 2011 by Calvin Sun.

MySQL is the most popular open source SQL database. The ever-increasing performance demands of web-based services have generated significant interest in providing NoSQL access methods to MySQL. Today, MySQL is announcing the preview of the NoSQL to InnoDB via memcached. This offering provides users with the best of both worlds – maintain all of the advantages of rich SQL query language, while providing better performance for simple queries via direct access to shared data.

In this preview release, memcached is implemented as a MySQL plugin daemon, accessing InnoDB directly via the native InnoDB API:

Features provided in the current release:

  • Memcached as a daemon plugin of mysqld: both mysqld and memcached are running in the same process space, with very low latency access to data
  • Direct access to InnoDB: bypassing SQL parser and optimizer
  • Support standard protocol (memcapable): support both memcached text-based protocol and binary protocol; all 55 memcapable tests are passed
  • Support multiple columns: users can map multiple columns into “value”. The value is separated by a pre-defined “separator” (configurable).
  • Optional local caching: three options – “cache-only”, “innodb-only”, and “caching” (both “cache” and “innodb store”). These local options can apply to each of four Memcached operations (set, get, delete and flush).
  • Batch operations: user can specify the batch commit size for InnoDB memcached operations via “daemon_memcached_r_batch_size” and “daemon_memcached_w_batch_size” (default 32)
  • Support all memcached configure options through MySQL configure variable “daemon_memcached_option”

Sounds interesting? You can download the source or binary from MySQL Labs (only tested on Linux) – select”mysql-5.6-labs-innodb-memcached”. After unpacking the files, please read the readme file “README-innodb_memcached”. Also, please read the upcoming blog “Get started with InnoDB Memcached Daemon plugin” by Jimmy.

This is a technology preview, with some limitations. We will gradually address those limitations. If you’d like to see additional new features or improvements, please let us know.

MySQL 5.6: InnoDB scalability fix – Kernel mutex removed

Note: this article was originally published on on April 11, 2011 by Sunny Bains.

For those interested in InnoDB internals, this post tries to explain why the global kernel mutex was required and the new mutexes and rw-locks that now replace it. Along with the long term benefit from this change.

InnoDB’s core sub-systems up to v5.5 are protected by a global mutex called the Kernel mutex. This makes it difficult to do even some common sense optimisations. In the past we tried optimising the code but it would invariably upset the delicate balance that was achieved by tuning of the code that used the global Kernel mutex, leading to unexpected performance regression. The kernel mutex is also abused in several places to cover operations unrelated to the core e.g., some counters in the server thread main loop.

The InnoDB core sub-systems are:

  1. The Locking sub-system
  2. The Transaction sub-system
  3. MVCC  views

For any state change in the above sub-systems we had to acquire the kernel mutex and this would reduce concurrency and made the kernel mutex very highly contended. A transaction that is creating a lock would end up blocking read view creation (for MVCC) and transaction start or commit/rollback. With the the finer granularity mutexes and rw-locks, a transaction that is creating a lock will not block transaction start or commit/rollback. MVCC read view creation will however block transaction create and commit/rollback because of the shared trx_sys_t::trx_list. But MVCC read view creations will not block each other because they will acquire an S lock.

In 5.6 the global kernel mutex has been further split into several localised mutexes. The important ones are:

  1. Transactions and MVCC views: trx_sys_t::lock (rw_lock) and trx_t::mutex
  2. Locking : lock_sys_t::mutex and lock_sys_t::wait_mutex

This change is significant from an architectural perspective and most of the effort has gone into proving that the new design is correct. Splitting a global mutex into several independent mutexes should improve performance by increasing concurrency. The downside of course is that there is probably going to be a little more context switching with finer grained mutexes. However, now we have far greater freedom in making localised changes to further speed up the sub-systems independently without worrying about the global state and a global mutex.


InnoDB was originally designed to support multiple query threads per transaction. However, this functionality has never been tested and since its first release the InnoDB engine has only ever worked in single query thread per transaction mode. This design decision to support multiple query threads per transaction caused a lot of tight coupling between the core sub-systems, most of which was related to the state of the (potentially multiple) query threads. The state changes and detection were protected by the Kernel mutex. The first step was in simplifying the rules around the handling of multiple-query threads. For the curious, it is the code in files that are prefixed with que0. The most complex part of the change was the transition from the state waiting for lock to the state deadlock or timeout rollback. The control flow was rather convoluted because of the handling for potentially multiple query threads that were bound to a transaction.

Difference between query state and transaction state

InnoDB distinguishes between transaction state and the query thread state. When a transaction waits for a lock it is not the transaction state that is changed to LOCK_WAIT but the query thread state. In a similar fashion when we rollback a transaction it is the query thread state that changes to ROLLING BACK not the transaction state. However, because there has always been a one-to-one mapping between a query thread and a transaction, the query state can be regarded as transaction sub-state. ie.

Transaction states are:


When a transaction is in state TRX_STATE_ACTIVE it can have the following sub-states (or query thread states)


Below is a somewhat simplified version of the roles that the (more important) new rw-locks and mutexes play in the new design.


This is a rw-lock that protects the global list of transactions that are ordered on transaction id. This transaction list is used every time a view for MVCC is created and also when a user does something like “SHOW ENGINE INNODB STATUS”. We add a transaction to this list when it is started and remove it from the list once it completes (commit or rollback).


All locking data structures and changes to transaction query thread states are protected by this mutex. The transaction’s query thread state can be changed asynchronously by other threads in the system. Therefore changing a query state requires that the thread doing the query thread state change has to acquire both the lock mutex and the transaction mutex.

Ongoing optimisations and future work

Getting rid of the global kernel mutex is only the start of fixing some of the mutex contention issues. There are several optimisations that can now be considered which were impossible with a single global mutex. We can now take this splitting process down to the sub-system level. I’ve been experimenting with eliminating the trx_sys_t::lock completely. By selectively commenting out parts of the code, I’ve been able to test InnoDB without this rw-lock and test where the other hot spots are. These findings are probably better in a separate blog post :-) . We should finally be able to eliminate (or mitigate) the overhead of MVCC read view creation. More room to move in fixing the lock release overhead, this is especially important in read-write workloads. There are a whole slew of optimisations that we couldn’t do earlier that are now possible :-) .

InnoDB Persistent Statistics at last

Note: this article was originally published on on April 11, 2011 by Vasil Dimov.


InnoDB gathers statistics for the data in user tables, which are used by the MySQL optimizer to choose the best query plan. For a long time the imprecision and instability of these statistics have been creating problems for users.

The problem is that these statistics are recalculated at any of the following events:

* When the table is opened

* When the table has changed a lot (1/16th of the table has been updated/deleted or inserted)

* When ANALYZE TABLE is run


* When InnoDB Monitor is turned ON

* others

and so their recalculation must be quick an unnoticeable. Thus the quick algorithm just picks 8 random pages and could give a wildly varying results.

Quick glance

To solve this problem MySQL 5.6.2 introduces the InnoDB Persistent Statistics feature, which:

* Uses a more precise sampling algorithm (which is also slower) in an attempt to better inform the MySQL optimizer so it can choose the best possible query plan.

* Does not recalculate the statistics automatically, only when ANALYZE TABLE is run. This means that the same query plan will always be chosen even when MySQL is restarted (see below). A query plan using the global statistics can only change after ANALYZE TABLE has been run (manually). This also means that the user and/or DBA is responsible for running ANALYZE TABLE regularly if the data in the table has been changed a lot.

* Saves the statistics on disk in normal user-visible and updatable SQL tables.

How to turn it ON?

To turn ON the persistent statistics for a given table you must first create the persistent statistics storage by executing the SQL script storage/innobase/scripts/persistent_storage.sql which is shipped in the source distribution (in future versions it may be executed automatically by mysql_install_db). This script will create a new database named “innodb” with two tables in it – “table_stats” and “index_stats”. InnoDB will read/write the statistics from/to those tables. Second you must set the new parameter innodb_analyze_is_persistent to ON (it is OFF by default). This will instruct InnoDB to use the better (and slower!) sampling algorithm during ANALYZE TABLE and to save the results to the stats tables. To change the number of leaf pages that are being sampled during such a “persistent-results” ANALYZE TABLE set the parameter innodb_stats_persistent_sample_pages to a higher value.

Once the stats “meta” tables are present and stats for a given table exist in them, then those stats will be used whenever stats are needed – including open table/show table status/show index from and others. To stop using them you must DELETE the corresponding rows from innodb.table_stats and innodb.index_stats.

Beware of bugs

Some bugs in persistent stats code which may cause InnoDB to crash have been fixed, but are not included in 5.6.2. See:

Information Schema for InnoDB System Tables

Note: this article was originally published on on April 11, 2011 by Jimmy Yang.

One of the most important things a database user or DBA wants to know is what columns, indexes etc. a table has. There are a few ways to find these things out, such as show tables. However, to really reveal all of the detailed metadata information in an InnoDB database, many have tried the “innodb_table_monitor” to peek into internal data dictionary. By creating an “innodb_table_monitor” table, InnoDB will print out the contents of metadata periodically. Unfortunately, it is printed out in an unstructured text for each table, and to find out what you need you would have to either carefully scan the output or have your own parser to do some additional analysis on the result if you want to display them systematically.

Well, in the MySQL 5.6 release, “innodb_table_monitor” can become history, and you will no longer need to search the text output to find out system metadata information. Instead, you can query the InnoDB internal system tables through Information Schema System Tables we implemented in this release.

To begin with, let’s take a look at six of the system tables in InnoDB:


The name of these tables explain what each system table contains, SYS_TABLES contains basic information about the table, each table’s index information is stored in SYS_INDEXES, and its column information is stored in SYS_COLUMNS. Each index’s field information is stored in SYS_FIELDS. And if a table contains foreign keys, that information is contained in SYS_FOREIGN_COLS and SYS_FOREIGN. You can get a good idea what a table or index look like after digging into all the detailed information stored in these system tables.

When a table is being used, the content of the each system table is loaded into an in-memory structure (its in-memory representation). For example, SYS_TABLES info is loaded into dict_table_t, SYS_INDEXES info is loaded into dict_index_t structure etc. There could be some additional information stored in each system table’s in memory representation in addition to its on disk content. For example, some run time statistics (such as rows inserted/updated) are stored in dict_table_t. To accommodate that we also created the 7th Information Schema table called SYS_TABLESTATS to display such information.

So in summary we have added the following seven System Tables:

mysql> show tables like “INNODB_SYS%”;
| Tables_in_information_schema (INNODB_SYS%) |
| INNODB_SYS_FIELDS                                     |
| INNODB_SYS_INDEXES                                   |
| INNODB_SYS_TABLESTATS                            |
| INNODB_SYS_COLUMNS                                |
| INNODB_SYS_FOREIGN_COLS                        |
| INNODB_SYS_FOREIGN                                  |
| INNODB_SYS_TABLES                                    |
7 rows in set (0.00 sec)

Now let’s go through these table using an example. Suppose we have a table named “test_table”, which has one index “idx”:

mysql> create table test_table(col1 int, col2 char(10), col3 varchar(10)) engine = innodb;

mysql> create index idx on test_table(col1);

Let see what’s in each system table:


mysql> select * from INNODB_SYS_TABLES \G
*************************** 1. row ***************************
*************************** 2. row ***************************
*************************** 3. row ***************************
NAME: test/test_table
3 rows in set (0.00 sec)


We can see that table “test_table” is listed there with TABLE_ID 13.  It has 6 columns because  InnoDB adds three hidden columns (DB_ROW_ID, DB_TRX_ID,  DB_ROLL_PTR) to the table. The table flag is 1 shows it is “DICT_TF_COMPACT” type, and its space ID is 1.

Then we could select into INNODB_SYS_COLUMNS to find more information about its columns:


mysql> select * from INNODB_SYS_COLUMNS where TABLE_ID = 13 \G
*************************** 1. row ***************************
NAME: col1
POS: 0
PRTYPE: 1027
LEN: 4
*************************** 2. row ***************************
NAME: col2
POS: 1
PRTYPE: 524542
LEN: 10
*************************** 3. row ***************************
NAME: col3
POS: 2
PRTYPE: 524303
LEN: 10
3 rows in set (0.00 sec)


So it shows all of its three columns, each column’s position, type and length.

Let’s select into SYS_INDEXES to find out its index:


mysql> select * from INNODB_SYS_INDEXES where TABLE_ID = 13 \G
*************************** 1. row ***************************
*************************** 2. row ***************************
NAME: idx
2 rows in set (0.00 sec)


It has two indexes instead of one! Well, InnoDB actually creates an internal clustered index called “GEN_CLUST_INDEX” if there is no user defined clustered index for the table. SYS_INDEX also reveals the root page of the index in PAGE_NO field.

Note the user index “idx” has its INDEX_ID as 16. We can use it to select into SYS_FIELDS to find out the indexed columns:


mysql> select * from INNODB_SYS_FIELDS where INDEX_ID = 16 \G
*************************** 1. row ***************************
NAME: col1
POS: 0
1 row in set (0.00 sec)


Last let’s look into INNODB_SYS_TABLESTATS by inserting a row into “test_table”:


mysql> insert into test_table values(9, “me”, “here”);
Query OK, 1 row affected (0.06 sec)

mysql> select * from INNODB_SYS_TABLESTATS where TABLE_ID = 13 \G
*************************** 1. row ***************************
NAME: test/test_table
1 row in set (0.00 sec)


So it shows the num_rows inserted through “NUM_ROWS” field, and modified field through “MODIFIED_COUNTER”.

If tables have a foreign key relationship, that information can be displayed via the SYS_FOREIGNS and SYS_FOREIGNCOLS table:


CREATE TABLE child (id INT, parent_id INT,
INDEX par_ind (parent_id),
CONSTRAINT constraint_test
FOREIGN KEY (parent_id) REFERENCES parent(id)

*************************** 1. row ***************************
ID: test/constraint_test
FOR_NAME: test/child
REF_NAME: test/parent
1 row in set (0.00 sec)

*************************** 1. row ***************************
ID: test/constraint_test
FOR_COL_NAME: parent_id
POS: 0
1 row in set (0.00 sec)


So INNODB_SYS_FOREIGN and INNODB_SYS_FOREIGN_COLS clearly described the foreign key relationship between the “child” and “parent” tables and its related foreign key column.

All of these seven tables can be joined through either TABLE_ID or INDEX_ID, so you can fetch all information for exactly the table you want to study. And by displaying these system tables as as relational tables themselves, user can query through them and get exactly the information you are interested in. This becomes a great interface for any external tools to display the database content.

There is another benefit of Information Schema SYSTEM TABLES in that, by design, the data is read from system table directly, rather than fetching data from their in memory representations (like innodb_table_monitor does). So it is a true representation of what we have on disk. If there is an mismatch (unlikely) between the in memory information and on disk data, these system tables will display the real image on disk, and help DBA and even developer to debug or better understand the system metadata behavior.

Introducing page_cleaner thread in InnoDB

Note: this article was originally published on on April 11, 2011 by Inaam Rana.

In MySQL 5.6.2 we have introduced a new background thread named the page_cleaner in InnoDB. Adaptive flushing of modified buffer pool pages in the main background thread, async flushing by a foreground query thread if the log files are near to wrapping around, idle flushing and shutdown flushing are now moved to this thread, leaving only the last resort sync flushing in foreground query threads. We’ve also added counters for these activities.

As page_cleaner is all about the flushing of dirty pages to disk it’ll do you a world of good if you can go through this post where I have explained different types of flushing that happen inside InnoDB and the conditions that trigger flushing. The page_cleaner thread is only concerned with flush_list flushing (this may change in future releases). So let us dig a bit deeper into why flush_list flushing happens and why it would make sense to do this flushing in a separate thread. As is usually the case I have to skip some details to keep this note simple.

On a busy server flush_list flushing (which I’ll simply call flushing from this point onwards) happens under four conditions. In order to understand these conditions let us familiarize ourselves with the concept of checkpoint_age. The checkpoint_age is the difference between the current_lsn (the latest change to the database) and the last checkpoint_lsn (the lsn when last checkpoint happened). We obviously don’t want to let this difference grow beyond the log file size because if that happens then we end up overwriting redo log entries before the corresponding dirty pages are flushed to the disk, losing the ability to recover them. In order to avoid the above situation we maintain two high water marks to indicate if we are nearing the end of reusable redo log space. Lets call these water marks async_water_mark and sync_water_mark, where the later represents a more urgent situation than the former. Now that we have clarified the checkpoint_age concept let us get back to the four conditions under which flushing of dirty pages happens:

  1. checkpoint_age < async_water_mark
    • This condition means that we have enough reusable redo space. As such there is no hurry to flush dirty pages to the disk. This is the condition where we’d like our server to be most of the time.
    • Based on adaptive_flushing heuristics we flush some dirty pages in this state. This flushing happens in the background master thread.
    • During the flushing no other threads are blocked, so queries continue normally.

  2. async_water_mark < checkpoint_age < sync_water_mark
    • As we move past the first water mark we try to bring some more urgency to our flushing. The query thread noticing this condition will trigger a flush and will wait for that flushing to end. This type of flushing does not happen in background.
    • Other query threads are allowed to proceed. Therefore we call it async flushing because only the query thread that is doing the flushing is blocked.
  3. checkpoint_age > sync_water_mark
    • This is like a panic button. We have very little reusable redo log space available. The query thread detecting this condition immediately starts flushing.
    • Other query threads are blocked. The idea is to stop the advance of checkpoint_age.
    • This type of flushing not only happens in foreground it actually tends to bring the whole system to a stall.
  4. %n_dirty_pages > innodb_max_dirty_page_pct
    • %age of dirty pages in the buffer pool exceeds the user settable value of innodb_max_dirty_page_pct.
    • The flushing happens in the background master thread and is non-blocking for query threads.

The page_cleaner thread:

As explained above flushing can happen in the query thread e.g.: the async and sync flushing. It can also happen in the background master thread e.g.: the adaptive flushing and the max_dirty_page_pct flushing. There are two issues with this scheme. First, the master thread is also tasked to do other background activities. It has to do change buffer merges and possibly purge (though starting with 5.5 we have an option to use a dedicated thread for purge and in 5.6.2 we can even have multi-threaded purge). Under very heavy workload it is possible that the master thread is unable to find enough time to flush dirty pages. The second issue is with async flushing. It should be a background task but it is executed in the query thread. While other threads are allowed to proceed, the unfortunate thread which detects the condition is blocked on a huge dirty page flush.

To address these two issues we came up with the idea of having a dedicated background thread and named it the page_cleaner thread. All background flushing activity previously done in the master thread is off loaded to the page_cleaner. Also, the async flushing is now a background task performed in the page_cleaner thread. Query threads are only ever blocked for flushing if we cross the sync flushing water mark. The page_cleaner thread wakes up every second, checks the state of the system and performs the flushing activity if required. The flushing that happens when the server is idle or at shutdown is now also done by the page_cleaner thread.

Finally we have added some counters to the innodb_metrics table related to the above four types of flushing to give you a picture of how your system is behaving.

mysql> select name, comment from information_schema.innodb_metrics where name like 'buffer_flush_%';
| name | comment |
| buffer_flush_adaptive_flushes | Occurrences of adaptive flush |
| buffer_flush_adaptive_pages | Number of pages flushed as part of adaptive flushing |
| buffer_flush_async_flushes | Occurrences of async flush |
| buffer_flush_async_pages | Number of pages flushed as part of async flushing |
| buffer_flush_sync_flushes | Number of sync flushes |
| buffer_flush_sync_pages | Number of pages flushed as part of sync flushing |
| buffer_flush_max_dirty_flushes | Number of flushes as part of max dirty page flush |
| buffer_flush_max_dirty_pages | Number of pages flushed as part of max dirty flushing |

Friday Apr 08, 2011

MySQL 5.6: Data dictionary LRU

Note: this article was originally published on on April 8, 2011 by Sunny Bains.

In MySQL 5.6 we’ve added a new feature that closes and unloads table instances from the InnoDB internal data dictionary, once a user configurable threshold is reached. This ends the situation where you could have hundreds of megabytes caching rarely used entries until the server was restarted and will be particularly appreciated by hosting and software as a service providers.

For this we’ve used an existing MySQL config variable table-definition-cache.  This cache limit is a soft limit. This means that if the user has more than table-definition-cache tables open then InnoDB will not force eviction of the table from the InnoDB data dictionary cache.

Users don’t really have to do anything to make use of this feature. It should pick up the existing setting (default or explicitly set) and just work. This feature should benefit users that have 1000s of tables and open a subset of tables only rarely. The LRU mechanism will eventually mark them for eviction and remove them from the data dictionary cache.

Other benefits

Some of the additional benefits of this change are internal to InnoDB and help in improving the design. We now have well defined interfaces for opening and closing tables. All internal sub-systems  need to use the same interface as the MySQL table handler. This has helped in loosening the tight coupling between various InnoDB sub-systems and the data dictionary. By loosening this tight coupling, it will help us in adding features to the InnoDB data dictionary more easily.

Future work

This feature is by no means 100% complete. Currently we don’t evict tables that are in a foreign key relationship, either reference or those that refer to another table. Handling of such tables is work in progress, so stay tuned.

Thursday Apr 07, 2011

Tips and Tricks for Faster DDL

Note: this article was originally published on on April 6, 2011 by Calvin Sun.

Data Dictionary Language (DDL) operations have traditionally been slow in MySQL. Any change to the table definition would be implemented by creating a copy of the table and index layout with the requested changes, copying the table contents row by row, and finally renaming tables and dropping the original table.

The InnoDB Plugin for MySQL 5.1 implements a more efficient interface for creating and dropping indexes. Indexes can be created or dropped without rebuilding the entire table.

Speeding up Bulk INSERT, UPDATE and DELETE Operations

Normally, InnoDB would update all indexes of a table when rows are inserted or deleted. If you update an indexed column, InnoDB would have to delete the old value and insert the new value in the corresponding index. If you update a primary key column, the row would be deleted and inserted in every index of the table. This can be slow even despite the change buffering in MySQL 5.5.

Bulk data imports, bulk deletes and bulk updates could be executed faster if they did not have to update too many secondary indexes. Provided that there are no operations running that would benefit from the existence of the indexes while the bulk operation is in progress, it could be useful to drop the affected indexes before starting a bulk operation and to create the indexes after the operation.

Consider this artificial example:

name CHAR(50) NOT NULL,
age INT,
hair_color ENUM('red','black','blond','grey','artificial','other'),
INDEX(name), INDEX(age)

Yes, it is a bad idea to have age in the table, because it depends on the current time. A birthday would be more accurate and would not need to be updated. When you update age, you will be updating every non-NULL row. It probably pays off to drop the affected index for the duration of the operation. Other indexes can be left alone, because age is not part of them or the PRIMARY KEY:

UPDATE friends SET age=age+1;

Bulk INSERT operations can be sped up by creating the secondary indexes afterwards. In this way, the records will be merge sorted and then inserted into the InnoDB index B-trees in order, using sequentially allocated pages. This will create a more optimal layout especially for indexes whose columns are updated rarely. To continue our example:

LOAD DATA INFILE 'friends.txt' INTO TABLE friends;
ALTER TABLE friends ADD INDEX (name), ADD INDEX (age);

At some point, we may get suspicious about grey-haired people who are younger than 30 years:

DELETE FROM friends WHERE age<30 AND hair_color='grey';

Depending on the number of affected rows, it might pay off to drop and re-create all indexes, or to optimize the whole table.


The OPTIMIZE TABLE statement in MySQL makes a copy of the table, copying rows one by one to every index. Because the secondary indexes are not likely to be in the order of the PRIMARY KEY (for example, the (age,id) tuples in the secondary index B-tree age might not be (42,1),(45,2),(50,4),(65,7) but (50,1),(42,2),(65,5),(45,8)) this will lead to the secondary indexes being populated out of order.

If you want to create every index in order, you might want to mimic OPTIMIZE TABLE like this:

CREATE TABLE optimized_friends LIKE friends;
ALTER TABLE optimized_friends DROP INDEX age, DROP INDEX name;
ALTER TABLE optimized_friends ADD INDEX(age), ADD INDEX(name);
RENAME TABLE friends TO unoptimized_friends;
RENAME TABLE optimized_friends TO friends;
DROP TABLE unoptimized_friends;

The INSERT…SELECT will copy the data in the PRIMARY KEY order, which should allocate and populate the clustered index B-tree in sequential order. Then, the ALTER TABLE will create the secondary indexes. As the index entries will be merge sorted by InnoDB before inserting, the secondary indexes will be in order as well.

Issuing one ALTER TABLE … ADD INDEX statement instead of multiple CREATE INDEX has the benefit that the data rows will be scanned only once to create the temporary merge sort files for all new indexes. If you are short on temporary file space, you might not want to create all indexes of a large table in one go.

If you suspect that only some secondary indexes are fragmented, you can DROP and ADD only those indexes.


While the fill factor of the index B-trees cannot be controlled directly in InnoDB, it may happen that sequentially inserted records result in a tighter packed tree than records that were inserted in random order. This can be either a good or bad thing, depending on the number and nature of future modifications.

If you want the old table-copying behavior back, you can SET old_alter_table=1.

Wednesday Apr 06, 2011

MySQL 5.6: Multi threaded purge

Note: this article was originally published on on April 6, 2011 by Sunny Bains.


What does purge exactly do and why is it needed? If you have ever wondered then read on. It is really a type of garbage collector. When a user issues a DML like “DELETE FROM t WHERE c = 1;”, InnoDB doesn’t remove the matching record. This is what happens under the hood:

  1. It marks the record as deleted by setting a bit in the control bits of the record.
  2. Stores the before image of the modified columns to the UNDO log
  3. Updates the system columns DB_TRX_ID and DB_ROLL_PTR in the clustered index record. DB_TRX_ID identifies the transaction that made the last change, and DB_ROLL_PTR points to the new UNDO log record. This UNDO log record contains the old values of DB_TRX_ID and DB_ROLL_PTR, possibly pointing to an older transaction and undo log entry.

From this you should be able to visualise that the UNDO log records related to the modified clustered record are in a disk based linked list, with the head anchored in the clustered index record. For the sake of simplicity I’ve ignored the UNDO log pages and the case where DML updates a record. This information is required by  rollback and MVCC (multi version concurrency control). For MVCC we need the entry to exist in the clustered index so that we can follow a pointer back to where the “before” changes were written in the UNDO log and use that information to construct a previous version of the record. For rollback we need the before  information of the record  (UNDO entry) so that we can restore it when a transaction is rolled back.

Another benefit of purging separately in the background is that expensive B+Tree block merge operations, if the removal of the record were to lead to underflow, can be done asynchronously by purge and not by user transactions.

Garbage collection

Once a transaction has been committed, the UNDO log entries and delete-marked records written by it may be needed by other transactions for building old versions of the record. When there is no transaction left that would need the data, the delete-marked records and the related undo log records can be purged.  For this reason we have a dedicated (and aptly named) purge function that runs asynchronously within InnoDB. Another problem is that InnoDB clustered and secondary indexes can have multiple entries for the same key. These need to be removed (garbage collected) too by Purge (see InnoDB Change Buffering for details regarding secondary indexes).

How does purge work?

Purge clones the oldest view in the system. This view has the control information that limits what changes are visible, the ones  made by other transactions. By cloning the oldest view, purge can determine that there can’t be any transactions that are active than the oldest view low limit in the system. Purge then reads the UNDO log entries in reverse starting from the oldest to the latest. It then parses these entries and removes the delete marked entries from the cluster and secondary indexes, if they are not referenced by any currently running transaction.

Problem with the old design

In versions prior to 5.5 this purge function was part of the responsibility of the InnoDB master thread. At fixed intervals it would run purge asynchronously. The problem with this approach was that the master thread was also responsible for flushing (writing to disk) of dirty pages, among other tasks. A high load on the server that dirtied a lot of pages would force the master thread to spend most of its time flushing and therefore no purging would get done and vice a versa.

Changes in 5.5

In 5.5 there is an option innodb-purge-threads=[0,1] to create a dedicated thread that purges asynchronously if there are UNDO logs that need to be removed. We also introduced another option innodb-purge-batch-size that can be used to fine tune purge operations. The batch size determines how many UNDO log pages purge will parse and process in one pass. The default setting is 20, this is the same as the hard coded value that is in previous InnoDB releases. An interesting side effect of this value is that it also determines when purge will free the UNDO log pages after processing them. It is always after 128 passes, this magic value of 128  is the same as the number of UNDO logs in the system tablespace, now that 5.5 has 128 rollback segments. By increasing the innodb-purge-batch-size the freeing of the UNDO log pages behaviour changes, it will increase the number of UNDO log pages that it removes in a batch when the limit of 128 is reached. This change was seen as necessary so that we could reduce the cost of removing the UNDO log pages for the extra 127 rollback segments that were introduced in 5.5. Prior to this change iterating over the 128 rollback segments to find the segment to truncate had become expensive.

Changes in 5.6

In 5.6 we have the same parameters as 5.5 except that innodb-purge-threads can now be between 0 and 32. This introduces true multi threaded purging. If the value is greater than 1 then InnoDB will create that many purge worker threads and a dedicated purge coordinator thread. The responsibility of the purge coordinator thread is to parse the UNDO log records and parcel out the work to the worker threads. The coordinator thread also purges records, instead of just sitting around and waiting for the worker threads to complete. The coordinator thread will divide the innodb-purge-batch-size by innodb-purge-threads and hand that out as the unit of work for each worker thread.

Some caveats

For single table tests like Sysbench one obvious problem is that all the worker threads tend to block/serialise on the dict_index_t::lock. This reduces their effectiveness and for such loads it is usually better to have a single dedicated purge thread unless you are using partitions. With multiple tables the purge threads come into their own and can purge records in parallel with minimal contention.

Sunday Sep 19, 2010

MySQL 5.5: InnoDB as Default Storage Engine

Note: this article was originally published on on Sept 19, 2010 by Calvin Sun.

MySQL has a well-earned reputation for being easy-to-use and delivering performance and scalability. In previous versions, MyISAM was the default storage engine. In our experience, most users never changed the default settings. With MySQL 5.5, InnoDB becomes the default storage engine. Again, we expect most users will not change the default settings. But, because of InnoDB, the default settings deliver the benefits users expect from their RDBMS — ACID Transactions, Referential Integrity, and Crash Recovery. Lets explore how using InnoDB tables improves your life as a MySQL user, DBA, or developer.


In the first years of MySQL growth, early web-based applications didn’t push the limits of concurrency and availability. In 2010, hard drive and memory capacity and the performance/price ratio have all gone through the roof. Users pushing the performance boundaries of MySQL care a lot about reliability and crash recovery. MySQL databases are big, busy, robust, distributed, and important.

InnoDB hits the sweet spot of these top user priorities. The trend of storage engine usage has shifted in favor of the more efficient InnoDB. With MySQL 5.5, the time is right to make InnoDB the default storage engine.

The Big Change

Starting from MySQL 5.5.5, the default storage engine for new tables is InnoDB. This change applies to newly created tables that don’t specify a storage engine with a clause such as ENGINE=MyISAM. (Given this change of default behavior, MySQL 5.5 might be a logical point to evaluate whether your tables that do use MyISAM could benefit from switching to InnoDB.)

The mysql and information_schema databases, that implement some of the MySQL internals, still use MyISAM. In particular, you cannot switch the grant tables to use InnoDB.

Benefits of InnoDB Tables

If you use MyISAM tables but aren’t tied to them for technical reasons, you’ll find many things more convenient when you use InnoDB tables in MySQL 5.5:

  • If your server crashes because of a hardware or software issue, regardless of what was happening in the database at the time, you don’t need to do anything special after restarting the database. InnoDB automatically finalizes any changes that were committed before the time of the crash, and undoes any changes that were in process but not committed. Just restart and continue where you left off.
  • The InnoDB buffer pool caches table and index data as the data is accessed. Frequently used data is processed directly from memory. This cache applies to so many types of information and speeds up processing so much that dedicated database servers assign most of their physical memory to it.
  • If you split up related data into different tables, you can set up foreign keys that enforce referential integrity. Update or delete data, and the related data in other tables is updated or deleted automatically. Try to insert data into a secondary table without corresponding data in the primary table, and it gets kicked out automatically.
  • If data becomes corrupted on disk or in memory, a checksum mechanism alerts you to the bogus data before you use it.
  • When you do a little advance planning to decide on primary key columns for each table, lots of operations involving those columns are automatically optimized. It is very fast to reference the primary key columns in WHERE clauses, ORDER BY clauses, GROUP BY clauses, and join operations.
  • Inserts, updates, deletes are optimized by an automatic mechanism called change buffering. InnoDB not only allows concurrent read and write access to the same table, it caches changed data to streamline disk I/O.
  • Performance benefits are not limited to giant tables with long-running queries. When the same rows are accessed over and over from a table, a feature called the Adaptive Hash Index takes over to make these lookups even faster, as if they came out of a hash table.

Best Practices for InnoDB Tables

If you have been using InnoDB for a long time, you already know about features like transactions and foreign keys. If not, read about them in the MySQL Reference Manual. To make a long story short:

  • Specify a primary key for every table using the most frequently queried column or columns, or an auto-increment value if there isn’t an obvious primary key.
  • Embrace the idea of joins, where data is pulled from multiple tables based on identical ID values from those tables. For fast join performance, define foreign keys on the join columns, and declare those columns with the same datatype in each table. The foreign keys also propagate deletes or updates to all affected tables, and prevent insertion of data in a child table if the corresponding IDs are not present in the parent table.
  • Turn off autocommit. Committing hundreds of times a second puts a cap on performance (limited by the write speed of your storage device).
  • Bracket sets of related changes, “logical units of work”, with START TRANSACTION and COMMIT statements. While you don’t want to commit too often, you also don’t want to issue huge batches of INSERT, UPDATE, or DELETE statements that run for hours without committing.
  • Stop using LOCK TABLE statements. InnoDB can handle multiple sessions all reading and writing to the same table at once, without sacrificing reliability or high performance.
  • Enable the innodb_file_per_table option to put the data and indexes for individual tables into separate files, instead of in a single giant system tablespace. (This setting is required to use some of the other features, such as table compression and fast truncation.)
  • Evaluate whether your data and access patterns benefit from the new InnoDB table compression feature (ROW_FORMAT=COMPRESSED on the CREATE TABLE statement. You can compress InnoDB tables without sacrificing read/write capability.
  • Run your server with the option --sql_mode=NO_ENGINE_SUBSTITUTION to prevent tables being created with a different storage engine if there is an issue with the one specified in the ENGINE= clause of CREATE TABLE.

Recent Improvements for InnoDB Tables (from the Plugin Era)

If you have experience with InnoDB, but not the recent incarnation known as the InnoDB Plugin, read about those new features in the MySQL Reference Manual too. To make a long story short:

  • You can compress tables and associated indexes.
  • You can create and drop indexes with much less performance or availability impact than before.
  • Truncating a table is very fast, and can free up disk space for the operating system to reuse, rather than freeing up space within the system tablespace that only InnoDB could reuse.
  • The storage layout for table data is more efficient for BLOBs and long text fields.
  • You can monitor the internal workings of the storage engine by querying INFORMATION_SCHEMA tables.
  • You can monitor the performance details of the storage engine by querying PERFORMANCE_SCHEMA tables.
  • There are many many performance improvements.

Performance Improvements for InnoDB Tables

  • Crash recovery, the automatic process that makes all data consistent when the database is restarted, is fast and reliable. (Now much much faster than long-time InnoDB users are used to.) The bigger the database, the more dramatic the speedup.
  • Most new performance features are automatic, or at most require setting a value for a configuration option. Read details here. Advanced users can review the InnoDB configuration options.
  • Review the InnoDB-specific tuning techniques.

Testing and Benchmarking

It’s never too early to prepare for InnoDB as the default storage engine. You can preview whether your database server or application works correctly with MySQL 5.1 or even earlier. To set up InnoDB as the default storage engine, either specify on the command line --default-storage-engine=InnoDB, or add to your my.cnf file default-storage-engine=innodb in the [mysqld] section, then restart the server.

Since changing the default storage engine only affects new tables as they are created, run all your application installation and setup steps to confirm that everything installs properly. Then exercise all the application features to make sure all the data loading, editing, and querying features work. If a table relies on some MyISAM-specific feature, you’ll receive an error; add the ENGINE=MyISAM clause to the CREATE TABLE statement to avoid the error. (For example, tables that rely on full-text search must be MyISAM tables rather than InnoDB ones.)

If you didn’t make a deliberate decision about the storage engine, and you just want to preview how certain tables work when they’re created under InnoDB, issue the command ALTER TABLE table_name ENGINE=InnoDB; for each table. Or, to run test queries and other statements without disturbing the original table, make a copy like so:


Since there are so many performance enhancements in the InnoDB that’s part of MySQL 5.5, to get a true idea of the performance with a full application under a realistic workload, you’ll want to install the real MySQL 5.5 and run benchmarks. (Those who use the InnoDB Plugin with MySQL 5.1 have already seen some of these improvements, but to keep things simple, this article focuses on 5.5, where the latest and greatest InnoDB is fully integrated.)

Test the full application lifecycle, from installation, through heavy usage, and server restart. (For bonus points, kill the server process while the database is busy to simulate a power failure, and verify that the data is recovered successfully when you restart the server.)

Test any replication configurations, especially if you use different MySQL versions and options on the master and the slaves, because some of the InnoDB options have different defaults than in previous releases.


To know what the status of InnoDB is, whether you’re doing what-if testing with an older MySQL or comprehensive testing with the real MySQL 5.5:

  • Issue the command SHOW VARIABLES LIKE 'have_innodb'; to confirm that InnoDB is available at all. If the result is NO, you have a mysqld binary that was compiled without InnoDB support and you need to get a different one. If the result is DISABLED, go back through your startup options and configuration file and get rid of any skip-innodb option.
  • Issue the command SHOW ENGINES; to see all the different MySQL storage engines. You want to see DEFAULT in the InnoDB line.

MySQL 5.5: InnoDB Performance Improvements on Windows

Note: this article was originally published on on Sept 19, 2010 by Calvin Sun.

At MySQL, we know our users want Performance, Scalability, Reliability, and Availability, regardless of the platform the choose to deploy. We have always had excellent benchmarks on Linux, and with MySQL 5.5, we are also working hard on improving performance on Windows.

The original patch of improving Windows performance was developed by MySQL senior developer Vladislav Vaintroub; benchmarks by QA engineer Jonathan Miller. We integrated the patch into MySQL 5.5 release.

The following two charts show the comparison of MySQL 5.5 vs. MySQL 5.1 (plugin) vs. MySQL 5.1 (builtin) using sysbench:

MySQL 5.5 includes a Windows performance patch, plus other performance improvements. This Windows performance patch has two part: part one is to fix the inefficiency of InnoDB slow mutex implementation on Windows – implement slow mutex as CriticalSection; and part two is to take advantage of condition variables on Windows Vista or newer Windows operating systems. Condition variables are not supported on Windows Server 2003 and Windows XP/2000.

What if I have an old Windows on which condition variables are not supported? Well, you can still benefit from the new slow mutex implementation; but will not be able to take the advantage of condition variables. The same binary runs on old Windows too. The InnoDB does a dynamic checking during start-up, to see whether condition variables are supported by the operating system it runs on.

Another benefit from this patch is the reduced consumption of Windows kernel objects (or handles), which deserves another blog on its own.

Download MySQL 5.5 RC release to give it a try for yourself! Thanks to Vladislav and Jonathan for their hard work!

Hardware configuration and operating system

Intel x86_64, 4 CPU x 2 Cores , 3.166 GHz, 8GB RAM, and Windows Server 2008.

MySQL / InnoDB configuration

Relevant my.cnf parameters:


MySQL 5.5: InnoDB adaptive_flushing – How it works?

Note: this article was originally published on on Sept 19, 2010 by Inaam Rana.

Write-heavy workloads can reach a situation where InnoDB runs out of usable space in its redo log files. When that happens, InnoDB does a lot of disk writes to create space and you can see a drop in server throughput for a few seconds. From InnoDB plugin 1.0.4 we have introduced the ‘innodb_adaptive_flushing’ method that uses a heurstic to try to flush enough pages in the background so that it is rare for the very active writing to happen. In this note I’ll try to explain how the heuristic works i.e.: what factors are taken into account when deciding how many dirty pages to flush from the buffer pool in the background. I’ll skip some details for the sake of clarity.

You may find InnoDB glossary useful to understand the terminology used in this note. adaptive_flushing is of consequence to you if your workload involves significant write activity. For example, if you are running MySQL 5.0 or MySQL 5.1 with built-in InnoDB and you see periodic drop in the performance with corresponding spikes in write activity then you should consider upgrading to InnoDB plugin.

Let us first familiarize ourselves with why and how flushing takes place in the buffer pool. Flushing is the activity of writing dirty pages to the disk. We need to flush because of two main reasons. First, we need to read in a page into the buffer pool and there are no free buffers. In this case we select a page as victim (based on LRU) and if the victim is dirty we write it to the disk before reusing it. This, I’ll call, LRU_list flush. Secondly, we need to write dirty pages to the disk when we want to reuse the redo logs. Redo logs in InnoDB (or for that matter in most other database engines as well) are used in a circular fashion. We can only overwrite redo log entries if, and only if, the dirty pages corresponding to these entries have been flushed to the disk. This type of flushing is called flush_list flush.

adaptive_flushing is related to the flush_list flush (though it is also influenced by the LRU_list flush as we’ll see shortly). If we allow the redo logs to fill up completely before we trigger flushing of the dirty pages then it will cause a serious drop in performance during this spike in the I/O. To avoid this we’d like to keep flushing dirty pages in the background at a rate that is just enough to avoid the I/O bursts. Note that we don’t want to flush too aggressively because that will reduce the chances of write combining thus resulting in more I/O than is required. So how do we determine the rate at which we do flush_list flush in the background? That is where the adaptive_flushing heuristic comes into play.

We do know the capacity of the redo logs which is constant and is equal to the combined size of the log files. If we know the rate at which redo log is being generated then we can figure out how many seconds we have before the whole redo log is filled up.

n = log_capacity / redo_gen_rate

Assume there are n_dirty dirty pages in the buffer pool. If we have to avoid the I/O burst we must be able to flush n_dirty pages in n seconds. In other words, our required flushing rate can be given by the following formula:

required_flush_rate = n_dirty / n = (n_dirty * redo_gen_rate) / log_capacity

There is one bit missing in the above formula. What if there is also an LRU_list flush also happening? That will be flushing dirty pages as well. We don’t want to over do the flushing effort and it makes sense to deduct LRU_list flush rate from the above.

desired_flush_rate = (n_dirty * redo_gen_rate) / log_capacity – LRU_flush_rate

That is all there is to it. In order to avoid sudden spikes we keep the history of the LRU_flush_rate and the redo_gen_rate and use weighted averages in our calculations.

As is the case with most heuristics the above may not work under some workloads. That is why we introduced the innodb_adaptive_flushing parameter which can be used to disable this feature (it is enabled by default). In MySQL 5.5.4 we introduced multiple buffer pools meaning there are multiple LRU lists and multiple flush lists. Obviously adaptive flushing has to be adapted for this. In MySQL 5.5.6 we have worked on this further by limiting the excess flushing caused by rounding off and flushing neighbors of a victim. But this is definitely not the end of the story. Consider this an area under focus. We continue to work towards finding better solution for the overall flushing inside InnoDB. One more reason why your feedback will be more than welcome. All I can say at this stage is give us your feedback and stay tuned!

MySQL 5.5: InnoDB Change Buffering

Note: this article was originally published on on Sept 19, 2009 by Marko Mäkelä.

To speed up bulk loading of data, InnoDB implements an insert buffer, a special index in the InnoDB system tablespace that buffers modifications to secondary indexes when the leaf pages are not in the buffer pool. Batched merges from the insert buffer to the index pages result in less random access patterns than when updating the pages directly. This speeds up the operation on hard disks.

In MySQL 5.5, the insert buffer has been extended to a change buffer, which covers all modifications of secondary index leaf pages. This will improve the performance of bulk deletes and updates, transaction rollback and the purging of deleted records (reducing the “purge lag”).

To assess the benefits of the extended buffering, you may want to run benchmarks with the settings innodb_change_buffering=all, innodb_change_buffering=inserts, and innodb_change_buffering=none. Users of solid-state storage, where random reads are about as fast as sequential reads, might benefit from disabling the buffering altogether.

Read on to learn how the change buffering works.

Operations on Secondary Indexes

InnoDB can perform three kinds of modifications on secondary index records. If the affected index page is not in the buffer pool, the modifications can be buffered in the change buffer. When an index lookup or scan needs a page that is not in the buffer pool, the page will be read from the tablespace and any buffered changes will be merged to it.

The following operations can modify secondary index pages:

Inserting a record; supported in all versions of InnoDB
Marking a record for deletion
Removing a deleted record that is no longer accessible by active transactions

Before MySQL 5.5, UPDATE, DELETE and purge operations were performed directly on the index pages, resulting in random-access I/O. In MySQL 5.5, all of these operations can be buffered.

Implications of InnoDB Multiversioning

In InnoDB, there are two types of indexes: the clustered index B-tree, where the records are stored in the PRIMARY KEY order, and secondary index B-trees, which identify rows by primary key. InnoDB multiversion concurrency control (MVCC) treats these indexes differently.

Records in the clustered index can be updated in place, and their hidden system columns DB_TRX_ID, DB_ROLL_PTR point to undo log entries from which earlier versions can be reconstructed. InnoDB secondary index records do not contain any system columns, and their data is never updated in place. An UPDATE of an indexed column requires the operations Delete-mark(old), Insert(new) and eventually Purge(old) in the secondary index. An UPDATE of a PRIMARY KEY results in Delete-mark, Insert and eventually Purge in all indexes.

When a secondary index record has been marked for deletion or when the page has been updated by a newer transaction, InnoDB will look up the clustered index record. In the clustered index, it suffices to check the DB_TRX_ID and only retrieve the correct version from the undo log when the record was modified after the reading transaction started.

To Buffer or not to Buffer

When a page is in the buffer pool, it will always be updated directly. When a page is loaded to the buffer pool, any buffered changes will be merged to it, so that users never see unmerged changes.

Because change buffering works on individual leaf pages, we cannot buffer changes that would result into page splits or merges, but must perform such changes on the B-tree pages directly.

The insert buffer bitmap keeps track on the available space on pages and prevents overflows when buffering inserts. Delete-marking records can always be buffered, because the flag will be updated in place. Purging a delete-marked record could result in an empty page, something that we do not allow. We determine the non-emptiness of a page from previously buffered operations on the same page. If there are no previously buffered operations, the purge will have to load the index page to the buffer pool.

InnoDB refuses to buffer an operation when the on-disk change buffer tree would grow bigger than ⅓ of the in-memory buffer pool (innodb_buffer_pool_size). This might be a good rule-of-thumb, but some setups could benefit from the ability of setting the change buffer size independently of the buffer pool size.


The InnoDB change buffer is a persistent data structure and a complex mechanism that comes into play when the workload does not fit in the buffer pool. Because it trades random I/O with a larger amount of sequential I/O, it speeds up operation on hard disks, where random access is much slower than sequential access.

On solid-state storage, there is not much difference between sequential and random access times. Change buffering may still be useful if writes to solid-state storage are expensive, either in terms of speed or the consumption of limited program/erase cycles. Change buffering could reduce the write load on user tablespaces and cause more writes to the system tablespace (which contains the insert buffer) and the redo log. These should be placed on a hard disk.

InnoDB Revision History

Note: this article was originally published on on Sept 19, 2009 by Vasil Dimov.

This is a brief overview of the history of InnoDB with respect to the Version Control Systems (VCS) that were used for developing. It could be useful to people who want to trace back in time to find the origins of bugs or features.

Early days
InnoDB was born in the mid 1990s when Heikki Tuuri started developing this masterpiece. Heikki was a lone developer at that time and he did not use any VCS. There is no InnoDB revision history before 2001.

2001 – 2005
Then in 2001 InnoDB was imported into MySQL’s BitKeeper repository and development continued, recording the history there.

2005 – 2010
Later on in Oct 2005 when Oracle acquired Innobase, InnoDB developers had to move away from MySQL’s BitKeeper repository and Subversion was chosen for InnoDB development. The latest sources from BitKeeper were imported into SVN without preserving the history and Innobase started sending snapshots of InnoDB to MySQL. In other words – InnoDB devs committed changesets into SVN and then batches of those changesets were sent to MySQL. But those batches were applied as a single changeset in BitKeeper with a message like “Apply InnoDB snapshot XYZ” which made the revision history of InnoDB in BitKeeper after Oct 2005 meaningless.

In the meantime MySQL switched from BitKeeper to Bazaar, preserving all the history that BitKeeper had. This did not affect InnoDB development and revision history, except that Bazaar was easier to use than the free and limited BitKeeper client. The snapshots thing continued.

To summarize: until early 2010 InnoDB history was split in two – the before 2005 part was in Bazaar and after 2005 part was in Subversion.

2010 – present
In Mar 2010, after Oracle acquired Sun and MySQL, InnoDB development moved into Bazaar. We wanted to keep the precious history from SVN, but it was not possible to retrospectively swap the 2005-2010 history in Bazaar which contained changesets like “Apply InnoDB snapshot XYZ” with the meaningful history from Subversion. Instead, the InnoDB directory in the MySQL source tree was deleted and all revisions from SVN were applied one by one on top of the empty directory.

This way commands like “bzr blame” and “bzr log” show the relevant bits from the SVN history, up to 2005. If one wants to dive before 2005 he must branch the MySQL source as it was before the SVN import (e.g. bzr branch -rdate:2010-03-01). Branching the pre-SVN-import tree will give access to the 2001-2005 real history and 2005-2010 “apply snapshot” history.

Monday May 17, 2010

InnoDB recovery gets even faster in Plugin 1.1, thanks to native AIO

Note: this article was originally published on on April 16, 2010 by Michael Izioumtchenko.

InnoDB Plugin 1.1 doesn’t add any recovery specific improvements on top of what we already have in Plugin 1.0.7. The details on the latter are available in this blog. Yet, when I tried to recover another big recovery dataset I created, I got the following results for total recovery time:

  • Plugin 1.0.7: 46min 21s
  • Plugin 1.1: 32min 41s

Plugin 1.1 recovery is 1.5 times faster. Why would that happen? The numerous concurrency improvements in Plugin 1.1 and MySQL 5.5 can’t really affect the recovery. The honor goes to Native Asynchronous IO on Linux. Let’s try without it:

  • Plugin 1.1 with –innodb-use-native-aio=0: 49min 07s

which is about the same as 1.0.7 time. My numerous other recovery runs showed that the random fluctuations account for 2-3min of a 30-45min test.

Why is native AIO good for you? Why is it better the  than the simulated AIO we already have? Here’s what Inaam Rana, our IO expert and the author of the AIO patch, says:

  • During recovery typically redo log application is performed by the IO helper threads in the completion routine.
  • With simulated aio the thread waits for IO to complete and then calls the completion routine.
  • With native aio the thread doesn’t have to wait for io to complete, instead it picks a completed request and applies redo to it.

Read more about native AIO here.

You don’t have to do anything to take advantage of this feature. It is enabled by default and is used where available as determined by configure.

Here are some details about the test environment:

Hardware: HP DL480, 32G RAM, 2×4 core Intel(R) Xeon(R) CPU E5450  @ 3.00GHz, RAID5, about 1T total storage

Dataset:  1757549 dirty pages, 2808364565 bytes of redo. For the curious, it was a sysbench table with 400 million rows and the workload I used was random row update by a simple perl script.  Note that this is over 28G worth of dirty pages which means I had to use a very abusive settings of innodb_buffer_pool=29G and innodb_max_dirty_pages_pct=99, given only 32G of RAM. The recovery was done using the same settings and in the first few attempts the recovery would fail because of what was eventually diagnosed as bug 53122. As it happens, InnoDB recovery uses some memory outside of the buffer pool and it wanted more of it that was really necessary.

InnoDB configuration parameters:


This is highly artificial setup that targets maximizing the generation of dirty pages and redo, and using as much memory as possible for those dirty pages.

Friday Apr 16, 2010

Post-Conference Roundup of InnoDB-related Info

Note: this article was originally published on on April 16, 2010 by John Russell.

What a busy week! Lots of MySQL 5.5 announcements that just happened to coincide with the MySQL Conference and Expo in Silicon Valley. Here are some highlights of the performance and scalability work that the InnoDB team was involved with.

A good prep for the week of news is the article Introduction to MySQL 5.5, which includes information about the major performance and scalability features. That article will lead you into the MySQL 5.5 manual for general features and the InnoDB 1.1 manual for performance & scalability info.

Then there were the conference presentations from InnoDB team members, which continued the twin themes of performance and scalability:

  • InnoDB: Status, Architecture, and New Features: Slides
  • InnoDB Plugin: Performance Features and Benchmarks: Slides
  • What’s New in MySQL 5.5? Performance/Scale Unleashed!: Slides
  • What’s New in MySQL 5.5?: Performance and Scalability Benchmarks: Slides
  • Introduction to InnoDB Monitoring System and Resource & Performance Tuning: Slides
  • Backup Strategies with InnoDB Hot Backup: Slides

We hope that a good and useful time was had by all. Best regards to our European friends and colleagues whose return plans were disrupted by the Icelandic volcano!


This is the InnoDB team blog.


« July 2016