Monday Apr 11, 2011

InnoDB Persistent Statistics at last

Note: this article was originally published on on April 11, 2011 by Vasil Dimov.


InnoDB gathers statistics for the data in user tables, which are used by the MySQL optimizer to choose the best query plan. For a long time the imprecision and instability of these statistics have been creating problems for users.

The problem is that these statistics are recalculated at any of the following events:

* When the table is opened

* When the table has changed a lot (1/16th of the table has been updated/deleted or inserted)

* When ANALYZE TABLE is run


* When InnoDB Monitor is turned ON

* others

and so their recalculation must be quick an unnoticeable. Thus the quick algorithm just picks 8 random pages and could give a wildly varying results.

Quick glance

To solve this problem MySQL 5.6.2 introduces the InnoDB Persistent Statistics feature, which:

* Uses a more precise sampling algorithm (which is also slower) in an attempt to better inform the MySQL optimizer so it can choose the best possible query plan.

* Does not recalculate the statistics automatically, only when ANALYZE TABLE is run. This means that the same query plan will always be chosen even when MySQL is restarted (see below). A query plan using the global statistics can only change after ANALYZE TABLE has been run (manually). This also means that the user and/or DBA is responsible for running ANALYZE TABLE regularly if the data in the table has been changed a lot.

* Saves the statistics on disk in normal user-visible and updatable SQL tables.

How to turn it ON?

To turn ON the persistent statistics for a given table you must first create the persistent statistics storage by executing the SQL script storage/innobase/scripts/persistent_storage.sql which is shipped in the source distribution (in future versions it may be executed automatically by mysql_install_db). This script will create a new database named “innodb” with two tables in it – “table_stats” and “index_stats”. InnoDB will read/write the statistics from/to those tables. Second you must set the new parameter innodb_analyze_is_persistent to ON (it is OFF by default). This will instruct InnoDB to use the better (and slower!) sampling algorithm during ANALYZE TABLE and to save the results to the stats tables. To change the number of leaf pages that are being sampled during such a “persistent-results” ANALYZE TABLE set the parameter innodb_stats_persistent_sample_pages to a higher value.

Once the stats “meta” tables are present and stats for a given table exist in them, then those stats will be used whenever stats are needed – including open table/show table status/show index from and others. To stop using them you must DELETE the corresponding rows from innodb.table_stats and innodb.index_stats.

Beware of bugs

Some bugs in persistent stats code which may cause InnoDB to crash have been fixed, but are not included in 5.6.2. See:

Information Schema for InnoDB System Tables

Note: this article was originally published on on April 11, 2011 by Jimmy Yang.

One of the most important things a database user or DBA wants to know is what columns, indexes etc. a table has. There are a few ways to find these things out, such as show tables. However, to really reveal all of the detailed metadata information in an InnoDB database, many have tried the “innodb_table_monitor” to peek into internal data dictionary. By creating an “innodb_table_monitor” table, InnoDB will print out the contents of metadata periodically. Unfortunately, it is printed out in an unstructured text for each table, and to find out what you need you would have to either carefully scan the output or have your own parser to do some additional analysis on the result if you want to display them systematically.

Well, in the MySQL 5.6 release, “innodb_table_monitor” can become history, and you will no longer need to search the text output to find out system metadata information. Instead, you can query the InnoDB internal system tables through Information Schema System Tables we implemented in this release.

To begin with, let’s take a look at six of the system tables in InnoDB:


The name of these tables explain what each system table contains, SYS_TABLES contains basic information about the table, each table’s index information is stored in SYS_INDEXES, and its column information is stored in SYS_COLUMNS. Each index’s field information is stored in SYS_FIELDS. And if a table contains foreign keys, that information is contained in SYS_FOREIGN_COLS and SYS_FOREIGN. You can get a good idea what a table or index look like after digging into all the detailed information stored in these system tables.

When a table is being used, the content of the each system table is loaded into an in-memory structure (its in-memory representation). For example, SYS_TABLES info is loaded into dict_table_t, SYS_INDEXES info is loaded into dict_index_t structure etc. There could be some additional information stored in each system table’s in memory representation in addition to its on disk content. For example, some run time statistics (such as rows inserted/updated) are stored in dict_table_t. To accommodate that we also created the 7th Information Schema table called SYS_TABLESTATS to display such information.

So in summary we have added the following seven System Tables:

mysql> show tables like “INNODB_SYS%”;
| Tables_in_information_schema (INNODB_SYS%) |
| INNODB_SYS_FIELDS                                     |
| INNODB_SYS_INDEXES                                   |
| INNODB_SYS_TABLESTATS                            |
| INNODB_SYS_COLUMNS                                |
| INNODB_SYS_FOREIGN_COLS                        |
| INNODB_SYS_FOREIGN                                  |
| INNODB_SYS_TABLES                                    |
7 rows in set (0.00 sec)

Now let’s go through these table using an example. Suppose we have a table named “test_table”, which has one index “idx”:

mysql> create table test_table(col1 int, col2 char(10), col3 varchar(10)) engine = innodb;

mysql> create index idx on test_table(col1);

Let see what’s in each system table:


mysql> select * from INNODB_SYS_TABLES \G
*************************** 1. row ***************************
*************************** 2. row ***************************
*************************** 3. row ***************************
NAME: test/test_table
3 rows in set (0.00 sec)


We can see that table “test_table” is listed there with TABLE_ID 13.  It has 6 columns because  InnoDB adds three hidden columns (DB_ROW_ID, DB_TRX_ID,  DB_ROLL_PTR) to the table. The table flag is 1 shows it is “DICT_TF_COMPACT” type, and its space ID is 1.

Then we could select into INNODB_SYS_COLUMNS to find more information about its columns:


mysql> select * from INNODB_SYS_COLUMNS where TABLE_ID = 13 \G
*************************** 1. row ***************************
NAME: col1
POS: 0
PRTYPE: 1027
LEN: 4
*************************** 2. row ***************************
NAME: col2
POS: 1
PRTYPE: 524542
LEN: 10
*************************** 3. row ***************************
NAME: col3
POS: 2
PRTYPE: 524303
LEN: 10
3 rows in set (0.00 sec)


So it shows all of its three columns, each column’s position, type and length.

Let’s select into SYS_INDEXES to find out its index:


mysql> select * from INNODB_SYS_INDEXES where TABLE_ID = 13 \G
*************************** 1. row ***************************
*************************** 2. row ***************************
NAME: idx
2 rows in set (0.00 sec)


It has two indexes instead of one! Well, InnoDB actually creates an internal clustered index called “GEN_CLUST_INDEX” if there is no user defined clustered index for the table. SYS_INDEX also reveals the root page of the index in PAGE_NO field.

Note the user index “idx” has its INDEX_ID as 16. We can use it to select into SYS_FIELDS to find out the indexed columns:


mysql> select * from INNODB_SYS_FIELDS where INDEX_ID = 16 \G
*************************** 1. row ***************************
NAME: col1
POS: 0
1 row in set (0.00 sec)


Last let’s look into INNODB_SYS_TABLESTATS by inserting a row into “test_table”:


mysql> insert into test_table values(9, “me”, “here”);
Query OK, 1 row affected (0.06 sec)

mysql> select * from INNODB_SYS_TABLESTATS where TABLE_ID = 13 \G
*************************** 1. row ***************************
NAME: test/test_table
1 row in set (0.00 sec)


So it shows the num_rows inserted through “NUM_ROWS” field, and modified field through “MODIFIED_COUNTER”.

If tables have a foreign key relationship, that information can be displayed via the SYS_FOREIGNS and SYS_FOREIGNCOLS table:


CREATE TABLE child (id INT, parent_id INT,
INDEX par_ind (parent_id),
CONSTRAINT constraint_test
FOREIGN KEY (parent_id) REFERENCES parent(id)

*************************** 1. row ***************************
ID: test/constraint_test
FOR_NAME: test/child
REF_NAME: test/parent
1 row in set (0.00 sec)

*************************** 1. row ***************************
ID: test/constraint_test
FOR_COL_NAME: parent_id
POS: 0
1 row in set (0.00 sec)


So INNODB_SYS_FOREIGN and INNODB_SYS_FOREIGN_COLS clearly described the foreign key relationship between the “child” and “parent” tables and its related foreign key column.

All of these seven tables can be joined through either TABLE_ID or INDEX_ID, so you can fetch all information for exactly the table you want to study. And by displaying these system tables as as relational tables themselves, user can query through them and get exactly the information you are interested in. This becomes a great interface for any external tools to display the database content.

There is another benefit of Information Schema SYSTEM TABLES in that, by design, the data is read from system table directly, rather than fetching data from their in memory representations (like innodb_table_monitor does). So it is a true representation of what we have on disk. If there is an mismatch (unlikely) between the in memory information and on disk data, these system tables will display the real image on disk, and help DBA and even developer to debug or better understand the system metadata behavior.

Introducing page_cleaner thread in InnoDB

Note: this article was originally published on on April 11, 2011 by Inaam Rana.

In MySQL 5.6.2 we have introduced a new background thread named the page_cleaner in InnoDB. Adaptive flushing of modified buffer pool pages in the main background thread, async flushing by a foreground query thread if the log files are near to wrapping around, idle flushing and shutdown flushing are now moved to this thread, leaving only the last resort sync flushing in foreground query threads. We’ve also added counters for these activities.

As page_cleaner is all about the flushing of dirty pages to disk it’ll do you a world of good if you can go through this post where I have explained different types of flushing that happen inside InnoDB and the conditions that trigger flushing. The page_cleaner thread is only concerned with flush_list flushing (this may change in future releases). So let us dig a bit deeper into why flush_list flushing happens and why it would make sense to do this flushing in a separate thread. As is usually the case I have to skip some details to keep this note simple.

On a busy server flush_list flushing (which I’ll simply call flushing from this point onwards) happens under four conditions. In order to understand these conditions let us familiarize ourselves with the concept of checkpoint_age. The checkpoint_age is the difference between the current_lsn (the latest change to the database) and the last checkpoint_lsn (the lsn when last checkpoint happened). We obviously don’t want to let this difference grow beyond the log file size because if that happens then we end up overwriting redo log entries before the corresponding dirty pages are flushed to the disk, losing the ability to recover them. In order to avoid the above situation we maintain two high water marks to indicate if we are nearing the end of reusable redo log space. Lets call these water marks async_water_mark and sync_water_mark, where the later represents a more urgent situation than the former. Now that we have clarified the checkpoint_age concept let us get back to the four conditions under which flushing of dirty pages happens:

  1. checkpoint_age < async_water_mark
    • This condition means that we have enough reusable redo space. As such there is no hurry to flush dirty pages to the disk. This is the condition where we’d like our server to be most of the time.
    • Based on adaptive_flushing heuristics we flush some dirty pages in this state. This flushing happens in the background master thread.
    • During the flushing no other threads are blocked, so queries continue normally.

  2. async_water_mark < checkpoint_age < sync_water_mark
    • As we move past the first water mark we try to bring some more urgency to our flushing. The query thread noticing this condition will trigger a flush and will wait for that flushing to end. This type of flushing does not happen in background.
    • Other query threads are allowed to proceed. Therefore we call it async flushing because only the query thread that is doing the flushing is blocked.
  3. checkpoint_age > sync_water_mark
    • This is like a panic button. We have very little reusable redo log space available. The query thread detecting this condition immediately starts flushing.
    • Other query threads are blocked. The idea is to stop the advance of checkpoint_age.
    • This type of flushing not only happens in foreground it actually tends to bring the whole system to a stall.
  4. %n_dirty_pages > innodb_max_dirty_page_pct
    • %age of dirty pages in the buffer pool exceeds the user settable value of innodb_max_dirty_page_pct.
    • The flushing happens in the background master thread and is non-blocking for query threads.

The page_cleaner thread:

As explained above flushing can happen in the query thread e.g.: the async and sync flushing. It can also happen in the background master thread e.g.: the adaptive flushing and the max_dirty_page_pct flushing. There are two issues with this scheme. First, the master thread is also tasked to do other background activities. It has to do change buffer merges and possibly purge (though starting with 5.5 we have an option to use a dedicated thread for purge and in 5.6.2 we can even have multi-threaded purge). Under very heavy workload it is possible that the master thread is unable to find enough time to flush dirty pages. The second issue is with async flushing. It should be a background task but it is executed in the query thread. While other threads are allowed to proceed, the unfortunate thread which detects the condition is blocked on a huge dirty page flush.

To address these two issues we came up with the idea of having a dedicated background thread and named it the page_cleaner thread. All background flushing activity previously done in the master thread is off loaded to the page_cleaner. Also, the async flushing is now a background task performed in the page_cleaner thread. Query threads are only ever blocked for flushing if we cross the sync flushing water mark. The page_cleaner thread wakes up every second, checks the state of the system and performs the flushing activity if required. The flushing that happens when the server is idle or at shutdown is now also done by the page_cleaner thread.

Finally we have added some counters to the innodb_metrics table related to the above four types of flushing to give you a picture of how your system is behaving.

mysql> select name, comment from information_schema.innodb_metrics where name like 'buffer_flush_%';
| name | comment |
| buffer_flush_adaptive_flushes | Occurrences of adaptive flush |
| buffer_flush_adaptive_pages | Number of pages flushed as part of adaptive flushing |
| buffer_flush_async_flushes | Occurrences of async flush |
| buffer_flush_async_pages | Number of pages flushed as part of async flushing |
| buffer_flush_sync_flushes | Number of sync flushes |
| buffer_flush_sync_pages | Number of pages flushed as part of sync flushing |
| buffer_flush_max_dirty_flushes | Number of flushes as part of max dirty page flush |
| buffer_flush_max_dirty_pages | Number of pages flushed as part of max dirty flushing |

Friday Apr 08, 2011

MySQL 5.6: Data dictionary LRU

Note: this article was originally published on on April 8, 2011 by Sunny Bains.

In MySQL 5.6 we’ve added a new feature that closes and unloads table instances from the InnoDB internal data dictionary, once a user configurable threshold is reached. This ends the situation where you could have hundreds of megabytes caching rarely used entries until the server was restarted and will be particularly appreciated by hosting and software as a service providers.

For this we’ve used an existing MySQL config variable table-definition-cache.  This cache limit is a soft limit. This means that if the user has more than table-definition-cache tables open then InnoDB will not force eviction of the table from the InnoDB data dictionary cache.

Users don’t really have to do anything to make use of this feature. It should pick up the existing setting (default or explicitly set) and just work. This feature should benefit users that have 1000s of tables and open a subset of tables only rarely. The LRU mechanism will eventually mark them for eviction and remove them from the data dictionary cache.

Other benefits

Some of the additional benefits of this change are internal to InnoDB and help in improving the design. We now have well defined interfaces for opening and closing tables. All internal sub-systems  need to use the same interface as the MySQL table handler. This has helped in loosening the tight coupling between various InnoDB sub-systems and the data dictionary. By loosening this tight coupling, it will help us in adding features to the InnoDB data dictionary more easily.

Future work

This feature is by no means 100% complete. Currently we don’t evict tables that are in a foreign key relationship, either reference or those that refer to another table. Handling of such tables is work in progress, so stay tuned.

Thursday Apr 07, 2011

Tips and Tricks for Faster DDL

Note: this article was originally published on on April 6, 2011 by Calvin Sun.

Data Dictionary Language (DDL) operations have traditionally been slow in MySQL. Any change to the table definition would be implemented by creating a copy of the table and index layout with the requested changes, copying the table contents row by row, and finally renaming tables and dropping the original table.

The InnoDB Plugin for MySQL 5.1 implements a more efficient interface for creating and dropping indexes. Indexes can be created or dropped without rebuilding the entire table.

Speeding up Bulk INSERT, UPDATE and DELETE Operations

Normally, InnoDB would update all indexes of a table when rows are inserted or deleted. If you update an indexed column, InnoDB would have to delete the old value and insert the new value in the corresponding index. If you update a primary key column, the row would be deleted and inserted in every index of the table. This can be slow even despite the change buffering in MySQL 5.5.

Bulk data imports, bulk deletes and bulk updates could be executed faster if they did not have to update too many secondary indexes. Provided that there are no operations running that would benefit from the existence of the indexes while the bulk operation is in progress, it could be useful to drop the affected indexes before starting a bulk operation and to create the indexes after the operation.

Consider this artificial example:

name CHAR(50) NOT NULL,
age INT,
hair_color ENUM('red','black','blond','grey','artificial','other'),
INDEX(name), INDEX(age)

Yes, it is a bad idea to have age in the table, because it depends on the current time. A birthday would be more accurate and would not need to be updated. When you update age, you will be updating every non-NULL row. It probably pays off to drop the affected index for the duration of the operation. Other indexes can be left alone, because age is not part of them or the PRIMARY KEY:

UPDATE friends SET age=age+1;

Bulk INSERT operations can be sped up by creating the secondary indexes afterwards. In this way, the records will be merge sorted and then inserted into the InnoDB index B-trees in order, using sequentially allocated pages. This will create a more optimal layout especially for indexes whose columns are updated rarely. To continue our example:

LOAD DATA INFILE 'friends.txt' INTO TABLE friends;
ALTER TABLE friends ADD INDEX (name), ADD INDEX (age);

At some point, we may get suspicious about grey-haired people who are younger than 30 years:

DELETE FROM friends WHERE age<30 AND hair_color='grey';

Depending on the number of affected rows, it might pay off to drop and re-create all indexes, or to optimize the whole table.


The OPTIMIZE TABLE statement in MySQL makes a copy of the table, copying rows one by one to every index. Because the secondary indexes are not likely to be in the order of the PRIMARY KEY (for example, the (age,id) tuples in the secondary index B-tree age might not be (42,1),(45,2),(50,4),(65,7) but (50,1),(42,2),(65,5),(45,8)) this will lead to the secondary indexes being populated out of order.

If you want to create every index in order, you might want to mimic OPTIMIZE TABLE like this:

CREATE TABLE optimized_friends LIKE friends;
ALTER TABLE optimized_friends DROP INDEX age, DROP INDEX name;
ALTER TABLE optimized_friends ADD INDEX(age), ADD INDEX(name);
RENAME TABLE friends TO unoptimized_friends;
RENAME TABLE optimized_friends TO friends;
DROP TABLE unoptimized_friends;

The INSERT…SELECT will copy the data in the PRIMARY KEY order, which should allocate and populate the clustered index B-tree in sequential order. Then, the ALTER TABLE will create the secondary indexes. As the index entries will be merge sorted by InnoDB before inserting, the secondary indexes will be in order as well.

Issuing one ALTER TABLE … ADD INDEX statement instead of multiple CREATE INDEX has the benefit that the data rows will be scanned only once to create the temporary merge sort files for all new indexes. If you are short on temporary file space, you might not want to create all indexes of a large table in one go.

If you suspect that only some secondary indexes are fragmented, you can DROP and ADD only those indexes.


While the fill factor of the index B-trees cannot be controlled directly in InnoDB, it may happen that sequentially inserted records result in a tighter packed tree than records that were inserted in random order. This can be either a good or bad thing, depending on the number and nature of future modifications.

If you want the old table-copying behavior back, you can SET old_alter_table=1.

Wednesday Apr 06, 2011

MySQL 5.6: Multi threaded purge

Note: this article was originally published on on April 6, 2011 by Sunny Bains.


What does purge exactly do and why is it needed? If you have ever wondered then read on. It is really a type of garbage collector. When a user issues a DML like “DELETE FROM t WHERE c = 1;”, InnoDB doesn’t remove the matching record. This is what happens under the hood:

  1. It marks the record as deleted by setting a bit in the control bits of the record.
  2. Stores the before image of the modified columns to the UNDO log
  3. Updates the system columns DB_TRX_ID and DB_ROLL_PTR in the clustered index record. DB_TRX_ID identifies the transaction that made the last change, and DB_ROLL_PTR points to the new UNDO log record. This UNDO log record contains the old values of DB_TRX_ID and DB_ROLL_PTR, possibly pointing to an older transaction and undo log entry.

From this you should be able to visualise that the UNDO log records related to the modified clustered record are in a disk based linked list, with the head anchored in the clustered index record. For the sake of simplicity I’ve ignored the UNDO log pages and the case where DML updates a record. This information is required by  rollback and MVCC (multi version concurrency control). For MVCC we need the entry to exist in the clustered index so that we can follow a pointer back to where the “before” changes were written in the UNDO log and use that information to construct a previous version of the record. For rollback we need the before  information of the record  (UNDO entry) so that we can restore it when a transaction is rolled back.

Another benefit of purging separately in the background is that expensive B+Tree block merge operations, if the removal of the record were to lead to underflow, can be done asynchronously by purge and not by user transactions.

Garbage collection

Once a transaction has been committed, the UNDO log entries and delete-marked records written by it may be needed by other transactions for building old versions of the record. When there is no transaction left that would need the data, the delete-marked records and the related undo log records can be purged.  For this reason we have a dedicated (and aptly named) purge function that runs asynchronously within InnoDB. Another problem is that InnoDB clustered and secondary indexes can have multiple entries for the same key. These need to be removed (garbage collected) too by Purge (see InnoDB Change Buffering for details regarding secondary indexes).

How does purge work?

Purge clones the oldest view in the system. This view has the control information that limits what changes are visible, the ones  made by other transactions. By cloning the oldest view, purge can determine that there can’t be any transactions that are active than the oldest view low limit in the system. Purge then reads the UNDO log entries in reverse starting from the oldest to the latest. It then parses these entries and removes the delete marked entries from the cluster and secondary indexes, if they are not referenced by any currently running transaction.

Problem with the old design

In versions prior to 5.5 this purge function was part of the responsibility of the InnoDB master thread. At fixed intervals it would run purge asynchronously. The problem with this approach was that the master thread was also responsible for flushing (writing to disk) of dirty pages, among other tasks. A high load on the server that dirtied a lot of pages would force the master thread to spend most of its time flushing and therefore no purging would get done and vice a versa.

Changes in 5.5

In 5.5 there is an option innodb-purge-threads=[0,1] to create a dedicated thread that purges asynchronously if there are UNDO logs that need to be removed. We also introduced another option innodb-purge-batch-size that can be used to fine tune purge operations. The batch size determines how many UNDO log pages purge will parse and process in one pass. The default setting is 20, this is the same as the hard coded value that is in previous InnoDB releases. An interesting side effect of this value is that it also determines when purge will free the UNDO log pages after processing them. It is always after 128 passes, this magic value of 128  is the same as the number of UNDO logs in the system tablespace, now that 5.5 has 128 rollback segments. By increasing the innodb-purge-batch-size the freeing of the UNDO log pages behaviour changes, it will increase the number of UNDO log pages that it removes in a batch when the limit of 128 is reached. This change was seen as necessary so that we could reduce the cost of removing the UNDO log pages for the extra 127 rollback segments that were introduced in 5.5. Prior to this change iterating over the 128 rollback segments to find the segment to truncate had become expensive.

Changes in 5.6

In 5.6 we have the same parameters as 5.5 except that innodb-purge-threads can now be between 0 and 32. This introduces true multi threaded purging. If the value is greater than 1 then InnoDB will create that many purge worker threads and a dedicated purge coordinator thread. The responsibility of the purge coordinator thread is to parse the UNDO log records and parcel out the work to the worker threads. The coordinator thread also purges records, instead of just sitting around and waiting for the worker threads to complete. The coordinator thread will divide the innodb-purge-batch-size by innodb-purge-threads and hand that out as the unit of work for each worker thread.

Some caveats

For single table tests like Sysbench one obvious problem is that all the worker threads tend to block/serialise on the dict_index_t::lock. This reduces their effectiveness and for such loads it is usually better to have a single dedicated purge thread unless you are using partitions. With multiple tables the purge threads come into their own and can purge records in parallel with minimal contention.

Sunday Sep 19, 2010

MySQL 5.5: InnoDB adaptive_flushing – How it works?

Note: this article was originally published on on Sept 19, 2010 by Inaam Rana.

Write-heavy workloads can reach a situation where InnoDB runs out of usable space in its redo log files. When that happens, InnoDB does a lot of disk writes to create space and you can see a drop in server throughput for a few seconds. From InnoDB plugin 1.0.4 we have introduced the ‘innodb_adaptive_flushing’ method that uses a heurstic to try to flush enough pages in the background so that it is rare for the very active writing to happen. In this note I’ll try to explain how the heuristic works i.e.: what factors are taken into account when deciding how many dirty pages to flush from the buffer pool in the background. I’ll skip some details for the sake of clarity.

You may find InnoDB glossary useful to understand the terminology used in this note. adaptive_flushing is of consequence to you if your workload involves significant write activity. For example, if you are running MySQL 5.0 or MySQL 5.1 with built-in InnoDB and you see periodic drop in the performance with corresponding spikes in write activity then you should consider upgrading to InnoDB plugin.

Let us first familiarize ourselves with why and how flushing takes place in the buffer pool. Flushing is the activity of writing dirty pages to the disk. We need to flush because of two main reasons. First, we need to read in a page into the buffer pool and there are no free buffers. In this case we select a page as victim (based on LRU) and if the victim is dirty we write it to the disk before reusing it. This, I’ll call, LRU_list flush. Secondly, we need to write dirty pages to the disk when we want to reuse the redo logs. Redo logs in InnoDB (or for that matter in most other database engines as well) are used in a circular fashion. We can only overwrite redo log entries if, and only if, the dirty pages corresponding to these entries have been flushed to the disk. This type of flushing is called flush_list flush.

adaptive_flushing is related to the flush_list flush (though it is also influenced by the LRU_list flush as we’ll see shortly). If we allow the redo logs to fill up completely before we trigger flushing of the dirty pages then it will cause a serious drop in performance during this spike in the I/O. To avoid this we’d like to keep flushing dirty pages in the background at a rate that is just enough to avoid the I/O bursts. Note that we don’t want to flush too aggressively because that will reduce the chances of write combining thus resulting in more I/O than is required. So how do we determine the rate at which we do flush_list flush in the background? That is where the adaptive_flushing heuristic comes into play.

We do know the capacity of the redo logs which is constant and is equal to the combined size of the log files. If we know the rate at which redo log is being generated then we can figure out how many seconds we have before the whole redo log is filled up.

n = log_capacity / redo_gen_rate

Assume there are n_dirty dirty pages in the buffer pool. If we have to avoid the I/O burst we must be able to flush n_dirty pages in n seconds. In other words, our required flushing rate can be given by the following formula:

required_flush_rate = n_dirty / n = (n_dirty * redo_gen_rate) / log_capacity

There is one bit missing in the above formula. What if there is also an LRU_list flush also happening? That will be flushing dirty pages as well. We don’t want to over do the flushing effort and it makes sense to deduct LRU_list flush rate from the above.

desired_flush_rate = (n_dirty * redo_gen_rate) / log_capacity – LRU_flush_rate

That is all there is to it. In order to avoid sudden spikes we keep the history of the LRU_flush_rate and the redo_gen_rate and use weighted averages in our calculations.

As is the case with most heuristics the above may not work under some workloads. That is why we introduced the innodb_adaptive_flushing parameter which can be used to disable this feature (it is enabled by default). In MySQL 5.5.4 we introduced multiple buffer pools meaning there are multiple LRU lists and multiple flush lists. Obviously adaptive flushing has to be adapted for this. In MySQL 5.5.6 we have worked on this further by limiting the excess flushing caused by rounding off and flushing neighbors of a victim. But this is definitely not the end of the story. Consider this an area under focus. We continue to work towards finding better solution for the overall flushing inside InnoDB. One more reason why your feedback will be more than welcome. All I can say at this stage is give us your feedback and stay tuned!

MySQL 5.5: InnoDB Change Buffering

Note: this article was originally published on on Sept 19, 2009 by Marko Mäkelä.

To speed up bulk loading of data, InnoDB implements an insert buffer, a special index in the InnoDB system tablespace that buffers modifications to secondary indexes when the leaf pages are not in the buffer pool. Batched merges from the insert buffer to the index pages result in less random access patterns than when updating the pages directly. This speeds up the operation on hard disks.

In MySQL 5.5, the insert buffer has been extended to a change buffer, which covers all modifications of secondary index leaf pages. This will improve the performance of bulk deletes and updates, transaction rollback and the purging of deleted records (reducing the “purge lag”).

To assess the benefits of the extended buffering, you may want to run benchmarks with the settings innodb_change_buffering=all, innodb_change_buffering=inserts, and innodb_change_buffering=none. Users of solid-state storage, where random reads are about as fast as sequential reads, might benefit from disabling the buffering altogether.

Read on to learn how the change buffering works.

Operations on Secondary Indexes

InnoDB can perform three kinds of modifications on secondary index records. If the affected index page is not in the buffer pool, the modifications can be buffered in the change buffer. When an index lookup or scan needs a page that is not in the buffer pool, the page will be read from the tablespace and any buffered changes will be merged to it.

The following operations can modify secondary index pages:

Inserting a record; supported in all versions of InnoDB
Marking a record for deletion
Removing a deleted record that is no longer accessible by active transactions

Before MySQL 5.5, UPDATE, DELETE and purge operations were performed directly on the index pages, resulting in random-access I/O. In MySQL 5.5, all of these operations can be buffered.

Implications of InnoDB Multiversioning

In InnoDB, there are two types of indexes: the clustered index B-tree, where the records are stored in the PRIMARY KEY order, and secondary index B-trees, which identify rows by primary key. InnoDB multiversion concurrency control (MVCC) treats these indexes differently.

Records in the clustered index can be updated in place, and their hidden system columns DB_TRX_ID, DB_ROLL_PTR point to undo log entries from which earlier versions can be reconstructed. InnoDB secondary index records do not contain any system columns, and their data is never updated in place. An UPDATE of an indexed column requires the operations Delete-mark(old), Insert(new) and eventually Purge(old) in the secondary index. An UPDATE of a PRIMARY KEY results in Delete-mark, Insert and eventually Purge in all indexes.

When a secondary index record has been marked for deletion or when the page has been updated by a newer transaction, InnoDB will look up the clustered index record. In the clustered index, it suffices to check the DB_TRX_ID and only retrieve the correct version from the undo log when the record was modified after the reading transaction started.

To Buffer or not to Buffer

When a page is in the buffer pool, it will always be updated directly. When a page is loaded to the buffer pool, any buffered changes will be merged to it, so that users never see unmerged changes.

Because change buffering works on individual leaf pages, we cannot buffer changes that would result into page splits or merges, but must perform such changes on the B-tree pages directly.

The insert buffer bitmap keeps track on the available space on pages and prevents overflows when buffering inserts. Delete-marking records can always be buffered, because the flag will be updated in place. Purging a delete-marked record could result in an empty page, something that we do not allow. We determine the non-emptiness of a page from previously buffered operations on the same page. If there are no previously buffered operations, the purge will have to load the index page to the buffer pool.

InnoDB refuses to buffer an operation when the on-disk change buffer tree would grow bigger than ⅓ of the in-memory buffer pool (innodb_buffer_pool_size). This might be a good rule-of-thumb, but some setups could benefit from the ability of setting the change buffer size independently of the buffer pool size.


The InnoDB change buffer is a persistent data structure and a complex mechanism that comes into play when the workload does not fit in the buffer pool. Because it trades random I/O with a larger amount of sequential I/O, it speeds up operation on hard disks, where random access is much slower than sequential access.

On solid-state storage, there is not much difference between sequential and random access times. Change buffering may still be useful if writes to solid-state storage are expensive, either in terms of speed or the consumption of limited program/erase cycles. Change buffering could reduce the write load on user tablespaces and cause more writes to the system tablespace (which contains the insert buffer) and the redo log. These should be placed on a hard disk.

Wednesday Apr 14, 2010

InnoDB now supports native AIO on Linux

Note: this article was originally published on on April 14, 2010 by Inaam Rana.

With the exception of Windows InnoDB has used ‘simulated AIO’ on all other platforms to perform certain IO operations. The IO requests that have been performed in a ‘simulated AIO’ way are the write requests and the readahead requests for the datafile pages. Let us first look at what does ‘simulated AIO’ mean in this context.

We call it ‘simulated AIO’ because it appears asynchronous from the context of a query thread but from the OS perspective the IO calls are still synchronous. The query thread simply queues the request in an array and then returns to the normal working. One of the IO helper thread, which is a background thread, then takes the request from the queue and issues a synchronous IO call (pread/pwrite) meaning it blocks on the IO call. Once it returns from the pread/pwrite call, this helper thread then calls the IO completion routine on the block in question which includes doing a merge of buffered operations, if any, in case of a read. In case of a write, the block is marked as ‘clean’ and is removed from the flush_list. Some other book keeping stuff also happens in IO completion routine.

What we have changed in the InnoDB Plugin 1.1 is to use the native AIO interface on Linux. Note that this feature requires that your system has libaio installed on it. libaio is a thin wrapper around the kernelized AIO on Linux. It is different from Posix AIO which requires user level threads to service AIO requests. There is a new boolean switch, innodb_use_native_aio, to choose between simulated or native AIO, the default being to use native AIO.

How does this change the design of the InnoDB IO subsystem? Now the query thread instead of enqueueing the IO request actually dispatches the request to the kernel and returns to the normal working. The IO helper thread, instead of picking up enqueued requests, waits on the IO wait events for any completed IO requests. As soon as it is notified by the kernel that a certain request has been completed it calls the IO completion routine on that request and then returns back to wait on the IO wait events. In this new design the IO requesting thread becomes kind of a dispatcher while the background IO thread takes on the role of a collector.

What will this buy us? The answer is simple – scalability. For example, consider a system which is heavily IO bound. In InnoDB one IO helper thread works on a maximum of 256 IO requests at one time. Assume that the heavy workload results in the queue being filled up. In simulated AIO the IO helper thread will go through these requests one by one making a synchronous call for each request. This means serialisation forcing the request that is serviced last to wait for the other 255 requests before it gets a chance. What this implies is that with simulated AIO there can be at most ‘n’ IO requests in parallel inside the kernel where ‘n’ is the total number of IO helper threads (this is not entirely true because query threads are also allowed to issue synchronous requests as well, but I’ll gloss over that detail for now). In case of native AIO all 256 requests are dispatched to the kernel and if the underlying OS can service more requests in parallel then we’ll take advantage of that.

The idea of coalescing contiguous requests is now off loaded to the kernel/IO scheduler. What this means is that which IO scheduler you are using or the properties of your RAID/disk controller may now have more affect on the overall IO performance. This is also true because now many more IO requests will be inside the kernel than before. Though we have not run tests to specifically certify any particular IO scheduler the conventional wisdom has been that for database engine workloads perhaps no-op or deadline scheduler would give optimal performance. I have heard that lately a lots of improvements have gone in cfq as well. It is for you to try and as always YMMV. And we look forward to hear your story.

NOTE:InnoDB h as always used native AIO on Windows and it continues to do so in Plugin 1.1. innodb_use_native_aio will have no affect on Windows.

Tuesday Apr 13, 2010

InnoDB Performance Schema

Note: this article was originally published on on April 13, 2010 by Jimmy Yang.

Performance Schema Support in InnoDB

With the plugin 1.1 release, InnoDB will have full support of Performance Schema, a new feature of MySQL 5.5 release. This allows a user to peak into some critical server synchronization events and obtain their usage statistics. On the other hand, in order to make a lot of sense of the instrumented result, you might need some understanding of InnoDB internals, especially in the area of synchronization with mutexes and rwlocks.

With this effort, the following four modules have been performance schema instrumented.

1. Mutex
3. File I/O
4. Thread

Almost all mutexes (42), rwlocks (10) and 6 types of threads are instrumented. Most mutex/rwlock instrumentations are turned on by default, a few of them are under special define. For File I/O, their statistics are categorized into Data, Log and Temp file I/O.

This blog is to give you a quick overview on this new machinery.

Start the MySQL Server with Performance Schema

To start with, you probably want to take a quick look at MySQL’s Performance Schema Manual ( ), this gives you a quick overview on the general performance schema features.

The performance schema is by default built in with MySQL 5.5 release. However, you do need to add “-performance_schema” to your server boot command line or have performance_schema system variable enabled in your server configure file in order to enable the performance schema. Otherwise, it will be disabled.  Please note, you can specify “performance_schema” variable with no value or a value of 1 to enable it, or with a value of 0 to disable it.

When server starts, please pay attention to following lines in server error log:

“100407 16:13:02 [Note] Buffered information: Performance schema enabled.”

This means the server starts with performance schema running fine.

It could also display message such as:

“100407 16:13:02 [Note] Buffered information:  Performance schema disabled (reason: start parameters)”

This shows the performance schema is disabled due to lack of either “performance_schema” boot option or appropriate variable set in the configuration file.

The third type message would be “Performance schema disabled (reason: init failed)”, it is due to performance schema initialization failure (could due to reasons such as memory allocation failure etc.). This message is relatively rare. I have not encountered it. If you do hit it, please check  other performance schema related system variables, to see if they are out of reasonable range.

Performance Schema Database and its Tables

Assuming server starts fine with Performance Schema enabled, first stop you want to visit is probably the new database called “performance_schema”. All performance schema related tables are in this database:

mysql> use performance_schema

mysql> show tables;



These 18 tables can be categorized into a few big groups, such as “Instance” tables, “Wait” table with “History”, or “Wait” table with “Summary” and “Setup” table.

In the next few section, I will go through a few tables in this list that I think are important.

Find Instrumented Events with INSTANCE TABLES

To view what InnoDB events are active and being instrumented, please check following four “Instance” tables for corresponding modules:


-> WHERE NAME LIKE “%innodb%”;




35 rows in set (0.00 sec)

Please notice there could be multiple instances of a mutex in the server,

mysql> SELECT COUNT(*)
-> WHERE NAME LIKE “%rseg_mutex%”;


1 row in set (0.92 sec)

This is why we need to use “SELECT DISTINCT (NAME)” clause in the initial query to get only the distinct mutex names from the MUTEX_INSTANCES table. Without the DISTINCT, there could be hundreds of instances of mutex being displayed. This also applies to other instance tables.

Also please note, if the mutex is not yet created, it will not be listed in the instance table, so you might see fewer events/instances than you might expected.

One last point for this section, buffer block mutex and rwlock are instrumented but disabled by default from performance schema instrumentation. The reason is that there comes one mutex/rwlock pair per 16k buffer block. Server with large buffer pool configuration could easily create thousands of instances of this type of mutexes/rwlocks.  This easily exceed the default value of max mutex/rwlock instances (1000) allowed. And user would require to extend the limit by setting system variable performance_schema_max_mutex_instances and/or performance_schema_max_rwlock_instances.
However, as we mentioned, the block mutex/rwlock are instrumented,  to enable them, you might need to change the code and un-define “PFS_SKIP_BUFFER_MUTEX_RWLOCK”.

Find out what is going on with EVENTS_WAITS_CURRENT table

The next table you might be interested in is the EVENTS_WAITS_CURRENT table,

->  WHERE EVENT_NAME LIKE  “%innodb%”;

2 wait/synch/mutex/innodb/ios_mutex srv0start.c:495
8 wait/synch/mutex/innodb/log_sys_mutex log0log.ic:405
9 wait/synch/mutex/innodb/kernel_mutex srv0srv.c:2182
10 wait/synch/mutex/innodb/thr_local_mutex thr0loc.c:127

4 rows in set (0.00 sec)

This table shows the latest instrumented activity for a particular thread. And the nice part of it is that it has the exact file name and line number of each event. So in case there is a hang/blocking situation (due to mutex/rwlock), you could know which mutex or rwlock is actually involved.

Check into “Limited history” with HISTORY tables

There are a couple of “HISTORY” tables that record each instrumented events. The EVENTS_WAITS_HISTORY table contains the most recent 10 events per thread. And EVENTS_WAITS_HISTORY_LONG contains the most recent 10,000 events by default.  They also come with the “SOURCE” field with file name and line number, and you might be able to do some aggregation on them to find some interesting behavior.

For example, following query gives you exact mutex instances that has been on the top list as shown in the history table:


Or you can obtain the instance with the most average time wait:

mysql> SELECT EVENT_NAME, SUM(TIMER_WAIT)/count(*), source
-> GROUP BY source

As mentioned, the history table has limited size, with 10 events per thread for  EVENTS_WAITS_and 10,000 for  EVENTS_WAITS_HISTORY_LONG.  However, you could extend the history length of these two tables by changing
“performance_schema_events_waits_history_size” and “performance_schema_events_waits_history_long_size” system variables. The performance_schema_events_waits_history_long_size can be extended to a million rows in maximum. However, please do not expect this would be enough. Even with 1 million events configured, in a busy system, it probably only contains a few seconds operation of the server.

Find out aggregated information from SUMMARY Tables

To get the overall aggregated value for these instances, you would need the “SUMMARY” table. There are 5 Summary tables,


As their name suggested, they are just events statistics aggregated with different criteria. Digging into these table gives you some idea where the contention could be.

For example, following query shows what is the hottest mutex (these values have unit as pico-second):


buf_pool_mutex 1925253 264662026992 137468
buffer_block_mutex 720640 80696897622 111979
kernel_mutex 243870 44872951662 184003
purge_sys_mutex 162085 12238011720 75503
…. ..

This experiment shows the buf_pool_mutex has been the hottest mutex. However, aggregate on AVG_TIMER_WAIT shows that ibuf_mutex is the one we waited the longest, even though it is much less frequently accessed.

Also please note these tables can be truncated, so you can essentially reset these wait values, and start the counting and aggregation afresh.

Again, to really understand and interpret information from these tables require some understanding of internals that these mutexes/rwlocks reside. It could target to advanced users and developers who want to analyze performance bottlenecks. However, common users might still be able to infer certain information out of it, and have some creative usage of these statistics. For example, for some I/O bound servers, you might find double write buffer mutex is on the top listed mutexes in terms of total time waited. Then you might want to consider to turn off the double write buffer option etc.

Performance Impact:
The last item we discuss is that this performance schema comes with a cost. It does have visible performance impact. A simple dbt2 test with 50 warehouse, 32 connections on a server with 2G buffer pool size show about 8% performance impact with all performance schema events turned on. This is also confirmed by some sysbench’ tests.

In fact, to minimize performance impact, performance schema allows you turn off counting on individual event with SETUP tables,  you can use SETUP_CONSUMERS to turn on/off logging into history table etc., and SETUP_INSTRUMENTS to turn on/off counting on a particular mutex/rwlock etc.  However, turning off events counting cannot completely eliminate the performance impact from the performance schema. This is something the performance schema to be improved upon.

In summary, we are providing a rich set of mutex, rwlock, I/O and thread usage information through Performance Schema instrumentations. It can be used to diagnose server performance bottlenecks, find possible hot spots in the server as well as gain a better understanding on system behavior/access pattern on modules where these mutexes/rwlocks reside in. However, it does come with a cost to server performance itself. So this is more suitable for development server tuning and studying. You might want to leave this out for any production server.

Wednesday Mar 11, 2009

Plug In for Performance and Scalability

Note: this article was originally published on  on March 10, 2009 by Calvin Sun.

Why should you care about the latest “early adopter” release of the InnoDB Plugin, version 1.0.3?   One word: performance! The release introduces these features:

  • Enhanced concurrency & scalability: the “Google SMP patch” using atomic instructions for mutexing
  • More efficient memory allocation: ability to use more scalable platform memory allocator
  • Improved out-of-the-box scalability: unlimited concurrent thread execution by default
  • Dynamic tuning: at run-time, enable or disable insert buffering and adaptive hash indexing

These new performance features can yield up to twice the throughput or more, depending on your workload, platform and other tuning considerations. In another post, we explore some details about these changes, but first, what do these enhancements mean for performance and scalability?

In brief, we’ve tested three different workloads (joins, DBT2 OLTP and a modified sysbench) using a memory-resident database. In all cases, the InnoDB Plugin scales significantly better than the built-in InnoDB in MySQL 5.1. And in some cases, the absolute level of performance is dramatically higher too! The charts below illustrate the kinds of performance gains we’ve measured with release 1.0.3 of the InnoDB Plugin. Your mileage may vary, of course. 

This release of the InnoDB Plugin incorporates a patch made by Ben Handy and Mark Callaghan at Google to improve multi-core scalability by using more efficient synchronization methods (mutexing and rw-locks) to reduce cpu utilization and contention. We’re grateful for this contribution, and you will be too!

Now to our test results …

Joins: The following chart shows the performance gains in performing joins, comparing the built-in InnoDB in MySQL (in ">blue) with the InnoDB Plugin 1.0.3 (in red).

As you can see from the blue bars in the above chart, with MySQL 5.1 using the built-in InnoDB, the total number of joins the system can execute declines as the number of concurrent users increases. In contrast, the InnoDB Plugin slightly improves performance even with one user, and maintains performance as the number of users rises. This performance improvement is due in large part to the use of atomics for mutexing in the InnoDB Plugin.

Transaction Processing (DBT2): The following chart illustrates a scalability improvement using the OLTP read/write DBT2 benchmark, again comparing the performance of the built-in InnoDB in MySQL with the performance of InnoDB Plugin 1.0.3.

Here, the InnoDB Plugin scales better than the built-in InnoDB from 16 to 32 users and produces about 12% more throughput with 64 concurrent users, as other bottlenecks are encountered or system capacity is reached. This improvement is likewise due primarily to the changes in mutexing.

Modified Sysbench: This test uses a version of the well-known sysbench workload, modified to include queries based on a secondary index, as suggested by Mark Callaghan of Google.

This time, the InnoDB Plugin shows significantly better scalability from 8 to 64 users than the built-in InnoDB in MySQL, yielding as much as 60% more throughput at 64 users. Like the previous examples, this improvement is largely due to the use of atomics for mutexing.

Modified Sysbench with tcmalloc: This test uses the same modified sysbench workload, but shows the difference between the built-in InnoDB (which uses the internal InnoDB memory allocator) and the InnoDB Plugin when using a more scalable memory allocator, in this case tcmalloc.

When the new configuration parameter innodb_use_sys_malloc is set to enable use of the memory allocator tcmalloc, the InnoDB Plugin really shines! Transaction throughput continues to scale, and the actual throughput with 64 users has nearly doubled!


This is the InnoDB team blog.


« July 2016