Friday Jan 10, 2014

Transaction life cycle improvements in 5.7.3

This is part of the ongoing work on improving the transaction life cycle management. In 5.7.2 we split the transaction list into two. The read-only transaction list and the read-write transaction list. There was another "virtual" list, the auto-commit non-locking  read-only (AC-NL-RO) transaction list. The change in 5.7.2 was that by default a transaction was treated as read only and added to the read-only transaction list. Only when it was determined that the transaction was going to do an update we removed the transaction from the read-only list and moved it to the read-write transaction list. This initial add to the the read-only list forced the acquisition of the trx_sys_t::mutex. Acquiring the mutex during transaction start/begin has a cost. Promoting a transaction from read-only to read-write we had to acquire the trx_sys_t::mutex to add to the read-write transaction list and so that is not too expensive and unavoidable. There is another transaction list for caching user transactions that we will ignore in this discussion (the mysql-trx-list) it is per connection. All user transactions both AC-NL-RO, plain read-only and read-write user transactions are on this list.

The optimization

The optimization in 5.7.3 is to eliminate the explicit read-only transaction list altogether. By default transactions are not put on any list unless they are explicitly tagged as read-write then they are added to the read-write transaction list. This makes the transaction begin mutex free. Additionally if the transaction commits as a read-only transaction then we don't need to acquire the trx_sys_t::mutex at commit time to remove from the read-only list either. This improves performance for read-only transactions, making them more or less equivalent to AC-NL-RO transactions. They will however incur some cost compared to AC-NL-RO transactions if they acquire shared locks.

Additionally,  the MVCC view open/close handling of read-only transactions is now equivalent to that of AC-NL-RO transactions, very low life cycle overhead. This will also help in improving performance significantly.

User impact

There is of course the positive performance impact but there is also a change in visibility semantics. With the elimination of the read-only list the read-only transactions are now no longer visible via SHOW ENGINE INNODB STATUS; they are however visible via the INFORMATION SCHEMA.

Thursday Dec 12, 2013

MySQL 5.7.3: Deep dive into 1mil QPS with InnoDB & Memcached

As you probably already know, in MySQL 5.7.3 release, InnoDB Memcached reached a record of over 1 million QPS on a read only load. The overview of the benchmark and testing results can be seen in an earlier blog by Dimitri. In this blog, I will spend sometime on the detail changes we have made to achieve this number.

First thanks to Facebook's Yoshinori with his bug#70172 that brought our attention to this single commit read only load test. We have been focussing on operation with large batch size. This bug prompted us to do a series of optimization on single commit read only queries and these optimizations eliminate almost all major bottlenecks from the InnoDB Memcached plugin itself.

If you are just getting familiar with InnoDB Memcached, there are some earlier blog on the topics to get you started. In a short word, InnoDB Memcached allows a fast path to retrieve key value data stored in the InnoDB table, with Memcached protocol.

The Benchmark:

Now, Let's discuss the testing scenario. The InnoDB Memcached plugin configurations are all by default in this benchmark, which means, the daemon_memcached_r_batch_size was also set to be 1, and the read operation would do a begin and commit transaction for each query. It is equivalent to auto-commit single selects through SQL interface. The innodb_api_trx_level is by default set to 0 (read uncommitted), however, changing it to 2 (repeatable read) gave the same benchmark result.

Another good news in 5.7.3 is that we start to support integer key column mapping, as it is common to use integer as primary key for a table. And the table used in this benchmark comes with integer as the key column. The mapping table contains only key and value columns. So we set the corresponding `flags`, `cas_column` and `expire_time_column` column in the config containers table all to NULL, this avoids overhead columns to support Memcached "extra" options. The table itself containers 1 million rows, each with a short integer key and a text value.

Here is the detail table definition

mysql> desc test.memc_test;
















To make InnoDB Memcached recognize this InnoDB table, insert following row into innodb_memcache/containers table


("memc1", "test", "memc_test", "id", "value3", null, null, null, "PRIMARY");

The memcached client for inserting rows and querying is a simple libmemcached program provided by Yoshinori.Each query would do a key lookup and fetches corresponding value.

We made some adjustment so that there are multiple client processes, each with multiple sessions. This was used to alleviate bottlenecks in the client itself.

As a note, there are many memcached clients out there, and Memcached clients can play  important roles in the performance result itself. For example, we observed at least 3 times difference on result with Perl client Cache::Memcached::Fast when comparing to its slower version Cache::Memcached. And as far as we can see, libmemcached is one of the most efficient clients available, even though eventually it becomes bottleneck itself as the test progresses, especially requests through the network.

The test result can be seen in Dimitri's blog, so I will not repeat them here. The summary is that we got close to 1.2 million QPS at the best. The next bottleneck now seems lying at the adaptive hash index's global latch - "btr_search_latch". The libmemcached client overhead is also significant.

Read Only QPS through InnoDB Memcached

The improvement:

There are several changes in both InnoDB Memcached code and Memcached Native code to achieve the record benchmarks.

1. The first is of course to address the issue brought by bug #70712. With daemon_memcached_r_batch_size set to 1, the transaction is being repeatedly started and committed for each query. It is better to cache the trx object itself, to avoid repeated create and destroy the trx object. Otherwise, the "trx_sys mutex" will kill the concurrency.

After the change, the trx object is cached with private memcached connection data. Each connection gets its own trx object, and it is used to handle transactions through this particular connection.

2. The next thing we did is to take advantage of the read only optimization recently made in the InnoDB code. This scenario (single read trx) is perfect to use the optimization. Whenever the read batch size is set to 1, InnoDB Memcached will treat incoming queries as auto-commit read only query. It will automatically hook up to the "fast path" of read-only operation in InnoDB.

3. After these two transaction related changes, we found the bottleneck comes from Memcached native Code itself. As a note, we embedded the Memcached code itself in our InnoDB Memcached plugin, so any bottleneck in Memcached will affect us.

The original Memcached memory allocation is protected by a global Cache Lock (engine->cache_lock), and it quickly rises in prominence in the profiling result.

Even though the data is stored in InnoDB, we happened to still use some of Memcached's own memory allocation to store and deliver the result back to the front end. To fix this bottleneck, we stopped using Memcached Memory altogether. Instead a connection private memory buffer is used to store and deliver the result. This also saves a memcpy as we move the data to memcached memory as before.

This change makes InnoDB Memcached plugin as thin as possible, and only relies on the InnoDB buffer pool and Adaptive Hash Index (AHI) as the backing store for the data. This provides better scaling and memory handling than Memcached itself.

4. Another bottleneck in Memcached is its statistics mutex ("thread_stats->mutex"). This also becomes significant as testing goes. So to remove it, we switched to using atomic operations whenever the platform supports (most modern platforms do). With these changes, we can now well scale the plugin to over 100 connections without degradation as the number of connections are ramped up.

5. In addition to removing those major bottlenecks, we also streamline the code to remove some overhead work. For example, we start to cached the "search tuple", so that there is no need to allocate the search tuple for each query. This is to keep the InnoDB Memcached as lean as possible.

With these changes, we have eliminated all the major InnoDB Memcached Plugin bottlenecks. The bottlenecks now comes from clients themselves and to a lesser degree from the Adaptive Hash Index search latch.

Future work:

Now the Memcached read goes more than twice as fast as those from SQL end. By using the InnoDB buffer pool as the in-memory store, and with InnoDB AHI, InnoDB Memcached can probably provide an efficient and more scalable store than Memcached itself.

There is still more to be done.

1. We will continue to remove some bottlenecks in InnoDB (such as btr_search_latch), as well as make InnoDB memcached leaner/faster.

2. We will add support to "mgets" command, which allows Memcached to fetch multiple results (corresponding to multiple keys) in one query attempts. This would again give us another big jump in terms of QPS.

3. We will start to focus more on insertion/updates operations.

4. We are considering extending the functionality of the memcached interface to support range queries etc. So to make it a more versatile key value store.

In summary, with these enhancements, the InnoDB Memcached becomes more and more attractive as as quick  key value store through the MySQL server.

Your feedback and comments are important to us as we evolve and improve this plugin. 

Wednesday Sep 25, 2013

InnoDB Temporary Tables just got faster

It all started with a goal to make InnoDB temporary tables more effective. Temporary table semantics are blessed with some important characteristics that can help us simplify lot of operations.

  • Temporary tables are not visible across connections
  • Temporary tables lifetime is limited to connection lifetime (unless user explicitly drops it).

What does this means in to InnoDB ?

  • REDO logging can be avoided for temporary tables and related objects since temporary tables do not survive a shutdown or crash.
  • Temporary table definitions can be maintained in-memory without persisting to the disk.
  • Locking constraints can be relaxed since only one client can see these tables.
  • Change buffering can be avoided since the majority of temporary tables are short-lived.

In order to implement these changes in InnoDB we took a bit different approach:

  • We introduced a dedicated tablespace for housing temporary tables. In 5.7, all non-compressed temporary tables (irrespective of innodb-file-per-table setting) are housed in this new temporary tablespace (ibtmp1). Compressed temporary tables continue to get housed in their independent tablespace. This new temporary tablespace is re-created on each server restart and can be pre-extended before starting server. (For more about temporary tablespace check MySQL documentation).
  • Dedicated rollback-segments (for housing undo logs) for temporary table objects.
  • All related objects including rollback-segments have been moved to this new temporary tablespace. This helps in maintaining locality and importantly eliminating REDO logging associated with all such objects.

What have we achieved ? (Performance Gain w.r.t to 5.6)

  • Create/Drop workload: (around 11x performance gain)
    create n tables, drop n tables, create+drop n tables, create + drop n tables with significant number of keys.
  • Insert/Update/Delete workload: (2-4x performance gain)
    • insert workload: insert n entries: sorted, n entries: reverse sorted, n entries: random (total 3n entries)
    • delete workload: delete n entries: first initiate insert workload to load table, delete n entries using primary key, delete n entries using secondary index, delete n entries using primary key such that complete table is empty.
    • update workload: update n entries: first initiate insert workload to load table, update n entries using primary key, update n entries using secondary index, update all 3n entries 2 times. (no explicit key specified).

All testing done using sql-bench (sql-bench modified to accommodate temporary table use-case).

Summing it up....

The benchmarks above show that InnoDB temporary tables are faster in 5.7 (than in 5.6). We plan to further enhance temporary table performance.

Saturday Sep 21, 2013

InnoDB 5.7 performance improvements

A quick overview of the InnoDB performance improvements for both read-only and read-write loads.
[Read More]

Friday Sep 13, 2013

InnoDB Redundant Row Format


This article describes the InnoDB redundant row format. If you are new to InnoDB code base (a new developer starting to work with InnoDB), then this article is for you. I'll explain the row format by making use of a gdb session. An overview of the article is given below:

  • Create a simple table and populate few rows.
  • Access the page that contains the rows inserted.
  • Access a couple of rows and explain its format.
  • Give summary of redundant row format.
  • Useful gdb commands to analyse the InnoDB rows.
  • Look at a GNU Emacs Lisp function to traverse rows in an InnoDB index page.

To get the most out of this article, the reader is expected to repeat the gdb session as described here.

The Schema

Consider the following SQL statements to produce the schema:

CREATE TABLE t1 (f1 int unsigned) row_format=redundant engine=innodb;
INSERT INTO t1 VALUES (1), (2), (3), (4), (5);

When the above CREATE TABLE is executed, InnoDB internally creates a clustered B-tree index (or just clustered index) representing the table. If no primary key is explicitly given, internally a primary key is added. Each node of B-tree is represented by one page (which is by default is of size 16KB). In the given scenario, we insert only 5 records. So the B-tree will consist of only 1 node, the root node.

Access InnoDB Index Page in gdb

All select queries will go through the row_search_for_mysql() function of InnoDB. When the first time it is called for a query, a persistent cursor (btr_pcur_t) will be opened by this function. When this B-tree persistent cursor (pcur in the following gdb) points to the InnoDB index page containing the rows of our interest, we can save the page frame to a file as follows:

(gdb) set $frame = pcur->btr_cur->page_cur->block->frame
(gdb) dump binary memory /tmp/gdb.anna $frame $frame + srv_page_size
(gdb) p srv_page_size
$10 = 16384

If you do not understand the commands above, refer to the gdb manual for more information. Saving a page in binary format in a file on disk helps us to analyse the contents of the page by loading it in a text editor like GNU Emacs or ViM. This is sometimes a very useful approach for debugging. The above commands were executed when I was on the sync point row_search_rec_loop, in the function row_search_for_mysql(). The line numbers are from mysql 5.7.3.

4155         } else if (mode == PAGE_CUR_G || mode == PAGE_CUR_L) {
4156                 btr_pcur_open_at_index_side(
4157                         mode == PAGE_CUR_G, index, BTR_SEARCH_LEAF,
4158                         pcur, false, 0, &mtr);
4159         }
4161 rec_loop:
4162         DEBUG_SYNC_C("row_search_rec_loop");
4163         if (trx_is_interrupted(trx)) {
4164                 btr_pcur_store_position(pcur, &mtr);
4165                 err = DB_INTERRUPTED;
4166                 goto normal_return;
4167         }

An important point to remember when analysing the page is that all the data in the page are stored in big-endian format.

Open Page in GNU Emacs

The following image shows the InnoDB page opened in GNU Emacs in hexl-mode major mode. I have placed the cursor on the origin of the infimum record (at offset 0x65 from page origin). In the right side, you can look at the text "infimum" and "supremum" as stored in the page. These are special records.

All rows in an InnoDB page are linked to one another in ascending order via the next record pointer. In the given example this ordering is based on the internally generated DB_ROW_ID. In each page, the first record (or the minimum record) is the infimum record, and the last record (or the maximum record) is the supremum record. The records between the infimum and supremum records are called the user records.

The next record pointer of each record points to the "origin" of the next record. The record contains data to the right of the origin and meta data to the left of the origin. The next record pointer is available in the 2 bytes before the record origin.

The next record pointer of infimum as given in the picture above is 0x0087 (which is 135 in decimal). This is the offset of the next record from the beginning of the page.

Printing the infimum record

As mentioned previously, infimum is a special minimum record in an index page. It exists at a fixed offset of 101 bytes from the page beginning. The function page_get_infimum_rec() can be used to get the infimum record given a page frame pointer. All records in InnoDB are pointed to at their record origin. So to access the meta data in the record, we need to subtract offset from the record origin.

Printing the infimum record given a page frame pointer (in char):

(gdb) p  pcur->btr_cur->page_cur->block->frame+101
$41 = (unsigned char *) 0x7fffd1d3c065 "infimum"

Printing the infimum record given a page frame pointer (in hex):

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+101)@7
$45 = {0x69, 0x6e, 0x66, 0x69, 0x6d, 0x75, 0x6d}

Printing the 6 byte Informational Bit Fields

Given the origin of the infimum record, we can print the 6 bytes of informational bits as follows (in hex).

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+101-6)@6
$50 = {0x1, 0x0, 0x0, 0x3, 0x0, 0x87}

Printing the 6 bytes of informational bits of the infimum record (in binary). In this output, the right most byte displayed is the first byte before the record origin.

(gdb) x /6tb  pcur->btr_cur->page_cur->block->frame+101-6
0x7fffd1d3c05f: 00000001  00000000   00000000  00000011  00000000  10000111

The above 6 bytes bit fields of the record is formatted as follows: (it also contains the actual values as seen above)

4 bits
info bits
byte: 6
4 bits
byte: 6
13 bits
heap number
byte: 4,5
00000000 00000
10 bits
field count
byte: 3,4
000 0000001
1 bit
byte: 3
2 bytes
next record offset
byte: 1, 2
00000000 10000111

Each of the box above represents one bit field. It contains the following information:

The size of bit field in bits or bytes
A short description of the bit field.
The byte where this bit field exists.
The actual value of this bit field as seen in above gdb output

The 1 bit flag above is used to tell whether the offset array element is 1 byte (if set) or 2 bytes (if not set). In our example case, the 1 bit flag is set and hence the offset array elements are each 1 byte long.

From the field count bit field, we can see that the infimum record has only one field, whose value is fixed to "infimum". The next record pointer of the infimum record points to the first user record in the page. This is available at the offset 0x0087, which is 135 in decimal. So the first user record is available at page frame pointer + 135 bytes.

The Bit Fields

The information available in 6 bytes (REC_N_OLD_EXTRA_BYTES) of bit field data is listed here. These six bytes are available immediately before the record origin. These bit fields must be accessed by making use of the interface provided by the record manager (refer to rem0rec.h).

The Bit Field Size of Bit Field Description of the Bit Field
info bits 4 bits Used to delete mark a record
n_owned 4 bits Number of records owned by this record. This is related to page directory. To understand this field, we need to understand the page structure used by InnoDB. This is not explained further in this article.
heap number 13 bits Gives the order number of this record in the page. Each row has a unique heap number in the page. The lock manager uses this heap number.
field count 10 bits Number of fields in this record.
flag 1 bit Tells if the offset array element is 1 or 2 bytes.
next record offset 2 bytes Offset of the next record from page origin.

Printing the First User Record

To first parse the record we need to know the number of records in the row. This can be obtained from the 6 bytes of bit-field information available before the record origin. Let us first see this for the user record:

(gdb) x /6tb  pcur->btr_cur->page_cur->block->frame+135-6
0x7fffd1d3c081: 00000000  00000000  00010000  00001001  00000000  10101000

4 bits
info bits
byte: 6
4 bits
byte: 6
13 bits
heap number
byte: 4,5
00000000 00010
10 bits
field count
byte: 3,4
000 0000100
1 bit
byte: 3
2 bytes
next record offset
byte: 1, 2
00000000 10101000

The above information shows that this record has 4 fields (000 0000100 in binary). But our table definition was

CREATE TABLE t1 (f1 unsigned int) row_format=redundant engine=innodb;

If you are wondering why a row of this table has 4 field count, it is because InnoDB added 3 internal fields to the table t1 - a row identifier (DB_ROW_ID), transaction identifier (DB_TRX_ID) and a undo roll pointer (DB_ROLL_PTR). A row identifier column is internally added to a table if no primary key is explicitly given by the user. This row identifier becomes the internal primary key. The transaction identifier specifies the transaction that has last updated the row. The undo roll pointer points to the previous versions of the record. The 4th field is the actual column f1 added by the user.

The list of fields in the table t1 can be printed by accessing its associated dict_table_t object (or its clustered index dict_index_t object). In the following, the name GEN_CLUST_INDEX is the name of the clustered index of the table t1. This is the standard name of the clustered index if it was internally generated by InnoDB.

(gdb) p prebuilt->index->name       
$75 = 0x7fffa00232e8 "GEN_CLUST_INDEX"
(gdb) p prebuilt->index->table      
$73 = (dict_table_t *) 0x7fffa001fe68
(gdb) p prebuilt->index->table->name
$74 = 0x7fffa000cef0 "test/t1"
(gdb) set print array on
(gdb) p *prebuilt->index->fields@4
$67 =   {{col = 0x7fffa0020154, name = 0x7fffa00201c3 "DB_ROW_ID", prefix_len = 0, 
    fixed_len = 6},
  {col = 0x7fffa0020160, name = 0x7fffa00201cd "DB_TRX_ID", prefix_len = 0, 
    fixed_len = 6},
  {col = 0x7fffa002016c, name = 0x7fffa00201d7 "DB_ROLL_PTR", prefix_len = 0, 
    fixed_len = 7},
  {col = 0x7fffa0020148, name = 0x7fffa00201c0 "f1", prefix_len = 0, fixed_len = 4}}

Printing the Record Offsets (or Offset Array)

The redundant record contains an array of offsets of each field of the row. This array is available after the 6-byte bit fields (down from the record origin). The size of each element of the array is either 1 byte or 2 byte depending on the 1 bit flag in the 6 byte bit fields. In our current case each offset element is 1 byte long.

(gdb) p /d  *(pcur->btr_cur->page_cur->block->frame+135-6-4)@4
$79 = {23, 19, 12, 6}

The first field (DB_ROW_ID) starts at offset 0 from record origin. The second field (DB_TRX_ID) starts at offset 6 from record origin. The third field (DB_ROLL_PTR) starts at offset 12 bytes from record origin. The fourth field (f1) starts at offset 19 bytes from record origin. The end of the record is 23 bytes from the record origin. So the total size of this record can be calculated as follows:

S - size of offset array element (1 byte) 
N - number of fields in the row (4 fields)
I - size of bit fields (6 bytes)
D - Length of row after record origin (23 as given in offset array)
R - Total Size of the row 

R = (S*N) + I + D = 4 + 6 + 23 = 33 bytes

The total size of the row is 33 bytes. Since this table contains only fixed length fields, the size of each row in this table will be the same.

Printing the DB_ROW_ID

The DB_ROW_ID is 6 bytes long. It starts at the record origin itself. It can be printed as follows:

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135)@6
$82 = {0x0, 0x0, 0x0, 0x0, 0x2, 0x0}

Printing the DB_TRX_ID

The second field DB_TRX_ID is also 6 bytes long. It starts at an offset of 6 bytes from the record origin. It can be examined as follows:

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135+6)@6
$84 = {0x0, 0x0, 0x0, 0x0, 0x9, 0x8}

Printing the DB_ROLL_PTR

The third field DB_ROLL_PTR is 7 bytes long. It points to the previous versions of a record available in the undo segments. It can be examined as follows:

(gdb)  p /x *(pcur->btr_cur->page_cur->block->frame+135+12)@7
$3 = {0xa8, 0x0, 0x0, 0x1, 0x1a, 0x1, 0x10}

Printing the User Data Field

The fourth field "f1" is the user defined field. It is 4 bytes long. It can be examined as follows:

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135+19)@4
$5 = {0x0, 0x0, 0x0, 0x1}
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+135+19))
$8 = 1

Thus, we have completely accessed the first user record. Given the first user record, we can find out the location of the next user record (the first 2 bytes before the record origin). If we keep accessing the next user record, we will eventually reach the supremum record, which is the last (and considered largest or maximum) record in a page.

Printing All Rows

Just to improve our confidence in accessing the rows, here I have printed all the rows that we have inserted into the table. The first record is the infimum, followed by the records 1, 2, 3, 4 and 5 (which we have inserted) and then the last record supremum.

(gdb) p pcur->btr_cur->page_cur->block->frame+101
$13 = (unsigned char *) 0x7fffd1e48065 "infimum"
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+135+19))
$14 = 1
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+168+19))
$15 = 2
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+201+19))
$16 = 3
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+234+19))
$17 = 4
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+267+19))
$18 = 5
(gdb) p pcur->btr_cur->page_cur->block->frame+116
$19 = (unsigned char *) 0x7fffd1e48074 "supremum"

The Complete Redundant Row Format

One redundant record can be logically separated into 5 parts - the offsets array, the 6-byte bit field, record origin, hidden fields (DB_ROW_ID, DB_TRX_ID, DB_ROLL_PTR) and the actual user-specified fields of the table. They can be viewed as follows:

offsets array
fn ... f1, x2, x1, pn ... p1
6-byte bit field
primary key
or user given
p1, p2 ... pn
hidden fields
x1, x2
user fields
f1, f2, f3 ... fn

All record pointers point to the record origin. The record origin is the beginning of the primary key for the clustered primary index. It is not a separate entity that occupies some space. I am showing it separately to make it easy to visualize. All information prior to the record origin can be considered as the record header or the record meta-data.

Each data field has a corresponding offsets array element. The offsets array is arranged in reverse order. The first element in the array provides the offset of the end of the last data field and the last element in the array provides the offset of the end of the first data element.

The offsets array also tells whether a data field is NULL. If a field is NULL, the highest bit in the associated offset array element will be set to 1.

The biggest advantage of the redundant row format is that a record in this format can be parsed without needing to access the data dictionary. But this format is not space efficient. Even for fixed length fields, we store information in the offsets array (this is avoided in COMPACT, DYNAMIC and COMPRESSED row formats).

One more limitation of the redundant row format is that the blob data cannot be stored completely off page (as in DYNAMIC and COMPRESSED row formats). A 768 byte prefix of the blob will always be stored inline with the row and cannot be avoided.

A small note on row formats. InnoDB really supports only two row formats - the redundant row format and the compact row format. The compressed and dynamic row formats are a variation available only on top of the compact row format. Technically these variations can be implemented for the redundant row format as well, but is currently not available.

GNU Emacs Function to Traverse Rows in Page

If we dump a single InnoDB index page into a file, we can open it in GNU Emacs hexl-mode and use the following elisp function to traverse from infimum record to the supremum record. First place the cursor on the 'i' of the infimum record and then call this function to reach the next record origin. I am not an expert in writing elisp functions. If you can improve on this and write a better one, kindly share it with me.

(defun ib-next-red-row ()
  "Go to the origin of the next record in ROW_FORMAT=REDUNDANT.
The cursor must be positioned at the origin of a record. 
This is for the GNU Emacs hexl-mode. "
  (setq rec-origin (point))
  (setq rec-origin-offset (hexl-current-address))

  ;; The next-rec pointer is stored in 2 bytes before the
  ;; record origin
  (hexl-backward-char 2)
  (setq next-rec (string (char-after)))
  (forward-char 1)
  (setq next-rec 
	(concat next-rec (string (char-after))))
  (hexl-forward-char 1)
  (setq next-rec 
	(concat next-rec (string (char-after))))
  (forward-char 1)
  (setq next-rec 
	(concat next-rec (string (char-after))))

  ;; Next record is next-rec-off bytes away from page origin
  (setq next-rec-off (hexl-hex-string-to-integer next-rec))

  ;; Assuming that there is only one page in dump
  (hexl-goto-address 0)
  (hexl-forward-char next-rec-off))


In this article, the format of the redundant row was explained. A gdb session was used to demonstrate how to access a page and the rows within that page. The infimum record was accessed, printed and analysed. Using its next record pointer, the first user record was accessed. The reader can use the next record pointer of the first user record to access the next user record and so on till the supremum record is reached. I leave that as an exercise for the readers.

Thanks to Marko Makela for his useful tips on debugging at our yearly InnoDB Team Meeting 2013 held at Shanghai, China. This article derives from his ideas.

Send in your feedback to me.

Monday Jul 15, 2013

Redo Logging in InnoDB


InnoDB is a general-purpose storage engine that balances high reliability and high performance. It is a transactional storage engine and is fully ACID compliant, as would be expected from any relational database. The durability guarantee provided by InnoDB is made possible by the redo logs.

This article will provide an overview of the redo log subsystem or log subsystem of InnoDB. We will look at the following details:

  • The global log system object, which provides access to important data structures and information.
  • The mini-transaction (mtr), using which all redo log records are created.
  • The global in-memory log buffer (or just log buffer), into which the redo logs are written to from the mini transaction buffer. This log buffer will be periodically flushed to the log file on disk.
  • The redo log files on the disk and their high level internal structure.
  • We will discuss about the concept of LSN, and how the various values of LSN are used to implement the write-ahead logging (WAL).

Redo Log Generation

In the article, Data Organization in InnoDB, we already saw that the user data files of InnoDB (.ibd files) are considered to be a sequence of equal sized pages. These pages are uniquely identified within the InnoDB system using the space_id, page_no combination. If we want to read or modify any page, it needs to be loaded into memory. So there are two copies of the pages - on disk and in memory. Here are the high level steps in redo log generation.

  1. Any changes to a page is first done to the in-memory copy of the page. The page that is modified in memory and not yet flushed to disk is marked as the dirty page in memory.
  2. An associated redo log is generated in memory, in the local mini transaction (mtr) buffer. This will then be transferred to the global in-memory redo log buffer.
  3. The redo log record is written from the redo log buffer in memory to the redo log file on disk. This is subsequently flushed. These two steps are considered separate - writing to the redo log file and flushing the redo log file to the disk. This is to account for the file buffering done by the operating system.
  4. The dirty page is flushed from memory to disk at some later point of time as part of the checkpointing operation.

The order of these steps are important. The redo log record of a change must be flushed to disk before flushing the corresponding dirty page to the disk. This is the concept of write-ahead logging (WAL).

The generated redo log record will contain information necessary to repeat the same operation later during a database recovery. So the redo log record will contain information about the original set of pages and what has been changed in them. Using the redo log record the set of dirty pages will be re-created during database recovery.

Redo Log Files

By default, InnoDB creates two redo log files (or just log files) ib_logfile0 and ib_logfile1 within the data directory of MySQL. In MySQL versions 5.6.8 and above, the default size of each redo log file is 48MB each. This can be configured by the user by making use of innodb_log_file_size server option. The number of log files is controlled by innodb_log_files_in_group server option.

A log group consists of a number of log files, each of same size. As of MySQL 5.6, InnoDB supports only one log group. So I'll not discuss this further.

The redo log files are used in a circular fashion. This means that the redo logs are written from the beginning to end of first redo log file, then it is continued to be written into the next log file, and so on till it reaches the last redo log file. Once the last redo log file has been written, then redo logs are again written from the first redo log file.

The log files are viewed as a sequence of blocks called "log blocks" whose size is given by OS_FILE_LOG_BLOCK_SIZE which is equal to 512 bytes. Each log file has a header whose size is given by LOG_FILE_HDR_SIZE, which is defined as 4*OS_FILE_LOG_BLOCK_SIZE.

Redo Log File Header

Each redo log file contains a header occupying four log blocks with the following information:

  • The first 4 bytes contain the log group number to which the log file belongs.
  • The next 8 bytes contain the lsn of the start of data in this log file.
  • First checkpoint field located in the beginning of the second log block.
  • Second checkpoint field located in the beginning of the fourth log block.

The checkpoint field mentioned above contains information necessary to identify a checkpoint - a checkpoint number, the LSN of the checkpoint, checksum information and more.

Log Blocks

A redo log file can be viewed as a sequence of log blocks. All log blocks, except the ones belonging to the log file header, contain a header and a footer. The header is of size LOG_BLOCK_HDR_SIZE bytes (which is defined as 12 bytes). The log block trailer is of size LOG_BLOCK_TRL_SIZE (4 bytes). The log block header contains the following information:

  • The log block number. This field is of 4 bytes.
  • Number of bytes of log written to this block. This field is of 2 bytes.
  • Offset of the first start of an mtr log record group in this log block or 0 if none
  • The checkpoint number to which this log block belongs

The log block trailer contains checksum of the log block contents.

Log Sequence Number (LSN)

The log sequence number (LSN) is an important concept. The LSN is an offset into the redo log file. Within InnoDB the log sequence number is represented by the type lsn_t, which is an 8-byte unsigned integer. There are different LSN values that are of interest. The following table lists some of the LSN values discussed in this article. (Note: The log_sys is the global log system object which is explained in next section.)

LSN Description
log_sys->lsn The redo log record that will be generated next will make use of this lsn
log_sys->flushed_to_disk_lsn The redo log file is flushed upto this LSN. All redo log records whose LSN is < flushed_to_disk_lsn is safely on the disk.
log_sys->write_lsn There is a currently running write operation which will write upto this LSN.
log_sys->current_flush_lsn There is a currently running write + flush operation which will write upto this LSN.

The LSN is what that links the dirty page, the redo log record, and the redo log file. Each redo log record when it is copied to the in-memory log buffer, it gets an associated LSN. When each database page is modified, redo log records are generated. So each database page is also associated to an LSN. The page lsn is stored in a header for each page. The page lsn gives the lsn upto which the redo log file must be flushed before flushing the page.

Global Log System Object

The global log system object log_sys (of type log_t) holds important information related to log subsystem of InnoDB. Here only those information related to redo log buffer and redo log files are discussed. The global log_sys identifies the "active area" of the log buffer, which contains redo logs that are yet to be safely written on the disk. It also identifies the area in the redo log file, into which the active area of the log buffer will be written/flushed.

/** Redo log buffer */
struct log_t{
	lsn_t		lsn;		/*!< log sequence number */
	ulint		buf_free;	/*!< first free offset within the log
					buffer */
	byte*		buf_ptr;	/* unaligned log buffer */
	byte*		buf;		/*!< log buffer */
	ulint		buf_size;	/*!< log buffer size in bytes */
	ulint		max_buf_free;	/*!< recommended maximum value of
					buf_free, after which the buffer is
					flushed */

	ulint		buf_next_to_write;/*!< first offset in the log buffer
					where the byte content may not exist
					written to file, e.g., the start
					offset of a log record catenated
					later; this is advanced when a flush
					operation is completed to all the log
					groups */
	lsn_t		write_lsn;	/*!< end lsn for the current running
					write */
	ulint		write_end_offset;/*!< the data in buffer has
					been written up to this offset
					when the current write ends:
					this field will then be copied
					to buf_next_to_write */
	lsn_t		current_flush_lsn;/*!< end lsn for the current running
					write + flush operation */
	lsn_t		flushed_to_disk_lsn;
					/*!< how far we have written the log
					AND flushed to disk */

  • The member log_sys->buf points to the in-memory redo log buffer. This is the buffer into which an mtr_commit() writes the redo log records. The size of this buffer is given by log_sys->buf_size in bytes.
  • The member log_sys->buf_free is the offset within the in-memory redo log buffer, where the next redo log record will be written. This is the end offset for the next redo log flush to disk.
  • The member log_sys->buf_next_to_write is the offset from where the redo log records are not yet written to the redo log file. When the redo log buffer is flushed to disk next time, it will flush from this location. This is the start offset for the next redo log flush to disk.
  • The member log_sys->flushed_to_disk_lsn marks the lsn upto which we have written to the log file on the disk and flushed it. So upto this lsn, the redo logs are safely on the disk. This will always be less than or equal to log_sys->write_lsn and log_sys->lsn.
  • The member log_sys->lsn represents the current lsn. This member will be updated whenever we do mtr_commit(). The function mtr_commit() writes the bunch of redo log records generated in the mini transaction to the global or system wide redo log buffer in memory. This will always be greater than or equal to log_sys->flushed_to_disk_lsn and log_sys->write_lsn. This will be the LSN of the redo log record written at log_sys->buf_free.
  • The member log_sys->write_lsn represents the end lsn of a currently running redo log buffer write operation. This will be greater than or equal to log_sys->flushed_to_disk_lsn and less than or equal to log_sys->lsn.
  • The member log_sys->current_flush_lsn represents the end lsn of a currently running redo log buffer write + flush operation. This will be mostly equal to log_sys->write_lsn.

The global log_sys object points to various positions in the in-memory redo log buffer and on-disk redo log files. The following picture shows the locations pointed to by the global log_sys object. The picture clearly shows that the redo log buffer maps to a specific portion of the redo log file.

Global In-memory Redo Log Buffer

The in-memory redo log buffer is global and all redo logs generated by user transactions will be written into this buffer. The size of this buffer is configurable and is given by the innodb_log_buffer_size. The default size of this redo log buffer is 8MB.

When a transaction is running and if it is modifying the database, then it will generate redo logs and populate this log buffer. This log buffer will be written or flushed to the log file, either when the transaction commits, or when the log buffer gets full.

When the redo log buffer is full, and there is not enough space for an mtr_commit(), which will transfer a group of redo log records to log buffer, then a synchronous flush of the log buffer is done to the redo log file by using the function log_buffer_flush_to_disk(). This function will write the log buffer from log_sys->buf_next_to_write to log_sys->buf_free. In terms of LSN, the function log_buffer_flush_to_disk() flushes the redo logs from log_sys->flushed_to_disk_lsn to log_sys->lsn.

The mini transaction (mtr)

A mini transaction (mtr) must be used to generate all the redo log records. A mini transaction contains a local buffer (called the mini transaction buffer) into which the generated redo log records will be stored. If we need to generate a group of redo log records such that either all make it to the redo log file or none makes it, then we need to put them in a single mini transaction. Apart from the redo log records, the mini transaction also maintains a list of pages that has been modified (dirty pages).

The normal usage of a mini transaction is as follows:

  • Create a mini transaction object of type mtr_t
  • Start the mini transaction with mtr_start() function. This will initialize the mini transaction buffer.
  • Generate the redo log records by making use of mlog_write_ulint() family of functions.
  • Commit the mini transaction with mtr_commit() function. This will transfer the redo log records from mini transaction buffer to the global redo log buffer. The list of dirty pages is added to the buffer pool flush list.

The definition of a mini transaction is given here for your reference. The member mtr_t::log contains the mini transaction buffer which will hold the redo log records, and the member mtr_t::memo contains a list of pages dirtied by this mini transaction.

/* Mini-transaction handle and buffer */
struct mtr_t{
	dyn_array_t	memo;	/*!< memo stack for locks etc. */
	dyn_array_t	log;	/*!< mini-transaction log */
	unsigned	inside_ibuf:1;
				/*!< TRUE if inside ibuf changes */
	unsigned	modifications:1;
				/*!< TRUE if the mini-transaction
				modified buffer pool pages */
	unsigned	made_dirty:1;
				/*!< TRUE if mtr has made at least
				one buffer pool page dirty */
	ulint		n_log_recs;
				/* count of how many page initial log records
				have been written to the mtr log */
	ulint		n_freed_pages;
				/* number of pages that have been freed in
				this mini-transaction */
	ulint		log_mode; /* specifies which operations should be
				logged; default value MTR_LOG_ALL */
	lsn_t		start_lsn;/* start lsn of the possible log entry for
				this mtr */
	lsn_t		end_lsn;/* end lsn of the possible log entry for
				this mtr */

Redo Log Record Types

When we modify a database page, a redo log record is generated. This redo log record either contains what information has been changed in the page (physical redo log) or how to perform the change again (logical redo log). InnoDB uses a combination of physical redo logs and logical redo logs.

To understand this consider an operation like page re-organization. If we generated a physical redo log record for this operation, it is likely that the redo log record generated will be equal to the page size. But instead InnoDB uses logical redo log record for this operation, which will reduce the redo log record size significantly. In the case of logical redo log record to represent a page reorganize, all we need is the information to uniquely identify a page, and "type" of operation which is page reorganize.

So each redo log record has a type. A redo log record type helps to identify the function that will be used to apply or execute the redo log during recovery. The contents of the redo log record must then contain all the arguments or parameters needed by the function.

Life cycle of a redo log record

The life cycle of a redo log record is as follows:
  • The redo log record is first created by a mini transaction and stored in the mini transaction buffer. It has information necessary to redo the same operation again in the time of database recovery.
  • When mtr_commit() is done, the redo log record is transferred to the global in-memory redo log buffer. If necessary, the redo log buffer will be flushed to the redo log file, to make space for the new redo log record.
  • The redo log record has a specific lsn associated with it. This association is established during mtr_commit(), when the redo log record is transferred from mtr buffer to log buffer. Once the lsn is established for a redo log record, then its location in the redo log file is also established.
  • The redo log record is then transferred from the log buffer to redo log file when it is written + flushed. This means that the redo log record is now durably stored on the disk.
  • Each redo log record has an associated list of dirty pages. This relationship is established via LSN. A redo log record must be flushed to disk, before its associated dirty pages. A redo log record can be discarded only after all the associated dirty pages are flushed to the disk.
  • A redo log record can be used to re-create its associated list of dirty pages. This happens during database recovery.


This article provided an overview of the redo log subsystem of InnoDB storage engine. The major data structures used in the redo log subsystem are the mini transaction (mtr), the in-memory redo log buffer and the on-disk redo log files. The InnoDB storage engine tracks many LSN values to ensure that the write-ahead logging (WAL) happens correctly.

Since data loss is unacceptable for an RDBMS, the redo log subsystem is very critical for a successful RDBMS. And since redo logs are generated synchronously with DML operations, it is important to do it efficiently. The size of redo logs must be kept as minimal as possible.

Send in your feedback to me.

Friday May 31, 2013

Introduction to Transaction Locks in InnoDB Storage Engine


Transaction locks are an important feature of any transactional storage engine. There are two types of transaction locks – table locks and row locks. Table locks are used to avoid a table being altered or dropped by one transaction when another transaction is using the table. It is also used to prohibit a transaction from accessing a table, when it is being altered. InnoDB supports multiple granularity locking (MGL). So to access rows in a table, intention locks must be taken on the tables.

Row locks are at finer granularity than table level locks, different threads can work on different parts of the table without interfering with each other. This is in contrast with MyISAM where the entire table has to be locked when updating even unrelated rows. Having row locks means that multiple transactions can read and write into a single table. This increases the concurrency level of the storage engine. InnoDB being an advanced transactional storage engine, provides both table and row level transaction locks.

This article will provide information about how transaction locks are implemented in InnoDB storage engine. The lock subsystem of InnoDB provides many services to the overall system, like:

  • Creating, acquiring, releasing and destroying row locks.
  • Creating, acquiring, releasing and destroying table locks.
  • Providing multi-thread safe access to row and table locks.
  • Data structures useful for locating a table lock or a row lock.
  • Maintaining a list of user threads suspended while waiting for transaction locks.
  • Notification of suspended threads when a lock is released.
  • Deadlock detection

The lock subsystem helps to isolate one transaction from another transaction. This article will provide information about how transaction locks are created, maintained and used in the InnoDB storage engine. All reference to locks means transaction locks, unless specified otherwise.

Internal Data Structures of InnoDB

Before we proceed with scenarios and algorithms, I would like to present the following data structures. We need to be familiar with these data structures to understand how transaction locks work in InnoDB. The data structures of interest are:

  1. The enum lock_mode – provides the list of modes in which the transaction locks can be obtained.
  2. The lock struct lock_t. This represents either a table lock or a row lock.
  3. The struct trx_t which represents one transaction within InnoDB.
  4. The struct trx_lock_t which associates one transaction with all its transaction locks.
  5. The global object of type lock_sys_t holds the hash table of row locks.
  6. The table descriptor dict_table_t, which uniquely identifies a table in InnoDB. Each table descriptor contains a list of locks on the table.
  7. The global object trx_sys of type trx_sys_t, which holds the active transaction table (ATT).

The Lock Modes

The locks can be obtained in the following modes. I'll not discuss about the LOCK_AUTO_INC in this article.

/* Basic lock modes */ 
enum lock_mode { 
        LOCK_IS = 0,    /* intention shared */ 
        LOCK_IX,        /* intention exclusive */ 
        LOCK_S,         /* shared */ 
        LOCK_X,         /* exclusive */ 
        LOCK_AUTO_INC,  /* locks the auto-inc counter of a table 
                        in an exclusive mode */ 
        LOCK_NONE,      /* this is used elsewhere to note consistent read */ 
        LOCK_NUM = LOCK_NONE, /* number of lock modes */ 
        LOCK_NONE_UNSET = 255 

The lock struct or lock object

The structure lock_t represents either a table lock (lock_table_t) or a group of row locks (lock_rec_t) for all the rows belonging to the same page. For different lock modes, different lock structs will be used. For example, if a row in a page is locked in LOCK_X mode, and if another row in the same page is locked in LOCK_S mode, then these two row locks will be held in different lock structs.

This structure is defined as follows:

struct lock_t {
        trx_t*          trx;
        ulint           type_mode;
        hash_node_t     hash;
        dict_index_t*   index;
        union {
                lock_table_t    tab_lock;/*!< table lock */
                lock_rec_t      rec_lock;/*!< record lock */
        } un_member;                    /*!< lock        details */

/** A table lock */
struct lock_table_t {
        dict_table_t*   table;          /*!< database table in dictionary
                                        cache */
                        locks;          /*!< list of locks on the same
                                        table */

/** Record lock for a page */
struct lock_rec_t {
        ulint   space;                  /*!< space id */
        ulint   page_no;                /*!< page number */
        ulint   n_bits;                 /*!< number of bits in the lock
                                        bitmap; NOTE: the lock bitmap is
                                        placed immediately after the
                                        lock struct */

The important point here is the lock bitmap. The lock bitmap is a space efficient way to represent the row locks. This space efficient way of representing the row locks avoids the need for lock escalation and lock data persistence. (Note: For prepared transactions, it would be useful to have lock data persistence, but InnoDB currently do not support lock data persistence.)

The lock bitmap is placed immediately after the lock struct object. If a page can contain a maximum of 100 records, then the lock bitmap would be of size 100 (or more). Each bit in this bitmap will represent a row in the page. The heap_no of the row is used to index into the bitmap. If the 5th bit in the bitmap is enabled, then the row with heap_no 5 is locked.

The Transaction And Lock Relationship

The struct trx_t is used to represent the transaction within InnoDB. The struct trx_lock_t is used to represent all locks associated to a given transaction. Here I have listed down only those members relevant to this article.

struct trx_t {

	trx_id_t        id;             /*!< transaction id */ 

      trx_lock_t      lock;           /*!< Information about the transaction 
                                        locks and state. Protected by 
                                        trx->mutex or lock_sys->mutex 
                                        or both */ 

struct trx_lock_t { 

        ib_vector_t*    table_locks;    /*!< All table locks requested by this 
                                        transaction, including AUTOINC locks */ 

                        trx_locks;      /*!< locks requested 
                                        by the transaction; 
                                        insertions are protected by trx->mutex 
                                        and lock_sys->mutex; removals are 
                                        protected by lock_sys->mutex */ 


Global hash table of row locks

Before we look at how the row locks internally works, we need to be aware of this global data structure. The lock subsystem of the InnoDB storage engine has a global object lock_sys of type lock_sys_t. The class lock_sys_t is defined as follows:

struct lock_sys_t { 
        ib_mutex_t      mutex;   
        hash_table_t*   rec_hash; 
        ulint           n_lock_max_wait_time; 
        // more ...

The lock_sys_t::rec_hash member is the hash table of the record locks. Every page within InnoDB is uniquely identified by the (space_id, page_no) combination, called the page address. The hashing is done based on the page address. So given the page address, we can locate the list of lock_t objects of that page. All lock structs of the given page will be in the same hash bucket. So using this mechanism we can locate the row lock of any row.

The Transaction Subsystem Global Object

The transaction subsystem of InnoDB has one global object trx_sys of type trx_sys_t, which is an important internal data structure. It is defined as follows:

/** The transaction system central memory data structure. */ 
struct trx_sys_t{ 

	  // more ...

        trx_list_t      rw_trx_list;    /*!< List of active and committed in 
                                        memory read-write transactions, sorted 
                                        on trx id, biggest first. Recovered 
                                        transactions are always on this list. */ 
        trx_list_t      ro_trx_list;    /*!< List of active and committed in 
                                        memory read-only transactions, sorted 
                                        on trx id, biggest first. NOTE: 
                                        The order for read-only transactions 
                                        is not necessary. We should exploit 
                                        this and increase concurrency during 
                                        add/remove. */ 
        // more ...

This global data structure contains the active transaction table (ATT) of InnoDB storage engine. In the following sections, I'll use this global object to access my transaction object through the debugger to demonstrate the transaction locks.

The table descriptor (dict_table_t)

Each table in InnoDB is uniquely identified by its name in the form of databasename/tablename. For each table, the data dictionary of InnoDB will contain exactly one table descriptor object of type dict_table_t. Given the table name, the table descriptor can be obtained. This table descriptor contains a list of locks on the table. This list can be used to check if the table has already been locked by a transaction.

struct dict_table_t{ 

        // moretable_id_t      id;     /*!< id of the table */ 
        char*           name;   /*!< table name */ 

                        locks;  /*!< list of locks on the table; protected 
                                by lock_sys->mutex */ 

The heap_no of a row

Every row in InnoDB is uniquely identified by space_id, page_no and heap_no of the row. I assume that you know about space_id and page_no. I'll explain here only about the heap_no of the row. Each row in the page has a heap_no. The heap_no of the infimum record is 0, the heap_no of the supremum record is 1, and the heap_no of the first user record in page is 2.

If we have inserted 10 records in the page, and if there has been no updates on the page, then the heap_no of the user records would be 2, 3, 4, 5 … 10, 11. The heap_no will be in the same order in which the records will be accessed in ascending order.

The heap_no of the row is used to locate the bit in the lock bitmap corresponding to the row. If the row has a heap_no of 10, then the 10th bit in the lock bitmap corresponds to the row. This means that if the heap_no of the row is changed, then the associated lock structs must be adjusted accordingly.

The Schema

To demonstrate the transaction locks, I use the following schema in this article. There is one table t1 which has 3 rows (1, 'அ'), (2, 'ஆ') and (3, 'இ'). In this article, we will see when and how the transaction locks (table and row) are created and used. We will also cover the internal steps involved in creating table and row locks.

set names utf8; 
drop table if exists t1; 
create table t1 (f1 int primary key, f2 char(10)) charset='utf8'; 
insert into t1 values (1, 'அ'), (2, 'ஆ'), (3, 'இ'); 

Table Level Transaction Locks

The purpose of table level transaction locks or simply table locks is to ensure that no transactions modify the structure of the table when another transaction is accessing or modifying the table or the rows in a table. There are two types of table locks in MySQL – one is the meta-data locking (MDL) provided by the SQL engine and the other is the table level locks within InnoDB storage engine. Here the discussion is only about the table level locks within InnoDB.

On tables, InnoDB normally acquires only intentional shared (LOCK_IS) or intentional exclusive (LOCK_IX) modes. It does not lock the tables in shared mode (LOCK_S) or exclusive mode (LOCK_X) unless explicitly requested via LOCK TABLES command. One exception to this is in the prepare phase of the online alter table command, where the table is locked in shared mode. Please refer to Multiple Granularity Locking to know about intention shared and intention exclusive locks.

Scenario (௧) for LOCK_IS table lock

Here is an example scenario that will take intention shared (LOCK_IS) lock on the table.

mysql> set transaction isolation level serializable;
mysql> start transaction; 
mysql> select * from t1; 

When the above 3 statements are executed, one table lock would be taken in LOCK_IS mode. In mysql-5.6, there is no way to verify this other than using the debugger. I verified it as follows:

(gdb) set $trx_locklist = trx_sys->rw_trx_list->start->lock->trx_locks 
(gdb) p $trx_locklist.start->un_member->tab_lock->table->name
$21 = 0x7fffb0014f70 "test/t1" 
(gdb) p lock_get_mode(trx_sys->rw_trx_list->start->lock->trx_locks->start) 
$25 = LOCK_IS 

You need to access the correct transaction object and the lock object.

Scenario (௨) for LOCK_IX table lock

Here is an extension to the previous scenario. Adding an INSERT statement to the scenario will take an intention exclusive (LOCK_IX) lock on the table.

mysql> set transaction isolation level serializable;
mysql> start transaction; 
mysql> select * from t1; 
mysql> insert into t1 values (4, 'ஃ'); 

When the above 4 statements are executed, there would be two locks on the table – LOCK_IS and LOCK_IX. The select statement would have taken the table lock in LOCK_IS mode and the INSERT statement would take the table lock in LOCK_IX mode.

This can also be verified using the debugger. I'll leave it as an exercise for the reader.

What Happens Internally When Acquiring Table Locks

Each table is uniquely identified within the InnoDB storage engine using the table descriptor object of type dict_table_t. In this section we will see the steps taken internally by InnoDB to obtain a table lock. The function to refer to is lock_table() in the source code. Necessary mutexes are taken during these steps to ensure that everything works correctly in a multi-thread environment. This aspect is not discussed here.

  1. The request for a table lock comes with the following information – the table to lock (dict_table_t), the mode in which to lock (enum lock_mode), and the transaction which wants the lock (trx_t).
  2. Check whether the transaction is already holding an equal or stronger lock on the given table. Each transaction maintains a list of table locks obtained by itself (trx_t::trx_lock_t::table_locks). Searching this list for the given table and mode is sufficient to answer this question. If the current transaction is already holding an equal or stronger lock on the table, then the lock request is reported as success. If not, then go to next step.
  3. Check if any other transaction has an incompatible lock request in the lock queue of the table. Each table descriptor has a lock queue (dict_table_t::locks). Searching this queue is sufficient to answer this question. If some other transaction holds an incompatible lock on the table, then the lock request needs to wait. Waiting for a lock can lead to time out or deadlock error. If there is no contention from other transactions for this table, then proceed further.
  4. Allocate a lock struct object (lock_t). Initialize with table, trx and lock mode information. Add this object to the queue in dict_table_t::locks object of the table as well as the trx_t::trx_lock_t::table_locks.
  5. Complete.

You can re-visit the above scenario (௨) and then follow the above steps and verify that the number of lock structs created is 2.

Row Level Transaction Locks

Each row in the InnoDB storage engine needs to be uniquely identified in order to be able to lock it. A row is uniquely identified by the following pieces of information:

  • The space identifier
  • The page number within the space.
  • The heap number of the record within the page.
  • The descriptor of the index to which the row belongs (dict_index_t). Stricly speaking, this is not necessary to uniquely identify the row. But associating the row to its index will help to provide user friendly diagnosis and also help the developers to debug a problem.

There are two types of row locks in InnoDB – the implicit and the explicit. The explicit row locks are the locks that make use of the global row lock hash table and the lock_t structures. The implicit row locks are logically arrived at based on the transaction information in the clustered index or secondary index record. These are explained in the following sections.

Implicit Row Locks

Implicit row locks do not have an associated lock_t object allocated. This is purely calculated based on the ID of the requesting transaction and the transaction ID available in each record. First, let use see how implicit locks are “acquired” (here is comment from

“If a transaction has modified or inserted an index record, then it owns an implicit x-lock on the record. On a secondary index record, a transaction has an implicit x-lock also if it has modified the clustered index record, the max trx id of the page where the secondary index record resides is >= trx id of the transaction (or database recovery is running), and there are no explicit non-gap lock requests on the secondary index record. ”

As we can see, the implicit locks are a logical entity and whether a transaction has implicit locks or not is calculated using certain procedures. This procedure is explained here briefly.

If a transaction wants to acquire a row lock (implicit or explicit), then it needs to determine whether any other transaction has an implicit lock on the row before checking on the explicit lock. As explained above this procedure is different for clustered index and secondary index.

For the clustered index, get the transaction id from the given record. If it is a valid transaction id, then that is the transaction which is holding the implicit exclusive lock on the row. Refer to the function lock_clust_rec_some_has_impl() for details.

For the secondary index, it is more cumbersome to calculate. Here are the steps:

  1. Let R1 be the secondary index row for which we want to check if another transaction is holding an implicit exclusive lock.
  2. Obtain the maximum transaction id (T1) of the page that contains the secondary index row R1.
  3. Let T2 be the minimum transaction id of the InnoDB system. If T1 is less than T2, then there can be no transaction holding implicit lock on the row. Otherwise, go to next step.
  4. Obtain the corresponding clustered index record for the given secondary index record R1.
  5. Get the transaction id (T3) from the clustered index record. By looking at the clustered index record, including its older versions, find out if T3 could have modified or inserted the secondary index row R1. If yes, then T3 has an implicit exclusive lock on row R1. Otherwise it does not have.

In the case of secondary indexes, we need to make use of the undo logs to determine if any transactions have an implicit exclusive row lock on record. Refer to the function lock_sec_rec_some_has_impl() for details.

Also note that the implicit row locks do not affect the gaps.

Explicit Row Locks

Explicit row locks are represented by the lock_t structures. This section provides some scenarios in which explicit row locks are obtained in different modes. It also briefly discusses the internal procedure to acquire an explicit row lock.

Scenario (௩) for LOCK_S row lock

For transaction isolation level REPEATABLE READ and less stronger levels, InnoDB does not take shared locks on rows. So in the following scenario, we use SERIALIZABLE transaction isolation level for demonstrating shared row locks:

mysql> set transaction isolation level serializable;
mysql> start transaction; 
mysql> select * from t1; 

This scenario is the same as the Scenario (௧). The verification via debugger alone is different. In the case of row locks, each lock object represents a row lock for all rows of a page (of the same lock mode). So a lock bitmap is used for this purpose which exists at the end of the lock struct object. In the above scenario, we can verify the locks as follows:

The lock_t object allocated for the row locks is

(gdb) set $trx_locklist = trx_sys->rw_trx_list->start->lock->trx_locks 
(gdb) set $rowlock = $trx_locklist.start->trx_locks->next 
(gdb) p $rowlock 
$47 = (ib_lock_t *) 0x7fffb00103f0 

Verify the lock mode of the lock object.

(gdb) p lock_get_mode($rowlock) 
$48 = LOCK_S 

Verify that it is actually a lock struct object allocated for rows (and not a table lock).

(gdb) p *$rowlock 
$43 = {trx = 0x7fffb0013f78, trx_locks = {prev = 0x7fffb00103a8, next = 0x0}, 
  type_mode = 34, hash = 0x0, index = 0x7fffb001a6b8, un_member = {tab_lock = { 
      table = 0xc, locks = {prev = 0x3, next = 0x48}}, rec_lock = {space = 12, 
      page_no = 3, n_bits = 72}}} 

The row lock bit map is at the end of the lock object. Even though the number of bits in the lock bitmap is given as n_bits = 72 (9 bytes) above, I am examining only 4 bytes here.

(gdb) x /4t $rowlock + 1 
0x7fffb0010438:	00011110	00000000	00000000	00000000 

Note that there are 4 bits enabled. This means that there are 4 row locks obtained. The row in the page, and the bit in the lock bit map are related via the heap_no of the row, as described previously.

Scenario (௪) for LOCK_X row lock

Changing the above scenario slightly as follows, will obtain LOCK_X locks on the rows.

mysql> set transaction isolation level serializable;
mysql> start transaction; 
mysql> select * from t1 for update; 

Verification of this through the debugger is left as an exercise for the reader.

What Happens Internally When Acquiring Explicit Row Locks

To lock a row, all information necessary to uniquely identify the row – the trx, lock mode, table space id, page_no and heap_no – must be supplied. The entry point for row locking is the lock_rec_lock() function in the source code.

  1. Check if the given transaction has a granted explicit lock on the row with equal or stronger lock mode. To do this search, the global hash table of row locks is used. If the transaction already has a strong enough lock on the row, then nothing more to do. Otherwise go to next step.
  2. Check if other transactions have a conflicting lock on the row. For this search also, the global hash table of row locks is used. If yes, then the current transaction must wait. Waiting can lead to timeout or deadlock error. If there is no contention for this row from other transactions then proceed further.
  3. Check if there is already a lock struct object for the given row. If yes, reuse the same lock struct object. Otherwise allocate a lock struct object.
  4. Initialize the trx, lock mode, page_no, space_id and the size of the lock bit map. Set the bit of the lock bitmap based on the heap_no of the row.
  5. Insert this lock struct object in the global hash table of row locks.
  6. Add this to the list of transaction locks in the transaction object (trx_t::trx_lock_t::trx_locks).

Releasing transaction locks

All the transaction locks acquired by a transaction is released by InnoDB at transaction end (commit or rollback). In the case of REPEATABLE READ isolation level or SERIALIZABLE, it is not possible for the SQL layer to release locks before transaction end. In the case of READ COMMITTED isolation level, it is possible for the SQL layer to release locks before transaction end by making use of ha_innobase::unlock_row() handler interface API call.

To obtain the list of all transaction locks of a transaction, the following member can be used – trx_t::trx_lock_t::trx_locks.


This article provided an overview of the transaction locks in InnoDB. We looked at the data structures used to represent transaction locks. We analysed some scenarios in which the various locks are created. We also looked at the internal steps that happen when a table lock or a row lock is created.

You can download this article in pdf.

Friday Apr 19, 2013

Data Organization in InnoDB


This article will explain how the data is organized in InnoDB storage engine. First we will look at the various files that are created by InnoDB, then we look at the logical data organization like tablespaces, pages, segments and extents. We will explore each of them in some detail and discuss about their relationship with each other. At the end of this article, the reader will have a high level view of the data layout within the InnoDB storage engine.

The Files

MySQL will store all data within the data directory. The data directory can be specified using the command line option –data-dir or in the configuration file as datadir. Refer to the Server Command Options for complete details.

By default, when InnoDB is initialized, it creates 3 important files in the data directory – ibdata1, ib_logfile0 and ib_logfile1. The ibdata1 is the data file in which system and user data will be stored. The ib_logfile0 and ib_logfile1 are the redo log files. The location and size of these files are configurable. Refer to Configuring InnoDB for more details.

The data file ibdata1 belongs to the system tablespace with tablespace id (space_id) of 0. The system tablespace can contain more than 1 data file. As of MySQL 5.6, only the system tablespace can contain more than 1 data file. All other tablespaces can contain only one data file. Also, only the system tablespace can contain more than one table, while all other tablespaces can contain only one table.

The data files and the redo log files are represented in the memory by the C structure fil_node_t.


By default, InnoDB contains only one tablespace called the system tablespace whose identifier is 0. More tablespaces can be created indirectly using the innodb_file_per_table configuration parameter. In MySQL 5.6, this configuration parameter is ON by default. When it is ON, each table will be created in its own tablespace in a separate data file.

The relationship between the tablespace and data files is explained in the InnoDB source code comment (storage/innobase/fil/ which is quoted here for reference:

A tablespace consists of a chain of files. The size of the files does not have to be divisible by the database block size, because we may just leave the last incomplete block unused. When a new file is appended to the tablespace, the maximum size of the file is also specified. At the moment, we think that it is best to extend the file to its maximum size already at the creation of the file, because then we can avoid dynamically extending the file when more space is needed for the tablespace.”

The last statement about avoiding dynamic extension is applicable only to the redo log files and not the data files. Data files are dynamically extended, but redo log files are pre-allocated. Also, as already mentioned earlier, only the system tablespace can have more than one data file.

It is also clearly mentioned that even though the tablespace can have multiple files, they are thought of as one single large file concatenated together. So the order of files within the tablespace is important.


A data file is logically divided into equal sized pages. The first page of the first data file is identified with page number of 0, and the next page would be 1 and so on. A page within a tablespace is uniquely identified by the page identifier or page number (page_no). And each tablespace is uniquely identified by the tablespace identifier (space_id). So a page is uniquely identified throughout InnoDB by using the (space_id, page_no) combination. And any location within InnoDB can be uniquely identified by the (space_id, page_no, page_offset) combination, where page_offset is the number of bytes within the given page.

How the pages from different data files relate to one another is explained in another source code comment: “A block's position in the tablespace is specified with a 32-bit unsigned integer. The files in the chain are thought to be catenated, and the block corresponding to an address n is the nth block in the catenated file (where the first block is named the 0th block, and the incomplete block fragments at the end of files are not taken into account). A tablespace can be extended by appending a new file at the end of the chain.” This means that the first page of all the data files will not have page_no of 0 (zero). Only the first page of the first data file in a tablespace will have the page_no as 0 (zero).

Also in the above comment it is mentioned that the page_no is a 32-bit unsigned integer. This is the size of the page_no when stored on the disk.

Every page has a page header (page_header_t). For more details on this please refer to the Jeremy Cole's blog “The basics of InnoDB space file layout.


An extent is 1MB of consecutive pages. The size of one extent is defined as follows (1048576 bytes = 1MB):


The macro UNIV_PAGE_SIZE used to be a compile time constant. From mysql-5.6 onwards it is a global variable. The number of pages in an extent depends on the page size used. If the page size is 16K (the default), then an extent would contain 64 pages.

Page Types

One page can be used for many purposes. The page type will identify the purpose for which the page is being used. The page type of each page will be stored in the page header. The page types are available in the header file storage/innobase/include/fil0fil.h. The following table provides a brief description of the page types.

Page Type



The page is a B-tree node


The page stores undo logs


contains an array of fseg_inode_t objects.


The page is in the free list of insert buffer or change buffer.


Freshly allocated page.


Insert buffer or change buffer bitmap


System page


Transaction system data


File space header


Extent Descriptor Page


Uncompressed BLOB page


First compressed BLOB page


Subsequent compressed BLOB page

Each page type is used for different purposes. It is beyond the scope of this article, to explore each page type. For now, it is sufficient to know that all pages have a page header (page_header_t) and they store the page type in it, and based on the page type the contents and the layout of the page would be decided.

Tablespace Header

Each tablespace will have a header of type fsp_header_t. This data structure is stored in the first page of a tablespace.

  • The table space identifier (space_id)

  • Current size of the table space in pages.

  • List of free extents

  • List of full extents not belonging to any segment.

  • List of partially full/free extents not belonging to any segment.

  • List of pages containing segment headers, where all the segment inode slots are reserved. (pages of type FIL_PAGE_INODE)

  • List of pages containing segment headers, where not all the segment inode slots are reserved. (pages of type FIL_PAGE_INODE).

InnoDB Tablespace Header Structure

From the tablespace header, we can access the list of segments available in the tablespace. The total space occupied by the tablespace header is given by the macro FSP_HEADER_SIZE, which is equal to 16*7 = 112 bytes.

Reserved Pages of Tablespace

As mentioned earlier, InnoDB will always contain one tablespace called the system tablespace with identifier 0. This is a special tablespace and is always kept open as long as the MySQL server is running. The first few pages of the system tablespace is reserved for internal usage. This information can be obtained from the header storage/innobase/include/fsp0types.h. They are listed below with a short description.

Page Number

The Page Name




The extent descriptor page.



The insert buffer bitmap page.



The first inode page number.



Insert buffer header page in system tablespace.



Insert buffer B-tree root page in system tablespace.



Transaction system header in system tablespace.



First rollback segment page, in system tablespace.



Data dictionary header page in system tablespace.

As can be noted from above, the first 3 pages will be there in any tablespace. But the last 5 pages are reserved only in the case of system tablespace. In the case of other tablespaces only 3 pages are reserved.

When the option innodb_file_per_table is enabled, then for each table a separate tablespace with one data file would be created. The source code comment in the function dict_build_table_def_step() states the following:

                /* We create a new single-table tablespace for the table. 
                We initially let it be 4 pages: 
                - page 0 is the fsp header and an extent descriptor page, 
                - page 1 is an ibuf bitmap page, 
                - page 2 is the first inode page, 
                - page 3 will contain the root of the clustered index of the 
                table we create here. */ 

File Segments

A tablespace can contain many file segments. File segments (or just segments) is a logical entity. Each segment has a segment header (fseg_header_t), which points to the inode (fseg_inode_t) describing the file segment. The file segment header contains the following information:

  • The space to which the inode belongs

  • The page_no of the inode

  • The byte offset of the inode

  • The length of the file segment header (in bytes).

Note: It would have been really more readable (at source code level) if fseg_header_t and fseg_inode_t had proper C-style structures defined for them.

The fseg_inode_t object contains the following information:

  • The segment id to which it belongs.

  • List of full extents.

  • List of free extents of this segment.

  • List of partially full/free extents

  • Array of individual pages belonging to this segment. The size of this array is half an extent.

When a segment wants to grow, it will get free extents or pages from the tablespace to which it belongs.


In InnoDB, when a table is created, a clustered index (B-tree) is created internally. This B-tree contains two file segments, one for the non-leaf pages and the other for the leaf pages. From the source code documentation:

In the root node of a B-tree there are two file segment headers. The leaf pages of a tree are allocated from one file segment, to make them consecutive on disk if possible. From the other file segment we allocate pages for the non-leaf levels of the tree.”

For a given table, the root page of a B-tree will be obtained from the data dictionary. So in InnoDB, each table exists within a tablespace, and contains one B-tree (the clustered index), which contains 2 file segments. Each file segment can contain many extents, and each extent contains 1MB of consecutive pages.


This article discussed the details about the data organization within InnoDB. We first looked at the files created by InnoDB, and then discussed about the various logical entities like tablespaces, pages, page types, extents, segments and tables. We also looked at the relationship between each one of them.

Send in your comments and feedback to

Tuesday Jan 15, 2013

Repeatable Read Isolation Level in InnoDB - How Consistent Read View Works

This article discusses about the approach taken by InnoDB Storage Engine of MySQL to provide the repeatable read isolation level.  First, an example is presented to demonstrate the two different designs that are possible.  Then the design used in InnoDB is presented followed by a short discussion about the advantages and disadvantages of this design choice.  As part of this discussion, we also present a performance optimization done in MySQL 5.6. 

An Example Scenario

I used MySQL 5.5 for this purpose.  Let us create the following tables t1 and t2 in the test database that is available by default. Even though the default storage engine in MySQL 5.5 is InnoDB, I explicitly specify it for clarity.  

mysql> use test;
mysql> create table t1 (f1 int) engine=innodb;
mysql> create table t2 (f1 int) engine=innodb;
mysql> insert into t1 values (1);
mysql> insert into t2 values (1);

Consider the following scenario (transactions are started with REPEATABLE READ isolation level):

S. No.

Session 1

Session 2


start transaction;


select f1 from t1;


start transaction;


update t2 set f1 = f1+1;




select f1 from t2;

Please go through the above scenario and find out what would be the result set in Session 1, line 6, query “select f1 from t2”?  Proceed once you know what you would expect.

By default, the isolation level in InnoDB is repeatable read.  So in line 1 and 3 the transactions would be started with repeatable read isolation level.  The output of the query in line 6 above will depend on the design approach taken to implement this isolation level.  The two different design approaches possible are the traditional locking approach and the snapshot isolation technique.

In traditional locking approach, a transaction running in REPEATABLE READ isolation level would acquire a row lock on all the rows that has been read.  No gap locks or range locks will be taken.  These transactional row locks will be used to provide the semantics of the REPEATABLE READ isolation level.   There are two drawbacks in this approach.  One, the number of row locks taken can become very high and hence will be a performance problem.  Two, this approach will not prevent phantom reads.   

In InnoDB, a REPEATABLE READ isolation level is provided by making use of the snapshot isolation technique.  This technique will be explained in the following sections.  

If the traditional locking approach is taken then the output would be 2 (the new value).  But in InnoDB the output of query “select f1 from t2” in line 6 would be 1 (the old value).   

Creating a Consistent Read View

InnoDB creates a consistent read view or a consistent snapshot either when the statement


is executed or when the first select query is executed in the transaction.  In the example given above, the consistent read view is created in Session 1 when the first select query in line 2 is executed.  This consistent read view is at the database level, so further select queries in Session 1 will not see the modifications done by other transactions in any of the tables in the database.  This is the reason why the “select f1 from t2” in Session 1 sees the old tuple instead of the new tuple as updated by Session 2.  But Session 1 can see the modifications done by itself.

When a transaction creates a read view, it creates a read view (read_view_t) object and copies the list of all the active transactions into it.  It uses this information later to provide a consistent view of the database as it existed when the read view was created.   How this is done is detailed in the sections below.  But before we proceed further, there is a small detail that we need to be aware of to understand how the read view works.

Hidden Fields of a InnoDB Table

For all InnoDB tables, 3 fields are internally added to the table – DB_ROW_ID, DB_TRX_ID and DB_ROLL_PTR (refer to InnoDB Multi-Versioning).  They are explained below:

S. No.

System Column


Length (bytes)



The row identifier




The identifier of the transaction identifier that created the tuple




The rollback data pointer, pointing to previous versions of the tuple in the undo logs.


Update: For tables, with explicit PRIMARY KEY or UNIQUE NOT NULL key, the DB_ROW_ID will not be stored in the row, even though it will be listed as one of the columns in the data dictionary object for the table (of type dict_table_t). 

For our current discussion the focus is on DB_TRX_ID and the DB_ROLL_PTR.  Each tuple or row contains the identifier of the transaction that created the tuple, and a list of previous versions of the tuple.  

In InnoDB, the transaction identifier is an integer value and it is a monotonically increasing value.    So using the transaction identifier one can determine which transaction started earlier and which started later.  

How Consistent Read View Works

With the help of read view object and the hidden fields in the tuple, a transaction can construct the database as it existed at the time the consistent read view was created.  For this purpose, the transaction using the read view has to determine whether it can "see a particular transaction".  This is done using the following 2 rules:

Rule 1: When the read view object is created it notes down the smallest transaction identifier that is not yet used as a transaction identifier (trx_sys_t::max_trx_id).   The read view calls it the low limit. So the transaction using the read view must not see any transaction with identifier greater than or equal to this low limit.

Rule 2: The transaction using the read view must not see a transaction that was active when the read view was created.

Whenever a tuple is accessed, the transaction using the read view will check if it can see the transaction that created the tuple (using the two rules mentioned above).  If not, then check if the tuple has a previous version.  If there are no previous versions then ignore the tuple and proceed to the next tuple.  If there are previous versions, then repeat the procedure to find and build the correct version of the tuple to access from the undo log by following the DB_ROLL_PTR.

This avoids the phantom read problem in the REPEATABLE READ isolation in InnoDB.

Purpose of Read View

Why create the read view in the first place?  To provide the repeatable read isolation level, creating a read view and later accessing the various versions of the row without locking has a performance benefit.  The alternative would be to take a read lock on all the rows that was read in the transaction with repeatable read isolation level.  Accessing a large table with millions of records would generate that many shared row locks.  For performance gain a read view is created.  

Performance Problem in MySQL 5.5

A transaction using the read view must decide whether to see the rows created by a particular transaction.  This is done with the help of the two rules given above in the section “How Consistent Read View Works.”  The first rule is simple and doesn't involve much copying.  But the 2nd rule is expensive.  Because of the 2nd rule, when the read view object is created, the complete list of active transaction identifiers are copied.  The number of bytes copied depends on the number of active transactions in the database system when the read view is created.  

The performance problem is not only because of the number of bytes copied during the read view creation, but also because of the actual traversal of the list of all the active transactions.  This traversal is done under the protection of a mutex.  So when there are many transactions trying to create a read view concurrently, the contention for this mutex becomes high.

The Read Only Optimization in MySQL 5.6

A recent discussion in InnoDB developers mailing list suggested that the read view creation can be a very expensive operation if the number of active concurrent transactions are very high (say in thousands).  One of the optimizations that was discussed involved identifying the read-only transactions and ignoring them during read view creation.

Since read-only transactions do not make any changes that change the state of the tuple they are not required to be part of the read view. If the read-only transactions are ignored, then the number of transaction identifiers that needs to be copied during the read view creation can be reduced significantly.   This will also reduce the mutex contention discussed earlier.  This idea is based on the reasonable assumption that at any point of time, there will be a mix of read-only and read-write active transactions in a database system and that the read-only transactions will dominate the work load.

To facilitate this the Transaction Subsystem of InnoDB in MySQL 5.6 contains two lists of transactions, one for the read-write transactions (trx_sys_t::rw_trx_list) and the other for read-only transactions (trx_sys_t::ro_trx_list).  For creating the read view only the list of read-write transactions are used.  The length of transaction list to be traversed to create a read view is now smaller and the number of bytes copied is also less.  This makes the read view creation faster and more scalable.

Refer to Optimizations for Read-Only Transactions in MySQL 5.6 Reference Manual for further details.


In this article, we saw how InnoDB provides the repeatable read isolation level using a consistent read view or consistent snapshot.  Because of this approach, InnoDB is able to avoid the creation of many shared row locks and hence is able to achieve better performance.  The phantom read problem is also avoided.  We also saw how the consistent read view works.  

We discussed a particular performance problem in MySQL 5.5 and presented an optimization done in MySQL 5.6, which makes the read view creation faster and more scalable.  In certain highly concurrent situations, the current design of read view does have some performance challenges and MySQL team at Oracle is actively working to find solutions.  

For further reading refer to the Section “Consistent Nonlocking Reads” of the MySQL Reference Manual.  Send your feedback to

Saturday Sep 29, 2012

Online ALTER TABLE in MySQL 5.6

This is the low-level view of data dictionary language (DDL) operations in the InnoDB storage engine in MySQL 5.6. John Russell gave a more high-level view in his blog post April 2012 Labs Release – Online DDL Improvements.

MySQL before the InnoDB Plugin

Traditionally, the MySQL storage engine interface has taken a minimalistic approach to data definition language. The only natively supported operations were CREATE TABLE, DROP TABLE and RENAME TABLE. Consider the following example:

INSERT INTO t VALUES (1),(2),(3);

The CREATE INDEX statement would be executed roughly as follows:


You could imagine that the database could crash when copying all rows from the original table to the new one. For example, it could run out of file space. Then, on restart, InnoDB would roll back the huge INSERT transaction. To fix things a little, a hack was added to ha_innobase::write_row for committing the transaction every 10,000 rows.

Still, it was frustrating that even a simple DROP INDEX would make the table unavailable for modifications for a long time.

Fast Index Creation in the InnoDB Plugin of MySQL 5.1

MySQL 5.1 introduced a new interface for CREATE INDEX and DROP INDEX. The old table-copying approach can still be forced by SET old_alter_table=0.

This interface is used in MySQL 5.5 and in the InnoDB Plugin for MySQL 5.1. Apart from the ability to do a quick DROP INDEX, the main advantage is that InnoDB will execute a merge-sort algorithm before inserting the index records into each index that is being created. This should speed up the insert into the secondary index B-trees and potentially result in a better B-tree fill factor.

The 5.1 ALTER TABLE interface was not perfect. For example, DROP FOREIGN KEY still invoked the table copy. Renaming columns could conflict with InnoDB foreign key constraints. Combining ADD KEY and DROP KEY in ALTER TABLE was problematic and not atomic inside the storage engine.

The ALTER TABLE interface in MySQL 5.6

The ALTER TABLE storage engine interface was completely rewritten in MySQL 5.6. Instead of introducing a method call for every conceivable operation, MySQL 5.6 introduced a handful of methods, and data structures that keep track of the requested changes.

In MySQL 5.6, online ALTER TABLE operation can be requested by specifying LOCK=NONE. Also LOCK=SHARED and LOCK=EXCLUSIVE are available. The old-style table copying can be requested by ALGORITHM=COPY. That one will require at least LOCK=SHARED. From the InnoDB point of view, anything that is possible with LOCK=EXCLUSIVE is also possible with LOCK=SHARED.

Most ALGORITHM=INPLACE operations inside InnoDB can be executed online (LOCK=NONE). InnoDB will always require an exclusive table lock in two phases of the operation. The execution phases are tied to a number of methods:

Checks if the storage engine can perform all requested operations, and if so, what kind of locking is needed.
InnoDB uses this method to set up the data dictionary cache for upcoming CREATE INDEX operation. We need stubs for the new indexes, so that we can keep track of changes to the table during online index creation. Also, crash recovery would drop any indexes that were incomplete at the time of the crash.
In InnoDB, this method is used for creating secondary indexes or for rebuilding the table. This is the ‘main’ phase that can be executed online (with concurrent writes to the table).
This is where the operation is committed or rolled back. Here, InnoDB would drop any indexes, rename any columns, drop or add foreign keys, and finalize a table rebuild or index creation. It would also discard any logs that were set up for online index creation or table rebuild.

The prepare and commit phases require an exclusive lock, blocking all access to the table. If MySQL times out while upgrading the table meta-data lock for the commit phase, it will roll back the ALTER TABLE operation.

In MySQL 5.6, data definition language operations are still not fully atomic, because the data dictionary is split. Part of it is inside InnoDB data dictionary tables. Part of the information is only available in the *.frm file, which is not covered by any crash recovery log. But, there is a single commit phase inside the storage engine.

Online Secondary Index Creation

It may occur that an index needs to be created on a new column to speed up queries. But, it may be unacceptable to block modifications on the table while creating the index.

It turns out that it is conceptually not so hard to support online index creation. All we need is some more execution phases:

  1. Set up a stub for the index, for logging changes.
  2. Scan the table for index records.
  3. Sort the index records.
  4. Bulk load the index records.
  5. Apply the logged changes.
  6. Replace the stub with the actual index.

Threads that modify the table will log the operations to the logs of each index that is being created. Errors, such as log overflow or uniqueness violations, will only be flagged by the ALTER TABLE thread. The log is conceptually similar to the InnoDB change buffer.

The bulk load of index records will bypass record locking. We still generate redo log for writing the index pages. It would suffice to log page allocations only, and to flush the index pages from the buffer pool to the file system upon completion.


Starting with MySQL 5.6, InnoDB supports most ALTER TABLE operations natively. The notable exceptions are changes to the column type, ADD FOREIGN KEY except when foreign_key_checks=0, and changes to tables that contain FULLTEXT indexes.

The keyword ALGORITHM=INPLACE is somewhat misleading, because certain operations cannot be performed in-place. For example, changing the ROW_FORMAT of a table requires a rebuild.

Online operation (LOCK=NONE) is not allowed in the following cases:

  • when adding an AUTO_INCREMENT column,
  • when the table contains FULLTEXT indexes or a hidden FTS_DOC_ID column, or
  • when there are FOREIGN KEY constraints referring to the table, with ON…CASCADE or ON…SET NULL option.

The FOREIGN KEY limitations are needed, because MySQL does not acquire meta-data locks on the child or parent tables when executing SQL statements.

Theoretically, InnoDB could support operations like ADD COLUMN and DROP COLUMN in-place, by lazily converting the table to a newer format. This would require that the data dictionary keep multiple versions of the table definition. For simplicity, we will copy the entire table, even for DROP COLUMN.

The bulk copying of the table will bypass record locking and undo logging. For facilitating online operation, a temporary log will be associated with the clustered index of table. Threads that modify the table will also write the changes to the log.

When altering the table, we skip all records that have been marked for deletion. In this way, we can simply discard any undo log records that were not yet purged from the original table.

Off-page columns, or BLOBs, are an important consideration. We suspend the purge of delete-marked records if it would free any off-page columns from the old table. This is because the BLOBs can be needed when applying changes from the log. We have special logging for handling the ROLLBACK of an INSERT that inserted new off-page columns. This is because the columns will be freed at rollback.

New Enhancements for InnoDB Memcached

In MySQL 5.6, we continued our development on InnoDB Memcached and completed a few widely desirable features that make InnoDB Memcached a competitive feature in more scenario. Notablely, they are

1) Support multiple table mapping

2) Added background thread to auto-commit long running transactions

3) Enhancement in binlog performance

 Let’s go over each of these features one by one. And in the last section, we will go over a couple of internally performed performance tests.

Support multiple table mapping

In our earlier release, all InnoDB Memcached operations are mapped to a single InnoDB table. In the real life, user might want to use this InnoDB Memcached features on different tables. Thus being able to support access to different table at run time, and having different mapping for different connections becomes a very desirable feature. And in this GA release, we allow user just be able to do both. We will discuss the key concepts and key steps in using this feature.

1) "mapping name" in the "get" and "set" command

In order to allow InnoDB Memcached map to a new table, the user (DBA) would still require to "pre-register" table(s) in InnoDB Memcached “containers” table (there is security consideration for this requirement). If you would like to know about “containers” table, please refer to my earlier blogs in Once registered, the InnoDB Memcached will then be able to look for such table when they are referred.

Each of such registered table will have a unique "registration name" (or mapping_name) corresponding to the “name” field in the “containers” table.. To access these tables, user will include such "registration name" in their get or set commands, in the form of "get @@new_mapping_name.key", prefix "@@" is required for signaling a mapped table change. The key and the "mapping name" are separated by a configurable delimiter, by default, it is ".". So the syntax is:

get [@@mapping_name.]key_name

set [@@mapping_name.]key_name


 get @@mapping_name

set @@mapping_name

Here is an example:

Let's set up three tables in the "containers" table:

The first is a map to InnoDB table "test/demo_test" table with mapping name "setup_1"

INSERT INTO containers VALUES ("setup_1", "test", "demo_test",

"c1", "c2", "c3", "c4", "c5", "PRIMARY");

 Similarly, we set up table mappings for table "test/new_demo" with name "setup_2" and that to table "mydatabase/my_demo" with name "setup_3":

INSERT INTO containers VALUES ("setup_2", "test", "new_demo", "c1", "c2",

"c3", "c4", "c5", "secondary_index_x");

INSERT INTO containers VALUES ("setup_3", "my_database", "my_demo",

"c1", "c2", "c3", "c4", "c5", "idx");

To switch to table "my_database/my_demo", and get the value corresponding to “key_a”, user will do:

get @@setup_3.key_a

(this will also output the value that corresponding to key "key_a"

or simply

get @@setup_3

Once this is done, this connection will switch to "my_database/my_demo" table until another table mapping switch is requested. so it can continue issue regular command like:

get key_b

 set key_c 0 0 7

These DMLs will all be directed to "my_database/my_demo" table.

And this also implies that different connections can have different bindings (to different table).

2) Delimiter:

For the delimiter "." that separates the "mapping name" and key value, we also added a configure option in the "config_options" system table with name of "table_map_delimiter":

INSERT INTO config_options VALUES("table_map_delimiter", ".");

So if user wants to change to a different delimiter, they can change it in the config_option table.

3) Default mapping:

Once we have multiple table mapping, there should be always a "default" map setting. For this, we decided if there exists a mapping name of "default", then this will be chosen as default mapping. Otherwise, the first row of the containers table will chosen as default setting.

Please note, user tables can be repeated in the "containers" table (for example, user wants to access different columns of the table in different settings), as long as they are using different mapping/configure names in the first column, which is enforced by a unique index.

4) bind command

In addition, we also extend the protocol and added a bind command, its usage is fairly straightforward. To switch to "setup_3" mapping above, you simply issue:

bind setup_3

This will switch this connection's InnoDB table to "my_database/my_demo"

In summary, with this feature, you now can direct access to difference tables with difference session. And even a single connection, you can query into difference tables.

Background thread to auto-commit long running transactions

This is a feature related to the “batch” concept we discussed in earlier blogs. This “batch” feature allows us batch the read and write operations, and commit them only after certain calls. The “batch” size is controlled by the configure parameter “daemon_memcached_w_batch_size” and “daemon_memcached_r_batch_size”. This could significantly boost performance.

However, it also comes with some disadvantages, for example, you will not be able to view “uncommitted” operations from SQL end unless you set transaction isolation level to read_uncommitted, and in addition, this will held certain row locks for extend period of time that might reduce the concurrency.

To deal with this, we introduce a background thread that “auto-commits” the transaction if they are idle for certain amount of time (default is 5 seconds). The background thread will wake up every second and loop through every “connections” opened by Memcached, and check for idle transactions. And if such transaction is idle longer than certain limit and not being used, it will commit such transactions. This limit is configurable by change “innodb_api_bk_commit_interval”. Its default value is 5 seconds, and minimum is 1 second, and maximum is 1073741824 seconds.

With the help of such background thread, you will not need to worry about long running uncommitted transactions when set daemon_memcached_w_batch_size and daemon_memcached_r_batch_size to a large number. This also reduces the number of locks that could be held due to long running transactions, and thus further increase the concurrency.

Enhancement in binlog performance

As you might all know, binlog operation is not done by InnoDB storage engine, rather it is handled in the MySQL layer. In order to support binlog operation through InnoDB Memcached, we would have to artificially create some MySQL constructs in order to access binlog handler APIs. In previous lab release, for simplicity consideration, we open and destroy these MySQL constructs (such as THD) for each operations. This required us to set the “batch” size always to 1 when binlog is on, no matter what “daemon_memcached_w_batch_size” and “daemon_memcached_r_batch_size” are configured to. This put a big restriction on our capability to scale, and also there are quite a bit overhead in creating destroying such constructs that bogs the performance down.

With this release, we made necessary change that would keep MySQL constructs as long as they are valid for a particular connection. So there will not be repeated and redundant open and close (table) calls. And now even with binlog option is enabled (with innodb_api_enable_binlog,), we still can batch the transactions with daemon_memcached_w_batch_size and daemon_memcached_r_batch_size, thus scale the write/read performance.

Although there are still overheads that makes InnoDB Memcached cannot perform as fast as when binlog is turned off. It is much better off comparing to previous release. And we are continuing optimize the solution is this area to improve the performance as much as possible.

Performance Study:

Amerandra of our System QA team have conducted some performance studies on queries through our InnoDB Memcached connection and plain SQL end. And it shows some interesting results.

The test is conducted on a “Linux 2.6.32-300.7.1.el6uek.x86_64 ix86 (64)” machine with 16 GB Memory, Intel Xeon 2.0 GHz CPU X86_64 2 CPUs- 4 Core Each, 2 RAID DISKS (1027 GB,733.9GB). Results are described in following tables:

Table 1: Performance comparison on Set operations


5.6.7-RC-Memcached-plugin ( TPS / Qps) with memcached-threads=8***


X faster

Set (QPS)


















* mysql-5.6.7-rc-linux2.6-x86_64

** The “set” operation when implemented in InnoDB Memcached involves a couple of DMLs: it first query the table to see whether the “key” exists, if it does not, the new key/value pair will be inserted. If it does exist, the “value” field of matching row (by key) will be updated. So when used in above query, it is a precompiled store procedure, and query will just execute such procedures.

*** added “–daemon_memcached_option=-t8” (default is 4 threads)

So we can see with this “set” query, InnoDB Memcached can run 4.5 to 9 time faster than MySQL server.

Table 2: Performance comparison on Get operations


5.6.7-RC-Memcached-plugin ( TPS / Qps) with memcached-threads=8


X faster

Get (QPS)


















With the “get” query (or the select query), memcached performs 1.5 to 2 times faster than normal SQL.


In summary, we added several much-desired features to InnoDB Memcached in this release, allowing user to operate on different tables with this Memcached interface. We also now provide a background commit thread to commit long running idle transactions, thus allow user to configure large batch write/read without worrying about large number of rows held or not being able to see (uncommit) data. We also greatly enhanced the performance when Binlog is enabled. We will continue making efforts in both performance enhancement and functionality areas to make InnoDB Memcached a good demo case for our InnoDB APIs.

Jimmy Yang, September 29, 2012

Helping to Reduce Page Compression Failures Rate

When InnoDB compresses a page it needs the result to fit into its predetermined compressed page size (specified with KEY_BLOCK_SIZE). When the result does not fit we call that a compression failure. In this case InnoDB needs to split up the page and try to compress again. That said, compression failures are bad for performance and should be minimized.

Whether the result of the compression will fit largely depends on the data being compressed and some tables and/or indexes may contain more compressible data than others. And so it would be nice if the compression failure rate, along with other compression stats, could be monitored on a per table or even on a per index basis, wouldn't it?

This is where the new INFORMATION_SCHEMA table in MySQL 5.6 kicks in. INFORMATION_SCHEMA.INNODB_CMP_PER_INDEX provides exactly this helpful information. It contains the following fields:

| Field           | Type         | Null |
| database_name   | varchar(192) | NO   |
| table_name      | varchar(192) | NO   |
| index_name      | varchar(192) | NO   |
| compress_ops    | int(11)      | NO   |
| compress_ops_ok | int(11)      | NO   |
| compress_time   | int(11)      | NO   |
| uncompress_ops  | int(11)      | NO   |
| uncompress_time | int(11)      | NO   |

similarly to INFORMATION_SCHEMA.INNODB_CMP, but this time the data is grouped by "database_name,table_name,index_name" instead of by "page_size".

So a query like

compress_ops - compress_ops_ok AS failures
FROM information_schema.innodb_cmp_per_index
ORDER BY failures DESC;
would reveal the most problematic tables and indexes that have the highest compression failure rate.

From there on the way to improving performance would be to try to increase the compressed page size or change the structure of the table/indexes or the data being stored and see if it will have a positive impact on performance.

InnoDB Compression Improvements in MySQL 5.6

MySQL 5.6 comes with significant improvements for the compression support inside InnoDB. The enhancements that we'll talk about in this piece are also a good example of community contributions. The work on these was conceived, implemented and contributed by the engineers at Facebook. Before we plunge into the details let us familiarize ourselves with some of the key concepts surrounding InnoDB compression.

  • In InnoDB compressed pages are fixed size. Supported sizes are 1, 2, 4, 8 and 16K. The compressed page size is specified at table creation time.
  • InnoDB uses zlib for compression.
  • InnoDB buffer pool will attempt to cache compressed pages like normal pages. However, whenever a page is actively used by a transaction, we'll always have the uncompressed version of the page as well i.e.: we can have a page in the buffer pool in compressed only form or in a state where we have both the compressed page and uncompressed version but we'll never have a page in uncompressed only form. On-disk we'll always only have the compressed page.
  • When both compressed and uncompressed images are present in the buffer pool they are always kept in sync i.e.: changes are applied to both atomically.
  • Recompression happens when changes are made to the compressed data. In order to minimize recompressions InnoDB maintains a modification log within a compressed page. This is the extra space available in the page after compression and it is used to log modifications to the compressed data thus avoiding recompressions.
  • DELETE (and ROLLBACK of DELETE) and purge can be performed without recompressing the page. This is because the delete-mark bit and the system fields DB_TRX_ID and DB_ROLL_PTR are stored in uncompressed format on the compressed page. A record can be purged by shuffling entries in the compressed page directory. This can also be useful for updates of indexed columns, because UPDATE of a key is mapped to INSERT+DELETE+purge.
  • A compression failure happens when we attempt to recompress a page and it does not fit in the fixed size. In such case, we first try to reorganize the page and attempt to recompress and if that fails as well then we split the page into two and recompress both pages.

Now lets talk about the three major improvements that we made in MySQL 5.6.

Logging of Compressed Page Images:
InnoDB used to log entire compressed data on the page to the redo logs when recompression happens. This was an extra safety measure to guard against the rare case where an attempt is made to do recovery using a different zlib version from the one that was used before the crash. Because recovery is a page level operation in InnoDB we have to be sure that all recompress attempts must succeed without causing a btree page split. However, writing entire compressed data images to the redo log files not only makes the operation heavy duty but can also adversely affect flushing activity. This happens because redo space is used in a circular fashion and when we generate much more than normal redo we fill up the space much more quickly and in order to reuse the redo space we have to flush the corresponding dirty pages from the buffer pool.
Starting with MySQL 5.6 a new global configuration parameter innodb_log_compressed_pages. The default value is true which is same as the current behavior. If you are sure that you are not going to attempt to recover from a crash using a different version of zlib then you should set this parameter to false. This is a dynamic parameter.

Compression Level:
You can now set the compression level that zlib should choose to compress the data. The global parameter is innodb_compression_level - the default value is 6 (the zlib default) and allowed values are 1 to 9. Again the parameter is dynamic i.e.: you can change it on the fly.

Dynamic Padding to Reduce Compression Failures:
Compression failures are expensive in terms of CPU. We go through the hoops of recompress, failure, reorganize, recompress, failure and finally page split. At the same time, how often we encounter compression failure depends largely on the compressibility of the data. In MySQL 5.6, courtesy of Facebook engineers, we have an adaptive algorithm based on per-index statistics that we gather about compression operations. The idea is that if a certain index/table is experiencing too many compression failures then we should try to pack the 16K uncompressed version of the page less densely i.e.: we let some space in the 16K page go unused in an attempt that the recompression won't end up in a failure. In other words, we dynamically keep adding 'pad' to the 16K page till we get compression failures within an agreeable range. It works the other way as well, that is we'll keep removing the pad if failure rate is fairly low. To tune the padding effort two configuration variables are exposed.

  • innodb_compression_failure_threshold_pct: default 5, range 0 - 100,dynamic, implies the percentage of compress ops to fail before we start using to padding. Value 0 has a special meaning of disabling the padding.
  • innodb_compression_pad_pct_max: default 50, range 0 - 75, dynamic, the  maximum percentage of uncompressed data page that can be reserved as pad.

Performance Enhancement in Full-Text Search Query

Ever since its first release, we are continuing consolidating and developing InnoDB Full-Text Search feature. There is one recent improvement that worth blogging about. It is an effort with MySQL Optimizer team that simplifies some common queries Query Plans and dramatically shorted the query time. I will describe the issue, our solution and the end result by some performance numbers to demonstrate our efforts in continuing enhancement the Full-Text Search capability.

The Issue:

As we had discussed in previous Blogs, InnoDB implements Full-Text index as reversed auxiliary tables. The query once parsed will be reinterpreted into several queries into related auxiliary tables and then results are merged and consolidated to come up with the final result. So at the end of the query, well have all matching records on hand, sorted by their ranking or by their Doc IDs.

Unfortunately, MySQLs optimizer and query processing had been initially designed for MyISAM Full-Text index, and sometimes did not fully utilize the complete result package from InnoDB.

Here are a couple examples:

Case 1: Query result ordered by Rank with only top N results:

mysql> SELECT FTS_DOC_ID, MATCH (title, body) AGAINST ('database')


In this query, user tries to retrieve a single record with highest ranking. It should have a quick answer once we have all the matching documents on hand, especially if there are ranked. However, before this change, MySQL would almost retrieve rankings for almost every row in the table, sort them and them come with the top rank result. This whole retrieve and sort is quite unnecessary given the InnoDB already have the answer.

In a real life case, user could have millions of rows, so in the old scheme, it would retrieve millions of rows' ranking and sort them, even if our FTS already found there are two 3 matched rows. Apparently, the million ranking retrieve is done in vain. In above case, it should just ask for 3 matched rows' ranking, all other rows' ranking are 0. If it want the top ranking, then it can just get the first record from our already sorted result.

Case 2: Select Count(*) on matching records:

mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body)


In this case, InnoDB search can find matching rows quickly and will have all matching rows. However, before our change, in the old scheme, every row in the table was requested by MySQL one by one, just to check whether its ranking is larger than 0, and later comes up a count.

In fact, there is no need for MySQL to fetch all rows, instead InnoDB already had all the matching records. The only thing need is to call an InnoDB API to retrieve the count

The difference can be huge. Following query output shows how big the difference can be:

mysql> select count(*) from searchindex_inno where match(si_title, si_text) against ('people') 

| count(*) |
| 666877 |
1 row in set (16 min 17.37 sec)

So the query took almost 16 minutes.

Lets see how long the InnoDB can come up the result. In InnoDB, you can obtain extra diagnostic printout by turning on innodb_ft_enable_diag_print, this will print out extra query info:

Error log:

keynr=2, 'people'
NL search
Total docs: 10954826 Total words: 0
UNION: Searching: 'people'
Processing time: 2 secs: row(s) 666877: error: 10
keynr=2, 'people'
NL search
Total docs: 10954826 Total words: 0
UNION: Searching: 'people'
Processing time: 3 secs: row(s) 666877: error: 10

Output shows it only took InnoDB only 3 seconds to get the result, while the whole query took 16 minutes to finish. So large amount of time has been wasted on the un-needed row fetching.

The Solution:

The solution is obvious. MySQL can skip some of its steps, optimize its plan and obtain useful information directly from InnoDB. Some of savings from doing this include:

1) Avoid redundant sorting. Since InnoDB already sorted the result according to ranking. MySQL Query Processing layer does not need to sort to get top matching results.

2) Avoid row by row fetching to get the matching count. InnoDB provides all the matching records. All those not in the result list should all have ranking of 0, and no need to be retrieved. And InnoDB has a count of total matching records on hand. No need to recount.

3) Covered index scan. InnoDB results always contains the matching records' Document ID and their ranking. So if only the Document ID and ranking is needed, there is no need to go to user table to fetch the record itself.

4) Narrow the search result early, reduce the user table access. If the user wants to get top N matching records, we do not need to fetch all matching records from user table. We should be able to first select TOP N matching DOC IDs, and then only fetch corresponding records with these Doc IDs.

Performance Results and comparison with MyISAM

The result by this change is very obvious. I includes six testing result performed by Alexander Rubin just to demonstrate how fast the InnoDB query now becomes when comparing MyISAM Full-Text Search.

These tests are base on the English Wikipedia data of 5.4 Million rows and approximately 16G table. The test was performed on a machine with 1 CPU Dual Core, SSD drive, 8G of RAM and InnoDB_buffer_pool is set to 8 GB.


mysql> SELECT si_title, match(si_title, si_text) against('family') as rel FROM si WHERE match(si_title, si_text) against('family') ORDER BY rel desc LIMIT 10;



Times Faster

Time for the query

1.63 sec

3 min 26.31 sec


You can see for this particular query (retrieve top 10 records), InnoDB Full-Text Search is now approximately 127 times faster than MyISAM.


mysql>select count(*) from si where match(si_title, si_text) against('family‘);

| count(*) |
| 293955 |



Times Faster

Time for the query

1.35 sec

28 min 59.59 sec


In this particular case, where there are 293k matching results, InnoDB took only 1.35 second to get all of them, while take MyISAM almost half an hour, that is about 1289 times faster!.

Table 3: SELECT ID with ORDER BY and LIMIT CLAUSE for selected terms

mysql> SELECT <ID>, match(si_title, si_text) against(<TERM>) as rel FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>) ORDER BY rel desc LIMIT 10;


InnoDB (time to execute)

MyISAM(time to execute)

Times Faster


0.5 sec

5.05 sec


family film

0.95 sec

25.39 sec


Pizza restaurant orange county California

0.93 sec

32.03 sec


President united states of America

2.5 sec

36.98 sec


Table 4: SELECT title and text with ORDER BY and LIMIT CLAUSE for selected terms

mysql> SELECT <ID>, si_title, si_text, ... as rel FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>) ORDER BY rel desc LIMIT 10;


InnoDB (time to execute)

MyISAM(time to execute)

Times Faster


0.61 sec

41.65 sec


family film

1.15 sec

47.17 sec


Pizza restaurant orange county california

1.03 sec

48.2 sec


President united states of america

2.49 sec

44.61 sec


Table 5: SELECT ID with ORDER BY and LIMIT CLAUSE for selected terms

mysql> SELECT <ID>, match(si_title, si_text) against(<TERM>) as rel  FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>) ORDER BY rel desc LIMIT 10;


InnoDB (time to execute)

MyISAM(time to execute)

Times Faster


0.5 sec

5.05 sec


family film

0.95 sec

25.39 sec


Pizza restaurant orange county califormia

0.93 sec

32.03 sec


President united states of america

2.5 sec

36.98 sec


Table 6: SELECT COUNT(*)

mysql> SELECT count(*) FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>) LIMIT 10;


InnoDB (time to execute)

MyISAM(time to execute)

Times Faster


0.47 sec

82 sec


family film

0.83 sec

131 sec


Pizza restaurant orange county califormia

0.74 sec

106 sec


President united states of america

1.96 sec

220 sec


 Again, table 3 to table 6 all showing InnoDB consistently outperform MyISAM in these queries by a large margin. It becomes obvious the InnoDB has great advantage over MyISAM in handling large data search.


These results demonstrate the great performance we could achieve by making MySQL optimizer and InnoDB Full-Text Search more tightly coupled. I think there are still many cases that InnoDB’s result info have not been fully taken advantage of, which means we still have great room to improve. And we will continuously explore the area, and get more dramatic results for InnoDB full-text searches.

Jimmy Yang, September 29, 2012

Saturday Sep 15, 2012

The New "Transactions on InnoDB"

New home of the InnoDB team blog.
[Read More]

This is the InnoDB team blog.


« November 2016