Friday Jan 10, 2014

Transaction life cycle improvements in 5.7.3

This is part of the ongoing work on improving the transaction life cycle management. In 5.7.2 we split the transaction list into two. The read-only transaction list and the read-write transaction list. There was another "virtual" list, the auto-commit non-locking  read-only (AC-NL-RO) transaction list. The change in 5.7.2 was that by default a transaction was treated as read only and added to the read-only transaction list. Only when it was determined that the transaction was going to do an update we removed the transaction from the read-only list and moved it to the read-write transaction list. This initial add to the the read-only list forced the acquisition of the trx_sys_t::mutex. Acquiring the mutex during transaction start/begin has a cost. Promoting a transaction from read-only to read-write we had to acquire the trx_sys_t::mutex to add to the read-write transaction list and so that is not too expensive and unavoidable. There is another transaction list for caching user transactions that we will ignore in this discussion (the mysql-trx-list) it is per connection. All user transactions both AC-NL-RO, plain read-only and read-write user transactions are on this list.

The optimization

The optimization in 5.7.3 is to eliminate the explicit read-only transaction list altogether. By default transactions are not put on any list unless they are explicitly tagged as read-write then they are added to the read-write transaction list. This makes the transaction begin mutex free. Additionally if the transaction commits as a read-only transaction then we don't need to acquire the trx_sys_t::mutex at commit time to remove from the read-only list either. This improves performance for read-only transactions, making them more or less equivalent to AC-NL-RO transactions. They will however incur some cost compared to AC-NL-RO transactions if they acquire shared locks.

Additionally,  the MVCC view open/close handling of read-only transactions is now equivalent to that of AC-NL-RO transactions, very low life cycle overhead. This will also help in improving performance significantly.

User impact

There is of course the positive performance impact but there is also a change in visibility semantics. With the elimination of the read-only list the read-only transactions are now no longer visible via SHOW ENGINE INNODB STATUS; they are however visible via the INFORMATION SCHEMA.

Thursday Dec 12, 2013

MySQL 5.7.3: Deep dive into 1mil QPS with InnoDB & Memcached

As you probably already know, in MySQL 5.7.3 release, InnoDB Memcached reached a record of over 1 million QPS on a read only load. The overview of the benchmark and testing results can be seen in an earlier blog by Dimitri. In this blog, I will spend sometime on the detail changes we have made to achieve this number.

First thanks to Facebook's Yoshinori with his bug#70172 that brought our attention to this single commit read only load test. We have been focussing on operation with large batch size. This bug prompted us to do a series of optimization on single commit read only queries and these optimizations eliminate almost all major bottlenecks from the InnoDB Memcached plugin itself.

If you are just getting familiar with InnoDB Memcached, there are some earlier blog on the topics to get you started. In a short word, InnoDB Memcached allows a fast path to retrieve key value data stored in the InnoDB table, with Memcached protocol.

The Benchmark:

Now, Let's discuss the testing scenario. The InnoDB Memcached plugin configurations are all by default in this benchmark, which means, the daemon_memcached_r_batch_size was also set to be 1, and the read operation would do a begin and commit transaction for each query. It is equivalent to auto-commit single selects through SQL interface. The innodb_api_trx_level is by default set to 0 (read uncommitted), however, changing it to 2 (repeatable read) gave the same benchmark result.

Another good news in 5.7.3 is that we start to support integer key column mapping, as it is common to use integer as primary key for a table. And the table used in this benchmark comes with integer as the key column. The mapping table contains only key and value columns. So we set the corresponding `flags`, `cas_column` and `expire_time_column` column in the config containers table all to NULL, this avoids overhead columns to support Memcached "extra" options. The table itself containers 1 million rows, each with a short integer key and a text value.

Here is the detail table definition

mysql> desc test.memc_test;
















To make InnoDB Memcached recognize this InnoDB table, insert following row into innodb_memcache/containers table


("memc1", "test", "memc_test", "id", "value3", null, null, null, "PRIMARY");

The memcached client for inserting rows and querying is a simple libmemcached program provided by Yoshinori.Each query would do a key lookup and fetches corresponding value.

We made some adjustment so that there are multiple client processes, each with multiple sessions. This was used to alleviate bottlenecks in the client itself.

As a note, there are many memcached clients out there, and Memcached clients can play  important roles in the performance result itself. For example, we observed at least 3 times difference on result with Perl client Cache::Memcached::Fast when comparing to its slower version Cache::Memcached. And as far as we can see, libmemcached is one of the most efficient clients available, even though eventually it becomes bottleneck itself as the test progresses, especially requests through the network.

The test result can be seen in Dimitri's blog, so I will not repeat them here. The summary is that we got close to 1.2 million QPS at the best. The next bottleneck now seems lying at the adaptive hash index's global latch - "btr_search_latch". The libmemcached client overhead is also significant.

Read Only QPS through InnoDB Memcached

The improvement:

There are several changes in both InnoDB Memcached code and Memcached Native code to achieve the record benchmarks.

1. The first is of course to address the issue brought by bug #70712. With daemon_memcached_r_batch_size set to 1, the transaction is being repeatedly started and committed for each query. It is better to cache the trx object itself, to avoid repeated create and destroy the trx object. Otherwise, the "trx_sys mutex" will kill the concurrency.

After the change, the trx object is cached with private memcached connection data. Each connection gets its own trx object, and it is used to handle transactions through this particular connection.

2. The next thing we did is to take advantage of the read only optimization recently made in the InnoDB code. This scenario (single read trx) is perfect to use the optimization. Whenever the read batch size is set to 1, InnoDB Memcached will treat incoming queries as auto-commit read only query. It will automatically hook up to the "fast path" of read-only operation in InnoDB.

3. After these two transaction related changes, we found the bottleneck comes from Memcached native Code itself. As a note, we embedded the Memcached code itself in our InnoDB Memcached plugin, so any bottleneck in Memcached will affect us.

The original Memcached memory allocation is protected by a global Cache Lock (engine->cache_lock), and it quickly rises in prominence in the profiling result.

Even though the data is stored in InnoDB, we happened to still use some of Memcached's own memory allocation to store and deliver the result back to the front end. To fix this bottleneck, we stopped using Memcached Memory altogether. Instead a connection private memory buffer is used to store and deliver the result. This also saves a memcpy as we move the data to memcached memory as before.

This change makes InnoDB Memcached plugin as thin as possible, and only relies on the InnoDB buffer pool and Adaptive Hash Index (AHI) as the backing store for the data. This provides better scaling and memory handling than Memcached itself.

4. Another bottleneck in Memcached is its statistics mutex ("thread_stats->mutex"). This also becomes significant as testing goes. So to remove it, we switched to using atomic operations whenever the platform supports (most modern platforms do). With these changes, we can now well scale the plugin to over 100 connections without degradation as the number of connections are ramped up.

5. In addition to removing those major bottlenecks, we also streamline the code to remove some overhead work. For example, we start to cached the "search tuple", so that there is no need to allocate the search tuple for each query. This is to keep the InnoDB Memcached as lean as possible.

With these changes, we have eliminated all the major InnoDB Memcached Plugin bottlenecks. The bottlenecks now comes from clients themselves and to a lesser degree from the Adaptive Hash Index search latch.

Future work:

Now the Memcached read goes more than twice as fast as those from SQL end. By using the InnoDB buffer pool as the in-memory store, and with InnoDB AHI, InnoDB Memcached can probably provide an efficient and more scalable store than Memcached itself.

There is still more to be done.

1. We will continue to remove some bottlenecks in InnoDB (such as btr_search_latch), as well as make InnoDB memcached leaner/faster.

2. We will add support to "mgets" command, which allows Memcached to fetch multiple results (corresponding to multiple keys) in one query attempts. This would again give us another big jump in terms of QPS.

3. We will start to focus more on insertion/updates operations.

4. We are considering extending the functionality of the memcached interface to support range queries etc. So to make it a more versatile key value store.

In summary, with these enhancements, the InnoDB Memcached becomes more and more attractive as as quick  key value store through the MySQL server.

Your feedback and comments are important to us as we evolve and improve this plugin. 

Wednesday Sep 25, 2013

InnoDB Temporary Tables just got faster

It all started with a goal to make InnoDB temporary tables more effective. Temporary table semantics are blessed with some important characteristics that can help us simplify lot of operations.

  • Temporary tables are not visible across connections
  • Temporary tables lifetime is limited to connection lifetime (unless user explicitly drops it).

What does this means in to InnoDB ?

  • REDO logging can be avoided for temporary tables and related objects since temporary tables do not survive a shutdown or crash.
  • Temporary table definitions can be maintained in-memory without persisting to the disk.
  • Locking constraints can be relaxed since only one client can see these tables.
  • Change buffering can be avoided since the majority of temporary tables are short-lived.

In order to implement these changes in InnoDB we took a bit different approach:

  • We introduced a dedicated tablespace for housing temporary tables. In 5.7, all non-compressed temporary tables (irrespective of innodb-file-per-table setting) are housed in this new temporary tablespace (ibtmp1). Compressed temporary tables continue to get housed in their independent tablespace. This new temporary tablespace is re-created on each server restart and can be pre-extended before starting server. (For more about temporary tablespace check MySQL documentation).
  • Dedicated rollback-segments (for housing undo logs) for temporary table objects.
  • All related objects including rollback-segments have been moved to this new temporary tablespace. This helps in maintaining locality and importantly eliminating REDO logging associated with all such objects.

What have we achieved ? (Performance Gain w.r.t to 5.6)

  • Create/Drop workload: (around 11x performance gain)
    create n tables, drop n tables, create+drop n tables, create + drop n tables with significant number of keys.
  • Insert/Update/Delete workload: (2-4x performance gain)
    • insert workload: insert n entries: sorted, n entries: reverse sorted, n entries: random (total 3n entries)
    • delete workload: delete n entries: first initiate insert workload to load table, delete n entries using primary key, delete n entries using secondary index, delete n entries using primary key such that complete table is empty.
    • update workload: update n entries: first initiate insert workload to load table, update n entries using primary key, update n entries using secondary index, update all 3n entries 2 times. (no explicit key specified).

All testing done using sql-bench (sql-bench modified to accommodate temporary table use-case).

Summing it up....

The benchmarks above show that InnoDB temporary tables are faster in 5.7 (than in 5.6). We plan to further enhance temporary table performance.

Saturday Sep 21, 2013

InnoDB 5.7 performance improvements

A quick overview of the InnoDB performance improvements for both read-only and read-write loads.
[Read More]

Friday Sep 13, 2013

InnoDB Redundant Row Format


This article describes the InnoDB redundant row format. If you are new to InnoDB code base (a new developer starting to work with InnoDB), then this article is for you. I'll explain the row format by making use of a gdb session. An overview of the article is given below:

  • Create a simple table and populate few rows.
  • Access the page that contains the rows inserted.
  • Access a couple of rows and explain its format.
  • Give summary of redundant row format.
  • Useful gdb commands to analyse the InnoDB rows.
  • Look at a GNU Emacs Lisp function to traverse rows in an InnoDB index page.

To get the most out of this article, the reader is expected to repeat the gdb session as described here.

The Schema

Consider the following SQL statements to produce the schema:

CREATE TABLE t1 (f1 int unsigned) row_format=redundant engine=innodb;
INSERT INTO t1 VALUES (1), (2), (3), (4), (5);

When the above CREATE TABLE is executed, InnoDB internally creates a clustered B-tree index (or just clustered index) representing the table. If no primary key is explicitly given, internally a primary key is added. Each node of B-tree is represented by one page (which is by default is of size 16KB). In the given scenario, we insert only 5 records. So the B-tree will consist of only 1 node, the root node.

Access InnoDB Index Page in gdb

All select queries will go through the row_search_for_mysql() function of InnoDB. When the first time it is called for a query, a persistent cursor (btr_pcur_t) will be opened by this function. When this B-tree persistent cursor (pcur in the following gdb) points to the InnoDB index page containing the rows of our interest, we can save the page frame to a file as follows:

(gdb) set $frame = pcur->btr_cur->page_cur->block->frame
(gdb) dump binary memory /tmp/gdb.anna $frame $frame + srv_page_size
(gdb) p srv_page_size
$10 = 16384

If you do not understand the commands above, refer to the gdb manual for more information. Saving a page in binary format in a file on disk helps us to analyse the contents of the page by loading it in a text editor like GNU Emacs or ViM. This is sometimes a very useful approach for debugging. The above commands were executed when I was on the sync point row_search_rec_loop, in the function row_search_for_mysql(). The line numbers are from mysql 5.7.3.

4155         } else if (mode == PAGE_CUR_G || mode == PAGE_CUR_L) {
4156                 btr_pcur_open_at_index_side(
4157                         mode == PAGE_CUR_G, index, BTR_SEARCH_LEAF,
4158                         pcur, false, 0, &mtr);
4159         }
4161 rec_loop:
4162         DEBUG_SYNC_C("row_search_rec_loop");
4163         if (trx_is_interrupted(trx)) {
4164                 btr_pcur_store_position(pcur, &mtr);
4165                 err = DB_INTERRUPTED;
4166                 goto normal_return;
4167         }

An important point to remember when analysing the page is that all the data in the page are stored in big-endian format.

Open Page in GNU Emacs

The following image shows the InnoDB page opened in GNU Emacs in hexl-mode major mode. I have placed the cursor on the origin of the infimum record (at offset 0x65 from page origin). In the right side, you can look at the text "infimum" and "supremum" as stored in the page. These are special records.

All rows in an InnoDB page are linked to one another in ascending order via the next record pointer. In the given example this ordering is based on the internally generated DB_ROW_ID. In each page, the first record (or the minimum record) is the infimum record, and the last record (or the maximum record) is the supremum record. The records between the infimum and supremum records are called the user records.

The next record pointer of each record points to the "origin" of the next record. The record contains data to the right of the origin and meta data to the left of the origin. The next record pointer is available in the 2 bytes before the record origin.

The next record pointer of infimum as given in the picture above is 0x0087 (which is 135 in decimal). This is the offset of the next record from the beginning of the page.

Printing the infimum record

As mentioned previously, infimum is a special minimum record in an index page. It exists at a fixed offset of 101 bytes from the page beginning. The function page_get_infimum_rec() can be used to get the infimum record given a page frame pointer. All records in InnoDB are pointed to at their record origin. So to access the meta data in the record, we need to subtract offset from the record origin.

Printing the infimum record given a page frame pointer (in char):

(gdb) p  pcur->btr_cur->page_cur->block->frame+101
$41 = (unsigned char *) 0x7fffd1d3c065 "infimum"

Printing the infimum record given a page frame pointer (in hex):

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+101)@7
$45 = {0x69, 0x6e, 0x66, 0x69, 0x6d, 0x75, 0x6d}

Printing the 6 byte Informational Bit Fields

Given the origin of the infimum record, we can print the 6 bytes of informational bits as follows (in hex).

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+101-6)@6
$50 = {0x1, 0x0, 0x0, 0x3, 0x0, 0x87}

Printing the 6 bytes of informational bits of the infimum record (in binary). In this output, the right most byte displayed is the first byte before the record origin.

(gdb) x /6tb  pcur->btr_cur->page_cur->block->frame+101-6
0x7fffd1d3c05f: 00000001  00000000   00000000  00000011  00000000  10000111

The above 6 bytes bit fields of the record is formatted as follows: (it also contains the actual values as seen above)

4 bits
info bits
byte: 6
4 bits
byte: 6
13 bits
heap number
byte: 4,5
00000000 00000
10 bits
field count
byte: 3,4
000 0000001
1 bit
byte: 3
2 bytes
next record offset
byte: 1, 2
00000000 10000111

Each of the box above represents one bit field. It contains the following information:

The size of bit field in bits or bytes
A short description of the bit field.
The byte where this bit field exists.
The actual value of this bit field as seen in above gdb output

The 1 bit flag above is used to tell whether the offset array element is 1 byte (if set) or 2 bytes (if not set). In our example case, the 1 bit flag is set and hence the offset array elements are each 1 byte long.

From the field count bit field, we can see that the infimum record has only one field, whose value is fixed to "infimum". The next record pointer of the infimum record points to the first user record in the page. This is available at the offset 0x0087, which is 135 in decimal. So the first user record is available at page frame pointer + 135 bytes.

The Bit Fields

The information available in 6 bytes (REC_N_OLD_EXTRA_BYTES) of bit field data is listed here. These six bytes are available immediately before the record origin. These bit fields must be accessed by making use of the interface provided by the record manager (refer to rem0rec.h).

The Bit Field Size of Bit Field Description of the Bit Field
info bits 4 bits Used to delete mark a record
n_owned 4 bits Number of records owned by this record. This is related to page directory. To understand this field, we need to understand the page structure used by InnoDB. This is not explained further in this article.
heap number 13 bits Gives the order number of this record in the page. Each row has a unique heap number in the page. The lock manager uses this heap number.
field count 10 bits Number of fields in this record.
flag 1 bit Tells if the offset array element is 1 or 2 bytes.
next record offset 2 bytes Offset of the next record from page origin.

Printing the First User Record

To first parse the record we need to know the number of records in the row. This can be obtained from the 6 bytes of bit-field information available before the record origin. Let us first see this for the user record:

(gdb) x /6tb  pcur->btr_cur->page_cur->block->frame+135-6
0x7fffd1d3c081: 00000000  00000000  00010000  00001001  00000000  10101000

4 bits
info bits
byte: 6
4 bits
byte: 6
13 bits
heap number
byte: 4,5
00000000 00010
10 bits
field count
byte: 3,4
000 0000100
1 bit
byte: 3
2 bytes
next record offset
byte: 1, 2
00000000 10101000

The above information shows that this record has 4 fields (000 0000100 in binary). But our table definition was

CREATE TABLE t1 (f1 unsigned int) row_format=redundant engine=innodb;

If you are wondering why a row of this table has 4 field count, it is because InnoDB added 3 internal fields to the table t1 - a row identifier (DB_ROW_ID), transaction identifier (DB_TRX_ID) and a undo roll pointer (DB_ROLL_PTR). A row identifier column is internally added to a table if no primary key is explicitly given by the user. This row identifier becomes the internal primary key. The transaction identifier specifies the transaction that has last updated the row. The undo roll pointer points to the previous versions of the record. The 4th field is the actual column f1 added by the user.

The list of fields in the table t1 can be printed by accessing its associated dict_table_t object (or its clustered index dict_index_t object). In the following, the name GEN_CLUST_INDEX is the name of the clustered index of the table t1. This is the standard name of the clustered index if it was internally generated by InnoDB.

(gdb) p prebuilt->index->name       
$75 = 0x7fffa00232e8 "GEN_CLUST_INDEX"
(gdb) p prebuilt->index->table      
$73 = (dict_table_t *) 0x7fffa001fe68
(gdb) p prebuilt->index->table->name
$74 = 0x7fffa000cef0 "test/t1"
(gdb) set print array on
(gdb) p *prebuilt->index->fields@4
$67 =   {{col = 0x7fffa0020154, name = 0x7fffa00201c3 "DB_ROW_ID", prefix_len = 0, 
    fixed_len = 6},
  {col = 0x7fffa0020160, name = 0x7fffa00201cd "DB_TRX_ID", prefix_len = 0, 
    fixed_len = 6},
  {col = 0x7fffa002016c, name = 0x7fffa00201d7 "DB_ROLL_PTR", prefix_len = 0, 
    fixed_len = 7},
  {col = 0x7fffa0020148, name = 0x7fffa00201c0 "f1", prefix_len = 0, fixed_len = 4}}

Printing the Record Offsets (or Offset Array)

The redundant record contains an array of offsets of each field of the row. This array is available after the 6-byte bit fields (down from the record origin). The size of each element of the array is either 1 byte or 2 byte depending on the 1 bit flag in the 6 byte bit fields. In our current case each offset element is 1 byte long.

(gdb) p /d  *(pcur->btr_cur->page_cur->block->frame+135-6-4)@4
$79 = {23, 19, 12, 6}

The first field (DB_ROW_ID) starts at offset 0 from record origin. The second field (DB_TRX_ID) starts at offset 6 from record origin. The third field (DB_ROLL_PTR) starts at offset 12 bytes from record origin. The fourth field (f1) starts at offset 19 bytes from record origin. The end of the record is 23 bytes from the record origin. So the total size of this record can be calculated as follows:

S - size of offset array element (1 byte) 
N - number of fields in the row (4 fields)
I - size of bit fields (6 bytes)
D - Length of row after record origin (23 as given in offset array)
R - Total Size of the row 

R = (S*N) + I + D = 4 + 6 + 23 = 33 bytes

The total size of the row is 33 bytes. Since this table contains only fixed length fields, the size of each row in this table will be the same.

Printing the DB_ROW_ID

The DB_ROW_ID is 6 bytes long. It starts at the record origin itself. It can be printed as follows:

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135)@6
$82 = {0x0, 0x0, 0x0, 0x0, 0x2, 0x0}

Printing the DB_TRX_ID

The second field DB_TRX_ID is also 6 bytes long. It starts at an offset of 6 bytes from the record origin. It can be examined as follows:

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135+6)@6
$84 = {0x0, 0x0, 0x0, 0x0, 0x9, 0x8}

Printing the DB_ROLL_PTR

The third field DB_ROLL_PTR is 7 bytes long. It points to the previous versions of a record available in the undo segments. It can be examined as follows:

(gdb)  p /x *(pcur->btr_cur->page_cur->block->frame+135+12)@7
$3 = {0xa8, 0x0, 0x0, 0x1, 0x1a, 0x1, 0x10}

Printing the User Data Field

The fourth field "f1" is the user defined field. It is 4 bytes long. It can be examined as follows:

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135+19)@4
$5 = {0x0, 0x0, 0x0, 0x1}
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+135+19))
$8 = 1

Thus, we have completely accessed the first user record. Given the first user record, we can find out the location of the next user record (the first 2 bytes before the record origin). If we keep accessing the next user record, we will eventually reach the supremum record, which is the last (and considered largest or maximum) record in a page.

Printing All Rows

Just to improve our confidence in accessing the rows, here I have printed all the rows that we have inserted into the table. The first record is the infimum, followed by the records 1, 2, 3, 4 and 5 (which we have inserted) and then the last record supremum.

(gdb) p pcur->btr_cur->page_cur->block->frame+101
$13 = (unsigned char *) 0x7fffd1e48065 "infimum"
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+135+19))
$14 = 1
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+168+19))
$15 = 2
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+201+19))
$16 = 3
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+234+19))
$17 = 4
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+267+19))
$18 = 5
(gdb) p pcur->btr_cur->page_cur->block->frame+116
$19 = (unsigned char *) 0x7fffd1e48074 "supremum"

The Complete Redundant Row Format

One redundant record can be logically separated into 5 parts - the offsets array, the 6-byte bit field, record origin, hidden fields (DB_ROW_ID, DB_TRX_ID, DB_ROLL_PTR) and the actual user-specified fields of the table. They can be viewed as follows:

offsets array
fn ... f1, x2, x1, pn ... p1
6-byte bit field
primary key
or user given
p1, p2 ... pn
hidden fields
x1, x2
user fields
f1, f2, f3 ... fn

All record pointers point to the record origin. The record origin is the beginning of the primary key for the clustered primary index. It is not a separate entity that occupies some space. I am showing it separately to make it easy to visualize. All information prior to the record origin can be considered as the record header or the record meta-data.

Each data field has a corresponding offsets array element. The offsets array is arranged in reverse order. The first element in the array provides the offset of the end of the last data field and the last element in the array provides the offset of the end of the first data element.

The offsets array also tells whether a data field is NULL. If a field is NULL, the highest bit in the associated offset array element will be set to 1.

The biggest advantage of the redundant row format is that a record in this format can be parsed without needing to access the data dictionary. But this format is not space efficient. Even for fixed length fields, we store information in the offsets array (this is avoided in COMPACT, DYNAMIC and COMPRESSED row formats).

One more limitation of the redundant row format is that the blob data cannot be stored completely off page (as in DYNAMIC and COMPRESSED row formats). A 768 byte prefix of the blob will always be stored inline with the row and cannot be avoided.

A small note on row formats. InnoDB really supports only two row formats - the redundant row format and the compact row format. The compressed and dynamic row formats are a variation available only on top of the compact row format. Technically these variations can be implemented for the redundant row format as well, but is currently not available.

GNU Emacs Function to Traverse Rows in Page

If we dump a single InnoDB index page into a file, we can open it in GNU Emacs hexl-mode and use the following elisp function to traverse from infimum record to the supremum record. First place the cursor on the 'i' of the infimum record and then call this function to reach the next record origin. I am not an expert in writing elisp functions. If you can improve on this and write a better one, kindly share it with me.

(defun ib-next-red-row ()
  "Go to the origin of the next record in ROW_FORMAT=REDUNDANT.
The cursor must be positioned at the origin of a record. 
This is for the GNU Emacs hexl-mode. "
  (setq rec-origin (point))
  (setq rec-origin-offset (hexl-current-address))

  ;; The next-rec pointer is stored in 2 bytes before the
  ;; record origin
  (hexl-backward-char 2)
  (setq next-rec (string (char-after)))
  (forward-char 1)
  (setq next-rec 
	(concat next-rec (string (char-after))))
  (hexl-forward-char 1)
  (setq next-rec 
	(concat next-rec (string (char-after))))
  (forward-char 1)
  (setq next-rec 
	(concat next-rec (string (char-after))))

  ;; Next record is next-rec-off bytes away from page origin
  (setq next-rec-off (hexl-hex-string-to-integer next-rec))

  ;; Assuming that there is only one page in dump
  (hexl-goto-address 0)
  (hexl-forward-char next-rec-off))


In this article, the format of the redundant row was explained. A gdb session was used to demonstrate how to access a page and the rows within that page. The infimum record was accessed, printed and analysed. Using its next record pointer, the first user record was accessed. The reader can use the next record pointer of the first user record to access the next user record and so on till the supremum record is reached. I leave that as an exercise for the readers.

Thanks to Marko Makela for his useful tips on debugging at our yearly InnoDB Team Meeting 2013 held at Shanghai, China. This article derives from his ideas.

Send in your feedback to me.

Monday Jul 15, 2013

Redo Logging in InnoDB


InnoDB is a general-purpose storage engine that balances high reliability and high performance. It is a transactional storage engine and is fully ACID compliant, as would be expected from any relational database. The durability guarantee provided by InnoDB is made possible by the redo logs.

This article will provide an overview of the redo log subsystem or log subsystem of InnoDB. We will look at the following details:

  • The global log system object, which provides access to important data structures and information.
  • The mini-transaction (mtr), using which all redo log records are created.
  • The global in-memory log buffer (or just log buffer), into which the redo logs are written to from the mini transaction buffer. This log buffer will be periodically flushed to the log file on disk.
  • The redo log files on the disk and their high level internal structure.
  • We will discuss about the concept of LSN, and how the various values of LSN are used to implement the write-ahead logging (WAL).

Redo Log Generation

In the article, Data Organization in InnoDB, we already saw that the user data files of InnoDB (.ibd files) are considered to be a sequence of equal sized pages. These pages are uniquely identified within the InnoDB system using the space_id, page_no combination. If we want to read or modify any page, it needs to be loaded into memory. So there are two copies of the pages - on disk and in memory. Here are the high level steps in redo log generation.

  1. Any changes to a page is first done to the in-memory copy of the page. The page that is modified in memory and not yet flushed to disk is marked as the dirty page in memory.
  2. An associated redo log is generated in memory, in the local mini transaction (mtr) buffer. This will then be transferred to the global in-memory redo log buffer.
  3. The redo log record is written from the redo log buffer in memory to the redo log file on disk. This is subsequently flushed. These two steps are considered separate - writing to the redo log file and flushing the redo log file to the disk. This is to account for the file buffering done by the operating system.
  4. The dirty page is flushed from memory to disk at some later point of time as part of the checkpointing operation.

The order of these steps are important. The redo log record of a change must be flushed to disk before flushing the corresponding dirty page to the disk. This is the concept of write-ahead logging (WAL).

The generated redo log record will contain information necessary to repeat the same operation later during a database recovery. So the redo log record will contain information about the original set of pages and what has been changed in them. Using the redo log record the set of dirty pages will be re-created during database recovery.

Redo Log Files

By default, InnoDB creates two redo log files (or just log files) ib_logfile0 and ib_logfile1 within the data directory of MySQL. In MySQL versions 5.6.8 and above, the default size of each redo log file is 48MB each. This can be configured by the user by making use of innodb_log_file_size server option. The number of log files is controlled by innodb_log_files_in_group server option.

A log group consists of a number of log files, each of same size. As of MySQL 5.6, InnoDB supports only one log group. So I'll not discuss this further.

The redo log files are used in a circular fashion. This means that the redo logs are written from the beginning to end of first redo log file, then it is continued to be written into the next log file, and so on till it reaches the last redo log file. Once the last redo log file has been written, then redo logs are again written from the first redo log file.

The log files are viewed as a sequence of blocks called "log blocks" whose size is given by OS_FILE_LOG_BLOCK_SIZE which is equal to 512 bytes. Each log file has a header whose size is given by LOG_FILE_HDR_SIZE, which is defined as 4*OS_FILE_LOG_BLOCK_SIZE.

Redo Log File Header

Each redo log file contains a header occupying four log blocks with the following information:

  • The first 4 bytes contain the log group number to which the log file belongs.
  • The next 8 bytes contain the lsn of the start of data in this log file.
  • First checkpoint field located in the beginning of the second log block.
  • Second checkpoint field located in the beginning of the fourth log block.

The checkpoint field mentioned above contains information necessary to identify a checkpoint - a checkpoint number, the LSN of the checkpoint, checksum information and more.

Log Blocks

A redo log file can be viewed as a sequence of log blocks. All log blocks, except the ones belonging to the log file header, contain a header and a footer. The header is of size LOG_BLOCK_HDR_SIZE bytes (which is defined as 12 bytes). The log block trailer is of size LOG_BLOCK_TRL_SIZE (4 bytes). The log block header contains the following information:

  • The log block number. This field is of 4 bytes.
  • Number of bytes of log written to this block. This field is of 2 bytes.
  • Offset of the first start of an mtr log record group in this log block or 0 if none
  • The checkpoint number to which this log block belongs

The log block trailer contains checksum of the log block contents.

Log Sequence Number (LSN)

The log sequence number (LSN) is an important concept. The LSN is an offset into the redo log file. Within InnoDB the log sequence number is represented by the type lsn_t, which is an 8-byte unsigned integer. There are different LSN values that are of interest. The following table lists some of the LSN values discussed in this article. (Note: The log_sys is the global log system object which is explained in next section.)

LSN Description
log_sys->lsn The redo log record that will be generated next will make use of this lsn
log_sys->flushed_to_disk_lsn The redo log file is flushed upto this LSN. All redo log records whose LSN is < flushed_to_disk_lsn is safely on the disk.
log_sys->write_lsn There is a currently running write operation which will write upto this LSN.
log_sys->current_flush_lsn There is a currently running write + flush operation which will write upto this LSN.

The LSN is what that links the dirty page, the redo log record, and the redo log file. Each redo log record when it is copied to the in-memory log buffer, it gets an associated LSN. When each database page is modified, redo log records are generated. So each database page is also associated to an LSN. The page lsn is stored in a header for each page. The page lsn gives the lsn upto which the redo log file must be flushed before flushing the page.

Global Log System Object

The global log system object log_sys (of type log_t) holds important information related to log subsystem of InnoDB. Here only those information related to redo log buffer and redo log files are discussed. The global log_sys identifies the "active area" of the log buffer, which contains redo logs that are yet to be safely written on the disk. It also identifies the area in the redo log file, into which the active area of the log buffer will be written/flushed.

/** Redo log buffer */
struct log_t{
	lsn_t		lsn;		/*!< log sequence number */
	ulint		buf_free;	/*!< first free offset within the log
					buffer */
	byte*		buf_ptr;	/* unaligned log buffer */
	byte*		buf;		/*!< log buffer */
	ulint		buf_size;	/*!< log buffer size in bytes */
	ulint		max_buf_free;	/*!< recommended maximum value of
					buf_free, after which the buffer is
					flushed */

	ulint		buf_next_to_write;/*!< first offset in the log buffer
					where the byte content may not exist
					written to file, e.g., the start
					offset of a log record catenated
					later; this is advanced when a flush
					operation is completed to all the log
					groups */
	lsn_t		write_lsn;	/*!< end lsn for the current running
					write */
	ulint		write_end_offset;/*!< the data in buffer has
					been written up to this offset
					when the current write ends:
					this field will then be copied
					to buf_next_to_write */
	lsn_t		current_flush_lsn;/*!< end lsn for the current running
					write + flush operation */
	lsn_t		flushed_to_disk_lsn;
					/*!< how far we have written the log
					AND flushed to disk */

  • The member log_sys->buf points to the in-memory redo log buffer. This is the buffer into which an mtr_commit() writes the redo log records. The size of this buffer is given by log_sys->buf_size in bytes.
  • The member log_sys->buf_free is the offset within the in-memory redo log buffer, where the next redo log record will be written. This is the end offset for the next redo log flush to disk.
  • The member log_sys->buf_next_to_write is the offset from where the redo log records are not yet written to the redo log file. When the redo log buffer is flushed to disk next time, it will flush from this location. This is the start offset for the next redo log flush to disk.
  • The member log_sys->flushed_to_disk_lsn marks the lsn upto which we have written to the log file on the disk and flushed it. So upto this lsn, the redo logs are safely on the disk. This will always be less than or equal to log_sys->write_lsn and log_sys->lsn.
  • The member log_sys->lsn represents the current lsn. This member will be updated whenever we do mtr_commit(). The function mtr_commit() writes the bunch of redo log records generated in the mini transaction to the global or system wide redo log buffer in memory. This will always be greater than or equal to log_sys->flushed_to_disk_lsn and log_sys->write_lsn. This will be the LSN of the redo log record written at log_sys->buf_free.
  • The member log_sys->write_lsn represents the end lsn of a currently running redo log buffer write operation. This will be greater than or equal to log_sys->flushed_to_disk_lsn and less than or equal to log_sys->lsn.
  • The member log_sys->current_flush_lsn represents the end lsn of a currently running redo log buffer write + flush operation. This will be mostly equal to log_sys->write_lsn.

The global log_sys object points to various positions in the in-memory redo log buffer and on-disk redo log files. The following picture shows the locations pointed to by the global log_sys object. The picture clearly shows that the redo log buffer maps to a specific portion of the redo log file.

Global In-memory Redo Log Buffer

The in-memory redo log buffer is global and all redo logs generated by user transactions will be written into this buffer. The size of this buffer is configurable and is given by the innodb_log_buffer_size. The default size of this redo log buffer is 8MB.

When a transaction is running and if it is modifying the database, then it will generate redo logs and populate this log buffer. This log buffer will be written or flushed to the log file, either when the transaction commits, or when the log buffer gets full.

When the redo log buffer is full, and there is not enough space for an mtr_commit(), which will transfer a group of redo log records to log buffer, then a synchronous flush of the log buffer is done to the redo log file by using the function log_buffer_flush_to_disk(). This function will write the log buffer from log_sys->buf_next_to_write to log_sys->buf_free. In terms of LSN, the function log_buffer_flush_to_disk() flushes the redo logs from log_sys->flushed_to_disk_lsn to log_sys->lsn.

The mini transaction (mtr)

A mini transaction (mtr) must be used to generate all the redo log records. A mini transaction contains a local buffer (called the mini transaction buffer) into which the generated redo log records will be stored. If we need to generate a group of redo log records such that either all make it to the redo log file or none makes it, then we need to put them in a single mini transaction. Apart from the redo log records, the mini transaction also maintains a list of pages that has been modified (dirty pages).

The normal usage of a mini transaction is as follows:

  • Create a mini transaction object of type mtr_t
  • Start the mini transaction with mtr_start() function. This will initialize the mini transaction buffer.
  • Generate the redo log records by making use of mlog_write_ulint() family of functions.
  • Commit the mini transaction with mtr_commit() function. This will transfer the redo log records from mini transaction buffer to the global redo log buffer. The list of dirty pages is added to the buffer pool flush list.

The definition of a mini transaction is given here for your reference. The member mtr_t::log contains the mini transaction buffer which will hold the redo log records, and the member mtr_t::memo contains a list of pages dirtied by this mini transaction.

/* Mini-transaction handle and buffer */
struct mtr_t{
	dyn_array_t	memo;	/*!< memo stack for locks etc. */
	dyn_array_t	log;	/*!< mini-transaction log */
	unsigned	inside_ibuf:1;
				/*!< TRUE if inside ibuf changes */
	unsigned	modifications:1;
				/*!< TRUE if the mini-transaction
				modified buffer pool pages */
	unsigned	made_dirty:1;
				/*!< TRUE if mtr has made at least
				one buffer pool page dirty */
	ulint		n_log_recs;
				/* count of how many page initial log records
				have been written to the mtr log */
	ulint		n_freed_pages;
				/* number of pages that have been freed in
				this mini-transaction */
	ulint		log_mode; /* specifies which operations should be
				logged; default value MTR_LOG_ALL */
	lsn_t		start_lsn;/* start lsn of the possible log entry for
				this mtr */
	lsn_t		end_lsn;/* end lsn of the possible log entry for
				this mtr */

Redo Log Record Types

When we modify a database page, a redo log record is generated. This redo log record either contains what information has been changed in the page (physical redo log) or how to perform the change again (logical redo log). InnoDB uses a combination of physical redo logs and logical redo logs.

To understand this consider an operation like page re-organization. If we generated a physical redo log record for this operation, it is likely that the redo log record generated will be equal to the page size. But instead InnoDB uses logical redo log record for this operation, which will reduce the redo log record size significantly. In the case of logical redo log record to represent a page reorganize, all we need is the information to uniquely identify a page, and "type" of operation which is page reorganize.

So each redo log record has a type. A redo log record type helps to identify the function that will be used to apply or execute the redo log during recovery. The contents of the redo log record must then contain all the arguments or parameters needed by the function.

Life cycle of a redo log record

The life cycle of a redo log record is as follows:
  • The redo log record is first created by a mini transaction and stored in the mini transaction buffer. It has information necessary to redo the same operation again in the time of database recovery.
  • When mtr_commit() is done, the redo log record is transferred to the global in-memory redo log buffer. If necessary, the redo log buffer will be flushed to the redo log file, to make space for the new redo log record.
  • The redo log record has a specific lsn associated with it. This association is established during mtr_commit(), when the redo log record is transferred from mtr buffer to log buffer. Once the lsn is established for a redo log record, then its location in the redo log file is also established.
  • The redo log record is then transferred from the log buffer to redo log file when it is written + flushed. This means that the redo log record is now durably stored on the disk.
  • Each redo log record has an associated list of dirty pages. This relationship is established via LSN. A redo log record must be flushed to disk, before its associated dirty pages. A redo log record can be discarded only after all the associated dirty pages are flushed to the disk.
  • A redo log record can be used to re-create its associated list of dirty pages. This happens during database recovery.


This article provided an overview of the redo log subsystem of InnoDB storage engine. The major data structures used in the redo log subsystem are the mini transaction (mtr), the in-memory redo log buffer and the on-disk redo log files. The InnoDB storage engine tracks many LSN values to ensure that the write-ahead logging (WAL) happens correctly.

Since data loss is unacceptable for an RDBMS, the redo log subsystem is very critical for a successful RDBMS. And since redo logs are generated synchronously with DML operations, it is important to do it efficiently. The size of redo logs must be kept as minimal as possible.

Send in your feedback to me.

Friday May 31, 2013

Introduction to Transaction Locks in InnoDB Storage Engine


Transaction locks are an important feature of any transactional storage engine. There are two types of transaction locks – table locks and row locks. Table locks are used to avoid a table being altered or dropped by one transaction when another transaction is using the table. It is also used to prohibit a transaction from accessing a table, when it is being altered. InnoDB supports multiple granularity locking (MGL). So to access rows in a table, intention locks must be taken on the tables.

Row locks are at finer granularity than table level locks, different threads can work on different parts of the table without interfering with each other. This is in contrast with MyISAM where the entire table has to be locked when updating even unrelated rows. Having row locks means that multiple transactions can read and write into a single table. This increases the concurrency level of the storage engine. InnoDB being an advanced transactional storage engine, provides both table and row level transaction locks.

This article will provide information about how transaction locks are implemented in InnoDB storage engine. The lock subsystem of InnoDB provides many services to the overall system, like:

  • Creating, acquiring, releasing and destroying row locks.
  • Creating, acquiring, releasing and destroying table locks.
  • Providing multi-thread safe access to row and table locks.
  • Data structures useful for locating a table lock or a row lock.
  • Maintaining a list of user threads suspended while waiting for transaction locks.
  • Notification of suspended threads when a lock is released.
  • Deadlock detection

The lock subsystem helps to isolate one transaction from another transaction. This article will provide information about how transaction locks are created, maintained and used in the InnoDB storage engine. All reference to locks means transaction locks, unless specified otherwise.

Internal Data Structures of InnoDB

Before we proceed with scenarios and algorithms, I would like to present the following data structures. We need to be familiar with these data structures to understand how transaction locks work in InnoDB. The data structures of interest are:

  1. The enum lock_mode – provides the list of modes in which the transaction locks can be obtained.
  2. The lock struct lock_t. This represents either a table lock or a row lock.
  3. The struct trx_t which represents one transaction within InnoDB.
  4. The struct trx_lock_t which associates one transaction with all its transaction locks.
  5. The global object of type lock_sys_t holds the hash table of row locks.
  6. The table descriptor dict_table_t, which uniquely identifies a table in InnoDB. Each table descriptor contains a list of locks on the table.
  7. The global object trx_sys of type trx_sys_t, which holds the active transaction table (ATT).

The Lock Modes

The locks can be obtained in the following modes. I'll not discuss about the LOCK_AUTO_INC in this article.

/* Basic lock modes */ 
enum lock_mode { 
        LOCK_IS = 0,    /* intention shared */ 
        LOCK_IX,        /* intention exclusive */ 
        LOCK_S,         /* shared */ 
        LOCK_X,         /* exclusive */ 
        LOCK_AUTO_INC,  /* locks the auto-inc counter of a table 
                        in an exclusive mode */ 
        LOCK_NONE,      /* this is used elsewhere to note consistent read */ 
        LOCK_NUM = LOCK_NONE, /* number of lock modes */ 
        LOCK_NONE_UNSET = 255 

The lock struct or lock object

The structure lock_t represents either a table lock (lock_table_t) or a group of row locks (lock_rec_t) for all the rows belonging to the same page. For different lock modes, different lock structs will be used. For example, if a row in a page is locked in LOCK_X mode, and if another row in the same page is locked in LOCK_S mode, then these two row locks will be held in different lock structs.

This structure is defined as follows:

struct lock_t {
        trx_t*          trx;
        ulint           type_mode;
        hash_node_t     hash;
        dict_index_t*   index;
        union {
                lock_table_t    tab_lock;/*!< table lock */
                lock_rec_t      rec_lock;/*!< record lock */
        } un_member;                    /*!< lock        details */

/** A table lock */
struct lock_table_t {
        dict_table_t*   table;          /*!< database table in dictionary
                                        cache */
                        locks;          /*!< list of locks on the same
                                        table */

/** Record lock for a page */
struct lock_rec_t {
        ulint   space;                  /*!< space id */
        ulint   page_no;                /*!< page number */
        ulint   n_bits;                 /*!< number of bits in the lock
                                        bitmap; NOTE: the lock bitmap is
                                        placed immediately after the
                                        lock struct */

The important point here is the lock bitmap. The lock bitmap is a space efficient way to represent the row locks. This space efficient way of representing the row locks avoids the need for lock escalation and lock data persistence. (Note: For prepared transactions, it would be useful to have lock data persistence, but InnoDB currently do not support lock data persistence.)

The lock bitmap is placed immediately after the lock struct object. If a page can contain a maximum of 100 records, then the lock bitmap would be of size 100 (or more). Each bit in this bitmap will represent a row in the page. The heap_no of the row is used to index into the bitmap. If the 5th bit in the bitmap is enabled, then the row with heap_no 5 is locked.

The Transaction And Lock Relationship

The struct trx_t is used to represent the transaction within InnoDB. The struct trx_lock_t is used to represent all locks associated to a given transaction. Here I have listed down only those members relevant to this article.

struct trx_t {

	trx_id_t        id;             /*!< transaction id */ 

      trx_lock_t      lock;           /*!< Information about the transaction 
                                        locks and state. Protected by 
                                        trx->mutex or lock_sys->mutex 
                                        or both */ 

struct trx_lock_t { 

        ib_vector_t*    table_locks;    /*!< All table locks requested by this 
                                        transaction, including AUTOINC locks */ 

                        trx_locks;      /*!< locks requested 
                                        by the transaction; 
                                        insertions are protected by trx->mutex 
                                        and lock_sys->mutex; removals are 
                                        protected by lock_sys->mutex */ 


Global hash table of row locks

Before we look at how the row locks internally works, we need to be aware of this global data structure. The lock subsystem of the InnoDB storage engine has a global object lock_sys of type lock_sys_t. The class lock_sys_t is defined as follows:

struct lock_sys_t { 
        ib_mutex_t      mutex;   
        hash_table_t*   rec_hash; 
        ulint           n_lock_max_wait_time; 
        // more ...

The lock_sys_t::rec_hash member is the hash table of the record locks. Every page within InnoDB is uniquely identified by the (space_id, page_no) combination, called the page address. The hashing is done based on the page address. So given the page address, we can locate the list of lock_t objects of that page. All lock structs of the given page will be in the same hash bucket. So using this mechanism we can locate the row lock of any row.

The Transaction Subsystem Global Object

The transaction subsystem of InnoDB has one global object trx_sys of type trx_sys_t, which is an important internal data structure. It is defined as follows:

/** The transaction system central memory data structure. */ 
struct trx_sys_t{ 

	  // more ...

        trx_list_t      rw_trx_list;    /*!< List of active and committed in 
                                        memory read-write transactions, sorted 
                                        on trx id, biggest first. Recovered 
                                        transactions are always on this list. */ 
        trx_list_t      ro_trx_list;    /*!< List of active and committed in 
                                        memory read-only transactions, sorted 
                                        on trx id, biggest first. NOTE: 
                                        The order for read-only transactions 
                                        is not necessary. We should exploit 
                                        this and increase concurrency during 
                                        add/remove. */ 
        // more ...

This global data structure contains the active transaction table (ATT) of InnoDB storage engine. In the following sections, I'll use this global object to access my transaction object through the debugger to demonstrate the transaction locks.

The table descriptor (dict_table_t)

Each table in InnoDB is uniquely identified by its name in the form of databasename/tablename. For each table, the data dictionary of InnoDB will contain exactly one table descriptor object of type dict_table_t. Given the table name, the table descriptor can be obtained. This table descriptor contains a list of locks on the table. This list can be used to check if the table has already been locked by a transaction.

struct dict_table_t{ 

        // moretable_id_t      id;     /*!< id of the table */ 
        char*           name;   /*!< table name */ 

                        locks;  /*!< list of locks on the table; protected 
                                by lock_sys->mutex */ 

The heap_no of a row

Every row in InnoDB is uniquely identified by space_id, page_no and heap_no of the row. I assume that you know about space_id and page_no. I'll explain here only about the heap_no of the row. Each row in the page has a heap_no. The heap_no of the infimum record is 0, the heap_no of the supremum record is 1, and the heap_no of the first user record in page is 2.

If we have inserted 10 records in the page, and if there has been no updates on the page, then the heap_no of the user records would be 2, 3, 4, 5 … 10, 11. The heap_no will be in the same order in which the records will be accessed in ascending order.

The heap_no of the row is used to locate the bit in the lock bitmap corresponding to the row. If the row has a heap_no of 10, then the 10th bit in the lock bitmap corresponds to the row. This means that if the heap_no of the row is changed, then the associated lock structs must be adjusted accordingly.

The Schema

To demonstrate the transaction locks, I use the following schema in this article. There is one table t1 which has 3 rows (1, 'அ'), (2, 'ஆ') and (3, 'இ'). In this article, we will see when and how the transaction locks (table and row) are created and used. We will also cover the internal steps involved in creating table and row locks.

set names utf8; 
drop table if exists t1; 
create table t1 (f1 int primary key, f2 char(10)) charset='utf8'; 
insert into t1 values (1, 'அ'), (2, 'ஆ'), (3, 'இ'); 

Table Level Transaction Locks

The purpose of table level transaction locks or simply table locks is to ensure that no transactions modify the structure of the table when another transaction is accessing or modifying the table or the rows in a table. There are two types of table locks in MySQL – one is the meta-data locking (MDL) provided by the SQL engine and the other is the table level locks within InnoDB storage engine. Here the discussion is only about the table level locks within InnoDB.

On tables, InnoDB normally acquires only intentional shared (LOCK_IS) or intentional exclusive (LOCK_IX) modes. It does not lock the tables in shared mode (LOCK_S) or exclusive mode (LOCK_X) unless explicitly requested via LOCK TABLES command. One exception to this is in the prepare phase of the online alter table command, where the table is locked in shared mode. Please refer to Multiple Granularity Locking to know about intention shared and intention exclusive locks.

Scenario (௧) for LOCK_IS table lock

Here is an example scenario that will take intention shared (LOCK_IS) lock on the table.

mysql> set transaction isolation level serializable;
mysql> start transaction; 
mysql> select * from t1; 

When the above 3 statements are executed, one table lock would be taken in LOCK_IS mode. In mysql-5.6, there is no way to verify this other than using the debugger. I verified it as follows:

(gdb) set $trx_locklist = trx_sys->rw_trx_list->start->lock->trx_locks 
(gdb) p $trx_locklist.start->un_member->tab_lock->table->name
$21 = 0x7fffb0014f70 "test/t1" 
(gdb) p lock_get_mode(trx_sys->rw_trx_list->start->lock->trx_locks->start) 
$25 = LOCK_IS 

You need to access the correct transaction object and the lock object.

Scenario (௨) for LOCK_IX table lock

Here is an extension to the previous scenario. Adding an INSERT statement to the scenario will take an intention exclusive (LOCK_IX) lock on the table.

mysql> set transaction isolation level serializable;
mysql> start transaction; 
mysql> select * from t1; 
mysql> insert into t1 values (4, 'ஃ'); 

When the above 4 statements are executed, there would be two locks on the table – LOCK_IS and LOCK_IX. The select statement would have taken the table lock in LOCK_IS mode and the INSERT statement would take the table lock in LOCK_IX mode.

This can also be verified using the debugger. I'll leave it as an exercise for the reader.

What Happens Internally When Acquiring Table Locks

Each table is uniquely identified within the InnoDB storage engine using the table descriptor object of type dict_table_t. In this section we will see the steps taken internally by InnoDB to obtain a table lock. The function to refer to is lock_table() in the source code. Necessary mutexes are taken during these steps to ensure that everything works correctly in a multi-thread environment. This aspect is not discussed here.

  1. The request for a table lock comes with the following information – the table to lock (dict_table_t), the mode in which to lock (enum lock_mode), and the transaction which wants the lock (trx_t).
  2. Check whether the transaction is already holding an equal or stronger lock on the given table. Each transaction maintains a list of table locks obtained by itself (trx_t::trx_lock_t::table_locks). Searching this list for the given table and mode is sufficient to answer this question. If the current transaction is already holding an equal or stronger lock on the table, then the lock request is reported as success. If not, then go to next step.
  3. Check if any other transaction has an incompatible lock request in the lock queue of the table. Each table descriptor has a lock queue (dict_table_t::locks). Searching this queue is sufficient to answer this question. If some other transaction holds an incompatible lock on the table, then the lock request needs to wait. Waiting for a lock can lead to time out or deadlock error. If there is no contention from other transactions for this table, then proceed further.
  4. Allocate a lock struct object (lock_t). Initialize with table, trx and lock mode information. Add this object to the queue in dict_table_t::locks object of the table as well as the trx_t::trx_lock_t::table_locks.
  5. Complete.

You can re-visit the above scenario (௨) and then follow the above steps and verify that the number of lock structs created is 2.

Row Level Transaction Locks

Each row in the InnoDB storage engine needs to be uniquely identified in order to be able to lock it. A row is uniquely identified by the following pieces of information:

  • The space identifier
  • The page number within the space.
  • The heap number of the record within the page.
  • The descriptor of the index to which the row belongs (dict_index_t). Stricly speaking, this is not necessary to uniquely identify the row. But associating the row to its index will help to provide user friendly diagnosis and also help the developers to debug a problem.

There are two types of row locks in InnoDB – the implicit and the explicit. The explicit row locks are the locks that make use of the global row lock hash table and the lock_t structures. The implicit row locks are logically arrived at based on the transaction information in the clustered index or secondary index record. These are explained in the following sections.

Implicit Row Locks

Implicit row locks do not have an associated lock_t object allocated. This is purely calculated based on the ID of the requesting transaction and the transaction ID available in each record. First, let use see how implicit locks are “acquired” (here is comment from

“If a transaction has modified or inserted an index record, then it owns an implicit x-lock on the record. On a secondary index record, a transaction has an implicit x-lock also if it has modified the clustered index record, the max trx id of the page where the secondary index record resides is >= trx id of the transaction (or database recovery is running), and there are no explicit non-gap lock requests on the secondary index record. ”

As we can see, the implicit locks are a logical entity and whether a transaction has implicit locks or not is calculated using certain procedures. This procedure is explained here briefly.

If a transaction wants to acquire a row lock (implicit or explicit), then it needs to determine whether any other transaction has an implicit lock on the row before checking on the explicit lock. As explained above this procedure is different for clustered index and secondary index.

For the clustered index, get the transaction id from the given record. If it is a valid transaction id, then that is the transaction which is holding the implicit exclusive lock on the row. Refer to the function lock_clust_rec_some_has_impl() for details.

For the secondary index, it is more cumbersome to calculate. Here are the steps:

  1. Let R1 be the secondary index row for which we want to check if another transaction is holding an implicit exclusive lock.
  2. Obtain the maximum transaction id (T1) of the page that contains the secondary index row R1.
  3. Let T2 be the minimum transaction id of the InnoDB system. If T1 is less than T2, then there can be no transaction holding implicit lock on the row. Otherwise, go to next step.
  4. Obtain the corresponding clustered index record for the given secondary index record R1.
  5. Get the transaction id (T3) from the clustered index record. By looking at the clustered index record, including its older versions, find out if T3 could have modified or inserted the secondary index row R1. If yes, then T3 has an implicit exclusive lock on row R1. Otherwise it does not have.

In the case of secondary indexes, we need to make use of the undo logs to determine if any transactions have an implicit exclusive row lock on record. Refer to the function lock_sec_rec_some_has_impl() for details.

Also note that the implicit row locks do not affect the gaps.

Explicit Row Locks

Explicit row locks are represented by the lock_t structures. This section provides some scenarios in which explicit row locks are obtained in different modes. It also briefly discusses the internal procedure to acquire an explicit row lock.

Scenario (௩) for LOCK_S row lock

For transaction isolation level REPEATABLE READ and less stronger levels, InnoDB does not take shared locks on rows. So in the following scenario, we use SERIALIZABLE transaction isolation level for demonstrating shared row locks:

mysql> set transaction isolation level serializable;
mysql> start transaction; 
mysql> select * from t1; 

This scenario is the same as the Scenario (௧). The verification via debugger alone is different. In the case of row locks, each lock object represents a row lock for all rows of a page (of the same lock mode). So a lock bitmap is used for this purpose which exists at the end of the lock struct object. In the above scenario, we can verify the locks as follows:

The lock_t object allocated for the row locks is

(gdb) set $trx_locklist = trx_sys->rw_trx_list->start->lock->trx_locks 
(gdb) set $rowlock = $trx_locklist.start->trx_locks->next 
(gdb) p $rowlock 
$47 = (ib_lock_t *) 0x7fffb00103f0 

Verify the lock mode of the lock object.

(gdb) p lock_get_mode($rowlock) 
$48 = LOCK_S 

Verify that it is actually a lock struct object allocated for rows (and not a table lock).

(gdb) p *$rowlock 
$43 = {trx = 0x7fffb0013f78, trx_locks = {prev = 0x7fffb00103a8, next = 0x0}, 
  type_mode = 34, hash = 0x0, index = 0x7fffb001a6b8, un_member = {tab_lock = { 
      table = 0xc, locks = {prev = 0x3, next = 0x48}}, rec_lock = {space = 12, 
      page_no = 3, n_bits = 72}}} 

The row lock bit map is at the end of the lock object. Even though the number of bits in the lock bitmap is given as n_bits = 72 (9 bytes) above, I am examining only 4 bytes here.

(gdb) x /4t $rowlock + 1 
0x7fffb0010438:	00011110	00000000	00000000	00000000 

Note that there are 4 bits enabled. This means that there are 4 row locks obtained. The row in the page, and the bit in the lock bit map are related via the heap_no of the row, as described previously.

Scenario (௪) for LOCK_X row lock

Changing the above scenario slightly as follows, will obtain LOCK_X locks on the rows.

mysql> set transaction isolation level serializable;
mysql> start transaction; 
mysql> select * from t1 for update; 

Verification of this through the debugger is left as an exercise for the reader.

What Happens Internally When Acquiring Explicit Row Locks

To lock a row, all information necessary to uniquely identify the row – the trx, lock mode, table space id, page_no and heap_no – must be supplied. The entry point for row locking is the lock_rec_lock() function in the source code.

  1. Check if the given transaction has a granted explicit lock on the row with equal or stronger lock mode. To do this search, the global hash table of row locks is used. If the transaction already has a strong enough lock on the row, then nothing more to do. Otherwise go to next step.
  2. Check if other transactions have a conflicting lock on the row. For this search also, the global hash table of row locks is used. If yes, then the current transaction must wait. Waiting can lead to timeout or deadlock error. If there is no contention for this row from other transactions then proceed further.
  3. Check if there is already a lock struct object for the given row. If yes, reuse the same lock struct object. Otherwise allocate a lock struct object.
  4. Initialize the trx, lock mode, page_no, space_id and the size of the lock bit map. Set the bit of the lock bitmap based on the heap_no of the row.
  5. Insert this lock struct object in the global hash table of row locks.
  6. Add this to the list of transaction locks in the transaction object (trx_t::trx_lock_t::trx_locks).

Releasing transaction locks

All the transaction locks acquired by a transaction is released by InnoDB at transaction end (commit or rollback). In the case of REPEATABLE READ isolation level or SERIALIZABLE, it is not possible for the SQL layer to release locks before transaction end. In the case of READ COMMITTED isolation level, it is possible for the SQL layer to release locks before transaction end by making use of ha_innobase::unlock_row() handler interface API call.

To obtain the list of all transaction locks of a transaction, the following member can be used – trx_t::trx_lock_t::trx_locks.


This article provided an overview of the transaction locks in InnoDB. We looked at the data structures used to represent transaction locks. We analysed some scenarios in which the various locks are created. We also looked at the internal steps that happen when a table lock or a row lock is created.

You can download this article in pdf.

Saturday Sep 29, 2012

Online ALTER TABLE in MySQL 5.6

This is the low-level view of data dictionary language (DDL) operations in the InnoDB storage engine in MySQL 5.6. John Russell gave a more high-level view in his blog post April 2012 Labs Release – Online DDL Improvements.

MySQL before the InnoDB Plugin

Traditionally, the MySQL storage engine interface has taken a minimalistic approach to data definition language. The only natively supported operations were CREATE TABLE, DROP TABLE and RENAME TABLE. Consider the following example:

INSERT INTO t VALUES (1),(2),(3);

The CREATE INDEX statement would be executed roughly as follows:


You could imagine that the database could crash when copying all rows from the original table to the new one. For example, it could run out of file space. Then, on restart, InnoDB would roll back the huge INSERT transaction. To fix things a little, a hack was added to ha_innobase::write_row for committing the transaction every 10,000 rows.

Still, it was frustrating that even a simple DROP INDEX would make the table unavailable for modifications for a long time.

Fast Index Creation in the InnoDB Plugin of MySQL 5.1

MySQL 5.1 introduced a new interface for CREATE INDEX and DROP INDEX. The old table-copying approach can still be forced by SET old_alter_table=0.

This interface is used in MySQL 5.5 and in the InnoDB Plugin for MySQL 5.1. Apart from the ability to do a quick DROP INDEX, the main advantage is that InnoDB will execute a merge-sort algorithm before inserting the index records into each index that is being created. This should speed up the insert into the secondary index B-trees and potentially result in a better B-tree fill factor.

The 5.1 ALTER TABLE interface was not perfect. For example, DROP FOREIGN KEY still invoked the table copy. Renaming columns could conflict with InnoDB foreign key constraints. Combining ADD KEY and DROP KEY in ALTER TABLE was problematic and not atomic inside the storage engine.

The ALTER TABLE interface in MySQL 5.6

The ALTER TABLE storage engine interface was completely rewritten in MySQL 5.6. Instead of introducing a method call for every conceivable operation, MySQL 5.6 introduced a handful of methods, and data structures that keep track of the requested changes.

In MySQL 5.6, online ALTER TABLE operation can be requested by specifying LOCK=NONE. Also LOCK=SHARED and LOCK=EXCLUSIVE are available. The old-style table copying can be requested by ALGORITHM=COPY. That one will require at least LOCK=SHARED. From the InnoDB point of view, anything that is possible with LOCK=EXCLUSIVE is also possible with LOCK=SHARED.

Most ALGORITHM=INPLACE operations inside InnoDB can be executed online (LOCK=NONE). InnoDB will always require an exclusive table lock in two phases of the operation. The execution phases are tied to a number of methods:

Checks if the storage engine can perform all requested operations, and if so, what kind of locking is needed.
InnoDB uses this method to set up the data dictionary cache for upcoming CREATE INDEX operation. We need stubs for the new indexes, so that we can keep track of changes to the table during online index creation. Also, crash recovery would drop any indexes that were incomplete at the time of the crash.
In InnoDB, this method is used for creating secondary indexes or for rebuilding the table. This is the ‘main’ phase that can be executed online (with concurrent writes to the table).
This is where the operation is committed or rolled back. Here, InnoDB would drop any indexes, rename any columns, drop or add foreign keys, and finalize a table rebuild or index creation. It would also discard any logs that were set up for online index creation or table rebuild.

The prepare and commit phases require an exclusive lock, blocking all access to the table. If MySQL times out while upgrading the table meta-data lock for the commit phase, it will roll back the ALTER TABLE operation.

In MySQL 5.6, data definition language operations are still not fully atomic, because the data dictionary is split. Part of it is inside InnoDB data dictionary tables. Part of the information is only available in the *.frm file, which is not covered by any crash recovery log. But, there is a single commit phase inside the storage engine.

Online Secondary Index Creation

It may occur that an index needs to be created on a new column to speed up queries. But, it may be unacceptable to block modifications on the table while creating the index.

It turns out that it is conceptually not so hard to support online index creation. All we need is some more execution phases:

  1. Set up a stub for the index, for logging changes.
  2. Scan the table for index records.
  3. Sort the index records.
  4. Bulk load the index records.
  5. Apply the logged changes.
  6. Replace the stub with the actual index.

Threads that modify the table will log the operations to the logs of each index that is being created. Errors, such as log overflow or uniqueness violations, will only be flagged by the ALTER TABLE thread. The log is conceptually similar to the InnoDB change buffer.

The bulk load of index records will bypass record locking. We still generate redo log for writing the index pages. It would suffice to log page allocations only, and to flush the index pages from the buffer pool to the file system upon completion.


Starting with MySQL 5.6, InnoDB supports most ALTER TABLE operations natively. The notable exceptions are changes to the column type, ADD FOREIGN KEY except when foreign_key_checks=0, and changes to tables that contain FULLTEXT indexes.

The keyword ALGORITHM=INPLACE is somewhat misleading, because certain operations cannot be performed in-place. For example, changing the ROW_FORMAT of a table requires a rebuild.

Online operation (LOCK=NONE) is not allowed in the following cases:

  • when adding an AUTO_INCREMENT column,
  • when the table contains FULLTEXT indexes or a hidden FTS_DOC_ID column, or
  • when there are FOREIGN KEY constraints referring to the table, with ON…CASCADE or ON…SET NULL option.

The FOREIGN KEY limitations are needed, because MySQL does not acquire meta-data locks on the child or parent tables when executing SQL statements.

Theoretically, InnoDB could support operations like ADD COLUMN and DROP COLUMN in-place, by lazily converting the table to a newer format. This would require that the data dictionary keep multiple versions of the table definition. For simplicity, we will copy the entire table, even for DROP COLUMN.

The bulk copying of the table will bypass record locking and undo logging. For facilitating online operation, a temporary log will be associated with the clustered index of table. Threads that modify the table will also write the changes to the log.

When altering the table, we skip all records that have been marked for deletion. In this way, we can simply discard any undo log records that were not yet purged from the original table.

Off-page columns, or BLOBs, are an important consideration. We suspend the purge of delete-marked records if it would free any off-page columns from the old table. This is because the BLOBs can be needed when applying changes from the log. We have special logging for handling the ROLLBACK of an INSERT that inserted new off-page columns. This is because the columns will be freed at rollback.

Helping to Reduce Page Compression Failures Rate

When InnoDB compresses a page it needs the result to fit into its predetermined compressed page size (specified with KEY_BLOCK_SIZE). When the result does not fit we call that a compression failure. In this case InnoDB needs to split up the page and try to compress again. That said, compression failures are bad for performance and should be minimized.

Whether the result of the compression will fit largely depends on the data being compressed and some tables and/or indexes may contain more compressible data than others. And so it would be nice if the compression failure rate, along with other compression stats, could be monitored on a per table or even on a per index basis, wouldn't it?

This is where the new INFORMATION_SCHEMA table in MySQL 5.6 kicks in. INFORMATION_SCHEMA.INNODB_CMP_PER_INDEX provides exactly this helpful information. It contains the following fields:

| Field           | Type         | Null |
| database_name   | varchar(192) | NO   |
| table_name      | varchar(192) | NO   |
| index_name      | varchar(192) | NO   |
| compress_ops    | int(11)      | NO   |
| compress_ops_ok | int(11)      | NO   |
| compress_time   | int(11)      | NO   |
| uncompress_ops  | int(11)      | NO   |
| uncompress_time | int(11)      | NO   |

similarly to INFORMATION_SCHEMA.INNODB_CMP, but this time the data is grouped by "database_name,table_name,index_name" instead of by "page_size".

So a query like

compress_ops - compress_ops_ok AS failures
FROM information_schema.innodb_cmp_per_index
ORDER BY failures DESC;
would reveal the most problematic tables and indexes that have the highest compression failure rate.

From there on the way to improving performance would be to try to increase the compressed page size or change the structure of the table/indexes or the data being stored and see if it will have a positive impact on performance.

Wednesday Mar 11, 2009

Plug In for Performance and Scalability

Note: this article was originally published on  on March 10, 2009 by Calvin Sun.

Why should you care about the latest “early adopter” release of the InnoDB Plugin, version 1.0.3?   One word: performance! The release introduces these features:

  • Enhanced concurrency & scalability: the “Google SMP patch” using atomic instructions for mutexing
  • More efficient memory allocation: ability to use more scalable platform memory allocator
  • Improved out-of-the-box scalability: unlimited concurrent thread execution by default
  • Dynamic tuning: at run-time, enable or disable insert buffering and adaptive hash indexing

These new performance features can yield up to twice the throughput or more, depending on your workload, platform and other tuning considerations. In another post, we explore some details about these changes, but first, what do these enhancements mean for performance and scalability?

In brief, we’ve tested three different workloads (joins, DBT2 OLTP and a modified sysbench) using a memory-resident database. In all cases, the InnoDB Plugin scales significantly better than the built-in InnoDB in MySQL 5.1. And in some cases, the absolute level of performance is dramatically higher too! The charts below illustrate the kinds of performance gains we’ve measured with release 1.0.3 of the InnoDB Plugin. Your mileage may vary, of course. 

This release of the InnoDB Plugin incorporates a patch made by Ben Handy and Mark Callaghan at Google to improve multi-core scalability by using more efficient synchronization methods (mutexing and rw-locks) to reduce cpu utilization and contention. We’re grateful for this contribution, and you will be too!

Now to our test results …

Joins: The following chart shows the performance gains in performing joins, comparing the built-in InnoDB in MySQL (in ">blue) with the InnoDB Plugin 1.0.3 (in red).

As you can see from the blue bars in the above chart, with MySQL 5.1 using the built-in InnoDB, the total number of joins the system can execute declines as the number of concurrent users increases. In contrast, the InnoDB Plugin slightly improves performance even with one user, and maintains performance as the number of users rises. This performance improvement is due in large part to the use of atomics for mutexing in the InnoDB Plugin.

Transaction Processing (DBT2): The following chart illustrates a scalability improvement using the OLTP read/write DBT2 benchmark, again comparing the performance of the built-in InnoDB in MySQL with the performance of InnoDB Plugin 1.0.3.

Here, the InnoDB Plugin scales better than the built-in InnoDB from 16 to 32 users and produces about 12% more throughput with 64 concurrent users, as other bottlenecks are encountered or system capacity is reached. This improvement is likewise due primarily to the changes in mutexing.

Modified Sysbench: This test uses a version of the well-known sysbench workload, modified to include queries based on a secondary index, as suggested by Mark Callaghan of Google.

This time, the InnoDB Plugin shows significantly better scalability from 8 to 64 users than the built-in InnoDB in MySQL, yielding as much as 60% more throughput at 64 users. Like the previous examples, this improvement is largely due to the use of atomics for mutexing.

Modified Sysbench with tcmalloc: This test uses the same modified sysbench workload, but shows the difference between the built-in InnoDB (which uses the internal InnoDB memory allocator) and the InnoDB Plugin when using a more scalable memory allocator, in this case tcmalloc.

When the new configuration parameter innodb_use_sys_malloc is set to enable use of the memory allocator tcmalloc, the InnoDB Plugin really shines! Transaction throughput continues to scale, and the actual throughput with 64 users has nearly doubled!


This is the InnoDB team blog.


« July 2016