Friday Jan 10, 2014

Transaction life cycle improvements in 5.7.3

This is part of the ongoing work on improving the transaction life cycle management. In 5.7.2 we split the transaction list into two. The read-only transaction list and the read-write transaction list. There was another "virtual" list, the auto-commit non-locking  read-only (AC-NL-RO) transaction list. The change in 5.7.2 was that by default a transaction was treated as read only and added to the read-only transaction list. Only when it was determined that the transaction was going to do an update we removed the transaction from the read-only list and moved it to the read-write transaction list. This initial add to the the read-only list forced the acquisition of the trx_sys_t::mutex. Acquiring the mutex during transaction start/begin has a cost. Promoting a transaction from read-only to read-write we had to acquire the trx_sys_t::mutex to add to the read-write transaction list and so that is not too expensive and unavoidable. There is another transaction list for caching user transactions that we will ignore in this discussion (the mysql-trx-list) it is per connection. All user transactions both AC-NL-RO, plain read-only and read-write user transactions are on this list.

The optimization

The optimization in 5.7.3 is to eliminate the explicit read-only transaction list altogether. By default transactions are not put on any list unless they are explicitly tagged as read-write then they are added to the read-write transaction list. This makes the transaction begin mutex free. Additionally if the transaction commits as a read-only transaction then we don't need to acquire the trx_sys_t::mutex at commit time to remove from the read-only list either. This improves performance for read-only transactions, making them more or less equivalent to AC-NL-RO transactions. They will however incur some cost compared to AC-NL-RO transactions if they acquire shared locks.

Additionally,  the MVCC view open/close handling of read-only transactions is now equivalent to that of AC-NL-RO transactions, very low life cycle overhead. This will also help in improving performance significantly.

User impact

There is of course the positive performance impact but there is also a change in visibility semantics. With the elimination of the read-only list the read-only transactions are now no longer visible via SHOW ENGINE INNODB STATUS; they are however visible via the INFORMATION SCHEMA.

Wednesday Sep 25, 2013

InnoDB Temporary Tables just got faster

It all started with a goal to make InnoDB temporary tables more effective. Temporary table semantics are blessed with some important characteristics that can help us simplify lot of operations.

  • Temporary tables are not visible across connections
  • Temporary tables lifetime is limited to connection lifetime (unless user explicitly drops it).

What does this means in to InnoDB ?

  • REDO logging can be avoided for temporary tables and related objects since temporary tables do not survive a shutdown or crash.
  • Temporary table definitions can be maintained in-memory without persisting to the disk.
  • Locking constraints can be relaxed since only one client can see these tables.
  • Change buffering can be avoided since the majority of temporary tables are short-lived.

In order to implement these changes in InnoDB we took a bit different approach:

  • We introduced a dedicated tablespace for housing temporary tables. In 5.7, all non-compressed temporary tables (irrespective of innodb-file-per-table setting) are housed in this new temporary tablespace (ibtmp1). Compressed temporary tables continue to get housed in their independent tablespace. This new temporary tablespace is re-created on each server restart and can be pre-extended before starting server. (For more about temporary tablespace check MySQL documentation).
  • Dedicated rollback-segments (for housing undo logs) for temporary table objects.
  • All related objects including rollback-segments have been moved to this new temporary tablespace. This helps in maintaining locality and importantly eliminating REDO logging associated with all such objects.

What have we achieved ? (Performance Gain w.r.t to 5.6)

  • Create/Drop workload: (around 11x performance gain)
    create n tables, drop n tables, create+drop n tables, create + drop n tables with significant number of keys.
  • Insert/Update/Delete workload: (2-4x performance gain)
    • insert workload: insert n entries: sorted, n entries: reverse sorted, n entries: random (total 3n entries)
    • delete workload: delete n entries: first initiate insert workload to load table, delete n entries using primary key, delete n entries using secondary index, delete n entries using primary key such that complete table is empty.
    • update workload: update n entries: first initiate insert workload to load table, update n entries using primary key, update n entries using secondary index, update all 3n entries 2 times. (no explicit key specified).

All testing done using sql-bench (sql-bench modified to accommodate temporary table use-case).

Summing it up....

The benchmarks above show that InnoDB temporary tables are faster in 5.7 (than in 5.6). We plan to further enhance temporary table performance.

Friday Sep 13, 2013

InnoDB Redundant Row Format


This article describes the InnoDB redundant row format. If you are new to InnoDB code base (a new developer starting to work with InnoDB), then this article is for you. I'll explain the row format by making use of a gdb session. An overview of the article is given below:

  • Create a simple table and populate few rows.
  • Access the page that contains the rows inserted.
  • Access a couple of rows and explain its format.
  • Give summary of redundant row format.
  • Useful gdb commands to analyse the InnoDB rows.
  • Look at a GNU Emacs Lisp function to traverse rows in an InnoDB index page.

To get the most out of this article, the reader is expected to repeat the gdb session as described here.

The Schema

Consider the following SQL statements to produce the schema:

CREATE TABLE t1 (f1 int unsigned) row_format=redundant engine=innodb;
INSERT INTO t1 VALUES (1), (2), (3), (4), (5);

When the above CREATE TABLE is executed, InnoDB internally creates a clustered B-tree index (or just clustered index) representing the table. If no primary key is explicitly given, internally a primary key is added. Each node of B-tree is represented by one page (which is by default is of size 16KB). In the given scenario, we insert only 5 records. So the B-tree will consist of only 1 node, the root node.

Access InnoDB Index Page in gdb

All select queries will go through the row_search_for_mysql() function of InnoDB. When the first time it is called for a query, a persistent cursor (btr_pcur_t) will be opened by this function. When this B-tree persistent cursor (pcur in the following gdb) points to the InnoDB index page containing the rows of our interest, we can save the page frame to a file as follows:

(gdb) set $frame = pcur->btr_cur->page_cur->block->frame
(gdb) dump binary memory /tmp/gdb.anna $frame $frame + srv_page_size
(gdb) p srv_page_size
$10 = 16384

If you do not understand the commands above, refer to the gdb manual for more information. Saving a page in binary format in a file on disk helps us to analyse the contents of the page by loading it in a text editor like GNU Emacs or ViM. This is sometimes a very useful approach for debugging. The above commands were executed when I was on the sync point row_search_rec_loop, in the function row_search_for_mysql(). The line numbers are from mysql 5.7.3.

4155         } else if (mode == PAGE_CUR_G || mode == PAGE_CUR_L) {
4156                 btr_pcur_open_at_index_side(
4157                         mode == PAGE_CUR_G, index, BTR_SEARCH_LEAF,
4158                         pcur, false, 0, &mtr);
4159         }
4161 rec_loop:
4162         DEBUG_SYNC_C("row_search_rec_loop");
4163         if (trx_is_interrupted(trx)) {
4164                 btr_pcur_store_position(pcur, &mtr);
4165                 err = DB_INTERRUPTED;
4166                 goto normal_return;
4167         }

An important point to remember when analysing the page is that all the data in the page are stored in big-endian format.

Open Page in GNU Emacs

The following image shows the InnoDB page opened in GNU Emacs in hexl-mode major mode. I have placed the cursor on the origin of the infimum record (at offset 0x65 from page origin). In the right side, you can look at the text "infimum" and "supremum" as stored in the page. These are special records.

All rows in an InnoDB page are linked to one another in ascending order via the next record pointer. In the given example this ordering is based on the internally generated DB_ROW_ID. In each page, the first record (or the minimum record) is the infimum record, and the last record (or the maximum record) is the supremum record. The records between the infimum and supremum records are called the user records.

The next record pointer of each record points to the "origin" of the next record. The record contains data to the right of the origin and meta data to the left of the origin. The next record pointer is available in the 2 bytes before the record origin.

The next record pointer of infimum as given in the picture above is 0x0087 (which is 135 in decimal). This is the offset of the next record from the beginning of the page.

Printing the infimum record

As mentioned previously, infimum is a special minimum record in an index page. It exists at a fixed offset of 101 bytes from the page beginning. The function page_get_infimum_rec() can be used to get the infimum record given a page frame pointer. All records in InnoDB are pointed to at their record origin. So to access the meta data in the record, we need to subtract offset from the record origin.

Printing the infimum record given a page frame pointer (in char):

(gdb) p  pcur->btr_cur->page_cur->block->frame+101
$41 = (unsigned char *) 0x7fffd1d3c065 "infimum"

Printing the infimum record given a page frame pointer (in hex):

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+101)@7
$45 = {0x69, 0x6e, 0x66, 0x69, 0x6d, 0x75, 0x6d}

Printing the 6 byte Informational Bit Fields

Given the origin of the infimum record, we can print the 6 bytes of informational bits as follows (in hex).

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+101-6)@6
$50 = {0x1, 0x0, 0x0, 0x3, 0x0, 0x87}

Printing the 6 bytes of informational bits of the infimum record (in binary). In this output, the right most byte displayed is the first byte before the record origin.

(gdb) x /6tb  pcur->btr_cur->page_cur->block->frame+101-6
0x7fffd1d3c05f: 00000001  00000000   00000000  00000011  00000000  10000111

The above 6 bytes bit fields of the record is formatted as follows: (it also contains the actual values as seen above)

4 bits
info bits
byte: 6
4 bits
byte: 6
13 bits
heap number
byte: 4,5
00000000 00000
10 bits
field count
byte: 3,4
000 0000001
1 bit
byte: 3
2 bytes
next record offset
byte: 1, 2
00000000 10000111

Each of the box above represents one bit field. It contains the following information:

The size of bit field in bits or bytes
A short description of the bit field.
The byte where this bit field exists.
The actual value of this bit field as seen in above gdb output

The 1 bit flag above is used to tell whether the offset array element is 1 byte (if set) or 2 bytes (if not set). In our example case, the 1 bit flag is set and hence the offset array elements are each 1 byte long.

From the field count bit field, we can see that the infimum record has only one field, whose value is fixed to "infimum". The next record pointer of the infimum record points to the first user record in the page. This is available at the offset 0x0087, which is 135 in decimal. So the first user record is available at page frame pointer + 135 bytes.

The Bit Fields

The information available in 6 bytes (REC_N_OLD_EXTRA_BYTES) of bit field data is listed here. These six bytes are available immediately before the record origin. These bit fields must be accessed by making use of the interface provided by the record manager (refer to rem0rec.h).

The Bit Field Size of Bit Field Description of the Bit Field
info bits 4 bits Used to delete mark a record
n_owned 4 bits Number of records owned by this record. This is related to page directory. To understand this field, we need to understand the page structure used by InnoDB. This is not explained further in this article.
heap number 13 bits Gives the order number of this record in the page. Each row has a unique heap number in the page. The lock manager uses this heap number.
field count 10 bits Number of fields in this record.
flag 1 bit Tells if the offset array element is 1 or 2 bytes.
next record offset 2 bytes Offset of the next record from page origin.

Printing the First User Record

To first parse the record we need to know the number of records in the row. This can be obtained from the 6 bytes of bit-field information available before the record origin. Let us first see this for the user record:

(gdb) x /6tb  pcur->btr_cur->page_cur->block->frame+135-6
0x7fffd1d3c081: 00000000  00000000  00010000  00001001  00000000  10101000

4 bits
info bits
byte: 6
4 bits
byte: 6
13 bits
heap number
byte: 4,5
00000000 00010
10 bits
field count
byte: 3,4
000 0000100
1 bit
byte: 3
2 bytes
next record offset
byte: 1, 2
00000000 10101000

The above information shows that this record has 4 fields (000 0000100 in binary). But our table definition was

CREATE TABLE t1 (f1 unsigned int) row_format=redundant engine=innodb;

If you are wondering why a row of this table has 4 field count, it is because InnoDB added 3 internal fields to the table t1 - a row identifier (DB_ROW_ID), transaction identifier (DB_TRX_ID) and a undo roll pointer (DB_ROLL_PTR). A row identifier column is internally added to a table if no primary key is explicitly given by the user. This row identifier becomes the internal primary key. The transaction identifier specifies the transaction that has last updated the row. The undo roll pointer points to the previous versions of the record. The 4th field is the actual column f1 added by the user.

The list of fields in the table t1 can be printed by accessing its associated dict_table_t object (or its clustered index dict_index_t object). In the following, the name GEN_CLUST_INDEX is the name of the clustered index of the table t1. This is the standard name of the clustered index if it was internally generated by InnoDB.

(gdb) p prebuilt->index->name       
$75 = 0x7fffa00232e8 "GEN_CLUST_INDEX"
(gdb) p prebuilt->index->table      
$73 = (dict_table_t *) 0x7fffa001fe68
(gdb) p prebuilt->index->table->name
$74 = 0x7fffa000cef0 "test/t1"
(gdb) set print array on
(gdb) p *prebuilt->index->fields@4
$67 =   {{col = 0x7fffa0020154, name = 0x7fffa00201c3 "DB_ROW_ID", prefix_len = 0, 
    fixed_len = 6},
  {col = 0x7fffa0020160, name = 0x7fffa00201cd "DB_TRX_ID", prefix_len = 0, 
    fixed_len = 6},
  {col = 0x7fffa002016c, name = 0x7fffa00201d7 "DB_ROLL_PTR", prefix_len = 0, 
    fixed_len = 7},
  {col = 0x7fffa0020148, name = 0x7fffa00201c0 "f1", prefix_len = 0, fixed_len = 4}}

Printing the Record Offsets (or Offset Array)

The redundant record contains an array of offsets of each field of the row. This array is available after the 6-byte bit fields (down from the record origin). The size of each element of the array is either 1 byte or 2 byte depending on the 1 bit flag in the 6 byte bit fields. In our current case each offset element is 1 byte long.

(gdb) p /d  *(pcur->btr_cur->page_cur->block->frame+135-6-4)@4
$79 = {23, 19, 12, 6}

The first field (DB_ROW_ID) starts at offset 0 from record origin. The second field (DB_TRX_ID) starts at offset 6 from record origin. The third field (DB_ROLL_PTR) starts at offset 12 bytes from record origin. The fourth field (f1) starts at offset 19 bytes from record origin. The end of the record is 23 bytes from the record origin. So the total size of this record can be calculated as follows:

S - size of offset array element (1 byte) 
N - number of fields in the row (4 fields)
I - size of bit fields (6 bytes)
D - Length of row after record origin (23 as given in offset array)
R - Total Size of the row 

R = (S*N) + I + D = 4 + 6 + 23 = 33 bytes

The total size of the row is 33 bytes. Since this table contains only fixed length fields, the size of each row in this table will be the same.

Printing the DB_ROW_ID

The DB_ROW_ID is 6 bytes long. It starts at the record origin itself. It can be printed as follows:

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135)@6
$82 = {0x0, 0x0, 0x0, 0x0, 0x2, 0x0}

Printing the DB_TRX_ID

The second field DB_TRX_ID is also 6 bytes long. It starts at an offset of 6 bytes from the record origin. It can be examined as follows:

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135+6)@6
$84 = {0x0, 0x0, 0x0, 0x0, 0x9, 0x8}

Printing the DB_ROLL_PTR

The third field DB_ROLL_PTR is 7 bytes long. It points to the previous versions of a record available in the undo segments. It can be examined as follows:

(gdb)  p /x *(pcur->btr_cur->page_cur->block->frame+135+12)@7
$3 = {0xa8, 0x0, 0x0, 0x1, 0x1a, 0x1, 0x10}

Printing the User Data Field

The fourth field "f1" is the user defined field. It is 4 bytes long. It can be examined as follows:

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135+19)@4
$5 = {0x0, 0x0, 0x0, 0x1}
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+135+19))
$8 = 1

Thus, we have completely accessed the first user record. Given the first user record, we can find out the location of the next user record (the first 2 bytes before the record origin). If we keep accessing the next user record, we will eventually reach the supremum record, which is the last (and considered largest or maximum) record in a page.

Printing All Rows

Just to improve our confidence in accessing the rows, here I have printed all the rows that we have inserted into the table. The first record is the infimum, followed by the records 1, 2, 3, 4 and 5 (which we have inserted) and then the last record supremum.

(gdb) p pcur->btr_cur->page_cur->block->frame+101
$13 = (unsigned char *) 0x7fffd1e48065 "infimum"
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+135+19))
$14 = 1
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+168+19))
$15 = 2
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+201+19))
$16 = 3
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+234+19))
$17 = 4
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+267+19))
$18 = 5
(gdb) p pcur->btr_cur->page_cur->block->frame+116
$19 = (unsigned char *) 0x7fffd1e48074 "supremum"

The Complete Redundant Row Format

One redundant record can be logically separated into 5 parts - the offsets array, the 6-byte bit field, record origin, hidden fields (DB_ROW_ID, DB_TRX_ID, DB_ROLL_PTR) and the actual user-specified fields of the table. They can be viewed as follows:

offsets array
fn ... f1, x2, x1, pn ... p1
6-byte bit field
primary key
or user given
p1, p2 ... pn
hidden fields
x1, x2
user fields
f1, f2, f3 ... fn

All record pointers point to the record origin. The record origin is the beginning of the primary key for the clustered primary index. It is not a separate entity that occupies some space. I am showing it separately to make it easy to visualize. All information prior to the record origin can be considered as the record header or the record meta-data.

Each data field has a corresponding offsets array element. The offsets array is arranged in reverse order. The first element in the array provides the offset of the end of the last data field and the last element in the array provides the offset of the end of the first data element.

The offsets array also tells whether a data field is NULL. If a field is NULL, the highest bit in the associated offset array element will be set to 1.

The biggest advantage of the redundant row format is that a record in this format can be parsed without needing to access the data dictionary. But this format is not space efficient. Even for fixed length fields, we store information in the offsets array (this is avoided in COMPACT, DYNAMIC and COMPRESSED row formats).

One more limitation of the redundant row format is that the blob data cannot be stored completely off page (as in DYNAMIC and COMPRESSED row formats). A 768 byte prefix of the blob will always be stored inline with the row and cannot be avoided.

A small note on row formats. InnoDB really supports only two row formats - the redundant row format and the compact row format. The compressed and dynamic row formats are a variation available only on top of the compact row format. Technically these variations can be implemented for the redundant row format as well, but is currently not available.

GNU Emacs Function to Traverse Rows in Page

If we dump a single InnoDB index page into a file, we can open it in GNU Emacs hexl-mode and use the following elisp function to traverse from infimum record to the supremum record. First place the cursor on the 'i' of the infimum record and then call this function to reach the next record origin. I am not an expert in writing elisp functions. If you can improve on this and write a better one, kindly share it with me.

(defun ib-next-red-row ()
  "Go to the origin of the next record in ROW_FORMAT=REDUNDANT.
The cursor must be positioned at the origin of a record. 
This is for the GNU Emacs hexl-mode. "
  (setq rec-origin (point))
  (setq rec-origin-offset (hexl-current-address))

  ;; The next-rec pointer is stored in 2 bytes before the
  ;; record origin
  (hexl-backward-char 2)
  (setq next-rec (string (char-after)))
  (forward-char 1)
  (setq next-rec 
	(concat next-rec (string (char-after))))
  (hexl-forward-char 1)
  (setq next-rec 
	(concat next-rec (string (char-after))))
  (forward-char 1)
  (setq next-rec 
	(concat next-rec (string (char-after))))

  ;; Next record is next-rec-off bytes away from page origin
  (setq next-rec-off (hexl-hex-string-to-integer next-rec))

  ;; Assuming that there is only one page in dump
  (hexl-goto-address 0)
  (hexl-forward-char next-rec-off))


In this article, the format of the redundant row was explained. A gdb session was used to demonstrate how to access a page and the rows within that page. The infimum record was accessed, printed and analysed. Using its next record pointer, the first user record was accessed. The reader can use the next record pointer of the first user record to access the next user record and so on till the supremum record is reached. I leave that as an exercise for the readers.

Thanks to Marko Makela for his useful tips on debugging at our yearly InnoDB Team Meeting 2013 held at Shanghai, China. This article derives from his ideas.

Send in your feedback to me.

Friday May 31, 2013

Introduction to Transaction Locks in InnoDB Storage Engine


Transaction locks are an important feature of any transactional storage engine. There are two types of transaction locks – table locks and row locks. Table locks are used to avoid a table being altered or dropped by one transaction when another transaction is using the table. It is also used to prohibit a transaction from accessing a table, when it is being altered. InnoDB supports multiple granularity locking (MGL). So to access rows in a table, intention locks must be taken on the tables.

Row locks are at finer granularity than table level locks, different threads can work on different parts of the table without interfering with each other. This is in contrast with MyISAM where the entire table has to be locked when updating even unrelated rows. Having row locks means that multiple transactions can read and write into a single table. This increases the concurrency level of the storage engine. InnoDB being an advanced transactional storage engine, provides both table and row level transaction locks.

This article will provide information about how transaction locks are implemented in InnoDB storage engine. The lock subsystem of InnoDB provides many services to the overall system, like:

  • Creating, acquiring, releasing and destroying row locks.
  • Creating, acquiring, releasing and destroying table locks.
  • Providing multi-thread safe access to row and table locks.
  • Data structures useful for locating a table lock or a row lock.
  • Maintaining a list of user threads suspended while waiting for transaction locks.
  • Notification of suspended threads when a lock is released.
  • Deadlock detection

The lock subsystem helps to isolate one transaction from another transaction. This article will provide information about how transaction locks are created, maintained and used in the InnoDB storage engine. All reference to locks means transaction locks, unless specified otherwise.

Internal Data Structures of InnoDB

Before we proceed with scenarios and algorithms, I would like to present the following data structures. We need to be familiar with these data structures to understand how transaction locks work in InnoDB. The data structures of interest are:

  1. The enum lock_mode – provides the list of modes in which the transaction locks can be obtained.
  2. The lock struct lock_t. This represents either a table lock or a row lock.
  3. The struct trx_t which represents one transaction within InnoDB.
  4. The struct trx_lock_t which associates one transaction with all its transaction locks.
  5. The global object of type lock_sys_t holds the hash table of row locks.
  6. The table descriptor dict_table_t, which uniquely identifies a table in InnoDB. Each table descriptor contains a list of locks on the table.
  7. The global object trx_sys of type trx_sys_t, which holds the active transaction table (ATT).

The Lock Modes

The locks can be obtained in the following modes. I'll not discuss about the LOCK_AUTO_INC in this article.

/* Basic lock modes */ 
enum lock_mode { 
        LOCK_IS = 0,    /* intention shared */ 
        LOCK_IX,        /* intention exclusive */ 
        LOCK_S,         /* shared */ 
        LOCK_X,         /* exclusive */ 
        LOCK_AUTO_INC,  /* locks the auto-inc counter of a table 
                        in an exclusive mode */ 
        LOCK_NONE,      /* this is used elsewhere to note consistent read */ 
        LOCK_NUM = LOCK_NONE, /* number of lock modes */ 
        LOCK_NONE_UNSET = 255 

The lock struct or lock object

The structure lock_t represents either a table lock (lock_table_t) or a group of row locks (lock_rec_t) for all the rows belonging to the same page. For different lock modes, different lock structs will be used. For example, if a row in a page is locked in LOCK_X mode, and if another row in the same page is locked in LOCK_S mode, then these two row locks will be held in different lock structs.

This structure is defined as follows:

struct lock_t {
        trx_t*          trx;
        ulint           type_mode;
        hash_node_t     hash;
        dict_index_t*   index;
        union {
                lock_table_t    tab_lock;/*!< table lock */
                lock_rec_t      rec_lock;/*!< record lock */
        } un_member;                    /*!< lock        details */

/** A table lock */
struct lock_table_t {
        dict_table_t*   table;          /*!< database table in dictionary
                                        cache */
                        locks;          /*!< list of locks on the same
                                        table */

/** Record lock for a page */
struct lock_rec_t {
        ulint   space;                  /*!< space id */
        ulint   page_no;                /*!< page number */
        ulint   n_bits;                 /*!< number of bits in the lock
                                        bitmap; NOTE: the lock bitmap is
                                        placed immediately after the
                                        lock struct */

The important point here is the lock bitmap. The lock bitmap is a space efficient way to represent the row locks. This space efficient way of representing the row locks avoids the need for lock escalation and lock data persistence. (Note: For prepared transactions, it would be useful to have lock data persistence, but InnoDB currently do not support lock data persistence.)

The lock bitmap is placed immediately after the lock struct object. If a page can contain a maximum of 100 records, then the lock bitmap would be of size 100 (or more). Each bit in this bitmap will represent a row in the page. The heap_no of the row is used to index into the bitmap. If the 5th bit in the bitmap is enabled, then the row with heap_no 5 is locked.

The Transaction And Lock Relationship

The struct trx_t is used to represent the transaction within InnoDB. The struct trx_lock_t is used to represent all locks associated to a given transaction. Here I have listed down only those members relevant to this article.

struct trx_t {

	trx_id_t        id;             /*!< transaction id */ 

      trx_lock_t      lock;           /*!< Information about the transaction 
                                        locks and state. Protected by 
                                        trx->mutex or lock_sys->mutex 
                                        or both */ 

struct trx_lock_t { 

        ib_vector_t*    table_locks;    /*!< All table locks requested by this 
                                        transaction, including AUTOINC locks */ 

                        trx_locks;      /*!< locks requested 
                                        by the transaction; 
                                        insertions are protected by trx->mutex 
                                        and lock_sys->mutex; removals are 
                                        protected by lock_sys->mutex */ 


Global hash table of row locks

Before we look at how the row locks internally works, we need to be aware of this global data structure. The lock subsystem of the InnoDB storage engine has a global object lock_sys of type lock_sys_t. The class lock_sys_t is defined as follows:

struct lock_sys_t { 
        ib_mutex_t      mutex;   
        hash_table_t*   rec_hash; 
        ulint           n_lock_max_wait_time; 
        // more ...

The lock_sys_t::rec_hash member is the hash table of the record locks. Every page within InnoDB is uniquely identified by the (space_id, page_no) combination, called the page address. The hashing is done based on the page address. So given the page address, we can locate the list of lock_t objects of that page. All lock structs of the given page will be in the same hash bucket. So using this mechanism we can locate the row lock of any row.

The Transaction Subsystem Global Object

The transaction subsystem of InnoDB has one global object trx_sys of type trx_sys_t, which is an important internal data structure. It is defined as follows:

/** The transaction system central memory data structure. */ 
struct trx_sys_t{ 

	  // more ...

        trx_list_t      rw_trx_list;    /*!< List of active and committed in 
                                        memory read-write transactions, sorted 
                                        on trx id, biggest first. Recovered 
                                        transactions are always on this list. */ 
        trx_list_t      ro_trx_list;    /*!< List of active and committed in 
                                        memory read-only transactions, sorted 
                                        on trx id, biggest first. NOTE: 
                                        The order for read-only transactions 
                                        is not necessary. We should exploit 
                                        this and increase concurrency during 
                                        add/remove. */ 
        // more ...

This global data structure contains the active transaction table (ATT) of InnoDB storage engine. In the following sections, I'll use this global object to access my transaction object through the debugger to demonstrate the transaction locks.

The table descriptor (dict_table_t)

Each table in InnoDB is uniquely identified by its name in the form of databasename/tablename. For each table, the data dictionary of InnoDB will contain exactly one table descriptor object of type dict_table_t. Given the table name, the table descriptor can be obtained. This table descriptor contains a list of locks on the table. This list can be used to check if the table has already been locked by a transaction.

struct dict_table_t{ 

        // moretable_id_t      id;     /*!< id of the table */ 
        char*           name;   /*!< table name */ 

                        locks;  /*!< list of locks on the table; protected 
                                by lock_sys->mutex */ 

The heap_no of a row

Every row in InnoDB is uniquely identified by space_id, page_no and heap_no of the row. I assume that you know about space_id and page_no. I'll explain here only about the heap_no of the row. Each row in the page has a heap_no. The heap_no of the infimum record is 0, the heap_no of the supremum record is 1, and the heap_no of the first user record in page is 2.

If we have inserted 10 records in the page, and if there has been no updates on the page, then the heap_no of the user records would be 2, 3, 4, 5 … 10, 11. The heap_no will be in the same order in which the records will be accessed in ascending order.

The heap_no of the row is used to locate the bit in the lock bitmap corresponding to the row. If the row has a heap_no of 10, then the 10th bit in the lock bitmap corresponds to the row. This means that if the heap_no of the row is changed, then the associated lock structs must be adjusted accordingly.

The Schema

To demonstrate the transaction locks, I use the following schema in this article. There is one table t1 which has 3 rows (1, 'அ'), (2, 'ஆ') and (3, 'இ'). In this article, we will see when and how the transaction locks (table and row) are created and used. We will also cover the internal steps involved in creating table and row locks.

set names utf8; 
drop table if exists t1; 
create table t1 (f1 int primary key, f2 char(10)) charset='utf8'; 
insert into t1 values (1, 'அ'), (2, 'ஆ'), (3, 'இ'); 

Table Level Transaction Locks

The purpose of table level transaction locks or simply table locks is to ensure that no transactions modify the structure of the table when another transaction is accessing or modifying the table or the rows in a table. There are two types of table locks in MySQL – one is the meta-data locking (MDL) provided by the SQL engine and the other is the table level locks within InnoDB storage engine. Here the discussion is only about the table level locks within InnoDB.

On tables, InnoDB normally acquires only intentional shared (LOCK_IS) or intentional exclusive (LOCK_IX) modes. It does not lock the tables in shared mode (LOCK_S) or exclusive mode (LOCK_X) unless explicitly requested via LOCK TABLES command. One exception to this is in the prepare phase of the online alter table command, where the table is locked in shared mode. Please refer to Multiple Granularity Locking to know about intention shared and intention exclusive locks.

Scenario (௧) for LOCK_IS table lock

Here is an example scenario that will take intention shared (LOCK_IS) lock on the table.

mysql> set transaction isolation level serializable;
mysql> start transaction; 
mysql> select * from t1; 

When the above 3 statements are executed, one table lock would be taken in LOCK_IS mode. In mysql-5.6, there is no way to verify this other than using the debugger. I verified it as follows:

(gdb) set $trx_locklist = trx_sys->rw_trx_list->start->lock->trx_locks 
(gdb) p $trx_locklist.start->un_member->tab_lock->table->name
$21 = 0x7fffb0014f70 "test/t1" 
(gdb) p lock_get_mode(trx_sys->rw_trx_list->start->lock->trx_locks->start) 
$25 = LOCK_IS 

You need to access the correct transaction object and the lock object.

Scenario (௨) for LOCK_IX table lock

Here is an extension to the previous scenario. Adding an INSERT statement to the scenario will take an intention exclusive (LOCK_IX) lock on the table.

mysql> set transaction isolation level serializable;
mysql> start transaction; 
mysql> select * from t1; 
mysql> insert into t1 values (4, 'ஃ'); 

When the above 4 statements are executed, there would be two locks on the table – LOCK_IS and LOCK_IX. The select statement would have taken the table lock in LOCK_IS mode and the INSERT statement would take the table lock in LOCK_IX mode.

This can also be verified using the debugger. I'll leave it as an exercise for the reader.

What Happens Internally When Acquiring Table Locks

Each table is uniquely identified within the InnoDB storage engine using the table descriptor object of type dict_table_t. In this section we will see the steps taken internally by InnoDB to obtain a table lock. The function to refer to is lock_table() in the source code. Necessary mutexes are taken during these steps to ensure that everything works correctly in a multi-thread environment. This aspect is not discussed here.

  1. The request for a table lock comes with the following information – the table to lock (dict_table_t), the mode in which to lock (enum lock_mode), and the transaction which wants the lock (trx_t).
  2. Check whether the transaction is already holding an equal or stronger lock on the given table. Each transaction maintains a list of table locks obtained by itself (trx_t::trx_lock_t::table_locks). Searching this list for the given table and mode is sufficient to answer this question. If the current transaction is already holding an equal or stronger lock on the table, then the lock request is reported as success. If not, then go to next step.
  3. Check if any other transaction has an incompatible lock request in the lock queue of the table. Each table descriptor has a lock queue (dict_table_t::locks). Searching this queue is sufficient to answer this question. If some other transaction holds an incompatible lock on the table, then the lock request needs to wait. Waiting for a lock can lead to time out or deadlock error. If there is no contention from other transactions for this table, then proceed further.
  4. Allocate a lock struct object (lock_t). Initialize with table, trx and lock mode information. Add this object to the queue in dict_table_t::locks object of the table as well as the trx_t::trx_lock_t::table_locks.
  5. Complete.

You can re-visit the above scenario (௨) and then follow the above steps and verify that the number of lock structs created is 2.

Row Level Transaction Locks

Each row in the InnoDB storage engine needs to be uniquely identified in order to be able to lock it. A row is uniquely identified by the following pieces of information:

  • The space identifier
  • The page number within the space.
  • The heap number of the record within the page.
  • The descriptor of the index to which the row belongs (dict_index_t). Stricly speaking, this is not necessary to uniquely identify the row. But associating the row to its index will help to provide user friendly diagnosis and also help the developers to debug a problem.

There are two types of row locks in InnoDB – the implicit and the explicit. The explicit row locks are the locks that make use of the global row lock hash table and the lock_t structures. The implicit row locks are logically arrived at based on the transaction information in the clustered index or secondary index record. These are explained in the following sections.

Implicit Row Locks

Implicit row locks do not have an associated lock_t object allocated. This is purely calculated based on the ID of the requesting transaction and the transaction ID available in each record. First, let use see how implicit locks are “acquired” (here is comment from

“If a transaction has modified or inserted an index record, then it owns an implicit x-lock on the record. On a secondary index record, a transaction has an implicit x-lock also if it has modified the clustered index record, the max trx id of the page where the secondary index record resides is >= trx id of the transaction (or database recovery is running), and there are no explicit non-gap lock requests on the secondary index record. ”

As we can see, the implicit locks are a logical entity and whether a transaction has implicit locks or not is calculated using certain procedures. This procedure is explained here briefly.

If a transaction wants to acquire a row lock (implicit or explicit), then it needs to determine whether any other transaction has an implicit lock on the row before checking on the explicit lock. As explained above this procedure is different for clustered index and secondary index.

For the clustered index, get the transaction id from the given record. If it is a valid transaction id, then that is the transaction which is holding the implicit exclusive lock on the row. Refer to the function lock_clust_rec_some_has_impl() for details.

For the secondary index, it is more cumbersome to calculate. Here are the steps:

  1. Let R1 be the secondary index row for which we want to check if another transaction is holding an implicit exclusive lock.
  2. Obtain the maximum transaction id (T1) of the page that contains the secondary index row R1.
  3. Let T2 be the minimum transaction id of the InnoDB system. If T1 is less than T2, then there can be no transaction holding implicit lock on the row. Otherwise, go to next step.
  4. Obtain the corresponding clustered index record for the given secondary index record R1.
  5. Get the transaction id (T3) from the clustered index record. By looking at the clustered index record, including its older versions, find out if T3 could have modified or inserted the secondary index row R1. If yes, then T3 has an implicit exclusive lock on row R1. Otherwise it does not have.

In the case of secondary indexes, we need to make use of the undo logs to determine if any transactions have an implicit exclusive row lock on record. Refer to the function lock_sec_rec_some_has_impl() for details.

Also note that the implicit row locks do not affect the gaps.

Explicit Row Locks

Explicit row locks are represented by the lock_t structures. This section provides some scenarios in which explicit row locks are obtained in different modes. It also briefly discusses the internal procedure to acquire an explicit row lock.

Scenario (௩) for LOCK_S row lock

For transaction isolation level REPEATABLE READ and less stronger levels, InnoDB does not take shared locks on rows. So in the following scenario, we use SERIALIZABLE transaction isolation level for demonstrating shared row locks:

mysql> set transaction isolation level serializable;
mysql> start transaction; 
mysql> select * from t1; 

This scenario is the same as the Scenario (௧). The verification via debugger alone is different. In the case of row locks, each lock object represents a row lock for all rows of a page (of the same lock mode). So a lock bitmap is used for this purpose which exists at the end of the lock struct object. In the above scenario, we can verify the locks as follows:

The lock_t object allocated for the row locks is

(gdb) set $trx_locklist = trx_sys->rw_trx_list->start->lock->trx_locks 
(gdb) set $rowlock = $trx_locklist.start->trx_locks->next 
(gdb) p $rowlock 
$47 = (ib_lock_t *) 0x7fffb00103f0 

Verify the lock mode of the lock object.

(gdb) p lock_get_mode($rowlock) 
$48 = LOCK_S 

Verify that it is actually a lock struct object allocated for rows (and not a table lock).

(gdb) p *$rowlock 
$43 = {trx = 0x7fffb0013f78, trx_locks = {prev = 0x7fffb00103a8, next = 0x0}, 
  type_mode = 34, hash = 0x0, index = 0x7fffb001a6b8, un_member = {tab_lock = { 
      table = 0xc, locks = {prev = 0x3, next = 0x48}}, rec_lock = {space = 12, 
      page_no = 3, n_bits = 72}}} 

The row lock bit map is at the end of the lock object. Even though the number of bits in the lock bitmap is given as n_bits = 72 (9 bytes) above, I am examining only 4 bytes here.

(gdb) x /4t $rowlock + 1 
0x7fffb0010438:	00011110	00000000	00000000	00000000 

Note that there are 4 bits enabled. This means that there are 4 row locks obtained. The row in the page, and the bit in the lock bit map are related via the heap_no of the row, as described previously.

Scenario (௪) for LOCK_X row lock

Changing the above scenario slightly as follows, will obtain LOCK_X locks on the rows.

mysql> set transaction isolation level serializable;
mysql> start transaction; 
mysql> select * from t1 for update; 

Verification of this through the debugger is left as an exercise for the reader.

What Happens Internally When Acquiring Explicit Row Locks

To lock a row, all information necessary to uniquely identify the row – the trx, lock mode, table space id, page_no and heap_no – must be supplied. The entry point for row locking is the lock_rec_lock() function in the source code.

  1. Check if the given transaction has a granted explicit lock on the row with equal or stronger lock mode. To do this search, the global hash table of row locks is used. If the transaction already has a strong enough lock on the row, then nothing more to do. Otherwise go to next step.
  2. Check if other transactions have a conflicting lock on the row. For this search also, the global hash table of row locks is used. If yes, then the current transaction must wait. Waiting can lead to timeout or deadlock error. If there is no contention for this row from other transactions then proceed further.
  3. Check if there is already a lock struct object for the given row. If yes, reuse the same lock struct object. Otherwise allocate a lock struct object.
  4. Initialize the trx, lock mode, page_no, space_id and the size of the lock bit map. Set the bit of the lock bitmap based on the heap_no of the row.
  5. Insert this lock struct object in the global hash table of row locks.
  6. Add this to the list of transaction locks in the transaction object (trx_t::trx_lock_t::trx_locks).

Releasing transaction locks

All the transaction locks acquired by a transaction is released by InnoDB at transaction end (commit or rollback). In the case of REPEATABLE READ isolation level or SERIALIZABLE, it is not possible for the SQL layer to release locks before transaction end. In the case of READ COMMITTED isolation level, it is possible for the SQL layer to release locks before transaction end by making use of ha_innobase::unlock_row() handler interface API call.

To obtain the list of all transaction locks of a transaction, the following member can be used – trx_t::trx_lock_t::trx_locks.


This article provided an overview of the transaction locks in InnoDB. We looked at the data structures used to represent transaction locks. We analysed some scenarios in which the various locks are created. We also looked at the internal steps that happen when a table lock or a row lock is created.

You can download this article in pdf.

Friday Apr 19, 2013

Data Organization in InnoDB


This article will explain how the data is organized in InnoDB storage engine. First we will look at the various files that are created by InnoDB, then we look at the logical data organization like tablespaces, pages, segments and extents. We will explore each of them in some detail and discuss about their relationship with each other. At the end of this article, the reader will have a high level view of the data layout within the InnoDB storage engine.

The Files

MySQL will store all data within the data directory. The data directory can be specified using the command line option –data-dir or in the configuration file as datadir. Refer to the Server Command Options for complete details.

By default, when InnoDB is initialized, it creates 3 important files in the data directory – ibdata1, ib_logfile0 and ib_logfile1. The ibdata1 is the data file in which system and user data will be stored. The ib_logfile0 and ib_logfile1 are the redo log files. The location and size of these files are configurable. Refer to Configuring InnoDB for more details.

The data file ibdata1 belongs to the system tablespace with tablespace id (space_id) of 0. The system tablespace can contain more than 1 data file. As of MySQL 5.6, only the system tablespace can contain more than 1 data file. All other tablespaces can contain only one data file. Also, only the system tablespace can contain more than one table, while all other tablespaces can contain only one table.

The data files and the redo log files are represented in the memory by the C structure fil_node_t.


By default, InnoDB contains only one tablespace called the system tablespace whose identifier is 0. More tablespaces can be created indirectly using the innodb_file_per_table configuration parameter. In MySQL 5.6, this configuration parameter is ON by default. When it is ON, each table will be created in its own tablespace in a separate data file.

The relationship between the tablespace and data files is explained in the InnoDB source code comment (storage/innobase/fil/ which is quoted here for reference:

A tablespace consists of a chain of files. The size of the files does not have to be divisible by the database block size, because we may just leave the last incomplete block unused. When a new file is appended to the tablespace, the maximum size of the file is also specified. At the moment, we think that it is best to extend the file to its maximum size already at the creation of the file, because then we can avoid dynamically extending the file when more space is needed for the tablespace.”

The last statement about avoiding dynamic extension is applicable only to the redo log files and not the data files. Data files are dynamically extended, but redo log files are pre-allocated. Also, as already mentioned earlier, only the system tablespace can have more than one data file.

It is also clearly mentioned that even though the tablespace can have multiple files, they are thought of as one single large file concatenated together. So the order of files within the tablespace is important.


A data file is logically divided into equal sized pages. The first page of the first data file is identified with page number of 0, and the next page would be 1 and so on. A page within a tablespace is uniquely identified by the page identifier or page number (page_no). And each tablespace is uniquely identified by the tablespace identifier (space_id). So a page is uniquely identified throughout InnoDB by using the (space_id, page_no) combination. And any location within InnoDB can be uniquely identified by the (space_id, page_no, page_offset) combination, where page_offset is the number of bytes within the given page.

How the pages from different data files relate to one another is explained in another source code comment: “A block's position in the tablespace is specified with a 32-bit unsigned integer. The files in the chain are thought to be catenated, and the block corresponding to an address n is the nth block in the catenated file (where the first block is named the 0th block, and the incomplete block fragments at the end of files are not taken into account). A tablespace can be extended by appending a new file at the end of the chain.” This means that the first page of all the data files will not have page_no of 0 (zero). Only the first page of the first data file in a tablespace will have the page_no as 0 (zero).

Also in the above comment it is mentioned that the page_no is a 32-bit unsigned integer. This is the size of the page_no when stored on the disk.

Every page has a page header (page_header_t). For more details on this please refer to the Jeremy Cole's blog “The basics of InnoDB space file layout.


An extent is 1MB of consecutive pages. The size of one extent is defined as follows (1048576 bytes = 1MB):


The macro UNIV_PAGE_SIZE used to be a compile time constant. From mysql-5.6 onwards it is a global variable. The number of pages in an extent depends on the page size used. If the page size is 16K (the default), then an extent would contain 64 pages.

Page Types

One page can be used for many purposes. The page type will identify the purpose for which the page is being used. The page type of each page will be stored in the page header. The page types are available in the header file storage/innobase/include/fil0fil.h. The following table provides a brief description of the page types.

Page Type



The page is a B-tree node


The page stores undo logs


contains an array of fseg_inode_t objects.


The page is in the free list of insert buffer or change buffer.


Freshly allocated page.


Insert buffer or change buffer bitmap


System page


Transaction system data


File space header


Extent Descriptor Page


Uncompressed BLOB page


First compressed BLOB page


Subsequent compressed BLOB page

Each page type is used for different purposes. It is beyond the scope of this article, to explore each page type. For now, it is sufficient to know that all pages have a page header (page_header_t) and they store the page type in it, and based on the page type the contents and the layout of the page would be decided.

Tablespace Header

Each tablespace will have a header of type fsp_header_t. This data structure is stored in the first page of a tablespace.

  • The table space identifier (space_id)

  • Current size of the table space in pages.

  • List of free extents

  • List of full extents not belonging to any segment.

  • List of partially full/free extents not belonging to any segment.

  • List of pages containing segment headers, where all the segment inode slots are reserved. (pages of type FIL_PAGE_INODE)

  • List of pages containing segment headers, where not all the segment inode slots are reserved. (pages of type FIL_PAGE_INODE).

InnoDB Tablespace Header Structure

From the tablespace header, we can access the list of segments available in the tablespace. The total space occupied by the tablespace header is given by the macro FSP_HEADER_SIZE, which is equal to 16*7 = 112 bytes.

Reserved Pages of Tablespace

As mentioned earlier, InnoDB will always contain one tablespace called the system tablespace with identifier 0. This is a special tablespace and is always kept open as long as the MySQL server is running. The first few pages of the system tablespace is reserved for internal usage. This information can be obtained from the header storage/innobase/include/fsp0types.h. They are listed below with a short description.

Page Number

The Page Name




The extent descriptor page.



The insert buffer bitmap page.



The first inode page number.



Insert buffer header page in system tablespace.



Insert buffer B-tree root page in system tablespace.



Transaction system header in system tablespace.



First rollback segment page, in system tablespace.



Data dictionary header page in system tablespace.

As can be noted from above, the first 3 pages will be there in any tablespace. But the last 5 pages are reserved only in the case of system tablespace. In the case of other tablespaces only 3 pages are reserved.

When the option innodb_file_per_table is enabled, then for each table a separate tablespace with one data file would be created. The source code comment in the function dict_build_table_def_step() states the following:

                /* We create a new single-table tablespace for the table. 
                We initially let it be 4 pages: 
                - page 0 is the fsp header and an extent descriptor page, 
                - page 1 is an ibuf bitmap page, 
                - page 2 is the first inode page, 
                - page 3 will contain the root of the clustered index of the 
                table we create here. */ 

File Segments

A tablespace can contain many file segments. File segments (or just segments) is a logical entity. Each segment has a segment header (fseg_header_t), which points to the inode (fseg_inode_t) describing the file segment. The file segment header contains the following information:

  • The space to which the inode belongs

  • The page_no of the inode

  • The byte offset of the inode

  • The length of the file segment header (in bytes).

Note: It would have been really more readable (at source code level) if fseg_header_t and fseg_inode_t had proper C-style structures defined for them.

The fseg_inode_t object contains the following information:

  • The segment id to which it belongs.

  • List of full extents.

  • List of free extents of this segment.

  • List of partially full/free extents

  • Array of individual pages belonging to this segment. The size of this array is half an extent.

When a segment wants to grow, it will get free extents or pages from the tablespace to which it belongs.


In InnoDB, when a table is created, a clustered index (B-tree) is created internally. This B-tree contains two file segments, one for the non-leaf pages and the other for the leaf pages. From the source code documentation:

In the root node of a B-tree there are two file segment headers. The leaf pages of a tree are allocated from one file segment, to make them consecutive on disk if possible. From the other file segment we allocate pages for the non-leaf levels of the tree.”

For a given table, the root page of a B-tree will be obtained from the data dictionary. So in InnoDB, each table exists within a tablespace, and contains one B-tree (the clustered index), which contains 2 file segments. Each file segment can contain many extents, and each extent contains 1MB of consecutive pages.


This article discussed the details about the data organization within InnoDB. We first looked at the files created by InnoDB, and then discussed about the various logical entities like tablespaces, pages, page types, extents, segments and tables. We also looked at the relationship between each one of them.

Send in your comments and feedback to

Tuesday Apr 10, 2012

InnoDB 2012 Spring Labs Release

Note: this article was originally published on on April 10, 2012 by Calvin Sun.

InnoDB team is pleased to announce the 2012 Spring labs release, with several much anticipated new features and performance enhancements. Please download mysql-5.6-labs-april-2012 from MySQL Labs and give a try. Do not forget to provide your feedback.

The 2012 Spring labs release on MySQL Labs consists of the following InnoDB new features, which are not in the newly released MySQL 5.6.5 DMR yet:

  • Online DDL: some of the DDLs are now truly online, including ADD INDEX, SET DEFAULT, and DROP FOREIGN KEY.
  • Memcached plugin: with additional features, such as SASL support.
  • Transportable tablespace: allow user to export data files and import them into another MySQL instance.
  • Persistent statistics ON/OFF switch: the ability of controlling persistent statistics on table level.
  • Option for specifying locations of InnoDB tables: allows user to choose the location of specific tables.

This labs release also includes several performance and scalability improvements, specially on modern CPUs:

  • Reduced false sharing
  • Configurable fast mutexes
  • my_hash_sort_simple() loop unrolling
  • Improved adaptive flushing
  • Improved neighbor flushing

With those improvements, (InnoDB) read-only performance reaches a new high. Please see Mikael’s blog for some of the improvements. You will see the benchmark results on DimitriK’s blog. And the InnoDB team will also continue publishing technical details in the coming days on this site.

We intend to make those new features & improvements into future development milestone releases and GA releases. Thanks for being interested in InnoDB!

Friday Dec 23, 2011

Improving InnoDB memory usage continued

Note: this article was originally published on on Dec 23, 2011 by Vasil Dimov.

Continues from Improving InnoDB memory usage.

Here are some numbers from the fixups described in the above article:

The workload consists of 10 partitioned tables, each one containing 1000 partitions. This means 10’000 InnoDB tables. We truncate the tables, then restart mysqld and run:

1. INSERT a single row into each of the 10 tables
2. SELECT * from each table
3. FLUSH TABLES (this causes the tables to be closed and reopened on the next run)
4. wait for 10 seconds

we repeat the above steps 10 times. Here is the total memory consumption by mysqld with 1GB InnoDB buffer pool during the workload:

In the fixed case (green line) you can clearly see the peaks when the commands 1. – 4. are run and the valley bottoms during the 10 seconds waits.

Notice that no memory leaks were fixed in these improvements. This is all about interaction between InnoDB and the system allocator. InnoDB always frees memory it has allocated, but the problem was that the system allocator could not reuse the freed memory optimally due to memory fragmentation. We mainly reduced the number of and the sizes of some of the hottest allocations, mitigating the malloc/free storm.

Notice also that this is somewhat unusual workload – 10’000 InnoDB tables, row size of about 64KB and frequent FLUSH TABLES.

Tuesday Dec 20, 2011

Improving InnoDB memory usage

Note: this article was originally published on on Dec 20, 2011 by Vasil Dimov.

Last month we did a few improvements in InnoDB memory usage. We solved a challenging issue about how InnoDB uses memory in certain places of the code.

The symptom of the issue was that under a certain workloads the memory used by InnoDB kept growing infinitely, until OOM killer kicked in. It looked like a memory leak, but Valgrind wasn’t reporting any leaks and the issue was not reproducible on FreeBSD – it only happened on Linux (see Bug#57480). Especially the latest fact lead us to think that there is something in the InnoDB memory usage pattern that reveals a nasty side of the otherwise good-natured Linux’s memory manager.

It turned out to be an interesting memory fragmentation caused by a storm of malloc/free calls of various sizes. We had to track and analyze each call to malloc during the workload, including the code path that lead to it. We collected a huge set of analysis data – some code paths were executed many 10’000s of times! A hurricane of allocations and deallocations! We looked at the hottest ones hoping that some of them are not necessary, can be eliminated, avoided, minimized or stuck together. Luckily there were plenty of them!

After an extensive testing we did a numerous improvements, allocating the smallest chunks of the memory from the stack instead of from the heap, grouping allocations together where possible, removing unnecessary allocations altogether, estimating exactly how much memory will be consumed by a given operation and allocating it in advance and others and others and others.

This not only fixed Bug#57480 but improved InnoDB memory usage in general.

Note: the fix is not in the 5.6.4 release.

Continues with some numbers here.

Wednesday Sep 28, 2011

InnoDB at Oracle OpenWorld

Note: this article was originally published on on Sept 28, 2011 by Calvin Sun.

Sunny and I will be presenting at the Oracle OpenWorld next week:

  • Introduction to InnoDB, MySQL’s Default Storage Engine,  10/04/11 Tuesday 01:15 PM,   Marriott Marquis – Golden Gate C3,     Calvin Sun
  • InnoDB Performance Tuning,  10/04/11 Tuesday  03:30 PM,   Marriott Marquis – Golden Gate C2,   Sunny Bains

The first session is for beginners, who are new to InnoDB and MySQL. The second session will cover many new performance features in MySQL 5.5 and 5.6, and share some tuning tips to maximize MySQL performance.

What to learn more about MySQL? There will be something for everyone. Come to join us!

Wednesday Jul 27, 2011

InnoDB 2011 Summer Labs Releases

Note: this article was originally published on on July 27, 2011 by Calvin Sun.

In April of 2011, InnoDB team published the early access of NoSQL to InnoDB with memcached, plus several new features as part of MySQL 5.6.2 milestone release. This week, we announced additional early access to new InnoDB features for the community to test, and provide feedback.

There are two release packages from InnoDB team on MySQL Labs: InnoDB full-text search, and InnoDB new features.

InnoDB Full-Text Search

MySQL 5.5 makes InnoDB the default storage engine, so everyone can benefit from ACID-compliant transactions, referential integrity, crash recovery.  However, some users need InnoDB to have built-in full-text search, similar to MyISAM’s full-text search.

InnoDB full-text search provides users with the ability to build full text indices and search for specific text-based content stored in InnoDB tables.  This new functionality supports fast and accurate search on document content using natural language, boolean, wildcard, and proximity search.

The design and implementation of InnoDB full-text search can trace back to 2005, when Osku Salerma detailed the design in his master thesis “Design of a Full Text Search index for a database management system“. Later, Sunny Bains and Jimmy Yang from the InnoDB team took over the development and made major contributions to this important feature.

Jimmy gave an overview of InnoDB full-text search, and the main differences in design between InnoDB full-text search and MyISAM full-text search. John provided a set of examples in the tutorial. What about the performance of InnoDB full-text search, you can find out in Vinay and Jimmy’s article.

Please download mysql-5.6-labs-innodb-fts from MySQL Labs and give a try.

InnoDB New Features

The package mysql-5.6-labs-innodb-features on MySQL Labs consists of a set of InnoDB new features since MySQL 5.6.2 milestone release, except InnoDB full-text search. Some of the new features are already in MySQL server main development tree, and the rest of them are intended to move into the main development tree toward future development milestone releases and GA releases.

The new InnoDB features included in this package are:

  • Increase the max size of redo log files from 4GB to 2TB
  • Reduce contention during file extension
  • Make deadlock detection non-recursive
  • Improve thread scheduling
  • Change rw-lock to mutex for trx_sys_t
  • Option to preload InnoDB buffer pool
  • Allow UNDO logs to reside in their own tablespace
  • Reintroduce random readahead
  • Support smaller page sizes (4K & 8K)
  • Increase the max length of prefix index from 767 bytes to 3072 bytes

In additional to continue improvements of InnoDB performance and scalability, we are also focusing on optimizing InnoDB for flash drives. InnoDB with flash drives could benefit from new features such as larger REDO log files, separate UNDO logs, smaller page sizes, and preloaded buffer pool.

Group commit with binlog is released separately by MySQL replication team, also on MySQL Labs.

Want to learn the details of InnoDB new features? Download mysql-5.6-labs-innodb-features from MySQL Labs, play with it, and read the blogs from InnoDB engineers:

BTW, do not forget to send us feedback. Thanks for being interested in InnoDB!

Friday May 06, 2011

InnoDB blogs, slides, etc.

Note: this article was originally published on on May 6, 2011 by Calvin Sun.

Last month was one of the busiest months for the InnoDB team:

The slides of the three talks are now available:

In addition to the blogs from the InnoDB team, Dimitri published a five-part series on MySQL 5.6 Performance, all on DimitriK’s (dim) Weblog:

  1. MySQL Performance: 5.6 Notes, part 5 – Fixing Adaptive Flushing..
  2. MySQL Performance: 5.6 Notes, part 4 – Fixing Purge issue..
  3. MySQL Performance: 5.6 Notes, part 3 – More in depth..
  4. MySQL Performance: 5.6 Notes, part 2 – Under full dbSTRESS workload…
  5. MySQL Performance: 5.6 Notes, part 1 – Discovery…

As usual, Dimitri’s blogs are full of nice charts, detail analysis, and sometimes critiques. In part 3, he talked about his experience on InnoDB METRICS table, a new feature in MySQL 5.6.2 release.

Øystein Grøvlen, a member of MySQL optimizer team, published an article on InnoDB Persistent Statistic: InnoDB Persistent Statistics Save the Day, also a new feature in MySQL 5.6.2 release.

The blogs written by InnoDB team:

  1. MySQL 5.6: Multi threaded purge, by Sunny Bains
  2. Tips and Tricks for Faster DDL, by Marko Mäkelä
  3. MySQL 5.6: Data dictionary LRU, by Sunny Bains
  4. Introducing page_cleaner thread in InnoDB, by Inaam Rana
  5. Information Schema for InnoDB System Tables, by Jimmy Yang
  6. InnoDB Persistent Statistics at last, by Vasil Dimov
  7. MySQL 5.6: InnoDB scalability fix – Kernel mutex removed, by Sunny Bains
  8. NoSQL to InnoDB with Memcached, by Calvin Sun
  9. Get started with InnoDB Memcached Daemon plugin, by Jimmy Yang
  10. Get started with InnoDB Metrics Table, by Jimmy Yang

Sunday Sep 19, 2010

MySQL 5.5: InnoDB as Default Storage Engine

Note: this article was originally published on on Sept 19, 2010 by Calvin Sun.

MySQL has a well-earned reputation for being easy-to-use and delivering performance and scalability. In previous versions, MyISAM was the default storage engine. In our experience, most users never changed the default settings. With MySQL 5.5, InnoDB becomes the default storage engine. Again, we expect most users will not change the default settings. But, because of InnoDB, the default settings deliver the benefits users expect from their RDBMS — ACID Transactions, Referential Integrity, and Crash Recovery. Lets explore how using InnoDB tables improves your life as a MySQL user, DBA, or developer.


In the first years of MySQL growth, early web-based applications didn’t push the limits of concurrency and availability. In 2010, hard drive and memory capacity and the performance/price ratio have all gone through the roof. Users pushing the performance boundaries of MySQL care a lot about reliability and crash recovery. MySQL databases are big, busy, robust, distributed, and important.

InnoDB hits the sweet spot of these top user priorities. The trend of storage engine usage has shifted in favor of the more efficient InnoDB. With MySQL 5.5, the time is right to make InnoDB the default storage engine.

The Big Change

Starting from MySQL 5.5.5, the default storage engine for new tables is InnoDB. This change applies to newly created tables that don’t specify a storage engine with a clause such as ENGINE=MyISAM. (Given this change of default behavior, MySQL 5.5 might be a logical point to evaluate whether your tables that do use MyISAM could benefit from switching to InnoDB.)

The mysql and information_schema databases, that implement some of the MySQL internals, still use MyISAM. In particular, you cannot switch the grant tables to use InnoDB.

Benefits of InnoDB Tables

If you use MyISAM tables but aren’t tied to them for technical reasons, you’ll find many things more convenient when you use InnoDB tables in MySQL 5.5:

  • If your server crashes because of a hardware or software issue, regardless of what was happening in the database at the time, you don’t need to do anything special after restarting the database. InnoDB automatically finalizes any changes that were committed before the time of the crash, and undoes any changes that were in process but not committed. Just restart and continue where you left off.
  • The InnoDB buffer pool caches table and index data as the data is accessed. Frequently used data is processed directly from memory. This cache applies to so many types of information and speeds up processing so much that dedicated database servers assign most of their physical memory to it.
  • If you split up related data into different tables, you can set up foreign keys that enforce referential integrity. Update or delete data, and the related data in other tables is updated or deleted automatically. Try to insert data into a secondary table without corresponding data in the primary table, and it gets kicked out automatically.
  • If data becomes corrupted on disk or in memory, a checksum mechanism alerts you to the bogus data before you use it.
  • When you do a little advance planning to decide on primary key columns for each table, lots of operations involving those columns are automatically optimized. It is very fast to reference the primary key columns in WHERE clauses, ORDER BY clauses, GROUP BY clauses, and join operations.
  • Inserts, updates, deletes are optimized by an automatic mechanism called change buffering. InnoDB not only allows concurrent read and write access to the same table, it caches changed data to streamline disk I/O.
  • Performance benefits are not limited to giant tables with long-running queries. When the same rows are accessed over and over from a table, a feature called the Adaptive Hash Index takes over to make these lookups even faster, as if they came out of a hash table.

Best Practices for InnoDB Tables

If you have been using InnoDB for a long time, you already know about features like transactions and foreign keys. If not, read about them in the MySQL Reference Manual. To make a long story short:

  • Specify a primary key for every table using the most frequently queried column or columns, or an auto-increment value if there isn’t an obvious primary key.
  • Embrace the idea of joins, where data is pulled from multiple tables based on identical ID values from those tables. For fast join performance, define foreign keys on the join columns, and declare those columns with the same datatype in each table. The foreign keys also propagate deletes or updates to all affected tables, and prevent insertion of data in a child table if the corresponding IDs are not present in the parent table.
  • Turn off autocommit. Committing hundreds of times a second puts a cap on performance (limited by the write speed of your storage device).
  • Bracket sets of related changes, “logical units of work”, with START TRANSACTION and COMMIT statements. While you don’t want to commit too often, you also don’t want to issue huge batches of INSERT, UPDATE, or DELETE statements that run for hours without committing.
  • Stop using LOCK TABLE statements. InnoDB can handle multiple sessions all reading and writing to the same table at once, without sacrificing reliability or high performance.
  • Enable the innodb_file_per_table option to put the data and indexes for individual tables into separate files, instead of in a single giant system tablespace. (This setting is required to use some of the other features, such as table compression and fast truncation.)
  • Evaluate whether your data and access patterns benefit from the new InnoDB table compression feature (ROW_FORMAT=COMPRESSED on the CREATE TABLE statement. You can compress InnoDB tables without sacrificing read/write capability.
  • Run your server with the option --sql_mode=NO_ENGINE_SUBSTITUTION to prevent tables being created with a different storage engine if there is an issue with the one specified in the ENGINE= clause of CREATE TABLE.

Recent Improvements for InnoDB Tables (from the Plugin Era)

If you have experience with InnoDB, but not the recent incarnation known as the InnoDB Plugin, read about those new features in the MySQL Reference Manual too. To make a long story short:

  • You can compress tables and associated indexes.
  • You can create and drop indexes with much less performance or availability impact than before.
  • Truncating a table is very fast, and can free up disk space for the operating system to reuse, rather than freeing up space within the system tablespace that only InnoDB could reuse.
  • The storage layout for table data is more efficient for BLOBs and long text fields.
  • You can monitor the internal workings of the storage engine by querying INFORMATION_SCHEMA tables.
  • You can monitor the performance details of the storage engine by querying PERFORMANCE_SCHEMA tables.
  • There are many many performance improvements.

Performance Improvements for InnoDB Tables

  • Crash recovery, the automatic process that makes all data consistent when the database is restarted, is fast and reliable. (Now much much faster than long-time InnoDB users are used to.) The bigger the database, the more dramatic the speedup.
  • Most new performance features are automatic, or at most require setting a value for a configuration option. Read details here. Advanced users can review the InnoDB configuration options.
  • Review the InnoDB-specific tuning techniques.

Testing and Benchmarking

It’s never too early to prepare for InnoDB as the default storage engine. You can preview whether your database server or application works correctly with MySQL 5.1 or even earlier. To set up InnoDB as the default storage engine, either specify on the command line --default-storage-engine=InnoDB, or add to your my.cnf file default-storage-engine=innodb in the [mysqld] section, then restart the server.

Since changing the default storage engine only affects new tables as they are created, run all your application installation and setup steps to confirm that everything installs properly. Then exercise all the application features to make sure all the data loading, editing, and querying features work. If a table relies on some MyISAM-specific feature, you’ll receive an error; add the ENGINE=MyISAM clause to the CREATE TABLE statement to avoid the error. (For example, tables that rely on full-text search must be MyISAM tables rather than InnoDB ones.)

If you didn’t make a deliberate decision about the storage engine, and you just want to preview how certain tables work when they’re created under InnoDB, issue the command ALTER TABLE table_name ENGINE=InnoDB; for each table. Or, to run test queries and other statements without disturbing the original table, make a copy like so:


Since there are so many performance enhancements in the InnoDB that’s part of MySQL 5.5, to get a true idea of the performance with a full application under a realistic workload, you’ll want to install the real MySQL 5.5 and run benchmarks. (Those who use the InnoDB Plugin with MySQL 5.1 have already seen some of these improvements, but to keep things simple, this article focuses on 5.5, where the latest and greatest InnoDB is fully integrated.)

Test the full application lifecycle, from installation, through heavy usage, and server restart. (For bonus points, kill the server process while the database is busy to simulate a power failure, and verify that the data is recovered successfully when you restart the server.)

Test any replication configurations, especially if you use different MySQL versions and options on the master and the slaves, because some of the InnoDB options have different defaults than in previous releases.


To know what the status of InnoDB is, whether you’re doing what-if testing with an older MySQL or comprehensive testing with the real MySQL 5.5:

  • Issue the command SHOW VARIABLES LIKE 'have_innodb'; to confirm that InnoDB is available at all. If the result is NO, you have a mysqld binary that was compiled without InnoDB support and you need to get a different one. If the result is DISABLED, go back through your startup options and configuration file and get rid of any skip-innodb option.
  • Issue the command SHOW ENGINES; to see all the different MySQL storage engines. You want to see DEFAULT in the InnoDB line.

InnoDB Revision History

Note: this article was originally published on on Sept 19, 2009 by Vasil Dimov.

This is a brief overview of the history of InnoDB with respect to the Version Control Systems (VCS) that were used for developing. It could be useful to people who want to trace back in time to find the origins of bugs or features.

Early days
InnoDB was born in the mid 1990s when Heikki Tuuri started developing this masterpiece. Heikki was a lone developer at that time and he did not use any VCS. There is no InnoDB revision history before 2001.

2001 – 2005
Then in 2001 InnoDB was imported into MySQL’s BitKeeper repository and development continued, recording the history there.

2005 – 2010
Later on in Oct 2005 when Oracle acquired Innobase, InnoDB developers had to move away from MySQL’s BitKeeper repository and Subversion was chosen for InnoDB development. The latest sources from BitKeeper were imported into SVN without preserving the history and Innobase started sending snapshots of InnoDB to MySQL. In other words – InnoDB devs committed changesets into SVN and then batches of those changesets were sent to MySQL. But those batches were applied as a single changeset in BitKeeper with a message like “Apply InnoDB snapshot XYZ” which made the revision history of InnoDB in BitKeeper after Oct 2005 meaningless.

In the meantime MySQL switched from BitKeeper to Bazaar, preserving all the history that BitKeeper had. This did not affect InnoDB development and revision history, except that Bazaar was easier to use than the free and limited BitKeeper client. The snapshots thing continued.

To summarize: until early 2010 InnoDB history was split in two – the before 2005 part was in Bazaar and after 2005 part was in Subversion.

2010 – present
In Mar 2010, after Oracle acquired Sun and MySQL, InnoDB development moved into Bazaar. We wanted to keep the precious history from SVN, but it was not possible to retrospectively swap the 2005-2010 history in Bazaar which contained changesets like “Apply InnoDB snapshot XYZ” with the meaningful history from Subversion. Instead, the InnoDB directory in the MySQL source tree was deleted and all revisions from SVN were applied one by one on top of the empty directory.

This way commands like “bzr blame” and “bzr log” show the relevant bits from the SVN history, up to 2005. If one wants to dive before 2005 he must branch the MySQL source as it was before the SVN import (e.g. bzr branch -rdate:2010-03-01). Branching the pre-SVN-import tree will give access to the 2001-2005 real history and 2005-2010 “apply snapshot” history.

Friday Apr 16, 2010

Post-Conference Roundup of InnoDB-related Info

Note: this article was originally published on on April 16, 2010 by John Russell.

What a busy week! Lots of MySQL 5.5 announcements that just happened to coincide with the MySQL Conference and Expo in Silicon Valley. Here are some highlights of the performance and scalability work that the InnoDB team was involved with.

A good prep for the week of news is the article Introduction to MySQL 5.5, which includes information about the major performance and scalability features. That article will lead you into the MySQL 5.5 manual for general features and the InnoDB 1.1 manual for performance & scalability info.

Then there were the conference presentations from InnoDB team members, which continued the twin themes of performance and scalability:

  • InnoDB: Status, Architecture, and New Features: Slides
  • InnoDB Plugin: Performance Features and Benchmarks: Slides
  • What’s New in MySQL 5.5? Performance/Scale Unleashed!: Slides
  • What’s New in MySQL 5.5?: Performance and Scalability Benchmarks: Slides
  • Introduction to InnoDB Monitoring System and Resource & Performance Tuning: Slides
  • Backup Strategies with InnoDB Hot Backup: Slides

We hope that a good and useful time was had by all. Best regards to our European friends and colleagues whose return plans were disrupted by the Icelandic volcano!

Thursday Apr 01, 2010

InnoDB Plugin Doc now on

Note: this article was originally published on on April 1, 2010 by John Russell.

The InnoDB Plugin manual is now available on the MySQL web site.

This is the InnoDB team blog.


« July 2016