Friday Sep 13, 2013

InnoDB Redundant Row Format


This article describes the InnoDB redundant row format. If you are new to InnoDB code base (a new developer starting to work with InnoDB), then this article is for you. I'll explain the row format by making use of a gdb session. An overview of the article is given below:

  • Create a simple table and populate few rows.
  • Access the page that contains the rows inserted.
  • Access a couple of rows and explain its format.
  • Give summary of redundant row format.
  • Useful gdb commands to analyse the InnoDB rows.
  • Look at a GNU Emacs Lisp function to traverse rows in an InnoDB index page.

To get the most out of this article, the reader is expected to repeat the gdb session as described here.

The Schema

Consider the following SQL statements to produce the schema:

CREATE TABLE t1 (f1 int unsigned) row_format=redundant engine=innodb;
INSERT INTO t1 VALUES (1), (2), (3), (4), (5);

When the above CREATE TABLE is executed, InnoDB internally creates a clustered B-tree index (or just clustered index) representing the table. If no primary key is explicitly given, internally a primary key is added. Each node of B-tree is represented by one page (which is by default is of size 16KB). In the given scenario, we insert only 5 records. So the B-tree will consist of only 1 node, the root node.

Access InnoDB Index Page in gdb

All select queries will go through the row_search_for_mysql() function of InnoDB. When the first time it is called for a query, a persistent cursor (btr_pcur_t) will be opened by this function. When this B-tree persistent cursor (pcur in the following gdb) points to the InnoDB index page containing the rows of our interest, we can save the page frame to a file as follows:

(gdb) set $frame = pcur->btr_cur->page_cur->block->frame
(gdb) dump binary memory /tmp/gdb.anna $frame $frame + srv_page_size
(gdb) p srv_page_size
$10 = 16384

If you do not understand the commands above, refer to the gdb manual for more information. Saving a page in binary format in a file on disk helps us to analyse the contents of the page by loading it in a text editor like GNU Emacs or ViM. This is sometimes a very useful approach for debugging. The above commands were executed when I was on the sync point row_search_rec_loop, in the function row_search_for_mysql(). The line numbers are from mysql 5.7.3.

4155         } else if (mode == PAGE_CUR_G || mode == PAGE_CUR_L) {
4156                 btr_pcur_open_at_index_side(
4157                         mode == PAGE_CUR_G, index, BTR_SEARCH_LEAF,
4158                         pcur, false, 0, &mtr);
4159         }
4161 rec_loop:
4162         DEBUG_SYNC_C("row_search_rec_loop");
4163         if (trx_is_interrupted(trx)) {
4164                 btr_pcur_store_position(pcur, &mtr);
4165                 err = DB_INTERRUPTED;
4166                 goto normal_return;
4167         }

An important point to remember when analysing the page is that all the data in the page are stored in big-endian format.

Open Page in GNU Emacs

The following image shows the InnoDB page opened in GNU Emacs in hexl-mode major mode. I have placed the cursor on the origin of the infimum record (at offset 0x65 from page origin). In the right side, you can look at the text "infimum" and "supremum" as stored in the page. These are special records.

All rows in an InnoDB page are linked to one another in ascending order via the next record pointer. In the given example this ordering is based on the internally generated DB_ROW_ID. In each page, the first record (or the minimum record) is the infimum record, and the last record (or the maximum record) is the supremum record. The records between the infimum and supremum records are called the user records.

The next record pointer of each record points to the "origin" of the next record. The record contains data to the right of the origin and meta data to the left of the origin. The next record pointer is available in the 2 bytes before the record origin.

The next record pointer of infimum as given in the picture above is 0x0087 (which is 135 in decimal). This is the offset of the next record from the beginning of the page.

Printing the infimum record

As mentioned previously, infimum is a special minimum record in an index page. It exists at a fixed offset of 101 bytes from the page beginning. The function page_get_infimum_rec() can be used to get the infimum record given a page frame pointer. All records in InnoDB are pointed to at their record origin. So to access the meta data in the record, we need to subtract offset from the record origin.

Printing the infimum record given a page frame pointer (in char):

(gdb) p  pcur->btr_cur->page_cur->block->frame+101
$41 = (unsigned char *) 0x7fffd1d3c065 "infimum"

Printing the infimum record given a page frame pointer (in hex):

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+101)@7
$45 = {0x69, 0x6e, 0x66, 0x69, 0x6d, 0x75, 0x6d}

Printing the 6 byte Informational Bit Fields

Given the origin of the infimum record, we can print the 6 bytes of informational bits as follows (in hex).

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+101-6)@6
$50 = {0x1, 0x0, 0x0, 0x3, 0x0, 0x87}

Printing the 6 bytes of informational bits of the infimum record (in binary). In this output, the right most byte displayed is the first byte before the record origin.

(gdb) x /6tb  pcur->btr_cur->page_cur->block->frame+101-6
0x7fffd1d3c05f: 00000001  00000000   00000000  00000011  00000000  10000111

The above 6 bytes bit fields of the record is formatted as follows: (it also contains the actual values as seen above)

4 bits
info bits
byte: 6
4 bits
byte: 6
13 bits
heap number
byte: 4,5
00000000 00000
10 bits
field count
byte: 3,4
000 0000001
1 bit
byte: 3
2 bytes
next record offset
byte: 1, 2
00000000 10000111

Each of the box above represents one bit field. It contains the following information:

The size of bit field in bits or bytes
A short description of the bit field.
The byte where this bit field exists.
The actual value of this bit field as seen in above gdb output

The 1 bit flag above is used to tell whether the offset array element is 1 byte (if set) or 2 bytes (if not set). In our example case, the 1 bit flag is set and hence the offset array elements are each 1 byte long.

From the field count bit field, we can see that the infimum record has only one field, whose value is fixed to "infimum". The next record pointer of the infimum record points to the first user record in the page. This is available at the offset 0x0087, which is 135 in decimal. So the first user record is available at page frame pointer + 135 bytes.

The Bit Fields

The information available in 6 bytes (REC_N_OLD_EXTRA_BYTES) of bit field data is listed here. These six bytes are available immediately before the record origin. These bit fields must be accessed by making use of the interface provided by the record manager (refer to rem0rec.h).

The Bit Field Size of Bit Field Description of the Bit Field
info bits 4 bits Used to delete mark a record
n_owned 4 bits Number of records owned by this record. This is related to page directory. To understand this field, we need to understand the page structure used by InnoDB. This is not explained further in this article.
heap number 13 bits Gives the order number of this record in the page. Each row has a unique heap number in the page. The lock manager uses this heap number.
field count 10 bits Number of fields in this record.
flag 1 bit Tells if the offset array element is 1 or 2 bytes.
next record offset 2 bytes Offset of the next record from page origin.

Printing the First User Record

To first parse the record we need to know the number of records in the row. This can be obtained from the 6 bytes of bit-field information available before the record origin. Let us first see this for the user record:

(gdb) x /6tb  pcur->btr_cur->page_cur->block->frame+135-6
0x7fffd1d3c081: 00000000  00000000  00010000  00001001  00000000  10101000

4 bits
info bits
byte: 6
4 bits
byte: 6
13 bits
heap number
byte: 4,5
00000000 00010
10 bits
field count
byte: 3,4
000 0000100
1 bit
byte: 3
2 bytes
next record offset
byte: 1, 2
00000000 10101000

The above information shows that this record has 4 fields (000 0000100 in binary). But our table definition was

CREATE TABLE t1 (f1 unsigned int) row_format=redundant engine=innodb;

If you are wondering why a row of this table has 4 field count, it is because InnoDB added 3 internal fields to the table t1 - a row identifier (DB_ROW_ID), transaction identifier (DB_TRX_ID) and a undo roll pointer (DB_ROLL_PTR). A row identifier column is internally added to a table if no primary key is explicitly given by the user. This row identifier becomes the internal primary key. The transaction identifier specifies the transaction that has last updated the row. The undo roll pointer points to the previous versions of the record. The 4th field is the actual column f1 added by the user.

The list of fields in the table t1 can be printed by accessing its associated dict_table_t object (or its clustered index dict_index_t object). In the following, the name GEN_CLUST_INDEX is the name of the clustered index of the table t1. This is the standard name of the clustered index if it was internally generated by InnoDB.

(gdb) p prebuilt->index->name       
$75 = 0x7fffa00232e8 "GEN_CLUST_INDEX"
(gdb) p prebuilt->index->table      
$73 = (dict_table_t *) 0x7fffa001fe68
(gdb) p prebuilt->index->table->name
$74 = 0x7fffa000cef0 "test/t1"
(gdb) set print array on
(gdb) p *prebuilt->index->fields@4
$67 =   {{col = 0x7fffa0020154, name = 0x7fffa00201c3 "DB_ROW_ID", prefix_len = 0, 
    fixed_len = 6},
  {col = 0x7fffa0020160, name = 0x7fffa00201cd "DB_TRX_ID", prefix_len = 0, 
    fixed_len = 6},
  {col = 0x7fffa002016c, name = 0x7fffa00201d7 "DB_ROLL_PTR", prefix_len = 0, 
    fixed_len = 7},
  {col = 0x7fffa0020148, name = 0x7fffa00201c0 "f1", prefix_len = 0, fixed_len = 4}}

Printing the Record Offsets (or Offset Array)

The redundant record contains an array of offsets of each field of the row. This array is available after the 6-byte bit fields (down from the record origin). The size of each element of the array is either 1 byte or 2 byte depending on the 1 bit flag in the 6 byte bit fields. In our current case each offset element is 1 byte long.

(gdb) p /d  *(pcur->btr_cur->page_cur->block->frame+135-6-4)@4
$79 = {23, 19, 12, 6}

The first field (DB_ROW_ID) starts at offset 0 from record origin. The second field (DB_TRX_ID) starts at offset 6 from record origin. The third field (DB_ROLL_PTR) starts at offset 12 bytes from record origin. The fourth field (f1) starts at offset 19 bytes from record origin. The end of the record is 23 bytes from the record origin. So the total size of this record can be calculated as follows:

S - size of offset array element (1 byte) 
N - number of fields in the row (4 fields)
I - size of bit fields (6 bytes)
D - Length of row after record origin (23 as given in offset array)
R - Total Size of the row 

R = (S*N) + I + D = 4 + 6 + 23 = 33 bytes

The total size of the row is 33 bytes. Since this table contains only fixed length fields, the size of each row in this table will be the same.

Printing the DB_ROW_ID

The DB_ROW_ID is 6 bytes long. It starts at the record origin itself. It can be printed as follows:

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135)@6
$82 = {0x0, 0x0, 0x0, 0x0, 0x2, 0x0}

Printing the DB_TRX_ID

The second field DB_TRX_ID is also 6 bytes long. It starts at an offset of 6 bytes from the record origin. It can be examined as follows:

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135+6)@6
$84 = {0x0, 0x0, 0x0, 0x0, 0x9, 0x8}

Printing the DB_ROLL_PTR

The third field DB_ROLL_PTR is 7 bytes long. It points to the previous versions of a record available in the undo segments. It can be examined as follows:

(gdb)  p /x *(pcur->btr_cur->page_cur->block->frame+135+12)@7
$3 = {0xa8, 0x0, 0x0, 0x1, 0x1a, 0x1, 0x10}

Printing the User Data Field

The fourth field "f1" is the user defined field. It is 4 bytes long. It can be examined as follows:

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135+19)@4
$5 = {0x0, 0x0, 0x0, 0x1}
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+135+19))
$8 = 1

Thus, we have completely accessed the first user record. Given the first user record, we can find out the location of the next user record (the first 2 bytes before the record origin). If we keep accessing the next user record, we will eventually reach the supremum record, which is the last (and considered largest or maximum) record in a page.

Printing All Rows

Just to improve our confidence in accessing the rows, here I have printed all the rows that we have inserted into the table. The first record is the infimum, followed by the records 1, 2, 3, 4 and 5 (which we have inserted) and then the last record supremum.

(gdb) p pcur->btr_cur->page_cur->block->frame+101
$13 = (unsigned char *) 0x7fffd1e48065 "infimum"
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+135+19))
$14 = 1
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+168+19))
$15 = 2
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+201+19))
$16 = 3
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+234+19))
$17 = 4
(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+267+19))
$18 = 5
(gdb) p pcur->btr_cur->page_cur->block->frame+116
$19 = (unsigned char *) 0x7fffd1e48074 "supremum"

The Complete Redundant Row Format

One redundant record can be logically separated into 5 parts - the offsets array, the 6-byte bit field, record origin, hidden fields (DB_ROW_ID, DB_TRX_ID, DB_ROLL_PTR) and the actual user-specified fields of the table. They can be viewed as follows:

offsets array
fn ... f1, x2, x1, pn ... p1
6-byte bit field
primary key
or user given
p1, p2 ... pn
hidden fields
x1, x2
user fields
f1, f2, f3 ... fn

All record pointers point to the record origin. The record origin is the beginning of the primary key for the clustered primary index. It is not a separate entity that occupies some space. I am showing it separately to make it easy to visualize. All information prior to the record origin can be considered as the record header or the record meta-data.

Each data field has a corresponding offsets array element. The offsets array is arranged in reverse order. The first element in the array provides the offset of the end of the last data field and the last element in the array provides the offset of the end of the first data element.

The offsets array also tells whether a data field is NULL. If a field is NULL, the highest bit in the associated offset array element will be set to 1.

The biggest advantage of the redundant row format is that a record in this format can be parsed without needing to access the data dictionary. But this format is not space efficient. Even for fixed length fields, we store information in the offsets array (this is avoided in COMPACT, DYNAMIC and COMPRESSED row formats).

One more limitation of the redundant row format is that the blob data cannot be stored completely off page (as in DYNAMIC and COMPRESSED row formats). A 768 byte prefix of the blob will always be stored inline with the row and cannot be avoided.

A small note on row formats. InnoDB really supports only two row formats - the redundant row format and the compact row format. The compressed and dynamic row formats are a variation available only on top of the compact row format. Technically these variations can be implemented for the redundant row format as well, but is currently not available.

GNU Emacs Function to Traverse Rows in Page

If we dump a single InnoDB index page into a file, we can open it in GNU Emacs hexl-mode and use the following elisp function to traverse from infimum record to the supremum record. First place the cursor on the 'i' of the infimum record and then call this function to reach the next record origin. I am not an expert in writing elisp functions. If you can improve on this and write a better one, kindly share it with me.

(defun ib-next-red-row ()
  "Go to the origin of the next record in ROW_FORMAT=REDUNDANT.
The cursor must be positioned at the origin of a record. 
This is for the GNU Emacs hexl-mode. "
  (setq rec-origin (point))
  (setq rec-origin-offset (hexl-current-address))

  ;; The next-rec pointer is stored in 2 bytes before the
  ;; record origin
  (hexl-backward-char 2)
  (setq next-rec (string (char-after)))
  (forward-char 1)
  (setq next-rec 
	(concat next-rec (string (char-after))))
  (hexl-forward-char 1)
  (setq next-rec 
	(concat next-rec (string (char-after))))
  (forward-char 1)
  (setq next-rec 
	(concat next-rec (string (char-after))))

  ;; Next record is next-rec-off bytes away from page origin
  (setq next-rec-off (hexl-hex-string-to-integer next-rec))

  ;; Assuming that there is only one page in dump
  (hexl-goto-address 0)
  (hexl-forward-char next-rec-off))


In this article, the format of the redundant row was explained. A gdb session was used to demonstrate how to access a page and the rows within that page. The infimum record was accessed, printed and analysed. Using its next record pointer, the first user record was accessed. The reader can use the next record pointer of the first user record to access the next user record and so on till the supremum record is reached. I leave that as an exercise for the readers.

Thanks to Marko Makela for his useful tips on debugging at our yearly InnoDB Team Meeting 2013 held at Shanghai, China. This article derives from his ideas.

Send in your feedback to me.


This is the InnoDB team blog.


« September 2013 »