Monday Jul 15, 2013

Redo Logging in InnoDB

Introduction

InnoDB is a general-purpose storage engine that balances high reliability and high performance. It is a transactional storage engine and is fully ACID compliant, as would be expected from any relational database. The durability guarantee provided by InnoDB is made possible by the redo logs.

This article will provide an overview of the redo log subsystem or log subsystem of InnoDB. We will look at the following details:

  • The global log system object, which provides access to important data structures and information.
  • The mini-transaction (mtr), using which all redo log records are created.
  • The global in-memory log buffer (or just log buffer), into which the redo logs are written to from the mini transaction buffer. This log buffer will be periodically flushed to the log file on disk.
  • The redo log files on the disk and their high level internal structure.
  • We will discuss about the concept of LSN, and how the various values of LSN are used to implement the write-ahead logging (WAL).

Redo Log Generation

In the article, Data Organization in InnoDB, we already saw that the user data files of InnoDB (.ibd files) are considered to be a sequence of equal sized pages. These pages are uniquely identified within the InnoDB system using the space_id, page_no combination. If we want to read or modify any page, it needs to be loaded into memory. So there are two copies of the pages - on disk and in memory. Here are the high level steps in redo log generation.

  1. Any changes to a page is first done to the in-memory copy of the page. The page that is modified in memory and not yet flushed to disk is marked as the dirty page in memory.
  2. An associated redo log is generated in memory, in the local mini transaction (mtr) buffer. This will then be transferred to the global in-memory redo log buffer.
  3. The redo log record is written from the redo log buffer in memory to the redo log file on disk. This is subsequently flushed. These two steps are considered separate - writing to the redo log file and flushing the redo log file to the disk. This is to account for the file buffering done by the operating system.
  4. The dirty page is flushed from memory to disk at some later point of time as part of the checkpointing operation.

The order of these steps are important. The redo log record of a change must be flushed to disk before flushing the corresponding dirty page to the disk. This is the concept of write-ahead logging (WAL).

The generated redo log record will contain information necessary to repeat the same operation later during a database recovery. So the redo log record will contain information about the original set of pages and what has been changed in them. Using the redo log record the set of dirty pages will be re-created during database recovery.

Redo Log Files

By default, InnoDB creates two redo log files (or just log files) ib_logfile0 and ib_logfile1 within the data directory of MySQL. In MySQL versions 5.6.8 and above, the default size of each redo log file is 48MB each. This can be configured by the user by making use of innodb_log_file_size server option. The number of log files is controlled by innodb_log_files_in_group server option.

A log group consists of a number of log files, each of same size. As of MySQL 5.6, InnoDB supports only one log group. So I'll not discuss this further.

The redo log files are used in a circular fashion. This means that the redo logs are written from the beginning to end of first redo log file, then it is continued to be written into the next log file, and so on till it reaches the last redo log file. Once the last redo log file has been written, then redo logs are again written from the first redo log file.

The log files are viewed as a sequence of blocks called "log blocks" whose size is given by OS_FILE_LOG_BLOCK_SIZE which is equal to 512 bytes. Each log file has a header whose size is given by LOG_FILE_HDR_SIZE, which is defined as 4*OS_FILE_LOG_BLOCK_SIZE.

Redo Log File Header

Each redo log file contains a header occupying four log blocks with the following information:

  • The first 4 bytes contain the log group number to which the log file belongs.
  • The next 8 bytes contain the lsn of the start of data in this log file.
  • First checkpoint field located in the beginning of the second log block.
  • Second checkpoint field located in the beginning of the fourth log block.

The checkpoint field mentioned above contains information necessary to identify a checkpoint - a checkpoint number, the LSN of the checkpoint, checksum information and more.

Log Blocks

A redo log file can be viewed as a sequence of log blocks. All log blocks, except the ones belonging to the log file header, contain a header and a footer. The header is of size LOG_BLOCK_HDR_SIZE bytes (which is defined as 12 bytes). The log block trailer is of size LOG_BLOCK_TRL_SIZE (4 bytes). The log block header contains the following information:

  • The log block number. This field is of 4 bytes.
  • Number of bytes of log written to this block. This field is of 2 bytes.
  • Offset of the first start of an mtr log record group in this log block or 0 if none
  • The checkpoint number to which this log block belongs

The log block trailer contains checksum of the log block contents.

Log Sequence Number (LSN)

The log sequence number (LSN) is an important concept. The LSN is an offset into the redo log file. Within InnoDB the log sequence number is represented by the type lsn_t, which is an 8-byte unsigned integer. There are different LSN values that are of interest. The following table lists some of the LSN values discussed in this article. (Note: The log_sys is the global log system object which is explained in next section.)

LSN Description
log_sys->lsn The redo log record that will be generated next will make use of this lsn
log_sys->flushed_to_disk_lsn The redo log file is flushed upto this LSN. All redo log records whose LSN is < flushed_to_disk_lsn is safely on the disk.
log_sys->write_lsn There is a currently running write operation which will write upto this LSN.
log_sys->current_flush_lsn There is a currently running write + flush operation which will write upto this LSN.

The LSN is what that links the dirty page, the redo log record, and the redo log file. Each redo log record when it is copied to the in-memory log buffer, it gets an associated LSN. When each database page is modified, redo log records are generated. So each database page is also associated to an LSN. The page lsn is stored in a header for each page. The page lsn gives the lsn upto which the redo log file must be flushed before flushing the page.

Global Log System Object

The global log system object log_sys (of type log_t) holds important information related to log subsystem of InnoDB. Here only those information related to redo log buffer and redo log files are discussed. The global log_sys identifies the "active area" of the log buffer, which contains redo logs that are yet to be safely written on the disk. It also identifies the area in the redo log file, into which the active area of the log buffer will be written/flushed.

/** Redo log buffer */
struct log_t{
	lsn_t		lsn;		/*!< log sequence number */
	ulint		buf_free;	/*!< first free offset within the log
					buffer */
	byte*		buf_ptr;	/* unaligned log buffer */
	byte*		buf;		/*!< log buffer */
	ulint		buf_size;	/*!< log buffer size in bytes */
	ulint		max_buf_free;	/*!< recommended maximum value of
					buf_free, after which the buffer is
					flushed */

	ulint		buf_next_to_write;/*!< first offset in the log buffer
					where the byte content may not exist
					written to file, e.g., the start
					offset of a log record catenated
					later; this is advanced when a flush
					operation is completed to all the log
					groups */
	lsn_t		write_lsn;	/*!< end lsn for the current running
					write */
	ulint		write_end_offset;/*!< the data in buffer has
					been written up to this offset
					when the current write ends:
					this field will then be copied
					to buf_next_to_write */
	lsn_t		current_flush_lsn;/*!< end lsn for the current running
					write + flush operation */
	lsn_t		flushed_to_disk_lsn;
					/*!< how far we have written the log
					AND flushed to disk */
};

  • The member log_sys->buf points to the in-memory redo log buffer. This is the buffer into which an mtr_commit() writes the redo log records. The size of this buffer is given by log_sys->buf_size in bytes.
  • The member log_sys->buf_free is the offset within the in-memory redo log buffer, where the next redo log record will be written. This is the end offset for the next redo log flush to disk.
  • The member log_sys->buf_next_to_write is the offset from where the redo log records are not yet written to the redo log file. When the redo log buffer is flushed to disk next time, it will flush from this location. This is the start offset for the next redo log flush to disk.
  • The member log_sys->flushed_to_disk_lsn marks the lsn upto which we have written to the log file on the disk and flushed it. So upto this lsn, the redo logs are safely on the disk. This will always be less than or equal to log_sys->write_lsn and log_sys->lsn.
  • The member log_sys->lsn represents the current lsn. This member will be updated whenever we do mtr_commit(). The function mtr_commit() writes the bunch of redo log records generated in the mini transaction to the global or system wide redo log buffer in memory. This will always be greater than or equal to log_sys->flushed_to_disk_lsn and log_sys->write_lsn. This will be the LSN of the redo log record written at log_sys->buf_free.
  • The member log_sys->write_lsn represents the end lsn of a currently running redo log buffer write operation. This will be greater than or equal to log_sys->flushed_to_disk_lsn and less than or equal to log_sys->lsn.
  • The member log_sys->current_flush_lsn represents the end lsn of a currently running redo log buffer write + flush operation. This will be mostly equal to log_sys->write_lsn.

The global log_sys object points to various positions in the in-memory redo log buffer and on-disk redo log files. The following picture shows the locations pointed to by the global log_sys object. The picture clearly shows that the redo log buffer maps to a specific portion of the redo log file.

Global In-memory Redo Log Buffer

The in-memory redo log buffer is global and all redo logs generated by user transactions will be written into this buffer. The size of this buffer is configurable and is given by the innodb_log_buffer_size. The default size of this redo log buffer is 8MB.

When a transaction is running and if it is modifying the database, then it will generate redo logs and populate this log buffer. This log buffer will be written or flushed to the log file, either when the transaction commits, or when the log buffer gets full.

When the redo log buffer is full, and there is not enough space for an mtr_commit(), which will transfer a group of redo log records to log buffer, then a synchronous flush of the log buffer is done to the redo log file by using the function log_buffer_flush_to_disk(). This function will write the log buffer from log_sys->buf_next_to_write to log_sys->buf_free. In terms of LSN, the function log_buffer_flush_to_disk() flushes the redo logs from log_sys->flushed_to_disk_lsn to log_sys->lsn.

The mini transaction (mtr)

A mini transaction (mtr) must be used to generate all the redo log records. A mini transaction contains a local buffer (called the mini transaction buffer) into which the generated redo log records will be stored. If we need to generate a group of redo log records such that either all make it to the redo log file or none makes it, then we need to put them in a single mini transaction. Apart from the redo log records, the mini transaction also maintains a list of pages that has been modified (dirty pages).

The normal usage of a mini transaction is as follows:

  • Create a mini transaction object of type mtr_t
  • Start the mini transaction with mtr_start() function. This will initialize the mini transaction buffer.
  • Generate the redo log records by making use of mlog_write_ulint() family of functions.
  • Commit the mini transaction with mtr_commit() function. This will transfer the redo log records from mini transaction buffer to the global redo log buffer. The list of dirty pages is added to the buffer pool flush list.

The definition of a mini transaction is given here for your reference. The member mtr_t::log contains the mini transaction buffer which will hold the redo log records, and the member mtr_t::memo contains a list of pages dirtied by this mini transaction.

/* Mini-transaction handle and buffer */
struct mtr_t{
	dyn_array_t	memo;	/*!< memo stack for locks etc. */
	dyn_array_t	log;	/*!< mini-transaction log */
	unsigned	inside_ibuf:1;
				/*!< TRUE if inside ibuf changes */
	unsigned	modifications:1;
				/*!< TRUE if the mini-transaction
				modified buffer pool pages */
	unsigned	made_dirty:1;
				/*!< TRUE if mtr has made at least
				one buffer pool page dirty */
	ulint		n_log_recs;
				/* count of how many page initial log records
				have been written to the mtr log */
	ulint		n_freed_pages;
				/* number of pages that have been freed in
				this mini-transaction */
	ulint		log_mode; /* specifies which operations should be
				logged; default value MTR_LOG_ALL */
	lsn_t		start_lsn;/* start lsn of the possible log entry for
				this mtr */
	lsn_t		end_lsn;/* end lsn of the possible log entry for
				this mtr */
};

Redo Log Record Types

When we modify a database page, a redo log record is generated. This redo log record either contains what information has been changed in the page (physical redo log) or how to perform the change again (logical redo log). InnoDB uses a combination of physical redo logs and logical redo logs.

To understand this consider an operation like page re-organization. If we generated a physical redo log record for this operation, it is likely that the redo log record generated will be equal to the page size. But instead InnoDB uses logical redo log record for this operation, which will reduce the redo log record size significantly. In the case of logical redo log record to represent a page reorganize, all we need is the information to uniquely identify a page, and "type" of operation which is page reorganize.

So each redo log record has a type. A redo log record type helps to identify the function that will be used to apply or execute the redo log during recovery. The contents of the redo log record must then contain all the arguments or parameters needed by the function.

Life cycle of a redo log record

The life cycle of a redo log record is as follows:
  • The redo log record is first created by a mini transaction and stored in the mini transaction buffer. It has information necessary to redo the same operation again in the time of database recovery.
  • When mtr_commit() is done, the redo log record is transferred to the global in-memory redo log buffer. If necessary, the redo log buffer will be flushed to the redo log file, to make space for the new redo log record.
  • The redo log record has a specific lsn associated with it. This association is established during mtr_commit(), when the redo log record is transferred from mtr buffer to log buffer. Once the lsn is established for a redo log record, then its location in the redo log file is also established.
  • The redo log record is then transferred from the log buffer to redo log file when it is written + flushed. This means that the redo log record is now durably stored on the disk.
  • Each redo log record has an associated list of dirty pages. This relationship is established via LSN. A redo log record must be flushed to disk, before its associated dirty pages. A redo log record can be discarded only after all the associated dirty pages are flushed to the disk.
  • A redo log record can be used to re-create its associated list of dirty pages. This happens during database recovery.

Conclusion

This article provided an overview of the redo log subsystem of InnoDB storage engine. The major data structures used in the redo log subsystem are the mini transaction (mtr), the in-memory redo log buffer and the on-disk redo log files. The InnoDB storage engine tracks many LSN values to ensure that the write-ahead logging (WAL) happens correctly.

Since data loss is unacceptable for an RDBMS, the redo log subsystem is very critical for a successful RDBMS. And since redo logs are generated synchronously with DML operations, it is important to do it efficiently. The size of redo logs must be kept as minimal as possible.

Send in your feedback to me.

About

This is the InnoDB team blog.

Search

Archives
« July 2013 »
SunMonTueWedThuFriSat
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
   
       
Today