Thursday Dec 12, 2013

MySQL 5.7.3: Deep dive into 1mil QPS with InnoDB & Memcached

As you probably already know, in MySQL 5.7.3 release, InnoDB Memcached reached a record of over 1 million QPS on a read only load. The overview of the benchmark and testing results can be seen in an earlier blog by Dimitri. In this blog, I will spend sometime on the detail changes we have made to achieve this number.

First thanks to Facebook's Yoshinori with his bug#70172 that brought our attention to this single commit read only load test. We have been focussing on operation with large batch size. This bug prompted us to do a series of optimization on single commit read only queries and these optimizations eliminate almost all major bottlenecks from the InnoDB Memcached plugin itself.


If you are just getting familiar with InnoDB Memcached, there are some earlier blog on the topics to get you started. In a short word, InnoDB Memcached allows a fast path to retrieve key value data stored in the InnoDB table, with Memcached protocol.


The Benchmark:

Now, Let's discuss the testing scenario. The InnoDB Memcached plugin configurations are all by default in this benchmark, which means, the daemon_memcached_r_batch_size was also set to be 1, and the read operation would do a begin and commit transaction for each query. It is equivalent to auto-commit single selects through SQL interface. The innodb_api_trx_level is by default set to 0 (read uncommitted), however, changing it to 2 (repeatable read) gave the same benchmark result.


Another good news in 5.7.3 is that we start to support integer key column mapping, as it is common to use integer as primary key for a table. And the table used in this benchmark comes with integer as the key column. The mapping table contains only key and value columns. So we set the corresponding `flags`, `cas_column` and `expire_time_column` column in the config containers table all to NULL, this avoids overhead columns to support Memcached "extra" options. The table itself containers 1 million rows, each with a short integer key and a text value.

Here is the detail table definition

mysql> desc test.memc_test;


Field

Type

Null

Key

Default

Extra

id

int(11)

NO

PRI

NULL


value3

text

YES


NULL


To make InnoDB Memcached recognize this InnoDB table, insert following row into innodb_memcache/containers table

INSERT INTO containers VALUES

("memc1", "test", "memc_test", "id", "value3", null, null, null, "PRIMARY");


The memcached client for inserting rows and querying is a simple libmemcached program provided by Yoshinori.Each query would do a key lookup and fetches corresponding value.

We made some adjustment so that there are multiple client processes, each with multiple sessions. This was used to alleviate bottlenecks in the client itself.

As a note, there are many memcached clients out there, and Memcached clients can play  important roles in the performance result itself. For example, we observed at least 3 times difference on result with Perl client Cache::Memcached::Fast when comparing to its slower version Cache::Memcached. And as far as we can see, libmemcached is one of the most efficient clients available, even though eventually it becomes bottleneck itself as the test progresses, especially requests through the network.

The test result can be seen in Dimitri's blog, so I will not repeat them here. The summary is that we got close to 1.2 million QPS at the best. The next bottleneck now seems lying at the adaptive hash index's global latch - "btr_search_latch". The libmemcached client overhead is also significant.


Read Only QPS through InnoDB Memcached


The improvement:


There are several changes in both InnoDB Memcached code and Memcached Native code to achieve the record benchmarks.


1. The first is of course to address the issue brought by bug #70712. With daemon_memcached_r_batch_size set to 1, the transaction is being repeatedly started and committed for each query. It is better to cache the trx object itself, to avoid repeated create and destroy the trx object. Otherwise, the "trx_sys mutex" will kill the concurrency.

After the change, the trx object is cached with private memcached connection data. Each connection gets its own trx object, and it is used to handle transactions through this particular connection.

2. The next thing we did is to take advantage of the read only optimization recently made in the InnoDB code. This scenario (single read trx) is perfect to use the optimization. Whenever the read batch size is set to 1, InnoDB Memcached will treat incoming queries as auto-commit read only query. It will automatically hook up to the "fast path" of read-only operation in InnoDB.

3. After these two transaction related changes, we found the bottleneck comes from Memcached native Code itself. As a note, we embedded the Memcached code itself in our InnoDB Memcached plugin, so any bottleneck in Memcached will affect us.

The original Memcached memory allocation is protected by a global Cache Lock (engine->cache_lock), and it quickly rises in prominence in the profiling result.

Even though the data is stored in InnoDB, we happened to still use some of Memcached's own memory allocation to store and deliver the result back to the front end. To fix this bottleneck, we stopped using Memcached Memory altogether. Instead a connection private memory buffer is used to store and deliver the result. This also saves a memcpy as we move the data to memcached memory as before.

This change makes InnoDB Memcached plugin as thin as possible, and only relies on the InnoDB buffer pool and Adaptive Hash Index (AHI) as the backing store for the data. This provides better scaling and memory handling than Memcached itself.

4. Another bottleneck in Memcached is its statistics mutex ("thread_stats->mutex"). This also becomes significant as testing goes. So to remove it, we switched to using atomic operations whenever the platform supports (most modern platforms do). With these changes, we can now well scale the plugin to over 100 connections without degradation as the number of connections are ramped up.

5. In addition to removing those major bottlenecks, we also streamline the code to remove some overhead work. For example, we start to cached the "search tuple", so that there is no need to allocate the search tuple for each query. This is to keep the InnoDB Memcached as lean as possible.

With these changes, we have eliminated all the major InnoDB Memcached Plugin bottlenecks. The bottlenecks now comes from clients themselves and to a lesser degree from the Adaptive Hash Index search latch.

Future work:

Now the Memcached read goes more than twice as fast as those from SQL end. By using the InnoDB buffer pool as the in-memory store, and with InnoDB AHI, InnoDB Memcached can probably provide an efficient and more scalable store than Memcached itself.

There is still more to be done.

1. We will continue to remove some bottlenecks in InnoDB (such as btr_search_latch), as well as make InnoDB memcached leaner/faster.

2. We will add support to "mgets" command, which allows Memcached to fetch multiple results (corresponding to multiple keys) in one query attempts. This would again give us another big jump in terms of QPS.

3. We will start to focus more on insertion/updates operations.

4. We are considering extending the functionality of the memcached interface to support range queries etc. So to make it a more versatile key value store.

In summary, with these enhancements, the InnoDB Memcached becomes more and more attractive as as quick  key value store through the MySQL server.

Your feedback and comments are important to us as we evolve and improve this plugin. 

Monday Jul 15, 2013

Redo Logging in InnoDB

Introduction

InnoDB is a general-purpose storage engine that balances high reliability and high performance. It is a transactional storage engine and is fully ACID compliant, as would be expected from any relational database. The durability guarantee provided by InnoDB is made possible by the redo logs.

This article will provide an overview of the redo log subsystem or log subsystem of InnoDB. We will look at the following details:

  • The global log system object, which provides access to important data structures and information.
  • The mini-transaction (mtr), using which all redo log records are created.
  • The global in-memory log buffer (or just log buffer), into which the redo logs are written to from the mini transaction buffer. This log buffer will be periodically flushed to the log file on disk.
  • The redo log files on the disk and their high level internal structure.
  • We will discuss about the concept of LSN, and how the various values of LSN are used to implement the write-ahead logging (WAL).

Redo Log Generation

In the article, Data Organization in InnoDB, we already saw that the user data files of InnoDB (.ibd files) are considered to be a sequence of equal sized pages. These pages are uniquely identified within the InnoDB system using the space_id, page_no combination. If we want to read or modify any page, it needs to be loaded into memory. So there are two copies of the pages - on disk and in memory. Here are the high level steps in redo log generation.

  1. Any changes to a page is first done to the in-memory copy of the page. The page that is modified in memory and not yet flushed to disk is marked as the dirty page in memory.
  2. An associated redo log is generated in memory, in the local mini transaction (mtr) buffer. This will then be transferred to the global in-memory redo log buffer.
  3. The redo log record is written from the redo log buffer in memory to the redo log file on disk. This is subsequently flushed. These two steps are considered separate - writing to the redo log file and flushing the redo log file to the disk. This is to account for the file buffering done by the operating system.
  4. The dirty page is flushed from memory to disk at some later point of time as part of the checkpointing operation.

The order of these steps are important. The redo log record of a change must be flushed to disk before flushing the corresponding dirty page to the disk. This is the concept of write-ahead logging (WAL).

The generated redo log record will contain information necessary to repeat the same operation later during a database recovery. So the redo log record will contain information about the original set of pages and what has been changed in them. Using the redo log record the set of dirty pages will be re-created during database recovery.

Redo Log Files

By default, InnoDB creates two redo log files (or just log files) ib_logfile0 and ib_logfile1 within the data directory of MySQL. In MySQL versions 5.6.8 and above, the default size of each redo log file is 48MB each. This can be configured by the user by making use of innodb_log_file_size server option. The number of log files is controlled by innodb_log_files_in_group server option.

A log group consists of a number of log files, each of same size. As of MySQL 5.6, InnoDB supports only one log group. So I'll not discuss this further.

The redo log files are used in a circular fashion. This means that the redo logs are written from the beginning to end of first redo log file, then it is continued to be written into the next log file, and so on till it reaches the last redo log file. Once the last redo log file has been written, then redo logs are again written from the first redo log file.

The log files are viewed as a sequence of blocks called "log blocks" whose size is given by OS_FILE_LOG_BLOCK_SIZE which is equal to 512 bytes. Each log file has a header whose size is given by LOG_FILE_HDR_SIZE, which is defined as 4*OS_FILE_LOG_BLOCK_SIZE.

Redo Log File Header

Each redo log file contains a header occupying four log blocks with the following information:

  • The first 4 bytes contain the log group number to which the log file belongs.
  • The next 8 bytes contain the lsn of the start of data in this log file.
  • First checkpoint field located in the beginning of the second log block.
  • Second checkpoint field located in the beginning of the fourth log block.

The checkpoint field mentioned above contains information necessary to identify a checkpoint - a checkpoint number, the LSN of the checkpoint, checksum information and more.

Log Blocks

A redo log file can be viewed as a sequence of log blocks. All log blocks, except the ones belonging to the log file header, contain a header and a footer. The header is of size LOG_BLOCK_HDR_SIZE bytes (which is defined as 12 bytes). The log block trailer is of size LOG_BLOCK_TRL_SIZE (4 bytes). The log block header contains the following information:

  • The log block number. This field is of 4 bytes.
  • Number of bytes of log written to this block. This field is of 2 bytes.
  • Offset of the first start of an mtr log record group in this log block or 0 if none
  • The checkpoint number to which this log block belongs

The log block trailer contains checksum of the log block contents.

Log Sequence Number (LSN)

The log sequence number (LSN) is an important concept. The LSN is an offset into the redo log file. Within InnoDB the log sequence number is represented by the type lsn_t, which is an 8-byte unsigned integer. There are different LSN values that are of interest. The following table lists some of the LSN values discussed in this article. (Note: The log_sys is the global log system object which is explained in next section.)

LSN Description
log_sys->lsn The redo log record that will be generated next will make use of this lsn
log_sys->flushed_to_disk_lsn The redo log file is flushed upto this LSN. All redo log records whose LSN is < flushed_to_disk_lsn is safely on the disk.
log_sys->write_lsn There is a currently running write operation which will write upto this LSN.
log_sys->current_flush_lsn There is a currently running write + flush operation which will write upto this LSN.

The LSN is what that links the dirty page, the redo log record, and the redo log file. Each redo log record when it is copied to the in-memory log buffer, it gets an associated LSN. When each database page is modified, redo log records are generated. So each database page is also associated to an LSN. The page lsn is stored in a header for each page. The page lsn gives the lsn upto which the redo log file must be flushed before flushing the page.

Global Log System Object

The global log system object log_sys (of type log_t) holds important information related to log subsystem of InnoDB. Here only those information related to redo log buffer and redo log files are discussed. The global log_sys identifies the "active area" of the log buffer, which contains redo logs that are yet to be safely written on the disk. It also identifies the area in the redo log file, into which the active area of the log buffer will be written/flushed.

/** Redo log buffer */
struct log_t{
	lsn_t		lsn;		/*!< log sequence number */
	ulint		buf_free;	/*!< first free offset within the log
					buffer */
	byte*		buf_ptr;	/* unaligned log buffer */
	byte*		buf;		/*!< log buffer */
	ulint		buf_size;	/*!< log buffer size in bytes */
	ulint		max_buf_free;	/*!< recommended maximum value of
					buf_free, after which the buffer is
					flushed */

	ulint		buf_next_to_write;/*!< first offset in the log buffer
					where the byte content may not exist
					written to file, e.g., the start
					offset of a log record catenated
					later; this is advanced when a flush
					operation is completed to all the log
					groups */
	lsn_t		write_lsn;	/*!< end lsn for the current running
					write */
	ulint		write_end_offset;/*!< the data in buffer has
					been written up to this offset
					when the current write ends:
					this field will then be copied
					to buf_next_to_write */
	lsn_t		current_flush_lsn;/*!< end lsn for the current running
					write + flush operation */
	lsn_t		flushed_to_disk_lsn;
					/*!< how far we have written the log
					AND flushed to disk */
};

  • The member log_sys->buf points to the in-memory redo log buffer. This is the buffer into which an mtr_commit() writes the redo log records. The size of this buffer is given by log_sys->buf_size in bytes.
  • The member log_sys->buf_free is the offset within the in-memory redo log buffer, where the next redo log record will be written. This is the end offset for the next redo log flush to disk.
  • The member log_sys->buf_next_to_write is the offset from where the redo log records are not yet written to the redo log file. When the redo log buffer is flushed to disk next time, it will flush from this location. This is the start offset for the next redo log flush to disk.
  • The member log_sys->flushed_to_disk_lsn marks the lsn upto which we have written to the log file on the disk and flushed it. So upto this lsn, the redo logs are safely on the disk. This will always be less than or equal to log_sys->write_lsn and log_sys->lsn.
  • The member log_sys->lsn represents the current lsn. This member will be updated whenever we do mtr_commit(). The function mtr_commit() writes the bunch of redo log records generated in the mini transaction to the global or system wide redo log buffer in memory. This will always be greater than or equal to log_sys->flushed_to_disk_lsn and log_sys->write_lsn. This will be the LSN of the redo log record written at log_sys->buf_free.
  • The member log_sys->write_lsn represents the end lsn of a currently running redo log buffer write operation. This will be greater than or equal to log_sys->flushed_to_disk_lsn and less than or equal to log_sys->lsn.
  • The member log_sys->current_flush_lsn represents the end lsn of a currently running redo log buffer write + flush operation. This will be mostly equal to log_sys->write_lsn.

The global log_sys object points to various positions in the in-memory redo log buffer and on-disk redo log files. The following picture shows the locations pointed to by the global log_sys object. The picture clearly shows that the redo log buffer maps to a specific portion of the redo log file.

Global In-memory Redo Log Buffer

The in-memory redo log buffer is global and all redo logs generated by user transactions will be written into this buffer. The size of this buffer is configurable and is given by the innodb_log_buffer_size. The default size of this redo log buffer is 8MB.

When a transaction is running and if it is modifying the database, then it will generate redo logs and populate this log buffer. This log buffer will be written or flushed to the log file, either when the transaction commits, or when the log buffer gets full.

When the redo log buffer is full, and there is not enough space for an mtr_commit(), which will transfer a group of redo log records to log buffer, then a synchronous flush of the log buffer is done to the redo log file by using the function log_buffer_flush_to_disk(). This function will write the log buffer from log_sys->buf_next_to_write to log_sys->buf_free. In terms of LSN, the function log_buffer_flush_to_disk() flushes the redo logs from log_sys->flushed_to_disk_lsn to log_sys->lsn.

The mini transaction (mtr)

A mini transaction (mtr) must be used to generate all the redo log records. A mini transaction contains a local buffer (called the mini transaction buffer) into which the generated redo log records will be stored. If we need to generate a group of redo log records such that either all make it to the redo log file or none makes it, then we need to put them in a single mini transaction. Apart from the redo log records, the mini transaction also maintains a list of pages that has been modified (dirty pages).

The normal usage of a mini transaction is as follows:

  • Create a mini transaction object of type mtr_t
  • Start the mini transaction with mtr_start() function. This will initialize the mini transaction buffer.
  • Generate the redo log records by making use of mlog_write_ulint() family of functions.
  • Commit the mini transaction with mtr_commit() function. This will transfer the redo log records from mini transaction buffer to the global redo log buffer. The list of dirty pages is added to the buffer pool flush list.

The definition of a mini transaction is given here for your reference. The member mtr_t::log contains the mini transaction buffer which will hold the redo log records, and the member mtr_t::memo contains a list of pages dirtied by this mini transaction.

/* Mini-transaction handle and buffer */
struct mtr_t{
	dyn_array_t	memo;	/*!< memo stack for locks etc. */
	dyn_array_t	log;	/*!< mini-transaction log */
	unsigned	inside_ibuf:1;
				/*!< TRUE if inside ibuf changes */
	unsigned	modifications:1;
				/*!< TRUE if the mini-transaction
				modified buffer pool pages */
	unsigned	made_dirty:1;
				/*!< TRUE if mtr has made at least
				one buffer pool page dirty */
	ulint		n_log_recs;
				/* count of how many page initial log records
				have been written to the mtr log */
	ulint		n_freed_pages;
				/* number of pages that have been freed in
				this mini-transaction */
	ulint		log_mode; /* specifies which operations should be
				logged; default value MTR_LOG_ALL */
	lsn_t		start_lsn;/* start lsn of the possible log entry for
				this mtr */
	lsn_t		end_lsn;/* end lsn of the possible log entry for
				this mtr */
};

Redo Log Record Types

When we modify a database page, a redo log record is generated. This redo log record either contains what information has been changed in the page (physical redo log) or how to perform the change again (logical redo log). InnoDB uses a combination of physical redo logs and logical redo logs.

To understand this consider an operation like page re-organization. If we generated a physical redo log record for this operation, it is likely that the redo log record generated will be equal to the page size. But instead InnoDB uses logical redo log record for this operation, which will reduce the redo log record size significantly. In the case of logical redo log record to represent a page reorganize, all we need is the information to uniquely identify a page, and "type" of operation which is page reorganize.

So each redo log record has a type. A redo log record type helps to identify the function that will be used to apply or execute the redo log during recovery. The contents of the redo log record must then contain all the arguments or parameters needed by the function.

Life cycle of a redo log record

The life cycle of a redo log record is as follows:
  • The redo log record is first created by a mini transaction and stored in the mini transaction buffer. It has information necessary to redo the same operation again in the time of database recovery.
  • When mtr_commit() is done, the redo log record is transferred to the global in-memory redo log buffer. If necessary, the redo log buffer will be flushed to the redo log file, to make space for the new redo log record.
  • The redo log record has a specific lsn associated with it. This association is established during mtr_commit(), when the redo log record is transferred from mtr buffer to log buffer. Once the lsn is established for a redo log record, then its location in the redo log file is also established.
  • The redo log record is then transferred from the log buffer to redo log file when it is written + flushed. This means that the redo log record is now durably stored on the disk.
  • Each redo log record has an associated list of dirty pages. This relationship is established via LSN. A redo log record must be flushed to disk, before its associated dirty pages. A redo log record can be discarded only after all the associated dirty pages are flushed to the disk.
  • A redo log record can be used to re-create its associated list of dirty pages. This happens during database recovery.

Conclusion

This article provided an overview of the redo log subsystem of InnoDB storage engine. The major data structures used in the redo log subsystem are the mini transaction (mtr), the in-memory redo log buffer and the on-disk redo log files. The InnoDB storage engine tracks many LSN values to ensure that the write-ahead logging (WAL) happens correctly.

Since data loss is unacceptable for an RDBMS, the redo log subsystem is very critical for a successful RDBMS. And since redo logs are generated synchronously with DML operations, it is important to do it efficiently. The size of redo logs must be kept as minimal as possible.

Send in your feedback to me.

Tuesday Jan 15, 2013

Repeatable Read Isolation Level in InnoDB - How Consistent Read View Works

This article discusses about the approach taken by InnoDB Storage Engine of MySQL to provide the repeatable read isolation level.  First, an example is presented to demonstrate the two different designs that are possible.  Then the design used in InnoDB is presented followed by a short discussion about the advantages and disadvantages of this design choice.  As part of this discussion, we also present a performance optimization done in MySQL 5.6. 

An Example Scenario

I used MySQL 5.5 for this purpose.  Let us create the following tables t1 and t2 in the test database that is available by default. Even though the default storage engine in MySQL 5.5 is InnoDB, I explicitly specify it for clarity.  

mysql> use test;
mysql> create table t1 (f1 int) engine=innodb;
mysql> create table t2 (f1 int) engine=innodb;
mysql> insert into t1 values (1);
mysql> insert into t2 values (1);

Consider the following scenario (transactions are started with REPEATABLE READ isolation level):

S. No.

Session 1

Session 2

1

start transaction;


2

select f1 from t1;


3


start transaction;

4


update t2 set f1 = f1+1;

5


commit;

6

select f1 from t2;


Please go through the above scenario and find out what would be the result set in Session 1, line 6, query “select f1 from t2”?  Proceed once you know what you would expect.

By default, the isolation level in InnoDB is repeatable read.  So in line 1 and 3 the transactions would be started with repeatable read isolation level.  The output of the query in line 6 above will depend on the design approach taken to implement this isolation level.  The two different design approaches possible are the traditional locking approach and the snapshot isolation technique.

In traditional locking approach, a transaction running in REPEATABLE READ isolation level would acquire a row lock on all the rows that has been read.  No gap locks or range locks will be taken.  These transactional row locks will be used to provide the semantics of the REPEATABLE READ isolation level.   There are two drawbacks in this approach.  One, the number of row locks taken can become very high and hence will be a performance problem.  Two, this approach will not prevent phantom reads.   

In InnoDB, a REPEATABLE READ isolation level is provided by making use of the snapshot isolation technique.  This technique will be explained in the following sections.  

If the traditional locking approach is taken then the output would be 2 (the new value).  But in InnoDB the output of query “select f1 from t2” in line 6 would be 1 (the old value).   

Creating a Consistent Read View

InnoDB creates a consistent read view or a consistent snapshot either when the statement

mysql> START TRANSACTION WITH CONSISTENT SNAPSHOT;

is executed or when the first select query is executed in the transaction.  In the example given above, the consistent read view is created in Session 1 when the first select query in line 2 is executed.  This consistent read view is at the database level, so further select queries in Session 1 will not see the modifications done by other transactions in any of the tables in the database.  This is the reason why the “select f1 from t2” in Session 1 sees the old tuple instead of the new tuple as updated by Session 2.  But Session 1 can see the modifications done by itself.

When a transaction creates a read view, it creates a read view (read_view_t) object and copies the list of all the active transactions into it.  It uses this information later to provide a consistent view of the database as it existed when the read view was created.   How this is done is detailed in the sections below.  But before we proceed further, there is a small detail that we need to be aware of to understand how the read view works.

Hidden Fields of a InnoDB Table

For all InnoDB tables, 3 fields are internally added to the table – DB_ROW_ID, DB_TRX_ID and DB_ROLL_PTR (refer to InnoDB Multi-Versioning).  They are explained below:

S. No.

System Column

Description

Length (bytes)

1

DB_ROW_ID

The row identifier

6

2

DB_TRX_ID

The identifier of the transaction identifier that created the tuple

6

3

DB_ROLL_PTR

The rollback data pointer, pointing to previous versions of the tuple in the undo logs.

7

Update: For tables, with explicit PRIMARY KEY or UNIQUE NOT NULL key, the DB_ROW_ID will not be stored in the row, even though it will be listed as one of the columns in the data dictionary object for the table (of type dict_table_t). 

For our current discussion the focus is on DB_TRX_ID and the DB_ROLL_PTR.  Each tuple or row contains the identifier of the transaction that created the tuple, and a list of previous versions of the tuple.  

In InnoDB, the transaction identifier is an integer value and it is a monotonically increasing value.    So using the transaction identifier one can determine which transaction started earlier and which started later.  

How Consistent Read View Works

With the help of read view object and the hidden fields in the tuple, a transaction can construct the database as it existed at the time the consistent read view was created.  For this purpose, the transaction using the read view has to determine whether it can "see a particular transaction".  This is done using the following 2 rules:

Rule 1: When the read view object is created it notes down the smallest transaction identifier that is not yet used as a transaction identifier (trx_sys_t::max_trx_id).   The read view calls it the low limit. So the transaction using the read view must not see any transaction with identifier greater than or equal to this low limit.

Rule 2: The transaction using the read view must not see a transaction that was active when the read view was created.

Whenever a tuple is accessed, the transaction using the read view will check if it can see the transaction that created the tuple (using the two rules mentioned above).  If not, then check if the tuple has a previous version.  If there are no previous versions then ignore the tuple and proceed to the next tuple.  If there are previous versions, then repeat the procedure to find and build the correct version of the tuple to access from the undo log by following the DB_ROLL_PTR.

This avoids the phantom read problem in the REPEATABLE READ isolation in InnoDB.

Purpose of Read View

Why create the read view in the first place?  To provide the repeatable read isolation level, creating a read view and later accessing the various versions of the row without locking has a performance benefit.  The alternative would be to take a read lock on all the rows that was read in the transaction with repeatable read isolation level.  Accessing a large table with millions of records would generate that many shared row locks.  For performance gain a read view is created.  

Performance Problem in MySQL 5.5

A transaction using the read view must decide whether to see the rows created by a particular transaction.  This is done with the help of the two rules given above in the section “How Consistent Read View Works.”  The first rule is simple and doesn't involve much copying.  But the 2nd rule is expensive.  Because of the 2nd rule, when the read view object is created, the complete list of active transaction identifiers are copied.  The number of bytes copied depends on the number of active transactions in the database system when the read view is created.  

The performance problem is not only because of the number of bytes copied during the read view creation, but also because of the actual traversal of the list of all the active transactions.  This traversal is done under the protection of a mutex.  So when there are many transactions trying to create a read view concurrently, the contention for this mutex becomes high.

The Read Only Optimization in MySQL 5.6

A recent discussion in InnoDB developers mailing list suggested that the read view creation can be a very expensive operation if the number of active concurrent transactions are very high (say in thousands).  One of the optimizations that was discussed involved identifying the read-only transactions and ignoring them during read view creation.

Since read-only transactions do not make any changes that change the state of the tuple they are not required to be part of the read view. If the read-only transactions are ignored, then the number of transaction identifiers that needs to be copied during the read view creation can be reduced significantly.   This will also reduce the mutex contention discussed earlier.  This idea is based on the reasonable assumption that at any point of time, there will be a mix of read-only and read-write active transactions in a database system and that the read-only transactions will dominate the work load.

To facilitate this the Transaction Subsystem of InnoDB in MySQL 5.6 contains two lists of transactions, one for the read-write transactions (trx_sys_t::rw_trx_list) and the other for read-only transactions (trx_sys_t::ro_trx_list).  For creating the read view only the list of read-write transactions are used.  The length of transaction list to be traversed to create a read view is now smaller and the number of bytes copied is also less.  This makes the read view creation faster and more scalable.

Refer to Optimizations for Read-Only Transactions in MySQL 5.6 Reference Manual for further details.

Conclusion

In this article, we saw how InnoDB provides the repeatable read isolation level using a consistent read view or consistent snapshot.  Because of this approach, InnoDB is able to avoid the creation of many shared row locks and hence is able to achieve better performance.  The phantom read problem is also avoided.  We also saw how the consistent read view works.  

We discussed a particular performance problem in MySQL 5.5 and presented an optimization done in MySQL 5.6, which makes the read view creation faster and more scalable.  In certain highly concurrent situations, the current design of read view does have some performance challenges and MySQL team at Oracle is actively working to find solutions.  

For further reading refer to the Section “Consistent Nonlocking Reads” of the MySQL Reference Manual.  Send your feedback to annamalai.gurusami@oracle.com.

Saturday Sep 29, 2012

Online ALTER TABLE in MySQL 5.6

This is the low-level view of data dictionary language (DDL) operations in the InnoDB storage engine in MySQL 5.6. John Russell gave a more high-level view in his blog post April 2012 Labs Release – Online DDL Improvements.

MySQL before the InnoDB Plugin

Traditionally, the MySQL storage engine interface has taken a minimalistic approach to data definition language. The only natively supported operations were CREATE TABLE, DROP TABLE and RENAME TABLE. Consider the following example:

CREATE TABLE t(a INT);
INSERT INTO t VALUES (1),(2),(3);
CREATE INDEX a ON t(a);
DROP TABLE t;

The CREATE INDEX statement would be executed roughly as follows:

CREATE TABLE temp(a INT, INDEX(a));
INSERT INTO temp SELECT * FROM t;
RENAME TABLE t TO temp2;
RENAME TABLE temp TO t;
DROP TABLE temp2;

You could imagine that the database could crash when copying all rows from the original table to the new one. For example, it could run out of file space. Then, on restart, InnoDB would roll back the huge INSERT transaction. To fix things a little, a hack was added to ha_innobase::write_row for committing the transaction every 10,000 rows.

Still, it was frustrating that even a simple DROP INDEX would make the table unavailable for modifications for a long time.

Fast Index Creation in the InnoDB Plugin of MySQL 5.1

MySQL 5.1 introduced a new interface for CREATE INDEX and DROP INDEX. The old table-copying approach can still be forced by SET old_alter_table=0.

This interface is used in MySQL 5.5 and in the InnoDB Plugin for MySQL 5.1. Apart from the ability to do a quick DROP INDEX, the main advantage is that InnoDB will execute a merge-sort algorithm before inserting the index records into each index that is being created. This should speed up the insert into the secondary index B-trees and potentially result in a better B-tree fill factor.

The 5.1 ALTER TABLE interface was not perfect. For example, DROP FOREIGN KEY still invoked the table copy. Renaming columns could conflict with InnoDB foreign key constraints. Combining ADD KEY and DROP KEY in ALTER TABLE was problematic and not atomic inside the storage engine.

The ALTER TABLE interface in MySQL 5.6

The ALTER TABLE storage engine interface was completely rewritten in MySQL 5.6. Instead of introducing a method call for every conceivable operation, MySQL 5.6 introduced a handful of methods, and data structures that keep track of the requested changes.

In MySQL 5.6, online ALTER TABLE operation can be requested by specifying LOCK=NONE. Also LOCK=SHARED and LOCK=EXCLUSIVE are available. The old-style table copying can be requested by ALGORITHM=COPY. That one will require at least LOCK=SHARED. From the InnoDB point of view, anything that is possible with LOCK=EXCLUSIVE is also possible with LOCK=SHARED.

Most ALGORITHM=INPLACE operations inside InnoDB can be executed online (LOCK=NONE). InnoDB will always require an exclusive table lock in two phases of the operation. The execution phases are tied to a number of methods:

handler::check_if_supported_inplace_alter
Checks if the storage engine can perform all requested operations, and if so, what kind of locking is needed.
handler::prepare_inplace_alter_table
InnoDB uses this method to set up the data dictionary cache for upcoming CREATE INDEX operation. We need stubs for the new indexes, so that we can keep track of changes to the table during online index creation. Also, crash recovery would drop any indexes that were incomplete at the time of the crash.
handler::inplace_alter_table
In InnoDB, this method is used for creating secondary indexes or for rebuilding the table. This is the ‘main’ phase that can be executed online (with concurrent writes to the table).
handler::commit_inplace_alter_table
This is where the operation is committed or rolled back. Here, InnoDB would drop any indexes, rename any columns, drop or add foreign keys, and finalize a table rebuild or index creation. It would also discard any logs that were set up for online index creation or table rebuild.

The prepare and commit phases require an exclusive lock, blocking all access to the table. If MySQL times out while upgrading the table meta-data lock for the commit phase, it will roll back the ALTER TABLE operation.

In MySQL 5.6, data definition language operations are still not fully atomic, because the data dictionary is split. Part of it is inside InnoDB data dictionary tables. Part of the information is only available in the *.frm file, which is not covered by any crash recovery log. But, there is a single commit phase inside the storage engine.

Online Secondary Index Creation

It may occur that an index needs to be created on a new column to speed up queries. But, it may be unacceptable to block modifications on the table while creating the index.

It turns out that it is conceptually not so hard to support online index creation. All we need is some more execution phases:

  1. Set up a stub for the index, for logging changes.
  2. Scan the table for index records.
  3. Sort the index records.
  4. Bulk load the index records.
  5. Apply the logged changes.
  6. Replace the stub with the actual index.

Threads that modify the table will log the operations to the logs of each index that is being created. Errors, such as log overflow or uniqueness violations, will only be flagged by the ALTER TABLE thread. The log is conceptually similar to the InnoDB change buffer.

The bulk load of index records will bypass record locking. We still generate redo log for writing the index pages. It would suffice to log page allocations only, and to flush the index pages from the buffer pool to the file system upon completion.

Native ALTER TABLE

Starting with MySQL 5.6, InnoDB supports most ALTER TABLE operations natively. The notable exceptions are changes to the column type, ADD FOREIGN KEY except when foreign_key_checks=0, and changes to tables that contain FULLTEXT indexes.

The keyword ALGORITHM=INPLACE is somewhat misleading, because certain operations cannot be performed in-place. For example, changing the ROW_FORMAT of a table requires a rebuild.

Online operation (LOCK=NONE) is not allowed in the following cases:

  • when adding an AUTO_INCREMENT column,
  • when the table contains FULLTEXT indexes or a hidden FTS_DOC_ID column, or
  • when there are FOREIGN KEY constraints referring to the table, with ON…CASCADE or ON…SET NULL option.

The FOREIGN KEY limitations are needed, because MySQL does not acquire meta-data locks on the child or parent tables when executing SQL statements.

Theoretically, InnoDB could support operations like ADD COLUMN and DROP COLUMN in-place, by lazily converting the table to a newer format. This would require that the data dictionary keep multiple versions of the table definition. For simplicity, we will copy the entire table, even for DROP COLUMN.

The bulk copying of the table will bypass record locking and undo logging. For facilitating online operation, a temporary log will be associated with the clustered index of table. Threads that modify the table will also write the changes to the log.

When altering the table, we skip all records that have been marked for deletion. In this way, we can simply discard any undo log records that were not yet purged from the original table.

Off-page columns, or BLOBs, are an important consideration. We suspend the purge of delete-marked records if it would free any off-page columns from the old table. This is because the BLOBs can be needed when applying changes from the log. We have special logging for handling the ROLLBACK of an INSERT that inserted new off-page columns. This is because the columns will be freed at rollback.

InnoDB Compression Improvements in MySQL 5.6

MySQL 5.6 comes with significant improvements for the compression support inside InnoDB. The enhancements that we'll talk about in this piece are also a good example of community contributions. The work on these was conceived, implemented and contributed by the engineers at Facebook. Before we plunge into the details let us familiarize ourselves with some of the key concepts surrounding InnoDB compression.

  • In InnoDB compressed pages are fixed size. Supported sizes are 1, 2, 4, 8 and 16K. The compressed page size is specified at table creation time.
  • InnoDB uses zlib for compression.
  • InnoDB buffer pool will attempt to cache compressed pages like normal pages. However, whenever a page is actively used by a transaction, we'll always have the uncompressed version of the page as well i.e.: we can have a page in the buffer pool in compressed only form or in a state where we have both the compressed page and uncompressed version but we'll never have a page in uncompressed only form. On-disk we'll always only have the compressed page.
  • When both compressed and uncompressed images are present in the buffer pool they are always kept in sync i.e.: changes are applied to both atomically.
  • Recompression happens when changes are made to the compressed data. In order to minimize recompressions InnoDB maintains a modification log within a compressed page. This is the extra space available in the page after compression and it is used to log modifications to the compressed data thus avoiding recompressions.
  • DELETE (and ROLLBACK of DELETE) and purge can be performed without recompressing the page. This is because the delete-mark bit and the system fields DB_TRX_ID and DB_ROLL_PTR are stored in uncompressed format on the compressed page. A record can be purged by shuffling entries in the compressed page directory. This can also be useful for updates of indexed columns, because UPDATE of a key is mapped to INSERT+DELETE+purge.
  • A compression failure happens when we attempt to recompress a page and it does not fit in the fixed size. In such case, we first try to reorganize the page and attempt to recompress and if that fails as well then we split the page into two and recompress both pages.

Now lets talk about the three major improvements that we made in MySQL 5.6.

Logging of Compressed Page Images:
InnoDB used to log entire compressed data on the page to the redo logs when recompression happens. This was an extra safety measure to guard against the rare case where an attempt is made to do recovery using a different zlib version from the one that was used before the crash. Because recovery is a page level operation in InnoDB we have to be sure that all recompress attempts must succeed without causing a btree page split. However, writing entire compressed data images to the redo log files not only makes the operation heavy duty but can also adversely affect flushing activity. This happens because redo space is used in a circular fashion and when we generate much more than normal redo we fill up the space much more quickly and in order to reuse the redo space we have to flush the corresponding dirty pages from the buffer pool.
Starting with MySQL 5.6 a new global configuration parameter innodb_log_compressed_pages. The default value is true which is same as the current behavior. If you are sure that you are not going to attempt to recover from a crash using a different version of zlib then you should set this parameter to false. This is a dynamic parameter.

Compression Level:
You can now set the compression level that zlib should choose to compress the data. The global parameter is innodb_compression_level - the default value is 6 (the zlib default) and allowed values are 1 to 9. Again the parameter is dynamic i.e.: you can change it on the fly.

Dynamic Padding to Reduce Compression Failures:
Compression failures are expensive in terms of CPU. We go through the hoops of recompress, failure, reorganize, recompress, failure and finally page split. At the same time, how often we encounter compression failure depends largely on the compressibility of the data. In MySQL 5.6, courtesy of Facebook engineers, we have an adaptive algorithm based on per-index statistics that we gather about compression operations. The idea is that if a certain index/table is experiencing too many compression failures then we should try to pack the 16K uncompressed version of the page less densely i.e.: we let some space in the 16K page go unused in an attempt that the recompression won't end up in a failure. In other words, we dynamically keep adding 'pad' to the 16K page till we get compression failures within an agreeable range. It works the other way as well, that is we'll keep removing the pad if failure rate is fairly low. To tune the padding effort two configuration variables are exposed.

  • innodb_compression_failure_threshold_pct: default 5, range 0 - 100,dynamic, implies the percentage of compress ops to fail before we start using to padding. Value 0 has a special meaning of disabling the padding.
  • innodb_compression_pad_pct_max: default 50, range 0 - 75, dynamic, the  maximum percentage of uncompressed data page that can be reserved as pad.





New Enhancements for InnoDB Memcached

In MySQL 5.6, we continued our development on InnoDB Memcached and completed a few widely desirable features that make InnoDB Memcached a competitive feature in more scenario. Notablely, they are

1) Support multiple table mapping

2) Added background thread to auto-commit long running transactions

3) Enhancement in binlog performance

 Let’s go over each of these features one by one. And in the last section, we will go over a couple of internally performed performance tests.

Support multiple table mapping

In our earlier release, all InnoDB Memcached operations are mapped to a single InnoDB table. In the real life, user might want to use this InnoDB Memcached features on different tables. Thus being able to support access to different table at run time, and having different mapping for different connections becomes a very desirable feature. And in this GA release, we allow user just be able to do both. We will discuss the key concepts and key steps in using this feature.

1) "mapping name" in the "get" and "set" command

In order to allow InnoDB Memcached map to a new table, the user (DBA) would still require to "pre-register" table(s) in InnoDB Memcached “containers” table (there is security consideration for this requirement). If you would like to know about “containers” table, please refer to my earlier blogs in blogs.innodb.com. Once registered, the InnoDB Memcached will then be able to look for such table when they are referred.

Each of such registered table will have a unique "registration name" (or mapping_name) corresponding to the “name” field in the “containers” table.. To access these tables, user will include such "registration name" in their get or set commands, in the form of "get @@new_mapping_name.key", prefix "@@" is required for signaling a mapped table change. The key and the "mapping name" are separated by a configurable delimiter, by default, it is ".". So the syntax is:

get [@@mapping_name.]key_name

set [@@mapping_name.]key_name

 or

 get @@mapping_name

set @@mapping_name

Here is an example:

Let's set up three tables in the "containers" table:

The first is a map to InnoDB table "test/demo_test" table with mapping name "setup_1"

INSERT INTO containers VALUES ("setup_1", "test", "demo_test",

"c1", "c2", "c3", "c4", "c5", "PRIMARY");

 Similarly, we set up table mappings for table "test/new_demo" with name "setup_2" and that to table "mydatabase/my_demo" with name "setup_3":

INSERT INTO containers VALUES ("setup_2", "test", "new_demo", "c1", "c2",

"c3", "c4", "c5", "secondary_index_x");

INSERT INTO containers VALUES ("setup_3", "my_database", "my_demo",

"c1", "c2", "c3", "c4", "c5", "idx");

To switch to table "my_database/my_demo", and get the value corresponding to “key_a”, user will do:

get @@setup_3.key_a

(this will also output the value that corresponding to key "key_a"

or simply

get @@setup_3

Once this is done, this connection will switch to "my_database/my_demo" table until another table mapping switch is requested. so it can continue issue regular command like:

get key_b

 set key_c 0 0 7

These DMLs will all be directed to "my_database/my_demo" table.

And this also implies that different connections can have different bindings (to different table).

2) Delimiter:

For the delimiter "." that separates the "mapping name" and key value, we also added a configure option in the "config_options" system table with name of "table_map_delimiter":

INSERT INTO config_options VALUES("table_map_delimiter", ".");

So if user wants to change to a different delimiter, they can change it in the config_option table.

3) Default mapping:

Once we have multiple table mapping, there should be always a "default" map setting. For this, we decided if there exists a mapping name of "default", then this will be chosen as default mapping. Otherwise, the first row of the containers table will chosen as default setting.

Please note, user tables can be repeated in the "containers" table (for example, user wants to access different columns of the table in different settings), as long as they are using different mapping/configure names in the first column, which is enforced by a unique index.

4) bind command

In addition, we also extend the protocol and added a bind command, its usage is fairly straightforward. To switch to "setup_3" mapping above, you simply issue:

bind setup_3

This will switch this connection's InnoDB table to "my_database/my_demo"

In summary, with this feature, you now can direct access to difference tables with difference session. And even a single connection, you can query into difference tables.

Background thread to auto-commit long running transactions

This is a feature related to the “batch” concept we discussed in earlier blogs. This “batch” feature allows us batch the read and write operations, and commit them only after certain calls. The “batch” size is controlled by the configure parameter “daemon_memcached_w_batch_size” and “daemon_memcached_r_batch_size”. This could significantly boost performance.

However, it also comes with some disadvantages, for example, you will not be able to view “uncommitted” operations from SQL end unless you set transaction isolation level to read_uncommitted, and in addition, this will held certain row locks for extend period of time that might reduce the concurrency.

To deal with this, we introduce a background thread that “auto-commits” the transaction if they are idle for certain amount of time (default is 5 seconds). The background thread will wake up every second and loop through every “connections” opened by Memcached, and check for idle transactions. And if such transaction is idle longer than certain limit and not being used, it will commit such transactions. This limit is configurable by change “innodb_api_bk_commit_interval”. Its default value is 5 seconds, and minimum is 1 second, and maximum is 1073741824 seconds.

With the help of such background thread, you will not need to worry about long running uncommitted transactions when set daemon_memcached_w_batch_size and daemon_memcached_r_batch_size to a large number. This also reduces the number of locks that could be held due to long running transactions, and thus further increase the concurrency.

Enhancement in binlog performance

As you might all know, binlog operation is not done by InnoDB storage engine, rather it is handled in the MySQL layer. In order to support binlog operation through InnoDB Memcached, we would have to artificially create some MySQL constructs in order to access binlog handler APIs. In previous lab release, for simplicity consideration, we open and destroy these MySQL constructs (such as THD) for each operations. This required us to set the “batch” size always to 1 when binlog is on, no matter what “daemon_memcached_w_batch_size” and “daemon_memcached_r_batch_size” are configured to. This put a big restriction on our capability to scale, and also there are quite a bit overhead in creating destroying such constructs that bogs the performance down.

With this release, we made necessary change that would keep MySQL constructs as long as they are valid for a particular connection. So there will not be repeated and redundant open and close (table) calls. And now even with binlog option is enabled (with innodb_api_enable_binlog,), we still can batch the transactions with daemon_memcached_w_batch_size and daemon_memcached_r_batch_size, thus scale the write/read performance.

Although there are still overheads that makes InnoDB Memcached cannot perform as fast as when binlog is turned off. It is much better off comparing to previous release. And we are continuing optimize the solution is this area to improve the performance as much as possible.

Performance Study:

Amerandra of our System QA team have conducted some performance studies on queries through our InnoDB Memcached connection and plain SQL end. And it shows some interesting results.

The test is conducted on a “Linux 2.6.32-300.7.1.el6uek.x86_64 ix86 (64)” machine with 16 GB Memory, Intel Xeon 2.0 GHz CPU X86_64 2 CPUs- 4 Core Each, 2 RAID DISKS (1027 GB,733.9GB). Results are described in following tables:

Table 1: Performance comparison on Set operations

Connections

5.6.7-RC-Memcached-plugin ( TPS / Qps) with memcached-threads=8***

5.6.7-RC*

X faster


Set (QPS)

Set**


8

30,000

5,600

5.36

32

59,000

13,000

4.54

128

68,000

8,000

8.50

512

63,000

6.800

9.23

* mysql-5.6.7-rc-linux2.6-x86_64

** The “set” operation when implemented in InnoDB Memcached involves a couple of DMLs: it first query the table to see whether the “key” exists, if it does not, the new key/value pair will be inserted. If it does exist, the “value” field of matching row (by key) will be updated. So when used in above query, it is a precompiled store procedure, and query will just execute such procedures.

*** added “–daemon_memcached_option=-t8” (default is 4 threads)

So we can see with this “set” query, InnoDB Memcached can run 4.5 to 9 time faster than MySQL server.

Table 2: Performance comparison on Get operations

Connections

5.6.7-RC-Memcached-plugin ( TPS / Qps) with memcached-threads=8

5.6.7-RC*

X faster


Get (QPS)

Get


8

42,000

27,000

1.56

32

101,000

55.000

1.83

128

117,000

52,000

2.25

512

109,000

52,000

2.10

With the “get” query (or the select query), memcached performs 1.5 to 2 times faster than normal SQL.

Summary:

In summary, we added several much-desired features to InnoDB Memcached in this release, allowing user to operate on different tables with this Memcached interface. We also now provide a background commit thread to commit long running idle transactions, thus allow user to configure large batch write/read without worrying about large number of rows held or not being able to see (uncommit) data. We also greatly enhanced the performance when Binlog is enabled. We will continue making efforts in both performance enhancement and functionality areas to make InnoDB Memcached a good demo case for our InnoDB APIs.

Jimmy Yang, September 29, 2012

Monday Apr 16, 2012

Optimizing neighbor flush behavior

Note: this article was originally published on http://blogs.innodb.com on April 16, 2012 by Yasufumi Kinoshita.

The performance of flush_list flushing of InnoDB decides the basic performance for modifying workloads. So, it is important to optimize the flush behavior. In this post we’ll consider how to optimize the neighbor-flushing behavior.

Factor 1: Characteristics of storage

Depending on the characteristics of your storage’s throughput for write IO, you can term your storage as either “write amount bound” or “write times bound”. The minimum unit of the InnoDB datafile is page size (16KB or less). And InnoDB attempts to combines them in a single IO up to 1 extent (1MB) maximum, if they are contiguous.

<one HDD>:  Almost “write times bound”. Because head-seek time is the most effective factor for access time of HDD. And around 1MB size can be treated by the 1 head-seek.

<RAID-HDD>: It depends on the striping size of the RAID. In many cases, the striping size is set to 256KB ~ 1MB (much larger than the page size of datafile), with the intention that 1IO – 1HDD (both for keeping sequential access advantage of HDD and for keeping parallel ability for IO requests using several HDD in RAID). For the such general striping size, RAID-HDD is “write times bound”. (For the small striping size around same size as the page size, it should be “write amount bound”. But I don’t recommend such small striping size from the viewpoint of this post, because it just loses the sequential access advantage.)

<SSD>: It depends on internal write unit of SSD. For newer high-end SSD, the size is 4KB or more. It is not larger than InnoDB page size. Such high-end SSD is “write amount bound”. However, the unit size is very different according to each SSD’s internal implementations. Low-end or older SSD might have unit size over 1MB (and the throughput might be slow) and might be “write times bound”. You can estimate the write unit size of your SSD by random write benchmark with several block sizes 4,8,16KB,…,1MB, and the largest block size of the “write times bound” expected as the unit size.

Factor 2: Oldest modified age

The redo log in InnoDB is used in a circular fashion. The reusable redo space is limited by the oldest modification in the oldest modified block i.e.: the max oldest modified age which is equal to current_LSN (Log Sequenc Number) – the oldest modification LSN cannot be higher than the log capacity of the redo log files. When there is no reusable redo space available other modification operations cannot be done until the oldest modified age is decreased by flushing the oldest dirty pages.

The flushing throughput of the oldest dirty pages decides the workload throughput. It is important, how to effectively use the limited write IO bound for flushing “oldest” dirty pages.

Tuning flush_list flushing effective

The first priority of flushing is to reduce the oldest modified age assuming there is no shortage of  free blocks. So, this “flushing the oldest blocks only” is the basic strategy.

For “write amount bound” storage (e.g. high-end SSD), this is already the best strategy. It equals to “innodb_flush_neighbors = false“.

On the other hand, for “write times bound” storage (e.g. HDD base), the contiguous dirty neighbors of the oldest dirty pages can be flushed without wasting the write IO bound, because of the sequential advantage. So, flushing also the contiguous pages is really worth to do. But non-contiguous and non-oldest blocks should not be flushed at the same time, because non-contiguous flushing will become another IO request and has high probability to be treated as another raw block writing in the storage (waste the write IO bound).

The flush_neighbors of InnoDB traditional implementation flushes non-contiguous dirty blocks also. And it is not the best behavior for both type of storage “write times bound” and “write amount bound”. In MySQL labs release 2012 we have fixed this behavior to flush contiguous pages only, for “write times bound” storage.

Conclusion

In the end, the conclusion is followings

  • For HDD or HDD-RAID (stripe size about 256KB ~ 1MB): use the new flush_neighbors (flushing contiguous dirty blocks only)
  • For SSD (internal write unit size =< InnoDB data page size): disable flush_neighbors

Friday Apr 13, 2012

New flushing algorithm in InnoDB

Note: this article was originally published on http://blogs.innodb.com on April 13, 2012 by Inaam Rana.

In MySQL labs release April 2012 we have reworked the flushing heuristics in InnoDB. In this post I’ll give an overview of what we have changed and the various configuration variables we have introduced to fine tune the flushing algorithm. If you are interested in finding out how our new flushing algorithm fares in benchmarks you can get these details in Dimitri’s well-explained blog here.

Flushing means writing dirty pages to disk. I have explained in some detail about adaptive_flushing and types of flushing in my previous notes. Please go through these notes if you want to make sense of what follows.

The page_cleaner thread checks the state of the system every second and takes into account number of dirty pages, amount of reusable redo space, the rate at which redo is generated and the IO capacity for which the server is configured and based on these factors decide how many pages we need to flush.

In the new scheme of things the page_cleaner thread uses a single non-linear formula to calculate how many pages we need to flush to have sufficient reusable redo space. This is different from current flushing heuristic where async_water_mark is taken as point where we change our flushing algorithm. Similarly, instead of using innodb_max_dirty_pages_pct as a switch which triggers flushing we have introduced the concept of a range where the flushing to control the dirty pages percentage starts once we cross the low water mark and gets more and more aggressive as we near the high water mark.

There are four new configuration variables. Note that if your system is not experiencing any IO spikes due to checkpoints then you can probably leave all of the following as is. All the variables are global in scope and can be set dynamically.

  • innodb_adaptive_flushing_lwm: Low water mark measured in %age of total redo log size at which adaptive flushing kicks in. If currently unusable redo space is less then this value no background flushing will happen. Default value is 10 and permissible values are 0 – 70.
  • innodb_max_dirty_pages_pct_lwm: Low water mark of dirty pages in %age where preflushing to control dirty page ratio kicks in. Default value is 0 which has the special meaning of this value having no effect. Allowable values are 0 – 99.
  • innodb_max_io_capacity: InnoDB generally attempts to work within the limits of innodb_io_capacity. However, if it needs to do aggressive flushing then innodb_max_io_capacity defines the limit to which the write IOPs can be stretched. Default value is 2000 which is ten times the default value for innodb_io_capacity.
  • innodb_flushing_avg_loops: Number of iterations for which we keep the previously caculated snapshot of the flushing state. This variable is roughly a measure of how smooth you want the transition in the flushing activity to be. The higher the value the smoother will be the transition in flushing in face of rapidly changing workload. A lower value implies that the flushing algorithm is more responsive but it also means that flushing acitivity can become spiky when confronted with a quickly changing workload. Default value is 30 and permissible range is 1 – 1000.

Finally, there is a wealth of information availabe to you to monitor the impact of changing these variables. You can get a good inside view of how flushing activity is working by querying innodb_metrics table.

mysql> select name, comment from information_schema.innodb_metrics where name like ‘Buffer_flush%’;
+————————————-+—————————————————————-+
| name | comment |
+————————————-+—————————————————————-+
| buffer_flush_batch_scanned | Total pages scanned as part of flush batch |
| buffer_flush_batch_num_scan | Number of times buffer flush list flush is called |
| buffer_flush_batch_scanned_per_call | Pages scanned per flush batch scan |
| buffer_flush_batch_total_pages | Total pages flushed as part of flush batch |
| buffer_flush_batches | Number of flush batches |
| buffer_flush_batch_pages | Pages queued as a flush batch |
| buffer_flush_neighbor_total_pages | Total neighbors flushed as part of neighbor flush |
| buffer_flush_neighbor | Number of times neighbors flushing is invoked |
| buffer_flush_neighbor_pages | Pages queued as a neighbor batch |
| buffer_flush_n_to_flush_requested | Number of pages requested for flushing. |
| buffer_flush_avg_page_rate | Average number of pages at which flushing is happening |
| buffer_flush_lsn_avg_rate | Average redo generation rate |
| buffer_flush_pct_for_dirty | Percent of IO capacity used to avoid max dirty page limit |
| buffer_flush_pct_for_lsn | Percent of IO capacity used to avoid reusable redo space limit |
| buffer_flush_adaptive_total_pages | Total pages flushed as part of adaptive flushing |
| buffer_flush_adaptive | Number of adaptive batches |
| buffer_flush_adaptive_pages | Pages queued as an adaptive batch |
| buffer_flush_sync_total_pages | Total pages flushed as part of sync batches |
| buffer_flush_sync | Number of sync batches |
| buffer_flush_sync_pages | Pages queued as a sync batch |
| buffer_flush_background_total_pages | Total pages flushed as part of background batches |
| buffer_flush_background | Number of background batches |
| buffer_flush_background_pages | Pages queued as a background batch |
+————————————-+————————————————————-+
23 rows in set (0.00 sec)

Wednesday Apr 11, 2012

InnoDB persistent stats got a friendly UI

Note: this article was originally published on http://blogs.innodb.com on April 11, 2012 by Vasil Dimov.

After introducing InnoDB persistent statistics in MySQL 5.6, in this April Labs release we have dressed it up in a nice UI and refactored the internals a bit to make the code more elegant and straight-forward.

The persistent stats are now controlled globally and can also be overridden at table level, should any table require a different behavior.

Global

The server global flag –innodb-stats-persistent (boolean) now controls whether all InnoDB tables use persistent statistics or not. Keep in mind that if a table is using persistent stats then its statistics will not be updated automatically and you are responsible for running ANALYZE TABLE periodically, whenever you think the table contents has changed too much. Thus the default for –innodb-stats-persistent is currently set to OFF.

Per table

The persistent stats ON/OFF setting can be overridden at a table level with a table option like this:

CREATE TABLE t (a INT) ENGINE=INNODB PERSISTENT_STATS=0|1|default;

or

ALTER TABLE t PERSISTENT_STATS=0|1|default;

where ’0′ means no persistent stats for this table (regardless of the setting of the global –innodb-stats-persistent), ’1′ means to always use persistent stats for this table and ‘default’ means to use the server global setting, whatever it is.

Another goodie is that now you do not need to restart the server if you have manually updated the stats values in mysql.innodb_table_stats and/or mysql.innodb_index_stats and want your changes to take effect. Just do

FLUSH TABLE t;

to force InnoDB to reread what’s in mysql.innodb_table_stats and mysql.innodb_index_stats.

What to expect next? Maybe a fully controlled automatic persistent stats update so you do not have to worry about having to run ANALYZE TABLE periodically but still your query plans are stable? Wouldn’t that be nice?

Memcached With SASL Support

Note: this article was originally published on http://blogs.innodb.com on April 11, 2012 by Jimmy Yang.

In this April MySQL Lab release, we’ll provide you a more robust and release-ready InnoDB Memcached Engine with a few enhancements. The most notable addition is the SASL support, which gives users the capability to protect their MySQL database from unauthenticated access through memcached clients. In this blog, I will walk you through steps of getting this option enabled.

Background Info:
SASL stands for “Simple Authentication and Security Layer”, which is a Standard for adding authentication support to connection-based protocols. Memcached added SASL support starting its 1.4.3 release. And here is a good article that gives you some background on why and how SASL is supported in Memcached.

For InnoDB Memcached, the “Memcached mapped” user table must be registered in the “container” “system table”. And memcached client(s) can only access such “registered” table.  Even though the DBA can add access restrictions on such table,  he/she has no control over who can access it through the memcached client(s). And this is exactly the reason we want to provide a means (in this case SASL) for DBA being able to have some control over who can access our InnoDB table(s).

In the following section, we will go through with you the steps to build, enable and test an SASL-enabled InnoDB Memcached plugin.

Steps to Build and Enable SASL in InnoDB Memcached Plugin:

By default, SASL-enabled InnoDB Memcached is not built-in (and included in the release package), since it relies on some SASL libraries to build SASL into Memcached Engine. So you will need to download the source and rebuild the InnoDB Memcached plugin after you download the SASL libraries. The detail is described in following sections:

1) First, you would need to get SASL development and utility libraries. For example, on Ubuntu, you can get these libraries through:

> sudo apt-get -f install libsasl2-2 sasl2-bin libsasl2-2 libsasl2-dev libsasl2-modules

2) Then build InnoDB Memcached Engine plugin (shared libraries) with SASL capability. This is done by adding ENABLE_MEMCACHED_SASL=1 to the cmake option. In addition, Memcached provides a simple plaintext passwords support, which is easier to use for testing, so we have support for that too. And this is enabled by setting the option ENABLE_MEMCACHED_SASL_PWDB=1.

So overall, we will need to add following three options to the cmake:

> cmake ... -DWITH_INNODB_MEMCACHED=1
  -DENABLE_MEMCACHED_SASL=1 -DENABLE_MEMCACHED_SASL_PWDB=1

3) The third step is to install the InnoDB Memcached Plugin as before. Please refer to my earlier blog posts on the procedures.

4) As mentioned in section 2), Memcached provides a simple plaintext password support through SASL, which will be used for this demo.  There was a good blog from Thond Norbye describes the steps, so you can follow the instruction there too. I will repeat the important steps here.

a) Create a user named “testname” and its password as “testpasswd” in a file:

> echo "testname:testpasswd:::::::" >/home/jy/memcached-sasl-db

b) Let memcached know about it by setting environment variable MEMCACHED_SASL_PWDB:

> export MEMCACHED_SASL_PWDB=/home/jy/memcached-sasl-db

c) Also tell memcached that it is a plaintext password:

> echo "mech_list: plain" > /home/jy/work2/msasl/clients/memcached.conf
> export SASL_CONF_PATH=/home/jy/work2/msasl/clients/memcached.conf

4) Then we are ready to reboot the server, and add a “daemon_memcached” option “-S”, to enable SASL:

> mysqld .. --daemon_memcached_option="-S"

5) Now we are done the setup. Let’s test it. To do so, you might need SASL-enabled client. I used a SASL-enabled libmemcached as described in Thond Norbye’s blog, and tested it accordingly:

> memcp --servers=localhost:11211 --binary  --username=testname --password=testpasswd myfile.txt
> memcat --servers=localhost:11211 --binary --username=testname --password=testpasswd myfile.txt

Without appropriate user name or password, above operation will be rejected by error message “memcache error AUTHENTICATION FAILURE”. Otherwise, the operation will be completed. You can also play with the plaintext password set in /home/jy/memcached-sasl-db to verify it.

There are other methods to test the SASL with memcahced. But the one described above is the most straightforward.

Other changes for InnoDB Memcached:

Besides the SASL support, there are a few changes in this release that worth mentioning:
1) We added a configuration option innodb_api_trx_level, so that user can control the transaction isolation level on the queries through InnoDB APIs, or in this case, the memcached.

In theory, for memcached, there is no such concept of “transactions”, so this is an extra property that we added on top of it, so that user has some level of control when issuing DMLs through the SQL interface. By default, it is set to “read uncommitted”.

2) Another option we added is innodb_api_enable_mdl, the “mdl” stands for “metadata locking”. This basically “locks” the table from the MySQL level, so that the mapped table cannot be dropped or altered by DDL through the SQL interfaces. Without the lock, the table can be dropped from MySQL layer, but will be kept in the InnoDB storage until memcached or any other user stops using it.

3) A configure option name change. To enable binlog, the configure variable name has changed from innodb_direct_access_enable_binlog to innodb_api_enable_binlog.

Summary:

In summary, this release provides you a more robust InnoDB Memcached Engine with SASL support. The steps to enable such support is fairly straightforward and almost identical to those you would do to enable SASL for a Memcached server. So if you are familiar with using SASL for memcached, then it would just some name flipping to build and enable it. And if you are not familiar with the operation, above steps also give you a quick start to use it.

April 2012 Labs Release – Online DDL Improvements

Note: this article was originally published on http://blogs.innodb.com on April 11, 2012 by John Russell.

This feature is a continuation of the “Fast Index Creation” feature introduced in Fast Index Creation in the InnoDB Storage Engine. Now you can perform other kinds of DDL operations on InnoDB tables online: that is, with minimal delay for operations on that table, and without rebuilding the entire table. This enhancement improves responsiveness and availability in busy production environments, where making a table unavailable for seconds or minutes whenever its column definitions change is not practical.

The DDL operations enhanced by this feature are these variations on the ALTER TABLE statement:

  • Create secondary indexes: CREATE INDEX name ON table (col_list) or ALTER TABLE table ADD INDEX name (col_list)

    Drop secondary indexes: DROP INDEX name ON table; or ALTER TABLE table DROP INDEX name

    Creating and dropping secondary indexes on InnoDB tables has avoided the table-copying behavior since the days of MySQL 5.1 with the InnoDB Plugin. Now, the table remains available for read and write operations while the index is being created or dropped. The CREATE INDEX or DROP INDEX statement only finishes after all transactions that are modifying the table are completed, so that the initial state of the index reflects the most recent contents of the table.

    Previously, modifying the table while an index was being created or dropped typically resulted in a deadlock that cancelled the insert, update, or delete statement on the table.

  • Changing the auto-increment value for a column: ALTER TABLE table AUTO_INCREMENT=next_value;

    Especially in a distributed system using replication or sharding, you sometimes reset the auto-increment counter for a table to a specific value. The next row inserted into the table uses the specified value for its auto-increment column. You might also use this technique in a data warehousing environment where you periodically empty all the tables and reload them, and you can restart the auto-increment sequence from 1.

  • Drop a foreign key constraint: ALTER TABLE tbl DROP FOREIGN KEY fk_name

    Currently, online DDL only includes the DROP operation for foreign keys, not ADD to create foreign keys.

    If you do not know the names of the foreign key constraints on a particular table, issue the following statement and find the constraint name in the CONSTRAINT clause for each foreign key:

    show create table table\G
    

    Or, query the information_schema.table_constraints table and use the constraint_name and constraint_type columns to identify the foreign key names.

    As a consequence of this enhancement, you can now also drop a foreign key and its associated index in a single statement, which previously required separate statements in a strict order:

    ALTER TABLE  table DROP FOREIGN KEY  constraint, DROP INDEX index;
    
  • Rename a column: ALTER TABLE tbl CHANGE old_col_name new_col_name datatype

    When you keep the same data type and only change the column name, this operation can always be performed online. As part of this enhancement, you can now rename a column that is part of a foreign key constraint, which was not allowed before.

Note

As your database schema evolves with new columns, data types, constraints, indexes, and so on, keep your CREATE TABLE statements up to date with the latest table definitions. Even with the performance improvements of online DDL, it is more efficient to create stable database structures at the beginning, rather than creating part of the schema and then issuing ALTER TABLE statements afterward.

The main exception to this guideline is for secondary indexes on tables with large numbers of rows. It is typically most efficient to create the table with all details specified except the secondary indexes, load the data, then create the secondary indexes.

Whatever sequence of CREATE TABLE, CREATE INDEX, ALTER TABLE, and similar statements went into putting a table together, you can capture the SQL needed to reconstruct the current form of the table by issuing the statement SHOW CREATE TABLE table\G (uppercase \G required for tidy formatting). This output shows clauses such as numeric precision, NOT NULL, and CHARACTER SET that are sometimes added behind the scenes, and you might otherwise leave out when cloning the table on a new system or setting up foreign key columns with identical type.

Performance and Availability Considerations

Avoiding a table copy during DDL improves several aspects of MySQL operation, such as performance, concurrency, availability, and scalability:

  • By avoiding the disk I/O and CPU cycles to rebuild the table, you minimize the overall load on the database and maintain good performance and high throughput during the DDL operation.

  • Because the DDL operation completes in less time, there is a shorter period when queries and DML operations on that table are blocked, making your application more responsive.

  • Because less data is read into the buffer pool, you avoid purging frequently accessed data from the memory cache, which typically causes a temporary performance dip after a DDL operation.

  • Because there is a shorter period when queries and DML operations queue up waiting for the DDL to finish, there is less locking and waiting for other resources all throughout the MySQL server. Reducing this type of overhead leads to greater scalability, even for operations not involving the table being altered.

Benchmarking

To judge the relative performance of online DDL operations, you can run such operations on a big InnoDB table using current and earlier versions of MySQL. You can also run all the performance tests under the latest MySQL version, simulating the previous DDL behavior for the “before” results, by setting the old_alter_table system variable. Issue the statement set old_alter_table=1 in the session, and measure DDL performance to record the “before” figures. Then set old_alter_table=0 to re-enable the newer, faster behavior, and run the DDL operations again to record the “after” figures.

For a basic idea of whether a DDL operation does its changes in-place or performs a table copy, look at the “rows affected” value displayed after the command finishes. For example, here are lines you might see after doing different types of DDL operations:

  • Changing the default value of a column (super-fast, does not affect the table data at all):

    Query OK, 0 rows affected (0.07 sec)
  • Adding an index (takes time, but 0 rows affected shows that the table is not copied):

    Query OK, 0 rows affected (21.42 sec)
  • Changing the data type of a column (takes substantial time and does require rebuilding all the rows of the table):

    Query OK, 1671168 rows affected (1 min 35.54 sec)

For example, before running a DDL operation on a big table, you might check whether the operation will be fast or slow as follows:

  1. Clone the table structure.

  2. Populate the cloned table with a tiny amount of data.

  3. Run the DDL operation on the cloned table.

  4. Check whether the “rows affected” value is zero or not. A non-zero value means the operation will require rebuilding the entire table, which might require special planning. For example, you might do the DDL operation during a period of scheduled downtime, or on each replication slave server one at a time.

For a deeper understanding of the reduction in MySQL processing, examine the PERFORMANCE_SCHEMA and INFORMATION_SCHEMA tables related to InnoDB before and after DDL operations, to see the number of physical reads, writes, memory allocations, and so on.

Background Information

Historically, the MySQL server and InnoDB have each kept their own metadata about table and index structures. The MySQL server stores this information in .frm files that are not protected by a transactional mechanism, while InnoDB has its own data dictionary as part of the system tablespace. If a DDL operation was interrupted by a crash or other unexpected event partway through, the metadata could be left inconsistent between these two locations, causing problems such as startup errors or inability to access the table that was being altered. Now that InnoDB is the default storage engine, addressing such issues is a high priority. These enhancements to DDL operations reduce the window of opportunity for such issues to occur.

Examples

Here are code examples showing some operations whose performance, concurrency, and scalability are improved by this new feature.

Example 1. Schema Setup Code for Online DDL Experiments

Here is the code that sets up the initial tables used in these demonstrations:

/*
Setup code for the online DDL demonstration:
- Set up some config variables.
- Create 2 tables that are clones of one of the INFORMATION_SCHEMA tables
  that always has some data. The "small" table has a couple of thousand rows.
  For the "big" table, keep doubling the data until it reaches over a million rows.
- Set up a primary key for the sample tables, since we are demonstrating InnoDB aspects.
*/ 

set autocommit = 0;
set foreign_key_checks = 1;
set global innodb_file_per_table = 1;
set old_alter_table=0;
prompt mysql: 

use test;

\! echo "Setting up 'small' table:"
drop table if exists small_table;
create table small_table as select * from information_schema.columns;
alter table small_table add id int unsigned not null primary key auto_increment;
select count(id) from small_table;

\! echo "Setting up 'big' table:"
drop table if exists big_table;
create table big_table as select * from information_schema.columns;
show create table big_table\G

insert into big_table select * from big_table;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
commit;

alter table big_table add id int unsigned not null primary key auto_increment;
select count(id) from big_table;

Running this code gives this output, condensed for brevity and with the most important points bolded:

Setting up 'small' table:

Query OK, 1678 rows affected (0.13 sec)
Query OK, 1678 rows affected (0.07 sec)

+-----------+
| count(id) |
+-----------+
|      1678 |
+-----------+
1 row in set (0.00 sec)

Setting up 'big' table:

Query OK, 1678 rows affected (0.17 sec)

*************************** 1. row ***************************
       Table: big_table
Create Table: CREATE TABLE `big_table` (
  `TABLE_CATALOG` varchar(512) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `TABLE_SCHEMA` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `TABLE_NAME` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `COLUMN_NAME` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `ORDINAL_POSITION` bigint(21) unsigned NOT NULL DEFAULT '0',
  `COLUMN_DEFAULT` longtext CHARACTER SET utf8,
  `IS_NULLABLE` varchar(3) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `DATA_TYPE` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `CHARACTER_MAXIMUM_LENGTH` bigint(21) unsigned DEFAULT NULL,
  `CHARACTER_OCTET_LENGTH` bigint(21) unsigned DEFAULT NULL,
  `NUMERIC_PRECISION` bigint(21) unsigned DEFAULT NULL,
  `NUMERIC_SCALE` bigint(21) unsigned DEFAULT NULL,
  `DATETIME_PRECISION` bigint(21) unsigned DEFAULT NULL,
  `CHARACTER_SET_NAME` varchar(32) CHARACTER SET utf8 DEFAULT NULL,
  `COLLATION_NAME` varchar(32) CHARACTER SET utf8 DEFAULT NULL,
  `COLUMN_TYPE` longtext CHARACTER SET utf8 NOT NULL,
  `COLUMN_KEY` varchar(3) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `EXTRA` varchar(30) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `PRIVILEGES` varchar(80) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `COLUMN_COMMENT` varchar(1024) CHARACTER SET utf8 NOT NULL DEFAULT ''
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Query OK, 1678 rows affected (0.09 sec)
Query OK, 3356 rows affected (0.07 sec)
Query OK, 6712 rows affected (0.17 sec)
Query OK, 13424 rows affected (0.44 sec)
Query OK, 26848 rows affected (0.63 sec)
Query OK, 53696 rows affected (1.72 sec)
Query OK, 107392 rows affected (3.02 sec)
Query OK, 214784 rows affected (6.28 sec)
Query OK, 429568 rows affected (13.25 sec)
Query OK, 859136 rows affected (28.16 sec)
Query OK, 1718272 rows affected (1 min 9.22 sec)

+-----------+
| count(id) |
+-----------+
|   1718272 |
+-----------+
1 row in set (1.75 sec)

Example 2. Speed and Efficiency of CREATE INDEX and DROP INDEX

Here is a sequence of statements demonstrating the relative speed of CREATE INDEX and DROP INDEX statements. For a small table, the elapsed time is less than a second whether we use the fast or slow technique, so we look at the “rows affected” output to verify which operations can avoid the table rebuild. For a large table, the difference in efficiency is obvious because skipping the table rebuild saves substantial time.

\! clear

-- Make sure we're using the new-style fast DDL.
-- Outside of benchmarking and testing, you would
-- never enable the old_alter_table setting.
set old_alter_table=0;

\! echo "=== Create and drop index (small table, new/fast technique) ==="
\! echo
\! echo "Data size (kilobytes) before index created: "
\! du -k data/test/small_table.ibd
create index i_dtyp_small on small_table (data_type);
\! echo "Data size after index created: "
\! du -k data/test/small_table.ibd
drop index i_dtyp_small on small_table;

-- Revert to the older slower DDL for comparison.
set old_alter_table=1;

\! echo "=== Create and drop index (small table, old/slow technique) ==="
\! echo
\! echo "Data size (kilobytes) before index created: "
\! du -k data/test/small_table.ibd
create index i_dtyp_small on small_table (data_type);
\! echo "Data size after index created: "
\! du -k data/test/small_table.ibd
drop index i_dtyp_small on small_table;

-- In the above example, we examined the "rows affected" number,
-- ideally looking for a zero figure. Let's try again with a larger
-- sample size, where we'll see that the actual time taken can
-- vary significantly.

-- Back to the new/fast behavior:
set old_alter_table=0;

\! echo "=== Create and drop index (big table, new/fast technique) ==="
\! echo
\! echo "Data size (kilobytes) before index created: "
\! du -k data/test/big_table.ibd
create index i_dtyp_big on big_table (data_type);
\! echo "Data size after index created: "
\! du -k data/test/big_table.ibd
drop index i_dtyp_big on big_table;

-- Let's see that again, in slow motion:
set old_alter_table=1;

\! echo "=== Create and drop index (big table, old/slow technique) ==="
\! echo
\! echo "Data size (kilobytes) before index created: "
\! du -k data/test/big_table.ibd
create index i_dtyp_big on big_table (data_type);
\! echo "Data size after index created: "
\! du -k data/test/big_table.ibd
drop index i_dtyp_big on big_table;

Running this code gives this output, condensed for brevity and with the most important points bolded:

=== Create and drop index (small table, new/fast technique) ===

Data size (kilobytes) before index created:
384	data/test/small_table.ibd
Query OK, 0 rows affected (0.04 sec)

Data size after index created:
432	data/test/small_table.ibd
Query OK, 0 rows affected (0.02 sec)

=== Create and drop index (small table, old/slow technique) ===

Data size (kilobytes) before index created:
432	data/test/small_table.ibd
Query OK, 1678 rows affected (0.12 sec)

Data size after index created:
448	data/test/small_table.ibd
Query OK, 1678 rows affected (0.10 sec)

=== Create and drop index (big table, new/fast technique) ===

Data size (kilobytes) before index created:
315392	data/test/big_table.ibd
Query OK, 0 rows affected (33.32 sec)

Data size after index created:
335872	data/test/big_table.ibd
Query OK, 0 rows affected (0.02 sec)

=== Create and drop index (big table, old/slow technique) ===

Data size (kilobytes) before index created:
335872	data/test/big_table.ibd
Query OK, 1718272 rows affected (1 min 5.01 sec)

Data size after index created:
348160	data/test/big_table.ibd
Query OK, 1718272 rows affected (46.59 sec)

Example 3. Concurrent DML During CREATE INDEX and DROP INDEX

Here are some snippets of code that I ran in separate mysql sessions connected to the same database, to illustrate DML statements (insert, update, or delete) running at the same time as CREATE INDEX and DROP INDEX.

CREATE INDEX statement (in session 1):

/*
CREATE INDEX statement to run against a table while
insert/update/delete statements are modifying the
column being indexed.
*/

-- We'll run this script in one session, while simultaneously creating and dropping
-- an index on test/big_table.table_name in another session.

use test;
create index i_concurrent on big_table(table_name);

DROP INDEX statement (in session 1):

/*
DROP INDEX statement to run against a table while
insert/update/delete statements are modifying the
column being indexed.
*/

-- We'll run this script in one session, while simultaneously creating and dropping
-- an index on test/big_table.table_name in another session.

use test;
drop index i_concurrent on big_table;

DML statements (in session 2). I ran these DELETE statements while the CREATE INDEX was running.
Because DROP INDEX in this case takes less than a second, I started the DELETE first and then ran the DROP INDEX while the DML was in progress. In each case, the DML statement waited until the transaction with the DELETE was finished. (That’s why the timing numbers are higher than in the other examples. I waited for a while before issuing the ROLLBACK statement at the mysql command line.)

/*
Some queries and insert/update/delete statements to run against a table
while an index is being created or dropped. Previously, these operations
would have stalled during the index create/drop period and possibly
timed out or deadlocked.
*/

-- We'll run this script in one session, while simultaneously creating and dropping
-- an index on test/big_table.table_name in another session.

-- In our test instance, that column has about 1.7M rows, with 136 different values.

set autocommit = 0;
use test;

select distinct character_set_name from big_table where table_name = 'FILES';
delete from big_table where table_name = 'FILES';
select distinct character_set_name from big_table where table_name = 'FILES';

-- I'll issue the final rollback interactively, not via script,
-- the better to control the timing.
-- rollback;

Running this code gives this output, condensed for brevity and with the most important points bolded:

mysql: source concurrent_ddl_create.sql
Database changed
Query OK, 0 rows affected (1 min 25.15 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql: source concurrent_ddl_drop.sql
Database changed
Query OK, 0 rows affected (24.98 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql: source concurrent_dml.sql
Query OK, 0 rows affected (0.00 sec)

Database changed
+--------------------+
| character_set_name |
+--------------------+
| NULL               |
| utf8               |
+--------------------+
2 rows in set (0.32 sec)

Query OK, 38912 rows affected (1.84 sec)

Empty set (0.01 sec)

mysql: rollback;
Query OK, 0 rows affected (1.05 sec)

Example 4. Renaming a Column

Here is a demonstration of using ALTER TABLE to rename a column. We use the new, fast DDL mechanism to change the name, then the old, slow DDL mechanism (with old_alter_table=1) to restore the original column name.

Note
  • Because the syntax for renaming a column also involves re-specifying the data type, be very careful to specify exactly the same data type to avoid a costly table rebuild. In this case, we checked the output of show create table table\G and copied any clauses such as CHARACTER SET and NOT NULL from the original column definition.

  • Again, renaming a column for a small table is fast enough that we need to examine the “rows affected” number to verify that the new DDL mechanism is more efficient than the old one. With a big table, the difference in elapsed time makes the improvement obvious.

/*
Run through a sequence of 'rename column' statements.
Because this operation involves only metadata, not table data,
it is fast for big and small tables, with new or old DDL mechanisms.
*/

\! clear

\! echo "Rename column (fast technique, small table):"
set old_alter_table=0;
alter table small_table change `IS_NULLABLE` `NULLABLE`
  varchar(3) character set utf8 not null;
\! echo "Rename back to original name (slow technique):"
set old_alter_table=1;
alter table small_table change `NULLABLE` `IS_NULLABLE`
  varchar(3) character set utf8 not null;

\! echo "Rename column (fast technique, big table):"
set old_alter_table=0;
alter table big_table change `IS_NULLABLE` `NULLABLE`
  varchar(3) character set utf8 not null;
\! echo "Rename back to original name (slow technique):"
set old_alter_table=1;
alter table big_table change `NULLABLE` `IS_NULLABLE`
  varchar(3) character set utf8 not null;
set old_alter_table=0;

Running this code gives this output, condensed for brevity and with the most important points bolded:

Rename column (fast technique, small table):
Query OK, 0 rows affected (0.13 sec)

Rename back to original name (slow technique):
Query OK, 1678 rows affected (0.35 sec)

Rename column (fast technique, big table):
Query OK, 0 rows affected (0.11 sec)

Rename back to original name (slow technique):
Query OK, 1718272 rows affected (1 min 0.00 sec)

Example 5. Dropping Foreign Keys

Here is a demonstration of foreign keys, including improvement to the speed of dropping a foreign key constraint.

/*
Demonstrate aspects of foreign keys that are or aren't affected by the DDL improvements.
- Create a new table with only a few values to serve as the parent table.
- Set up the 'small' and 'big' tables as child tables using a foreign key.
- Verify that the ON DELETE CASCADE clause makes changes ripple from parent to child tables.
- Drop the foreign key constraints, and optionally associated indexes.
  (This is the operation that is sped up.)
*/

\! clear

-- Make sure foreign keys are being enforced, and allow
-- rollback after doing some DELETEs that affect both
-- parent and child tables.
set foreign_key_checks = 1;
set autocommit = 0;

-- Create a parent table, containing values that we know are already present
-- in the child tables.
drop table if exists schema_names;
create table schema_names (id int unsigned not null primary key auto_increment,
  schema_name varchar(64) character set utf8 not null, index i_schema (schema_name))
  as select distinct table_schema schema_name from small_table;

show create table schema_names\G
show create table small_table\G
show create table big_table\G

-- Creating the foreign key constraint isn't any faster than before.
-- It still involves a table rebuild, as illustrated by the "rows affected" figure.
alter table small_table add constraint small_fk
  foreign key i_table_schema (table_schema) references schema_names(schema_name)
  on delete cascade;
alter table big_table add constraint big_fk
  foreign key i_table_schema (table_schema) references schema_names(schema_name)
  on delete cascade;

show create table small_table\G
show create table big_table\G

select schema_name from schema_names order by schema_name;
select count(table_schema) howmany, table_schema from small_table group by table_schema;
select count(table_schema) howmany, table_schema from big_table group by table_schema;

-- big_table is the parent table.
-- schema_names is the parent table.
-- big_table is the child table.
-- (One row in the parent table can have many "children" in the child table.)
-- Changes to the parent table can ripple through to the child table.
-- For example, removing the value 'test' from schema_names.schema_name will
-- result in the removal of 20K or so rows from big_table.

delete from schema_names where schema_name = 'test';

select schema_name from schema_names order by schema_name;
select count(table_schema) howmany, table_schema from small_table group by table_schema;
select count(table_schema) howmany, table_schema from big_table group by table_schema;

-- Because we've turned off autocommit, we can still get back those deleted rows
-- if the DELETE was issued by mistake.
rollback;

select schema_name from schema_names order by schema_name;
select count(table_schema) howmany, table_schema from small_table group by table_schema;
select count(table_schema) howmany, table_schema from big_table group by table_schema;

-- All of the cross-checking between parent and child tables would be
-- deadly slow if there wasn't the requirement for the corresponding
-- columns to be indexed!

-- But we can get rid of the foreign key using a fast operation
-- that doesn't rebuild the table.
-- If we didn't specify a constraint name when setting up the foreign key, we would
-- have to find the auto-generated name such as 'big_table_ibfk_1' in the
-- output from 'show create table'.

-- For the small table, we'll drop the foreign key and the associated index.
-- Having an index on a small table is less critical.

\! echo "DROP FOREIGN KEY and INDEX from small_table:"
alter table small_table drop foreign key small_fk, drop index small_fk;

-- For the big table, we'll drop the foreign key and leave the associated index.
-- If we are still doing queries that reference the indexed column, the index is
-- very important to avoid a full table scan of the big table.
\! echo "DROP FOREIGN KEY from big_table:"
alter table big_table drop foreign key big_fk;

show create table small_table\G
show create table big_table\G

Running this code gives this output, condensed for brevity and with the most important points bolded:

Query OK, 4 rows affected (0.03 sec)

*************************** 1. row ***************************
       Table: schema_names
Create Table: CREATE TABLE `schema_names` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `schema_name` varchar(64) CHARACTER SET utf8 NOT NULL,
  PRIMARY KEY (`id`),
  KEY `i_schema` (`schema_name`)
) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

*************************** 1. row ***************************
       Table: small_table
Create Table: CREATE TABLE `small_table` (
  `TABLE_CATALOG` varchar(512) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `TABLE_SCHEMA` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `TABLE_NAME` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `COLUMN_NAME` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `ORDINAL_POSITION` bigint(21) unsigned NOT NULL DEFAULT '0',
  `COLUMN_DEFAULT` longtext CHARACTER SET utf8,
  `IS_NULLABLE` varchar(3) CHARACTER SET utf8 NOT NULL,
  `DATA_TYPE` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `CHARACTER_MAXIMUM_LENGTH` bigint(21) unsigned DEFAULT NULL,
  `CHARACTER_OCTET_LENGTH` bigint(21) unsigned DEFAULT NULL,
  `NUMERIC_PRECISION` bigint(21) unsigned DEFAULT NULL,
  `NUMERIC_SCALE` bigint(21) unsigned DEFAULT NULL,
  `DATETIME_PRECISION` bigint(21) unsigned DEFAULT NULL,
  `CHARACTER_SET_NAME` varchar(32) CHARACTER SET utf8 DEFAULT NULL,
  `COLLATION_NAME` varchar(32) CHARACTER SET utf8 DEFAULT NULL,
  `COLUMN_TYPE` longtext CHARACTER SET utf8 NOT NULL,
  `COLUMN_KEY` varchar(3) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `EXTRA` varchar(30) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `PRIVILEGES` varchar(80) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `COLUMN_COMMENT` varchar(1024) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1679 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

*************************** 1. row ***************************
       Table: big_table
Create Table: CREATE TABLE `big_table` (
  `TABLE_CATALOG` varchar(512) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `TABLE_SCHEMA` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `TABLE_NAME` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `COLUMN_NAME` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `ORDINAL_POSITION` bigint(21) unsigned NOT NULL DEFAULT '0',
  `COLUMN_DEFAULT` longtext CHARACTER SET utf8,
  `IS_NULLABLE` varchar(3) CHARACTER SET utf8 NOT NULL,
  `DATA_TYPE` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `CHARACTER_MAXIMUM_LENGTH` bigint(21) unsigned DEFAULT NULL,
  `CHARACTER_OCTET_LENGTH` bigint(21) unsigned DEFAULT NULL,
  `NUMERIC_PRECISION` bigint(21) unsigned DEFAULT NULL,
  `NUMERIC_SCALE` bigint(21) unsigned DEFAULT NULL,
  `DATETIME_PRECISION` bigint(21) unsigned DEFAULT NULL,
  `CHARACTER_SET_NAME` varchar(32) CHARACTER SET utf8 DEFAULT NULL,
  `COLLATION_NAME` varchar(32) CHARACTER SET utf8 DEFAULT NULL,
  `COLUMN_TYPE` longtext CHARACTER SET utf8 NOT NULL,
  `COLUMN_KEY` varchar(3) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `EXTRA` varchar(30) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `PRIVILEGES` varchar(80) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `COLUMN_COMMENT` varchar(1024) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`),
  KEY `big_fk` (`TABLE_SCHEMA`)
) ENGINE=InnoDB AUTO_INCREMENT=1718273 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Query OK, 1678 rows affected (0.10 sec)
Query OK, 1718272 rows affected (1 min 14.54 sec)

*************************** 1. row ***************************
       Table: small_table
Create Table: CREATE TABLE `small_table` (
  `TABLE_CATALOG` varchar(512) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `TABLE_SCHEMA` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `TABLE_NAME` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  ...
  `COLUMN_COMMENT` varchar(1024) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`),
  KEY `small_fk` (`TABLE_SCHEMA`),
  CONSTRAINT `small_fk` FOREIGN KEY (`TABLE_SCHEMA`)
  REFERENCES `schema_names` (`schema_name`)
  ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=1679 DEFAULT CHARSET=latin1
1 row in set (0.12 sec)

*************************** 1. row ***************************
       Table: big_table
Create Table: CREATE TABLE `big_table` (
  `TABLE_CATALOG` varchar(512) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `TABLE_SCHEMA` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `TABLE_NAME` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  ...
  `COLUMN_COMMENT` varchar(1024) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`),
  KEY `big_fk` (`TABLE_SCHEMA`),
  CONSTRAINT `big_fk` FOREIGN KEY (`TABLE_SCHEMA`)
  REFERENCES `schema_names` (`schema_name`)
  ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=1718273 DEFAULT CHARSET=latin1
1 row in set (0.01 sec)

+--------------------+
| schema_name        |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| test               |
+--------------------+
4 rows in set (0.00 sec)

+---------+--------------------+
| howmany | table_schema       |
+---------+--------------------+
|     563 | information_schema |
|     286 | mysql              |
|     786 | performance_schema |
|      43 | test               |
+---------+--------------------+
4 rows in set (0.01 sec)

+---------+--------------------+
| howmany | table_schema       |
+---------+--------------------+
|  576512 | information_schema |
|  292864 | mysql              |
|  804864 | performance_schema |
|   44032 | test               |
+---------+--------------------+
4 rows in set (2.10 sec)

Query OK, 1 row affected (1.52 sec)

+--------------------+
| schema_name        |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
+--------------------+
3 rows in set (0.00 sec)

+---------+--------------------+
| howmany | table_schema       |
+---------+--------------------+
|     563 | information_schema |
|     286 | mysql              |
|     786 | performance_schema |
+---------+--------------------+
3 rows in set (0.00 sec)

+---------+--------------------+
| howmany | table_schema       |
+---------+--------------------+
|  576512 | information_schema |
|  292864 | mysql              |
|  804864 | performance_schema |
+---------+--------------------+
3 rows in set (1.74 sec)

Query OK, 0 rows affected (0.60 sec)

+--------------------+
| schema_name        |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| test               |
+--------------------+
4 rows in set (0.00 sec)

+---------+--------------------+
| howmany | table_schema       |
+---------+--------------------+
|     563 | information_schema |
|     286 | mysql              |
|     786 | performance_schema |
|      43 | test               |
+---------+--------------------+
4 rows in set (0.01 sec)

+---------+--------------------+
| howmany | table_schema       |
+---------+--------------------+
|  576512 | information_schema |
|  292864 | mysql              |
|  804864 | performance_schema |
|   44032 | test               |
+---------+--------------------+
4 rows in set (1.59 sec)

DROP FOREIGN KEY and INDEX from small_table:
Query OK, 0 rows affected (0.02 sec)

DROP FOREIGN KEY from big_table:
Query OK, 0 rows affected (0.02 sec)

*************************** 1. row ***************************
       Table: small_table
Create Table: CREATE TABLE `small_table` (
  `TABLE_CATALOG` varchar(512) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `TABLE_SCHEMA` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `TABLE_NAME` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  ...
  `COLUMN_COMMENT` varchar(1024) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1679 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

*************************** 1. row ***************************
       Table: big_table
Create Table: CREATE TABLE `big_table` (
  `TABLE_CATALOG` varchar(512) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `TABLE_SCHEMA` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `TABLE_NAME` varchar(64) CHARACTER SET utf8 NOT NULL DEFAULT '',
  ...
  `COLUMN_COMMENT` varchar(1024) CHARACTER SET utf8 NOT NULL DEFAULT '',
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`),
  KEY `big_fk` (`TABLE_SCHEMA`)
) ENGINE=InnoDB AUTO_INCREMENT=1718273 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Example 6. Changing Auto-Increment Value

Here is an illustration of increasing the auto-increment lower limit for a table column, demonstrating how this operation now avoids a table rebuild, plus some other fun facts about InnoDB auto-increment columns.

/*
If this script is run after foreign_key.sql, the schema_names table is
already set up. But to allow this script to run multiple times without
running into duplicate ID errors, we set up the schema_names table
all over again.
*/

\! clear

\! echo "=== Adjusting the Auto-Increment Limit for a Table ==="
\! echo

drop table if exists schema_names;
create table schema_names (id int unsigned not null primary key auto_increment,
  schema_name varchar(64) character set utf8 not null,
  index i_schema (schema_name))
  as select distinct table_schema schema_name from small_table;

\! echo "Initial state of schema_names table."
\! echo "AUTO_INCREMENT is included in SHOW CREATE TABLE output."
\! echo "Note how MySQL reserved a block of IDs,"
\! echo "but only needed 4 of them in this transaction,"
\! echo "so the next inserted values would get IDs 8 and 9."
show create table schema_names\G
select * from schema_names order by id;

\! echo "Inserting even a tiny amount of data can produce gaps in the ID sequence."
insert into schema_names (schema_name) values ('eight'), ('nine');

set old_alter_table=0;
\! echo "Bumping auto-increment lower limit to 20 (fast mechanism):"
alter table schema_names auto_increment=20;

\! echo "Inserting 2 rows that should get IDs 20 and 21:"
insert into schema_names (schema_name) values ('foo'), ('bar');
commit;

set old_alter_table=1;
\! echo "Bumping auto-increment lower limit to 30 (slow mechanism):"
alter table schema_names auto_increment=30;

\! echo "Inserting 2 rows that should get IDs 30 and 31:"
insert into schema_names (schema_name) values ('bletch'),('baz');
commit;

select * from schema_names order by id;

set old_alter_table=0;

\! echo "Final state of schema_names table."
\! echo "AUTO_INCREMENT value shows the next inserted row would get ID=32."
show create table schema_names\G

Running this code gives this output, condensed for brevity and with the most important points bolded:

=== Adjusting the Auto-Increment Limit for a Table ===

Query OK, 0 rows affected (0.01 sec)

Query OK, 4 rows affected (0.02 sec)
Records: 4  Duplicates: 0  Warnings: 0

Initial state of schema_names table.
AUTO_INCREMENT is included in SHOW CREATE TABLE output.
Note how MySQL reserved a block of IDs,
but only needed 4 of them in this transaction,
so the next inserted values would get IDs 8 and 9.
*************************** 1. row ***************************
       Table: schema_names
Create Table: CREATE TABLE `schema_names` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `schema_name` varchar(64) CHARACTER SET utf8 NOT NULL,
  PRIMARY KEY (`id`),
  KEY `i_schema` (`schema_name`)
) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

+----+--------------------+
| id | schema_name        |
+----+--------------------+
|  1 | information_schema |
|  2 | mysql              |
|  3 | performance_schema |
|  4 | test               |
+----+--------------------+
4 rows in set (0.00 sec)

Inserting even a tiny amount of data can produce gaps in the ID sequence.
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

Bumping auto-increment lower limit to 20 (fast mechanism):
Query OK, 0 rows affected (0.01 sec)

Inserting 2 rows that should get IDs 20 and 21:
Query OK, 2 rows affected (0.00 sec)

Bumping auto-increment lower limit to 30 (slow mechanism):
Query OK, 8 rows affected (0.02 sec)

Inserting 2 rows that should get IDs 30 and 31:
Query OK, 2 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

+----+--------------------+
| id | schema_name        |
+----+--------------------+
|  1 | information_schema |
|  2 | mysql              |
|  3 | performance_schema |
|  4 | test               |
|  8 | eight              |
|  9 | nine               |
| 20 | foo                |
| 21 | bar                |
| 30 | bletch             |
| 31 | baz                |
+----+--------------------+
10 rows in set (0.00 sec)

Final state of schema_names table.
AUTO_INCREMENT value shows the next inserted row would get ID=32.
*************************** 1. row ***************************
       Table: schema_names
Create Table: CREATE TABLE `schema_names` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `schema_name` varchar(64) CHARACTER SET utf8 NOT NULL,
  PRIMARY KEY (`id`),
  KEY `i_schema` (`schema_name`)
) ENGINE=InnoDB AUTO_INCREMENT=32 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Tuesday Apr 10, 2012

InnoDB transportable tablespaces

Note: this article was originally published on http://blogs.innodb.com on April 10, 2012 by Sunny Bains.

The Problem

In the past, users were unable to take full advantage of the FLUSH TABLES WITH READ LOCK statement. InnoDB simply ignored the flush to disk part. If the table did not have any dirty pages in the InnoDB buffer that weren’t synced to disk (due to sheer luck) then it was safe to copy the .ibd file to another location. Also, the restore was not without its limitations and complications. The .ibd file could not be copied over to another server because InnoDB during import did not fix up metadata required for a trouble-free import. The main problems during import were:

  • If the tablespace ID of the IMPORTing instance had changed, for example if the table had been dropped and re-created, then the import would fail.
  • If the table and index IDs in the tablespace were different than on the importing server.
  • InnoDB uses a global row ID for tables without an explicit primary key. Any such row IDs in an imported tablespace had to be lower than the current maximum on the new server.
  • The maximum LSN of the tablespace had to be lower than the server’s current max LSN.
  • The maximum transaction (trx) ID of the tablespace had to be lower than the server current max trx id.
  • Purge and change buffer issues.

In short, if it worked you were lucky that the above constraints were satisfied, or your tables were probably read-only.

The Solution

Do the flush (export) properly and also the import. During flushing we disable purge, merge all pending change buffer entries to the tablespace and then flush all the dirty pages to disk. Note: Purge will remain disabled until UNLOCK TABLES; is issued. Write out a metadata file that contains the table schema, hostname of the exporting server, page size of the exporting instance, the highest autoinc value in memory etc. So that when we do an import, we can check if the table schema matches (currently, it ignores foreign key relationships) and also restore the autoinc value. This metadata file is created in the same directory as the tablespace, with a suffix of .cfg. If the table is named customer then the meta-data file will be named customer.cfg and the tablespace will be named customer.ibd. Note: FLUSH TABLES and IMPORT only work for tables that are stored outside the system tablespace, in their own tablespace. The InnoDB config parameter innodb_file_per_table must be set when creating the table.

Because InnoDB stores data in big-endian format, this allows copying data between different architectures. Copying between different operating systems will also work.

Example

If you want to backup a single table:

CREATE DATABASE ftwrl;

SET GLOBAL innodb_file_per_table=1;

CREATE TABLE ftwrl.t (C INT) ENGINE=InnoDB;

INSERT INTO ftwrl.t VALUES(1),(2),(3);

SELECT * FROM ftwrl.t;

FLUSH TABLES ftwrl.t WITH READ LOCK;

From the command line, copy the t.ibd and t.cfg to your backup directory. The files should be in the ftwrl/ sub-directory under datadir.

UNLOCK TABLES; — Note: This will remove the t.cfg file.

DROP TABLE ftwrl.t;

CREATE TABLE ftwrl.t (C INT) ENGINE=InnoDB;

ALTER TABLE ftwrl.t DISCARD TABLESPACE;

The DISCARD tablespace will rename the t.ibd file to t.ibt in the same directory; you have to remove this file manually for now. Once you’ve removed this file, copy the backed up tablespace t.ibd and the metadata file t.cfg to the ftwrl/ sub-directory. Then issue the folowing:

ALTER TABLE ftwrl.t IMPORT TABLESPACE;
CHECK TABLE ftwrl.t;
SELECT * FROM t;

You should be good to go. If you want to backup multiple tables together, the syntax for that is:

FLUSH TABLES t1, t2, …, tn WITH READ LOCK;

Copy as before but this time the multiple tablespaces and metadata files.

UNLOCK TABLES;

Then first discard the tablespace as above followed by an import as above.

Other changes
Errors and warnings will be pushed to the client so that the user doesn’t need to poke around in the server error log files to check why an operation failed. The intention is to send enough information to the user so that the problem can be clearly identified and understood.

Conclusion

The syntax for “export” is not finalised yet and may change; for now the above examples using FLUSH TABLES WITH READ LOCK should work. To export tables that are in a foreign key relationship, you must export and import all the tables together, otherwise you may end up with inconsistencies. The export/import doesn’t work on partitioned tables, this limitation will be addressed later. A matching table definition must exist in the importing server instance, currently we don’t automagically create the table using the saved metadata during import. The .cfg format is not written in stone yet, we may add more data to this file, for example for partition support.

Your feedback is important to us, so that we can improve this feature further.

Choose the Location of your InnoDB File-per-Table Tablespace

Note: this article was originally published on http://blogs.innodb.com on April 10, 2012 by Kevin Lewis.

The April 2012 InnoDB labs release introduces a new feature in InnoDB that allows you to choose the location of specific tables.  For example, you can place critical tables onto an SSD drive while leaving the system tablespace on a hard drive.  Conversely, you can store you primary database files on an SSD and put a seldom used but very large archive or reference table on a larger cheaper hard drive.

Innodb now makes use of the following existing syntax in MySQL ;

CREATE TABLE  . . .  DATA DIRECTORY = ‘absolute path of data directory’;

CREATE TABLE  . . .  PARTITION . . . DATA DIRECTORY = ‘absolute path of data directory’;

This syntax is used in MyISAM and Archive engines to make use of symbolic links in those operating systems that support it.  But InnoDB can use this syntax on any OS since it stores the path in a new system table called SYS_DATAFILES.  There is also a new system table called SYS_TABLESPACES.  Both of these can be viewed in the April Labs release of 5.6 by using information schema. For example;

mysql> SET GLOBAL innodb_file_per_table=on;
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE TABLE t1 (a int KEY, b text) DATA DIRECTORY='/ssd1/my_data';
 Query OK, 0 rows affected (0.12 sec)
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES;
 +-------+---------+------+-------------+----------------------+-----------+---------------+
 | SPACE | NAME    | FLAG | FILE_FORMAT | ROW_FORMAT           | PAGE_SIZE | ZIP_PAGE_SIZE |
 +-------+---------+------+-------------+----------------------+-----------+---------------+
 |     1 | test/t1 |    0 | Antelope    | Compact or Redundant |     16384 |             0 |
 +-------+---------+------+-------------+----------------------+-----------+---------------+
 1 row in set (0.00 sec)
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_SYS_DATAFILES;
 +-------+---------------------------+
 | SPACE | PATH                      |
 +-------+---------------------------+
 |     1 | /ssd1/my_data/test/t1.ibd |
 +-------+---------------------------+
 1 row in set (0.00 sec)

Notice that a directory named for the database is added after the DATA DIRECTORY provided.  This is the same as what is done in the normal data directory.  It allows you to chose the same remote location for all your tablespaces, even if the tables in different databases have the same name.

The last 4 columns in innodb_sys_tablespaces are interpretations of the tablespace flags field. Tablespace flags do not distinguish between Compact and Redundant row format, so you see what is displayed above.  If you want to see the row format of the table, innodb_sys_tables now also interprets the table flags;

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES WHERE name like 'test%';
 +----------+---------+------+--------+-------+-------------+------------+---------------+
 | TABLE_ID | NAME    | FLAG | N_COLS | SPACE | FILE_FORMAT | ROW_FORMAT | ZIP_PAGE_SIZE |
 +----------+---------+------+--------+-------+-------------+------------+---------------+
 |       24 | test/t1 |   65 |      5 |     1 | Antelope    | Compact    |             0 |
 +----------+---------+------+--------+-------+-------------+------------+---------------+
 1 row in set (0.00 sec)

In addition to SYS_DATAFILES, the absolute file path to the tablespace is also written to a text file in the normal datafiles location of the tablespace.  This is also the location of the FRM file written by the MYSQL server.  It has the same name as the FRM except it has the extension ‘.isl’ which stands for InnoDB Symbolic Link.  It is there to support recovery which happens before the system tables are opened.  For the example above, the sole contents of t1.isl are;

/ssd1/my_data/test/t1.ibd

The ISL file also allows a DBA to move one of these remote tablespaces since ALTER TABLE … DATA DIRECTORY =  ‘. . .’ is currently not supported.  In order to move a tablespace to another location, or to port an entire database along with its remote tablespaces, you must edit this ISL file after moving the OS files.  The ISL file is a simple text file that contains the absolute path and filename of the tablespace.  Just change that to the new location and save your changes.  Do not leave a copy of the tablespace in the old location.

If you have a large existing database that uses file-per-table tablespaces (files with ‘.ibd’ extensions), it is possible to move these from the default location next to the FRM file to a remote location such as a new SSD drive.  But you must shut down the MySQL server first. After it is shut down, move the IBD tablespace to the location that you would like, create a text file where it was with the same name except using a ‘isl’ extension.  Make the contents of the text file to be the full file path of the ibd file.  Then restart the server.  It will read the isl file, find the tablespace, and update the internal system tables to show where the file resides.

Conclusion

This new feature provides a valuable option to configure your database to make the best use of the hardware available.

Tuesday Dec 20, 2011

Better scaling of read-only workloads

Note: this article was originally published on http://blogs.innodb.com on Dec 20, 2011 by Sunny Bains.

The problem and its cause

There have been several complaints over the years about InnoDB’s inability to scale beyond 256 connections. One of the main issues behind this scalability bottleneck was the read view creation that is required for MVCC (Multi Version Concurrency Control) to work. When the user starts a transaction this is what InnoDB does under the hood:

  • Create or reuse a transaction instance – usually it is reused, the transactions are reused from a pool (trx_sys_t::mysql_trx_list).
  • Initialize the transaction start time and assign a rollback segment
  • Append the transaction to an active  transaction list ordered on trx_t::id in descending order

The append to  the trx_sys_t::trx_list and corresponding remove during commit is covered by trx_sys_t::mutex. After the transaction is “started” and if the transaction has an isolation greater than or equal to REPEATABLE-READ then before the first record/row is accessed by the transaction, InnoDB creates a view (snapshot) of the running system state. It does this by examining the transactions that are active at the time of the MVCC snapshot, so that their changes can be excluded from the creating transaction’s read view. This read view creation is also covered by the trx_sys_t::mutex. As the number of active transactions in the system increases this read view creation takes longer and longer. This increases the wait times on the trx_sys_t::mutex (during transaction start and commit) and once threads are forced to wait on a condition variable (in contrast to simply spinning while waiting for the mutex) the system throughput drops dramatically.

The solution

While investigating this problem there were two observations that I made:

  • Read only transactions should not be considered in the MVCC snapshot
  • Auto commit non-locking read-only selects should not be in the trx_sys_t::trx_list at all

For the first to work we need to tag the transactions as READ ONLY when the transaction is started e.g.,

  •       START TRANSACTION READ ONLY;

I will not be discussing this functionality in this blog because the syntax for this doesn’t exist in MySQL (yet). However, the functionality exists in InnoDB to handle this case and is  in the 5.6.4 release. Once the above syntax exists, InnoDB can take advantage of the new syntax trivially. What I want to talk about is the second case. This special case can be detected by InnoDB using existing state information and handled transparently without any syntax change in user applications and is fully functional in the 5.6.4 release.

InnoDB transaction life cycle redesign

Split the trx_sys_t::trx_list (the active transactions list) into two, trx_sys_t::ro_trx_list and trx_sys_t::rw_trx_list. Only transactions that are in the trx_sys_t::rw_trx_list are taken into consideration when creating the MVCC snapshot. For a read-only heavy work load the benefits are obvious,  the smaller size of the RW active transaction list makes the read view creation for MVCC (and purge) very fast. For auto-commit read-only non-locking selects the additional benefit is that we don’t need to acquire the trx_sys_t::mutex at all because we don’t put them on the active list. This removes the bottleneck around the trx_sys_t::mutex and improves concurrency and scalability.

Auto-commit read-only non-locking transactions go from state NOT STARTED -> ACTIVE -> NOT STARTED, in contrast to locking read-only (and read-write) transactions which go from state NOT STARTED -> ACTIVE -> COMMIT TO MEMORY -> NOT STARTED. The additional advantage in skipping the COMMIT TO MEMORY state is that we know that they cannot acquire any locks and therefore it is pointless to acquire the lock_sys_t::mutex and attempt lock release. Also, during COMMIT because they are not on any active transaction list we don’t need to acquire the trx_sys_t::mutex to remove them from the list, improving concurrency and performance further.

Changes to transaction state visibility

Currently (5.6.4) doesn’t display the auto-commit read-only non-locking selects in “SHOW ENGINE INNODB STATUS“, however they are visible in the INFORMATION_SCHEMA.innodb_trx table. The innodb_trx table has two new fields that can be queried to determine whether  a transaction is tagged as a read-only transaction and additionally whether it qualifies for the special handling of auto-commit read-only non-locking select. The new fields are:

  • trx_read_only – 0 or 1 (INT)
  • trx_autocommit_non_locking – 0 or 1 (INT)

Some additional minor tweaks, read-only transactions aren’t assigned a rollback segment and if they are flagged as autocommit non-locking selects then we only set the start time once every 32 times, his reduces the overhead of a system call.

Test results

These tests were run by Alexey Stroganov (a.k.a Ranger) using Sysbench (with –oltp-skip-trx=on), wl6046 refers to the internal worklog number of this performance fix. Note how InnoDB scales easily up to 4K threads on the 24 host and 1.5K threads on the 8 core host, there is very little (negligible) drop in the TPS as the number of threads is cranked up. In particular note how the peak TPS on the 24 core host is nearly double:-)

Test POINT_SELECT
  • QUERY - SELECT c FROM sbtest WHERE id=N


Test SIMPLE_RANGES

  • QUERY – SELECT c FROM sbtest WHERE id BETWEEN N AND M

The same result for SIMPLE RANGES, at lower thread levels 5.6 has less TPS than 5.5 on the 8 core host however that is due to another known and unrelated issue that is currently being worked on actively.

Conclusion

This is a significant step forward in InnoDB read-only performance and will allow InnoDB to scale up to a very high number of concurrent queries and take advantage of greater number of processors with the improved parallelism. Note: mixed workloads with a read-only heavy component will also benefit from this improvement. Currently, only if the read-only transactions are auto-commit non-locking selects. The better news is that with this split of the active transaction list into two, we can now optimize deadlock detection, lock release and transaction scheduling around lock grants and waits, stay tuned!

InnoDB Full-Text Search is in MySQL 5.6.4

Note: this article was originally published on http://blogs.innodb.com on Dec 20, 2011 by Jimmy Yang.

InnoDB full-text search (FTS) is finally available in MySQL 5.6.4 release. The feature has been on trial through MySQL’s summer lab release, thus we had several blogs covering the feature. In this blog, we will leave the “how to” part of the feature to those blogs, and focus on some important characteristics of this new feature, so you will have a better understanding when trying on the feature.

The InnoDB Full-text Index as an Inverted Index

When comes to the basic design, InnoDB takes a traditional way to implementation the full-text index, which is a so called “Inverted Index”. It composes of a set of auxiliary “index tables” that stores the “Word” “Doc ID” pair, as well as each word’s position in the original text.

Incoming text strings in the inserting records are extracted out, tokenized and decomposed into individual words, and such words are then inserted into the auxiliary “index tables” along with their position info and the Doc ID associated with the record. With the position info, we will be able to support proximity search that is lacking in MyISAM FTS.

Create the Full-text Index in Parallel

The Full-text Index is created in parallel. By default, the parallel degree is two, which means two threads are used to tokenize, sort and final insertion to the Full-text “index tables”.

To support parallel create index and future parallel processing, InnoDB partitions the “full-text index” into six auxiliary “index tables”. The words are divided among these tables based on their first character’s charset sort weight. Currently, the partition criteria is hard-coded, targeted to the Latin characters. In the future, we will allow user to define their own partition criteria, to better support other character sets.

Batching the insert value with InnoDB Full-text Index’s “index cache”

As just discussed, when inserting into an “Inverted Index”, each inserting string will be tokenized, and decomposed into individual words before inserting into the auxiliary table. A single insertion could result in many small insertions into the inverted index tables. This magnifies the insertion operations by several or even dozens of fold, thus becomes normal insertion a disruptive fact for concurrent access to the table.

Our solution is to build an “internal” FTS “index cache” to temporarily cache these insertions, and batch flush to the disk once this index cache is full. Thus, it avoids frequent updates to the “Index Table” during busy inserts and updates, so the index table only needs to be synchronized when the index cache is full. The batching technique also avoids repetitively storing the same word, minimizes the number of entries for each word in the “Index Table”. So instead of flushing each word with a single “ilist” entry (with a single Doc ID), it batches result from many inserted documents so to create an “ilist” with multiple DocID/position pairs, before we flush this info to disk and also make a single entry. This reduces the redundancy and make the “Index Table” smaller.

The index cache is index specific. And has a configuration parameter (innodb_ft_cache_size) to configure the size of the cache. It stores the same information as those auxiliary tables, using a red-black tree data structure. However, this cache is different from an index tree in that it is only used to cache the word of the inserted documents, words on disk do not need to be brought back to this cache before they can be queried. In another word, the words in the “Index Table” are queried directly as we query into relational tables. And such result from “Index Table” will merge (union) with the query results from the Index Cache before sending back to user.

How InnoDB Full-text Index handles Deletes

Similar to inserts, each delete of the record with full-text index could result in many small “deletes”, if we had implemented it as normal deletes. Again, this could be very disruptive.

To avoid such disruptive operation, InnoDB Full-text Index develops a new scheme that only logs the Doc ID of the deleted Documents in a separate “DELETED” auxiliary table. The actually “indexed” record would still remain in the FTS index. However, these deleted Doc Ids will be filtered out from the final results by consulting the “DELETED” table before returning the query result.

The benefit of this design is obvious. Delete operation then becomes a trivial and fast operation, with minimum impact on the index itself. The shortcoming is that the index would not shrink along with record deletion. However, the benefits far overweight the drawbacks.

To cleanup the deleted records in the index, you will resort to the InnoDB optimize table utility, which rebuilt the Full-text index online. And we will discuss it in the next section.

Index Optimization

The optimize table on the Full-Text index essentially rebuilds the full-text index, and leaves those “deleted” Docs out of the new index. This operation does two things:

1) Remove the deleted Doc ID from the word’s DocID/Position pair list (ilist).

2) Consolidate multiple entries for the same word to one (or less) entries if possible by consolidating their DocID/Position pair list (ilist), as we will have as little entry as possible.

Currently, we overload this optimization operation with the “Optimize Table” command, and runs only if you turn on “innodb_optimize_fulltext_only” system configure variable:

mysql> set global innodb_optimize_fulltext_only=1;

mysql> optimize table articles;

Since the table could be large, and optimization (reorg rebuild index) may take a tremendous amount of time, we allow the optimization to be done in stages. The system configuration variable “innodb_ft_num_word_optimize” specifies how many words to optimize for each “optimize table” command. Its default value is 2000, meaning each time it optimizes 2000 words. And when the next “optimize table” command is issued, the server continues the process from where it left off from last run.

InnoDB Full-text search Transaction Behavior

As you could see from last two sections, the main idea behind InnoDB Full-text index DML is to batch the operations as much as possible. This determines there will be some special characteristics when coming to transaction handling.

In fact, all of the DML operations (insert/delete/update) involving Full-text indexed columns are processed at the transaction commit time. For example, for insertion, the tokenization for inserted strings is performed only at commit time, as a result, a full-text search only sees the committed data. This “delayed” process/synchronization behavior presents in all transactional DBMS text handling. And we do so at the transaction commit time.

InnoDB FTS Document ID

All of above design can’t be done without the support of a so called FTS Doc ID identifier. The mapping from word to original table record goes through such a unique ID. So there must be a Doc ID column in the table with Full-text index.

In our case, it is represented by the “FTS_DOC_ID” (uppercase required) column. If this column is not defined, InnoDB will automatically add it to the user table when creating the full-text index. The column itself must be of “BIGINT UNSIGNED NOT NULL” type, with a unique index named FTS_DOC_ID_INDEX. When you define this column during table creation, it saves considerable time in creating the full-text index after loading data, since InnoDB does not need to copy the table for adding such column (and also rebuild all indexes include primary index). In such case, user will be responsible to properly manage the Doc ID column so that document will not be inappropriately represented. If create index performance is not a concern, you can leave Doc ID management to InnoDB, by leaving out this column. InnoDB will added “FTS_DOC_ID” as a hidden columns. The “ FTS_DOC_ID_INDEX” will also be created automatically.

Stopword Handling

Stopwords are common or trivial words that are omitted from the full-text index. The InnoDB FTS provides two sources of stopwords:

1) InnoDB predefined default stopwords – If no user stopword list is defined, this default stopwords list is used.

You can view this list by querying the table INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD:

mysql> select * from INNODB_FT_DEFAULT_STOPWORD;

2) User-defined stopwords – You can define your own stopwords by creating a table with a single column named “value”, with datatype “varchar”, and pointing the global variable “innodb_ft_server_stopword_table” to this table. MySQL loads stopwords from this user table, rather than the default stopword list, when creating the FTS index. And an example will be here:

  • # Define a correctly formatted user stopword table

create table user_stopword(value varchar(30)) engine = innodb;

  • # Point to this stopword table with “db name/table name”

set global innodb_ft_server_stopword_table = “test/user_stopword”

Ranking

Currently, we use a very simple ranking mechanism (term frequency, inverse document frequency) for document ranking. This means the more the word appears in a document, and the less frequent the word appears in overall documents, the higher the selected document is ranked. We do have plans to add more sophisticated ranking mechanism and allow users to supplied the weight in the ranking calculation too.

Limitation

As the first release of InnoDB FTS, it provides limited and very basic functionalities for Full-text searches. And it has some limitations that worth mentioning:

  1. It does not yet support stemming.
  2. For ideographic languages such as CJK (Chinese, Japanese and Korea), which do not have word delimiters, InnoDB FTS does not yet support N-GRAM parsing (similar to MyISAM FTS).
  3. Although the use of multiple character sets within a single table is supported, all columns in a FULLTEXT index must use the same character set and collation (similar to MyISAM FTS).
  4. It does not support external parser plugins
  5. Ranking mechanism is relatively simple

Summary

The InnoDB full-text search gives InnoDB an important capability in using MySQL to handle text documents, and we will further develop it to continue increase its performance and usability.

About

This is the InnoDB team blog.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today