Saturday Sep 29, 2012

Online ALTER TABLE in MySQL 5.6

This is the low-level view of data dictionary language (DDL) operations in the InnoDB storage engine in MySQL 5.6. John Russell gave a more high-level view in his blog post April 2012 Labs Release – Online DDL Improvements.

MySQL before the InnoDB Plugin

Traditionally, the MySQL storage engine interface has taken a minimalistic approach to data definition language. The only natively supported operations were CREATE TABLE, DROP TABLE and RENAME TABLE. Consider the following example:

CREATE TABLE t(a INT);
INSERT INTO t VALUES (1),(2),(3);
CREATE INDEX a ON t(a);
DROP TABLE t;

The CREATE INDEX statement would be executed roughly as follows:

CREATE TABLE temp(a INT, INDEX(a));
INSERT INTO temp SELECT * FROM t;
RENAME TABLE t TO temp2;
RENAME TABLE temp TO t;
DROP TABLE temp2;

You could imagine that the database could crash when copying all rows from the original table to the new one. For example, it could run out of file space. Then, on restart, InnoDB would roll back the huge INSERT transaction. To fix things a little, a hack was added to ha_innobase::write_row for committing the transaction every 10,000 rows.

Still, it was frustrating that even a simple DROP INDEX would make the table unavailable for modifications for a long time.

Fast Index Creation in the InnoDB Plugin of MySQL 5.1

MySQL 5.1 introduced a new interface for CREATE INDEX and DROP INDEX. The old table-copying approach can still be forced by SET old_alter_table=0.

This interface is used in MySQL 5.5 and in the InnoDB Plugin for MySQL 5.1. Apart from the ability to do a quick DROP INDEX, the main advantage is that InnoDB will execute a merge-sort algorithm before inserting the index records into each index that is being created. This should speed up the insert into the secondary index B-trees and potentially result in a better B-tree fill factor.

The 5.1 ALTER TABLE interface was not perfect. For example, DROP FOREIGN KEY still invoked the table copy. Renaming columns could conflict with InnoDB foreign key constraints. Combining ADD KEY and DROP KEY in ALTER TABLE was problematic and not atomic inside the storage engine.

The ALTER TABLE interface in MySQL 5.6

The ALTER TABLE storage engine interface was completely rewritten in MySQL 5.6. Instead of introducing a method call for every conceivable operation, MySQL 5.6 introduced a handful of methods, and data structures that keep track of the requested changes.

In MySQL 5.6, online ALTER TABLE operation can be requested by specifying LOCK=NONE. Also LOCK=SHARED and LOCK=EXCLUSIVE are available. The old-style table copying can be requested by ALGORITHM=COPY. That one will require at least LOCK=SHARED. From the InnoDB point of view, anything that is possible with LOCK=EXCLUSIVE is also possible with LOCK=SHARED.

Most ALGORITHM=INPLACE operations inside InnoDB can be executed online (LOCK=NONE). InnoDB will always require an exclusive table lock in two phases of the operation. The execution phases are tied to a number of methods:

handler::check_if_supported_inplace_alter
Checks if the storage engine can perform all requested operations, and if so, what kind of locking is needed.
handler::prepare_inplace_alter_table
InnoDB uses this method to set up the data dictionary cache for upcoming CREATE INDEX operation. We need stubs for the new indexes, so that we can keep track of changes to the table during online index creation. Also, crash recovery would drop any indexes that were incomplete at the time of the crash.
handler::inplace_alter_table
In InnoDB, this method is used for creating secondary indexes or for rebuilding the table. This is the ‘main’ phase that can be executed online (with concurrent writes to the table).
handler::commit_inplace_alter_table
This is where the operation is committed or rolled back. Here, InnoDB would drop any indexes, rename any columns, drop or add foreign keys, and finalize a table rebuild or index creation. It would also discard any logs that were set up for online index creation or table rebuild.

The prepare and commit phases require an exclusive lock, blocking all access to the table. If MySQL times out while upgrading the table meta-data lock for the commit phase, it will roll back the ALTER TABLE operation.

In MySQL 5.6, data definition language operations are still not fully atomic, because the data dictionary is split. Part of it is inside InnoDB data dictionary tables. Part of the information is only available in the *.frm file, which is not covered by any crash recovery log. But, there is a single commit phase inside the storage engine.

Online Secondary Index Creation

It may occur that an index needs to be created on a new column to speed up queries. But, it may be unacceptable to block modifications on the table while creating the index.

It turns out that it is conceptually not so hard to support online index creation. All we need is some more execution phases:

  1. Set up a stub for the index, for logging changes.
  2. Scan the table for index records.
  3. Sort the index records.
  4. Bulk load the index records.
  5. Apply the logged changes.
  6. Replace the stub with the actual index.

Threads that modify the table will log the operations to the logs of each index that is being created. Errors, such as log overflow or uniqueness violations, will only be flagged by the ALTER TABLE thread. The log is conceptually similar to the InnoDB change buffer.

The bulk load of index records will bypass record locking. We still generate redo log for writing the index pages. It would suffice to log page allocations only, and to flush the index pages from the buffer pool to the file system upon completion.

Native ALTER TABLE

Starting with MySQL 5.6, InnoDB supports most ALTER TABLE operations natively. The notable exceptions are changes to the column type, ADD FOREIGN KEY except when foreign_key_checks=0, and changes to tables that contain FULLTEXT indexes.

The keyword ALGORITHM=INPLACE is somewhat misleading, because certain operations cannot be performed in-place. For example, changing the ROW_FORMAT of a table requires a rebuild.

Online operation (LOCK=NONE) is not allowed in the following cases:

  • when adding an AUTO_INCREMENT column,
  • when the table contains FULLTEXT indexes or a hidden FTS_DOC_ID column, or
  • when there are FOREIGN KEY constraints referring to the table, with ON…CASCADE or ON…SET NULL option.

The FOREIGN KEY limitations are needed, because MySQL does not acquire meta-data locks on the child or parent tables when executing SQL statements.

Theoretically, InnoDB could support operations like ADD COLUMN and DROP COLUMN in-place, by lazily converting the table to a newer format. This would require that the data dictionary keep multiple versions of the table definition. For simplicity, we will copy the entire table, even for DROP COLUMN.

The bulk copying of the table will bypass record locking and undo logging. For facilitating online operation, a temporary log will be associated with the clustered index of table. Threads that modify the table will also write the changes to the log.

When altering the table, we skip all records that have been marked for deletion. In this way, we can simply discard any undo log records that were not yet purged from the original table.

Off-page columns, or BLOBs, are an important consideration. We suspend the purge of delete-marked records if it would free any off-page columns from the old table. This is because the BLOBs can be needed when applying changes from the log. We have special logging for handling the ROLLBACK of an INSERT that inserted new off-page columns. This is because the columns will be freed at rollback.

InnoDB Compression Improvements in MySQL 5.6

MySQL 5.6 comes with significant improvements for the compression support inside InnoDB. The enhancements that we'll talk about in this piece are also a good example of community contributions. The work on these was conceived, implemented and contributed by the engineers at Facebook. Before we plunge into the details let us familiarize ourselves with some of the key concepts surrounding InnoDB compression.

  • In InnoDB compressed pages are fixed size. Supported sizes are 1, 2, 4, 8 and 16K. The compressed page size is specified at table creation time.
  • InnoDB uses zlib for compression.
  • InnoDB buffer pool will attempt to cache compressed pages like normal pages. However, whenever a page is actively used by a transaction, we'll always have the uncompressed version of the page as well i.e.: we can have a page in the buffer pool in compressed only form or in a state where we have both the compressed page and uncompressed version but we'll never have a page in uncompressed only form. On-disk we'll always only have the compressed page.
  • When both compressed and uncompressed images are present in the buffer pool they are always kept in sync i.e.: changes are applied to both atomically.
  • Recompression happens when changes are made to the compressed data. In order to minimize recompressions InnoDB maintains a modification log within a compressed page. This is the extra space available in the page after compression and it is used to log modifications to the compressed data thus avoiding recompressions.
  • DELETE (and ROLLBACK of DELETE) and purge can be performed without recompressing the page. This is because the delete-mark bit and the system fields DB_TRX_ID and DB_ROLL_PTR are stored in uncompressed format on the compressed page. A record can be purged by shuffling entries in the compressed page directory. This can also be useful for updates of indexed columns, because UPDATE of a key is mapped to INSERT+DELETE+purge.
  • A compression failure happens when we attempt to recompress a page and it does not fit in the fixed size. In such case, we first try to reorganize the page and attempt to recompress and if that fails as well then we split the page into two and recompress both pages.

Now lets talk about the three major improvements that we made in MySQL 5.6.

Logging of Compressed Page Images:
InnoDB used to log entire compressed data on the page to the redo logs when recompression happens. This was an extra safety measure to guard against the rare case where an attempt is made to do recovery using a different zlib version from the one that was used before the crash. Because recovery is a page level operation in InnoDB we have to be sure that all recompress attempts must succeed without causing a btree page split. However, writing entire compressed data images to the redo log files not only makes the operation heavy duty but can also adversely affect flushing activity. This happens because redo space is used in a circular fashion and when we generate much more than normal redo we fill up the space much more quickly and in order to reuse the redo space we have to flush the corresponding dirty pages from the buffer pool.
Starting with MySQL 5.6 a new global configuration parameter innodb_log_compressed_pages. The default value is true which is same as the current behavior. If you are sure that you are not going to attempt to recover from a crash using a different version of zlib then you should set this parameter to false. This is a dynamic parameter.

Compression Level:
You can now set the compression level that zlib should choose to compress the data. The global parameter is innodb_compression_level - the default value is 6 (the zlib default) and allowed values are 1 to 9. Again the parameter is dynamic i.e.: you can change it on the fly.

Dynamic Padding to Reduce Compression Failures:
Compression failures are expensive in terms of CPU. We go through the hoops of recompress, failure, reorganize, recompress, failure and finally page split. At the same time, how often we encounter compression failure depends largely on the compressibility of the data. In MySQL 5.6, courtesy of Facebook engineers, we have an adaptive algorithm based on per-index statistics that we gather about compression operations. The idea is that if a certain index/table is experiencing too many compression failures then we should try to pack the 16K uncompressed version of the page less densely i.e.: we let some space in the 16K page go unused in an attempt that the recompression won't end up in a failure. In other words, we dynamically keep adding 'pad' to the 16K page till we get compression failures within an agreeable range. It works the other way as well, that is we'll keep removing the pad if failure rate is fairly low. To tune the padding effort two configuration variables are exposed.

  • innodb_compression_failure_threshold_pct: default 5, range 0 - 100,dynamic, implies the percentage of compress ops to fail before we start using to padding. Value 0 has a special meaning of disabling the padding.
  • innodb_compression_pad_pct_max: default 50, range 0 - 75, dynamic, the  maximum percentage of uncompressed data page that can be reserved as pad.





New Enhancements for InnoDB Memcached

In MySQL 5.6, we continued our development on InnoDB Memcached and completed a few widely desirable features that make InnoDB Memcached a competitive feature in more scenario. Notablely, they are

1) Support multiple table mapping

2) Added background thread to auto-commit long running transactions

3) Enhancement in binlog performance

 Let’s go over each of these features one by one. And in the last section, we will go over a couple of internally performed performance tests.

Support multiple table mapping

In our earlier release, all InnoDB Memcached operations are mapped to a single InnoDB table. In the real life, user might want to use this InnoDB Memcached features on different tables. Thus being able to support access to different table at run time, and having different mapping for different connections becomes a very desirable feature. And in this GA release, we allow user just be able to do both. We will discuss the key concepts and key steps in using this feature.

1) "mapping name" in the "get" and "set" command

In order to allow InnoDB Memcached map to a new table, the user (DBA) would still require to "pre-register" table(s) in InnoDB Memcached “containers” table (there is security consideration for this requirement). If you would like to know about “containers” table, please refer to my earlier blogs in blogs.innodb.com. Once registered, the InnoDB Memcached will then be able to look for such table when they are referred.

Each of such registered table will have a unique "registration name" (or mapping_name) corresponding to the “name” field in the “containers” table.. To access these tables, user will include such "registration name" in their get or set commands, in the form of "get @@new_mapping_name.key", prefix "@@" is required for signaling a mapped table change. The key and the "mapping name" are separated by a configurable delimiter, by default, it is ".". So the syntax is:

get [@@mapping_name.]key_name

set [@@mapping_name.]key_name

 or

 get @@mapping_name

set @@mapping_name

Here is an example:

Let's set up three tables in the "containers" table:

The first is a map to InnoDB table "test/demo_test" table with mapping name "setup_1"

INSERT INTO containers VALUES ("setup_1", "test", "demo_test",

"c1", "c2", "c3", "c4", "c5", "PRIMARY");

 Similarly, we set up table mappings for table "test/new_demo" with name "setup_2" and that to table "mydatabase/my_demo" with name "setup_3":

INSERT INTO containers VALUES ("setup_2", "test", "new_demo", "c1", "c2",

"c3", "c4", "c5", "secondary_index_x");

INSERT INTO containers VALUES ("setup_3", "my_database", "my_demo",

"c1", "c2", "c3", "c4", "c5", "idx");

To switch to table "my_database/my_demo", and get the value corresponding to “key_a”, user will do:

get @@setup_3.key_a

(this will also output the value that corresponding to key "key_a"

or simply

get @@setup_3

Once this is done, this connection will switch to "my_database/my_demo" table until another table mapping switch is requested. so it can continue issue regular command like:

get key_b

 set key_c 0 0 7

These DMLs will all be directed to "my_database/my_demo" table.

And this also implies that different connections can have different bindings (to different table).

2) Delimiter:

For the delimiter "." that separates the "mapping name" and key value, we also added a configure option in the "config_options" system table with name of "table_map_delimiter":

INSERT INTO config_options VALUES("table_map_delimiter", ".");

So if user wants to change to a different delimiter, they can change it in the config_option table.

3) Default mapping:

Once we have multiple table mapping, there should be always a "default" map setting. For this, we decided if there exists a mapping name of "default", then this will be chosen as default mapping. Otherwise, the first row of the containers table will chosen as default setting.

Please note, user tables can be repeated in the "containers" table (for example, user wants to access different columns of the table in different settings), as long as they are using different mapping/configure names in the first column, which is enforced by a unique index.

4) bind command

In addition, we also extend the protocol and added a bind command, its usage is fairly straightforward. To switch to "setup_3" mapping above, you simply issue:

bind setup_3

This will switch this connection's InnoDB table to "my_database/my_demo"

In summary, with this feature, you now can direct access to difference tables with difference session. And even a single connection, you can query into difference tables.

Background thread to auto-commit long running transactions

This is a feature related to the “batch” concept we discussed in earlier blogs. This “batch” feature allows us batch the read and write operations, and commit them only after certain calls. The “batch” size is controlled by the configure parameter “daemon_memcached_w_batch_size” and “daemon_memcached_r_batch_size”. This could significantly boost performance.

However, it also comes with some disadvantages, for example, you will not be able to view “uncommitted” operations from SQL end unless you set transaction isolation level to read_uncommitted, and in addition, this will held certain row locks for extend period of time that might reduce the concurrency.

To deal with this, we introduce a background thread that “auto-commits” the transaction if they are idle for certain amount of time (default is 5 seconds). The background thread will wake up every second and loop through every “connections” opened by Memcached, and check for idle transactions. And if such transaction is idle longer than certain limit and not being used, it will commit such transactions. This limit is configurable by change “innodb_api_bk_commit_interval”. Its default value is 5 seconds, and minimum is 1 second, and maximum is 1073741824 seconds.

With the help of such background thread, you will not need to worry about long running uncommitted transactions when set daemon_memcached_w_batch_size and daemon_memcached_r_batch_size to a large number. This also reduces the number of locks that could be held due to long running transactions, and thus further increase the concurrency.

Enhancement in binlog performance

As you might all know, binlog operation is not done by InnoDB storage engine, rather it is handled in the MySQL layer. In order to support binlog operation through InnoDB Memcached, we would have to artificially create some MySQL constructs in order to access binlog handler APIs. In previous lab release, for simplicity consideration, we open and destroy these MySQL constructs (such as THD) for each operations. This required us to set the “batch” size always to 1 when binlog is on, no matter what “daemon_memcached_w_batch_size” and “daemon_memcached_r_batch_size” are configured to. This put a big restriction on our capability to scale, and also there are quite a bit overhead in creating destroying such constructs that bogs the performance down.

With this release, we made necessary change that would keep MySQL constructs as long as they are valid for a particular connection. So there will not be repeated and redundant open and close (table) calls. And now even with binlog option is enabled (with innodb_api_enable_binlog,), we still can batch the transactions with daemon_memcached_w_batch_size and daemon_memcached_r_batch_size, thus scale the write/read performance.

Although there are still overheads that makes InnoDB Memcached cannot perform as fast as when binlog is turned off. It is much better off comparing to previous release. And we are continuing optimize the solution is this area to improve the performance as much as possible.

Performance Study:

Amerandra of our System QA team have conducted some performance studies on queries through our InnoDB Memcached connection and plain SQL end. And it shows some interesting results.

The test is conducted on a “Linux 2.6.32-300.7.1.el6uek.x86_64 ix86 (64)” machine with 16 GB Memory, Intel Xeon 2.0 GHz CPU X86_64 2 CPUs- 4 Core Each, 2 RAID DISKS (1027 GB,733.9GB). Results are described in following tables:

Table 1: Performance comparison on Set operations

Connections

5.6.7-RC-Memcached-plugin ( TPS / Qps) with memcached-threads=8***

5.6.7-RC*

X faster


Set (QPS)

Set**


8

30,000

5,600

5.36

32

59,000

13,000

4.54

128

68,000

8,000

8.50

512

63,000

6.800

9.23

* mysql-5.6.7-rc-linux2.6-x86_64

** The “set” operation when implemented in InnoDB Memcached involves a couple of DMLs: it first query the table to see whether the “key” exists, if it does not, the new key/value pair will be inserted. If it does exist, the “value” field of matching row (by key) will be updated. So when used in above query, it is a precompiled store procedure, and query will just execute such procedures.

*** added “–daemon_memcached_option=-t8” (default is 4 threads)

So we can see with this “set” query, InnoDB Memcached can run 4.5 to 9 time faster than MySQL server.

Table 2: Performance comparison on Get operations

Connections

5.6.7-RC-Memcached-plugin ( TPS / Qps) with memcached-threads=8

5.6.7-RC*

X faster


Get (QPS)

Get


8

42,000

27,000

1.56

32

101,000

55.000

1.83

128

117,000

52,000

2.25

512

109,000

52,000

2.10

With the “get” query (or the select query), memcached performs 1.5 to 2 times faster than normal SQL.

Summary:

In summary, we added several much-desired features to InnoDB Memcached in this release, allowing user to operate on different tables with this Memcached interface. We also now provide a background commit thread to commit long running idle transactions, thus allow user to configure large batch write/read without worrying about large number of rows held or not being able to see (uncommit) data. We also greatly enhanced the performance when Binlog is enabled. We will continue making efforts in both performance enhancement and functionality areas to make InnoDB Memcached a good demo case for our InnoDB APIs.

Jimmy Yang, September 29, 2012

About

This is the InnoDB team blog.

Search

Categories
Archives
« September 2012 »
SunMonTueWedThuFriSat
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
16
17
18
19
20
21
22
23
24
25
26
27
28
30
      
Today