Tuesday Dec 20, 2011

Improving InnoDB memory usage

Note: this article was originally published on http://blogs.innodb.com on Dec 20, 2011 by Vasil Dimov.

Last month we did a few improvements in InnoDB memory usage. We solved a challenging issue about how InnoDB uses memory in certain places of the code.

The symptom of the issue was that under a certain workloads the memory used by InnoDB kept growing infinitely, until OOM killer kicked in. It looked like a memory leak, but Valgrind wasn’t reporting any leaks and the issue was not reproducible on FreeBSD – it only happened on Linux (see Bug#57480). Especially the latest fact lead us to think that there is something in the InnoDB memory usage pattern that reveals a nasty side of the otherwise good-natured Linux’s memory manager.

It turned out to be an interesting memory fragmentation caused by a storm of malloc/free calls of various sizes. We had to track and analyze each call to malloc during the workload, including the code path that lead to it. We collected a huge set of analysis data – some code paths were executed many 10’000s of times! A hurricane of allocations and deallocations! We looked at the hottest ones hoping that some of them are not necessary, can be eliminated, avoided, minimized or stuck together. Luckily there were plenty of them!

After an extensive testing we did a numerous improvements, allocating the smallest chunks of the memory from the stack instead of from the heap, grouping allocations together where possible, removing unnecessary allocations altogether, estimating exactly how much memory will be consumed by a given operation and allocating it in advance and others and others and others.

This not only fixed Bug#57480 but improved InnoDB memory usage in general.

Note: the fix is not in the 5.6.4 release.

Continues with some numbers here.

Better scaling of read-only workloads

Note: this article was originally published on http://blogs.innodb.com on Dec 20, 2011 by Sunny Bains.

The problem and its cause

There have been several complaints over the years about InnoDB’s inability to scale beyond 256 connections. One of the main issues behind this scalability bottleneck was the read view creation that is required for MVCC (Multi Version Concurrency Control) to work. When the user starts a transaction this is what InnoDB does under the hood:

  • Create or reuse a transaction instance – usually it is reused, the transactions are reused from a pool (trx_sys_t::mysql_trx_list).
  • Initialize the transaction start time and assign a rollback segment
  • Append the transaction to an active  transaction list ordered on trx_t::id in descending order

The append to  the trx_sys_t::trx_list and corresponding remove during commit is covered by trx_sys_t::mutex. After the transaction is “started” and if the transaction has an isolation greater than or equal to REPEATABLE-READ then before the first record/row is accessed by the transaction, InnoDB creates a view (snapshot) of the running system state. It does this by examining the transactions that are active at the time of the MVCC snapshot, so that their changes can be excluded from the creating transaction’s read view. This read view creation is also covered by the trx_sys_t::mutex. As the number of active transactions in the system increases this read view creation takes longer and longer. This increases the wait times on the trx_sys_t::mutex (during transaction start and commit) and once threads are forced to wait on a condition variable (in contrast to simply spinning while waiting for the mutex) the system throughput drops dramatically.

The solution

While investigating this problem there were two observations that I made:

  • Read only transactions should not be considered in the MVCC snapshot
  • Auto commit non-locking read-only selects should not be in the trx_sys_t::trx_list at all

For the first to work we need to tag the transactions as READ ONLY when the transaction is started e.g.,


I will not be discussing this functionality in this blog because the syntax for this doesn’t exist in MySQL (yet). However, the functionality exists in InnoDB to handle this case and is  in the 5.6.4 release. Once the above syntax exists, InnoDB can take advantage of the new syntax trivially. What I want to talk about is the second case. This special case can be detected by InnoDB using existing state information and handled transparently without any syntax change in user applications and is fully functional in the 5.6.4 release.

InnoDB transaction life cycle redesign

Split the trx_sys_t::trx_list (the active transactions list) into two, trx_sys_t::ro_trx_list and trx_sys_t::rw_trx_list. Only transactions that are in the trx_sys_t::rw_trx_list are taken into consideration when creating the MVCC snapshot. For a read-only heavy work load the benefits are obvious,  the smaller size of the RW active transaction list makes the read view creation for MVCC (and purge) very fast. For auto-commit read-only non-locking selects the additional benefit is that we don’t need to acquire the trx_sys_t::mutex at all because we don’t put them on the active list. This removes the bottleneck around the trx_sys_t::mutex and improves concurrency and scalability.

Auto-commit read-only non-locking transactions go from state NOT STARTED -> ACTIVE -> NOT STARTED, in contrast to locking read-only (and read-write) transactions which go from state NOT STARTED -> ACTIVE -> COMMIT TO MEMORY -> NOT STARTED. The additional advantage in skipping the COMMIT TO MEMORY state is that we know that they cannot acquire any locks and therefore it is pointless to acquire the lock_sys_t::mutex and attempt lock release. Also, during COMMIT because they are not on any active transaction list we don’t need to acquire the trx_sys_t::mutex to remove them from the list, improving concurrency and performance further.

Changes to transaction state visibility

Currently (5.6.4) doesn’t display the auto-commit read-only non-locking selects in “SHOW ENGINE INNODB STATUS“, however they are visible in the INFORMATION_SCHEMA.innodb_trx table. The innodb_trx table has two new fields that can be queried to determine whether  a transaction is tagged as a read-only transaction and additionally whether it qualifies for the special handling of auto-commit read-only non-locking select. The new fields are:

  • trx_read_only – 0 or 1 (INT)
  • trx_autocommit_non_locking – 0 or 1 (INT)

Some additional minor tweaks, read-only transactions aren’t assigned a rollback segment and if they are flagged as autocommit non-locking selects then we only set the start time once every 32 times, his reduces the overhead of a system call.

Test results

These tests were run by Alexey Stroganov (a.k.a Ranger) using Sysbench (with –oltp-skip-trx=on), wl6046 refers to the internal worklog number of this performance fix. Note how InnoDB scales easily up to 4K threads on the 24 host and 1.5K threads on the 8 core host, there is very little (negligible) drop in the TPS as the number of threads is cranked up. In particular note how the peak TPS on the 24 core host is nearly double:-)

  • QUERY - SELECT c FROM sbtest WHERE id=N



The same result for SIMPLE RANGES, at lower thread levels 5.6 has less TPS than 5.5 on the 8 core host however that is due to another known and unrelated issue that is currently being worked on actively.


This is a significant step forward in InnoDB read-only performance and will allow InnoDB to scale up to a very high number of concurrent queries and take advantage of greater number of processors with the improved parallelism. Note: mixed workloads with a read-only heavy component will also benefit from this improvement. Currently, only if the read-only transactions are auto-commit non-locking selects. The better news is that with this split of the active transaction list into two, we can now optimize deadlock detection, lock release and transaction scheduling around lock grants and waits, stay tuned!

InnoDB Full-Text Search is in MySQL 5.6.4

Note: this article was originally published on http://blogs.innodb.com on Dec 20, 2011 by Jimmy Yang.

InnoDB full-text search (FTS) is finally available in MySQL 5.6.4 release. The feature has been on trial through MySQL’s summer lab release, thus we had several blogs covering the feature. In this blog, we will leave the “how to” part of the feature to those blogs, and focus on some important characteristics of this new feature, so you will have a better understanding when trying on the feature.

The InnoDB Full-text Index as an Inverted Index

When comes to the basic design, InnoDB takes a traditional way to implementation the full-text index, which is a so called “Inverted Index”. It composes of a set of auxiliary “index tables” that stores the “Word” “Doc ID” pair, as well as each word’s position in the original text.

Incoming text strings in the inserting records are extracted out, tokenized and decomposed into individual words, and such words are then inserted into the auxiliary “index tables” along with their position info and the Doc ID associated with the record. With the position info, we will be able to support proximity search that is lacking in MyISAM FTS.

Create the Full-text Index in Parallel

The Full-text Index is created in parallel. By default, the parallel degree is two, which means two threads are used to tokenize, sort and final insertion to the Full-text “index tables”.

To support parallel create index and future parallel processing, InnoDB partitions the “full-text index” into six auxiliary “index tables”. The words are divided among these tables based on their first character’s charset sort weight. Currently, the partition criteria is hard-coded, targeted to the Latin characters. In the future, we will allow user to define their own partition criteria, to better support other character sets.

Batching the insert value with InnoDB Full-text Index’s “index cache”

As just discussed, when inserting into an “Inverted Index”, each inserting string will be tokenized, and decomposed into individual words before inserting into the auxiliary table. A single insertion could result in many small insertions into the inverted index tables. This magnifies the insertion operations by several or even dozens of fold, thus becomes normal insertion a disruptive fact for concurrent access to the table.

Our solution is to build an “internal” FTS “index cache” to temporarily cache these insertions, and batch flush to the disk once this index cache is full. Thus, it avoids frequent updates to the “Index Table” during busy inserts and updates, so the index table only needs to be synchronized when the index cache is full. The batching technique also avoids repetitively storing the same word, minimizes the number of entries for each word in the “Index Table”. So instead of flushing each word with a single “ilist” entry (with a single Doc ID), it batches result from many inserted documents so to create an “ilist” with multiple DocID/position pairs, before we flush this info to disk and also make a single entry. This reduces the redundancy and make the “Index Table” smaller.

The index cache is index specific. And has a configuration parameter (innodb_ft_cache_size) to configure the size of the cache. It stores the same information as those auxiliary tables, using a red-black tree data structure. However, this cache is different from an index tree in that it is only used to cache the word of the inserted documents, words on disk do not need to be brought back to this cache before they can be queried. In another word, the words in the “Index Table” are queried directly as we query into relational tables. And such result from “Index Table” will merge (union) with the query results from the Index Cache before sending back to user.

How InnoDB Full-text Index handles Deletes

Similar to inserts, each delete of the record with full-text index could result in many small “deletes”, if we had implemented it as normal deletes. Again, this could be very disruptive.

To avoid such disruptive operation, InnoDB Full-text Index develops a new scheme that only logs the Doc ID of the deleted Documents in a separate “DELETED” auxiliary table. The actually “indexed” record would still remain in the FTS index. However, these deleted Doc Ids will be filtered out from the final results by consulting the “DELETED” table before returning the query result.

The benefit of this design is obvious. Delete operation then becomes a trivial and fast operation, with minimum impact on the index itself. The shortcoming is that the index would not shrink along with record deletion. However, the benefits far overweight the drawbacks.

To cleanup the deleted records in the index, you will resort to the InnoDB optimize table utility, which rebuilt the Full-text index online. And we will discuss it in the next section.

Index Optimization

The optimize table on the Full-Text index essentially rebuilds the full-text index, and leaves those “deleted” Docs out of the new index. This operation does two things:

1) Remove the deleted Doc ID from the word’s DocID/Position pair list (ilist).

2) Consolidate multiple entries for the same word to one (or less) entries if possible by consolidating their DocID/Position pair list (ilist), as we will have as little entry as possible.

Currently, we overload this optimization operation with the “Optimize Table” command, and runs only if you turn on “innodb_optimize_fulltext_only” system configure variable:

mysql> set global innodb_optimize_fulltext_only=1;

mysql> optimize table articles;

Since the table could be large, and optimization (reorg rebuild index) may take a tremendous amount of time, we allow the optimization to be done in stages. The system configuration variable “innodb_ft_num_word_optimize” specifies how many words to optimize for each “optimize table” command. Its default value is 2000, meaning each time it optimizes 2000 words. And when the next “optimize table” command is issued, the server continues the process from where it left off from last run.

InnoDB Full-text search Transaction Behavior

As you could see from last two sections, the main idea behind InnoDB Full-text index DML is to batch the operations as much as possible. This determines there will be some special characteristics when coming to transaction handling.

In fact, all of the DML operations (insert/delete/update) involving Full-text indexed columns are processed at the transaction commit time. For example, for insertion, the tokenization for inserted strings is performed only at commit time, as a result, a full-text search only sees the committed data. This “delayed” process/synchronization behavior presents in all transactional DBMS text handling. And we do so at the transaction commit time.

InnoDB FTS Document ID

All of above design can’t be done without the support of a so called FTS Doc ID identifier. The mapping from word to original table record goes through such a unique ID. So there must be a Doc ID column in the table with Full-text index.

In our case, it is represented by the “FTS_DOC_ID” (uppercase required) column. If this column is not defined, InnoDB will automatically add it to the user table when creating the full-text index. The column itself must be of “BIGINT UNSIGNED NOT NULL” type, with a unique index named FTS_DOC_ID_INDEX. When you define this column during table creation, it saves considerable time in creating the full-text index after loading data, since InnoDB does not need to copy the table for adding such column (and also rebuild all indexes include primary index). In such case, user will be responsible to properly manage the Doc ID column so that document will not be inappropriately represented. If create index performance is not a concern, you can leave Doc ID management to InnoDB, by leaving out this column. InnoDB will added “FTS_DOC_ID” as a hidden columns. The “ FTS_DOC_ID_INDEX” will also be created automatically.

Stopword Handling

Stopwords are common or trivial words that are omitted from the full-text index. The InnoDB FTS provides two sources of stopwords:

1) InnoDB predefined default stopwords – If no user stopword list is defined, this default stopwords list is used.

You can view this list by querying the table INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD:

mysql> select * from INNODB_FT_DEFAULT_STOPWORD;

2) User-defined stopwords – You can define your own stopwords by creating a table with a single column named “value”, with datatype “varchar”, and pointing the global variable “innodb_ft_server_stopword_table” to this table. MySQL loads stopwords from this user table, rather than the default stopword list, when creating the FTS index. And an example will be here:

  • # Define a correctly formatted user stopword table

create table user_stopword(value varchar(30)) engine = innodb;

  • # Point to this stopword table with “db name/table name”

set global innodb_ft_server_stopword_table = “test/user_stopword”


Currently, we use a very simple ranking mechanism (term frequency, inverse document frequency) for document ranking. This means the more the word appears in a document, and the less frequent the word appears in overall documents, the higher the selected document is ranked. We do have plans to add more sophisticated ranking mechanism and allow users to supplied the weight in the ranking calculation too.


As the first release of InnoDB FTS, it provides limited and very basic functionalities for Full-text searches. And it has some limitations that worth mentioning:

  1. It does not yet support stemming.
  2. For ideographic languages such as CJK (Chinese, Japanese and Korea), which do not have word delimiters, InnoDB FTS does not yet support N-GRAM parsing (similar to MyISAM FTS).
  3. Although the use of multiple character sets within a single table is supported, all columns in a FULLTEXT index must use the same character set and collation (similar to MyISAM FTS).
  4. It does not support external parser plugins
  5. Ranking mechanism is relatively simple


The InnoDB full-text search gives InnoDB an important capability in using MySQL to handle text documents, and we will further develop it to continue increase its performance and usability.

InnoDB 5.6.4 supports databases with 4k and 8k page sizes

Note: this article was originally published on http://blogs.innodb.com on Dec 20, 2011 by Kevin Lewis.

In the 5.6.4 release it is now possible to create an InnoDB database with 4k or 8k page sizes in addition to the original 16k page size. Previously, it could be done by recompiling the engine with a different value for UNIV_PAGE_SIZE_SHIFT and UNIV_PAGE_SIZE. With this release, you can set –innodb-page-size=n when starting mysqld, or put innodb_page_size=n in the configuration file in the [mysqld] section where n can be 4k, 8k, 16k, or 4096, 8192, 16384.

The support of smaller page sizes may be useful for certain storage media such as SSDs. Performance results can vary depending on your data schema, record size, and read/write ratio. But this provides you more options to optimize your performance.

When this new setting is used, the page size is set for all tablespaces used by that InnoDB instance. You can query the current value with;

SHOW VARIABLES LIKE ‘innodb_page_size’;
SELECT variable_value FROM information_schema.global_status  WHERE LOWER(variable_name) = ‘innodb_page_size’;

It is a read-only variable while the engine is running since it must be set before InnoDB starts up and creates a new system tablespace. That happens when InnoDB does not find ibdata1 in the data directory. If you start mysqld with a page size other than the standard 16k, the error log will contain something like this;

111214 15:55:05 InnoDB: innodb-page-size has been changed from the default value 16384 to 4096.

If your system tablespace already exists using one page size and innodb-page-size is something else, the engine will not start. There will be a message logged like this;

111214 16:06:51 InnoDB: Error: data file .\ibdata1 uses page size 4096,
111214 16:06:51 InnoDB: but the start-up parameter is innodb-page-size=16384

InnoDB knows the page size used in an existing tablespace created by version 5.6.4 because it stamps that page size in the header page. But this is not readable to older engines, of course. If an older engine opens a database with a page size other than 16k, it will report a corrupted page and quit.

All features in InnoDB work the same with smaller page sizes. But some limits are affected. The maximum record size is proportionately less with smaller pages. Record length limits are calculated within InnoDB based on a variety of factors including row type, column type and length, number of columns, secondary indexes, index prefix lengths, and of course, the page size. The main record is stored in a clustered index and a minimum of two records must fit into each page. So the maximum record length for 8k pages is about half that of 16k pages and the maximum record size with 4k pages is about half that of 8k pages.

In addition to record lengths, the maximum key lengths are proportionately smaller. For 16k pages, MySQL prevents key definitions from containing over 3072 bytes of data. This means that the primary key in the clustered index cannot use more than 3072 bytes of data. Secondary indexes can only contain up to 3072 bytes of data in the key also. But within a secondary index, InnoDB must store the primary key as the record pointer.  So if you have a 3072 byte primary key and a 3072 byte secondary key, each entry in the secondary index contains 6144 bytes of data plus internal record overhead. The amount of internal overhead is dependent upon the column types, but this gets the secondary record close to half the page size which is its natural limit. So this limit of 3072 bytes per key is less than but close to what InnoDB would have to impose on a table by table basis.  Based on that, InnoDB now reports to MySQL that any key defined for 8k page sizes must be less than 1536 bytes.  Likewise, the maximum key length when InnoDB uses 4k page sizes is 768 bytes.

If you have a database schema with any large records or keys defined, you may not be able to use smaller page sizes. Even if your records do barely fit in the clustered index page, it may not be advisable to use these smaller pages because the btree will be a lot deeper.  For example, if only 2 records fit on the page and there are 1,000,000 records, leaf pages are 20 levels deep, meaning InnoDB will need to read 20 pages to find the leaf page.  If that were on 4k pages, then using the same table on 16k pages would give 8 records per page and the leaf pages would only be 7 levels down.

There is a trick to reducing the size of records on the clustered index page;
1) Use Dynamic or Compressed row format.
2) Convert VARCHAR fields to TEXT fields.  (VARBINARY can be converted to BLOB)

There are 4 ROW FORMATS in InnoDB. The first two, Redundant and Compact, which are considered the Antelope file version, store at least 768 bytes of each field in the clustered record.  The Barracuda file version consists of Compact and Dynamic ROW FORMATS. These have the ability to put all of a VARCHAR, VARBINARY, TEXT or BLOB field onto a separate BLOB page for storage.  So one good way to prepare a table structure to use smaller page sizes is to use Dynamic or Compressed row format.

The decision of how big a record will be in the clustered record is made during the INSERT or UPDATE.  Each record is evaluated for its own length.  If the record is too long to fit in half the page, InnoDB will look for the longest actual field in that record and put as much of that field off-page as possible based on the row type.  Then if it is still to long, it will shorten the longest field remaining.  Since this is done when the record is added to the page, different records may have different columns stored off-page when there are multiple long fields in the record.

VARCHAR/VARBINARY fields are treated like TEXT/BLOB fields if they are over 255 bytes long.  If you are using Compressed or Dynamic row format and your record is too long because you have too many VARCHAR fields 255 bytes or less, you can reduce the record length by converting them to TEXT fields. Likewise, VARBINARY fields 255 bytes or less can be converted to BLOB fields to take advantage of this ability to store the whole TEXT or BLOB field on a separate page.

A file extent in InnoDB is 1 Mb independent of the page size. So an extent will hold 64 16k pages, 128 8k pages and 256 4k pages.  This means that the read ahead mechanisms will read more pages with smaller page sizes since they read a whole extent at a time.  The doublewrite buffer, which is based on the size of an extent, will also contain more pages.

If you want to use smaller page sizes with existing data, export the data first with a logical export utility such as mysqldump. Then create the new mysql instance with innodb-page-size=4k or 8k and import the data. Do not use a physical export method such as alter table … discard tablespace.


This feature makes it easier to try smaller page sizes in an InnoDB database. And with the 5.6.4 release, those smaller page sizes are fully supported by MySQL. Just export your data, move or delete the system database (ibdata1) and the log files (ib_logfile0 & ib_logfile1), set innodb-page-size to either 4k or 8k, and restart MySQL. A new InnoDB instance will be created with the smaller page size. Then you can import your data and run your tests, all without recompiling InnoDB.


This is the InnoDB team blog.


« December 2011 »