Thursday Dec 12, 2013

MySQL 5.7.3: Deep dive into 1mil QPS with InnoDB & Memcached

As you probably already know, in MySQL 5.7.3 release, InnoDB Memcached reached a record of over 1 million QPS on a read only load. The overview of the benchmark and testing results can be seen in an earlier blog by Dimitri. In this blog, I will spend sometime on the detail changes we have made to achieve this number.

First thanks to Facebook's Yoshinori with his bug#70172 that brought our attention to this single commit read only load test. We have been focussing on operation with large batch size. This bug prompted us to do a series of optimization on single commit read only queries and these optimizations eliminate almost all major bottlenecks from the InnoDB Memcached plugin itself.

If you are just getting familiar with InnoDB Memcached, there are some earlier blog on the topics to get you started. In a short word, InnoDB Memcached allows a fast path to retrieve key value data stored in the InnoDB table, with Memcached protocol.

The Benchmark:

Now, Let's discuss the testing scenario. The InnoDB Memcached plugin configurations are all by default in this benchmark, which means, the daemon_memcached_r_batch_size was also set to be 1, and the read operation would do a begin and commit transaction for each query. It is equivalent to auto-commit single selects through SQL interface. The innodb_api_trx_level is by default set to 0 (read uncommitted), however, changing it to 2 (repeatable read) gave the same benchmark result.

Another good news in 5.7.3 is that we start to support integer key column mapping, as it is common to use integer as primary key for a table. And the table used in this benchmark comes with integer as the key column. The mapping table contains only key and value columns. So we set the corresponding `flags`, `cas_column` and `expire_time_column` column in the config containers table all to NULL, this avoids overhead columns to support Memcached "extra" options. The table itself containers 1 million rows, each with a short integer key and a text value.

Here is the detail table definition

mysql> desc test.memc_test;
















To make InnoDB Memcached recognize this InnoDB table, insert following row into innodb_memcache/containers table


("memc1", "test", "memc_test", "id", "value3", null, null, null, "PRIMARY");

The memcached client for inserting rows and querying is a simple libmemcached program provided by Yoshinori.Each query would do a key lookup and fetches corresponding value.

We made some adjustment so that there are multiple client processes, each with multiple sessions. This was used to alleviate bottlenecks in the client itself.

As a note, there are many memcached clients out there, and Memcached clients can play  important roles in the performance result itself. For example, we observed at least 3 times difference on result with Perl client Cache::Memcached::Fast when comparing to its slower version Cache::Memcached. And as far as we can see, libmemcached is one of the most efficient clients available, even though eventually it becomes bottleneck itself as the test progresses, especially requests through the network.

The test result can be seen in Dimitri's blog, so I will not repeat them here. The summary is that we got close to 1.2 million QPS at the best. The next bottleneck now seems lying at the adaptive hash index's global latch - "btr_search_latch". The libmemcached client overhead is also significant.

Read Only QPS through InnoDB Memcached

The improvement:

There are several changes in both InnoDB Memcached code and Memcached Native code to achieve the record benchmarks.

1. The first is of course to address the issue brought by bug #70712. With daemon_memcached_r_batch_size set to 1, the transaction is being repeatedly started and committed for each query. It is better to cache the trx object itself, to avoid repeated create and destroy the trx object. Otherwise, the "trx_sys mutex" will kill the concurrency.

After the change, the trx object is cached with private memcached connection data. Each connection gets its own trx object, and it is used to handle transactions through this particular connection.

2. The next thing we did is to take advantage of the read only optimization recently made in the InnoDB code. This scenario (single read trx) is perfect to use the optimization. Whenever the read batch size is set to 1, InnoDB Memcached will treat incoming queries as auto-commit read only query. It will automatically hook up to the "fast path" of read-only operation in InnoDB.

3. After these two transaction related changes, we found the bottleneck comes from Memcached native Code itself. As a note, we embedded the Memcached code itself in our InnoDB Memcached plugin, so any bottleneck in Memcached will affect us.

The original Memcached memory allocation is protected by a global Cache Lock (engine->cache_lock), and it quickly rises in prominence in the profiling result.

Even though the data is stored in InnoDB, we happened to still use some of Memcached's own memory allocation to store and deliver the result back to the front end. To fix this bottleneck, we stopped using Memcached Memory altogether. Instead a connection private memory buffer is used to store and deliver the result. This also saves a memcpy as we move the data to memcached memory as before.

This change makes InnoDB Memcached plugin as thin as possible, and only relies on the InnoDB buffer pool and Adaptive Hash Index (AHI) as the backing store for the data. This provides better scaling and memory handling than Memcached itself.

4. Another bottleneck in Memcached is its statistics mutex ("thread_stats->mutex"). This also becomes significant as testing goes. So to remove it, we switched to using atomic operations whenever the platform supports (most modern platforms do). With these changes, we can now well scale the plugin to over 100 connections without degradation as the number of connections are ramped up.

5. In addition to removing those major bottlenecks, we also streamline the code to remove some overhead work. For example, we start to cached the "search tuple", so that there is no need to allocate the search tuple for each query. This is to keep the InnoDB Memcached as lean as possible.

With these changes, we have eliminated all the major InnoDB Memcached Plugin bottlenecks. The bottlenecks now comes from clients themselves and to a lesser degree from the Adaptive Hash Index search latch.

Future work:

Now the Memcached read goes more than twice as fast as those from SQL end. By using the InnoDB buffer pool as the in-memory store, and with InnoDB AHI, InnoDB Memcached can probably provide an efficient and more scalable store than Memcached itself.

There is still more to be done.

1. We will continue to remove some bottlenecks in InnoDB (such as btr_search_latch), as well as make InnoDB memcached leaner/faster.

2. We will add support to "mgets" command, which allows Memcached to fetch multiple results (corresponding to multiple keys) in one query attempts. This would again give us another big jump in terms of QPS.

3. We will start to focus more on insertion/updates operations.

4. We are considering extending the functionality of the memcached interface to support range queries etc. So to make it a more versatile key value store.

In summary, with these enhancements, the InnoDB Memcached becomes more and more attractive as as quick  key value store through the MySQL server.

Your feedback and comments are important to us as we evolve and improve this plugin. 

Saturday Sep 29, 2012

New Enhancements for InnoDB Memcached

In MySQL 5.6, we continued our development on InnoDB Memcached and completed a few widely desirable features that make InnoDB Memcached a competitive feature in more scenario. Notablely, they are

1) Support multiple table mapping

2) Added background thread to auto-commit long running transactions

3) Enhancement in binlog performance

 Let’s go over each of these features one by one. And in the last section, we will go over a couple of internally performed performance tests.

Support multiple table mapping

In our earlier release, all InnoDB Memcached operations are mapped to a single InnoDB table. In the real life, user might want to use this InnoDB Memcached features on different tables. Thus being able to support access to different table at run time, and having different mapping for different connections becomes a very desirable feature. And in this GA release, we allow user just be able to do both. We will discuss the key concepts and key steps in using this feature.

1) "mapping name" in the "get" and "set" command

In order to allow InnoDB Memcached map to a new table, the user (DBA) would still require to "pre-register" table(s) in InnoDB Memcached “containers” table (there is security consideration for this requirement). If you would like to know about “containers” table, please refer to my earlier blogs in Once registered, the InnoDB Memcached will then be able to look for such table when they are referred.

Each of such registered table will have a unique "registration name" (or mapping_name) corresponding to the “name” field in the “containers” table.. To access these tables, user will include such "registration name" in their get or set commands, in the form of "get @@new_mapping_name.key", prefix "@@" is required for signaling a mapped table change. The key and the "mapping name" are separated by a configurable delimiter, by default, it is ".". So the syntax is:

get [@@mapping_name.]key_name

set [@@mapping_name.]key_name


 get @@mapping_name

set @@mapping_name

Here is an example:

Let's set up three tables in the "containers" table:

The first is a map to InnoDB table "test/demo_test" table with mapping name "setup_1"

INSERT INTO containers VALUES ("setup_1", "test", "demo_test",

"c1", "c2", "c3", "c4", "c5", "PRIMARY");

 Similarly, we set up table mappings for table "test/new_demo" with name "setup_2" and that to table "mydatabase/my_demo" with name "setup_3":

INSERT INTO containers VALUES ("setup_2", "test", "new_demo", "c1", "c2",

"c3", "c4", "c5", "secondary_index_x");

INSERT INTO containers VALUES ("setup_3", "my_database", "my_demo",

"c1", "c2", "c3", "c4", "c5", "idx");

To switch to table "my_database/my_demo", and get the value corresponding to “key_a”, user will do:

get @@setup_3.key_a

(this will also output the value that corresponding to key "key_a"

or simply

get @@setup_3

Once this is done, this connection will switch to "my_database/my_demo" table until another table mapping switch is requested. so it can continue issue regular command like:

get key_b

 set key_c 0 0 7

These DMLs will all be directed to "my_database/my_demo" table.

And this also implies that different connections can have different bindings (to different table).

2) Delimiter:

For the delimiter "." that separates the "mapping name" and key value, we also added a configure option in the "config_options" system table with name of "table_map_delimiter":

INSERT INTO config_options VALUES("table_map_delimiter", ".");

So if user wants to change to a different delimiter, they can change it in the config_option table.

3) Default mapping:

Once we have multiple table mapping, there should be always a "default" map setting. For this, we decided if there exists a mapping name of "default", then this will be chosen as default mapping. Otherwise, the first row of the containers table will chosen as default setting.

Please note, user tables can be repeated in the "containers" table (for example, user wants to access different columns of the table in different settings), as long as they are using different mapping/configure names in the first column, which is enforced by a unique index.

4) bind command

In addition, we also extend the protocol and added a bind command, its usage is fairly straightforward. To switch to "setup_3" mapping above, you simply issue:

bind setup_3

This will switch this connection's InnoDB table to "my_database/my_demo"

In summary, with this feature, you now can direct access to difference tables with difference session. And even a single connection, you can query into difference tables.

Background thread to auto-commit long running transactions

This is a feature related to the “batch” concept we discussed in earlier blogs. This “batch” feature allows us batch the read and write operations, and commit them only after certain calls. The “batch” size is controlled by the configure parameter “daemon_memcached_w_batch_size” and “daemon_memcached_r_batch_size”. This could significantly boost performance.

However, it also comes with some disadvantages, for example, you will not be able to view “uncommitted” operations from SQL end unless you set transaction isolation level to read_uncommitted, and in addition, this will held certain row locks for extend period of time that might reduce the concurrency.

To deal with this, we introduce a background thread that “auto-commits” the transaction if they are idle for certain amount of time (default is 5 seconds). The background thread will wake up every second and loop through every “connections” opened by Memcached, and check for idle transactions. And if such transaction is idle longer than certain limit and not being used, it will commit such transactions. This limit is configurable by change “innodb_api_bk_commit_interval”. Its default value is 5 seconds, and minimum is 1 second, and maximum is 1073741824 seconds.

With the help of such background thread, you will not need to worry about long running uncommitted transactions when set daemon_memcached_w_batch_size and daemon_memcached_r_batch_size to a large number. This also reduces the number of locks that could be held due to long running transactions, and thus further increase the concurrency.

Enhancement in binlog performance

As you might all know, binlog operation is not done by InnoDB storage engine, rather it is handled in the MySQL layer. In order to support binlog operation through InnoDB Memcached, we would have to artificially create some MySQL constructs in order to access binlog handler APIs. In previous lab release, for simplicity consideration, we open and destroy these MySQL constructs (such as THD) for each operations. This required us to set the “batch” size always to 1 when binlog is on, no matter what “daemon_memcached_w_batch_size” and “daemon_memcached_r_batch_size” are configured to. This put a big restriction on our capability to scale, and also there are quite a bit overhead in creating destroying such constructs that bogs the performance down.

With this release, we made necessary change that would keep MySQL constructs as long as they are valid for a particular connection. So there will not be repeated and redundant open and close (table) calls. And now even with binlog option is enabled (with innodb_api_enable_binlog,), we still can batch the transactions with daemon_memcached_w_batch_size and daemon_memcached_r_batch_size, thus scale the write/read performance.

Although there are still overheads that makes InnoDB Memcached cannot perform as fast as when binlog is turned off. It is much better off comparing to previous release. And we are continuing optimize the solution is this area to improve the performance as much as possible.

Performance Study:

Amerandra of our System QA team have conducted some performance studies on queries through our InnoDB Memcached connection and plain SQL end. And it shows some interesting results.

The test is conducted on a “Linux 2.6.32-300.7.1.el6uek.x86_64 ix86 (64)” machine with 16 GB Memory, Intel Xeon 2.0 GHz CPU X86_64 2 CPUs- 4 Core Each, 2 RAID DISKS (1027 GB,733.9GB). Results are described in following tables:

Table 1: Performance comparison on Set operations


5.6.7-RC-Memcached-plugin ( TPS / Qps) with memcached-threads=8***


X faster

Set (QPS)


















* mysql-5.6.7-rc-linux2.6-x86_64

** The “set” operation when implemented in InnoDB Memcached involves a couple of DMLs: it first query the table to see whether the “key” exists, if it does not, the new key/value pair will be inserted. If it does exist, the “value” field of matching row (by key) will be updated. So when used in above query, it is a precompiled store procedure, and query will just execute such procedures.

*** added “–daemon_memcached_option=-t8” (default is 4 threads)

So we can see with this “set” query, InnoDB Memcached can run 4.5 to 9 time faster than MySQL server.

Table 2: Performance comparison on Get operations


5.6.7-RC-Memcached-plugin ( TPS / Qps) with memcached-threads=8


X faster

Get (QPS)


















With the “get” query (or the select query), memcached performs 1.5 to 2 times faster than normal SQL.


In summary, we added several much-desired features to InnoDB Memcached in this release, allowing user to operate on different tables with this Memcached interface. We also now provide a background commit thread to commit long running idle transactions, thus allow user to configure large batch write/read without worrying about large number of rows held or not being able to see (uncommit) data. We also greatly enhanced the performance when Binlog is enabled. We will continue making efforts in both performance enhancement and functionality areas to make InnoDB Memcached a good demo case for our InnoDB APIs.

Jimmy Yang, September 29, 2012


This is the InnoDB team blog.


« July 2016