Tuesday Dec 20, 2011

InnoDB 5.6.4 supports databases with 4k and 8k page sizes

Note: this article was originally published on http://blogs.innodb.com on Dec 20, 2011 by Kevin Lewis.

In the 5.6.4 release it is now possible to create an InnoDB database with 4k or 8k page sizes in addition to the original 16k page size. Previously, it could be done by recompiling the engine with a different value for UNIV_PAGE_SIZE_SHIFT and UNIV_PAGE_SIZE. With this release, you can set –innodb-page-size=n when starting mysqld, or put innodb_page_size=n in the configuration file in the [mysqld] section where n can be 4k, 8k, 16k, or 4096, 8192, 16384.

The support of smaller page sizes may be useful for certain storage media such as SSDs. Performance results can vary depending on your data schema, record size, and read/write ratio. But this provides you more options to optimize your performance.

When this new setting is used, the page size is set for all tablespaces used by that InnoDB instance. You can query the current value with;

SHOW VARIABLES LIKE ‘innodb_page_size’;
or
SELECT variable_value FROM information_schema.global_status  WHERE LOWER(variable_name) = ‘innodb_page_size’;

It is a read-only variable while the engine is running since it must be set before InnoDB starts up and creates a new system tablespace. That happens when InnoDB does not find ibdata1 in the data directory. If you start mysqld with a page size other than the standard 16k, the error log will contain something like this;

111214 15:55:05 InnoDB: innodb-page-size has been changed from the default value 16384 to 4096.

If your system tablespace already exists using one page size and innodb-page-size is something else, the engine will not start. There will be a message logged like this;

111214 16:06:51 InnoDB: Error: data file .\ibdata1 uses page size 4096,
111214 16:06:51 InnoDB: but the start-up parameter is innodb-page-size=16384

InnoDB knows the page size used in an existing tablespace created by version 5.6.4 because it stamps that page size in the header page. But this is not readable to older engines, of course. If an older engine opens a database with a page size other than 16k, it will report a corrupted page and quit.

All features in InnoDB work the same with smaller page sizes. But some limits are affected. The maximum record size is proportionately less with smaller pages. Record length limits are calculated within InnoDB based on a variety of factors including row type, column type and length, number of columns, secondary indexes, index prefix lengths, and of course, the page size. The main record is stored in a clustered index and a minimum of two records must fit into each page. So the maximum record length for 8k pages is about half that of 16k pages and the maximum record size with 4k pages is about half that of 8k pages.

In addition to record lengths, the maximum key lengths are proportionately smaller. For 16k pages, MySQL prevents key definitions from containing over 3072 bytes of data. This means that the primary key in the clustered index cannot use more than 3072 bytes of data. Secondary indexes can only contain up to 3072 bytes of data in the key also. But within a secondary index, InnoDB must store the primary key as the record pointer.  So if you have a 3072 byte primary key and a 3072 byte secondary key, each entry in the secondary index contains 6144 bytes of data plus internal record overhead. The amount of internal overhead is dependent upon the column types, but this gets the secondary record close to half the page size which is its natural limit. So this limit of 3072 bytes per key is less than but close to what InnoDB would have to impose on a table by table basis.  Based on that, InnoDB now reports to MySQL that any key defined for 8k page sizes must be less than 1536 bytes.  Likewise, the maximum key length when InnoDB uses 4k page sizes is 768 bytes.

If you have a database schema with any large records or keys defined, you may not be able to use smaller page sizes. Even if your records do barely fit in the clustered index page, it may not be advisable to use these smaller pages because the btree will be a lot deeper.  For example, if only 2 records fit on the page and there are 1,000,000 records, leaf pages are 20 levels deep, meaning InnoDB will need to read 20 pages to find the leaf page.  If that were on 4k pages, then using the same table on 16k pages would give 8 records per page and the leaf pages would only be 7 levels down.

There is a trick to reducing the size of records on the clustered index page;
1) Use Dynamic or Compressed row format.
2) Convert VARCHAR fields to TEXT fields.  (VARBINARY can be converted to BLOB)

There are 4 ROW FORMATS in InnoDB. The first two, Redundant and Compact, which are considered the Antelope file version, store at least 768 bytes of each field in the clustered record.  The Barracuda file version consists of Compact and Dynamic ROW FORMATS. These have the ability to put all of a VARCHAR, VARBINARY, TEXT or BLOB field onto a separate BLOB page for storage.  So one good way to prepare a table structure to use smaller page sizes is to use Dynamic or Compressed row format.

The decision of how big a record will be in the clustered record is made during the INSERT or UPDATE.  Each record is evaluated for its own length.  If the record is too long to fit in half the page, InnoDB will look for the longest actual field in that record and put as much of that field off-page as possible based on the row type.  Then if it is still to long, it will shorten the longest field remaining.  Since this is done when the record is added to the page, different records may have different columns stored off-page when there are multiple long fields in the record.

VARCHAR/VARBINARY fields are treated like TEXT/BLOB fields if they are over 255 bytes long.  If you are using Compressed or Dynamic row format and your record is too long because you have too many VARCHAR fields 255 bytes or less, you can reduce the record length by converting them to TEXT fields. Likewise, VARBINARY fields 255 bytes or less can be converted to BLOB fields to take advantage of this ability to store the whole TEXT or BLOB field on a separate page.

A file extent in InnoDB is 1 Mb independent of the page size. So an extent will hold 64 16k pages, 128 8k pages and 256 4k pages.  This means that the read ahead mechanisms will read more pages with smaller page sizes since they read a whole extent at a time.  The doublewrite buffer, which is based on the size of an extent, will also contain more pages.

If you want to use smaller page sizes with existing data, export the data first with a logical export utility such as mysqldump. Then create the new mysql instance with innodb-page-size=4k or 8k and import the data. Do not use a physical export method such as alter table … discard tablespace.

Summary:

This feature makes it easier to try smaller page sizes in an InnoDB database. And with the 5.6.4 release, those smaller page sizes are fully supported by MySQL. Just export your data, move or delete the system database (ibdata1) and the log files (ib_logfile0 & ib_logfile1), set innodb-page-size to either 4k or 8k, and restart MySQL. A new InnoDB instance will be created with the smaller page size. Then you can import your data and run your tests, all without recompiling InnoDB.

Monday Oct 03, 2011

InnoDB Memcached with Binlog Capability

Note: this article was originally published on http://blogs.innodb.com on Oct 3, 2011 by Jimmy Yang.

In our earlier Labs release, we presented a MySQL NoSQL solution through InnoDB Memcached Daemon Plugin (see earlier Calvin’s and my posts). That earlier release allows the Memcached Plugin directly interacting with InnoDB, completely bypassing MySQL optimizer and QP layers. It differs with another popular MySQL NoSQL solution, HandlerSocket, by skipping even the Handler API layer and directly access the InnoDB through InnoDB APIs. Thus, in theory, it would be simpler and more efficient.

However, there is one major functionality we did not implement in the first release: the replication (binlog) capability. And this important feature is being done in the latest labs release (mysql-5.6.4-labs-innodb-memcached), in which all memcached commands are now supported for binlogging.

In the following sections, we will quick walk you through the prototype and show how to use the binlog capability.

Set up the InnoDB Memcached:

Our earlier post (Get started with InnoDB Memcached Daemon plugin) details the step to set up the InnoDB Memcached. We will go very briefly on some important steps here:

  • Copy the plugins to the plugin directory

Find out your plugin directory:

        mysql> select @@plugin_dir;

@@plugin_dir
/home/jy/work2/mysql-5.6-memcached-labs/lib/plugin

       After you download and unzipped mysql-5.6.4-labs-innodb-memcached-linux2.6-i686.tar.gz,

in the lib/plugin directory, you will find two plugin share lib:

1) innodb_engine.so

2) libmemcached.so

Copy them to your plugin directory:

       shell > cp mysql-5.6.4-labs-innodb-memcached-linux2.6-i686/lib/plugin/innodb_engine.so /home/jy/work2/mysql-5.6-memcached-labs/lib/plugin

       shell > cp mysql-5.6.4-labs-innodb-memcached-linux2.6-i686/lib/plugin/libmemcached.so /home/jy/work2/mysql-5.6-memcached-labs/lib/plugin

  • Install the configuration tables:

In the scripts/ directory, find the configuration scripts: innodb_memcached_config.sql, and install it:

      mysql < scripts/nnodb_memcached_config.sql

  • Install the Daemon Plugin

Now install the daemon plugin:

       mysql> install plugin daemon_memcached soname ‘libmemcached.so’;

      Query OK, 0 rows affected (0.05 sec)

Enable InnoDB Memcached binlog with “innodb_direct_access_enable_binlog”:

  • One new system configuration option – “innodb_direct_access_enable_binlog” is added to enable the InnoDB Memcached binlog. This configure option is a Server Boot-time option, so you must turn it on at server start up time. In addition, you must enable MySQL binlog with the –log-bin option. You can add these option to your server configure file my.cnf or just add these at command line.

mysqld …. –log-bin –innodb_direct_access_enable_binlog=1

  • Use mysqldump to create a master data snapshot, and sync it to slave.

master shell > mysqldump –all-databases –lock-all-tables > dbdump.db

       slave shell > mysql < dbdump.db

  • In Master, call “show master status” to obtain Master Binary Log Coordinates

       mysql> show master status;

File Position Binlog_Do_DB Binlog_Ignore_DB
0 114

1 row in set (0.00 sec)

  • In Slave, use “change master to” to set up slave with above coordinates:

         mysql> CHANGE MASTER TO

                 MASTER_HOST=’localhost’,

                 MASTER_USER=’root’,

                 MASTER_PASSWORD=”,

                MASTER_PORT = 13000,

                MASTER_LOG_FILE=’0.000001,

                MASTER_LOG_POS=114;

  • Then start the slave:

mysql>start slave;


If the errolog prints following info, then the slave is ready for the replication:

111002 18:47:15 [Note] Slave I/O thread: connected to master ‘root@localhost:13000′,replication started in log ’0.000001′ at position 114

Test with memcached telnet interfaces:

Now let’s test the server with above replication setup. We use the memcached telnet interface for demonstration purpose. And we also connect to the master and slave server through SQL interfaces, and query the target table to verify the result.

In our configuration setup sql, one example table “demo_test” is created in the “test” database to take the memcached command by default. And we will use this default table for the demonstrations:

  • Use set to insert a record, key “test1”, value “t1” and flag is “10”:

       telnet 127.0.0.1 11211

       Trying 127.0.0.1…

       Connected to 127.0.0.1.

       Escape character is ‘^]’.

       set test1 10 0 2

       t1

       STORED

         In the master server, you can see the row is inserted (c1 maps to the key, c2 maps to value, c3 is the flag, and c4 is the cas, c5 is the expiration):

         mysql> select * from demo_test;

c1 c2 c3 c4 c5
test1 t1 10 2 0

        1 row in set (0.00 sec)

         In the slave server, you will see the same record is inserted by replication:

       mysql> select * from demo_test;

c1 c2 c3 c4 c5
test1 t1 10 2 0

      1 row in set (0.00 sec)

  • Use set command to update the key “test1” to a new value “new”:

Connected to 127.0.0.1.

         Escape character is ‘^]’.

         set test1 10 0 3

         new

         STORED

From the slave server, the update is replicated (notice the cas value also updated):

         mysql> select * from demo_test;

c1 c2 c3 c4 c5
test1 new 10 3 0

  1 row in set (0.00 sec)

  • Delete the record with delete command:

Connected to 127.0.0.1.

         Escape character is ‘^]’.

         delete test1

         DELETED

The delete is replicated to the slave, the record in slave is also deleted:

  mysql> select * from demo_test;

         Empty set (0.00 sec)

  • Truncated the table with the “flush all” command

First insert two records:

Connected to 127.0.0.1.

         Escape character is ‘^]’

         set test2 10 0 5

         again

         STORED

         set test3 10 0 6

again1

         STORED

In the slave server, there two records replicated:

mysql> select * from demo_test;

c1 c2 c3 c4 c5
test2 again 10 5 0
test3 again1 10 6 0

2 rows in set (0.00 sec)

Call “flush_all” in telnet interface to truncate the table:

         Connected to 127.0.0.1.

         Escape character is ‘^]’.

         flush_all

         OK

Then check the slave server, the truncate table is replicated:

   mysql> select * from demo_test;

         Empty set (0.00 sec)

In short, all memcached commands are supported in terms of replication. And we would not enumerate each of them here.

Notes for the InnoDB Memcached Binlog:

Binlog Format:

  • Most Memcached operations are mapped to DMLs (insert/delete/update), and since there is no actually “statement” comes through memcached, all memcached commands (except for flush_all) use Row Based (RBR) logging. This is independent of any server binlog_format stetting.
  • The Memcached “flush_all” command is mapped to “truncate table” command. Since “DDL” type commands can only use statement-based logging, this “flush_all” command is replicated with a “hand-crafted” “truncate table” statement.

Transactions:

  • The transaction concept is not well defined when coming to Memcached. We use “daemon_memcached_r_batch_size” and “daemon_memcached_w_batch_size” to control the read and write “transaction” batch size for performance consideration. However, for replication, each command is replicated right after successful completion of each command.
  • In this release, we adjusted the default value of “daemon_memcached_w_batch_size” from 32 to 1, so each Memcached “write” operation are directly committed (it used to commit every 32 operations). This is to avoid any confusion that records shows up in slave server while they can’t be seen from master server without turning on “set transaction isolation level read uncommitted” to view uncommitted records. The side effect of this reduced is certain degradation of performance.

Other minor updates:

  • The memcached used has upgraded to the latest 1.6.0 (beta 1). And the libevent used for memcached has been updated to version 1.4.12.
  • InnoDB memcached now supports on Linux, Solaris and OSX platforms.

Summary:

In this release, we added the binlog support for InnoDB Memcached Daemon Plugins. This is an important feature to allow any updates through the memcached interface being able to be replicated for backup and high availability. More importantly, this demonstrate the capability of InnoDB APIs,  which not only support direct access to the InnoDB storage, but also being able to backup  any of such operations.

Wednesday Jul 27, 2011

InnoDB Full-Text Search Tutorial

Note: this article was originally published on http://blogs.innodb.com on July 25, 2011 by John Russell.

The InnoDB full-text search capability is an exciting feature. The full-text search itself is generally useful to have in an RDBMS. If an application is using all InnoDB tables except for one that is used for full-text searches, now that last table can be switched to InnoDB. If putting the full-text data in a MyISAM table led to scalability problems, duplication, or a less-than-ideal schema design, now those issues can be addressed.

In this post, I’ll take you through some of the basics of setting up and querying an InnoDB FULLTEXT search index. I’ll leave the scalability and performance aspects to Jimmy’s and Vinay’s blog posts, and just use some toy-sized data for demonstration purposes.

Creating a Table with a Full-Text Search Index

The key component of this feature is an index of type FULLTEXT, applied to one or more columns of an InnoDB table.

In Jimmy’s post, he mentions some scalability considerations where you might create the table (including a special FTS_DOC_ID column), load the data, then create the FULLTEXT index afterward. For simplicity (and since the data volume is so small), I’ll create the table with the index in place, then load the data afterward.

use test;
-- We will do some commits and rollbacks to demonstrate transactional features.
-- So turn off the default setting that commits immediately after each statement.
set autocommit=0;

drop table if exists quotes;
-- The InnoDB full-text search feature in the 5.6 Labs release
-- lets us define the FULLTEXT index on an InnoDB table.
-- The ENGINE=INNODB clause is just to illustrate the InnoDB aspect.
-- In 5.5 and above, new tables use InnoDB by default.

create table quotes
  (    id int unsigned auto_increment primary key
    , author varchar(64)    , quote varchar(4000)
    , source varchar(64)
    , fulltext(quote)
  ) engine=innodb;

-- Get some words and phrases to search for into the table.
insert into quotes (author, quote, source) values
  ('Abraham Lincoln', 'Fourscore and seven years ago...',
  'Gettysburg Address')
, ('George Harrison', 'All those years ago...',
  'Live In Japan')
, ('Arthur C. Clarke', 'Then 10 years ago the monolith was discovered.',
  '2010: The Year We Make Contact')
, ('Benjamin Franklin',
  'Early to bed and early to rise, makes a man healthy, wealthy, and wise.',
  'Poor Richard''s Almanack')
, ('James Thurber',
  'Early to rise and early to bed makes a male healthy and wealthy and dead.',
  'The New Yorker')
, ('K', '1500 hundred years ago, everybody knew that the Earth was the center of the universe.',
  'Men in Black')
;

-- Since this is an InnoDB table, we are mindful of transactions.
commit;

Word and Phrase Search – Natural Language Mode

Once the data is loaded and committed, you can run queries using the MATCH(columns) AGAINST (search expression) operator to do the actual searches. You can combine this operator with all the usual WHERE and similar clauses in the SELECT statement.

The simplest kind of search is to find a single word, or a phrase with all words in exact order. For this type of search, use the IN NATURAL LANGUAGE clause inside the AGAINST() call. This technique typically involves a user-entered string that you pass verbatim to the query (of course, after escaping any quotation marks or other special characters to prevent SQL injection attacks).

-- Search for a single word.
select author as "Monolith" from quotes
  where match(quote) against ('monolith' in natural language mode);
+------------------+
| Monolith         |
+------------------+
| Arthur C. Clarke |
+------------------+
1 row in set (0.01 sec)

-- Unlike with MyISAM full-text search, the default minimum
-- length is 3 rather than 4, and the search returns words that
-- appear in a high proportion of the table rows.
select author as "Ago" from quotes
  where match(quote) against ('ago' in natural language mode);
+------------------+
| Ago              |
+------------------+
| Abraham Lincoln  |
| George Harrison  |
| Arthur C. Clarke |
| K                |
+------------------+
4 rows in set (0.00 sec)

AND / OR / NOT Operators – Boolean Mode

For more complicated searches, you can have multiple words and phrases and search for different combinations of optional and required terms, not necessarily in the same order. This technique typically involves several data values that you query from elsewhere, or splitting apart a user-entered string and applying your own rules to the words and phrases inside.

-- Search for a combination of words, not in the same order as the original.
select author as "Ago and Years" from quotes
  where match(quote) against ('+ago +years' in boolean mode);
+------------------+
| Ago and Years    |
+------------------+
| Abraham Lincoln  |
| George Harrison  |
| Arthur C. Clarke |
| K                |
+------------------+
4 rows in set (0.00 sec)

-- Search for other Boolean combinations of words.
select author as "Fourscore or Monolith" from quotes
  where match(quote) against ('fourscore monolith' in boolean mode);
+-----------------------+
| Fourscore or Monolith |
+-----------------------+
| Abraham Lincoln       |
| Arthur C. Clarke      |
+-----------------------+
2 rows in set (0.00 sec)

select author as "Years and not Monolith" from quotes
  where match(quote) against ('+years -monolith' in boolean mode);
+------------------------+
| Years and not Monolith |
+------------------------+
| Abraham Lincoln        |
| George Harrison        |
| K                      |
+------------------------+
3 rows in set (0.00 sec)

Proximity Search

Proximity search is new feature for InnoDB full-text search. It is a special case of Boolean search using the @ operator within the AGAINST() string. You supply 2 or more words, double-quoted, within the single-quoted AGAINST() string, followed by @distance to specify how far apart these words can be. The distance represents the maximum number of bytes (which might not be equal to the number of characters) between the starting points of all these words.

-- The starting points for these words are too far apart
-- (not within 20 bytes), so no results.
select quote as "Too Far Apart" from quotes
  where match(quote) against ('"early wise" @20' in boolean mode);
Empty set (0.00 sec)

-- But the starting points of all words are within 100 bytes,
-- so this query does give results.
select quote as "Early...Wise" from quotes
  where match(quote) against ('"early wise" @100' in boolean mode);
+-------------------------------------------------------------------------+
| Early...Wise                                                            |
+-------------------------------------------------------------------------+
| Early to bed and early to rise, makes a man healthy, wealthy, and wise. |
+-------------------------------------------------------------------------+
1 row in set (0.00 sec)

-- In this case, the smallest distance that produces results is 49.
select quote as "Early...Wise" from quotes
  where match(quote) against ('"early wise" @49' in boolean mode);
+-------------------------------------------------------------------------+
| Early...Wise                                                            |
+-------------------------------------------------------------------------+
| Early to bed and early to rise, makes a man healthy, wealthy, and wise. |
+-------------------------------------------------------------------------+
1 row in set (0.00 sec)

-- Here is an example showing 2 results, with the words close to each other.
select quote as "Early...Bed" from quotes
  where match(quote) against ('"early bed" @20' in boolean mode);
+---------------------------------------------------------------------------+
| Early...Bed                                                               |
+---------------------------------------------------------------------------+
| Early to bed and early to rise, makes a man healthy, wealthy, and wise.   |
| Early to rise and early to bed makes a male healthy and wealthy and dead. |
+---------------------------------------------------------------------------+
2 rows in set (0.00 sec)

Relevance Ranking

The relevance ranking is fairly basic, derived from word frequencies within each document and the search data overall. Typically, you would only ORDER BY this value for very simplistic searches of small documents; for any important search you would layer your own ranking logic on top, perhaps with the MySQL relevance value as one factor in the overall rank.

-- Get the relevance of a single word.
select substr(quote,1,20) as "And",
  match(quote) against ('and' in natural language mode) as Relevance
  from quotes order by Relevance desc;
+----------------------+--------------------+
| And                  | Relevance          |
+----------------------+--------------------+
| Early to rise and ea | 0.2718571722507477 |
| Early to bed and ear | 0.1812381148338318 |
| Fourscore and seven  | 0.0906190574169159 |
| All those years ago. |                  0 |
| Then 10 years ago th |                  0 |
| 1500 hundred years a |                  0 |
+----------------------+--------------------+

6 rows in set (0.00 sec)

Transactions

The key idea behind bringing full-text search to InnoDB tables is to make this feature compatible with transactions, so that you can include full-text columns alongside other columns in tables in ways that make sense in terms of schema design, and multiple sessions can update the full-text column data (and/or other columns in the table) simultaneously. The full-text data doesn’t have to be treated as read-only or read-mostly.

As mentioned in Jimmy’s blog post, the table structures that manipulate the full-text data behind the scenes are only updated at COMMIT time. So make sure to insert or update full-text data in one transaction, commit, and then run any full-text queries in a subsequent transaction. (Actually, in the examples below, it looks like the data is taken out of the full-text results as soon as a DELETE is issued, then comes back if the deletion is rolled back. I think that is explained in Jimmy’s blog post by the discussion about the delete-marking optimization to avoid huge updates to the full-text index for deleted data.)

drop table if exists quotes_uncommitted;

create table quotes_uncommitted
  (
      author varchar(64)
    , quote varchar(4000)
    , source varchar(64)
    , fulltext(quote)
    , primary key (author, quote(128))
  );

-- We insert but don't immediately commit.
insert into quotes_uncommitted select author, quote, source from quotes;
-- Within the same transaction, a full-text search does not see the uncommitted data.
select count(author), author as "Uncommitted Results" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);
+---------------+---------------------+
| count(author) | Uncommitted Results |
+---------------+---------------------+
|             0 | NULL                |
+---------------+---------------------+
1 row in set (0.00 sec)

-- If the newly inserted rows are rolled back...
rollback;
-- ...then the full-text search still doesn't see them.
select count(author), author as "Rolled-Back Results" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);
+---------------+---------------------+
| count(author) | Rolled-Back Results |
+---------------+---------------------+
|             0 | NULL                |
+---------------+---------------------+
1 row in set (0.00 sec)

-- OK, let's start with some committed data in the table, then empty the table,
-- and then try some FTS queries
-- both before and after the commit.
insert into quotes_uncommitted select author, quote, source from quotes;
commit;
delete from quotes_uncommitted;
select count(author), author as "Deleted but still not committed" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);
+---------------+---------------------------------+
| count(author) | Deleted but still not committed |
+---------------+---------------------------------+
|             0 | NULL                            |
+---------------+---------------------------------+
1 row in set (0.00 sec)

rollback;
select count(author), author as "Deleted and rolled back" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);
+---------------+-------------------------+
| count(author) | Deleted and rolled back |
+---------------+-------------------------+
|             4 | Abraham Lincoln         |
+---------------+-------------------------+
1 row in set (0.00 sec)

delete from quotes_uncommitted;
commit;
select count(author), author as "Deleted and committed" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);
+---------------+-----------------------+
| count(author) | Deleted and committed |
+---------------+-----------------------+
|             0 | NULL                  |
+---------------+-----------------------+
1 row in set (0.00 sec)

insert into quotes_uncommitted select author, quote, source from quotes;
commit;
truncate table quotes_uncommitted;
select count(author), author as "Truncated" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);
+---------------+-----------+
| count(author) | Truncated |
+---------------+-----------+
|             0 | NULL      |
+---------------+-----------+
1 row in set (0.00 sec)

Multi-Column Searches

Although you can only have one FULLTEXT index in an InnoDB table, that index can apply to multiple columns, allowing you to search when you aren’t sure which column contains the term. With a multi-column index, we can MATCH() against all the columns to find words that appear in any of those columns. Always reference all the same columns in the MATCH() clause as in the FULLTEXT index definition, because the information about which column the words appear in is not included in the full-text search data.

drop table if exists quotes_multi_col;

create table quotes_multi_col
  (
    id int unsigned auto_increment primary key
    , author varchar(64)
    , quote varchar(4000)
    , source varchar(64)
    , fulltext(author, quote, source)
  );

insert into quotes_multi_col select * from quotes;
commit;

select author as "Poor 1 (NL)", substr(quote,1,15) as "Poor 2 (NL)", source as "Poor 3 (NL)" from
  quotes_multi_col where match(author, quote, source)
  against ('poor' in natural language mode);
+-------------------+-----------------+-------------------------+
| Poor 1 (NL)       | Poor 2 (NL)     | Poor 3 (NL)             |
+-------------------+-----------------+-------------------------+
| Benjamin Franklin | Early to bed an | Poor Richard's Almanack |
+-------------------+-----------------+-------------------------+
1 row in set (0.00 sec)

select author as "Poor 1 (BOOL)", substr(quote,1,15) as "Poor 2 (BOOL)", source as "Poor 3 (BOOL)"
  from quotes_multi_col where match(author, quote, source)
  against ('poor' in boolean mode);
+-------------------+-----------------+-------------------------+
| Poor 1 (BOOL)     | Poor 2 (BOOL)   | Poor 3 (BOOL)           |
+-------------------+-----------------+-------------------------+
| Benjamin Franklin | Early to bed an | Poor Richard's Almanack |
+-------------------+-----------------+-------------------------+
1 row in set (0.00 sec)

select author as "Clarke 1 (NL)", substr(quote,1,15) as "Clarke 2 (NL)", source as "Clarke 3 (NL)"
  from quotes_multi_col where match(author, quote, source)
  against ('clarke' in natural language mode);
+------------------+-----------------+--------------------------------+
| Clarke 1 (NL)    | Clarke 2 (NL)   | Clarke 3 (NL)                  |
+------------------+-----------------+--------------------------------+
| Arthur C. Clarke | Then 10 years a | 2010: The Year We Make Contact |
+------------------+-----------------+--------------------------------+
1 row in set (0.00 sec)

Interaction with Other Indexes

Remember that the design of your primary key index and secondary indexes is a big factor in query performance for InnoDB tables.

  • You can include parts (prefixes) of the full-text column(s) within the primary key.
  • However, that might not be a good idea if (a) the associated columns will ever be updated — which causes an expensive reorganization within the InnoDB table, or (b) if the table will have any other secondary indexes — the primary key values for a row are duplicated in the entry for that row in every secondary index, making index operations require more I/O and memory.
  • As mentioned in Jimmy’s blog post, adding the FULLTEXT index to the table is going to create a new column and associated index in the original table, so you could set up the column and index ahead of time, to avoid table reorganization later.
  • You can use the unique constraint of the primary key or a UNIQUE index to prevent duplicate values or combinations of values from being entered.
  • You can use the not-null constraint of the primary key to prevent blank values or combinations of values from being entered.
  • For the Labs release, the InnoDB FULLTEXT processing isn’t integrated with the MySQL optimizer and its estimates for which index is best to use, so don’t draw conclusions about performance characteristics from this early preview.

Stopwords

Stopwords are typically short, commonly used words that you designate as not significant for a search. They are left out of the FULLTEXT index and ignored when entered in FULLTEXT queries. For example, a search for ‘the’ is unsuccessful because it’s in the default stopword list. For your own customized search, you might create a bigger list (say, with common words from several languages) or a smaller one (for example, a music or movie site where words such as “The” in names and titles are significant). The details about customizing the stopword list are in Jimmy’s blog post.

Monday Jul 25, 2011

Improve InnoDB thread scheduling

Introduction

Note: this article was originally published on http://blogs.innodb.com on July 25, 2011 by Sunny Bains.

InnoDB has had the thread concurrency management code for some years now. Most will be familiar with the three configuration variables associated with this feature:
The problem with the existing code is that the queueing overhead becomes too much and negatively impacts performance, especially as the number of user threads goes up. The queueing code uses the os_event_t library to manage the queuing and dequeing of user threads in an explicit wait queue. This wait queue is implemented using strict FIFO. The FIFO constraint ensures that there is no thread starvation. To overcome this overhead one experimental feature that we are trying is to use busy polling of  free slots to enter InnoDB using sleep delays. The new scheme unlike the existing concurrency management scheme is not starvation free. Before people start complaining that event driven is better than polling, rest assured I know the theory. If there is a way to reduce the overhead and still keep it event driven, I’m very interested to see your code and experimental data :-) .  I’ve experimented with futexes too but no joy there either. It is reasonable to assume that I could have overlooked some better technique therefore feedback is important. In theory if the sleep delay can be tuned to exactly the amount required for a slot to be free then there should be perfect scheduling. This as we know is never going to be the case. However, we can try to get a good enough approximation that varies with the load. This will reduce the optimal or peak TPS due to the overhead of sleeping while there is potentially a slot that is empty. However, for applications that have lots of threads, greater than say 256 it will be able to maintain a higher TPS at the higher thread count because it doesn’t suffer from the queuing bottleneck. The results have been very encouraging in our internal experiments on high end hardware, hosts with >= 16 cores. There are some issues that need ironing out, in particular Sysbench OLTP read-only tests on 8 core hardware. This feature uses atomics to manage the thread concurrency slots.

New configuration variables
  1. innodb_adaptive_sleep_delay - This is a boolean variable that is used to enable/disable the adaptive sleep delay feature.
  2. innodb_adaptive_max_sleep_delay - This variable sets the upper bound of the max sleep delay in micro-seconds that is possible when adaptive sleep delay is enabled.
How adaptive sleep delay works

It is a very simple algorithm, in fact too simple and that is deliberate. I tried various feedback mechanisms but none matched the RW results of this simple algorithm. At one stage I even toyed with the idea of going down the subsumption architecture path :-) .
The basic idea behind it is that we try and reduce sleep as fast as we can and increase sleep as slowly as we can. The reasoning is that this makes it adapt quicker to a dynamic load. The code is rather simple and is all contained in the function srv_conc_enter_innodb_with_atomics(), see srv0conc.c.
  • if a thread has waited for more than one iteration increment sleep delay by one.
  • if a free slot found and there are no waiting threads sleeping then halve the sleep delay.
This works surprisingly well on hardware with >= 16 cores with the innodb_thread_concurrency set to the number of cores on the hardware. For read-write loads it works well on 8 core hardware too but has issues on 8 core hardware for Sysbench read-only tests. This last issue is under investigation.

Below you can find charts for results obtained on two boxes with 8 and 24 cores. These results highlight difference between two cases –  innodb_thread_concurrency=0 and innodb_thread_concurrency=<N_CPU> along with innodb_adaptive_sleep_delay=1

box1:  8 cores(2.33GHz)/16GB/RAID10
box2: 24 cores(2.66Ghz)/32GB/RAID10
MySQL 5.6.3 – lab-release
sysbench-0.4(https://launchpad.net/sysbench/0.4)
(You can find settings/test run details at the end of the post)


Conclusion

To use this new feature you simply set the innodb_thread_concurrency to the available number of cores and enable innodb_adaptive_sleep_delay and run your favourite benchmark software against it. As mentioned earlier there is an issue with RO tests on 8 core hardware, this issue is being looked at. If there is an easy way (and fast) way to make it starvation free then that will be added. However, it performs quite well in RW tests. Your feedback is important to fine tune this feature, please do share with us.

MySQL/sysbench settings
[mysqld]
innodb_status_file=0
innodb_data_file_path=ibdata1:100M:autoextend
innodb_buffer_pool_size=4G
innodb_log_file_size=100M
innodb_log_files_in_group=2
innodb_log_buffer_size=64M
innodb_flush_log_at_trx_commit=2
innodb_thread_concurrency=0
innodb_flush_method=O_DIRECT
user=root
port=3306
max_connections=2000
table_cache=2048
max_heap_table_size=64M
sort_buffer_size=64K
join_buffer_size=1M
tmp_table_size=64M
thread_cache=16
query_cache_type=0
query_cache_size=0
max_prepared_stmt_count=100000

cmd line for prepare:
sysbench –test=oltp –oltp-table-size=1000000 –oltp-dist-type=uniform \
–oltp-table-name=sbtest –mysql-user=root –mysql-db=sbtest \
–mysql-host=127.0.0.1 –mysql-port=3306 –mysql-table-engine=INNODB  prepare

cmd line for warmup:
mysql -uroot -h127.0.0.1 -P3306 -e’check table sbtest’

cmd line for run:
sysbench –num-threads=<16..1536> –test=oltp –oltp-table-size=1000000 \
–oltp-dist-type=uniform –oltp-table-name=sbtest  –report-interval=1 –forced-shutdown=1 \
–max-requests=0 –max-time=120 –mysql-host=127.0.0.1 \
–mysql-user=root –mysql-port=3306 –mysql-db=sbtest
–mysql-table-engine=INNODB  <test options(see below)> run

parameters for OLTP_RO test:
–oltp-read-only=on
parameters for OLTP_RW test:
–oltp-read-only=off
parameters for UPDATE_KEY test:
–oltp-test-mode=nontrx –oltp-nontrx-mode=update_key run
parameters for POINT_SELECT test:
–oltp-point-selects=1 –oltp-simple-ranges=0 –oltp-sum-ranges=0 \
–oltp-order-ranges=0 –oltp-distinct-ranges=0 –oltp-skip-trx=on –oltp-read-only=on
parameters for SELECT_SIMPLE_RANGES test:
–oltp-point-selects=0 –oltp-simple-ranges=1 –oltp-sum-ranges=0 \
–oltp-order-ranges=0 –oltp-distinct-ranges=0 –oltp-skip-trx=on –oltp-read-only=on
parameters for SELECT_SUM_RANGES test:
–oltp-point-selects=0 –oltp-simple-ranges=0 –oltp-sum-ranges=1 \
–oltp-order-ranges=0 –oltp-distinct-ranges=0 –oltp-skip-trx=on –oltp-read-only=on

This post was co-authored with Alexey Stroganov (a.k.a Ranger).

Allow UNDO logs to reside in their own tablespace

Note: this article was originally published on http://blogs.innodb.com on July 25, 2011 by Sunny Bains.

Introduction

The InnoDB  UNDO entries reside in a special system table called the UNDO log. This log is made up of several segments. These segments are called rollback segments. A segment in InnoDB is similar to what a file would be in a file system,e.g., user tables and indexes are also stored as separate segments within the same tablespace,  only their format is different. In that sense there is nothing special about InnoDB UNDO logs. This feature allows storing of the UNDO log across several tablespaces.

Purpose

UNDO logs  contain the before image of modified records. There are two types of UNDO records, one for insert and another for updates. The insert UNDO records can be discarded on transaction rollback. The update records are used for rollback, MVCC and by purge. It is because of purge that we can’t just remove the UNDO log records  once the UNDO logs are not referenced by any running transaction’s snapshot view (for MVCC). When a transaction is started it is assigned a rollback segment in a round robin scheme. Multiple transactions can be assigned the same rollback segment to write their changes. Up to a maximum of 1024 transactions can use a the same rollback segment. If you have more rollback segments then there is less contention around the rollback segment mutex.

Purge

The purge thread(s) run in the background and they read the UNDO log records from oldest to the latest, up to but not including the oldest active snapshot view in the system. It parses the UNDO log records and for entries that represent delete operations it uses the stored index key to search for the records in both the secondary and primary index and purges the entries, modifying the index tree structure if required. Normal DML operations simply delete mark the records but they don’t physically purge them, unless it is an insert that is being rolled back. This is to avoid expensive tree modifying operations in DML code. Once purge is finished with the UNDO entries it then truncates the UNDO log up to where it has processed the entries. For MVCC user transactions we need to follow the DATA_ROLL_PTR pointer to the UNDO log to build a previous version of the row.

Why did we make this change

All this activity results in lots of RANDOM IO. In order to reduce the  of random IO this feature helps by allowing the user to place the UNDO logs anywhere in the file system. This flexibility can be used to place them on an SSD for example.

New configuration variables

  1. innodb_undo_dir – This should be set to the directory where all the configured UNDO tablespaces will be created. Can be an absolute path too. Default is “.”.
  2. innodb_undo_tablespaces – This should be set to the number of UNDO tabespaces that you want to created. Note: This must be set at the time of database creation. Once created this cannot be changed.
  3. innodb_undo_logs (renamed from innodb_rollback_segments) – This should be set to the minimum number of rollback segments that you want. More can be added later but they cannot be removed. However, setting it to a smaller value at runtime will ensure that only the number set will be used when assigning to transactions.

There are a few gotchas here that you should be aware off. The innodb_undo_tablespaces can be set only at the time of database creation and the value must be the same for the lifetime of the database. More UNDO tablespaces cannot be added and existing tablespaces cannot be dropped.  We may add better UNDO tablespace management in the future.

Case 1. 128 segments spread across 128 segments.

  • innodb_undo_logs = 128
  • innodb_undo_tablespaces = 127 (we always create one log in tablespace 0, this is to avoid breaking the system tablespace layout)

Case 2. Start with one rollback segment but may want to create more later and spread them across say 4 UNDO tablespaces

  • innodb_undo_tablespaces = 4
  • innodb_undo_logs = 1

This will create the UNDO tablespaces but they will be empty, the single rollback segment that is created in tablespace 0 will work. Later if you want to add more tablespaces. You will need to shutdown the server and set the innodb_undo_logs variable to the number that you want. Lets say we add 2 more spaces. We would restart the server with the following parameters:

  • innodb_undo_tablespaces = 4
  • innodb_undo_logs = 3  - This will create two new rollback segments in the UNDO tablespace 1 and UNDO tablespace 2. Also this will disable the use of the rollback segment in tablespace 0.

If there were some UNDO records in rollback segment 1 they will be processed as normal. Only change is that no new ones will be written to rollback segment 1. It will be skipped when assigning rollback segments to new transactions.

Future work

Mapping of tablespace id to tablespace name is something that we would like to add so that users can change the path of individual tablespaces instead of all being lumped together in innodb_undo_dir. Managing these extra tablespaces is also something that we think is important, especially the adding and dropping of UNDO tablespaces including better management of the UNDO segments.

Conclusion

As we have seen  writing to the UNDO logs and accessing them can result in a lot of random IO. One way to reduce that overhead is to place the UNDO tablespaces on SSD. This feature allows the placement of the UNDO logs anywhere on the file system in a directory that is controlled by the configuration variable innodb_undo_dir. Note: this change is not backward compatible, databases created with this feature cannot be opened by older versions of InnoDB.  The main problems are that the old code doesn’t open these new system tablespaces before applying REDO and secondly there are some checks that assume space id can only be 0. The file format is not the problem as such.

Shortened warm-up times with a preloaded InnoDB buffer pool

Note: this article was originally published on http://blogs.innodb.com on July 25, 2011 by Vasil Dimov.

Are you running an InnoDB installation with a many-gigabytes buffer pool(s)? Does it take too long before it goes back to speed after a restart? If yes, then the following will be interesting to you.

In the latest MySQL 5.6 Labs release we have implemented an InnoDB buffer pool(s) dump and load to solve this problem.

The contents of the InnoDB buffer pool(s) can be saved on disk before MySQL is shut down and then read in after a restart so that the warm up time is drastically shortened – the buffer pool(s) go to the state they were before the server restart! The time needed for that is roughly the time needed to read data from disk that is about the size of the buffer pool(s).

Lets dive straight into the commands to perform various dump/load operations:

The buffer pool(s) dump can be done at any time when MySQL is running by doing:

  mysql> SET innodb_buffer_pool_dump_now=ON;

This operation is very quick and creates a file named ib_buffer_pool in the InnoDB data directory, by analogy with the default InnoDB data file ibdata1. The file contains only the space and page IDs and is about 500k per gigabyte of buffer pool. The same operation can also be done automatically during MySQL shutdown:

  mysql> SET innodb_buffer_pool_dump_at_shutdown=ON;

By analogy the buffer pool(s) can be loaded manually by:

  mysql> SET innodb_buffer_pool_load_now=ON;

and automatically during startup from the file ib_buffer_pool by setting

  innodb_buffer_pool_load_at_startup=ON

in my.cnf or my.ini.

The filename where the buffer pool(s) are dumped and later loaded from can be specified by changing the value of the global configuration variable innodb_buffer_pool_filename which defaults to ib_buffer_pool.

Additionally the progress of the buffer pool(s) dump (although very quick) and load (not so quick) can be monitored by:

  mysql> SHOW STATUS LIKE 'innodb_buffer_pool_dump_status';
  mysql> SHOW STATUS LIKE 'innodb_buffer_pool_load_status';

And finally the load operation can be interrupted by:

  mysql> SET innodb_buffer_pool_load_abort=ON;

Before loading the list is sorted so that pages will be read in as close to sequential order as possible. The load is asynchronous so it should not interfere with normal operations too much. The load is clever enough to handle different buffer pool sizes or pages that are missing during the load. A page will not be loaded if it is already in the buffer pool so there is no need to worry about problems with changed pages.

The dump file is plain text so it is possible to edit it, though we do not recommend this.

Enjoy!

Overview and Getting Started with InnoDB FTS

Note: this article was originally published on http://blogs.innodb.com on July 25, 2011 by Jimmy Yang.

This Summer’s lab release includes an important feature of InnoDB – InnoDB Fulltext Search. This feature would greatly enhance InnoDB’s capability in Text search and retrieving. Since the feature is designed for our transactional storage engine, its underlying architecture design and implementation are completely different with those of MyISAM. So it is worth to give a brief technology review of this feature, familiarize users with some important concepts in the InnoDB FTS so that they can better utilize this feature.

There are a few other posts on the subject. John Russell will give a brief tutorial on the InnoDB fulltext search command and syntax. I will also discuss some comparison with MyISAM fulltext search in another post. And Vinay in our server testing will give some performance number from his experiments.

The Design:

To begin with, I will go over briefly on some key design concepts, which would help you better understand the feature.

  • Inverted Index – Like many DBMS fulltext search engines, InnoDB Fulltext Search (FTS) is also designed as an “inverted index”, in which, the incoming text is tokenized into individual words, and these words would be stored in one or more auxiliary table(s). For each word, a list of Document ID and the word position pair are stored. We call such Document ID/Position pair list as “ilist”. So in our inverted index table, two important columns are “word” and “ilist” columns. There is an index on the word column. As you might note, since we store the word position (as in byte offset) for each word, so we do support “Proximity Search”, which has been lacking in the MyISAM FTS.
  • The FTS “Index Table (Auxiliary tables) – As mentioned, words in the inverted index are stored in a set of (auxiliary) tables. And in our case, such tables are called FTS “Index Tables”. And our Fulltext Indexes are mostly made up of this set of tables that stores the Word to Document mapping. In our design, the FTS “Index Table” is “partitioned” into multiple tables. So instead of a single auxiliary table, there are now six of them, each contains words partitioned according to their first character. The partition criteria is currently hard coded, targeted to the Latin encoding. But we will eventually make the partition user definable, so user can define different partition scheme with different Charsets.

The partitioned design makes it easier for us to parallelize the operation. Currently, we only parallelize the create index operation, but apparently we can extend such to query and fulltext search later on.

  • The FTS “Index Cache” – Although our Fulltext Index are represented by a set of “Index Auxiliary Tables” on disk, there does exist a FTS “Index Cache” to batch the result before flush/sync them to the “Index Table”. The “Index Cache” is actually a “red-black tree” that stores the Word-Doc ID/Position (ilist) information. However, this cache is different from a index tree in that it is only used to cache the word of the inserted documents, however, words on disk do not need to be brought back to this cache before they can be queried. In another word, the words in the “Index Table” are queried directly as we query into relational tables. And such result from Index Table will merge(union) with the result from the Index Cache before sending back to user.The “batching” by FTS “Index Cache” alleviates InnoDB from frequent updates to the “Index Table” if there are busy inserts and updates, so make the interaction to the index table only at the sync time when the index cache is full. Another benefit for the “batching” is that it can minimize the number of entries for each word in the “Index Table”. So instead of flushing each word with a single “ilist” entry after tokenizing one document, and making an entry for such word in the “Index Table”, we could create a longer “ilist” with multiple DocID/position pairs from different documents, before we flush this info to disk and also make a single entry. This reduces the redundancy and make the “Index Table” smaller.
  • The FTS “Document ID” – Another important concept that might affect the usability is the Document ID management. As for most FTS engine, the mapping from word to Document has to go through a unique (Document) ID. In our case, it is represented by a “FTS_DOC_ID” column that we automatically added to the user table. The column itself must be of “BIGINT UNSIGNED NOT NULL” type. And a unique index with FTS_DOC_ID_INDEX will be created on top the column. On the other hand, we do not prohibit user from adding such column before hand by themselves, in this case, user will be responsible to properly manage the Doc ID column so that document will not be inappropriately represented.

So above are some simple concepts to get us into the next step. Now we can start to use the Fulltext Search.

Use the InnoDB Fulltext Index

1. The FTS DDL and Query Syntax

We continue to use MySQL/MyISAM FTS query syntax. So if you are familiar with MySQL FTS syntax, you can directly get your hands on InnoDB FTS. All types of MyISAM search are supported, except we added the proximity search to the “Boolean Search”. John Russell will have a blog on the syntax and usage in his “Tutorial on FTS” blog, so I would leave that to him.

Instead, I will talk about more on some special points that we need to know when using InnoDB FTS.

2. Create index:

In general, we recommend load the table with data first, then create FTS index on top of it. This would be much faster than you create the FTS and then try to insert document into it. We had added the Fast Index Creation (FIC) parallel sort support for creating Fulltext search index. And this actually allows you to tokenize, sort and create FTS index in parallel.

To control the parallel sort degree, you can use the new “innodb_ft_sort_pll_degree” system configure variable (default 2, and max 32). This is used to specify how many ways you want to parallelize the tokenization and sort operations. And experiment shows on a system that is not I/O bound, the create index performance scales with the number of CPUs and “innodb_ft_sort_pll_degree”.

Following table shows a quick run on 2.7 G wikipedia data on a 8 core machine:

Server Time (min)
MyISAM 11 min 47.90
InnoDB (default) 7 min 25.21 sec
InnoDB pll_degree 5 min 34.98
InnoDB pll_degree = 8 4 min 9 sec
InnoDB pll_degree = 16 3 min 39.51 sec

Another trick on creating index is regarding the Doc ID handling. If a user does not supply a “FTS_DOC_ID” column, InnoDB will add a new hidden column to your table, this results a rebuild of cluster index along with all secondary index, which could be costly. If user could include the “FTS_DOC_ID” in his/her original table define, then it could save a cluster index (table) rebuild. To do so, user would need to include a column with name “FTS_DOC_ID” (all upper case) in his/her table. The column must be of “BIGINT UNSIGNED NOT NULL” datatype. It does not need to be an auto_increment column, but this could make life easier.

An example would be:

1) CREATE TABLE fts_test (

FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,

title VARCHAR(200),

body TEXT

) ENGINE=InnoDB;

The unique “FTS_DOC_ID_INDEX” index on “FTS_DOC_ID” is also optional, without it, InnoDB will automatically create it.:

2) CREATE UNIQUE INDEX FTS_DOC_ID_INDEX on fts_test(FTS_DOC_ID);

3) Load data

4) CREATE FULLTEXT INDEX idx on fts_test (title, body);

Please note, the column name “FTS_DOC_ID” has to be in upper case as the column name in InnoDB is case sensitive. In the next release, we’ll relax the column name requirement for the “FTS_DOC_ID” column, so that user can specify any column name for this ID column.

2. DMLs

- FTS Index Updated on Commit:

Like Fulltext search index for all Transactional DBMS, the update to the Fulltext search index has always been an issue. A typical example is the Oracle Text, which has been synchronizing its data to FTS periodically or manually. Starting in 11g, it now allows user to specify when the data update is reflected to its Fulltext index – manually, on commit, or at regular interval. For InnoDB Fulltext search, the FTS index is updated on transaction commit time. So the document tokenization and inserting into FTS index (cache) happens only after you commit the transaction. However, let’s not confuse this delayed update on FTS index with data handling. All the data manipulations still follow their original rules. Just the update to FTS index is done at transaction commit time.

- How insert is handled

The inserted document will be tokenized at the commit time and inserted into the “Index Cache”. This cache has a configurable size (“innodb_ft_cache_size”) with default of 32 MB. Once this index is full, it will be “sync-ed” to on-disk “Index Tables”.

During server normal shutdown, the content in the “Index Cache” will be “sync-ed” to the “Index Tables” on disk. However, if there is a server crash, and the content of “Index Cache” did not reflected to the “Index Table”, after the server reboot, and when you first time use the FTS index on the table (for search or insert), the “missing” documents will be read from the original table and re-tokenized, add to the “Index Cache”.

- How delete is handled

While inserted documents are being tokenized and inserted into FTS “Index Cache” at commit time, we do not delete any entries for deleting documents if such entries are already in the FTS index table. Instead, we record the deleted Doc ID in a “DELETED” auxiliary table. And for each query, we will also consult this “DELETED” table to filter out any deleted Documents. With this design, the delete operation could be simple and fast, without the need to update thousands of word entries for the deleting document. The downside is that the word entries are not removed from FTS index, so the index could grow. And this is resolved by “Index Optimization”.

4. Index Optimization:

As just discussed, if there are active DMLs on the Fulltext index, the index could be “bloated” over time. We do not do run time deletion on index entries in the index table, instead, we just log the deleted doc ID in a “DELETED” auxiliary table. So over time the index table itself could grow bigger even if you actually removing documents. To resolve this, we can “optimize” the index. This operation does two things:

1) Remove the deleted Doc ID from the word’s DocID/Position pair list (ilist)

2) Consolidate multiple entries for the same word to one (or less) entries if possible by consolidating their DocID/Position pair list (ilist)

Currently, we overload the optimization operation with “Optimize Table” command, and runs only if you turn on “innodb_optimize_fulltext_only” system configure variable:

mysql> set global innodb_optimize_fulltext_only=1;

mysql> optimize table articles;

Since the table could be large, and optimization could take tremendous amount of time, we allow the optimization to be done in stages. There is a system configure variable “innodb_ft_num_word_optimize” specifying how many words to optimize for each “optimize table” command, its default value is 2000, which mean each time it optimize 2000 words. And when the next “optimize table” command is issued, the server will continue from the word it left last time, and continue the process.

In fact, this optimization process could be run as an online, background process, but it is not enabled in this release. We will continue evaluate these options to see if which way is more appropriate and gives the minimum user impact.

5. Stopword Handling:

For FTS stopwords, we provide two sources of stopwords:

1) Default stopwords provided by the server – this is a static list of stopword come with server. If no user stopword is defined, then this default stopword list will be used.

You can view this list by querying into an INFORMATION_SCHEMA table: INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD:

mysql> select * from INNODB_FT_DEFAULT_STOPWORD;

  1. User defined stopwords – User can define his/her own stop word by creating an user table with a single column with a column name of “value” and of the “varchar” datatype, and use global variable “innodb_ft_server_stopword_table” to point to this stopword list. Server will load stopword from this user table, rather than the default stopword list when we create the FTS index. And an example will be here:
  • # Define a correct formated user stopword table

create table user_stopword(value varchar(30)) engine = innodb;

  • # Point to this stopword table with “db name/table name”

set global innodb_ft_server_stopword_table = “test/user_stopword”;

6. Ranking:

In this initial release, we use a very simple ranking mechanism (term frequency, inverse document frequency) for document result ranking. This means the more the word appears in a document, and the less frequent the word appears in overall documents, the higher the selected document is ranked. We do have plans to add more sophisticated ranking mechanism and allow users to supplied the weight in the ranking calculation too.

7. Some Restrictions:

Lastly, we talk about a couple of restrictions for our initial release:

  1. InnoDB FullText Search now only support one FTS index per Table.
  2. Same as MyISAM, although the use of multiple character sets within a single table is supported, all columns in a FULLTEXT index must use the same character set and collation.
  3. For ideographic languages such as CJK (Chinese, Japanese and Korea) do not have word delimiters, similar to MyISAM, we do not yet support N-GRAM parsing. But this is on our feature agenda.

Summary:

In summary, InnoDB fulltext search gives InnoDB an important capability in Text File handling. And this newly added platform is already showing its scalability in indexing and flexibility in DML handling.  We are also considering bring the parallelism to the Query and DML operations to further enhance its performance. However, by saying that, it is still in an early stage where we would continue the development effort, adding features and options, tuning the performance and to make it a competitive feature.

Difference between InnoDB FTS and MyISAM FTS

Note: this article was originally published on http://blogs.innodb.com on July 25, 2011 by Jimmy Yang.

With this lab release, InnoDB now has its own Fulltext Search (FTS) support. People might wonder how different is this FTS comparing with that of MyISAM. From the usage standpoint, they seem very similar with no apparent difference. However, in the underlying FTS engine, the difference is huge, since two are built on different storage engines with totally different design. So in some aspect such as DML, it might be difficult to make it an apple to apple comparison. However, we do try to list and explain the difference, so to make it easier for user to understand our design and make the transition to use InnoDB FTS easier.

The Design:

The most significant difference is the design. Like most transaction storage engine, InnoDB stores the inverted index in auxiliary relational tables. All queries and DMLs eventually route to these tables. On the other hand, MyISAM stores the inverted list in a two-level B-Tree. The first level contains records of word and its count, and the second level contains records of this word’s weight and the rowid. So it is more of a special B-tree “index”.

However, the MyISAM design would not be suitable for highly concurrent transaction storage engine like InnoDB. And its special structure limits its extensibility, as it still can’t include the position information in its indexing to support proximity search.

For InnoDB, by storing the inverted index in auxiliary tables, it renders all the benefit of relational table manipulations to the FullText index. It already applied the partition concept so that multiple tables are used to handle inverted index. This not allows the create index to be parallelized, it also allow its query can be parallelized in the future. So InnoDB design apparently wins in terms of future extensibility and scalability.

The Query Syntax:

Even though the underlying FTS design is different, InnoDB continue uses the same syntax defined MySQL Server. The create index clause and almost all Query Syntax are remain to be the same. Of course, InnoDB now supports the proximity search, but it does so by overloading the existing boolean search with “@” symbol:

# Proximity search, search two word “following” and “comparison” within 19 bytes distance

SELECT * FROM articles

WHERE MATCH (title,body)

AGAINST (‘”following comparison”@19′ IN BOOLEAN MODE);

Creating FTS index:

Even though we have the same create index clause as that of MyISAM, and the index can be created on top of loaded table, or created as part of table, we do recommend creating index after the data loaded to the table. In such way, it can utilize our parallel create index machinery and fully exploit the scalability of our parallel tokenization and parallel sort option.

User can tune such parallel create index by adjusting the value of “innodb_ft_sort_pll_degree”. Currently, it is only a server boot time variable, we could easily make it dynamic tunable.

As mentioned in an earlier blog for “Overview and Getting Started with InnoDB FTS”, another special point worth mentioning is the InnoDB Doc ID column (FTS_DOC_ID). This column stores the Document ID that is used to uniquely identify each document. If user do not supply such column, InnoDB will create a hidden column with this name. This normally means a cluster index rebuild. User can normally avoid so by suppling their own “FTS_DOC_ID” column. A typical example would be:

1) CREATE TABLE fts_test (

FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,

title VARCHAR(200),

body TEXT

) ENGINE=InnoDB;

The unique “FTS_DOC_ID_INDEX” index on “FTS_DOC_ID” is also optional, without it, InnoDB will automatically create it.:

2) CREATE UNIQUE INDEX FTS_DOC_ID_INDEX on fts_test(FTS_DOC_ID);

3) Load data

4) CREATE FULLTEXT INDEX idx on fts_test (title, body);

DMLs

MyISAM supports real time update, this is one of its good points. However, this comes with MyISAM’s limited DML concurrency support. InnoDB’s behavior is more consistent with other transactional storage engines such as Oracle, whose FTS update can be done at commit time, periodically or manually.

For InnoDB FTS, the update of FTS index is done at the commit time. Such delayed update means additional step to fetch the data again for tokenization at commit time, thus increasing the processing time. However, such design is necessary to ensure the concurrent multi-session DMLs to proceed as efficiently as usual. And importance of continuing this concurrency far overweights the restrictions required by the MyISAM to ensure its real time update.

So this is a typical example showing the difference between supporting FTS in a transactional, high concurrency database engine and doing so on a non-transactional database like MyISAM. To some aspect, we cannot compare them in an apple to apple fashion. It would be more reasonable to compare InnoDB FTS with that of Oracle Text in these areas.

InnoDB “Index Cache” and crash recovery

InnoDB FTS also has an “index cache” structure to cache the inserting results. And once this cache is full, the cache content will be flushed to the disk onto auxiliary index tables. This is a unique structure for InnoDB FTS. And InnoDB also provide a configure variable “ innodb_ft_cache_size” to control the size of it. The content of this index cache will also flushed to disk on normal shutdown.

If a server crashed before any of index cache content is sync-ed to the on disk index table, next time after server reboot, and when the FTS index is used first time or there is an insertion/update into the table, the missing content from last time crash will be reinstated into the index cache.

InnoDB FTS Index “Optimization”

Another special aspect of InnoDB is how it handles document deletion. If there is delete, the deleted document word is not immediately removed from the Fulltext index. Instead, its Doc ID is registered in a “DELETED” table. Because of this, the FTS index could be “bloated” over time. To resolve this, we can “optimize” the index using the “optimize table” command, when system configure variable “innodb_optimize_fulltext_only” is turned on:

mysql> set global innodb_optimize_fulltext_only=1;

The optimize process can take a long time, to avoid the process blocking index access for a long period of time, there is another configure system variable “innodb_ft_num_word_optimize” letting you specify the number of words to optimize each time. And the subsequent calls to “optimize table” will continue from where it has been left, and continue the optimize process.

Stopword handling

MyISAM has a large default stopword list, so it can create a smaller FTS index. In addition, it excludes any words that appear in more than 50% of documents. Such word exclusion could actually create unnecessary confusion. InnoDB FTS don’t have such restrictions, and our default stopword list is much smaller. You can view the default stopword list by select from an information schema table:

INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD

InnoDB FTS also support user supplied stopword. User could provide their stopword through a table with a single “varchar” column. And then tells the InnoDB about the stopword table through a system configure variable “innodb_ft_server_stopword_table”:

# Define a correct formated user stopword table

create table user_stopword(value varchar(30)) engine = innodb;

# The set operation should be successful

set global innodb_ft_server_stopword_table = “test/user_stopword”;

Other Miscellaneous items:

1) New InnoDB Information Schema Table

InnoDB FTS adds a few Information Schema tables for diagnostic purpose, and they might of interest to user if they want to explore the content of some internal auxiliary tables:

  1. INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE” displays the in memory “Index Cache” content.
  2. INFORMATION_SCHEMA.INNODB_FT_DELETED” displays the “DELETED” table content.
  3. INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE” displays the on disk “Index table” content.

Before being able to select from these table, user would need to tell InnoDB which table’s FTS index he/she is interested. It is achieved by specifying the “innodb_ft_aux_table” configure variable with the name of the table:

mysql> set global innodb_ft_aux_table=”test/articles”;

mysql> select * from INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE;

System configure variables:

Except for various system variables we already mentioned, there are a couple of other configure variables:

innodb_ft_min_token_size – minimum word size in bytes we would tokenize, default value is 3

innodb_ft_max_token_size – maximum word size in bytes we would tokenize, default value is 84

innodb_ft_num_word_optimize – number of words we optimize for each optimize operation

Please note, we do not take MyISAM’s configure variable such as “ft_min_word_len”, “ft_max_word_len”, “ft_stopword_file”  etc.

Summary:

In summary, InnoDB FTS keeps most of its query syntax in consistent with that used by MySQL/MyISAM, so user do not need to learn any additional command. However, the internal design of InnoDB FTS is marketed different from that of MyISAM, thus user should pay attention to such design difference and expect possible behavior difference as mentioned in this articles.

Create InnoDB databases with 4k and 8k page sizes without recompiling

Note: this article was originally published on http://blogs.innodb.com on July 25, 2011 by Kevin Lewis.

One of the features found in the summer 2011 labs release is the ability to select the InnoDB page size without recompiling.  Smaller page sizes may be useful for certain storage media such as SSDs where there is no need to minimize seek time between reads.

A new global setting called innodb-page-size can be set to 4k, 8k or 16k before creating a new MySQL instance. This sets the page size for all tablespaces used by that InnoDB instance.   This can be done in my.cnf or on the mysqld command line.  It is a read-only variable while the engine is running since it must be set before InnoDB starts up and creates a new system tablespace.  That happens when InnoDB does not find ibdata1 in the data directory.  If your system tablespace already exists using one page size and innodb-page-size is something else, the engine will not start.

A few bugs were found and fixed related to smaller page sizes in InnoDB tablespaces. Those bug fixes can also be found in the 2011 summer labs release.

The mysql-test-run can be run like this to test a smaller page size;

perl mysql-test-run.pl –mysqld=–innodb-page-size=4k –suite=innodb –force

Other features of InnoDB work the same with smaller page sizes but some limits are affected.  While each of the 4 row formats; Redundant, Compact, Compressed and Dynamic are supported, the maximum record size is less.  Record length limits are calculated within InnoDB based on a bunch of factors including row type, column type and length, number of columns, secondary indexes, index prefix lengths, and of course, the page size.  Since the main record is stored in a clustered index, a minimum of two records must fit in each page.  So the maximum record length for 8k pages is about half that of 16k pages and the max record size with 4k pages is half that of 8k pages.

If you have a database schema with any large records, you may not be able to use smaller page sizes.  Even if your records do barely fit in the clustered index page, it may not be advisable to use these smaller pages because the btree will be a lot deeper.  For example, if only 2 records fit on the page and there are  1,000,000 records, leaf pages are 20 levels down meaning InnoDB will need to read and search 20 pages to find the leaf page.  If that were on 4k pages, then using the same table on 16k pages would give 8 records per page and the leaf pages would only be 7 levels down.

There is a trick to reducing the size of records on the clustered index page.  VARCHAR fields can be converted to TEXT fields. (And VARBINARY can be converted to BLOB.)  InnoDB tries to put as much of VARCHAR fields in the clustered record as it can.  Normally, it will put as much as 767 bytes which is the index prefix limit on long fields.  But this labs release also contains a feature to increase that limit to 3072 bytes.  So if you have VARCHAR fields that use hundreds or thousands of bytes, they can make the clustered record very long.  TEXT columns, like BLOB columns, are stored in BLOB pages.  Only 20 bytes is needed in the record to link to that data.  Long VARCHAR fields are split at 767 bytes (or 3072 bytes if innodb_large_prefix=ON) and that 20 bytes is added to the clustered record to link to the rest of those fields.  If your VARCHAR(5000) field rarely ever uses more than 767 bytes, it will rarely be split up.  Still, you might get better performance by reducing the depth of the btree if you define long columns as TEXT instead of VARCHAR.

A file extent in InnoDB is 1 Mb independent of the page size.  So an extent will hold 64  16k pages, 128 8k pages and 256 4k pages.  This means that the read ahead mechanisms will read more pages with smaller page sizes since they read a whole extent at a time.  The doublewrite buffer, which is based on the size of an extent, will also contain more pages.

If you want to use smaller page sizes with existing data, export the data first with a logical export utility such as mysqldump.  Then create the new mysql instance with innodb-page-size=4k or 8k and import the data.  Do not use a physical export method such as alter table … discard tablespace.

Summary:

This feature makes it easier to try smaller page sizes in an InnoDB database. Just export your data, move or delete the system database (ibdata1) and the log files (ib_logfile0 & ib_logfile1), set innodb-page-size to either 4k or 8k, and restart MySQL.  A new InnoDB instance will be created with the smaller page size.  Then you can import your data and run your tests, all without recompiling InnoDB.

Reduced contention during datafile extension

Note: this article was originally published on http://blogs.innodb.com on July 25, 2011 by Inaam Rana.

InnoDB has an internal file system management module that primarily manages the space in the data files. One of the pain points was the coarse level of locking used when a data file has to be extended. More about this issue can be found here. In the latest labs release we have fixed this problem.

When we need to extend a data file inside InnoDB we write zero filled pages synchronously to the file. The user thread which is extending the data file holds fil_system::mutex during the whole operation. This mutex covers changes to all data structures related to file system management. What this means is that when we do a regular IO we do need to acquire fil_system::mutex though only for a short time. Because the thread doing the data file extension is holding the mutex during the whole IO operation any other thread (user or background) trying to access data file for regular read or write ends up waiting. This brings the whole system to a virtual stand still as no read or write activity can happen. This is true even if a thread is trying to access a data file that is not the one being extended.

We fixed this issue by introducing an internal flag to the data structure indicating that a file is being extended. Now if a user thread needs to extend a data file it does acquire the fil_system::mutex but releases it after setting the flag. Once it is done with the extension IO it resets the flag. This allows other threads to access data files while one of the file is being extended. This also allows multiple files to be extended in parallel. Our tests have shown that the issue of stalls due to file extension is indeed fixed by this approach.

A related feature which can be considered as future work is to off load the file extension to a background thread.

Reintroducing Random Readahead in InnoDB

Note: this article was originally published on http://blogs.innodb.com on July 25, 2011 by Inaam Rana.

Yes, we are reintroducing random readahead. We took the random readahead code out after our internal benchmarking and community feedback showed that random readahead tends to adversely affect the performance. However, some recent feedback from customers have shown that for some specific workloads random readahead can be good. Based on this information we have decided to reintroduce random readahead albeit with a dynamic switch to turn it on or off. By default, random readahead is not enabled. You can turn it on by setting innodb_random_read_ahead = TRUE.

The term ‘random readahead’ is a bit of an oxymoron. Readahead seems more related to sequential access pattern. To clarify this I’ll try to explain how InnoDB internally triggers a readahead request. Data pages in InnoDB are laid out on the disk in chunks of 64 pages called extents. The decision about readahead is essentially about whether it makes sense to read in the entire extent instead of reading in the pages one by one as and when they are requested by the user. There are two types of readahead mechanisms inside InnoDB. There are some obvious and some subtle differences between the two.

Linear readahead is the one that decides whether or not to read in the ‘next’ extent and the decision is made based not only on the number of pages of the ‘current’ extent present in the buffer pool but also the access pattern. The ‘current’ extent means the extent which is being accessed currently by the user threads. Whenever a user thread tries to access a page in the buffer pool the linear readahead mechanism is triggered. We first see if it is a boundary page of an extent or not. If it is a boundary page we figure out how many pages of this extent are present in the buffer pool and what is their access pattern. If there are more than innodb_read_ahead_threshold pages in the buffer pool and the access pattern is sequential and in the right order, InnoDB will queue asynchronous read requests for the entire next extent.

Random readahead is concerned mainly with the ‘current’ extent. The evaluation is done on each page read (note the difference from linear readahead where conditions are evaluated on every page access). On every page read, if innodb_random_read_ahead is set we try to see how many pages of this extent are in the buffer pool. If we have a certain number of pages and all of them have been accessed recently i.e.: they are closer to the MRU end of the LRU list we deduce that the remaining pages of the extent will be accessed soon as well. Asynchronous read requests for the remaining pages in the current extent are queued.

To cap the difference between the two types of readahead mechanisms, in case of linear readahead the decision is made about next extent, the conditions are evaluated on each page access, the access pattern is evaluated and decision is made if and only if we are accessing a boundary page of an extent. In case of random readahead the decision is made about current extent, the conditions are evaluated only at page read, instead of access pattern just the recentness of access is taken into consideration and the decision is not tied to the page being read in being a boundary page.

Wednesday Apr 13, 2011

Get started with InnoDB Metrics Table

Note: this article was originally published on http://blogs.innodb.com on April 13, 2011 by Jimmy Yang.

In our 5.6 release, we introduced a new feature – “InnoDB Metrics Table” as part of our effort to extend InnoDB diagnostic and monitoring capability. The “Metrics Table” feature consolidates all InnoDB related Performance and Resource related counters (as of today, 176 counters) and makes them available through an information schema table. In addition, it gives the user the ability to control these counters for their monitoring need.

In following sections, we will go over the feature in detail and focus on its usage to get you started with “metrics counters”.

1) The user interface – InnoDB Metrics Table

Before this feature, InnoDB already had dozens of performance counters. They are exposed to the user through “MySQL System Variables”. These counters are permanent counters, and there are no external controls on them. However, they had become important probes that allow users to peek into the server running state and make appropriate decisions to tune the system to its best state.

In an effort to extend our system monitoring capability, we decided to consolidate the monitoring mechinery, and created a dedicated monitoring module. And the interface to the monitor counters will be a relational table, so users can perform SQL queries on the result. And thus we have the “information_schema.innodb_metrics” table as the interfaces of the “Metrics Table” feature.

The table has following 17 columns:

1.  NAME             string              counter name
2.  SUBSYSTEM        string              the module or the feature the counter pertains to
3.  COUNT            int64               value since the counter is enabled
4.  MAX_COUNT        int64(nullable)     max value since the counter is enabled
5.  MIN_COUNT        int64(nullable)     min value since the counter is enabled
6.  AVG_COUNT        int64(nullable)     average since the counter is enabled
7.  COUNT_RESET       int64              counter value since last reset operation
8.  MAX_COUNT_RESET   int64 (nullable)   max value since last reset
9.  MIN_COUNT_RESET   int (nullable)     min value since last reset
10. AVE_COUNT_RESET   float(nullable)    avg value since last reset
11. TIME_ENABLED      timestamp (nullable) timestamp of last start
12. TIME_DISABLED     timestamp (nullable) timestamp of last stop
13. TIME_ELAPSED      int64 (nullable)     time elapsed since counter started (in seconds)
14. TIME_RESET        timestamp (nullable) timestamp of last reset
15. STATUS            string             whether the counter is running or stopped
16. TYPE              string             describes whether the counter is incremental
                                         or resource counter
17. COMMENT           string             counter description

The columns are pretty much self explanatory. The most basic ones are the “NAME” and “COUNT” columns, which gives you the counter name and its value.

2) The Control Mechanism

In addition to allowing user to view counters through relational tables, we added four simple operation options to allow the user to control the counters, these options are:

  1. Enable Counting- this is to start the monitor counter.
  2. Disable Counting – this is to stop the monitor counter.
  3. Reset – this is to reset the monitor counter to zero while the counter still in the “enabled” mode. Notice we have “COUNT” column and  “COUNT_RESET” column. “Reset” operation only resets the “COUNT_RESET” column.
  4. Reset All, this operation has to be preceded by a “Disable” operation. It will reset all values, including both “COUNT” and  “COUNT_RESET” columns.

Above 4 operations are achieved through MySQL set option on 4 option variables:

  • Enable – To enable or turn on the counter, use “innodb_monitor_enable”:

mysql> set global innodb_monitor_enable = counter_name;

  • Disable  – To disable or turn off the counter, use “innodb_monitor_disable”:

mysql> set global innodb_monitor_disable = counter_name;

  • Reset – To reset the counter, use “innodb_monitor_reset”:

mysql> set global innodb_monitor_reset = counter_name;

  • Reset all – To reset all the values, use innodb_monitor_reset_all:

mysql> set global innodb_monitor_reset_all = counter_name;

Wild card match for counter name:

Note that you would have to state the counter name for above operations. To make things easy,  we support wildcard match for the “counter name”. So the user does not need to remember the exact name for the counter, he/she would just need to include at least one wildcard match character “%”in the counter_name string:

For example, if you would like to turn on monitor “dml_inserts”

You can use “dml_inserts” directly or simply use “dml_i%” or “dml_%”, the later will turn on other counters such as “dml_deletes” too:

1) mysql> set global innodb_monitor_enable = dml_inserts;

Query OK, 0 rows affected (0.00 sec)

2) mysql> set global innodb_monitor_enable = “dml_%”

Check the counter, notice its status is in the “enabled” state:

mysql> select * from information_schema.innodb_metrics where name="dml_inserts" \G
*************************** 1. row ***************************
           NAME: dml_inserts
      SUBSYSTEM: dml
          COUNT: 0
      MAX_COUNT: 0
      MIN_COUNT: NULL
      AVG_COUNT: 0
    COUNT_RESET: 0
MAX_COUNT_RESET: 0
MIN_COUNT_RESET: NULL
AVG_COUNT_RESET: NULL
   TIME_ENABLED: 2011-02-14 19:20:15
  TIME_DISABLED: NULL
   TIME_ELAPSED: 27
     TIME_RESET: NULL
         STATUS: enabled
           TYPE: status_counter
        COMMENT: Number of rows inserted
1 row in set (0.01 sec)

Or if you want to turn on all monitors, you can simply use “%” for “counter_name”

mysql> set global innodb_monitor_reset_all = “%”;

Query OK, 0 rows affected (0.00 sec)

Finally, there are a few rules regarding Wildcard match with counter names.

  • Wildcard match only applies to individual counters and does not applies to modules (a concept we will discuss later).
  • At least one “%” should appear in the match string for the server to switch to wildcard match. “%” matches multiple character and “_” matches single character. However if the string contains only “_” but with no “%”, wildcard match will not be turned on.
  • As shown above, single or multiple “%”  with no other characters means all counters are selected.
  • User can supplies wildcard match string in the server configure file too: loose-innodb_monitor_enable=”lock%” (turn on monitors in server configure file is discussed in the next section)

Turn on monitor counters with server configure files or startup time options:

We also support turning on monitor counters through server configure file. Please use “innodb_monitor_enable” option to specify the monitor counters/groups you want to turn on at the server start up. The count name can be separated by “;”, “,” or space. For example, follow option turns on monitor counter “metadata_table_opened” and all counters start with “lock” prefix:

loose-innodb_monitor_enable=”metadata_table_opened;lock%”

Monitor Counter Modules:

Monitor counters are grouped in modules. And each monitor counter prefixed with its module name. For example, counter “dml_inserts” and “dml_deletes” and “dml_updates” all belong to the “DML” module. So far, we had about 19 modules are defined:

1 module_metadata
2 module_lock
3 module_buffer
4 module_buf_page
5 module_os
6 module_trx
7 module_purge
8 module_log
9 module_compress
10 module_index
11 module_adaptive_hash
12 module_table_space
13 module_change_buffer
14 module_innodb
15 module_file
16 module_ibuf_system
17 module_dml
18 module_ddl
19 module_icp

We created these modules to make the counters more manageable. You can apply these module names directly to the 4 set options defined above, and the action will apply to all counters in the module. For example:

set global innodb_monitor_enable = “module_dml”

3) Examples

Let’s finish this quick tutorial with some examples. Let’s continue to look at the “dml_inserts” counter:

1) Enable the dml_insert counter

mysql> set global innodb_monitor_enable = "dml_i%"

2) Insert three rows, and check the counters again:

mysql> insert into test values(9);
Query OK, 1 row affected (0.00 sec)
mysql> insert into test values(10);
Query OK, 1 row affected (0.00 sec)
mysql> insert into test values(11);
Query OK, 1 row affected (0.00 sec)
mysql>  select * from information_schema.innodb_metrics where
name="dml_inserts" \G
*************************** 1. row ***************************
           NAME: dml_inserts
      SUBSYSTEM: dml
          COUNT: 3
      MAX_COUNT: 3
      MIN_COUNT: NULL
      AVG_COUNT: 0.030303030303030304
    COUNT_RESET: 3
MAX_COUNT_RESET: 3
MIN_COUNT_RESET: NULL
AVG_COUNT_RESET: NULL
   TIME_ENABLED: 2011-02-14 19:20:15
  TIME_DISABLED: NULL
   TIME_ELAPSED: 99
     TIME_RESET: NULL
         STATUS: enabled
           TYPE: status_counter
        COMMENT: Number of rows inserted
1 row in set (0.00 sec)

3) Reset – Let’s reset the counter, notice we could also specify “module_dml” to reset all counters in DML module

mysql> set global innodb_monitor_reset = module_dml;
Query OK, 0 rows affected (0.00 sec)

Let’s select only related columns, notice only “count_reset” and “max_count_reset” are reset, but not the “count” and “max_count” values:

mysql> select NAME, COUNT, MAX_COUNT, COUNT_RESET,
    -> MAX_COUNT_RESET, TIME_ENABLED, TIME_DISABLED, STATUS
    -> from information_schema.innodb_metrics
    -> where name="dml_inserts" \G
*************************** 1. row ***************************
           NAME: dml_inserts
          COUNT: 3
      MAX_COUNT: 3
    COUNT_RESET: 0
MAX_COUNT_RESET: 0
   TIME_ENABLED: 2011-02-14 19:20:15
  TIME_DISABLED: NULL
         STATUS: enabled
1 row in set (0.00 sec)

While the monitor counter is “enabled”, we cannot use “innodb_monitor_reset_all” to reset all values, but we can apply “innodb_monitor_reset”, for which only “COUNT(MAX_COUNT)_RESET” would be reset to 0. If you do use “innodb_monitor_reset_all” while the counter is in enabled state, the following message will be printed to error log:

mysql>  set global innodb_monitor_reset_all = module_dml;
Query OK, 0 rows affected (0.00 sec)
InnoDB: Cannot reset all values for a monitor counter while it is on. Please
turn it off and retry.

4) Only after you disable the monitor counter, this reset_all command can be used:

mysql>  set global innodb_monitor_disable  = module_dml;
Query OK, 0 rows affected (0.00 sec)
mysql>  set global innodb_monitor_reset_all = module_dml;
Query OK, 0 rows affected (0.01 sec)
mysql> select * from information_schema.innodb_metrics  where
name="dml_inserts"\G
*************************** 1. row ***************************
           NAME: dml_inserts
      SUBSYSTEM: dml
          COUNT: 0
      MAX_COUNT: NULL
      MIN_COUNT: NULL
      AVG_COUNT: NULL
    COUNT_RESET: 0
MAX_COUNT_RESET: NULL
MIN_COUNT_RESET: NULL
AVG_COUNT_RESET: NULL
   TIME_ENABLED: NULL
  TIME_DISABLED: NULL
   TIME_ELAPSED: NULL
     TIME_RESET: NULL
         STATUS: disabled
           TYPE: status_counter
        COMMENT: Number of rows inserted
1 row in set (0.01 sec)
While the monitor is in the "disabled" state, its counting activity would be
completely stopped:
mysql>  insert into test values(9);
Query OK, 1 row affected (0.00 sec)
mysql>  insert into test values(9);
Query OK, 1 row affected (0.01 sec)
mysql> select * from information_schema.innodb_metrics  where
name="dml_num_inserts"\G
*************************** 1. row ***************************
           NAME: dml_inserts
      SUBSYSTEM: dml
          COUNT: 0
      MAX_COUNT: NULL
      MIN_COUNT: NULL
      AVG_COUNT: NULL
    COUNT_RESET: 0
MAX_COUNT_RESET: NULL
MIN_COUNT_RESET: NULL
AVG_COUNT_RESET: NULL
   TIME_ENABLED: NULL
  TIME_DISABLED: NULL
   TIME_ELAPSED: NULL
     TIME_RESET: NULL
         STATUS: disabled
           TYPE: status_counter
        COMMENT: Number of rows inserted
1 row in set (0.01 sec)

One last point before we finish. There are some counters that are really not behave as a “counter”. Most of them are resource related and display the amount of resource we uses. The total number of buffer pages is a good example. For these counters, we only display the current value, and in the “type” column, it will display as “current_value” instead of “counters”:

mysql> set global innodb_monitor_enable = buffer_pool_total_page;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from information_schema.innodb_metrics where name =
"buffer_pool_pages_total"\G
*************************** 1. row ***************************
           NAME: buffer_pool_pages_total
      SUBSYSTEM: buffer
          COUNT: 512
      MAX_COUNT: 512
      MIN_COUNT: 512
      AVG_COUNT: NULL
    COUNT_RESET: 512
MAX_COUNT_RESET: 512
MIN_COUNT_RESET: 512
AVG_COUNT_RESET: NULL
   TIME_ENABLED: 2011-02-14 19:20:15
  TIME_DISABLED: NULL
   TIME_ELAPSED: 702
     TIME_RESET: NULL
         STATUS: enabled
           TYPE: value
        COMMENT: Total buffer pool size in pages (innodb_buffer_pool_pages_total)

 

Monday Apr 11, 2011

Get started with InnoDB Memcached Daemon plugin

Note: this article was originally published on http://blogs.innodb.com on April 11, 2011 by Jimmy Yang.

As Calvin mentioned in “NoSQL to InnoDB with Memcached“, we just released a “technology preview” of the feature that makes memcached a MySQL Daemon Plugin. And this “technology preview” release demonstrates how user can go around SQL Optimizer and Query Processing and directly interact with InnoDB Storage Engine through InnoDB APIs. Here, we would like to walk with you step by step to see how to get the memcached Daemon Plugin set up and get it running.

If you would just like to get a brief introduction on the setup steps, there is a “README-innodb_memcached” in the mysql-5.6.2-labs-innodb-memcached package. This is a more elaborated description on these steps.

1) Prerequisite:

Currently, the Memcached Daemon Plugin prototype is only supported on Linux platform. And as a prerequisite, you must have libevent installed, since it is required by memcached.

If you have the source code release, then there is a libevent 1.4.3 included in the package (plugin/innodb_memcached/libevent). You can go to the directory and do autoconf, ./configure, make and make install to make the libevent installed.

2) Build the server

Assuming you would like to build the server yourself (otherwise, you can just skip this section), once libevent is installed, you can just build the MySQL server as usual.

Our source code is in the “plugin/innodb_memcached” directory. As part of server build, it will generate two shared libraries:

1) libmemcached.so: this is the memcached daemon plugin to MySQL

2) innodb_engine.so: this is an InnoDB API plugin to memcached

Make sure above two shared libraries are put in the MySQL plugin directory. You can find MySQL plugin directory by doing “select @@plugin_dir”:

mysql> select @@plugin_dir;
+——————————————————————————-+
| @@plugin_dir                                                           |
+——————————————————————————-+
| /home/jy/work2/mysql-5.6-memcached-labs/lib/plugin |
+——————————————————————————-+
1 row in set (0.00 sec)

3) Install configuration tables:

Next, the memcached plugin needs a bit configuration to know how to interact with InnoDB table. We have a configuration script in “scripts/innodb_memcached_config.sql”. You can just install the necessary configure tables by running  “mysql < scripts/innodb_memcached_config.sql”. If you do not like to know the detail of these configuration tables, you can just skip this section.

This configure script installs 3 tables needed by the InnoDB Memcached. These tables are created in a dedicated database “innodb_memcache”. We will go over these three tables in  a bit more detail:

1) “containers” – This table is the most important table for “Memcached – InnoDB mapping”. It describes the table used to store the memcached values. Currently, we can only map memcached to one table at a time. So essentially, there will be only one row in the table. In the future, we would consider making this configuration more flexible and dynamic, or user can map memcached operations to multiple tables.

The mapping is done through specifying corresponding column values in the table:

  • “db_schema” and “db_table” columns describe the database and table name for storing the memcached value.
  • “key_columns” describes the column (single column) name for the column being used as “key” for the memcached operation
  • “value_columns” describes the columns (can be multiple) used as “values” for the memcached operation. User can specify multiple columns by separating them by comma (such as “col1, col2″ etc.)
  • “unique_idx_name_on_key” is the name of the index on the “key” column. It must be a unique index. It can be primary or secondary.

Above 5 column values (table name, key column, value column and index) must be supplied. Otherwise, the setup will fail.

Following are optional values, however, to fully comply with memcached protocol, you will need these column values supplied too.

  • “flags” describes the columns used as “flag” for memcached. It also used as “column specifier” for some operations (such as incr, prepend) if memcached “value” is mapped to multiple columns. So the operation would be done on specified column. For example, if you have mapped value to 3 columns, and only want the “increment” operation performed on one of these columns, you can use flags to specify which column will be used for these operations.
  • “cas_column” and “exp_column” are used specifically to store the “cas” and “exp” value of memcached.

2. Table “cache_policies” specifies whether we’ll use InnoDB as the data store of memcached (innodb_only) or use memcached’s “default engine” as the backstore (cache-only) or both (caching). In the last case, only if the default engine operation fails, the operation will be forwarded to InnoDB (for example, we cannot find a key in the memory, then it will search InnoDB).

3) Table “config_options”, currently, we only support one config option through this table. It is the “separator” used to separate values of a long string into smaller values for multiple columns values. For example, if you defined “col1, col2″ as value columns. And you define “|” as separate, you could issue following command in memcached to insert values into col1 and col2 respectively:

set keyx 10 0 19

valuecolx|valuecoly

So “valuecol1x” will send to col1 and valuecoly will send to col2.

4) Example tables

Finally, as part of the configuration script, we created a “demo_test” in the “test” database as an example. It also allows the Daemon Memcached to work out of box, and no need to for any additional configurations.

As you would notice, this “demo_test” table has more columns than needed, so it would need the entries in the “container” table to tell which column is used for what purpose as described above.

4) Install the Daemon Plugin

The final step would be installing the daemon plugin. It is the same as installing any other MySQL plugin:

mysql> install plugin daemon_memcached soname “libmemcached.so”;

If you have any memcached specific configure parameters, although it takes effect when the plugin is installed, you would need to specify it during MySQL  server boot time or enter them in the MySQL configure files.

For example, if you would like the memcached to listen on port “11222″ instead of the default port “11211″, then you would need to add “-p11222″ to MySQL system configure variable “daemon_memcached_option”:

mysqld …. –loose-daemon_memcached_option=”-p11222″.

Of course, you can add other memcached command line options to “daemon_memcached_option” string.

The other configure options are:

1) daemon_memcached_engine_lib_name (default “innodb_engine.so”)

2) daemon_memcached_engine_lib_path (default NULL, the plugin directory).

By default, you do not need to set/change anything with these two configure option. We have above two configure options because they allow you to load any other storage engine for memcached (such as NDB memcached engine). This opens door for more interesting exploration.

3) daemon_memcached_r_batch_size, batch commit size for read operation (get operations. It specifies after how many ops we will do a commit. By default, this is set as a very large number, 1048576.

4) daemon_memcached_w_batch_size, batch commit for any write operations (set, replace, append, prepend, incr, decr etc.) By default, this is set as 32.

Again, please note that you will have these configuration parameter in your MySQL configure file or MySQL boot command line. And they will take effect when you load the memcached plugin.

5) Start to play:

Now you have everything set up, you can start to play:

telnet 127.0.0.1 11211

set a11 10 0 9
123456789
STORED
get a11
VALUE a11 0 9
123456789
END

You can access the InnoDB table (“demo_test”) through the standard SQL interfaces. However, there are some catches:

1) If you would like to take a look at what’s in the “demo_test” table, please remember we had batched the commits (32 ops by default) by default. So you will need to do “read uncommitted” select to find the just inserted rows:

mysql> set session TRANSACTION ISOLATION LEVEL
-> read uncommitted;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from demo_test;
+——+——+——+——+———–+——+——+——+——+——+——+
| cx   | cy   | c1   | cz   | c2        | ca   | CB   | c3   | cu   | c4   | C5   |
+——+——+——+——+———–+——+——+——+——+——+——+
| NULL | NULL | a11  | NULL | 123456789 | NULL | NULL |   10 | NULL |    3 | NULL |
+——+——+——+——+———–+——+——+——+——+——+——+
1 row in set (0.00 sec)

2) The InnoDB table would be IS (shared intention) or IX (exclusive intentional) locked for all operations in a transaction. So unless you change “daemon_memcached_r_batch_size” and “daemon_memcached_w_batch_size” to small number (like 1), the table is most likely intentionally locked between each operations. So you cannot do any DDL on the table.

Summary

Now you have everything setup. And you can directly interact with InnoDB storage engine through Memcached interfaces. In addition, as you might notice while going through this extended “README”, we still have a lot interesting options open for exploration and enhancement. This is the beginning of  opening InnoDB to the outside world, and NoSQL is a perfect place for it to play.

NoSQL to InnoDB with Memcached

Note: this article was originally published on http://blogs.innodb.com on April 11, 2011 by Calvin Sun.

MySQL is the most popular open source SQL database. The ever-increasing performance demands of web-based services have generated significant interest in providing NoSQL access methods to MySQL. Today, MySQL is announcing the preview of the NoSQL to InnoDB via memcached. This offering provides users with the best of both worlds – maintain all of the advantages of rich SQL query language, while providing better performance for simple queries via direct access to shared data.

In this preview release, memcached is implemented as a MySQL plugin daemon, accessing InnoDB directly via the native InnoDB API:

Features provided in the current release:

  • Memcached as a daemon plugin of mysqld: both mysqld and memcached are running in the same process space, with very low latency access to data
  • Direct access to InnoDB: bypassing SQL parser and optimizer
  • Support standard protocol (memcapable): support both memcached text-based protocol and binary protocol; all 55 memcapable tests are passed
  • Support multiple columns: users can map multiple columns into “value”. The value is separated by a pre-defined “separator” (configurable).
  • Optional local caching: three options – “cache-only”, “innodb-only”, and “caching” (both “cache” and “innodb store”). These local options can apply to each of four Memcached operations (set, get, delete and flush).
  • Batch operations: user can specify the batch commit size for InnoDB memcached operations via “daemon_memcached_r_batch_size” and “daemon_memcached_w_batch_size” (default 32)
  • Support all memcached configure options through MySQL configure variable “daemon_memcached_option”

Sounds interesting? You can download the source or binary from MySQL Labs (only tested on Linux) – select”mysql-5.6-labs-innodb-memcached”. After unpacking the files, please read the readme file “README-innodb_memcached”. Also, please read the upcoming blog “Get started with InnoDB Memcached Daemon plugin” by Jimmy.

This is a technology preview, with some limitations. We will gradually address those limitations. If you’d like to see additional new features or improvements, please let us know.

MySQL 5.6: InnoDB scalability fix – Kernel mutex removed

Note: this article was originally published on http://blogs.innodb.com on April 11, 2011 by Sunny Bains.

For those interested in InnoDB internals, this post tries to explain why the global kernel mutex was required and the new mutexes and rw-locks that now replace it. Along with the long term benefit from this change.

InnoDB’s core sub-systems up to v5.5 are protected by a global mutex called the Kernel mutex. This makes it difficult to do even some common sense optimisations. In the past we tried optimising the code but it would invariably upset the delicate balance that was achieved by tuning of the code that used the global Kernel mutex, leading to unexpected performance regression. The kernel mutex is also abused in several places to cover operations unrelated to the core e.g., some counters in the server thread main loop.

The InnoDB core sub-systems are:

  1. The Locking sub-system
  2. The Transaction sub-system
  3. MVCC  views

For any state change in the above sub-systems we had to acquire the kernel mutex and this would reduce concurrency and made the kernel mutex very highly contended. A transaction that is creating a lock would end up blocking read view creation (for MVCC) and transaction start or commit/rollback. With the the finer granularity mutexes and rw-locks, a transaction that is creating a lock will not block transaction start or commit/rollback. MVCC read view creation will however block transaction create and commit/rollback because of the shared trx_sys_t::trx_list. But MVCC read view creations will not block each other because they will acquire an S lock.

In 5.6 the global kernel mutex has been further split into several localised mutexes. The important ones are:

  1. Transactions and MVCC views: trx_sys_t::lock (rw_lock) and trx_t::mutex
  2. Locking : lock_sys_t::mutex and lock_sys_t::wait_mutex

This change is significant from an architectural perspective and most of the effort has gone into proving that the new design is correct. Splitting a global mutex into several independent mutexes should improve performance by increasing concurrency. The downside of course is that there is probably going to be a little more context switching with finer grained mutexes. However, now we have far greater freedom in making localised changes to further speed up the sub-systems independently without worrying about the global state and a global mutex.

Background

InnoDB was originally designed to support multiple query threads per transaction. However, this functionality has never been tested and since its first release the InnoDB engine has only ever worked in single query thread per transaction mode. This design decision to support multiple query threads per transaction caused a lot of tight coupling between the core sub-systems, most of which was related to the state of the (potentially multiple) query threads. The state changes and detection were protected by the Kernel mutex. The first step was in simplifying the rules around the handling of multiple-query threads. For the curious, it is the code in files that are prefixed with que0. The most complex part of the change was the transition from the state waiting for lock to the state deadlock or timeout rollback. The control flow was rather convoluted because of the handling for potentially multiple query threads that were bound to a transaction.

Difference between query state and transaction state

InnoDB distinguishes between transaction state and the query thread state. When a transaction waits for a lock it is not the transaction state that is changed to LOCK_WAIT but the query thread state. In a similar fashion when we rollback a transaction it is the query thread state that changes to ROLLING BACK not the transaction state. However, because there has always been a one-to-one mapping between a query thread and a transaction, the query state can be regarded as transaction sub-state. ie.

Transaction states are:

  1. TRX_STATE_NOT_STARTED,
  2. TRX_STATE_ACTIVE,
  3. TRX_STATE_PREPARED,
  4. TRX_STATE_COMMITTED_IN_MEMORY

When a transaction is in state TRX_STATE_ACTIVE it can have the following sub-states (or query thread states)

  1. TRX_QUE_RUNNING,
  2. TRX_QUE_LOCK_WAIT,
  3. TRX_QUE_ROLLING_BACK,
  4. TRX_QUE_COMMITTING

Below is a somewhat simplified version of the roles that the (more important) new rw-locks and mutexes play in the new design.

trx_sys_t::lock

This is a rw-lock that protects the global list of transactions that are ordered on transaction id. This transaction list is used every time a view for MVCC is created and also when a user does something like “SHOW ENGINE INNODB STATUS”. We add a transaction to this list when it is started and remove it from the list once it completes (commit or rollback).

lock_sys_t::mutex

All locking data structures and changes to transaction query thread states are protected by this mutex. The transaction’s query thread state can be changed asynchronously by other threads in the system. Therefore changing a query state requires that the thread doing the query thread state change has to acquire both the lock mutex and the transaction mutex.

Ongoing optimisations and future work

Getting rid of the global kernel mutex is only the start of fixing some of the mutex contention issues. There are several optimisations that can now be considered which were impossible with a single global mutex. We can now take this splitting process down to the sub-system level. I’ve been experimenting with eliminating the trx_sys_t::lock completely. By selectively commenting out parts of the code, I’ve been able to test InnoDB without this rw-lock and test where the other hot spots are. These findings are probably better in a separate blog post :-) . We should finally be able to eliminate (or mitigate) the overhead of MVCC read view creation. More room to move in fixing the lock release overhead, this is especially important in read-write workloads. There are a whole slew of optimisations that we couldn’t do earlier that are now possible :-) .

About

This is the InnoDB team blog.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today