Difference between InnoDB FTS and MyISAM FTS

Note: this article was originally published on http://blogs.innodb.com on July 25, 2011 by Jimmy Yang.

With this lab release, InnoDB now has its own Fulltext Search (FTS) support. People might wonder how different is this FTS comparing with that of MyISAM. From the usage standpoint, they seem very similar with no apparent difference. However, in the underlying FTS engine, the difference is huge, since two are built on different storage engines with totally different design. So in some aspect such as DML, it might be difficult to make it an apple to apple comparison. However, we do try to list and explain the difference, so to make it easier for user to understand our design and make the transition to use InnoDB FTS easier.

The Design:

The most significant difference is the design. Like most transaction storage engine, InnoDB stores the inverted index in auxiliary relational tables. All queries and DMLs eventually route to these tables. On the other hand, MyISAM stores the inverted list in a two-level B-Tree. The first level contains records of word and its count, and the second level contains records of this word’s weight and the rowid. So it is more of a special B-tree “index”.

However, the MyISAM design would not be suitable for highly concurrent transaction storage engine like InnoDB. And its special structure limits its extensibility, as it still can’t include the position information in its indexing to support proximity search.

For InnoDB, by storing the inverted index in auxiliary tables, it renders all the benefit of relational table manipulations to the FullText index. It already applied the partition concept so that multiple tables are used to handle inverted index. This not allows the create index to be parallelized, it also allow its query can be parallelized in the future. So InnoDB design apparently wins in terms of future extensibility and scalability.

The Query Syntax:

Even though the underlying FTS design is different, InnoDB continue uses the same syntax defined MySQL Server. The create index clause and almost all Query Syntax are remain to be the same. Of course, InnoDB now supports the proximity search, but it does so by overloading the existing boolean search with “@” symbol:

# Proximity search, search two word “following” and “comparison” within 19 bytes distance

SELECT * FROM articles

WHERE MATCH (title,body)

AGAINST (‘”following comparison”@19′ IN BOOLEAN MODE);

Creating FTS index:

Even though we have the same create index clause as that of MyISAM, and the index can be created on top of loaded table, or created as part of table, we do recommend creating index after the data loaded to the table. In such way, it can utilize our parallel create index machinery and fully exploit the scalability of our parallel tokenization and parallel sort option.

User can tune such parallel create index by adjusting the value of “innodb_ft_sort_pll_degree”. Currently, it is only a server boot time variable, we could easily make it dynamic tunable.

As mentioned in an earlier blog for “Overview and Getting Started with InnoDB FTS”, another special point worth mentioning is the InnoDB Doc ID column (FTS_DOC_ID). This column stores the Document ID that is used to uniquely identify each document. If user do not supply such column, InnoDB will create a hidden column with this name. This normally means a cluster index rebuild. User can normally avoid so by suppling their own “FTS_DOC_ID” column. A typical example would be:

1) CREATE TABLE fts_test (

FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,

title VARCHAR(200),

body TEXT

) ENGINE=InnoDB;

The unique “FTS_DOC_ID_INDEX” index on “FTS_DOC_ID” is also optional, without it, InnoDB will automatically create it.:

2) CREATE UNIQUE INDEX FTS_DOC_ID_INDEX on fts_test(FTS_DOC_ID);

3) Load data

4) CREATE FULLTEXT INDEX idx on fts_test (title, body);

DMLs

MyISAM supports real time update, this is one of its good points. However, this comes with MyISAM’s limited DML concurrency support. InnoDB’s behavior is more consistent with other transactional storage engines such as Oracle, whose FTS update can be done at commit time, periodically or manually.

For InnoDB FTS, the update of FTS index is done at the commit time. Such delayed update means additional step to fetch the data again for tokenization at commit time, thus increasing the processing time. However, such design is necessary to ensure the concurrent multi-session DMLs to proceed as efficiently as usual. And importance of continuing this concurrency far overweights the restrictions required by the MyISAM to ensure its real time update.

So this is a typical example showing the difference between supporting FTS in a transactional, high concurrency database engine and doing so on a non-transactional database like MyISAM. To some aspect, we cannot compare them in an apple to apple fashion. It would be more reasonable to compare InnoDB FTS with that of Oracle Text in these areas.

InnoDB “Index Cache” and crash recovery

InnoDB FTS also has an “index cache” structure to cache the inserting results. And once this cache is full, the cache content will be flushed to the disk onto auxiliary index tables. This is a unique structure for InnoDB FTS. And InnoDB also provide a configure variable “ innodb_ft_cache_size” to control the size of it. The content of this index cache will also flushed to disk on normal shutdown.

If a server crashed before any of index cache content is sync-ed to the on disk index table, next time after server reboot, and when the FTS index is used first time or there is an insertion/update into the table, the missing content from last time crash will be reinstated into the index cache.

InnoDB FTS Index “Optimization”

Another special aspect of InnoDB is how it handles document deletion. If there is delete, the deleted document word is not immediately removed from the Fulltext index. Instead, its Doc ID is registered in a “DELETED” table. Because of this, the FTS index could be “bloated” over time. To resolve this, we can “optimize” the index using the “optimize table” command, when system configure variable “innodb_optimize_fulltext_only” is turned on:

mysql> set global innodb_optimize_fulltext_only=1;

The optimize process can take a long time, to avoid the process blocking index access for a long period of time, there is another configure system variable “innodb_ft_num_word_optimize” letting you specify the number of words to optimize each time. And the subsequent calls to “optimize table” will continue from where it has been left, and continue the optimize process.

Stopword handling

MyISAM has a large default stopword list, so it can create a smaller FTS index. In addition, it excludes any words that appear in more than 50% of documents. Such word exclusion could actually create unnecessary confusion. InnoDB FTS don’t have such restrictions, and our default stopword list is much smaller. You can view the default stopword list by select from an information schema table:

INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD

InnoDB FTS also support user supplied stopword. User could provide their stopword through a table with a single “varchar” column. And then tells the InnoDB about the stopword table through a system configure variable “innodb_ft_server_stopword_table”:

# Define a correct formated user stopword table

create table user_stopword(value varchar(30)) engine = innodb;

# The set operation should be successful

set global innodb_ft_server_stopword_table = “test/user_stopword”;

Other Miscellaneous items:

1) New InnoDB Information Schema Table

InnoDB FTS adds a few Information Schema tables for diagnostic purpose, and they might of interest to user if they want to explore the content of some internal auxiliary tables:

  1. INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE” displays the in memory “Index Cache” content.
  2. INFORMATION_SCHEMA.INNODB_FT_DELETED” displays the “DELETED” table content.
  3. INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE” displays the on disk “Index table” content.

Before being able to select from these table, user would need to tell InnoDB which table’s FTS index he/she is interested. It is achieved by specifying the “innodb_ft_aux_table” configure variable with the name of the table:

mysql> set global innodb_ft_aux_table=”test/articles”;

mysql> select * from INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE;

System configure variables:

Except for various system variables we already mentioned, there are a couple of other configure variables:

innodb_ft_min_token_size – minimum word size in bytes we would tokenize, default value is 3

innodb_ft_max_token_size – maximum word size in bytes we would tokenize, default value is 84

innodb_ft_num_word_optimize – number of words we optimize for each optimize operation

Please note, we do not take MyISAM’s configure variable such as “ft_min_word_len”, “ft_max_word_len”, “ft_stopword_file”  etc.

Summary:

In summary, InnoDB FTS keeps most of its query syntax in consistent with that used by MySQL/MyISAM, so user do not need to learn any additional command. However, the internal design of InnoDB FTS is marketed different from that of MyISAM, thus user should pay attention to such design difference and expect possible behavior difference as mentioned in this articles.

Comments:

Post a Comment:
Comments are closed for this entry.
About

This is the InnoDB team blog.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today