InnoDB Full-Text Search is in MySQL 5.6.4
By Calvin Sun on Dec 20, 2011
Note: this article was originally published on http://blogs.innodb.com on Dec 20, 2011 by Jimmy Yang.
InnoDB full-text search (FTS) is finally available in MySQL 5.6.4 release. The feature has been on trial through MySQL’s summer lab release, thus we had several blogs covering the feature. In this blog, we will leave the “how to” part of the feature to those blogs, and focus on some important characteristics of this new feature, so you will have a better understanding when trying on the feature.
The InnoDB Full-text Index as an Inverted Index
When comes to the basic design, InnoDB takes a traditional way to implementation the full-text index, which is a so called “Inverted Index”. It composes of a set of auxiliary “index tables” that stores the “Word” “Doc ID” pair, as well as each word’s position in the original text.
Incoming text strings in the inserting records are extracted out, tokenized and decomposed into individual words, and such words are then inserted into the auxiliary “index tables” along with their position info and the Doc ID associated with the record. With the position info, we will be able to support proximity search that is lacking in MyISAM FTS.
Create the Full-text Index in Parallel
The Full-text Index is created in parallel. By default, the parallel degree is two, which means two threads are used to tokenize, sort and final insertion to the Full-text “index tables”.
To support parallel create index and future parallel processing, InnoDB partitions the “full-text index” into six auxiliary “index tables”. The words are divided among these tables based on their first character’s charset sort weight. Currently, the partition criteria is hard-coded, targeted to the Latin characters. In the future, we will allow user to define their own partition criteria, to better support other character sets.
Batching the insert value with InnoDB Full-text Index’s “index cache”
As just discussed, when inserting into an “Inverted Index”, each inserting string will be tokenized, and decomposed into individual words before inserting into the auxiliary table. A single insertion could result in many small insertions into the inverted index tables. This magnifies the insertion operations by several or even dozens of fold, thus becomes normal insertion a disruptive fact for concurrent access to the table.
Our solution is to build an “internal” FTS “index cache” to temporarily cache these insertions, and batch flush to the disk once this index cache is full. Thus, it avoids frequent updates to the “Index Table” during busy inserts and updates, so the index table only needs to be synchronized when the index cache is full. The batching technique also avoids repetitively storing the same word, minimizes the number of entries for each word in the “Index Table”. So instead of flushing each word with a single “ilist” entry (with a single Doc ID), it batches result from many inserted documents so to create an “ilist” with multiple DocID/position pairs, before we flush this info to disk and also make a single entry. This reduces the redundancy and make the “Index Table” smaller.
The index cache is index specific. And has a configuration parameter (innodb_ft_cache_size) to configure the size of the cache. It stores the same information as those auxiliary tables, using a red-black tree data structure. However, this cache is different from an index tree in that it is only used to cache the word of the inserted documents, words on disk do not need to be brought back to this cache before they can be queried. In another word, the words in the “Index Table” are queried directly as we query into relational tables. And such result from “Index Table” will merge (union) with the query results from the Index Cache before sending back to user.
How InnoDB Full-text Index handles Deletes
Similar to inserts, each delete of the record with full-text index could result in many small “deletes”, if we had implemented it as normal deletes. Again, this could be very disruptive.
To avoid such disruptive operation, InnoDB Full-text Index develops a new scheme that only logs the Doc ID of the deleted Documents in a separate “DELETED” auxiliary table. The actually “indexed” record would still remain in the FTS index. However, these deleted Doc Ids will be filtered out from the final results by consulting the “DELETED” table before returning the query result.
The benefit of this design is obvious. Delete operation then becomes a trivial and fast operation, with minimum impact on the index itself. The shortcoming is that the index would not shrink along with record deletion. However, the benefits far overweight the drawbacks.
To cleanup the deleted records in the index, you will resort to the InnoDB optimize table utility, which rebuilt the Full-text index online. And we will discuss it in the next section.
The optimize table on the Full-Text index essentially rebuilds the full-text index, and leaves those “deleted” Docs out of the new index. This operation does two things:
1) Remove the deleted Doc ID from the word’s DocID/Position pair list (ilist).
2) Consolidate multiple entries for the same word to one (or less) entries if possible by consolidating their DocID/Position pair list (ilist), as we will have as little entry as possible.
Currently, we overload this optimization operation with the “Optimize Table” command, and runs only if you turn on “innodb_optimize_fulltext_only” system configure variable:
mysql> set global innodb_optimize_fulltext_only=1;
mysql> optimize table articles;
Since the table could be large, and optimization (reorg rebuild index) may take a tremendous amount of time, we allow the optimization to be done in stages. The system configuration variable “innodb_ft_num_word_optimize” specifies how many words to optimize for each “optimize table” command. Its default value is 2000, meaning each time it optimizes 2000 words. And when the next “optimize table” command is issued, the server continues the process from where it left off from last run.
InnoDB Full-text search Transaction Behavior
As you could see from last two sections, the main idea behind InnoDB Full-text index DML is to batch the operations as much as possible. This determines there will be some special characteristics when coming to transaction handling.
In fact, all of the DML operations (insert/delete/update) involving Full-text indexed columns are processed at the transaction commit time. For example, for insertion, the tokenization for inserted strings is performed only at commit time, as a result, a full-text search only sees the committed data. This “delayed” process/synchronization behavior presents in all transactional DBMS text handling. And we do so at the transaction commit time.
InnoDB FTS Document ID
All of above design can’t be done without the support of a so called FTS Doc ID identifier. The mapping from word to original table record goes through such a unique ID. So there must be a Doc ID column in the table with Full-text index.
In our case, it is represented by the “FTS_DOC_ID” (uppercase required) column. If this column is not defined, InnoDB will automatically add it to the user table when creating the full-text index. The column itself must be of “BIGINT UNSIGNED NOT NULL” type, with a unique index named FTS_DOC_ID_INDEX. When you define this column during table creation, it saves considerable time in creating the full-text index after loading data, since InnoDB does not need to copy the table for adding such column (and also rebuild all indexes include primary index). In such case, user will be responsible to properly manage the Doc ID column so that document will not be inappropriately represented. If create index performance is not a concern, you can leave Doc ID management to InnoDB, by leaving out this column. InnoDB will added “FTS_DOC_ID” as a hidden columns. The “ FTS_DOC_ID_INDEX” will also be created automatically.
Stopwords are common or trivial words that are omitted from the full-text index. The InnoDB FTS provides two sources of stopwords:
1) InnoDB predefined default stopwords – If no user stopword list is defined, this default stopwords list is used.
You can view this list by querying the table INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD:
mysql> select * from INNODB_FT_DEFAULT_STOPWORD;
2) User-defined stopwords – You can define your own stopwords by creating a table with a single column named “value”, with datatype “varchar”, and pointing the global variable “innodb_ft_server_stopword_table” to this table. MySQL loads stopwords from this user table, rather than the default stopword list, when creating the FTS index. And an example will be here:
- # Define a correctly formatted user stopword table
create table user_stopword(value varchar(30)) engine = innodb;
- # Point to this stopword table with “db name/table name”
set global innodb_ft_server_stopword_table = “test/user_stopword”
Currently, we use a very simple ranking mechanism (term frequency, inverse document frequency) for document ranking. This means the more the word appears in a document, and the less frequent the word appears in overall documents, the higher the selected document is ranked. We do have plans to add more sophisticated ranking mechanism and allow users to supplied the weight in the ranking calculation too.
As the first release of InnoDB FTS, it provides limited and very basic functionalities for Full-text searches. And it has some limitations that worth mentioning:
- It does not yet support stemming.
- For ideographic languages such as CJK (Chinese, Japanese and Korea), which do not have word delimiters, InnoDB FTS does not yet support N-GRAM parsing (similar to MyISAM FTS).
- Although the use of multiple character sets within a single table is supported, all columns in a FULLTEXT index must use the same character set and collation (similar to MyISAM FTS).
- It does not support external parser plugins
- Ranking mechanism is relatively simple
The InnoDB full-text search gives InnoDB an important capability in using MySQL to handle text documents, and we will further develop it to continue increase its performance and usability.