Overview and Getting Started with InnoDB FTS
By Calvin Sun on Jul 25, 2011
Note: this article was originally published on http://blogs.innodb.com on July 25, 2011 by Jimmy Yang.This Summer’s lab release includes an important feature of InnoDB – InnoDB Fulltext Search. This feature would greatly enhance InnoDB’s capability in Text search and retrieving. Since the feature is designed for our transactional storage engine, its underlying architecture design and implementation are completely different with those of MyISAM. So it is worth to give a brief technology review of this feature, familiarize users with some important concepts in the InnoDB FTS so that they can better utilize this feature.
There are a few other posts on the subject. John Russell will give a brief tutorial on the InnoDB fulltext search command and syntax. I will also discuss some comparison with MyISAM fulltext search in another post. And Vinay in our server testing will give some performance number from his experiments.
To begin with, I will go over briefly on some key design concepts, which would help you better understand the feature.
- Inverted Index – Like many DBMS fulltext search engines, InnoDB Fulltext Search (FTS) is also designed as an “inverted index”, in which, the incoming text is tokenized into individual words, and these words would be stored in one or more auxiliary table(s). For each word, a list of Document ID and the word position pair are stored. We call such Document ID/Position pair list as “ilist”. So in our inverted index table, two important columns are “word” and “ilist” columns. There is an index on the word column. As you might note, since we store the word position (as in byte offset) for each word, so we do support “Proximity Search”, which has been lacking in the MyISAM FTS.
- The FTS “Index Table (Auxiliary tables) – As mentioned, words in the inverted index are stored in a set of (auxiliary) tables. And in our case, such tables are called FTS “Index Tables”. And our Fulltext Indexes are mostly made up of this set of tables that stores the Word to Document mapping. In our design, the FTS “Index Table” is “partitioned” into multiple tables. So instead of a single auxiliary table, there are now six of them, each contains words partitioned according to their first character. The partition criteria is currently hard coded, targeted to the Latin encoding. But we will eventually make the partition user definable, so user can define different partition scheme with different Charsets.
The partitioned design makes it easier for us to parallelize the operation. Currently, we only parallelize the create index operation, but apparently we can extend such to query and fulltext search later on.
- The FTS “Index Cache” – Although our Fulltext Index are represented by a set of “Index Auxiliary Tables” on disk, there does exist a FTS “Index Cache” to batch the result before flush/sync them to the “Index Table”. The “Index Cache” is actually a “red-black tree” that stores the Word-Doc ID/Position (ilist) information. However, this cache is different from a index tree in that it is only used to cache the word of the inserted documents, however, words on disk do not need to be brought back to this cache before they can be queried. In another word, the words in the “Index Table” are queried directly as we query into relational tables. And such result from Index Table will merge(union) with the result from the Index Cache before sending back to user.The “batching” by FTS “Index Cache” alleviates InnoDB from frequent updates to the “Index Table” if there are busy inserts and updates, so make the interaction to the index table only at the sync time when the index cache is full. Another benefit for the “batching” is that it can minimize the number of entries for each word in the “Index Table”. So instead of flushing each word with a single “ilist” entry after tokenizing one document, and making an entry for such word in the “Index Table”, we could create a longer “ilist” with multiple DocID/position pairs from different documents, before we flush this info to disk and also make a single entry. This reduces the redundancy and make the “Index Table” smaller.
- The FTS “Document ID” – Another important concept that might affect the usability is the Document ID management. As for most FTS engine, the mapping from word to Document has to go through a unique (Document) ID. In our case, it is represented by a “FTS_DOC_ID” column that we automatically added to the user table. The column itself must be of “BIGINT UNSIGNED NOT NULL” type. And a unique index with FTS_DOC_ID_INDEX will be created on top the column. On the other hand, we do not prohibit user from adding such column before hand by themselves, in this case, user will be responsible to properly manage the Doc ID column so that document will not be inappropriately represented.
So above are some simple concepts to get us into the next step. Now we can start to use the Fulltext Search.
Use the InnoDB Fulltext Index
1. The FTS DDL and Query Syntax
We continue to use MySQL/MyISAM FTS query syntax. So if you are familiar with MySQL FTS syntax, you can directly get your hands on InnoDB FTS. All types of MyISAM search are supported, except we added the proximity search to the “Boolean Search”. John Russell will have a blog on the syntax and usage in his “Tutorial on FTS” blog, so I would leave that to him.
Instead, I will talk about more on some special points that we need to know when using InnoDB FTS.
2. Create index:
In general, we recommend load the table with data first, then create FTS index on top of it. This would be much faster than you create the FTS and then try to insert document into it. We had added the Fast Index Creation (FIC) parallel sort support for creating Fulltext search index. And this actually allows you to tokenize, sort and create FTS index in parallel.
To control the parallel sort degree, you can use the new “innodb_ft_sort_pll_degree” system configure variable (default 2, and max 32). This is used to specify how many ways you want to parallelize the tokenization and sort operations. And experiment shows on a system that is not I/O bound, the create index performance scales with the number of CPUs and “innodb_ft_sort_pll_degree”.
Following table shows a quick run on 2.7 G wikipedia data on a 8 core machine:
|MyISAM||11 min 47.90|
|InnoDB (default)||7 min 25.21 sec|
|InnoDB pll_degree||5 min 34.98|
|InnoDB pll_degree = 8||4 min 9 sec|
|InnoDB pll_degree = 16||3 min 39.51 sec|
Another trick on creating index is regarding the Doc ID handling. If a user does not supply a “FTS_DOC_ID” column, InnoDB will add a new hidden column to your table, this results a rebuild of cluster index along with all secondary index, which could be costly. If user could include the “FTS_DOC_ID” in his/her original table define, then it could save a cluster index (table) rebuild. To do so, user would need to include a column with name “FTS_DOC_ID” (all upper case) in his/her table. The column must be of “BIGINT UNSIGNED NOT NULL” datatype. It does not need to be an auto_increment column, but this could make life easier.
An example would be:
1) CREATE TABLE fts_test (
FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
The unique “FTS_DOC_ID_INDEX” index on “FTS_DOC_ID” is also optional, without it, InnoDB will automatically create it.:
2) CREATE UNIQUE INDEX FTS_DOC_ID_INDEX on fts_test(FTS_DOC_ID);
3) Load data
4) CREATE FULLTEXT INDEX idx on fts_test (title, body);
Please note, the column name “FTS_DOC_ID” has to be in upper case as the column name in InnoDB is case sensitive. In the next release, we’ll relax the column name requirement for the “FTS_DOC_ID” column, so that user can specify any column name for this ID column.
- FTS Index Updated on Commit:
Like Fulltext search index for all Transactional DBMS, the update to the Fulltext search index has always been an issue. A typical example is the Oracle Text, which has been synchronizing its data to FTS periodically or manually. Starting in 11g, it now allows user to specify when the data update is reflected to its Fulltext index – manually, on commit, or at regular interval. For InnoDB Fulltext search, the FTS index is updated on transaction commit time. So the document tokenization and inserting into FTS index (cache) happens only after you commit the transaction. However, let’s not confuse this delayed update on FTS index with data handling. All the data manipulations still follow their original rules. Just the update to FTS index is done at transaction commit time.
- How insert is handled
The inserted document will be tokenized at the commit time and inserted into the “Index Cache”. This cache has a configurable size (“innodb_ft_cache_size”) with default of 32 MB. Once this index is full, it will be “sync-ed” to on-disk “Index Tables”.
During server normal shutdown, the content in the “Index Cache” will be “sync-ed” to the “Index Tables” on disk. However, if there is a server crash, and the content of “Index Cache” did not reflected to the “Index Table”, after the server reboot, and when you first time use the FTS index on the table (for search or insert), the “missing” documents will be read from the original table and re-tokenized, add to the “Index Cache”.
- How delete is handled
While inserted documents are being tokenized and inserted into FTS “Index Cache” at commit time, we do not delete any entries for deleting documents if such entries are already in the FTS index table. Instead, we record the deleted Doc ID in a “DELETED” auxiliary table. And for each query, we will also consult this “DELETED” table to filter out any deleted Documents. With this design, the delete operation could be simple and fast, without the need to update thousands of word entries for the deleting document. The downside is that the word entries are not removed from FTS index, so the index could grow. And this is resolved by “Index Optimization”.
4. Index Optimization:
As just discussed, if there are active DMLs on the Fulltext index, the index could be “bloated” over time. We do not do run time deletion on index entries in the index table, instead, we just log the deleted doc ID in a “DELETED” auxiliary table. So over time the index table itself could grow bigger even if you actually removing documents. To resolve this, we can “optimize” the index. This operation does two things:
1) Remove the deleted Doc ID from the word’s DocID/Position pair list (ilist)
2) Consolidate multiple entries for the same word to one (or less) entries if possible by consolidating their DocID/Position pair list (ilist)
Currently, we overload the optimization operation with “Optimize Table” command, and runs only if you turn on “innodb_optimize_fulltext_only” system configure variable:
mysql> set global innodb_optimize_fulltext_only=1;
mysql> optimize table articles;
Since the table could be large, and optimization could take tremendous amount of time, we allow the optimization to be done in stages. There is a system configure variable “innodb_ft_num_word_optimize” specifying how many words to optimize for each “optimize table” command, its default value is 2000, which mean each time it optimize 2000 words. And when the next “optimize table” command is issued, the server will continue from the word it left last time, and continue the process.
In fact, this optimization process could be run as an online, background process, but it is not enabled in this release. We will continue evaluate these options to see if which way is more appropriate and gives the minimum user impact.
5. Stopword Handling:
For FTS stopwords, we provide two sources of stopwords:
1) Default stopwords provided by the server – this is a static list of stopword come with server. If no user stopword is defined, then this default stopword list will be used.
You can view this list by querying into an INFORMATION_SCHEMA table: INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD:
mysql> select * from INNODB_FT_DEFAULT_STOPWORD;
- User defined stopwords – User can define his/her own stop word by creating an user table with a single column with a column name of “value” and of the “varchar” datatype, and use global variable “innodb_ft_server_stopword_table” to point to this stopword list. Server will load stopword from this user table, rather than the default stopword list when we create the FTS index. And an example will be here:
- # Define a correct formated user stopword table
create table user_stopword(value varchar(30)) engine = innodb;
- # Point to this stopword table with “db name/table name”
set global innodb_ft_server_stopword_table = “test/user_stopword”;
In this initial release, we use a very simple ranking mechanism (term frequency, inverse document frequency) for document result ranking. This means the more the word appears in a document, and the less frequent the word appears in overall documents, the higher the selected document is ranked. We do have plans to add more sophisticated ranking mechanism and allow users to supplied the weight in the ranking calculation too.
7. Some Restrictions:
Lastly, we talk about a couple of restrictions for our initial release:
- InnoDB FullText Search now only support one FTS index per Table.
- Same as MyISAM, although the use of multiple character sets within a single table is supported, all columns in a FULLTEXT index must use the same character set and collation.
- For ideographic languages such as CJK (Chinese, Japanese and Korea) do not have word delimiters, similar to MyISAM, we do not yet support N-GRAM parsing. But this is on our feature agenda.
In summary, InnoDB fulltext search gives InnoDB an important capability in Text File handling. And this newly added platform is already showing its scalability in indexing and flexibility in DML handling. We are also considering bring the parallelism to the Query and DML operations to further enhance its performance. However, by saying that, it is still in an early stage where we would continue the development effort, adding features and options, tuning the performance and to make it a competitive feature.