Note: this article was originally published on http://blogs.innodb.com on Dec 20, 2011 by Jimmy Yang.
InnoDB full-text search (FTS) is finally available in MySQL 5.6.4
release. The feature has been on trial through MySQL’s summer lab
release, thus we had several blogs covering the feature. In this blog,
we will leave the “how to” part of the feature to those blogs, and focus
on some important characteristics of this new feature, so you will have
a better understanding when trying on the feature.
The InnoDB Full-text Index as an Inverted Index
When comes to the basic design, InnoDB takes a traditional way to
implementation the full-text index, which is a so called “Inverted
Index”. It composes of a set of auxiliary “index tables” that stores the
“Word” “Doc ID” pair, as well as each word’s position in the original
Incoming text strings in the inserting records are extracted out,
tokenized and decomposed into individual words, and such words are then
inserted into the auxiliary “index tables” along with their position
info and the Doc ID associated with the record. With the position info,
we will be able to support proximity search that is lacking in MyISAM
Create the Full-text Index in Parallel
The Full-text Index is created in parallel. By default, the parallel
degree is two, which means two threads are used to tokenize, sort and
final insertion to the Full-text “index tables”.
To support parallel create index and future parallel processing,
InnoDB partitions the “full-text index” into six auxiliary “index
tables”. The words are divided among these tables based on their first
character’s charset sort weight. Currently, the partition criteria is
hard-coded, targeted to the Latin characters. In the future, we will
allow user to define their own partition criteria, to better support
other character sets.
Batching the insert value with InnoDB Full-text Index’s “index cache”
just discussed, when inserting into an “Inverted Index”, each inserting
string will be tokenized, and decomposed into individual words before
inserting into the auxiliary table. A single insertion could result in
many small insertions into the inverted index tables. This magnifies the
insertion operations by several or even dozens of fold, thus becomes
normal insertion a disruptive fact for concurrent access to the table.
solution is to build an “internal” FTS “index cache” to temporarily
cache these insertions, and batch flush to the disk once this index
cache is full. Thus, it avoids frequent updates to the “Index Table”
during busy inserts and updates, so the index table only needs to be
synchronized when the index cache is full. The batching technique also
avoids repetitively storing the same word, minimizes the number of
entries for each word in the “Index Table”. So instead of flushing each
word with a single “ilist” entry (with a single Doc ID), it batches
result from many inserted documents so to create an “ilist” with
multiple DocID/position pairs, before we flush this info to disk and
also make a single entry. This reduces the redundancy and make the
“Index Table” smaller.
index cache is index specific. And has a configuration parameter
(innodb_ft_cache_size) to configure the size of the cache. It stores the
same information as those auxiliary tables, using a red-black tree data
structure. However, this cache is different from an index tree in that
it is only used to cache the word of the inserted documents, words on
disk do not need to be brought back to this cache before they can be
queried. In another word, the words in the “Index Table” are queried
directly as we query into relational tables. And such result from “Index
Table” will merge (union) with the query results from the Index Cache
before sending back to user.
How InnoDB Full-text Index handles Deletes
to inserts, each delete of the record with full-text index could result
in many small “deletes”, if we had implemented it as normal deletes.
Again, this could be very disruptive.
avoid such disruptive operation, InnoDB Full-text Index develops a new
scheme that only logs the Doc ID of the deleted Documents in a separate
“DELETED” auxiliary table. The actually “indexed” record would still
remain in the FTS index. However, these deleted Doc Ids will be filtered
out from the final results by consulting the “DELETED” table before
returning the query result.
benefit of this design is obvious. Delete operation then becomes a
trivial and fast operation, with minimum impact on the index itself. The
shortcoming is that the index would not shrink along with record
deletion. However, the benefits far overweight the drawbacks.
cleanup the deleted records in the index, you will resort to the InnoDB
optimize table utility, which rebuilt the Full-text index online. And we
will discuss it in the next section.
The optimize table on the Full-Text index essentially rebuilds the
full-text index, and leaves those “deleted” Docs out of the new index.
This operation does two things:
1) Remove the deleted Doc ID from the word’s DocID/Position pair list (ilist).
2) Consolidate multiple entries for the same word to one (or less)
entries if possible by consolidating their DocID/Position pair list
(ilist), as we will have as little entry as possible.
Currently, we overload this optimization operation with the “Optimize
Table” command, and runs only if you turn on
“innodb_optimize_fulltext_only” system configure variable:
mysql> set global innodb_optimize_fulltext_only=1;
mysql> optimize table articles;
the table could be large, and optimization (reorg rebuild index) may
take a tremendous amount of time, we allow the optimization to be done
in stages. The system configuration variable
“innodb_ft_num_word_optimize” specifies how many words to optimize for
each “optimize table” command. Its default value is 2000, meaning each
time it optimizes 2000 words. And when the next “optimize table” command
is issued, the server continues the process from where it left off from
InnoDB Full-text search Transaction Behavior
As you could see from last two
sections, the main idea behind InnoDB Full-text index DML is to batch
the operations as much as possible. This determines there will be some
special characteristics when coming to transaction handling.
In fact, all of the DML operations
(insert/delete/update) involving Full-text indexed columns are processed
at the transaction commit time. For example, for insertion, the
tokenization for inserted strings is performed only at commit time, as a
result, a full-text search only sees the committed data. This “delayed”
process/synchronization behavior presents in all transactional DBMS
text handling. And we do so at the transaction commit time.
InnoDB FTS Document ID
All of above design can’t be done
without the support of a so called FTS Doc ID identifier. The mapping
from word to original table record goes through such a unique ID. So
there must be a Doc ID column in the table with Full-text index.
In our case, it is represented by the
“FTS_DOC_ID” (uppercase required) column. If this column is not
defined, InnoDB will automatically add it to the user table when
creating the full-text index. The column itself must be of “BIGINT
UNSIGNED NOT NULL” type, with a unique index named FTS_DOC_ID_INDEX.
When you define this column during table creation, it saves considerable
time in creating the full-text index after loading data, since InnoDB
does not need to copy the table for adding such column (and also rebuild
all indexes include primary index). In such case, user will be
responsible to properly manage the Doc ID column so that document will
not be inappropriately represented. If create index performance is not a
concern, you can leave Doc ID management to InnoDB, by leaving out this
column. InnoDB will added “FTS_DOC_ID” as a hidden columns. The “
FTS_DOC_ID_INDEX” will also be created automatically.
Stopwords are common or trivial words that are omitted from the
full-text index. The InnoDB FTS provides two sources of stopwords:
1) InnoDB predefined default stopwords – If no user stopword list is defined, this default stopwords list is used.
You can view this list by querying the table INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD:
mysql> select * from INNODB_FT_DEFAULT_STOPWORD;
2) User-defined stopwords – You can define your own stopwords by
creating a table with a single column named “value”, with datatype
“varchar”, and pointing the global variable
“innodb_ft_server_stopword_table” to this table. MySQL loads stopwords
from this user table, rather than the default stopword list, when
creating the FTS index. And an example will be here:
- # Define a correctly formatted user stopword table
create table user_stopword(value varchar(30)) engine = innodb;
- # Point to this stopword table with “db name/table name”
set global innodb_ft_server_stopword_table = “test/user_stopword”
Currently, we use a very simple
ranking mechanism (term frequency, inverse document frequency) for
document ranking. This means the more the word appears in a document,
and the less frequent the word appears in overall documents, the higher
the selected document is ranked. We do have plans to add more
sophisticated ranking mechanism and allow users to supplied the weight
in the ranking calculation too.
As the first release of InnoDB FTS, it provides limited and very
basic functionalities for Full-text searches. And it has some
limitations that worth mentioning:
- It does not yet support stemming.
- For ideographic languages such as CJK (Chinese, Japanese and Korea),
which do not have word delimiters, InnoDB FTS does not yet support
N-GRAM parsing (similar to MyISAM FTS).
- Although the use of multiple character sets within a single table is
supported, all columns in a FULLTEXT index must use the same character
set and collation (similar to MyISAM FTS).
- It does not support external parser plugins
- Ranking mechanism is relatively simple
The InnoDB full-text search gives
InnoDB an important capability in using MySQL to handle text documents,
and we will further develop it to continue increase its performance and