InnoDB Full-Text Search Performance

Note: this article was originally published on http://blogs.innodb.com on July 27, 2011 by Vinay.

Long awaited InnoDB support for full-text index is available in InnoDB lab release. This post talks about the performance of InnoDB’s full-text support over MyISAM’s full-text support. (There are multiple aspect of full-text index performance, the most important ones are full-text index creation time and query execution time). We are not focusing on performance with “DML” (expect some simple insert on loaded data) and “other supported character sets” (numbers are based on latin1 char-set data).

Numbers are encouraging in terms of ‘create index’  time where InnoDB is comparable or faster than MyISAM. And query execution time are comparable with MyISAM. Over the time, we can expect optimization for query performance.

Please refer  “Overview and Getting Started with InnoDB FTS” , “InnoDB Full-Text Search Tutorial” , as this blog focuses on full-text search performance part.

a) Full-Text Search Create Index Time
Full-text index creation time depend(s) on (the) number of words to be indexed which is indirectly dependent on the length of word which you want to index. By default , all words with length >= 3 bytes get indexed in InnoDB. (MyISAM default word length is 4).

When you are comparing ‘create index’ time then just make sure MyISAM and InnoDB are indexing the same number of words from the data. InnoDB uses “InnoDB_ft_min_token_size” variable to set the minimum word length. It is also dependent on max word length to be indexed but it is generally not changed. Default stop-word list for InnoDB contain less word than that of MyISAM stop-word list so it will index more word(s).

Looking at InnoDB full-text search architecture details , it is clear that usage of (user defined “FTS_DOC_ID”) column and server parameter “InnoDB_ft_sort_pll_degree” (could affect performance)  help to increase the performance. Let us see what do these parameter mean so we will get an idea why they are used/tuned to get better performance.

1)
InnoDB full-text search requires a monotonically increasing “Doc ID” column for fulltext index. There is no requirement that user must add this column but InnoDB will add it automatically if the column is missing and rebuild of unique/primary index. From performance point of view , its good that user could pre-define this column saving rebuild of  index. As of now, it needs to be a 8 bytes unique non-null BIGINT column and it must have the reserved name “FTS_DOC_ID” (all upper case).

Example , User can  avoid a index rebuild by server , using “FTS_DOC_ID” column when creating the table
CREATE TABLE wp (
FTS_DOC_ID BIGINT unsigned NOT NULL AUTO_INCREMENT,
title varchar(255) NOT NULL DEFAULT ”,
text mediumtext NOT NULL,
PRIMARY KEY (`FTS_DOC_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

CREATE UNIQUE INDEX FTS_DOC_ID_INDEX on wp(FTS_DOC_ID);

Note: “FTS_DOC_ID” (reserved column name, all upper case) column and FTS_DOC_ID_INDEX (reserved index name) are added. These two are used by InnoDB internally, if either is missing, InnoDB will create them for you.

If you are not in a hurry, and don’t want to manage your own Doc ID, then you can just create table as following. Subsequent “create fulltext index” call will do a rebuild on the table and add FTS_DOC_ID column as hidden column, and FTS_DOC_ID_INDEX on FTS_DOC_ID.
CREATE TABLE wp (
title varchar(255) NOT NULL DEFAULT ”,
text mediumtext NOT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

2)
InnoDB full-text search has tokenization and sort phases which are handled using threads. “InnoDB_ft_sort_pll_degree” variable determine how many parallel threads to use for tokenization and sorting. It has been observed that ‘create index’ is faster with large values of InnoDB_ft_sort_pll_degree on multi-core machines.

Default value for “InnoDB_ft_sort_pll_degree” is 2 . This is a static variable so it required to be set at server start-up time and it will/can be moved as dynamic variable at later enhancement of InnoDB full-text search.
* Use higher value for InnoDB_ft_sort_pll_degree ( it takes values 2,4,8,16,..)

Following Numbers are obtained with setup as

Schema: Table with number of words (as defined above)
Index : (title, text) columns
Data Size : Approx 2.7GB, 1 million row , 238 million word
Platform: Linux x86 64 bit , 3Ghz , 8 cores , 32 GB RAM

‘Create Index’ Performance


Default 4 8 16
InnoDB 7 min 48.12 5 min 12.06 4 min 10.95 3 min 33.4
MyISAM 11min 42.57 11 min 32.27 11min 32.79 11 min 25.17

Note: “innodb_ft_sort_pll_degree” do not affect myisam , hence ‘create index’ time is almost same around 11 min.

One thing to note is that InnoDB full-text search create index use extensive merge sort. So on some system we see memory usage larger than expected, due to OS file system cache.  You can turn off such OS file system cache on Linux or Solaris by setting “innodb_disable_sort_file_cache” to TRUE, which enables direct I/O. However, doing such might hurt performance to the merge sort to some extent.

Its been observed that CPU utilization is in proportion to “innodb_ft_sort_pll_degree” values. ‘Create index’ in Innodb goes through phases called tokenization, merge sort and insert . CPU utilization is listed for each phase receptively along with Memory utilization for default(2) value of innodb_ft_sort_pll_degree. ( Note: CPU utilization values of different phases are snapshot values of that phase).

Variable












core 0 core 1 core 2 core 3 core 4 core 5 core 6 core 7
Default(2) CPU (%)










Tokenization 0 0 0 0 96.97 0 0 97.11


Merge Sort 20.23 0 0 1 0 46.56 0 1


Insert Data 26 24.14 36.14 28.14 37 36.68 39.7 31

Memory (%) 5.8%  across all core







(Variable = innodb_ft_sort_pll_degree )

Above scenario when repeated with utf8 character-set on table also showed improvement in ‘create index’ time with increased values of  innodb_ft_sort_pll_degree. Insertion of 100 records on top of 1M record was faster for Innodb than MyISAM. InnoDB inserted 100 records in 10 millisec where as MyISAM took 30 millisec.

b) Query Execution Time With Full-text Index
In this section, we briefly go over the execution time of some typical full-text search queries:

Queries used are :
1) select count(*) from wp where match(title,text) against (‘Creationist +Abrahamic -creationism’ in boolean mode);
2) select count(*) from wp where match(title,text) against (‘Abrahamic’);
3) select count(*) from wp where match(title,text) against (‘preassumtions +orangutan’ in boolean mode);
4) select count(*) from wp where match(title,text) against (‘orangutan +falsified ~naturalistically’ in boolean mode);
5) select count(*) from wp where match(title,text) against (‘+india* -leader +gandh*’ in boolean mode);
6) select count(*) from wp where match(title,text) against (‘”american culture”@09′ in boolean mode);

Query InnoDB (ms) * InnoDB FTS Search Processing Time MyISAM (ms)
1 10 2 10
2 130 1 10
3 1 1 10
4 2 2 10
5 320 116 1000
6 (New InnoDB Feature) 5250
NA


* FTS Search processing time is the actual time spend by InnoDB full-text index scan.

There are still activities going to to synchronize the MySQL Optimizer and QP to the InnoDB FTS, so some queries types are not yet fully tuned. So there are further rooms to improve for various queries.


Summary:

We see ‘create index’ performance scale well align with the number of parallel degree and overall comparable performance with MyISAM in terms of query execution time.

Comments:

Post a Comment:
Comments are closed for this entry.
About

This is the InnoDB team blog.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today