Saturday Sep 29, 2012

Performance Enhancement in Full-Text Search Query

Ever since its first release, we are continuing consolidating and developing InnoDB Full-Text Search feature. There is one recent improvement that worth blogging about. It is an effort with MySQL Optimizer team that simplifies some common queries Query Plans and dramatically shorted the query time. I will describe the issue, our solution and the end result by some performance numbers to demonstrate our efforts in continuing enhancement the Full-Text Search capability.

The Issue:

As we had discussed in previous Blogs, InnoDB implements Full-Text index as reversed auxiliary tables. The query once parsed will be reinterpreted into several queries into related auxiliary tables and then results are merged and consolidated to come up with the final result. So at the end of the query, well have all matching records on hand, sorted by their ranking or by their Doc IDs.

Unfortunately, MySQLs optimizer and query processing had been initially designed for MyISAM Full-Text index, and sometimes did not fully utilize the complete result package from InnoDB.

Here are a couple examples:

Case 1: Query result ordered by Rank with only top N results:

mysql> SELECT FTS_DOC_ID, MATCH (title, body) AGAINST ('database')

AS SCORE FROM articles ORDER BY score DESC LIMIT 1;

In this query, user tries to retrieve a single record with highest ranking. It should have a quick answer once we have all the matching documents on hand, especially if there are ranked. However, before this change, MySQL would almost retrieve rankings for almost every row in the table, sort them and them come with the top rank result. This whole retrieve and sort is quite unnecessary given the InnoDB already have the answer.

In a real life case, user could have millions of rows, so in the old scheme, it would retrieve millions of rows' ranking and sort them, even if our FTS already found there are two 3 matched rows. Apparently, the million ranking retrieve is done in vain. In above case, it should just ask for 3 matched rows' ranking, all other rows' ranking are 0. If it want the top ranking, then it can just get the first record from our already sorted result.

Case 2: Select Count(*) on matching records:

mysql> SELECT COUNT(*) FROM articles WHERE MATCH (title,body)

AGAINST ('database' IN NATURAL LANGUAGE MODE);

In this case, InnoDB search can find matching rows quickly and will have all matching rows. However, before our change, in the old scheme, every row in the table was requested by MySQL one by one, just to check whether its ranking is larger than 0, and later comes up a count.

In fact, there is no need for MySQL to fetch all rows, instead InnoDB already had all the matching records. The only thing need is to call an InnoDB API to retrieve the count

The difference can be huge. Following query output shows how big the difference can be:

mysql> select count(*) from searchindex_inno where match(si_title, si_text) against ('people') 

+----------+
| count(*) |
+----------+
| 666877 |
+----------+
1 row in set (16 min 17.37 sec)

So the query took almost 16 minutes.

Lets see how long the InnoDB can come up the result. In InnoDB, you can obtain extra diagnostic printout by turning on innodb_ft_enable_diag_print, this will print out extra query info:

Error log:

keynr=2, 'people'
NL search
Total docs: 10954826 Total words: 0
UNION: Searching: 'people'
Processing time: 2 secs: row(s) 666877: error: 10
ft_init()
ft_init_ext()
keynr=2, 'people'
NL search
Total docs: 10954826 Total words: 0
UNION: Searching: 'people'
Processing time: 3 secs: row(s) 666877: error: 10

Output shows it only took InnoDB only 3 seconds to get the result, while the whole query took 16 minutes to finish. So large amount of time has been wasted on the un-needed row fetching.

The Solution:

The solution is obvious. MySQL can skip some of its steps, optimize its plan and obtain useful information directly from InnoDB. Some of savings from doing this include:

1) Avoid redundant sorting. Since InnoDB already sorted the result according to ranking. MySQL Query Processing layer does not need to sort to get top matching results.

2) Avoid row by row fetching to get the matching count. InnoDB provides all the matching records. All those not in the result list should all have ranking of 0, and no need to be retrieved. And InnoDB has a count of total matching records on hand. No need to recount.

3) Covered index scan. InnoDB results always contains the matching records' Document ID and their ranking. So if only the Document ID and ranking is needed, there is no need to go to user table to fetch the record itself.

4) Narrow the search result early, reduce the user table access. If the user wants to get top N matching records, we do not need to fetch all matching records from user table. We should be able to first select TOP N matching DOC IDs, and then only fetch corresponding records with these Doc IDs.

Performance Results and comparison with MyISAM

The result by this change is very obvious. I includes six testing result performed by Alexander Rubin just to demonstrate how fast the InnoDB query now becomes when comparing MyISAM Full-Text Search.

These tests are base on the English Wikipedia data of 5.4 Million rows and approximately 16G table. The test was performed on a machine with 1 CPU Dual Core, SSD drive, 8G of RAM and InnoDB_buffer_pool is set to 8 GB.

Table 1: SELECT with LIMIT CLAUSE

mysql> SELECT si_title, match(si_title, si_text) against('family') as rel FROM si WHERE match(si_title, si_text) against('family') ORDER BY rel desc LIMIT 10;


InnoDB

MyISAM

Times Faster

Time for the query

1.63 sec

3 min 26.31 sec

127

You can see for this particular query (retrieve top 10 records), InnoDB Full-Text Search is now approximately 127 times faster than MyISAM.

Table 2: SELECT COUNT QUERY

mysql>select count(*) from si where match(si_title, si_text) against('family‘);

+----------+
| count(*) |
+----------+
| 293955 |
+----------+


InnoDB

MyISAM

Times Faster

Time for the query

1.35 sec

28 min 59.59 sec

1289

In this particular case, where there are 293k matching results, InnoDB took only 1.35 second to get all of them, while take MyISAM almost half an hour, that is about 1289 times faster!.

Table 3: SELECT ID with ORDER BY and LIMIT CLAUSE for selected terms

mysql> SELECT <ID>, match(si_title, si_text) against(<TERM>) as rel FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>) ORDER BY rel desc LIMIT 10;

Term

InnoDB (time to execute)

MyISAM(time to execute)

Times Faster

family

0.5 sec

5.05 sec

10.1

family film

0.95 sec

25.39 sec

26.7

Pizza restaurant orange county California

0.93 sec

32.03 sec

34.4

President united states of America

2.5 sec

36.98 sec

14.8


Table 4: SELECT title and text with ORDER BY and LIMIT CLAUSE for selected terms

mysql> SELECT <ID>, si_title, si_text, ... as rel FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>) ORDER BY rel desc LIMIT 10;

Term

InnoDB (time to execute)

MyISAM(time to execute)

Times Faster

family

0.61 sec

41.65 sec

68.3

family film

1.15 sec

47.17 sec

41.0

Pizza restaurant orange county california

1.03 sec

48.2 sec

46.8

President united states of america

2.49 sec

44.61 sec

17.9

Table 5: SELECT ID with ORDER BY and LIMIT CLAUSE for selected terms

mysql> SELECT <ID>, match(si_title, si_text) against(<TERM>) as rel  FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>) ORDER BY rel desc LIMIT 10;

Term

InnoDB (time to execute)

MyISAM(time to execute)

Times Faster

family

0.5 sec

5.05 sec

10.1

family film

0.95 sec

25.39 sec

26.7

Pizza restaurant orange county califormia

0.93 sec

32.03 sec

34.4

President united states of america

2.5 sec

36.98 sec

14.8


Table 6: SELECT COUNT(*)

mysql> SELECT count(*) FROM si_<TB> WHERE match(si_title, si_text) against (<TERM>) LIMIT 10;

Term

InnoDB (time to execute)

MyISAM(time to execute)

Times Faster

family

0.47 sec

82 sec

174.5

family film

0.83 sec

131 sec

157.8

Pizza restaurant orange county califormia

0.74 sec

106 sec

143.2

President united states of america

1.96 sec

220 sec

112.2

 Again, table 3 to table 6 all showing InnoDB consistently outperform MyISAM in these queries by a large margin. It becomes obvious the InnoDB has great advantage over MyISAM in handling large data search.

Summary:

These results demonstrate the great performance we could achieve by making MySQL optimizer and InnoDB Full-Text Search more tightly coupled. I think there are still many cases that InnoDB’s result info have not been fully taken advantage of, which means we still have great room to improve. And we will continuously explore the area, and get more dramatic results for InnoDB full-text searches.

Jimmy Yang, September 29, 2012

Helping to Reduce Page Compression Failures Rate

When InnoDB compresses a page it needs the result to fit into its predetermined compressed page size (specified with KEY_BLOCK_SIZE). When the result does not fit we call that a compression failure. In this case InnoDB needs to split up the page and try to compress again. That said, compression failures are bad for performance and should be minimized.

Whether the result of the compression will fit largely depends on the data being compressed and some tables and/or indexes may contain more compressible data than others. And so it would be nice if the compression failure rate, along with other compression stats, could be monitored on a per table or even on a per index basis, wouldn't it?

This is where the new INFORMATION_SCHEMA table in MySQL 5.6 kicks in. INFORMATION_SCHEMA.INNODB_CMP_PER_INDEX provides exactly this helpful information. It contains the following fields:

+-----------------+--------------+------+
| Field           | Type         | Null |
+-----------------+--------------+------+
| database_name   | varchar(192) | NO   |
| table_name      | varchar(192) | NO   |
| index_name      | varchar(192) | NO   |
| compress_ops    | int(11)      | NO   |
| compress_ops_ok | int(11)      | NO   |
| compress_time   | int(11)      | NO   |
| uncompress_ops  | int(11)      | NO   |
| uncompress_time | int(11)      | NO   |
+-----------------+--------------+------+

similarly to INFORMATION_SCHEMA.INNODB_CMP, but this time the data is grouped by "database_name,table_name,index_name" instead of by "page_size".

So a query like

SELECT
database_name,
table_name,
index_name,
compress_ops - compress_ops_ok AS failures
FROM information_schema.innodb_cmp_per_index
ORDER BY failures DESC;
would reveal the most problematic tables and indexes that have the highest compression failure rate.

From there on the way to improving performance would be to try to increase the compressed page size or change the structure of the table/indexes or the data being stored and see if it will have a positive impact on performance.

About

This is the InnoDB team blog.

Search

Categories
Archives
« September 2012 »
SunMonTueWedThuFriSat
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
16
17
18
19
20
21
22
23
24
25
26
27
28
30
      
Today