FULLTEXT and Asian Languages with MySQL 5.0

I have seen a few people now ask about using MySQL's FULLTEXT indexing with asian languages such as Chinese, Japanese and Korean (herein referred to as CJK.), however, there doesn't seem to be a good centralised article that covers it.

The information is out there, I just don't think it has been well presented yet.

As I have recently done a bunch of research on this topic for a customer, I figured it might be a good opportunity to make my debut in the MySQL blogosphere.

So here we go...

I'll open by saying that attempting to use FULLTEXT with CJK text in MySQL 5.0 will be unsuccessful.

From the CJK FAQ in the MySQL manual:

"For FULLTEXT searches, we need to know where words begin and end. With Western languages, this is rarely a problem because most (if not all) of these use an easy-to-identify word boundary — the space character. However, this is not usually the case with Asian writing. We could use arbitrary halfway measures, like assuming that all Han characters represent words, or (for Japanese) depending on changes from Katakana to Hiragana due to grammatical endings. However, the only sure solution requires a comprehensive word list, which means that we would have to include a dictionary in the server for each Asian language supported."

So how do we deal with this problem?

First, please ensure you are using an appropriate column type and character set for the text you are storing. This means you may not use a field which by it's definition is binary - eg. BLOB, VARBINARY, etc. and you should use a character set like ujis, utf8, etc.

This will avoid any unexpected issues storing, retrieving and matching on your data.

If you need to stick to official MySQL builds, the best thing you can do is pre-process your text to be stored by running it through a function that separates words with some arbitrary delimiter like “"”. Of course you will also need to post-process the data to strip the delimiters before displaying the data again.

The problem of word separation for CJK is nothing new, so I'd advise you check to see if there is already some library, function, etc. that you can use/adapt for your application for this purpose.

Once your words are separated, you're going to need to tune the minimum word length setting (ft_min_word_len) for the FULLTEXT parser. The default of 4 is inappropriate as a single character can represent a word, so reducing it to 1 is what we need. A server restart is required.

Note: Changing this only effects how the FULLTEXT parser behaves as data is being put into the field. If you want this applied to data that is already stored, you will need to rebuild the indexes with "REPAIR TABLE tblName QUICK".

So let's just try an example:

# CREATE our test table
CREATE TABLE `t1` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `body` text,
  PRIMARY KEY  (`id`),
  FULLTEXT KEY `body` (`body`)
) ENGINE=MyISAM AUTO_INCREMENT=4 DEFAULT CHARSET=utf8

# Insert some data
# Note: We separate our words with the " character.
INSERT INTO t1 VALUES (NULL, '"おはよう"ございます" "ラクランさん"');
INSERT INTO t1 VALUES (NULL, '"ありがとう"ございます" "奥野さん"');
INSERT INTO t1 VALUES (NULL, '"元気"ですか" "木村さん"');

# Search for my name in Katakana, stripping out " characters.
SELECT id, REPLACE(body, '"','') as body FROM t1 WHERE MATCH (body) AGAINST ('ラクランさん');
+----+--------------------------------------------------+
| id | REPLACE(body, '"','')                            |
+----+--------------------------------------------------+
|  1 | おはようございます ラクランさん |
+----+--------------------------------------------------+
1 row in set (0.00 sec)

The technique works!

For folks who are happy to stray from official MySQL binaries, you may look into a project called "Senna", an embeddable FULLTEXT engine with support for Asian languages.

Senna requires changes to the MySQL server code in order to integrate, however, if you are not the sort to play around with that kind of thing the "Tritonn" project provides a patched version of MySQL with Senna built in.

Links:

Senna - http://qwik.jp/senna/FrontPage.html
Tritonn - http://qwik.jp/tritonn/about_en.html
Tritonn Download - http://sourceforge.jp/projects/tritonn/


Note: The projects are by Japanese developers, thus the English versions of the sites can be tricky to navigate.

Well that's it from me, for now.

I hope this has been helpful to someone out there and welcome any comments or feedback.

Stay tuned, as I plan to write about FULLTEXT and CJK with MySQL 5.1 in the future!



Comments:

Rather than use MySQL full-text, I would advice my clients to use Sphinx instead.

Posted by Mark Atwood on December 16, 2008 at 10:57 AM EST #

Open Query generally goes with the Sphinx recommendation also, and the OurDelta builds of MySQL even have the Sphinx engine built in: for some uses, the application-level interface (such as from PHP) is best, in other parts of an app the ability to use the indexes from within a MySQL query is preferable.

It's very fast even with much bigger datasets, thus scales... that's some thing FULLTEXT just does not do, FULLTEXT gets much slower on bigger datasets, and inserts in particular become a problem. That's serious business for a deployment.

Posted by Arjen Lentz on December 16, 2008 at 11:01 AM EST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Lachlan Mulcahy

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today