Wednesday Apr 30, 2008

Minion and Lucene: Case sensitivity

One of the big differences between Minion and Lucene is how they treat case. By default, Lucene's indexing is case insensitive. That is, terms extracted from documents are converted to lower case before being added to the index. The same transformation is performed on query terms, so in its default configuration, you can't ask for documents that have a particular term in a particular case. For example, if you're searching for my last name, Green, you're going to get lots of occurrences of the color green.

In fact, you can make a case sensitive index in Lucene by removing this behavior, but then you can only query for the particular case. So, if you run a query for dog, you won't get the occurrences of Dog that come at the start of a sentence.

While Minion could be configured to have either of these behaviors, by default Minion builds a case sensitive. Terms extracted from documents are stored in the case in which they appeared in the document as well as all lower case.

By default the query behavior is:


  • Given a term in all lower case or all upper case, search for the term in any case

  • Given a term in mixed case, search for the term in that case.

This behavior can be modified by the use of the CASE operator, which indicates that a term should be looked up in the provided case. You can also configure Minion to do case insensitive lookups no matter what the case of the terms provided in the query.

The above rules also apply to relational operators applied against saved string fields.

All of this case sensitive behavior is supported by a couple of dictionary entry types that support postings for the case sensitive and case insensitive versions of a term. This allows us to do the case insensitive queries nearly as quickly as the case sensitive ones.

Obviously, this requires extra index space to store the duplicated postings, but if you want to constrain the size of the index you can configure the engine to act like Lucene does. Minion provides a default configuration that has this behavior.

About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into recommendation and other technologies that can use a good text search engine. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today