Minion and Lucene: Case sensitivity

One of the big differences between Minion and Lucene is how they treat case. By default, Lucene's indexing is case insensitive. That is, terms extracted from documents are converted to lower case before being added to the index. The same transformation is performed on query terms, so in its default configuration, you can't ask for documents that have a particular term in a particular case. For example, if you're searching for my last name, Green, you're going to get lots of occurrences of the color green.

In fact, you can make a case sensitive index in Lucene by removing this behavior, but then you can only query for the particular case. So, if you run a query for dog, you won't get the occurrences of Dog that come at the start of a sentence.

While Minion could be configured to have either of these behaviors, by default Minion builds a case sensitive. Terms extracted from documents are stored in the case in which they appeared in the document as well as all lower case.

By default the query behavior is:

  • Given a term in all lower case or all upper case, search for the term in any case

  • Given a term in mixed case, search for the term in that case.

This behavior can be modified by the use of the CASE operator, which indicates that a term should be looked up in the provided case. You can also configure Minion to do case insensitive lookups no matter what the case of the terms provided in the query.

The above rules also apply to relational operators applied against saved string fields.

All of this case sensitive behavior is supported by a couple of dictionary entry types that support postings for the case sensitive and case insensitive versions of a term. This allows us to do the case insensitive queries nearly as quickly as the case sensitive ones.

Obviously, this requires extra index space to store the duplicated postings, but if you want to constrain the size of the index you can configure the engine to act like Lucene does. Minion provides a default configuration that has this behavior.


I don't think you can say "By default, Lucene's indexing is case insensitive.". There is really no "default" with this aspect of Lucene. Case (in)sensitivity is a job of an analyzer (token filter, actually) in Lucene, and there is no default one. There are some that ship with Lucene core, but you can't really call them default. Plus, an analyzer is often the first thing one changes. This post paints a slightly incorrect and negative picture of Lucene, IMHO.... but please keep the Minion and Lucene comparison going, I'm enjoying it.

Posted by Otis Gospodnetic on May 01, 2008 at 07:58 PM EDT #

I guess I'm more of a Minion expert than a Lucene expert :-). Thanks for the clarification. So, how would I go about constructing an index with the Minion casednesd behavior described above in Lucene?

Posted by Stephen Green on May 02, 2008 at 01:51 PM EDT #

You'd write a custom TokenFilter and use it in your custom Analyzer, along with any other suitable TokenFilters and a Tokenizer. Then you'd use that Analyzer when calling the IndexWriter ctor and when parsing queries with QueryParser (if you use it).

Posted by Otis Gospodnetic on May 02, 2008 at 02:05 PM EDT #

Interesting, sounds a bit more a kin to Xapian's way of doing things.

Posted by Rob Young on May 06, 2008 at 03:06 AM EDT #

Rob, I hadn't encountered Xapian before, but I think I had encountered Muscat (and of course all IR people know about Martin Porter :-) Thanks for the pointer!

Posted by Stephen Green on May 08, 2008 at 07:47 AM EDT #

Post a Comment:
Comments are closed for this entry.

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« February 2016