Saturday Apr 26, 2008

Minion and Lucene: Fields

Both Minion and Lucene model a document as a number of field/value pairs. Both engines provide ways to index and query the contents of fields, but they differ quite a bit in how the fields may be queried. Because fields are so fundamental to both engines' document models, there's a lot to say about the engines' similarities and differences.

Indexed Fields

Both engines allow for field values to be tokenized and the resulting tokens to be placed into the index for later querying. For example, if the title of a document is Dog bites man, the tokens dog, bites, and man will be placed into the dictionary and searches for these terms will return this document.

One main difference between the engines is that Lucene considers a field in a document to be a "sub-document" to a much greater degree than Minion. When a term occurs in a particular field, Lucene will index it only as part of that field. In our example above, an entry like title:dog will be placed into the dictionary for the term dog. Terms that are not in any defined field will be placed into a default body field.

When querying in Lucene, you can use a query term like title:dog to find documents that have dog in the title field. Terms that do not specify a field are searched for only in the body. In Minion, all of the information associated with the term dog is stored in the dictionary entry for dog. Thus a simple query for dog will find dog in any field of the document. The Minion query language provides a CONTAINS operator that can be used to restrict a query to a particular field.

The advantage for Lucene here is that searches for terms in a particular field will most likely be faster than in Minion, because only a simple postings list must be traversed. In Minion, finding a term in a particular field requires decoding field information in the postings list for the term. The advantage for Minion is that search for a term in all of the fields in a document will most likely be faster than in Lucene, because in that case a single dictionary lookup and postings list traversal is all that is necessary.

Lucene provides an extension class that searches across all fields for a particular term, but it has the (somewhat strange, IMHO) requirement that the term must occur in all fields in a document for it to satisfy the query. Clearly, the Minion behavior could be emulated in Lucene using a disjunction of terms in all of the known fields, but as far as we can tell, this is not done.

Saved Fields

Both engines provide for saving field values as they appear in the documents so that they may be retrieved and displayed to the user at query time. Both offer three "standard" types for their saved fields: String, Date, and Long. Minion additionally offers a Double saved field type.

Minion stores saved field values in a dictionary per field and the names of the entries in the dictionaries are the saved values, and the postings associated with the entries contain the documents that contain that particular field value. This has the advantage that common field values do not take up as much space in the index and it allows for fast computation of common relational queries (such as classification = fast.)

As with all of the dictionaries in a Minion index, the entries in the dictionaries and the associated postings are compressed to save space. Longs, Dates, and Doubles are all stored as 64-bit quantities. This allows us to store any valid Java Long or Double. It also provides for full millisecond resolution for all saved dates, but this full resolution requires some finesse at querying time when a query uses a day-resolution date. Additionally, Minion sometimes encounters difficulty when indexing is done in a different time zone than retrieval.

Lucene stores saved fields in a compressed format, but it uses a string representation for all saved data. In the case of numbers, they are stored using a base 36 representation that allows them to be sorted lexicographically and come out in the correct numerical order. Dates are stored using a variable resolution that is selected at indexing time. As with numbers, they are stored using a representation that allows them to be sorted lexicographically. This resolution selection gets around Minion's day-resolution and time zone problems, but it does mean that dates in the source material may not be represented with the resolution that someone querying the index will require (e.g., if you choose a day-level resolution at indexing time, then you can't query for documents from a particular hour).

Both Minion and Lucene provide relational operators for their saved fields. Lucene offers a range operator for saved field values, which can be used to select documents that have a saved field value in the given range (including or excluding endpoints). Wildcards can be used in the range operator to get something like a substring operator. Lucene doesn't offer a single ended range, but this can be emulated using terms that are known to be less than the smallest term in the index or greater than the largest term (of course, it's up to the application to provide for this capability!)

I must admit to some confusion about how indexed and saved fields interact with tokenization for these range queries. It appears as though one can only use a range operator with a particular word indexed out of a saved field value and not on the saved field values directly. As far as I can tell, the distinction between the saved, tokenized, and indexed attributes is stronger in Minion than it is in Lucene. As with all Lucene searches, these range operators are case insensitive.

Minion offers standard relational operators (e.g., <=, >=, =, etc.) with some support in the query parser for combining operators for the same field into a range. Minion also supports a != operator, which Lucene does not. These operators are available for any saved field type. Minion also offers operators like STARTS, ENDS, SUBSTRING, and MATCHES for string fields. In Minion, relational queries for string fields can be case sensitive or insensitive.


This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« February 2017