Minion and Lucene: Fields

Both Minion and Lucene model a document as a number of field/value pairs. Both engines provide ways to index and query the contents of fields, but they differ quite a bit in how the fields may be queried. Because fields are so fundamental to both engines' document models, there's a lot to say about the engines' similarities and differences.

Indexed Fields

Both engines allow for field values to be tokenized and the resulting tokens to be placed into the index for later querying. For example, if the title of a document is Dog bites man, the tokens dog, bites, and man will be placed into the dictionary and searches for these terms will return this document.

One main difference between the engines is that Lucene considers a field in a document to be a "sub-document" to a much greater degree than Minion. When a term occurs in a particular field, Lucene will index it only as part of that field. In our example above, an entry like title:dog will be placed into the dictionary for the term dog. Terms that are not in any defined field will be placed into a default body field.

When querying in Lucene, you can use a query term like title:dog to find documents that have dog in the title field. Terms that do not specify a field are searched for only in the body. In Minion, all of the information associated with the term dog is stored in the dictionary entry for dog. Thus a simple query for dog will find dog in any field of the document. The Minion query language provides a CONTAINS operator that can be used to restrict a query to a particular field.

The advantage for Lucene here is that searches for terms in a particular field will most likely be faster than in Minion, because only a simple postings list must be traversed. In Minion, finding a term in a particular field requires decoding field information in the postings list for the term. The advantage for Minion is that search for a term in all of the fields in a document will most likely be faster than in Lucene, because in that case a single dictionary lookup and postings list traversal is all that is necessary.

Lucene provides an extension class that searches across all fields for a particular term, but it has the (somewhat strange, IMHO) requirement that the term must occur in all fields in a document for it to satisfy the query. Clearly, the Minion behavior could be emulated in Lucene using a disjunction of terms in all of the known fields, but as far as we can tell, this is not done.

Saved Fields

Both engines provide for saving field values as they appear in the documents so that they may be retrieved and displayed to the user at query time. Both offer three "standard" types for their saved fields: String, Date, and Long. Minion additionally offers a Double saved field type.

Minion stores saved field values in a dictionary per field and the names of the entries in the dictionaries are the saved values, and the postings associated with the entries contain the documents that contain that particular field value. This has the advantage that common field values do not take up as much space in the index and it allows for fast computation of common relational queries (such as classification = fast.)

As with all of the dictionaries in a Minion index, the entries in the dictionaries and the associated postings are compressed to save space. Longs, Dates, and Doubles are all stored as 64-bit quantities. This allows us to store any valid Java Long or Double. It also provides for full millisecond resolution for all saved dates, but this full resolution requires some finesse at querying time when a query uses a day-resolution date. Additionally, Minion sometimes encounters difficulty when indexing is done in a different time zone than retrieval.

Lucene stores saved fields in a compressed format, but it uses a string representation for all saved data. In the case of numbers, they are stored using a base 36 representation that allows them to be sorted lexicographically and come out in the correct numerical order. Dates are stored using a variable resolution that is selected at indexing time. As with numbers, they are stored using a representation that allows them to be sorted lexicographically. This resolution selection gets around Minion's day-resolution and time zone problems, but it does mean that dates in the source material may not be represented with the resolution that someone querying the index will require (e.g., if you choose a day-level resolution at indexing time, then you can't query for documents from a particular hour).

Both Minion and Lucene provide relational operators for their saved fields. Lucene offers a range operator for saved field values, which can be used to select documents that have a saved field value in the given range (including or excluding endpoints). Wildcards can be used in the range operator to get something like a substring operator. Lucene doesn't offer a single ended range, but this can be emulated using terms that are known to be less than the smallest term in the index or greater than the largest term (of course, it's up to the application to provide for this capability!)

I must admit to some confusion about how indexed and saved fields interact with tokenization for these range queries. It appears as though one can only use a range operator with a particular word indexed out of a saved field value and not on the saved field values directly. As far as I can tell, the distinction between the saved, tokenized, and indexed attributes is stronger in Minion than it is in Lucene. As with all Lucene searches, these range operators are case insensitive.

Minion offers standard relational operators (e.g., <=, >=, =, etc.) with some support in the query parser for combining operators for the same field into a range. Minion also supports a != operator, which Lucene does not. These operators are available for any saved field type. Minion also offers operators like STARTS, ENDS, SUBSTRING, and MATCHES for string fields. In Minion, relational queries for string fields can be case sensitive or insensitive.

Comments:

Stephen, as for the indexed fields chapter I think that it is not that hard to emulate the same thing in Lucene just by creating a new arbitrary field (called 'all' for example) which will contain all terms in the document. This arbitraty field is then used for default search if field name is not defined in query string. I thing this is what Compass framework does for you automaticaly so you don't have to think abou this. On the other hand I think it may be useful to have an option not to have 'all' filed at all - this really depends on the business needs. Is it possible to turn off the 'all' files in Minion?

Posted by Lukas Vlcek on April 27, 2008 at 03:03 AM EDT #

Some clarifications about Lucene:

"Lucene provides an extension class that searches across all fields for a particular term, but...": if you refer to MultiFieldQuery parser, it only requires the term be be in at least of the given fields (there used to be a bug in an older version of Lucene which leads to the behaviour you describe).

"Lucene stores saved fields in a compressed format": this is actually optional (use Field.Store.COMPRESS instead of Field.Store.YES)

"Both Minion and Lucene provide relational operators for their saved fields": Lucene offers you to get the stored field to display it, but that's it. The "stored" property in Lucene has nothing to do with its search features.

"I must admit to some confusion about how indexed and saved fields interact with tokenization for these range queries": if you specify a field in Lucene as tokenized, it actually means that is analyzed (usually tokenized and then further normalized). But this never refers to the stored value which is always stored as-is.

Posted by Daniel Naber on April 27, 2008 at 04:22 AM EDT #

Lukas, that sounds like a good way to go about doing this, but it's up to the application developer to think of it and to implement it on the indexing and querying sides.

I haven't looked too deeply into some of the search systems built on top of Lucene, so I can't really speak to their capabilities, but I did notice that Solr modified some of Lucene's default behavior.

Come to think of it, turning off the "all fields" behavior in Minion would actually require some rewriting of the query logic.

We often kicked around the idea of having a user-configurable set of default fields to search, possibly with support in the indexer for creating postings for that set so that evaluation would be fast for the default fields.

Posted by Stephen Green on April 27, 2008 at 09:05 AM EDT #

Daniel,

I appreciate the clarifications. It's been a while since I looked at MultiFieldQuery in Lucene.

Just a followup: so if storing is only for display, then what do the range operators work on? Is it just the tokens extracted from the field?

Posted by Stephen Green on April 27, 2008 at 09:07 AM EDT #

Stephen,

the "range operator" is just a search (thus, it works on indexed i.e. analyzed/tokenized terms). For example, foo\* might be expanded to foobar and fooblah if those are the words in the index starting with "foo". Then a normal search is done. Actually the latest version of Lucene can turn foo\* also into a filter IIRC. A filter is a bitset that filters the rest of the query but it has no influence on the ranking. For range queries that are combined with a search for terms it's often useful to have no ranking influence from the range query anyway.

Posted by Daniel Naber on April 27, 2008 at 09:28 AM EDT #

OK, I think I've got it.

I agree that it's is useful to have a way to restrict queries without affecting scores.

Minion has an operator in the query language that applies to a sub-expression and indicates that the results should be interpreted as a strict boolean set of documents, which restricts the result set (i.e., the documents from the subexpression have to appear) without affecting the scores generated by other expressions.

We've also been implementing filters lately that are applied as documents are considered for addition into a results set. This is a way to provide filtering that does things that are not easy to express in the query language.

If you're the Daniel Naber that's working on the OpenThesaurus, have you ever looked at the lexical chaining work done by Graeme Hirst and his students (including me) at the University of Toronto?

Posted by Stephen Green on April 27, 2008 at 09:36 AM EDT #

Stephen, thanks for the pointer, I haven't looked at lexical chaining yet. Chances are good that OpenThesaurus will soon be relaunched with a more powerful software, BTW.

Posted by Daniel Naber on April 28, 2008 at 09:26 AM EDT #

The way lucene is sorting (afaik) based on date or other sortable values is by loading the map of all documents sorted by the sortable value into memory when searcher is opened. That way, sorting documents does not require it to read the value with a separate disk operation for each document considered in the result set.
How is sorting (by date for example) done in Minion?

Posted by Ron on April 29, 2008 at 08:54 AM EDT #

Minion fetches field values from the index as necessary during sorting.

The saved field values are stored in a dictionary along with a data structure that indicates what values are stored in each document. When you have to decide whether to include a document in a result set, you pull the field values by checking what values are in the document and then pulling the first of these.

This is supported by a field fetcher that's responsible for storing a copy of the data in the dictionary. This is done using a file-backed buffer of bytes. Because we consider hits in document ID order and we buffer a few KB of data in the fetcher, we avoid a lot of disk hits.

Don't get me wrong, sorting by stored field values is slower than sorting by score alone, but our approach uses a minimal amount of memory (basically, enough to store the field values for the top hits so far and the buffer(s) for the field store's data).

Still, it's fast enough that you can sort by a saved field value like a data in a collection with a few million documents quickly enough to present search results to users.

Does that help? Feels like I should do a whole post on this one :-)

Posted by Stephen Green on April 29, 2008 at 09:16 AM EDT #

Thanks for the reply Stephen.
The issue of date (or other field) ordered search becomes an issue when you have an index of lets say 100-200 million documents.
If you then have to read many times, the values of the sorted-by field, disk access kills your search and you get to >5sec (or many times >60sec) for such searches, especially ones that might contain a very large possible result set.
If you instead, loaded 200 million values associated with the unique document IDs, and used all in-memory access, you are actually getting a speed even better than relevancy search (as you don't have to compute BM25 or other weighing schemes.. by the way, which weighing scheme is Minion using?). Yes, you have to allocate enough memory for such data, but when search speed is a critical issue (which is mostly is), memory is certainly a commodity. Disk access is a VERY expensive and bad thing to hit.
This is a critical aspect for very large indexes which lucene managed to address.
I am not sure I accurately understood your explanation, but assuming I did, ordering by date in Minion will require accessing disk to read each segment/batch of documents in order to be able to sort them by date, a thing that will render date-ordering of very big indexes very slow. on the other hand, less memory will be consumed by such in-memory data structure, a thing that might be very handy for cases where memory is in short supply.
Am I correct with the above interpretation?

Posted by Ron on April 29, 2008 at 10:34 AM EDT #

Your interpretation is correct.

100-200 million documents is a pretty big collection (what the heck are you indexing?) I'm not saying that Minion couldn't index it, but I think at that collection size, you would want to start thinking about distributing it (we're working on distributing the search engine as part of Aura's data store.)

At that size, I wouldn't be at all surprised that the sort would kill your search times (in our case, the search would be fine, it would be fetching the results that would kill you, since that's when the sort happens.)

As it turns out, we've been discussing how to handle a saved field that is going to be used frequently for sorting or results filtering, and the answer is (we think) to add another attribute to fields called CACHED that would indicate that the values should just be loaded into memory when asked for (or perhaps at startup?)

Of course, 100-200 million dates would be 100-200M 64 bit values, which ends up being a gigabyte or more for that field. I hope you don't want to do this with too many fields :-)

A lot of the caching behavior of the engine is configurable (although, strangely, not the amount of stuff to cache when doing fetches like this. I guess we should fix that...), even up to things like mmapping postings files in (although there's a real aggravation aggravation in java.nio that you will probably run into if you're dealing with collections of this size.)

Our default scoring function is pretty much standard TF IDF, although all scoring is done via an interface whose implementation is (all together now!) configurable.

Posted by Stephen Green on April 29, 2008 at 01:59 PM EDT #

Regarding the 200M, this is already the distributed number, as the actual document collection is many billions. So 200M is per node. Indeed storing all date sort index in memory will cost you, but that's a price to pay for fast searches (I'm also assuming that the system is 64-bit ready?).
Loading data into memory at startup indeed looks like the way to go. The only issue here to consider is, whether or not it should be stored in a separate (or duplicate) area on the disk, where all the data is loaded at once, or stored separately for each document. in which case there will be 200M (or a different large number of) read operations to get all the data into memory.
The configurable part of the system sounds like a strong point.
Maybe also compare Minion and Lucene in that sense as well as in the practices of the code writing and modifying/interfacing with it (as Lucene may be easily managed but not the best written piece of code)?

Posted by Ron on April 29, 2008 at 09:00 PM EDT #

OK, wow. That's a lot of documents. I know a company with a good line of x64 servers if you need more hardware ;-)

The system is 64 bit ready: all of the numeric saved field values are 64 bits. Document IDs are 32 bit per partition, but you can have up to 2\^32 partitions (although things would become unusable long before then, of course!)

The actual saved values (i.e., the names of the terms in the dictionary for the saved field) and the list of saved fields per doc can be read in pretty quickly, since they're in contiguous buffers in the on-disk representation.

I'll be talking about the configurability and ease-of-use in future entries in the Minion and Lucene series, so stay tuned.

Posted by Stephen Green on April 30, 2008 at 02:06 AM EDT #

Post a Comment:
Comments are closed for this entry.
About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into recommendation and other technologies that can use a good text search engine. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today