Dictionaries in Minion: Dictionary Entries
By searchguy on Feb 26, 2009
Dictionaries are a map from entry names to entries, so let's take a second to look at what the entries are before we dive into how the dictionaries use them. For those who want to follow along in the code, all of the code pertaining to the entry types is in the
The basic interface for the entry types is
Entry, which describes the capabilties that all entries need to have. There are methods to get and set the name or the integer ID of the entry, methods to get the partition or the dictionary that the entry belongs to, as well as methods to get occurrence statistics for the entry.
Note that the names for the entries are only specified as
Object. We expect names to be instances of
Date. These types are currently sufficient for all of our uses (including terms in documents, values in saved fields, bigrams, and so on.) The code used in the dictionaries is such that anything that implements
Comparable could be used as the name for an entry, since we need to be able to sort dictionaries by name before dumping them. To-do list item: this dependency should be explicit!
Entry interface has two sub-interfaces
QueryEntry. These names aren't the greatest, but classes implementing
IndexEntry are meant to be used during indexing (i.e., when generating dictionary entries and their associated postings) and classes implementing
QueryEntry are meant to be used during query operations (i.e., when looking up entries and processing their postings). Even though most of the entry types implement both these interfaces, we make this distinction because it makes it clear in the code how we're going to use a particular entry.
BaseEntry abstract class provides an implementation for some of the parts of
QueryEntry. This entry is, unsurprisingly, the basis for pretty much all of the rest of the entries.
In particular, there is the
SinglePostingsEntry class, which is another abstract base class for all the entry types that store a single postings list. There are concrete subclasses of
SinglePostingsEntry for entries that store postings with IDs only, with IDs and frequencies, and for entries that store everything (IDs, frequencies, field information, and word positions).
Another important abstract class is
CasedEntry that defines the basis for all of the entry types that store cased postings lists. For example, if an engine is configured to provide case-sensitive searching, then we will store the postings for each term twice: once in a postings list that ignores the case in which the term was encountered, and once in a postings list that preserves the case in which the term was encountered. Why do it this way? Because in the default case (I want to search for a term without regard to case) we only need to lookup one term and then read and process one postings list.
So who's responsible for generating these entries? The dictionaries are. At indexing time, the dictionary is used to generate an entry for a new term. At query time, the entries will be constructed from the dictionary data that was dumped during indexing.