Dictionaries in Minion: Dictionary Entries

Dictionaries are a map from entry names to entries, so let's take a second to look at what the entries are before we dive into how the dictionaries use them. For those who want to follow along in the code, all of the code pertaining to the entry types is in the com.sun.labs.minion.indexer.entry package.

The basic interface for the entry types is Entry, which describes the capabilties that all entries need to have. There are methods to get and set the name or the integer ID of the entry, methods to get the partition or the dictionary that the entry belongs to, as well as methods to get occurrence statistics for the entry.

Note that the names for the entries are only specified as Object. We expect names to be instances of String, Long, Double, or Date. These types are currently sufficient for all of our uses (including terms in documents, values in saved fields, bigrams, and so on.) The code used in the dictionaries is such that anything that implements Comparable could be used as the name for an entry, since we need to be able to sort dictionaries by name before dumping them. To-do list item: this dependency should be explicit!

The Entry interface has two sub-interfaces IndexEntry and QueryEntry. These names aren't the greatest, but classes implementing IndexEntry are meant to be used during indexing (i.e., when generating dictionary entries and their associated postings) and classes implementing QueryEntry are meant to be used during query operations (i.e., when looking up entries and processing their postings). Even though most of the entry types implement both these interfaces, we make this distinction because it makes it clear in the code how we're going to use a particular entry.

The BaseEntry abstract class provides an implementation for some of the parts of IndexEntry and QueryEntry. This entry is, unsurprisingly, the basis for pretty much all of the rest of the entries.

In particular, there is the SinglePostingsEntry class, which is another abstract base class for all the entry types that store a single postings list. There are concrete subclasses of SinglePostingsEntry for entries that store postings with IDs only, with IDs and frequencies, and for entries that store everything (IDs, frequencies, field information, and word positions).

Another important abstract class is CasedEntry that defines the basis for all of the entry types that store cased postings lists. For example, if an engine is configured to provide case-sensitive searching, then we will store the postings for each term twice: once in a postings list that ignores the case in which the term was encountered, and once in a postings list that preserves the case in which the term was encountered. Why do it this way? Because in the default case (I want to search for a term without regard to case) we only need to lookup one term and then read and process one postings list.

So who's responsible for generating these entries? The dictionaries are. At indexing time, the dictionary is used to generate an entry for a new term. At query time, the entries will be constructed from the dictionary data that was dumped during indexing.

Comments:

Post a Comment:
Comments are closed for this entry.
About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into recommendation and other technologies that can use a good text search engine. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today