Monday Mar 02, 2009

Dictionaries in Minion: Indexing

Following on with our discussion of dictionaries, lets look at how
dictionaries get used during indexing. The MemoryDictionary
class is an implemenation of the Dictionary
interface that is used during indexing.

The basic functionality for storing dictionary entries while indexing
is handled by a HashMap<Object,Entry>. Note that
the system is structured so that we never have to worry about a
MemoryDictionary being accessed by multiple threads, so
we don't need to worry about concurrency for this map.

We create a MemoryDictionary by calling the constructor,
passing in the Class for the entries that will be stored
in the dictionary. The astute reader will realize that this is an
opportunity to use generics, and genericizing the dictionaries is on
our list of things to do.

Once we've got a dictionary, we want to add some entries to it. Code
to do this will typically look something like the following:


IndexEntry mde = (IndexEntry) mainDict.get(name);
if(mde == null) {
mde = mainDict.newEntry(name);
mainDict.put(name, mde);
}

The code tries to retrieve an entry with the given name from the
mainDict dictionary. If there is no such entry, then one
is generated by the dictionary (which is why it needs the entry class
above.) This entry is then put into the dictionary with the given
name.

The MemoryDictionary.newEntry
method is responsible for constructing an entry with the given name
and assigning that entry a numeric ID.

The newEntry method is actually kind of complicated,
because it's responsible for handling the difference between cased and
uncased entries. If the dictionary is using cased entries, the
newEntry method will make sure that the cased entry
points to the uncased entry, so that postings can be added to both
when necessary without having to do multiple dictionary lookups when
processing a token or field value.

This is pretty much all of the action that a dictionary sees during
indexing: creating entries and fetching existing entries so that
postings can be added to them (I'll be covering the postings stuff
soon.)

Up next: how a dictionary gets dumped (don't worry, it has a happy ending!)

About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into recommendation and other technologies that can use a good text search engine. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today