Thursday Mar 05, 2009

Dictionaries in Minion: Dumping a Dictionary

Once the indexer decides that it's time to dump a partition (because
memory's getting full or because it was told to by a call to SearchEngine.flush),
the partition dump basically reduces into dumping a bunch of
dictionaries and their associated postings data.

The MemoryDictionary.dump
method is responsible for dumping the dictionary. The first step in a
dictionary dump is to instantiate a DictionaryWriter.
This is a helper class that does most of the actual work of encoding
and writing the dictionary. We use a helper class because a lot of
the work done during the dump of a dictionary is also done when
merging dictionaries (more on that in an upcoming post.)

Once we're ready to start writing the dictionary, the entries in the
dictionary are sorted by their names. We sort the entries by name so
that we can use binary search to find entries at query time. We could
avoid the sorting step by keeping the entries in a
SortedMap at indexing time, but it turns out to be faster
to sort the entries when we're ready to dump them than to keep them
sorted while we're indexing them.

Although the dictionary assigned integer IDs to the entries during
indexing, at dump time we can take the opportunity to renumber the
entries so that the ID for an entry corresponds to its position in the
sorted list. If we do this, looking up a dictionary entry by ID can
be done very quickly at query time.

We do this renumbering for dictionaries like the main dictionary for a
partition (the one containing the terms from the documents), but we
skip it for the terms in the document dictionary. This renumbering
business is one of the places where the EIDAP philosophy gets a bit
burdensome. When you're sorting the entries in the document
dictionary, you don't want to renumber the entries, because the IDs
that were assigned to these entries have been encoded into the
postings lists for the terms in the main dictionary and we don't want
to have to re-encode all those postings.

In the cases where we don't renumber the entries, we need to keep a
map from the ID for an entry to the position in the dictionary, so
that we can continue to do fast lookups by ID at query time.

When we do renumber the entries, we can optionally keep a map from old
entry IDs to new entry IDs. The ID map from one dictionary can be
useful when dumping another dictionary's entries.

Once we've figured out the order that the entries will be dumped in
and what the entries' final IDs will be, each entry is written out to
the on-disk representation, via the DictionaryWriter.
The on-disk representation for the dictionary is a DictionaryHeader,
followed by four buffers.

The first buffer contains a standard 3-in-4 front coding of the names
of the entries. The basic idea is that we group the names of the
terms into groups of 4. The first name in a block is uncompressed,
the remaining names in the block have any initial substring shared with
the previous name removed and replaced with the length of the shared
initial substring.

Here's an example from an index I had lying around. A four word block
is composed of the words: evergreen, evergrey,
everlast, everlasting. These will get compressed
into a block of names that looks like:


where the numbers are actually bytes representing the initial
shared substring length. Thus 37 characters are compressed into 23. The
encoding of names is handled by classes that implement the NameEncoder
interface. The appropriate name encoder for a dictionary is passed
into the dump method for a dictionary.

The second buffer encodes the positions of the start of each of the
blocks of four terms in the dictionary. Because we need to be able to
quickly find the start of a block, these positions are encoded using
a fixed number of bytes.

The third buffer encodes the rest of the information contained in a
dictionary entry. This is stuff like the number of occurrences of the
term and the offset in the postings file where the postings associated
with the term can be found. Each entry knows how to encode this
information onto a buffer, and the encoding is done by the IndexEntry.encodePostingsInfo

The fourth buffer encodes the position in the third buffer of the
entry information for each entry. Again, we need to be able to find
the information for an entry quickly when we're fetching entries at
query time, so the offset is encoded using a fixed-length encoding.

As we're encoding the information for the entries, we also dump the
collected postings information to the appropriate postings files,
keeping track of the offsets where the postings are dumped and how
many bytes are in the postings.

Because we use the DictionaryWriter for merging
dictionaries too, we can't assume that the dictionary representation
can be stored completely in memory. Believe me, life would be easier
if we could, but this really does get to be a problem with large
indices. To work around this, when we're encoding the dictionary, we
use buffers that have an in-memory cache and are backed by a temporary
file. In the case of small dictionaries, this works out to be pretty
much a wash, because we only ever use the in-memory cache and never
have to write any of the data to the file.

The last stage of dumping the dictionaries is to write the dictionary
header and then transfer the encoded buffers into their final
destination file. We make sure this is done efficiently using NIO
channel transfer operations.


This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« August 2016