Dictionaries in Minion: Merging

The last major function that we're going to investigate for dictionaries is merging them.

When indexing in Minion, we try to index as much data as possible in memory before dumping the data to the disk. Each chunk of data that we dump to the disk is called a partition. We usually call a partition that was dumped directly by the indexer a "first-level partition."

A partition is composed of a number of files, and the contents of these files are mostly dictionaries and their associated postings. As indexing proceeds and we dump more first-level partitions, the PartitionManager for the index will collect a number of first-level partitions of a given size (in terms of the number of documents that contain) and merge them together into a new partition. As you can imagine, we usually call the partition resulting from a merge of first-level partitions a "second-level partition". The PartitionManager will eventually merge second-level partitions into third-level ones, and so on.

Because of our EIDAP philosophy, all of these partition merges boil down to merging a number of dictionaries. So, for example, we build the main dictionary for a merged partition by merging together the main dictionaries from the partitions that we're merging.

The main dictionary merge method is pretty straightforward, given how central it is to how indexing works in Minion.

Let's walk through a merge for the main dictionaries for a number of partitions. The first thing that you will notice is that the method signature is pretty complicated. This is a bit of an annoyance, but the merge method has to deal with a number of different situations. Let's look at the parameters for the method:

IndexEntry entryFactory
This is an empty entry of the same type as is in the dictionaries we'll be merging (and, therefore, in the dictionary that we'll be creating. We'll use it to make new entries for the dictionaries.

NameEncoder encoder
An encoder for the names in the merged dictionary. In the case of the main dictionary for the partitions, this will be a StringNameHandler that will handle the 3-in-4 front coding of the terms in the dictionary.

PartitionStats partStats
A set of simple statistics about the partition that we may be collecting while merging the terms.

DiskDictionary[] dicts
The actual dictionaries to merge.

EntryMapper[] mappers
A set of mappers for the entries. These can be used (for example) to renumber entries in the dictionary during the merge.

int[] starts
The starting document IDs for the data from each of the partitions. During indexing, the document IDs are assigned per partition (i.e., the first document ID in a partition is always 1), so during merging we need to remap the document IDs from the merged partitions.

int[][] postIDMaps
A set of old-to-new mappings for the IDs encoded in the postings associated with this dictionary.

RandomAccessFile mDictFile
The file where the merged dictionary will be (eventually) written.

PostingsOutput[] postOut
The file(s) where the merged postings will be written

boolean appendPostings
In most cases, when merging dictionaries, the postings associated with the entries in the dictionary can simply be appended onto one another to generate the postings for the merged dictionaries. There are a few dictionaries where this is not the case.

The first thing that we do for the merge is to build a heap of iterators, one for each of the dictionaries that we're merging (note that DiskDictionary implements Iterable — we'll talk more about dictionary iterators later.) The heap is organized by the names of the entries currently at the head of the iterator. While there are still iterators on the heap, the merge loop proceeds as follows:

  1. peek at the top element of the heap, and make a new entry for the merged dictionary using the name of the top element. We set the ID of the new dictionary entry in sequence or using the provided mappers.

  2. While the element at the top of the heap has the same name as the current element:

    1. Remove the top element from the heap.

    2. Read the postings associated with the element, and append them onto the postings that we're building for the new entry.

    3. Keep track of the mapping for the ID of this entry in the dictionary from which it was drawn and in the new dictionary.

    4. Advance the iterator that was at the top of the heap. If there is another element on that iterator, put the iterator back on the heap.

    5. peek at the element at the top of the heap.

  3. Write the merged postings to the postings file

  4. Write the new, merged entry to the dictionary using a DictionaryWriter

See, simple, right? There's actually quite a bit of complexity hidden in some of those steps (most notably, appending or merging postings — we'll talk more about that later too), but the basics actually are pretty straightforward.


Post a Comment:
Comments are closed for this entry.

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« July 2016