Tuesday Mar 10, 2009

Dictionaries in Minion: Merging

The last major function that we're going to investigate for dictionaries is merging them.

When indexing in Minion, we try to index as much data as possible in memory before dumping the data to the disk. Each chunk of data that we dump to the disk is called a partition. We usually call a partition that was dumped directly by the indexer a "first-level partition."

A partition is composed of a number of files, and the contents of these files are mostly dictionaries and their associated postings. As indexing proceeds and we dump more first-level partitions, the PartitionManager for the index will collect a number of first-level partitions of a given size (in terms of the number of documents that contain) and merge them together into a new partition. As you can imagine, we usually call the partition resulting from a merge of first-level partitions a "second-level partition". The PartitionManager will eventually merge second-level partitions into third-level ones, and so on.

Because of our EIDAP philosophy, all of these partition merges boil down to merging a number of dictionaries. So, for example, we build the main dictionary for a merged partition by merging together the main dictionaries from the partitions that we're merging.

The main dictionary merge method is pretty straightforward, given how central it is to how indexing works in Minion.

Let's walk through a merge for the main dictionaries for a number of partitions. The first thing that you will notice is that the method signature is pretty complicated. This is a bit of an annoyance, but the merge method has to deal with a number of different situations. Let's look at the parameters for the method:

IndexEntry entryFactory
This is an empty entry of the same type as is in the dictionaries we'll be merging (and, therefore, in the dictionary that we'll be creating. We'll use it to make new entries for the dictionaries.

NameEncoder encoder
An encoder for the names in the merged dictionary. In the case of the main dictionary for the partitions, this will be a StringNameHandler that will handle the 3-in-4 front coding of the terms in the dictionary.

PartitionStats partStats
A set of simple statistics about the partition that we may be collecting while merging the terms.

DiskDictionary[] dicts
The actual dictionaries to merge.

EntryMapper[] mappers
A set of mappers for the entries. These can be used (for example) to renumber entries in the dictionary during the merge.

int[] starts
The starting document IDs for the data from each of the partitions. During indexing, the document IDs are assigned per partition (i.e., the first document ID in a partition is always 1), so during merging we need to remap the document IDs from the merged partitions.

int[][] postIDMaps
A set of old-to-new mappings for the IDs encoded in the postings associated with this dictionary.

RandomAccessFile mDictFile
The file where the merged dictionary will be (eventually) written.

PostingsOutput[] postOut
The file(s) where the merged postings will be written

boolean appendPostings
In most cases, when merging dictionaries, the postings associated with the entries in the dictionary can simply be appended onto one another to generate the postings for the merged dictionaries. There are a few dictionaries where this is not the case.

The first thing that we do for the merge is to build a heap of iterators, one for each of the dictionaries that we're merging (note that DiskDictionary implements Iterable — we'll talk more about dictionary iterators later.) The heap is organized by the names of the entries currently at the head of the iterator. While there are still iterators on the heap, the merge loop proceeds as follows:

  1. peek at the top element of the heap, and make a new entry for the merged dictionary using the name of the top element. We set the ID of the new dictionary entry in sequence or using the provided mappers.

  2. While the element at the top of the heap has the same name as the current element:

    1. Remove the top element from the heap.

    2. Read the postings associated with the element, and append them onto the postings that we're building for the new entry.

    3. Keep track of the mapping for the ID of this entry in the dictionary from which it was drawn and in the new dictionary.

    4. Advance the iterator that was at the top of the heap. If there is another element on that iterator, put the iterator back on the heap.

    5. peek at the element at the top of the heap.

  3. Write the merged postings to the postings file

  4. Write the new, merged entry to the dictionary using a DictionaryWriter

See, simple, right? There's actually quite a bit of complexity hidden in some of those steps (most notably, appending or merging postings — we'll talk more about that later too), but the basics actually are pretty straightforward.

Wednesday Feb 25, 2009

Dictionaries in Minion: An Introduction

As I've mentioned previously, our design philosophy for Minion is that everything is a dictionary and postings.

To recap, in Minion, a dictionary is a map from entry names to entries. The entry name may be a term from a document, a value from a saved field, or a bigram that will be used for doing wildcard matching. The entry pointed to by the name contains information about that name, including frequency of occurrence, the dictionary that the entry was drawn from, and (most importantly) a pointer to the postings associated with that entry.

There are actually two kinds of dictionaries in Minion. The first kind is a MemoryDictionary, which is used to hold a dictionary at indexing time. The second kind is a DiskDictionary, which is used to hold a dictionary at query time. Both of these dictionary types are implementations of the Dictionary interface that describes the capabilities that all dictionaries share.

Over the next few posts I'll be describing the entries that you'll find in dictionaries, and how the MemoryDictionary and DiskDictionary classes are used in Minion.

Posts in this series:

Tuesday Feb 24, 2009

My new favorite kind of music

I was working on computing tag-tag similarities for last.fm tags for Frank for a 50K artist crawl that we did last month using an Aura instance running on EC2.

I wrote a quick program to pull the document vectors for the tags (the tags here are "documents" and the artists to whom the tags have been applied are the "words" in those documents.)

Given the vectors, it was easy to compute the complete similarity table for the tags and output the similarities for each tag in decreasing similarity order.

For 1500 tags pulled from an index of 1.8 million documents, this takes about 55 seconds to run.

I wanted to make sure that it was doing something reasonable, so I had it dump the top 10 similar tags for each tag as it was running, and I found this:

19/1510 artist-tag:8bit computing 1510 similarities
Most similar: ["<artist-tag:8bit, 1.000>", "<artist-tag:chiptune, 0.795>", "<artist-tag:chiptunes, 0.726>", "<artist-tag:bitpop, 0.584>", "<artist-tag:chipmusic, 0.341>", "<artist-tag:blipblop, 0.258>", "<artist-tag:nintendocore, 0.194>", "<artist-tag:nintendo, 0.189>", "<artist-tag:c64, 0.174>", "<artist-tag:vgm, 0.166>"]

Can you tell what caught my eye? Nintendocore? Awesome!

Wednesday May 21, 2008

Open Source TREC: TRECmentum!

Great minds think alike1, I suppose. Grant Ingersoll, a core committer for Lucene posted last week about open source search engine relevance, proposing a TREC-like evaluation for open source engines. His much more comprehensive post explains a lot of what I mentioned in passing in a post a little while ago.

Getting the TREC data is definitely a barrier to entry to someone who wants to just try the TREC ad-hoc queries to see how their favorite engine holds up against the standard TREC evaluation measures. At Sun, we actually participated in TREC 1.5 times and we've intended to compete other times, so we have all of the data but that doesn't actually do anyone else any good.

It also does take a fair amount of hardware to run a TREC system through its paces. In 2000, we borrowed about 20 Ultrasparc machines to distribute indexing and retrieval for the question answering task. Of course, we messed up by deciding to compete the week before the evaluation data was going out!

Grant mentions a number of collections that we could use for the evaluation. I think we should collect up as many mail archives as we could get our hands on as well (I, for example, could see about getting the OpenSolaris mailing lists and associated queries) since that data tends to have an interesting structure and it could lead (eventually) to tasks like topic detection and tracking and social network analysis. I'd even have a go at seeing if we could host the evaluation collections somewhere around here, if that was helpful.

I guess what I'm saying is "sign me up". I think this could be a great benefit to all of the open source engine communities. TREC certainly was to the academic and commercial communities.

Update: Paul reminds me in an email that we have a blog crawler and can pretty easily generate several million documents. We wouldn't have queries, though.

1. But as my grandfather always used to say: "Great minds think alike, but fools seldom differ."


This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« August 2016