Google's Postings Format
By searchguy on Mar 11, 2009
It's a pretty interesting look at how Google has grown over the years. Of particular interest to me as a search guy is the description of some of their index formats. On slide 47 there's a pretty detailed description of the postings lists format that they were using before they went to an all-in-memory index. They were using variable length encodings (like Gamma and Huffman encoding) for some of the data, in order to get the most compact on-disk representation that could still be decoded quickly.
This was what the initial versions of the Minion engine did as well.
On slide 54, he starts to talk about the formats that they considered when they went to an all-in-memory index. Given that they could hold the whole thing in memory, they needed to use a format that could be decoded as quickly as possible, which led them to use byte-aligned variable length encodings.
They started with what they call a Varint encoding, which is the same encoding that Minion and Lucene use. The following slides detail a number of variations of this encoding scheme, culminating in slide 63, where they describe encoding groups of four numbers together and pulling out two-bit prefixes into a leading byte, in order to reduce the amount of masking and shifting necessary to decode an integer.
He says the Varint encoding could be decoded at the rate of 180M integers per second, while their "group varint" approach could decode at the rate of 400M integers per second. He points out that this means that you can only encode 30 bits per integer, but that's not much of a restriction for typical postings data (document ID deltas and positions).
I think I might have to give this one a try myself, in my copious free time.