bloom filters in a nutshell

Bloom filters are a charming data structure. A predecessor idea dates from pre-electronic days, when decks of Hollerith punch cards would be queried in the 1940’s by inserting a rod into one of M holes along the top of the deck. Only cards with notches at the selected position would then drop out of the deck for further processing. Repeating the procedure K times would select only cards with notches at all K positions. A false drop occurs when a card happens to have the required K notches, but not for the reason it is being sought.

Similarly, a Bloom filter (dating from 1970) is an array of M bits which is queried at K quasi-randomly selected positions pk (k < K). If all of the bits are set, then the query returns positive, indicating that someone has already visited the array, setting the bits at all the positions pk. If this happened by chance, perhaps because of several independent visits, we get an error called a false positive or (with a nod to tradition) a false drop.

The filters are quite simple, but the math is a little slippery until you get the right grip on it. Here’s the way I like to grab it, presented in case it helps anyone else.

First of all, the bottom line: Size your Bloom filter to contain NK bits, plus an overhead of 44%. Put another way, for an error rate of ε, allocate lg(1/ε)·lg(e) bits for each key you intend your filter to hold.

Loading a Bloom filter is much like spraying bits randomly into a bit-array target. If the bits land randomly, the key measure of what happens is number of sprayed bits per bit of target. Call this density ρ (“rho”).

An easy analysis shows that any given bit in the array has dodged all the “bullets” with a probability of P0 = exp(-ρ).

Sanity tests on this formula work nicely: ρ=0 ⇒ P0=1 (no bullets), and ρ=large ⇒ P0=small.

Interesting cases arise when you are attempting to fill the array half full. (Why half full? That’s when the array is at maximum entropy, when the most information can be packed into it. As a programmer, I want each array bit to have close to a 50-50 probability of being 0 or 1. Otherwise, I think I’m wasting memory space.)

For our first try, let’s spray half as many bits as are in the array. But ρ=1/2 ⇒ P0=0.61. That is, I only hit 39% of the target. There are too many zeros, because some target bits got hit more than once!

In order to hit exactly half of the array targets, we have to choose a larger ρ such that P0=0.5. Solving gets us ρ50=ln(2)=1/lg(e)=0.69. Notice that we haven’t said anything about target size at all. It is the density that counts.

For Bloom filters, each bit is sprayed by one hash function for one key. (A hash function produces a quasi-random index less than M given the bits of a key. The key can be of any size. Typical hash functions select bunches of bits from the key and XOR them together, to produce each bit of hash.) So if you have K hash functions times N keys, and a bit-table of size M, then ρ=NK/M.

(The choice of K is proportional to the log of the error rate you are shooting for. If you ask K questions, the chance of the Bloom filter “guessing” all K answers correctly, and spoofing your key, is exponential in K. It is 1/2K if the filter is tuned to a 50-50 distribution of bits.)

Repeating the failed experiment above, if you were to try for a half-full target, you would size it as M=2NK, and spray in your NK bits. But instead of a 50-50 distribution, you’d get a bunch of extra zeroes.

That in turn would drive down your error rate (which would be nice) but it would waste memory to store all those extra zeroes. The math (not given here) says that you could get the same error rate with a smaller M by increasing K. (Actually, in real applications, K must be an integer, so K+1 might overshoot the sweet spot.)

It turns out that the optimum size for M, if you are allowed to vary K but preserve the error rate, is the size which fills the target array with nice clean 50-50 bits. Our programmer’s intuition about entropy (mentioned above) is vindicated.

So, solving for a 50-50 bit distribution in terms of NK gives M = NK/ρ50 = NK·lg(e) = NK·1.44.

There are many more tricks and variations on Bloom filters. One brief introduction is at cap-lore.com and there’s lots of stuff at Wikipedia. The nicest presentation of the math I have seen is Broder & Mitzenmacher. Enjoy!

Comments:

This reminds me of the spelling checker that Doug McIlroy wrote in 1978 for Unix. Its dictionary contained about 2\^15 English words; it was originally three times as large, but removing common prefixes and suffixes shrank it. Then the words were stored in a bitwise hash table (turn on the bit if the word is in the dictionary) of 2\^27 bits. That size was chosen to get about 1 false positive in every 2\^12 hits, or approximately 1% of all runs of the program, given that normal program runs reported about 2\^5 misspellings at most. Run-length encoding the zero bits allowed the table to be stuffed into the 2\^16 byte data space of the PDP-11.

Posted by John Cowan on February 23, 2009 at 11:33 AM PST #

John C.: Nice exposition! I remember puzzling over that code when I first saw it in the '80s. It inspired me to do a little dictionary hacking of my own. I wish I could say it was my first exposure to Bloom filters, except that it's not multi-hashed. It could have been...

Here's a modern reference: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/spell/hashlook.c

Posted by John Rose on February 23, 2009 at 12:30 PM PST #

Hi John,

What would be some of the good hash functions (ie fast and random enough) you know of that would do a good job for use in a Bloom Filter ?

Thanks,
Hanson

Posted by Hanson Char on November 25, 2011 at 11:40 AM PST #

Hanson: If you want very random, use a crypto step like SHA. If you want sort-of-random but sort-of-fast, use a handful of shifts and xors (or adds). It's hard to generalize. Look at conditioning functions like java.util.HashMap.hash. Ideally, each bit of input should pseudo-randomly affect each bit of output.

There's lots of good stuff here: http://burtleburtle.net/bob/hash

The x86 ISA includes some useful building blocks for hash construction, such as CRC32, which mixes 32-to-32 reasonably well. Even a multiply is useful (if it's fast enough). Choose a random odd constant multiplier, and quality test it on sample inputs. Use the most significant half of the multiplication, if you can get it, since the high bits mix differently from the low bits.

Once you have a bunch of mixed bits, just cut them up into bit-fields (or use mod operations) to get the individual Bloom filter indexes. The important thing is that a small random change in the hash input should cause a random change in *each* Bloom filter index.

Posted by John Rose on December 01, 2011 at 01:16 PM PST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

John R. Rose

Java maven, HotSpot developer, Mac user, Scheme refugee.

Once Sun and present Oracle engineer.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today