bloom filters in a nutshell
By John.Rose-Oracle on Feb 23, 2009
Mholes along the top of the deck. Only cards with notches at the selected position would then drop out of the deck for further processing. Repeating the procedure
Ktimes would select only cards with notches at all
Kpositions. A false drop occurs when a card happens to have the required
Knotches, but not for the reason it is being sought.
Similarly, a Bloom filter (dating from 1970) is an array of
M bits which is queried at
K quasi-randomly selected positions
k < K). If all of the bits are set, then the query returns positive, indicating that someone has already visited the array, setting the bits at all the positions
pk. If this happened by chance, perhaps because of several independent visits, we get an error called a false positive or (with a nod to tradition) a false drop.
The filters are quite simple, but the math is a little slippery until you get the right grip on it. Here’s the way I like to grab it, presented in case it helps anyone else.
First of all, the bottom line: Size your Bloom filter to contain
NK bits, plus an overhead of 44%. Put another way, for an error rate of
lg(1/ε)·lg(e) bits for each key you intend your filter to hold.
Loading a Bloom filter is much like spraying bits randomly into a bit-array target. If the bits land randomly, the key measure of what happens is number of sprayed bits per bit of target. Call this density
An easy analysis shows that any given bit in the array has dodged all the “bullets” with a probability of
P0 = exp(-ρ).
Sanity tests on this formula work nicely:
ρ=0 ⇒ P0=1 (no bullets), and
ρ=large ⇒ P0=small.
Interesting cases arise when you are attempting to fill the array half full. (Why half full? That’s when the array is at maximum entropy, when the most information can be packed into it. As a programmer, I want each array bit to have close to a 50-50 probability of being 0 or 1. Otherwise, I think I’m wasting memory space.)
For our first try, let’s spray half as many bits as are in the array. But
ρ=1/2 ⇒ P0=0.61. That is, I only hit 39% of the target. There are too many zeros, because some target bits got hit more than once!
In order to hit exactly half of the array targets, we have to choose a larger
ρ such that
P0=0.5. Solving gets us
ρ50=ln(2)=1/lg(e)=0.69. Notice that we haven’t said anything about target size at all. It is the density that counts.
For Bloom filters, each bit is sprayed by one hash function for one key. (A hash function produces a quasi-random index less than
M given the bits of a key. The key can be of any size. Typical hash functions select bunches of bits from the key and XOR them together, to produce each bit of hash.) So if you have
K hash functions times
N keys, and a bit-table of size
M, then ρ=NK/M.
(The choice of
K is proportional to the log of the error rate you are shooting for. If you ask
K questions, the chance of the Bloom filter “guessing” all
K answers correctly, and spoofing your key, is exponential in
K. It is
1/2K if the filter is tuned to a 50-50 distribution of bits.)
Repeating the failed experiment above, if you were to try for a half-full target, you would size it as
M=2NK, and spray in your
NK bits. But instead of a 50-50 distribution, you’d get a bunch of extra zeroes.
That in turn would drive down your error rate (which would be nice) but it would waste memory to store all those extra zeroes. The math (not given here) says that you could get the same error rate with a smaller
M by increasing
K. (Actually, in real applications,
K must be an integer, so
K+1 might overshoot the sweet spot.)
It turns out that the optimum size for
M, if you are allowed to vary
K but preserve the error rate, is the size which fills the target array with nice clean 50-50 bits. Our programmer’s intuition about entropy (mentioned above) is vindicated.
So, solving for a 50-50 bit distribution in terms of
M = NK/ρ50 = NK·lg(e) = NK·1.44.
There are many more tricks and variations on Bloom filters. One brief introduction is at cap-lore.com and there’s lots of stuff at Wikipedia. The nicest presentation of the math I have seen is Broder & Mitzenmacher. Enjoy!