Friday Dec 21, 2007

The importance of good hash functions

I have always been told that it is important to design hash functions carefully in order to ensure that they return values distributed evenly across the entire range of possible values. Otherwise, you risk that many distinct keys map to the same bucket in the hash table, and the hash lookups in the worst case degenerate to linked list traversals. Although this sounds like a very good piece of advice in theory, I have never seen a real-world example where a naive implementation of a hash function has done any harm, until now.

I was making some changes to a stress test that ran simple join queries against an Apache Derby database. The changes were quite innocent, I was just changing the order in which the tables and indices used in the joins were created, so I didn't expect them to influence the test noticeably. But they did! When I got the test back up and running, its performance had dropped to less than 50% of what it was before.

My first thought was that the changed table/index order had somehow tricked Derby's query optimizer into choosing a different query execution plan than before. If the optimizer makes a bad decision (or an uninformed decision, like in this case recently discussed on the derby-user mailing list), performance drops of this size, or even greater, may be experienced. However, when I made Derby print the query plans, I couldn't see any difference at all.

The next step was to profile the test in order to find out why it took so long. I have found collect/analyzer from Sun Studio to be a valuable tool for such tasks. You simply start your application from the command line with "collect -j on" followed by the command you normally start it with:

  collect -j on java MyApp arg1 arg2 ...

When I opened the collected performance data in analyzer, this is what I saw:

Okay, more than 60% of the CPU time is spent in RecordId.equals(), whose only job is to check the equality of four integer fields in two RecordId objects. Since it's a cheap method, it means that it must have been called very frequently in order to spend that much of the total CPU time. Further investigation showed that almost all the calls to this method came from a call to ConcurrentHashMap.get() in Derby's lock manager, which used RecordId objects as hash table keys. However, there was nothing in the profiling data indicating that ConcurrentHashMap.get() was invoked more frequently now than it was before. So then there had to be more calls to RecordId.equals() per call to to ConcurrentHashMap.get().

Now, what could cause that? The only thing I could think of, was collisions in the hash table caused by a bad hash function. I inserted some print statements in the Derby code and found out that the test visited about 350 000 distinct records in the database. In the original version of the test, 350 000 distinct RecordId values mapped to only 12 000 distinct hash values, which is bad to start with. In the new version of the test, the number of distinct hash values had dropped even further down to 6000. When I looked at the calculation of the hash value in RecordId.hashCode(), it was pretty obvious how this could happen. The hash value was constructed by taking the bitwise XOR of these four integers:

A record number which is unique within the page on which the record exists
The id of the page containing the record
The id of the container (that is, table or index) containing the page
Uninteresting variable in this discussion, since it's always 0

Since all of these variables essentially are counters starting from 0, they tend to be in the lower value range, and all the higher bits are unused. When you take the bitwise XOR of such integers, the higher bits will still be unused, and you'll end up with lots of values clustered in a relatively small range. This basically means that collisions are bound to happen. When I changed the table and index creation order in the test, I must have been unlucky and got less variation in the containerId variable for the tables involved in the join, hence the lower number of distinct hash values and the increased frequency of collisions.

So, now the problem had been identified: the hash function is biased. But how could it be fixed? I mentioned that I've always heard that careful design is needed for hash functions. Unfortunately, I must have missed the lecture where we were taught how to write good hash functions, if there ever was one. Luckily, after a couple of unsuccessful attempts to fix it, I discovered that NetBeans had this nice Insert Code menu.

I only needed to pick Generate hashCode() and tell which fields I wanted to include in the hash code calculation. NetBeans then generated a RecordId.hashCode() method which looked like this:

    public int hashCode() {
        int hash = 7;
        hash = 89 \* hash + pageId.hashCode();
        hash = 89 \* hash + recordId;
        return hash;

It doesn't look much more complex than the old method which only XOR'ed the fields, but thanks to the repeated addition and multiplication with a prime number, the hash values will be distributed across a wider range, even if the individual fields don't differ very much. (Apparently, this is very similar to the approach suggested by Joshua Bloch in his Effective Java Programming Language Guide.) I repeated the procedure for PageKey.hashCode(), which is called from RecordId.hashCode().

Then, when I reran the test with the modified Derby code, the performance was back at its normal level, more than twice as high as without the fixed hash functions. And the 350 000 records that previously mapped to 12 000 or 6000 distinct hash values, now mapped to 280 000 distinct hash values. Not bad to get such an improvement with just a couple of mouse clicks in NetBeans, and not writing a single line of code myself! :)




« February 2017