Friday May 16, 2008

JavaOne 2008

I spent last week at JavaOne in San Francisco. Although there were many interesting sessions, I found it even more interesting to be at the Java DB booth and talk to users and fellow developers who popped by. The general impression was that the users were very happy with Java DB/Apache Derby, especially that it was so easy to get it up and working. Not surprisingly, the most frequently requested feature was support for LIMIT and OFFSET in SQL. (Yes, I did tell them about the ROW_NUMBER function, but because of the current limitations of the implementation and the more verbose syntax, it didn't quite match what they wanted.)

Also, many of our new colleagues from MySQL visited us at the Java DB booth. The picture below shows Geir Høydalsvik (left) welcoming Mårten Mickos to the Java DB team. Mårten, we're still waiting for your first patch! :)

Sunday Apr 27, 2008

Derby is released

Apache Derby has just been released. This is the first official release on the 10.4 branch, and it introduces many new features, improvements and bug fixes, including:

  • Asynchronous replication with manual failover
  • Table functions that allow you to query non-relational data using SQL
  • Monitoring with Java Management Extensions (JMX)
  • Performance improvements (better scalability on multi-core machines and caching of session state and compiled statements in the network client driver)
  • Fix for a data corruption bug that could make the database unrecoverable

For download details and the full list of improvements, see the announcement from Dyre Tjeldvoll who was the release manager for this release.

Java DB 10.4.1, which is based on Apache Derby, has also been released and is available for download here.

Monday Feb 04, 2008

Derby documentation quick search in Firefox

Since Derby's documentation is spread across a number of different manuals, and I have never actually tried to learn which manual contains what, I usually end up using Google when I need to look something up.

So, say that I wanted to look up the syntax for ALTER TABLE, I would type "derby alter table" in the search bar. The results from the search however only contained links to sections about ALTER TABLE in old Derby manuals archived on the Apache site, and a number of hits in forums and mailing list archives. To find it in the most recent manuals, I would therefore also append "" which would limit the search to the alpha version manuals. That's tedious to type, though, and hits in the Portuguese translation of the manual tended be on top of the list (and I don't read Portuguese that well... :) ). Using "Advanced Search" and limiting the search to English pages would be even more tedious.

Finally, I got sufficiently tired of typing and scrolling, and came up with a quicker way of doing it. Firefox has this quick search function, which allows you to type a keyword that tells which search engine to use, followed by the search terms, and then Firefox automatically jumps to a URL that performs the search. To create my own quick search for the Derby documentation, I selected Bookmarks->Organize Bookmarks and added a bookmark in the "Quick Searches" folder with the following properties:

Name: Derby documentation
Keyword: derby

Now, I only need to go to Firefox's address bar and type "derby alter table", and Firefox sends me to this page, which shows exactly what I want.

Friday Dec 21, 2007

The importance of good hash functions

I have always been told that it is important to design hash functions carefully in order to ensure that they return values distributed evenly across the entire range of possible values. Otherwise, you risk that many distinct keys map to the same bucket in the hash table, and the hash lookups in the worst case degenerate to linked list traversals. Although this sounds like a very good piece of advice in theory, I have never seen a real-world example where a naive implementation of a hash function has done any harm, until now.

I was making some changes to a stress test that ran simple join queries against an Apache Derby database. The changes were quite innocent, I was just changing the order in which the tables and indices used in the joins were created, so I didn't expect them to influence the test noticeably. But they did! When I got the test back up and running, its performance had dropped to less than 50% of what it was before.

My first thought was that the changed table/index order had somehow tricked Derby's query optimizer into choosing a different query execution plan than before. If the optimizer makes a bad decision (or an uninformed decision, like in this case recently discussed on the derby-user mailing list), performance drops of this size, or even greater, may be experienced. However, when I made Derby print the query plans, I couldn't see any difference at all.

The next step was to profile the test in order to find out why it took so long. I have found collect/analyzer from Sun Studio to be a valuable tool for such tasks. You simply start your application from the command line with "collect -j on" followed by the command you normally start it with:

  collect -j on java MyApp arg1 arg2 ...

When I opened the collected performance data in analyzer, this is what I saw:

Okay, more than 60% of the CPU time is spent in RecordId.equals(), whose only job is to check the equality of four integer fields in two RecordId objects. Since it's a cheap method, it means that it must have been called very frequently in order to spend that much of the total CPU time. Further investigation showed that almost all the calls to this method came from a call to ConcurrentHashMap.get() in Derby's lock manager, which used RecordId objects as hash table keys. However, there was nothing in the profiling data indicating that ConcurrentHashMap.get() was invoked more frequently now than it was before. So then there had to be more calls to RecordId.equals() per call to to ConcurrentHashMap.get().

Now, what could cause that? The only thing I could think of, was collisions in the hash table caused by a bad hash function. I inserted some print statements in the Derby code and found out that the test visited about 350 000 distinct records in the database. In the original version of the test, 350 000 distinct RecordId values mapped to only 12 000 distinct hash values, which is bad to start with. In the new version of the test, the number of distinct hash values had dropped even further down to 6000. When I looked at the calculation of the hash value in RecordId.hashCode(), it was pretty obvious how this could happen. The hash value was constructed by taking the bitwise XOR of these four integers:

A record number which is unique within the page on which the record exists
The id of the page containing the record
The id of the container (that is, table or index) containing the page
Uninteresting variable in this discussion, since it's always 0

Since all of these variables essentially are counters starting from 0, they tend to be in the lower value range, and all the higher bits are unused. When you take the bitwise XOR of such integers, the higher bits will still be unused, and you'll end up with lots of values clustered in a relatively small range. This basically means that collisions are bound to happen. When I changed the table and index creation order in the test, I must have been unlucky and got less variation in the containerId variable for the tables involved in the join, hence the lower number of distinct hash values and the increased frequency of collisions.

So, now the problem had been identified: the hash function is biased. But how could it be fixed? I mentioned that I've always heard that careful design is needed for hash functions. Unfortunately, I must have missed the lecture where we were taught how to write good hash functions, if there ever was one. Luckily, after a couple of unsuccessful attempts to fix it, I discovered that NetBeans had this nice Insert Code menu.

I only needed to pick Generate hashCode() and tell which fields I wanted to include in the hash code calculation. NetBeans then generated a RecordId.hashCode() method which looked like this:

    public int hashCode() {
        int hash = 7;
        hash = 89 \* hash + pageId.hashCode();
        hash = 89 \* hash + recordId;
        return hash;

It doesn't look much more complex than the old method which only XOR'ed the fields, but thanks to the repeated addition and multiplication with a prime number, the hash values will be distributed across a wider range, even if the individual fields don't differ very much. (Apparently, this is very similar to the approach suggested by Joshua Bloch in his Effective Java Programming Language Guide.) I repeated the procedure for PageKey.hashCode(), which is called from RecordId.hashCode().

Then, when I reran the test with the modified Derby code, the performance was back at its normal level, more than twice as high as without the fixed hash functions. And the 350 000 records that previously mapped to 12 000 or 6000 distinct hash values, now mapped to 280 000 distinct hash values. Not bad to get such an improvement with just a couple of mouse clicks in NetBeans, and not writing a single line of code myself! :)

Tuesday Dec 18, 2007

SQL syntax highlighting in OpenGrok

Perhaps not the most frequently requested feature, but since much of the OpenGrok development happens at Sun's Database Technology Group these days (unofficially, that is, so please don't tell our boss that's what's keeping us so busy!), it's kind of embarrassing that OpenGrok doesn't understand SQL scripts. Now that's finally about to change. Yesterday, I checked in some basic support for SQL syntax highlighting, and we'll hopefully also be able to support search for symbols and definitions in SQL scripts in the next release. So then there should be one less embarrassment to worry about...



« July 2016