This story started out with a user telling us that Berkeley DB Java Edition's group commit feature didn't work very well on ext3. The reason is that we only group fsync's and not writes & fsyncs. And ext3 takes an exclusive mutex on the inode for any IO operation, whereas Windows and Solaris don't (i.e. they don't block concurrent seek/read/write operations while an fsync is happening). So I set off to fix this problem for some upcoming release.
While debugging my new group-commit code in JE, I was seeing occasional DbChecksumExceptions. The odd thing was that when I examined the log file, the last set of records in the log file were fine except that the "prevOffset"pointers were all pointing later, rather than earlier. JE's log file records all have a prevOffset pointer which allows it to walk backwards in the log file. So it would be just plain wrong if one of those prevOffset pointers actually pointed forward instead of behind. It was if there were two writes going on, but one of the buffers overlaid the other.
The DbChecksumException was fairly reproducible in my test so I set out to put some debugging code in. For instance, I put in a check which made sure that just prior to the write() call to the log that the first record in the buffer being written had a prevOffset that was earlier rather than later. That check never fired and yet the results were there on disk: records with prevOffsets later rather than earlier.
So then I put in a check which read the buffer right back immediately after writing it to disk. This check would occasionally fire. But why did it fire when the check just prior to write() didn't?
I went so far as to repeat the read-after-write if it failed the first time. It would always pass on a subsequent read.
So the sequence was something like this:
seek(); write();
repeat until success { seek; read(); }
It appeared that something else was changing the underlying file pointer between the seek() and the read() so that the read was reading from the wrong place.
I sprinkled lightweight event tracing code throughout the relevant methods, catching all seek, read, write, and fsync events. But they all looked correct.
Finally I ran my code with "truss" so that I could see all the system calls. The output showed something like this:
Thread 1: llseek(13, 40728, SEEK_SET) = 40728 // seek before the write()
Thread 1: write(wr,
Thread 2: llseek(13, 0, SEEK_CUR) = 40755 // what's this?
Thread 1: llseek(13, 40728, SEEK_SET) // the seek before the read() call
Thread 2: llseek(13, 0, SEEK_END) // what's this?
Thread 1: read(13, ...) = 0 // the read() call
Thread 2: llseek(13, 40755, SEEK_SET) = 40755 // what's this?
So sure enough, there was another thread (Thread 2) messing with the underlying file descriptor (13) right between my seek() and my read(). But all of our seeks() had been covered with event tracing. All of our uses of the RandomAccessFile for seek, read, write, and fsync were synchronized so it seemed "impossible" that some other thread could actually be doing a seek call on the RAF.
I finally narrowed it down to a call to RandomAccessFile.length() in an assert statement (my test was running with -ea). Our assumption in writing the code had been that RAF.length() would not side-effect the file descriptor. But looking at the source code, it is clear that it is doing a SEEK_CUR, SEEK_END, SEEK_SET sequence which is not thread-safe. That's certainly not what one would expect.
A quick search for this reveals that others have been bitten by this. We'll accept responsibility for this one since we should have made all method calls on RandomAccessFile be synchronized, but I have to confess that this was subtle.