« Welcome to Charles Lamb's Blog | Main | Berkeley DB Java Edition Direct Persistence Layer at OOW »

OSCON Presentation Slides Now Available

Some of you have asked if my OSCON slides are available for my talk in JE Internals and War Stories.  They are now posted here.

Some descriptive text that I used for practice is shown below.

Hello, my name is Charles Lamb and Ii??m an engineer with
Oracle.  My talk today is about some of
the internals of Berkeley DB Java Edition. 
During this discussion, Ii??ll mix in some war stories that we as
developers have to tell.  What I mean by
a war story is the category of problem that stumped us for days or weeks.  Theyi??re the kind of problem that make
schedules slip and managers wince, but still have to be fixed. 



Ii??ll first spend a few minutes talking about Berkeley DB
Java Edition, also known as i??JEi??, its history and what it is.  From there Ii??ll talk about three specific
aspects of the architecture and some of the interesting situations we have had
to work through.  Finally, Ii??ll draw some
conclusions about our experiences.



Most of you probably already know Berkeley DB as a Sleepycat
Software product.  Sleepycat was formed
in 1996 with the charter to productize Berkeley DB.  It is licensed under a dual source license: a
i??free, as in beeri?? license if you redistribute Berkeley DB along with your source,
and a revenue based license if you use it without redistributing your
source.  BDB has been adopted by hundreds,
if not thousands, of projects including some of the ones listed here.  In 2002, the company decided to create a
pure-Java version of the software, Berkeley DB Java Edition, which was released
in 2004.  It is this Java Edition product
that I will be talking about.  The most
recent event in the Berkeley DB timeline is that about five months ago, Oracle
acquired Sleepycat Software to expand its portfolio of embedded databases.



So just what is Berkeley DB Java Edition?  JE can be characterized as a Main Memory
Database system, or a system that is optimized for having all data fit into
memory.  Specifically, iti??s an embeddable
transactional data store that is written completely in Java and is therefore quite
portable across many platforms.  It
provides the programmer with full transactional semantics.  Since iti??s an embedded system, it achieves
high performance by not requiring any interprocess communication between client
and server.  It is built on top of a
B+Tree access method and the concurrency techniques that we use are one of the
major topics that Ii??ll speak about.  The
system uses an append-only log-based storage system which optimizes write
performance.  Naturally, since the system
is Java based we felt it important to make the on-disk file formats be
architecture neutral so that they can be transported to different endian-ness
machines.  Finally, the architecture is
record based, not page-based, so it provides record level locking.



JE and the original Berkeley DB have similarities and
differences.  They share concepts like
Environments, Databases, and Cursors. 
They are based on the same byte-array oriented key/data map API.  And in both systems the API does not require
the user to specify a schema.  Both
systems provide transactional and non-transactional support.  In the differences category, the systems have
different on-disk file formats which are in no way compatible.  JE is architected for the single-process,
multi-threaded sweet-spot, while Berkeley DB is aimed at a multi-process
environment.  Consequently, since the
internal architectures are very different, the administration and tuning parameters
are also quite different.




Now Ii??d like to get into the meat of my talk: JEi??s internals
and war.  Ii??ll focus on three areas of
the system: the log-based storage system, concurrency in the B+Tree, and
Performance tuning.



To a programmer, JE is simple.  Iti??s a single jar file that is linked into
the JVM.  JE is about 750KB of code.  It shares the heap and accesses the file
system along with the application.  It
also starts up three daemon threads on its behalf.  Multi-process configurations are allowed, but
only in a single-write process, multiple-reader process configuration.




I want to give you some quick terminology.  First, an Environment is similar to a
relational DBMSi??s Database.  Iti??s a
concept that embodies all data and shared resources, and more specifically, it
provides a common infrastructure for caching, locking, and logging.  Iti??s also the unit of backup and restore at a
binary level.  Environments may be
transactional or non-transactional.  A second
JE term, i??Databasei?? is like a relational DBMS table, and simply stated, iti??s a
map of key/data pairs.  Like an
Environment, it may be transactional. 
Databases have no schema i?? the user defines the layout of data in the
records.  There are one or more databases
per Environment.




This slide illustrates the relationship between the in-memory
and on-disk representations of an Environment. 
On the left side is the on-disk representation and on the right hand
side is the in-memory representation.  An
Environment may have one or more databases and a directory of these databases
is stored in a tree structure known as the db mapping tree, shown on the right
side.  Each leaf in the db mapping tree
is the root of a database, which again, is a b-tree.  The leaves of a database are the records, or
key data pairs.  Databases may be
configured to allow multiple data values for a given key, and these are known
as dup sets, also represented as a B+Tree.  On the left side of the slide, the on-disk
representation of an environment is a set of sequentially numbered log
files.  As data is written to the environment,
it is buffered in a log buffer, which once filled, is appended to the latest
log file.  Records in the log files are
never overwritten i?? only appended.



And it is this premise, that writes only ever append to the
log file, upon which the log based storage system is built.  This means that the disk head generally stays
in one place, or advances to the next adjacent track.  Disk latency is reduced to an average latency
of a half disk rotation.   A side-effect
of this is that all data is stored in the log. 
There is not a separation between log files and data files.  While a traditional page-based system
operates in units of pages, JE operates on records and objects so the IO is
matched to the record size, not the page size.




Inserts, updates, and deletes are very fast since the disk
head doesni??t have to move more than one track at a time, if at all.  Further, data is clustered temporally rather
than spatially.  The downside is that
out-of-cache reads are slower because the disk head does have to move.  Naturally, the best case is when the working
set fits in memory and no disk accesses are necessary.  To mitigate this, JE is able to do some
on-the-fly clustering which Ii??ll talk about shortly.  One interesting side-effect of this storage
system is that backup and restore are quite simple.  All that is required to obtain a consistent
snapshot is a sequential sector and file-order copy of the blocks to the
archive.




Probably the quintessential database war-story has to do
with data corruption and JE is no different. 
In war story number 1, a user reported checksum errors when JE was
reading back data.  Scrutinizing the log
we found that there was a block of 0i??s in the middle of one of the log files,
and it just happened to be on a sector boundary.   Hmmm, this smells like a hardware or
operating system problem we thought, but how could we prove that.  After all, developers never want to believe
that there are bugs in their own code.  
Fortunately, the problem was reproducible and the user was willing to
run repeated tests.  So we added
assertions to the JE code at the point of all writes to the file system to
ensure that we were not writing blocks of 0i??s. 
These assertions vindicated us. 
But to be sure, we went even further and added a stub libc library to
the JVMi??s dynamic library path that would intercept all write calls and assert
that we wereni??t writing a block of 0i??s. 
This further acquitted JE as the cause of the problem.  The user became convinced that the cause of
the problem might be their custom kernel, and in fact, once that was
substituted for a stock kernel, the problem disappeared.



The second war-story also involves the storage system, but
indirectly.  A user reported that they
would consistently see OutOfMemoryErrors i?? the bain of every Java programmeri??s
existence i?? when inserting large records into the database.  We poked and prodded JE in several different
ways to try to narrow down the cause. 
Increasing the JVMi??s heap size with i??Xmx only delayed the inevitable out
of memory error.  Yet memory profiling
with various tools showed no memory leaks by JE or the application.  Interestingly enough, the previous release of
JE didni??t demonstrate the same problem. 
So the brute force way to solve the problem was to binary search through
the CVS source tree, running different intermediate versions against the test
program, to determine exactly which change in the source base caused the
problem to appear.  It turned out that
using the Java Adler32 checksum routine was the problem.  The only reason we were able to figure out exactly
why was through examination of the JVM source code.  We noticed that Adler32 first copies the
input byte array to a temporary buffer which is pinned and therefore causing
the GC to temporarily stop.  So the
solution in this case turned out to be to add a parameter to JE that caused a
home-grown Adler32 routine to be used instead of the java library.



Back to the log-based storage system.  Since the system is append-only, it naturally
needs a way of cleaning up old obsolete data in the log files.  This is achieved by a process called
log-cleaning which is the process of consolidating older log files containing
obsolete data.  Cleaning is simple.  A candidate log file is chosen and all
records in it are scanned.  Data that is
current, or live, is marked for migration to the latest, most current log file,
and obsolete data is just ignored.  Once
all live data has been migrated successfully, the old log file is deleted so
that the disk space can be reclaimed.



During cleaning we can achieve some dynamic reclustering by
grouping the live data to be migrated and writing them to the log
together.  Besides the actual file
processing, the other aspect of cleaning is determining which log file is the
best candidate for cleaning.  To do this,
JE keeps track of which records in a log file are obsolete, and therefore, what
percentage of a log file is obsolete. 
Files with log space utilization are the best candidates for cleaning.



As you might expect, cleaning is not quite as simple as it
sounds.  So war story #3 has to do with
the cleaner, which under certain circumstances was not able to keep up with
high write loads.  To figure out what was
going on, we made lots of test runs with lots of statistics, counters, knobs,
and dials.  The upshot was that the
cleaner was generating a lot of the obsolete data by writing the internal nodes
of the B+Tree multiple times.  The
solution to this was two-fold.  First, we
deferred the actual data migration until either it had to be evicted from the cache
or when a checkpoint occurred.  While
this meant that we also had to defer the actual file deletion, it also meant
that we could get more and better clustering. 
Secondly, we added the capability of providing multiple threads to do
the cleaning.



Next Ii??ll talk about the concurrency mechanisms in the JE
B+Tree access method.



This diagram shows a simple B+Tree.  Red blocks are internal nodes in the tree and
they only store keys, no data.  Blue
blocks are the leaf nodes in the tree and they hold both key and data.  Notice that the tree does not have to be
balanced.



Within JE we use two synchronization mechanisms, locks and
latches.  Latches are lightweight mutexes
used for maintaining consistency in internal data structures.  They are similar to the Java synchronized
keyword except that latches are not lexically scoped.  Latches are implemented on top of the
synchronized keyword and come in shared and exclusive varieties.  A latch is never held across user API
calls.  Nor do we perform deadlock detection
on them.  Instead, if multiple latches
need to be acquired, we latch them in a predefined order.



The other synchronization mechanism, locks, are familiar to
most database users.  These are more
heavyweight mutexes for locking records in the database and may be held across
API calls, for example, until the end of a transaction.  There are read and write varieties and they
generally follow standard two-phase locking semantics.  While lock deadlocks are detectable, we
currently rely on lock timeouts in JE.




JE uses a hybrid concurrency scheme based on locks and
latches.  Specifically, the records of
the database, which are the treei??s Leaf Nodes are only locked and never
latched.  The internal nodes of the tree
are never locked, but only latched allowing for greater concurrency.  During searches down the tree, we latch
couple the internal nodes.  In other
words, while holding a latch we determine which child node is next in the search
and latch it.  Once the child is latched,
we release the latch on its parent. 
Finally, we never latch upwards in the tree as this can lead to
deadlocks, and to enforce this we never maintain parent pointers in the tree.




As you might have guessed, this hybrid scheme has some
impact on transactions and recovery. 
Because the internal nodes are not locked and do not participate in
transaction rollback, they may contain data from transactions that have not yet
committed or aborted.  This is ok because
we let the leaf nodes, which are locked and do participate in rollback, be the
final arbiter of the actual data.



A corollary to this is that during record insertion, if an
internal node is full, JE will opportunistically split it during the tree
descent.  These are not undone in the
event of a transaction abort because therei??s a good chance that the
transaction, and hence the insert, will just be retried anyway.  A similar optimization allows a background
compressor thread to do cleanup after deletes, including tree rebalancing,
pruning empty subtrees, and compressing internal nodes that only have deleted
entries.




Naturally, when you introduce background threads that are concurrently
updating the tree, there may be bugs, and this brings us to war story #4 which
involves the compressor thread messing up the results of a user cursor
traversal.  During traversal, the leaf
nodes referenced by an internal node are returned to the user until there are
no more on that leaf.  At that point the
next internal node needs to be located and to achieve this, JE starts at the
root and searches downward for the node with the next consecutive key.  During this search for the next internal
node, an empty internal node may be encountered.





Herei??s a diagram. 
Logically, a cursor traversal causes the useri??s cursor to move from the
left hand green leaf node to the right hand green leaf node, regardless of any
empty internal nodes that may lie in between them.





Internally, JE iteratively seeks the next internal node for
the cursor by starting at the root and searching downward for the next key
following the last internal node.  In
effect, this takes the cursor advance operation on the path illustrated by the
green arrows.  Along the way, JE may
encounter an empty IN that has not yet been processed by the compressor thread.




In fact, the compressor may be working on that empty subtree
at the same time, which would effectively cut the link between the parent and
the empty internal node.  These two nodes
are shown in yellow here.  If the cursor
advance operation was down on the empty node when the compressor cut the link
from the parent, then the cursor operation would think that it was at the end
of the traversal and return a no-more-records status.  Of course, proper latching and internal
concurrency control should handle this case, but in this case it was a subtle programmer
bug waiting to be uncovered.




Naturally, uncovering this particular problem was dicey
because it was highly timing dependent. 
So we added a lightweight event tracing facility to JE and called it at
opportune spots in the code.  When the
situation where the empty node could no longer be located in the tree was
encountered, the event tracing facility dumped out the history.  We iteratively sent the customer these
enhanced JE jar files and eventually found the problem.  The solution was to add code that would
effectively restart the entire cursor advance operation when the situation was
detected.




Speaking of latches, this next story is probably a
programmeri??s worst nightmare come true. 
What if your internal latching code didni??t actually latch when it said
it had on rare occasions and it didni??t tell you when those rare occasions were?  Oh oh. 
Crash and burn, film at 11.  We
stumbled upon this when our own internal latch consistency code detected that a
latch was being released but it was not really held.  Once again, the lightweight even tracing
facility came to the rescue along as well as adding more internal checks in in
the code.  By googling and reading the
appropriate javadoc, we learned that it is possible for Javai??s wait method to legally
return without the target being held and that this had not been documented
before Java 5.  The programmer is
responsible for checking on spurious notifications.



Finally, Ii??ll talk about some performance tuning anecdotes
that wei??ve encountered.



A natural first response to any performance problem in a
database system is to tell the user to increase the amount of cache.  In this particular benchmark, a database load,
we did just this.  Unfortunately, it had
the opposite effect.  Increasing the cache
slowed performance, while decreasing the cache increased performance.  Hmm, thati??s interesting, I wonder why?  It turns out that database loads prefer MRU
eviction.   Once an object is created,
iti??s not likely to be needed again, so we might as well evict it and save room
for the internal nodes of the tree which are likely to be referenced again
soon.  A bigger cache means that there
are more objects on the heap and because theyi??re in the JE database tree, they
are referenced.  More referenced objects
on the heap means that the generational garbage collector has to copy more
objects, slowing performance.  So, a
smaller heap means fewer copies and improved performance.  Subtle.




Returning to the cleaner, we have another opportunity for a
war story.  In this case, as we iterated
through the records in a log file, determining whether a record was live or not
was too slow and took up too much CPU because a tree lookup was necessary.  After much profiling and annotation of the
code it was determined that the best way to improve this was to maintain a
database within JE of which records were obsolete, thereby eliminating the
looking in the tree.  While this may not
seem like much of a war story, I categorize it as one because it took us a long
time with a lot of metering runs to design and implement the final solution
which was somewhat of a major architecture change.




And finally, this story has to do what happens when you take
a system that has been architected to run in a highly concurrent environment
and actually stress test it in such an environment.  You may be familiar with the new Sun Niagara
processors which have 6 or 8 cores with 4 execution engines in each core.  This appears as 24 or 32 CPUs to the
user.  A customer wanted to use JE to
perform 25 concurrent queries on one of these systems so they devised a
multi-threaded benchmark and challenged us to make the Niagara T2000 run at
100% CPU utilization.  Several profiling
runs and code changes later, we found that the biggest bottleneck was the JE
lock table which has a single latch.  The
solution here was to break the lock table into multiple tables each of which
had its own latch.  Mission
accomplished, 100% cpu utilization for the benchmark.





What can we conclude from all of this?




First, high concurrency systems like JE can produce lots of
interesting issues that may not be easy to reproduce deterministically.  We used lots of traditional tools like
debuggers, profilers, printfi??s, event logging, and good old source code
examination, but also used some non-traditional tools like linking in stub
libraries and reading JVM source code, which fortunately was available to us.



We also have to credit our user community for our success in
these battles because they provided us with lots of different configurations to
beat on JE with.  Lastly, sometimes to
debug these difficult situations, you just get lucky.  If you liked these stories and want some
more, checkout blog.sleepycat.com.

Thank you.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About This Entry

This page contains a single entry from the blog posted on July 31, 2006 10:42 AM.

The previous post in this blog was Welcome to Charles Lamb's Blog.

The next post in this blog is Berkeley DB Java Edition Direct Persistence Layer at OOW.

Many more can be found on the main index page or by looking through the archives.

Top Tags

Powered by
Movable Type and Oracle