June 11, 2009

Berkeley DB Java Edition High Availability Beta

We're looking for a few good testers for BDB JE HA. Applications for the beta test are now being accepted. If you're interested in working with this new functionality in JE, please contact Dave Segleau (dave.segleau atsign o.com).

June 3, 2009

Berkeley DB Java Edition HA vs Gathering Writes

While working on performance measurement and improvements for JE High Availability (HA) we noticed that we could eliminate a buffer copy by using a gathering write. The changes were simple enough but when I made them, weird things started happening with the stress test I was using. It seemed like the ByteBuffers which were supposedly being sent from master to replica were ending up corrupted over when they got to the replica side. Eventually, we decided that bug 6812202 was responsible (even though we didn't always see an OOME) and in fact, moving to the JDK 1.7 "fixes" the problem. Unfortunately, we are not in a position to pre-req Java 7 for HA or JE for that matter.

You're probably wondering if this means that HA is imminent. Yes, the beta is expected to start in a few weeks, which is a very exciting development for JE. And you may be wondering what we know about HA performance. I'll just say that we're quite comfortable with where HA performance is right now. We know of some areas where we can make improvements and we'll eventually make them, but for now, we like what we see.

May 27, 2009

Berkeley DB Java Edition Cleaner and Checkpointer Notes

My colleague Mark Hayes posted an excellent writeup of the JE Cleaner and how it relates to the checkpointer. It is cut-and-pasted below.

The JE cleaner daemon thread(s) are enabled by default. Normally this should not be changed. Possible reasons for disabling the JE cleaner threads are:

1) You may wish to disable the JE cleaner threads during heavy application usage periods and only run the log cleaner when application usage is light (e.g., at 2 am). This can increase throughput during heavy usage periods. However, caution is strongly advised. If the write rate is high during the heavy usage period, filling the disk is a possibility and must be avoided. You must also ensure that there is enough time during light usage periods for the log cleaner to catch up with the backlog created during the heavy usage periods. In addition, random reads may be negatively impacted during the heavy usage periods if the JE log grows very large, because there may be less hits in the file system cache.

2) You may wish to disable the JE cleaner and checkpointer threads when performing a "bulk load". A bulk load is a large set of writes, usually inserts but sometimes also updates and deletions, that is performed in a batch mode while all other application functions are disabled. It is used to initialize a large data set. The objective is to complete the load as quickly as possible and to use as little disk space as possible. Note that deferred write mode (see DatabaseConfig.setDeferredWrite) is often used for a bulk load to minimize writing.

Checkpointing can be disabled to avoid wasting disk space with multiple, redundant checkpoints during the load. Instead a single checkpoint is performed after the load is complete. This is acceptable because recovery time does not need to be bounded by checkpoints -- if a crash occurs during the load, the load can be restarted from scratch. Log cleaning can also be disabled to speed up the load. If only insertions are performed, then log cleaning will not be needed anyway. But even if updates and deletions are performed, log cleaning is not productive while the checkpointer is disabled since log files will not be deleted. Log cleaning may be performed efficiently by calling cleanLog at the end of the load, followed by a checkpoint.

3) You may wish to implement your own log cleaning threads for administrative reasons. Perhaps you have a special thread pool you wish to use, or you're sharing a thread pool with other components. In this case, your threads take on the same role as the JE daemon threads. Your threads should call Environment.cleanLog periodically. The number of threads calling cleanLog should be increased when the EnvironmentStats cleanerBacklog value grows. A checkpoint is not normally necessary, since checkpoints should occur independently on their own schedule. But if you also disable the JE checkpointer thread, then you should call Environment.checkpoint periodically from your own thread.

4) Using a NAS (e.g., NFS) for JE storage can be problematic for several reasons. For one, the EnvironmentConfig.LOG_USE_ODSYNC parameter must be set. In addition, if the NAS does not support the file locking needed by JE, then running multiple processes is problematic. JE cannot use file locking, and therefore cannot coordinate multiple processes accessing the same environment. It is then up to the application to ensure that only one process is writing to the environment, and that log cleaning is disabled when any read-only processes are open. The log cleaner threads may need to be disabled by the application in such situations.


Below are some example use cases where calling Environment.cleanLog is needed.

A) If you implement your own log cleaning threads (3) then you should call cleanLog periodically. The JE daemon threads effectively call cleanLog after each N bytes of log is written, where by default N is 0.25 times the maximum size of a log file, and may be configured using EnvironmentConfig.CLEANER_BYTES_INTERVAL. For simplicity your log cleaning threads may call cleanLog based on a configured time interval. As mentioned above (3), a checkpoint is not normally necessary after calling cleanLog.

B) The JE cleaner threads are triggered by write activity. You may wish to call cleanLog in order to force cleaning to occur when no other write activity is occurring. For example, you may wish to do this at the end of a bulk load (2), or as a utility function. After calling cleanLog, a checkpoint should be performed to cause cleaned log files to be deleted.


For completeness, I'd like to say a little more about checkpoints and log cleaning. As mentioned above, a checkpoint is necessary after the log cleaner has "processed" a log file, and before the file can be deleted. The log cleaner (the JE cleaner threads and the cleanLog method) process log files by migrating all active data from that file to the end of the log. The checkpoint is necessary before deleting the file, to ensure that no references to that log file remain active.

In addition, the checkpoint does a lot of the work -- the heavy lifting -- of log cleaning. When a log file is processed, the active data is placed in memory. But it is left to the checkpoint to write the active data to the end of the log. This has several advantages:

1) It offloads some of the work from the log cleaner threads, so they can make better progress and keep up with the application threads.

2) It reduces the total amount of writing by deferring it for as long as possible. Multiple updates to the active data are consolidated when writing is deferred until the next checkpoint.

3) Data is clustered naturally when writing is deferred. Data is written by the checkpointer in groups of records, where the records in a group have key values in close proximity to each other. For applications having locality of reference by key value, but where the records are initially written in a different order in the log, read performance may be improved.

In some applications, however, this approach can cause very long checkpoints, with negative repercussions. In particular, this can occur when the JE cache is very large (e.g., multiple GB) and the write rate is high. Because of the large cache, write activity and related log cleaner activity can queue up a large amount of work that must be done during each checkpoint. If the checkpoint takes too long (if it spans many log files) then the recovery interval may be very long also, and recovery after a crash may take a very long time. Long checkpoints also prevent cleaned log files from being deleted promptly.

For such applications, the EnvironmentConfig.CHECKPOINTER_HIGH_PRIORITY configuration parameter should be set to true. This causes two changes in behavior:

a) The log cleaner threads (and the cleanLog method) will write active data to the end of the log, rather than leaving this work to be done by the checkpointer.

b) The checkpointer will log multiple Btree nodes at a time, reducing contention with other threads.

Both of these changes cause the checkpoint to complete in much less time. This can have a significant positive impact on overall performance. If your application has long checkpoints (as usual, watch the EnvironmentStats), you should consider this option.

If you use this option, it is very likely that you'll also need to increase the number of log cleaner threads. The checkpointer will be doing less work, but the log cleaner thread(s) will be doing more work. Therefore more log cleaner threads will probably be needed to prevent the cleaner backlog from growing.

May 15, 2009

Berkeley DB Java Edition 3.3.82 Available

Berkeley DB Java Edition 3.3.82 is a patch release consisting of fixes for a number of significant issues. We strongly recommend that users of the 3.3.x version upgrade to this release.

There are several issues that are critical for applications using deferred write (aka temporary databases), XA, the shared environment cache, a single transaction in multiple threads, or large sets of duplicates.

These critical fixes are below:

* Fix a bug that could cause a LogFileNotFoundException when using an XAEnvironment, if a prepared transaction is not ended prior to a crash and then the prepared transaction is aborted after recovering from a crash. [#17022] (3.3.79)

* Fix a bug that prevented deferred-write record deletions from being made durable by Database.sync, if a crash occurs after Database.sync but prior to the next checkpoint. Under rare circumstances this could also result in a LogFileNotFoundException later when accessing the deleted entry. [#16864] (3.3.78)

* Fix a bug that prevents log cleaning from functioning properly when a temporary DB (DatabaseConfig.setTemporary) is large enough to overflow the JE cache. Also fix a bug that could in rare cirumstances cause an endless loop while performing log cleaning. [#16928] (3.3.78)

* Fix a bug that caused an infinite loop under certain timing dependent circumstances when using EnvironmentConfig.setSharedCache(true). This bug was reported in two different forum posts (http://forums.oracle.com/forums/thread.jspa?threadID=832695 & http://forums.oracle.com/forums/thread.jspa?messageID=3275138). [#16882] (3.3.78)

* Fix a bug that caused NullPointerException when opening an XAEnvironment under certain circumstances, when prepared but uncommitted transactions are present in the JE log and must be replayed during recovery. [#16774] (3.3.77)

* Fix a bug that caused LogFileNotFoundException in rare circumstances for an Environment having one or more Databases configured for duplicates (or one or more SecondaryIndexes with MANY_TO_XXX relationships). The bug only occurs when a single secondary/duplicate key value is associated with a large number of records/entities; specifically, the sub-Btree for a single duplicate key value must have at least 3 levels. [#16712] (3.3.76)

* Fix a bug that caused BufferOverflowException while writing transactional records. This could occur if multiple threads were writing to an environment while using the same Transaction. [#17204]


The complete list of changes can also be found in the change log page.

http://www.oracle.com/technology/documentation/berkeley-db/je/changeLog.html

Product documentation can be found at:

http://www.oracle.com/technology/documentation/berkeley-db/je/

Download the source code including the pre-compiled JAR, complete documentation, and the entire test suite as a single package.

http://download.oracle.com/berkeley-db/je-3.3.82.zip
http://download.oracle.com/berkeley-db/je-3.3.82.tar.gz

May 8, 2009

Berkeley DB Java Edition - A Benefit of Open Source

Berkeley DB is a well-known open source embedded database. We suspect that most users never really even look at the sources, but that they do like knowing that they can if they ever want to. Others may want to look over the sources just to see if the code looks clean -- kicking the tires so to speak. And then there are the users who actually find bugs for us. This recently happened. A user was getting a BufferOverflowException while using JE and since he had a reproducible case he went ahead and diagnosed it for us. His analysis took us straight to the actual problem and we'll be issuing a patch shortly (3.3.82).

April 26, 2009

Oracle Acquiring Sun

The FAQ talks a lot about this. All I can say is that this is that from my personal point of view this is really exciting news. I'm looking forward to working even more closely with my Sun colleagues.

January 22, 2009

Oracle in the Embedded Database Space

IDC recently wrote a report saying that Oracle is first in Embedded Databases. What exactly is an embedded database? Technically, an embedded database is one that requires no DBA for administration and can be silently installed and configured along with the application, so that the customer need not even know it's there. From a business standpoint, an embedded database is used by product developers at software ISVs and hardware OEMs, as opposed to enterprises who operate databases in their data centers.

From an ISV's or OEM's standpoint, using an embedded database can have several advantages:

* Makes the ISV/OEM product technically more competitive - gives it better performance, scalability, availability, etc.

* Lowers the product's TCO for customers since the end-user customer does not need to buy separate database licenses, spend time/money installing/integrating it, and have a DBA on the payroll,

* Allows the ISV/OEM to direct their engineering resources on building value-added pieces, rather than on the database infrastructure.

Lots of products embed databases, ranging from enterprise or Internet applications, small consumer devices or large-scale industrial equipment, each of which will put different requirements on the database. The IDC report calls this out: "in some cases a small footprint is required, in other cases speed, in others scalability, and in still others broad support of many data types."

Readers of this blog may think that the only Oracle embedded product is Berkeley DB, but Oracle Database, TimesTen In-Memory Database, and Oracle Database Lite are also on that list.

References:

Press release: http://www.oracle.com/features/hp/embedded-number-one.html

Technical White Paper describing how to embed Oracle Database: http://www.oracle.com/technologies/embedded/docs/11g-embedded-whitepaper.pdf

IDC Report: http://oracleimg.com/corporate/analyst/reports/infrastructure/dbms/idc-215446.pdf


December 30, 2008

Berkeley DB Java Edition and Amazon AWS/EC2, EBS

In a previous OTN thread titled BerkeleyDB and Amazon EC2/S3 questions were raised about using Berkeley DB Java Edition on AWS/EC2. Specifically,

(1) Does JE work on AWS/EC2, and
(2) Can S3 be used as a persistent store for JE.

To follow up on this, recently I have done some work validating JE on AWS and am happy to report that it works fine (there should be no surprise there). I have run it under 32b and 64b Ubuntu distros with Java 6, but I have no reason to think that it doesn't work on other platforms.

On the second question, I did no work with S3 as a persistent store. Rather, I ran JE with both the Instance Local Storage and with an EBS volume as Environment storage. In the Instance Local Storage case, AWS/EC2 makes no guarantees of durability if the instance fails. In the EBS case, the durability guarantees are much stronger. Both of these storage mechanisms worked fine with JE.

I call attention to the performance that I observed with EBS on an m1.large instance type. Raw write/fsync operations were on the order of 1.99 msec which is quite fast. A discussion of this can be found in this AWS Forum thread.

November 30, 2008

Berkeley DB Java Edition Group Commit vs ext3

In JE, when the durability dictates force-writing to disk (for example, a commitSync() call), it calls fsync(). On modern (non-SSD) disk drives an fsync() will take on the order of several milliseconds, a relatively long time compared to the time required to do a write() call which only moves data to the operating system's file cache. Presently, JE's group commit code ensures that any concurrent application threads requring fsync() calls will have their fsync()s subsequently executed in batches if the fsync() would otherwise block. For example, if a thread T1 requires an fsync(), and JE determines that no fsync() is in progress, it will execute that fsync() immediately. If, while T1 is executing the fsync(), some other thread(s) require an fsync(s), JE will block those threads until T1 finishes. When T1's fsync() completes, a new fsync() executes on behalf of the blocked thread(s). In this way, JE achieves group commit.

The existing JE group commit code assumes that the underlying platform (OS + file system combination) allow IO operations like seek(), read() and write() to execute concurrently with an fsync() call (on the same file, but using different file descriptors). On Solaris and Windows this is true. Hence, on these platforms, a thread which is performing an fsync() does not block another thread performing a concurrent write(). But, on several Linux file systems, ext3 in particular, an exclusive mutex on the inode is grabbed during any IO operation. So a write() call on a file (inode) will be blocked by an fsync() operation on the same file (inode). This negates any performance improvement which might be achieved by group commit.

Internally, I have finished working on improving the group commit so that it batches write() and fsync() calls, rather than just fsync() calls. Just before a write() call is executed, JE checks if an fsync() is in progress and if so, the write call is queued in a (new) Write Queue. Once the fsync() completes, all pending writes in the Write Queue are executed.

I have enabled this behavior by default but it may be disabled by setting the je.log.useWriteQueue configuration parameter to false. The size of the Write Queue (i.e. the amount of data it can queue until any currently-executing IO operations complete) can be controlled with the je.log.writeQueueSize parameter. The default for je.log.writeQueueSize is 1MB with a minimum value of 4KB and a maximum value of 32MB. The Write Queue does not use cache space controlled by the je.maxMemory parameter. For future reference, this is internal SR [#16440].

October 2, 2008

Google's Energy Proposal

I don't normally make wholesale links from this blog to others, but I thought the Google blog entry on their clean energy proposal warranted some attention.