Over the past few months I've been working on comparing Berkeley DB Java Edition to Apache Derby. The result of this effort is a whitepaper that compares the two systems in "CRUD" tests. I embarked on this benchmark because I was curious about Derby and its performance characteristics. I certainly do not intend to launch into a benchmark war against Derby -- there are some things that Derby does better than JE (like ad hoc queries) and some things that JE does better than Derby. It all depends on your application's needs.
From the text:
This paper compares the performance of Oracle Berkeley
DB Java Edition to Apachei??s pure Java RDBMS, Derby. In some areas, this comparison is
apples-to-apples, and in others it is apples-to-oranges. For instance, both JE and Derby
are embeddable, transactional, pure Java storage engines
(apples-to-apples). On the other hand,
JE provides schema and a higher level data abstraction without SQL using a
faster direct access API, features Derby provides
with SQL (apples-to-oranges). The reader
should evaluate whether the additional features provided by a relational
database system (RDBMS) are required, at the expense of reduced
performance. In all cases comparisons
are functionally equivalent despite different underlying implementations.
In the tests described in this white paper, Berkeley
DB Java Edition performance exceeds Derby performance by a factor of 3 to 10.
Comments (4)
I'm curious why the paper states BerkeleyDB outperforms Derby by a factor of 3 to 10 in every test, but actually in six of the eight direct comparisions with Derby the difference is less than a factor of 3? The first test only shows a 7% improvement for BerkeleyDB, but the paper describes this as "significantly faster" and even uses the trick of a non-zero based graph to try and re-enforce this fake impression.
Will the code for the benchmarks be published?
Posted by Dan Debrunner | November 29, 2006 9:32 PM
Posted on November 29, 2006 21:32
There is some further discussion of this over at David Van Couvering's blog. I put a reply to his post in his comments area, but I'll re-post them below.
Hi David,
Thanks for your excellent detailed post. I appreciate the effort you put into it as well as the time you took to read the whitepaper. As one of the developers of Berkeley DB Java Edition, I thought it would be good to comment on some of your thoughts. I hope that you'll let people on the derby-dev list know about my comments here.
Let me say up front that the benchmark and whitepaper was not meant to start a shouting match between JE and Derby. You commented that Oracle must be getting peppered with questions about Derby vs JE. Actually, we haven't had many questions, except from our own internal people. Following the Sleepycat acquisition earlier this year we thought it would be good to be proactive about performance numbers since our Sales Consultants would eventually ask. Also, on a personal level, I had been curious about Derby for quite a while and running a benchmark was a good excuse to learn more about it. I stressed to our Product Manager (and anyone else who would listen) that any numbers we published should emphasize that largely it's an apples-to-oranges comparison.
Frankly, I am quite impressed with Derby. It was easy to install, set up, and run with a minimal amount of hassle. I can't recall encountering any bugs during my tests. Congratulations to all the developers on a great piece of work.
And again, being frank, JE is not for every application. In fact, I know of at least one user out there who would probably be better served by using Derby. Conversely, you may know one or two Derby users who would be better served by JE. But whatever the case, JE and Derby are really two different animals and the reader should keep that in mind. Unfortunately, I can see why you might take offense (my word, not yours) to some of our "marketing prose" which may sound a little aggressive.
Source code: yeah, we dwelt on that one for a while and there's no good answer except that the benchmark was run as part of an internal performance regression suite. Releasing the code is problematic because it depends on other pieces of code that can't be released. Frankly, I'd like to release the code because I'd like to hear comments from you (and others) about how to tune Derby to run faster. We considered Poleposition but it requires a JDBC interface, and we don't have one.
SQL: Nope, we don't provide SQL (or JDBC). If you need either of those, JE may not be the right answer. In fact, I'm 99.44% sure we'd lose an ad-hoc query benchmark against Derby.
API: You're right that JE does not have an API like anyone else's. But I should make reference to Carbonado which is discussed on TheServerSide. Carbonado allows using Berkeley DB C Edition, Berkeley DB Java Edition, or an SQL database as an underlying repository. This is extremely useful when an abstraction that supports an SQL-based backend is a requirement, or when there is a need to synchronize data between Berkeley DB and SQL databases.
License: Yup, it's a dual license. GPL-like and standard commercial. It may not be right for everybody.
Your suggestion for a new standard for transactional access to key/value pairs is a good one and something missing from the myriad of standards available to Java programmers today. Perhaps this could be something to consider submitting to the JCP. With our Direct Persistence Layer (DPL) we realize that we are not following the EJB3 standard, but clearly it does provide 90% of the same functionality with a very similar API. My point is that although Berkeley DB Java Edition's primary API is key/value, it is by no means limited to that type of data management.
It turns out that at least one JavaSpaces implementation uses Berkeley DB Java Edition for the underlying storage of data, but just calling it a standard isn't really going to address the need for this kind of storage.
Durability is a critical aspect of any database, and any sophisticated database will provide a variety of options for durability. In our tests, 'non-durable writes' are in fact rather durable, just not written onto the disk platters before being considered complete. As a JE transaction commits, the data can be considered 'durable' at any of the following points (depending on the parameters passed to the JE configuration): 1) when the data is in application memory, 2) when the data is in operating system filesystem buffers, 3) when the data is in the disk's write ahead cache (hopefully battery backed, but not necessarily), or 4) when the data is written to the disk's platters. During a benchmark it is easier to see differences in performance when you eliminate the last step (4) which is exactly what we did in our 'non-durable write' tests.
Stated another way, different applications have different requirements in terms of ACID. Consider a web crawler like archive.org's Heritrix. They need A, C, and I, in their multi-threaded environment, but they don't necessarily need D to the extent that a standard TPC often provides. Lots of other applications are happy to just stream data to the database and if it gets to the disk eventually, they're happy. If the system crashes (e.g. due to a power failure) and they lose some of the modifications, it's ok, as long as they still have A and C. I admit that it's a different mentality (relaxing durability) and frankly, it took me a while to get used to it when I first started working on JE. Derby is clearly aimed at a different design point as evidenced by the value of the "derby.system.durability" being named "test". Your design emphasis is on a strong definition of durability. That's not meant as a criticism, just a recognition that Derby may sometimes be aimed at a different application space than JE.
Non-zero-based graph: You're right. We're going to fix this and republish the paper. Frankly, I should have caught that when I reviewed it and I'll take responsibility for it. The intent was to demonstrate that there is a detectable difference (albeit a small one) and by using a non-zero based graph it highlights the difference. No sleeze intended, but I understand why you might assume some sleeze-factor was involved. I apologize for this and again, we'll fix it.
I'll also go over some of the other marketing-eeze a bit more carefully.
Thanks again for your review.
Charles Lamb (charles.lamb -at-sign- the-obvious-domain.com)
Berkeley DB Java Edition Developer, Oracle
Posted by Charles Lamb | December 1, 2006 7:13 AM
Posted on December 1, 2006 07:13
Thanks, Charles. I put some quick responses on my blog, which I'll repeat here:
I appreciate your very well-thought-out and gentlemanly response. I suspect it's a habit of mine, coming from Sybase, to take on Oracle as a bit of an adversary. I need to get over that.
I agree that different apps have different durability requirements. But I think people need to be very aware when they look at performance comparisons that they recognize what they're getting. This is the same thing that comes up when people compare Java DB with MySQL, not recognizing that MySQL isn't logging to disk, and seeing amazing fantastic performance numbers.
One thing I note is we both agree that it would be very good to have a JCP for a basic object storage API that is not tied to SQL or J2EE and goes directly to some basic transactional storage. What I described sure sounds like JDO, but I think that's at a high level. Something to look at further...
Posted by David Van Couvering | December 1, 2006 4:25 PM
Posted on December 1, 2006 16:25
The changes to the paper have been made and the new pdf has been made "live" on the website.
Posted by Charles Lamb | December 4, 2006 9:10 AM
Posted on December 4, 2006 09:10