« August 27, 2006 | Main | September 24, 2006 »

September 22, 2006 Archives

September 22, 2006

JE and the Sun T1000/T2000 ("Niagara")

Over the past few months, I've had the opportunity to run some tests on a Sun T2000
(a/k/a "Niagara") box. Several people have asked me how JE does in a
CMT (Chip Multi-Threading) environment so I wanted to share my notes. I
also ran tests on a T1000, which is the same processor speed as the T2000. I've noted any relevant differences when appropriate.

Much
of this text is about the T2000 hardware and its configuration. If you
aren't interested in that aspect of the machine, please skip down below
where I talk about performance. I have also interspersed some comments
about the Sunfire V20Z (dual 1.8GHz Opteron x86 box) since it
represents a fairly well known baseline for many readers.

The
T2000 configuration, as tested, is an 8x4 (8 cores by 4 strands per
core) UltraSPARC processor (1GHz), with 8GB of memory and two Seagate
73GB Serial SCSI drives.

The T1000 configuration, as tested, is
a 24 "strand" (6x4) processor (1GHz), with 2GB of memory and a single
Seagate 500GB SATA drive. Since the T1000 only ships with 0 or 1 disk
drive (there's only room for one drive inside the box), and the only
option is an 80GB drive, I chose the no-drive option and put my own in
(*).

Configuring the Hardware

The T2000 comes on a wood palett. It is apparently assembled in China.

Both
machines are very slick looking. The T2000 is a 2U unit (T1000 is 1U).
Neither of them have a power switch. The T2000 has dual power supplies
(it ships with two power cords), triple fans, and four disk drives
slots (on the front panel). Power supplies, fans, and disk drives are
all hot swappable.

As noted by others,
the machine is very slick looking. The green LEDs for the power and
disk drives look cool. While most systems indicate disk activity by
blinking the disk drive LEDs on, the Niagara series has the LED
on to indicate inactivity (blinking it off for activity). You can also
turn on a white blinking "find me" LED using the Advanced Lights Out
Manager (ALOM) (a/k/a SCP or Service Control Processor) so that if you
have multiple units you can remotely help someone in the data center
find the unit. The Sun engineers seem to have made this LED brighter
than the one on the V20Z (I suspect intentionally).

The ALOM is
quit useful and allows you to poweron, reset, poweroff, and do some
minimal configuration. The ALOM lets you talk to its own command line
processor or switch over (and back) to the actual system tty console.
There's no video card on the T2000. You can also ssh into the ALOM if
you don't want to mess with serial anymore.

The ALOM also allows
you to enable and disable memory cards and NICs. At one point I had
stuck in some incorrect memory in the T1000 in order to bring it above
2GB. This caused the ALOM to disable all the memory including the
original Sun memory that was still "good". After removing the incorrect
memory, the machine wouldn't boot, claiming an incorrect memory
configuration. Uh oh. A call to Sun cleared that up. I just needed to
use the 'enablecomponent' command to re-enable the Sun memory and all
was well. Nevertheless, it is nice that you can use the ALOM to disable
memory and other devices on the main unit.

When you first power
it on, the ALOM goes through many diags and POSTs. Once the ALOM has
booted, you can boot the machine, which takes quiet a while.
Fortunately, Solaris 10 doesn't need to be reboot as often as, say
Windows. My V20Z has been up for 112 days.

The top-cover design
of the T2000 chassis is better than the T1000, which is better than the
V20Z. The V20Z cover is a pain to replace and remove so mostly I just
leave it "almost closed". The T1000 is better, but I still have to give
a little shove forward in order to get it to slide backwards. The T2000
cover seems to work pretty well. It comes off easily and goes back on
pretty easily. There are sensors on both machines so the SCP will tell
you when the cover is off (and refuse to boot it). It is a really nice
plus that Sun has printed diagrams and instructions right on the covers
for easy reference (e.g. how to replace memory, fans, disks, etc.).

The
fans are, like the V20Z, very noisy, but this is expected for a machine
that is going to live in the data center. When you plug the machine in,
some of the fans start. My first reaction was "Gee, this machine isn't
nearly as noisy as the V20Z". But after commanding the ALOM to power on
the actual Niagara processors, either more fans start or the fans cycle
to a higher speed (I didn't bother to determine which) and all hell
breaks loose (noise wise). I donned by noise cancelling headphones,
configured the machine as quickly as possible, and relocated it from my
office to the data center.

The insides of the T2000 are neat and tidy. All Serial SCSI cables are neatly run and even tied down.

The
disk drives on the T2000 are the newer 2.5" form factor, rather than
the usual 3.5" size. My complements to Sun for this forward thinking
(improved space utilization) as well as for using Serial SCSI (the
T1000 uses 3.5" SATA). They are serious about space and energy savings
(more on this below). But I do wish they had configurations with larger
capacity drives. The V20Z has the same problem. I was told by the T1000
Product Manager that the Niagara boxes are supposed to be web servers
and all the big-iron storage is on another machine somewhere
(presumably the Galaxy-based boxes that they're working on). The T2000
has PCI-E slots available so you can put more drives on it (e.g. a RAID
or SAN).

The memory on both units is DDR2, PC-4200 (533Mhz),
registered, ECC. There are 16 slots in the T2000 and 8 slots in the
T1000. Both the T1000 and T2000 shipped with 512MB DIMMS (2GB total on
the former and 8GB on the latter). If you buy non-Sun memory for these
units, be sure to ask the vendor if they've tested it on a T1000 or
T2000.

The T2000 has a DVD drive on it, but the T1000 does not.
Solaris 10 was pre-installed so I haven't had to use the DVD drive yet.
See the footnote below regarding Solaris installation on the T1000 (it
has no DVD or CD drive). There is no video card on the machine (unlike
the Sunfire V20Z), so to get the T2000 up and running, you need a
serial terminal emulator. I used Hyper Terminal connected to my
notebook. I set it to 9600-8-N-1, emulating a VT100 and it was ready
for action. The Solaris install then asks you a bunch of the usual
questions (approximately 5 minutes worth) about your configuration
(DHCP or fixed ipaddr, network ipaddrs, gateways, net masks, ipv6,
timezone, time, root password, etc.). After rebooting, you have to use
the serial terminal to create a non-root account, but then, since sshd
is running on the unit by default, getting to the box remotely is easy.
Once you're ssh'd in using your non-root account, you can su. For the
most part, I just use xterm to get to the box.

Note to Sun
engineers and Product Management: The Google Mini has a really clean
way of configuring the unit out of the box that doesn't require any
serial cables or magic. You plug your notebook into the Mini's
management network port. The Mini has a DHCP server on it so your
notebook gets assigned some class non-routable class c ipaddr by the
Mini (the notebook and the mini are the only thing on this one-wire
management network). You then fire up your web browser to a fixed
non-routable class C ipaddr (i.e. the Mini's fixed addr) with a known
port (printed in the quickstart guide) and you can do the whole
configuration via your browser. Presto! No serial is necessary. Since
the T2000 has a network management port, it seems that this would be
pretty easy to do.

It is clear from all aspects of their design
that these are data-center centric machines so don't order one with the
intent of putting it next to your desk.

Loading up the software
I needed (cvs, emacs, top, etc.) was easy. I went to sunfreeware.com
for all the binaries. Be sure to pick up gnu tar, and then replace the
one in /usr/sbin/tar with the new one. The system comes with a Java 1.5
pre-installed. It didn't take long to do a handful of pkgadd commands
and I had xterms and emacs's on my Windows X server.

Performance

My
benchmark comparisons are against a Sun V20Z with 1.8GHz Opterons,
running Solaris 10, and a Windows dual Xeon (2.4GHz, HyperThreading
disabled). Unfortunately, I don't have a V40Z to run against. The
results I talk about below are fairly general in nature and only meant
to give a general feel for what the box can do.

Overall, single
threaded performance is slower on the Niagara than on the Opteron and
the Xeon, but that is not a surprise (the comments I see coming out of
Sun seem to agree with this general assessment). What's interesting
about the Niagara is that you get a truckload of 'CPUs' (albeit virtual
CPUs since it's only 6 or 8 cores on a single chip) for a relatively
small number of dollars. From my point of view, a Berkeley DB Java
Edition ("JE") developer, this makes the T2000 a nice platform for
testing multi-threaded scalability and concurrency testing. True, each
of the strands are slower, but there sure are a lot of them.

I
have enabled the read and write disk caches on the T1000, T2000, and
V20Z for all of these benchmarks. I am less interested in disk speed
for these tests than CPU speed and multi-threaded performance.

The
primary tuning exercise for JE, and probably the fairest one for the
T2000, was to tune a customer benchmark that has 25 application
threads, all doing read-only queries against a database. The benchmark,
named JESearchRate,
simulates a multi-threaded database server application that uses JE.
The goal of the tuning exercise was to get mpstat to show 100% CPU
utilization on all processors.

The end result of the tuning was that I was able to get JESearchRate
running about twice as fast on the T2000 compared to the V20Z. All of
the changes that I made to JE for this tuning exercise were
incorporated into JE 3.0.11, with the shared latches option being
settable by a configuration parameter ("je.env.sharedLatches").

I realize that this is a pretty short summary of what was a couple of months of work on JESearchRatetuning
(I had access to a T2000 in Sun's lab before the Niagara boxes went
GA), but it really is pretty significant. Many of Sun's competitors in
the server space have made the argument that while CMT is a nice
concept, it requires tuning your applications before you can realize
the performance benefits. In fact, that is exactly what I did: tuned JE
to have better multi-threaded performance, just as the other vendors
predicted. But so what? The changes I made are beneficial to all JE
users, whether they're running on 1.8GHz Pentium-M notebooks or on the
biggest SMP iron available. The good news for me, a developer, is that
I didn't have to pay an arm and a leg (only about $4k) to get my hands
on 24 processors (in the case of a T1000) that I can use for JE
performance tuning.

During the recent release qualification of
JE 3.0.11, I used the T2000 to run our stress tests. Normally, I run
three different configurations of the stress test on a dual processor,
each configuration having four active threads performing transactions
of various types. Depending on the configuration, the tests are both
IO- and CPU-bound. The T2000 was able to run 6 of these pretty easily
and mpstat showed all processes in the 10 to 50% range (top showing
30-35% usage across all CPUs). Since the stress test uses about 30MB of
working set, they all easily fit into memory.

I also ran a JE
"Contentious Update" test against the 2-way 2.4 GHz Xeon Windows/XP
machine. This test is CPU bound and fits in memory on all the tested
machines so disk and memory configurations aren't significant. For
single threaded performance, the Xeon machine is nearly three times
faster than the T2000 (20K ops/sec vs 7k ops/sec). But with two
threads, the Xeon performance drops and the T2000 performance increases
(14K ops/sec vs 8k ops/sec). With four threads it's 10k ops/sec vs 13k
ops/sec. I found similar results with other tests: Xeon performance
drops as the number of concurrent threads increases, but the T2000
throughput increases as the number of threads increases.

Another
interesting test is a random read test with shared latches enabled. I
ran this test against the V20Z, the T2000, and the Xeon. The results
are shown in the graph below. Interestingly, the Xeon, peaks with 2
threads, but then drops off at 3 and 4 threads and then levels off
after that. The T2000 scales up until about 3 threads and then levels
off (above the Xeon). The V20X, drops from 1 to 3 threads and then
levels off.




Power Consumption

Sun
has made a lot of smoke about the reduced power consumption with the
Niagara boxes so it seems fitting that I measure it. Our data center
electricity bills cross my desk every month so I'm glad to see Sun
pushing hard on this issue. The T2000 and T1000 that I tested both met
or exceeded what Sun's data sheet says they consume.

All current measurements were made with a line voltage of 120.9 VAC (60Hz).

T2000:
The ALOM draws 0.3 amps with the main system powered off. With the main
system powered on, booted, and idle it draws 1.9 amps. I ran continuous
fsync(2) calls on one of the two disks and I couldn't see a noticeable
change in the current drawn. This isn't surprising since disk drives
are generally spec'd to use no more than 12 watts. Running 32
processes, all CPU bound, it drew another 0.2 amps additional, for a
total of 2.1 amps. The Sun T2000 data sheet says that the system draws
275w or less so my system met or exceeded that specification.

T1000:
The ALOM draws 0.1 amps with the main system powered off. With the main
system powered on, booted, and idle it draws 1.2 amps. Running 24
processes, all CPU bound, it drew another .15 amps, for a total of 1.35
amps. The Sun T1000 data sheet says that the system draws 220w or less
so my system met or exceeded that specification.

I was a little
surprised by the difference in ALOM power between the T1000 and T2000.
The Sun T1000 Product Manager says that the T2000 ALOM is on a riser
card and the T1000 is on the system motherboard and that may account
for the difference.

Xeon: The Xeon draws 1.2 amps idle and 1.75 amps when it's running something CPU bound.

V20Z:
The SCP on this is really chincy with power and only draws 0.1 amps.
When the V20Z is idle, it draws 1.4 amps. When both processors are
running JVMs at 100% each, it draws 1.6 amps.

Some people have
expressed concerns the Niagara chips don't have binary compatibility
with, say x86. This is true, but since I use Java, I don't care. All
the utils I want are available in binary on sunfreeware.com. But I
recognize that this may not be sufficient for all potential T2000
customers.

Conclusion

The
Niagara is an interesting box from a developers point of view because
it offers a large number of strands (virtual CPUs) at a relatively low
price point for concurrency testing. From a data center point of view,
this box could be interesting for people who may be thinking of
upgrading older boxes (like the Xeon used in my tests) where there are
more cycles/watt in the Niagara than an older Intel machine.

(*) Note
that if you choose this option, you must be prepared to do a network
install of Solaris. Also, Sun doesn't ship the disk drive bracket with
the unit and you can't order it as a separate part. This may have
changed, but I believe they expect most people to purchase the box with
their own 80GB drive (and Solaris pre-installed). Since I couldn't do a
network install (I didn't have a Solaris SPARC box available), the Sun
guys were really nice and imaged my 500GB drive with Solaris to get me
going. They also graciously gave me one of the prototype disk brackets.

Adler32 vs the GC






Recently a user opened a support request for what appeared to be a JE bug: their application was getting OutOfMemoryErrors
and all it was doing was updating and querying records in the database.
We were fortunate enough to have a test case that reproduced the
problem fairly quickly. The case started 16 threads. 15 of these
threads picked a key at random between 1 and 3000 and then either
created a record for that key or updated an existing record at that
key. The other thread would do a scan of the entire set of 3000 keys.
The inserters slept between 1 ms and 1 sec between inserts and the
full-scan thread woke up every 1 minute. Oh, did I say that the data
for each of the records were 512KB? The user reported that the program
ran just fine with JE 2.0.90, but would consistently OOME with 2.1.30.

So
I did a lot of the usual things that one does when trying to debug
OOMEs in Java: (1) run it in your favorite memory memory profiling
tool, (2) run it in the J2SE 1.6 JVM to try to get a better stack trace
from the OOME, (3) run with -verbose:gc and the handful of other -XX:
GC flags, etc.

When I tried (1), everything looked just fine.
The major consumer of memory (byte[]'s) held pretty firm and well
within the JE cache limit. When I tried (2), the program worked. Hmmm,
go figure. When I ran with (3), everything looked pretty normal -- the
GC was collecting memory quite nicely and then all of a sudden, bam! an
OOME appeared.

I tried increasing the -Xmx JVM parameter while
pinning down the JE cache size. While it ran longer, it still
eventually OOME'd.

I tried diddling the GC parameters, but nothing.

In
desperation, I started checking out snapshots of the source tree
between 2.0.90 and 2.1.30, doing binary searches to see which versions
worked and which didn't. The eventual culprit? The
java.util.zip.Adler32 class -- sort of.

Looking at the source code for the java.util.zip.Adler32 class, we see this:


JNIEXPORT jint JNICALL
Java_java_util_zip_Adler32_updateBytes(JNIEnv *env, jclass cls, jint adler,
jarray b, jint off, jint len)
{
Bytef *buf = (*env)->GetPrimitiveArrayCritical(env, b, 0);
if (buf) {
adler = adler32(adler, buf + off, len);
(*env)->ReleasePrimitiveArrayCritical(env, b, buf, 0);
}
return adler;
}


Checking at here we see:


These restrictions make it more likely that the native code will obtain
an uncopied version of the array, even if the VM does not support
pinning. For example, a VM may temporarily disable garbage collection
when the native code is holding a pointer to an array obtained via
GetPrimitiveArrayCritical.


JE passes the entire array
(all 512KB of it in this case) to Adler32 and so it seems that this is
blocking the GC. In a multi-threaded environment, this can be a big
problem.

The workaround is pretty easy. We have a flag, -Dje.disable.java.adler32=true
that can be used to force JE to use our own Adler32. It's slower than
the java.util.zip.Adler32, but it doesn't suffer from GC-blockage. Why
do we have this flag? A while back there was a bug in the J2SE 1.4 JVM
where Adler32 would cause JVM crashes, so we wrote our own to work
around this. We noticed that the 1.5 JVM fixes the bug so we put in
conditional code to check whether we were running on a 1.5 JVM or not,
and if so, use the java.util.zip.Adler32. We added a flag to override
the conditional use of the class.

Golden Penguin Bowl (repost from Sleepycat blog on 4/06/06)

[The following is a repost from the Sleepycat blog.  It was originally posted on 4/06/06 and has been migrated here.]

Greg, Mark, and I just returned from participating in the Golden
Penguin Bowl. This is the trivia contest where they ask you also sorts
of nerdy questions, many of them slams at high-profile execs (Larry
Ellison, Bill Gates, etc.). I've gotta say that it wasn't the kind of
thing that I'd normally volunteer for, but I volunteered for the good
of the whole. Greg, Mark, and I were really pretty scared of getting
slaughtered in this thing, especially when we heard that one of the
MySQL guys had won it the past two years, and that another one on their
team was Ted Tso (a ringer, for sure). We just wanted to escape from
the hour with having not made complete fools of ourselves, and perhaps
even answer a few questions. It didn't turn out all that bad and in
fact I think we all had a pretty good time.

I did my team honor
when I answered the question about what "Object Oriented File System
did Microsoft recently remove from Vista, and what year was it first
announced?" [Cairo, 1992]. But it was an Oh Darn! moment when I
answered the question about what AJAX stood for and I answered
Asynchronous Java And XML, knowing full well that it's Asynchronous
JavaScript And XML. D'oh. I thought Jeremy might let it slide by, but
the astute judges caught it. Mark and Greg had their own great answers
(stuff that I never could have come up with) and we celebrated those
with lots of high-fives.

Ted and the MySQL guys did an awesome
job, especially with the question that showed two pictures of Tiananmen
Square, one with the famous picture of the student blocking the
military tank, and the second with a family posing for a picture with
the square behind them. He picked right up that the difference was that
one was from google.com and the other from google.cn. It seems obvious
in retrospect, but under fire, it's not always clear.

Hint to future participants: read /. for a year or so before, and follow that with an APOD chaser or two and you'll be fine.

Comments and a link to the story are at:
http://linux.slashdot.org/linux/06/04/06/1650223.shtml

/. says: Host
Jeremy Allison hit a few raw nerves with the opposing Oracle team,
introducing them as members of the Berkeley DB product line, 'which
Oracle will soon kill.'
But to be honest, we couldn't hear much of
what Jeremy was saying because of the accoustics in the hall. So I
didn't even know he said this until I read it on /. To know what was
being asked, we pretty much had to rely on reading the questions on the
monitor.

Congratulations to the MySQL guys who did a great job.
Because we lost (not by much, I'm pleased to say), Mike will have to
wear a MySQL shirt to work at Redwood Shores. Sorry Mike.

All in all I had a great time and would happily do it again.

Are Your Transactions Really Durable? (Repost from 1/31/06 on Sleepycat Blog)

[The following is a repost from on 1/31/06 on the Sleepycat blog.]

A JE user recently asked how many durable (the "D" in ACID) put() calls per second might be possible using JE. Since most of the time in a durable transaction is spent in the fsync(2)
call, the answer is largely dependent on the underlying disk drive
speed. I gave the user a small Java program that repeatedly writes and
fsync's (forces to disk) the same block of data and measures the number
of calls for a given time. Their results were interesting.

On
three different Win XP boxes with the same disk and CPU, the throughput
was 35, 37, and 42 write+fsync's/second. That's between 23 and 28 msec
per forced write. On another box running W2K Server (which for the sake
of this test is the same as W2K), the throughput was 637
write+fsync's/second, or 1.5 msec per write. So either the disk drive
on the W2K box was significantly faster (the drive is a Seagate
Barracuda, so that theory doesn't hold), or the write cache was enabled
and the writes were not actually being forced to disk.

The Sleepycat DB Reference Guide has a short write-up on disk caches.
While disk write caches certainly can improve database performance, if
you really want the D in ACID, you have to disable them. On-board read
caches are, of course, still OK and should generally help performance.

The
user's results correspond to what we have observed in our lab. Win XP
never uses the disk cache and Win 2K always uses the disk cache.
Further, I don't know of a way of switching either of those settings.
Solaris allows the administrator to enable or disable read and write
caches with the format command. The corresponding Linux command is hdparm(8).

All
of this is to say that if you want Durability for your transactions,
you need to make sure your disk and your operating system are not
faking you out. The best way to do it is to measure performance of
multiple write+fsync's and see if the numbers jive with the underlying
disk drives.

Database Method Performance in JE 2.1 (repost from 12/23/05 on Sleepycat blog)

[The following is a repost of the 12/23/05 Sleepycat blog entry.]

In JE, the various Database-based operations (e.g. Database.put(), Database.get(), etc.) all are implemented by creating a Cursor object and then running the equivalent Cursor method. The internal Cursor
method implementations all start by "cloning" the cursor and then
performing the requested operation. Upon successful completion of the
operation, the cloned cursor becomes the "real" cursor that is returned
to the user program. That way, if the operation fails somewhere in the
middle of the tree, restoring the original state of the cursor is easy
(we throw away the cloned cursor and revert back to the original cursor
that was passed to us in the first place).

But the Database
operations don't need the benefit of the clone operation because the
internal cursor is never exposed to the user. Hence, if a Database operation ends up in a strange state (i.e. the operation is not going to succeed), we don't need to revert the cursor.

The next release of JE (2.1) will remove this performance penalty from the Database operations by not cloning the internal cursor.

Split Lock Table in JE 2.1 (repost from 12/15/05 on Sleepycat blog)

[The following is a repost of the 12/15/05 Sleepycat blog entry.]

JE 2.1 will have a lot of performance improvements. One of them is the Split Lock Table.
We noticed that the lock table was a concurrency bottleneck in some
multi-threaded benchmarks. To get around that, we split the lock table
into N lock tables, where N is a parameter (the default is 1). Then
locks are evenly distributed across those N tables (by taking nodeId
mod N, where nodeId is the lock table key). By choosing an appropriate
value for N (generally a prime, and generally something less than the
number of concurrent threads operating on an environment), performance
can be increased significantly.

JE and 64b JVMs (Repost from 11/30/05 Sleepycat blog)

[This is a repost of the 11/30/05 Sleepycat blog entry.]

A user recently noted that while running JE in a 64b JVM
OutOfMemoryErrors happened, whereas they didn't happen under a 32b JVM.
The reason was apparent. JE tries hard to only use as much memory as
the application program has given it in cache (e.g. using the je.maxMemory
parameter). How can a Java library like JE know just how much memory is
being used by the objects that it's responsible for? It has to keep a
running tally of each object in the JE cache.

But how does it
figure out the size of each object? One answer is to use the Java 5
reflection APIs to compute the size of an object at run time.
Unfortunately, that would be too slow and also would only be available
on Java 5 (JE is supported on 1.4 JVMs). Instead, for each object that
can use space in the cache (and some others that aren't in the cache,
like Transactions), we take a constant value as the base amount of
memory used -- we calculate this at "build" time -- and add to it the
size of variable components (e.g. arrays that reference other objects,
HashSets and their elements, etc.)

To determine the size of an element we use a "hack" Sizeof program, similar to the one here.
Not surprisingly, when we ran our Sizeof program in a 64b JVM, the
results were quite a bit different. Many of them were 50% different and
that means that the cache usage calculations for JE 2.0.x are quite
wrong under a 64b JVM.

JE 2.1 will have a fix that will
determine at runtime what JVM architecture JE is running on and then
use an appropriate set of constants. Until then, if you run on a 64b
JVM, give about 50% more heap than you would normally (but leave your
cache size the same).

Private vs Shared Database Handles in JE (repost from 11/17/05 Sleepycat blog entry)

[reposted from the 11/17/05 Sleepycat blog entry]

I was recently tuning a benchmark that has lots of threads doing retrievals against JE. The code used a single Database handle shared across all threads. This turned out to present a minor bottleneck because the Database handle maintains a set of Cursors open against it. This set is used to check if all Cursors are closed against the Database when close() is called, but to do that we have to synchronize against it before updating it. So if a bunch of threads are sharing the same Database handle it makes for a synchronization bottleneck.

The moral of the story is that in a multi-threaded case, unless there's a good reason to share a Database handle, it's probably better to use separate handles for each thread.

Unsorted Retrieval in JE (Repost from 11/10/05 Sleepycat blog)

[This is a repost of the 11/10/05 Sleepycat blog entry.]

We've been working a lot on performance improvements to JE and recently
we devised a nice way to speed up a couple of operations by an order of
magnitude or so. It's called Sorted LSN Tree Walking, and the basic
idea is that rather than walk the tree in key order, we walk it in Log
Sequence Number (LSN) order.

To do this, we walk the tree by
starting at the root, descend as far down as we can until we find a
node that is not in memory. Then we take all the LSNs of the
out-of-cache children and add them to an array. We keep descending down
all the various paths in the tree until we've accumulated all of the
LSN's of out-of-cache children. Then, we sort those LSNs, and for each
LSN in the sorted set, read in the node and accumulate all of the child
LSNs. Lather, rinse, repeat. We see all the nodes in the tree that way,
and while we're fetching them from disk, we do it in a sorted order so
that the disk head doesn't dance around in random access patterns.

We devised this scheme when a customer needed improved performance for Database.truncate() and Database.remove().
You might wonder why these two methods would need to scan all the nodes
in the B+Tree for a database. The answer is two fold. First, depending
on the arguments passed to the truncate/remove methods, in some cases
we need to return the count of entries to the caller. While this is an
important reason for needing to walk the tree, it's nevertheless
relatively easy to tell people that if they want a fast execution of
truncate() or remove() that they shouldn't expect an accurate return
for the count. The second reason is that we need to update the
Utilization Profile to correctly reflect the newly freed space. The
Utilization Profile is used by the cleaner to determine both which
files are going to give us the most bang for the buck when we clean
them as well as which records in the file are obsolete. If we remove a
database, then all the entries for that database become obsolete (and
thus more interesting to the cleaner). In order to update the
Utilization Profile, we have to look at each node in the database's
B+Tree. But it doesn't matter what order we walk that tree in. So an
LSN sorted order turns out to be very fast if the database is not in
cache. Not to mention that we don't have to worry about concurrency and
locking because the database is closed.

Now that we have developed this technology, it has other uses. One that we're thinking about is improving the Database.preload()
method. After all, the method doesn't make any guarantees about what
actual items it will load -- just that it will load up as much as it
can. Another possible use is DbDump. Presently this utility dumps in
key-sorted order and this means that a restore of the database using
DbLoad will be optimal (key order insertion). However, if a user is
willing to trade-off DbLoad speed for DbDump speed, it may make more
sense to use a SortedLSN based dump.

This technology will be built into the 2.1 release of JE, scheduled to be released in the first quarter of 2006.

Database.preload() Improvements in JE 2.1 (Repost from 12/14/05 on Sleepycat blog)

[The following is a repost of the 12/14/05 Sleepycat blog entry.]

In Unsorted Retrieval in JE I mentioned that JE 2.1 would use a new Sorted LSN Tree Walker technology to improve performance for Database.truncate() and Database.remove(). I also suggested that JE might also eventually use this for improving Database.preload()
performance. The changes that make preload use this technique were just
checked in so it will in fact be released in JE 2.1. The performance
improvement for a cold-cache preload (i.e. cold OS file system cache)
is on the order of about 50% for a database that was created in random
order (YMMV). Database.preload() will also allow optionally loading LN's (the data portion of the B+Tree). Finally, Database.preload()
will return stats about the number and type of nodes that were loaded
as well as whether the operation completed successfully or was
terminated due to filling the cache or exceeding the user-designated
time limit.

JE and Jave 5 Locks (Repost from 10/25/05on Sleepycat blog)

[This is a repost of the 10/25/05 Sleepycat blog entry.]

Java 5 has a new lock substrate embodied in the java.util.concurrent.locks
package. These classes are similar to the internal latching mechanisms
that we use in Berkeley DB Java Edition (JE) so it seemed like a
natural choice to make use of this new library in the hopes that they
would provide better performance for JE. After all, the Java 5 lock API
is similar to the com.sleepycat.je.latch
API. In JE, a latch is a short-lived, non-persistent lock on a data
structure (e.g. an internal tree node) that helps JE maintain
consistency. In fact, latches are very similar to synchronized blocks (we opt for synchronized whenever possible), but in some cases, the lexically scoped nature of synchronized is
not suitable. For instance, a technique that JE uses to descend the
B+tree is "latch coupling" (latch A; latch B; release A; latch C;
release B; ...) and this can not be implemented easily with a lexically
scoped construct since the latch acquisitions and releases are
intertwined.

Internally, the JE latch classes use a synchronized block during both the "acquire" and "release" calls. So a JE latch costs at least two synchronized calls, which even though relatively inexpensive, still add up.

Initial
tests of the new Java 5 locks showed that significant performance
improvements might be realized. The only possible downside was that
we'd have to change our latch API to be an interface so that we could
"switch hit" between our latch code or the Java 5 locks, depending on
whether JE was in a Java 5 or 1.4 JVM environment. Once the integration
of the Java 5 locks into JE was completed, we ran our performance
regression tests (a series of "micro benchmarks" that each focus on a
specific JE function) to see just how much difference the new locks
made.

The results were gratifying. For single threaded tests,
any degredation due to calling through an interface (where we hadn't
called through one before) were negligible. For multi-threaded tests,
the more threads in the test, the better the results. Some tests showed
10% improvement, and one customer's write benchmark was even better
than that.

Traditionally, we have seen much better performance
using the Java 5 JVM (especially the "server" JVM), and this adds still
another improvement to JE performance under the "Tiger" JVM.

The Java 5 latch functionality will be available in a future release of JE after it undergoes more testing.

Experiences Tuning Berkeley DB Java Edition (Repost from 10/24/05 on Sleepycat blog)

[This is a repost of the 10/24/05 Sleepycat blog entry.]

Recently, I've been working on a JE user's performance test. They have
a need for high insert throughput while simultaneously reading records
from the same database. While tuning the insertion test I learned some
interesting things about the Java Garbage Collector and how it
interacts with the JE Cache.

The benchmark is simple: it writes
one million records with keys that are sequential "longs". The total
database size is about 380MB. In order to keep the database in-memory,
I set the cache size is 512MB and the log buffer size is 64MB. With
those parameters as a baseline, I began my tuning.

One of the first things I tried was to adjust the JVM heap size (-Xmx).
I saw some odd results. For instance, if I ran the benchmark with no
-Xmx parameter on the command line, it performed better than if I
specified -Xmx512M. I was running on a "server class" machine (a Sun
V20Z, 2GB, 2 x Opteron 244 CPU, Solaris 10, which, by the way, really
screams) so I knew I was getting the server JVM by default. When I
printed out Runtime.maxMemory() the JVM returned the same value (512MB) whether I invoked it with or without the -Xmx512M, yet with that command line option (or lack thereof) being the only difference, the results were clearly slower with the -Xmx parameter than without. What I didn't know was that if didn't specify -Xmx, then the JVM chose values based on Ergonomics in the 5.0 Java[tm] Virtual Machine.
Specifically, on a server class machine, the initial heap size is 1/64
of the physical memory (1/64th of 2GB is 32MB) and the max heap size is
1/4 of the physical memory (512MB in my case). But if you do specify
-Xmx, then the initial heap size is not chosen based on the ergonomics
parameters. So, for example, a simple test program that prints
Runtime.maxMemory() and Runtime.totalMemory() shows the following:


> java Test # Java Ergonomics chooses values
maxMemory: 517013504 # 512MB
totalMemory: 33554432 # 32MB
freeMemory: 33269360
> java -Xmx512M Test # Java Ergonomics does not choose values
maxMemory: 517013504 # 512MB as specified by -Xmx512M
totalMemory: 5111808 # something smaller than 32MB
freeMemory: 4826736
> java -Xmx512M -Xms32M Test # User supplies both
maxMemory: 517013504 # 512MB as specified by -Xmx512M
totalMemory: 33554432 # 32MB as specified by -Xms32M
freeMemory: 33269360


So that explained why I was seeing different results when I specified -Xmx (but not -Xms).

The next (and more) interesting observation came when I adjusted the JE cache size (je.maxMemory).
The benchmark's baseline setting of 512MB was clearly enough to hold
all of the entries in the database. But just to confirm that, I
decreased the cache size. Surprisingly, the benchmark performance
improved! Additional decreases of the cache size all the way down to
8MB continued produce improved performance. The reason behind this
counter-intuitive phenomenon lay with the GC. Tuning Garbage Collection with the 5.0 Java[tm] Virtual Machine
provided assistance in narrowing this down. By using the various GC
debugging command line options available in the J2SE Java 5 JVM ("-verbose:gc", "-XX+PrintGCDetails", "-XX:+PrintGCTimeStamps") it was pretty easy to figure out what was going on with the GC.

GC's
of free objects in "Young Space" are cheap, but full GC's that migrate
objects are expensive. By increasing the cache size, JE was leaving
more objects in the cache. Even though they were never accessed again
(since this was just an insertion test), with a large cache those
objects had to be migrated to "Tenured Space". But with a smaller
cache, the objects could be evicted (i.e. "freed") thereby allowing the
GC to free them from the "Young Space" (cheap). When the heap was sized
large enough that the "Young Space" was also large enough so that no
Full GC's were necessary, the highest performance was obtained.

About September 2006

This page contains all entries posted to Charles Lamb's Blog in September 2006. They are listed from oldest to newest.

August 27, 2006 is the previous archive.

September 24, 2006 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type and Oracle