X

Jon Masamitsu's Weblog

  • Java
    August 30, 2006

More of a Good Thing

Guest Author
The throughput collector is getting more parallel.

In our J2SE 6 release there is an option in the throughput collector to collect
the tenured generation using multiple GC threads. Yes, the UseParallelGC collector wasn't
all that we wanted it to be in terms of parallelism. Without this recent work
only the young generation was being collected in parallel.

The option to turn on the parallel collection of the tenured generation is
-XX:+UseParallelOldGC.
That's the punch line for this blog. Oh, also it's off by default.

Below are
a few details on the new collector. Actually, first I'll describe the serial collector
used for the tenured generation collection and some of the things we learned from it. Then I'll describe the new collector and how we used what we had learned.

The J2SE 5 (and earlier) collector for the tenured generation was a serial mark-sweep-compact collector. There were four phases of that collector.

1. Mark all the live objects

2. Calculate the new locations of the objects

3. Adjust all object references to point to the new locations

4. Move the objects to their new locations

Phase 1 found and marked the live objects.

In phase 2 the collector kept a pointer to the first location
in the tenured generation to which we could move a live object during the collection.
Let's call that location T. T started at the bottom of the generation.
The collector scanned forward
from the bottom of the generation looking for marked (i.e., live) objects. When
a live object A was found, the collector saved T in A (i.e., saved the
new location of A in A). It then calculated
the next location T as T + sizeof(A) and restarted
the scan looking for the next live object to put at the new T.

Phase 3 started at the roots, references to objects known to the application (for the
most part), and changed all the references in the roots and
all the reference found from the roots
to point to the new location of the objects (to
the T's saved in the A's).

Phase 4 again started at the bottom of the tenured
generation and scanned forward to find live objects. When
a live object was found, it was moved to it's new location (the T's).

The end result was that all the live objects were moved to the lower end of the
tenured generation
and all the free space was in one contiguous region at the higher end of
the tenured generation. It
simplifies allocations from the tenured generation to have all the free space in a
single region. Also there are no issues of fragmentation as would be the case
if the collection resulted in
multiple regions of free space separated by live objects. This single
region of free space seemed an important
property to keep when thinking about how to do a collection of the tenured generation
using multiple GC threads.

Another feature of our serial mark-sweep-compact collector that we liked is the flexibility
to leave some deadwood at the lower end of the
tenured generation. We use the term "deadwood" to
refer to garbage at the lower end of the generation
that we're not going to collect. We don't
collect it to reduce the number of objects that we have to move.
If we have objects

A B C D

and if B is garbage and we collect it, then we have to move C and D.
If we treat B as deadwood, then C and D can stay where they are.
We make a judgment call to waste a certain amount of space (that occupied by
the deadwood) in order to move fewer objects.

You may have noticed that each of the four phases of our mark-sweep-compaction walked
over at least all the live objects. In some cases it actually walked over
all the objects (live and dead). When we're scanning the heap (such as in phases 2),
we look at a dead object just to find its size so that we can step over it.
Any part of the collection that walks over the objects is costly so
with the new parallel collector for the tenured generation we wanted to have fewer
walks over the objects.

Keeping all this in mind resulted in a parallel collector that

\* Marked the live objects using multiple GC threads and maintained some summary information (explained below).

\* Used the summary information to determine where each live object would go. Also determined the amount of deadwood to keep.

\* Moved live objects so there was one contiguous block of free space in the generation and updated references to live objects using multiple GC threads.

I'll refer to these three phases as the marking phase, summary phase and
move-and-update phase. In some of the other documentation on this new collector the
last phase is referred to as the compaction phase. I prefer move-and-update because
that's the type of name used in the code. Just in case you're ever looking at the
code.

Marking phase

Doing the marking in parallel was straight forward. The parallel young generation collector does this and a similar technique was used. The additional twist to the marking was creating the summary information. The heap is logically divided into chunks. Each object starts within a chunk. As we're doing the marking, when a live object was found, its size was added to the chunk containing the start of the object. At the end this gives us
the size the of data for objects that begin in each chunk.

Summary phase

The summary phase first walks over the chunks looking for the extent of the heap that contains a good amount of deadwood. We call that part of the heap the
dense prefix. Live objects that are moved will be placed in the heap
starting immediately after the
dense prefix. Given that we know the amount of live data
for each chunk, we can calculate the new location of any live object using the summary
data.
Basically adding up the sizes of the live data in all the chunks before chunk A tells you where the live data in chunk A is going to land.
The summary phase is currently done by one thread but can be done in parallel if it
turns out to be a scaling bottleneck.

Update-and-move phase

Given the summary information the move-and-update phase identifies chunks that are ready to be filled and then moves the appropriate objects into those chunks. A chunk is ready to be filled if it is empty or if its objects are going to be compacted into that same chunk. There is always at least one of those. The move-and-update phase is done with multiple GC threads. One thing to note is that it is possible that there is only 1 ready chunk
at any given time. If the chunks are A, B, C, D ... and the only dead object is in
A and fits entirely in A, then A is the only ready chunk. The move-and-update phase will
first fill any remaining space in A. Then B will be the only ready chunk. Obviously an extreme case but it makes the point. There is a technique to widen
this "gap" of ready chunks.
We expect to implement it, but it's not in this release.

That's basically it. We're working on some scaling issues with this new collector, but it working pretty well as is.

For more on this new collector see the slides for the GC presentation done at this
year's JavaOne. They can be download from

http://java.sun.com/javaone/

The last time I looked, there was a "Click here" link for the technical sessions
in the first paragraph.
Following that link, download the Java SE bundle and in there our GC presentation is
TS-1168.

Join the discussion

Comments ( 2 )
  • Bharath R Thursday, August 31, 2006
    Jon,
    In one of your write ups on GC in tiger, it was mentioned that the low pause GC (tenured gen) and throughput collector were mutually exclusive. But here, it looks like the parallelism is being extended to both generations. How different is this from having throughput and low pause GCs working together? Is it the difference in priorities (low pause goal, throughput goal etc)? And why isn't i-cms reused now (for UseParallelOldGC)?
  • Jon Muasamits Thursday, August 31, 2006
    With the parallel collections of the tenured generation with the throughput collector we are reducing the pause time for that collection but the collection is still a stop-the-world collection. With the low pause collector most of the collection of the tenured generation is done while the application continues to run. We do have two stop-the-world phases in the low pause collector but those are typically are very short when compared to the collection of the tenured generation with the throughput collector. ICMS is a version of the low pause collector where the concurrent phases of the collection are divide into slices and scheduled between the young generation collections. It is still mostly concurrent so again has distinctly different characteristics from the UseParallelOldGC collector.
    Please ask again if I misunderstood your questions.
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.