More of a Good Thing

The throughput collector is getting more parallel.

In our J2SE 6 release there is an option in the throughput collector to collect the tenured generation using multiple GC threads. Yes, the UseParallelGC collector wasn't all that we wanted it to be in terms of parallelism. Without this recent work only the young generation was being collected in parallel.

The option to turn on the parallel collection of the tenured generation is -XX:+UseParallelOldGC. That's the punch line for this blog. Oh, also it's off by default.

Below are a few details on the new collector. Actually, first I'll describe the serial collector used for the tenured generation collection and some of the things we learned from it. Then I'll describe the new collector and how we used what we had learned.

The J2SE 5 (and earlier) collector for the tenured generation was a serial mark-sweep-compact collector. There were four phases of that collector.

1. Mark all the live objects

2. Calculate the new locations of the objects

3. Adjust all object references to point to the new locations

4. Move the objects to their new locations

Phase 1 found and marked the live objects.

In phase 2 the collector kept a pointer to the first location in the tenured generation to which we could move a live object during the collection. Let's call that location T. T started at the bottom of the generation. The collector scanned forward from the bottom of the generation looking for marked (i.e., live) objects. When a live object A was found, the collector saved T in A (i.e., saved the new location of A in A). It then calculated the next location T as T + sizeof(A) and restarted the scan looking for the next live object to put at the new T.

Phase 3 started at the roots, references to objects known to the application (for the most part), and changed all the references in the roots and all the reference found from the roots to point to the new location of the objects (to the T's saved in the A's).

Phase 4 again started at the bottom of the tenured generation and scanned forward to find live objects. When a live object was found, it was moved to it's new location (the T's).

The end result was that all the live objects were moved to the lower end of the tenured generation and all the free space was in one contiguous region at the higher end of the tenured generation. It simplifies allocations from the tenured generation to have all the free space in a single region. Also there are no issues of fragmentation as would be the case if the collection resulted in multiple regions of free space separated by live objects. This single region of free space seemed an important property to keep when thinking about how to do a collection of the tenured generation using multiple GC threads.

Another feature of our serial mark-sweep-compact collector that we liked is the flexibility to leave some deadwood at the lower end of the tenured generation. We use the term "deadwood" to refer to garbage at the lower end of the generation that we're not going to collect. We don't collect it to reduce the number of objects that we have to move. If we have objects

A B C D

and if B is garbage and we collect it, then we have to move C and D. If we treat B as deadwood, then C and D can stay where they are. We make a judgment call to waste a certain amount of space (that occupied by the deadwood) in order to move fewer objects.

You may have noticed that each of the four phases of our mark-sweep-compaction walked over at least all the live objects. In some cases it actually walked over all the objects (live and dead). When we're scanning the heap (such as in phases 2), we look at a dead object just to find its size so that we can step over it. Any part of the collection that walks over the objects is costly so with the new parallel collector for the tenured generation we wanted to have fewer walks over the objects.

Keeping all this in mind resulted in a parallel collector that

\* Marked the live objects using multiple GC threads and maintained some summary information (explained below).

\* Used the summary information to determine where each live object would go. Also determined the amount of deadwood to keep.

\* Moved live objects so there was one contiguous block of free space in the generation and updated references to live objects using multiple GC threads.

I'll refer to these three phases as the marking phase, summary phase and move-and-update phase. In some of the other documentation on this new collector the last phase is referred to as the compaction phase. I prefer move-and-update because that's the type of name used in the code. Just in case you're ever looking at the code.

Marking phase

Doing the marking in parallel was straight forward. The parallel young generation collector does this and a similar technique was used. The additional twist to the marking was creating the summary information. The heap is logically divided into chunks. Each object starts within a chunk. As we're doing the marking, when a live object was found, its size was added to the chunk containing the start of the object. At the end this gives us the size the of data for objects that begin in each chunk.

Summary phase

The summary phase first walks over the chunks looking for the extent of the heap that contains a good amount of deadwood. We call that part of the heap the dense prefix. Live objects that are moved will be placed in the heap starting immediately after the dense prefix. Given that we know the amount of live data for each chunk, we can calculate the new location of any live object using the summary data. Basically adding up the sizes of the live data in all the chunks before chunk A tells you where the live data in chunk A is going to land. The summary phase is currently done by one thread but can be done in parallel if it turns out to be a scaling bottleneck.

Update-and-move phase

Given the summary information the move-and-update phase identifies chunks that are ready to be filled and then moves the appropriate objects into those chunks. A chunk is ready to be filled if it is empty or if its objects are going to be compacted into that same chunk. There is always at least one of those. The move-and-update phase is done with multiple GC threads. One thing to note is that it is possible that there is only 1 ready chunk at any given time. If the chunks are A, B, C, D ... and the only dead object is in A and fits entirely in A, then A is the only ready chunk. The move-and-update phase will first fill any remaining space in A. Then B will be the only ready chunk. Obviously an extreme case but it makes the point. There is a technique to widen this "gap" of ready chunks. We expect to implement it, but it's not in this release.

That's basically it. We're working on some scaling issues with this new collector, but it working pretty well as is.

For more on this new collector see the slides for the GC presentation done at this year's JavaOne. They can be download from

http://java.sun.com/javaone/

The last time I looked, there was a "Click here" link for the technical sessions in the first paragraph. Following that link, download the Java SE bundle and in there our GC presentation is TS-1168.

Comments:

Jon, In one of your write ups on GC in tiger, it was mentioned that the low pause GC (tenured gen) and throughput collector were mutually exclusive. But here, it looks like the parallelism is being extended to both generations. How different is this from having throughput and low pause GCs working together? Is it the difference in priorities (low pause goal, throughput goal etc)? And why isn't i-cms reused now (for UseParallelOldGC)?

Posted by Bharath R on August 30, 2006 at 09:50 PM PDT #

With the parallel collections of the tenured generation with the throughput collector we are reducing the pause time for that collection but the collection is still a stop-the-world collection. With the low pause collector most of the collection of the tenured generation is done while the application continues to run. We do have two stop-the-world phases in the low pause collector but those are typically are very short when compared to the collection of the tenured generation with the throughput collector. ICMS is a version of the low pause collector where the concurrent phases of the collection are divide into slices and scheduled between the young generation collections. It is still mostly concurrent so again has distinctly different characteristics from the UseParallelOldGC collector. Please ask again if I misunderstood your questions.

Posted by Jon Muasamits on August 31, 2006 at 05:24 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

jonthecollector

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today