Tuesday Nov 28, 2006

Presenting the Permanent Generation

Have you ever wondered how the permanent generation fits into our generational system? Ever been curious about what's in the permanent generation. Are objects ever promoted into it? Ever promoted out? We'll you're not alone. Here are some of the answers.

Java objects are instantiations of Java classes. Our JVM has an internal representation of those Java objects and those internal representations are stored in the heap (in the young generation or the tenured generation). Our JVM also has an internal representation of the Java classes and those are stored in the permanent generation. That relationship is shown in the figure below.

The internal representation of a Java object and an internal representation of a Java class are very similar. From this point on let me just call them Java objects and Java classes and you'll understand that I'm referring to their internal representation. The Java objects and Java classes are similar to the extent that during a garbage collection both are viewed just as objects and are collected in exactly the same way. So why store the Java objects in a separate permanent generation? Why not just store the Java classes in the heap along with the Java objects?

Well, there is a philosophical reason and a technical reason. The philosophical reason is that the classes are part of our JVM implementation and we should not fill up the Java heap with our data structures. The application writer has a hard enough time understanding the amount of live data the application needs and we shouldn't confuse the issue with the JVM's needs.

The technical reason comes in parts. Firstly the origins of the permanent generation predate my joining the team so I had to do some code archaeology to get the story straight (thanks Steffen for the history lesson).

Originally there was no permanent generation. Objects and classes were just stored together.

Back in those days classes were mostly static. Custom class loaders were not widely used and so it was observed that not much class unloading occurred. As a performance optimization the permanent generation was created and classes were put into it. The performance improvement was significant back then. With the amount of class unloading that occur with some applications, it's not clear that it's always a win today.

It might be a nice simplification to not have a permanent generation, but the recent implementation of the parallel collector for the tenured generation (aka parallel old collector) has made a separate permanent generation again desirable. The issue with the parallel old collector has to do with the order in which objects and classes are moved. If you're interested, I describe this at the end.

So the Java classes are stored in the permanent generation. What all does that entail? Besides the basic fields of a Java class there are

  • Methods of a class (including the bytecodes)
  • Names of the classes (in the form of an object that points to a string also in the permanent generation)
  • Constant pool information (data read from the class file, see chapter 4 of the JVM specification for all the details).
  • Object arrays and type arrays associated with a class (e.g., an object array containing references to methods).
  • Internal objects created by the JVM (java/lang/Object or java/lang/exception for instance)
  • Information used for optimization by the compilers (JITs)

That's it for the most part. There are a few other bits of information that end up in the permanent generation but nothing of consequence in terms of size. All these are allocated in the permanent generation and stay in the permanent generation. So now you know.

This last part is really, really extra credit. During a collection the garbage collector needs to have a description of a Java object (i.e., how big is it and what does it contain). Say I have an object X and X has a class K. I get to X in the collection and I need K to tell me what X looks like. Where's K? Has it been moved already? With a permanent generation during a collection we move the permanent generation first so we know that all the K's are in their new location by the time we're looking at any X's.

How do the classes in the permanent generation get collected while the classes are moving? Classes also have classes that describe their content. To distinguish these classes from those classes we spell the former klasses. The classes of klasses we spell klassKlasses. Yes, conversations around the office can be confusing. Klasses are instantiation of klassKlasses so the klassKlass KZ of klass Z has already been allocated before Z can be allocated. Garbage collections in the permanent generation visit objects in allocation order and that allocation order is always maintained during the collection. That is, if A is allocated before B then A always comes before B in the generation. Therefore if a Z is being moved it's always the case that KZ has already been moved.

And why not use the same knowledge about allocation order to eliminate the permanent generations even in the parallel old collector case? The parallel old collector does maintain allocation order of objects, but objects are moved in parallel. When the collection gets to X, we no longer know if K has been moved. It might be in its new location (which is known) or it might be in its old location (which is also known) or part of it might have been moved (but not all of it). It is possible to keep track of where K is exactly, but it would complicate the collector and the extra work of keeping track of K might make it a performance loser. So we take advantage of the fact that classes are kept in the permanent generation by collecting the permanent generation before collecting the tenured generation. And the permanent generation is currently collected serially.

Thursday Oct 26, 2006

Why Now?

Are you ever surprised by the occurrence of a full collection (aka major collection)? Generally a full collection only happens when an allocation is attempted and there is not enough space available in any of the generations. But there are exceptions to that rule and here are some of them. Actually, these are all of them that I could think of. I thought I would have a longer list so to fill out this blog I've included some bugs that relate to full collections that might interest you. And we're thinking about a policy change for the low pause collector that would affect full collections.

The simplest reason for an unexpected full collection is that you asked for it. System.gc() can be called from most anywhere in the program. Try using -XX:+DisableExplicitGC to have the garbage collector ignore such requests.

The permanent generation is collected only by a full collection. It's very seldom the cause for a full collection but don't overlook the possibility. If you turn on PrintGCDetails, you can see information about the permanent generation collection.

0.167: [Full GC [PSYoungGen: 504K->0K(10368K)] [ParOldGen: 27780K->28136K(41408K)] 28284K->28136K(51776K) [PSPermGen: 1615K->1615K(16384K)], 0.4558222 secs]

This output is from the throughput collector. The "PSPermGen" denotes the permanent generation. The size of the permanent generation currently is 16384K. It's occupancy before the collection is only 1615K and is probably the same after the collection. If the permanent generation needed collecting, the occupancy before the collection would have been closer to the 16384K.

The garbage collector tracks the average of the rate of promotion of live objects from the young generation into the tenured generation. If the average exceeds the free space in the tenured generation, the next time the young generation fills up, a full collection is done instead of a minor collection. There is a pathological situation where, after the full collection, the free space in the tenured generation is still not large enough to take all the live objects expected to be promoted from the next minor collection. This will result in another full collection the next time the young generation fills up. The bad part about this is that the average value for the amount promoted only changes when a minor collection is done and if no minor collections are done, then the average does not change. Increasing the size of the heap usually avoids this problem.

Sometimes it's not really a full collection. There was a bug affecting JDK5 (6432427) which print "Full" when the collection was not a full collection. When JNI critical sections are in use, GC can be locked out. When the JNI critical sections are exited, if a GC had been requested during the lock out, a GC is done. That GC most likely would be a minor collection but was mistakenly labeled a full collection. A second place where "Full" was also printed erroneously was when -XX:+CMSScavengeBeforeRemark is used. The same bug report explains and fixes that error.

An attempt to allocate an object A larger than the young generation can cause a minor collection. That minor collection will fail to free enough space in the young generation to satisfy the allocation of A. A full collection is then attempted and A is then allocated out of the tenured generation. This behavior was a bug (6369448). It has been fixed in JDK's 1.4.2_13, 5.0_10, and 6. The correct behavior is for the GC to recognize when an object is too large to be allocated out of the young generation and to attempt the allocation out of the tenured generation before doing a collection.

When an allocation cannot be satisfied by the tenured generation a full collection is done. That is the correct behavior but that behavior has an adverse affect on CMS (the low pause collector). Early in a run before the CMS (tenured) generation has expanded to a working size, concurrent mode failures (resulting in full collections) happen. These concurrent mode failures perhaps can be avoided. We are thinking about a change to the policy such that CMS expands the generation to satisfy the allocation and then starts a concurrent collection. We're cautious about this approach because similar policies in the past has led to inappropriate expansion of the CMS generation to its maximum size. To avoid these full collections try specifying a larger starting size for the CMS generation.

So just because these are all the unexpected "Full" collections that I could think of, that doesn't mean that these are all there are. If you have a mysterious "Full" collection happening, submit a description to


I'll post any additional cases of unexpected "Full" collections that I get.

Late addition (I'm allowed to edit my blog even after they've been posted). If you ask about unexpected "Full" collections and send in a gc log, please add the command line flags

-XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps

The additional output (of which there will be plenty) may help me see what's happening.

Monday Sep 11, 2006

The Second Most Important GC Tuning Knob

[Read More]

Wednesday Aug 30, 2006

More of a Good Thing

The throughput collector is getting more parallel.

In our J2SE 6 release there is an option in the throughput collector to collect the tenured generation using multiple GC threads. Yes, the UseParallelGC collector wasn't all that we wanted it to be in terms of parallelism. Without this recent work only the young generation was being collected in parallel.

The option to turn on the parallel collection of the tenured generation is -XX:+UseParallelOldGC. That's the punch line for this blog. Oh, also it's off by default.

Below are a few details on the new collector. Actually, first I'll describe the serial collector used for the tenured generation collection and some of the things we learned from it. Then I'll describe the new collector and how we used what we had learned.

The J2SE 5 (and earlier) collector for the tenured generation was a serial mark-sweep-compact collector. There were four phases of that collector.

1. Mark all the live objects

2. Calculate the new locations of the objects

3. Adjust all object references to point to the new locations

4. Move the objects to their new locations

Phase 1 found and marked the live objects.

In phase 2 the collector kept a pointer to the first location in the tenured generation to which we could move a live object during the collection. Let's call that location T. T started at the bottom of the generation. The collector scanned forward from the bottom of the generation looking for marked (i.e., live) objects. When a live object A was found, the collector saved T in A (i.e., saved the new location of A in A). It then calculated the next location T as T + sizeof(A) and restarted the scan looking for the next live object to put at the new T.

Phase 3 started at the roots, references to objects known to the application (for the most part), and changed all the references in the roots and all the reference found from the roots to point to the new location of the objects (to the T's saved in the A's).

Phase 4 again started at the bottom of the tenured generation and scanned forward to find live objects. When a live object was found, it was moved to it's new location (the T's).

The end result was that all the live objects were moved to the lower end of the tenured generation and all the free space was in one contiguous region at the higher end of the tenured generation. It simplifies allocations from the tenured generation to have all the free space in a single region. Also there are no issues of fragmentation as would be the case if the collection resulted in multiple regions of free space separated by live objects. This single region of free space seemed an important property to keep when thinking about how to do a collection of the tenured generation using multiple GC threads.

Another feature of our serial mark-sweep-compact collector that we liked is the flexibility to leave some deadwood at the lower end of the tenured generation. We use the term "deadwood" to refer to garbage at the lower end of the generation that we're not going to collect. We don't collect it to reduce the number of objects that we have to move. If we have objects


and if B is garbage and we collect it, then we have to move C and D. If we treat B as deadwood, then C and D can stay where they are. We make a judgment call to waste a certain amount of space (that occupied by the deadwood) in order to move fewer objects.

You may have noticed that each of the four phases of our mark-sweep-compaction walked over at least all the live objects. In some cases it actually walked over all the objects (live and dead). When we're scanning the heap (such as in phases 2), we look at a dead object just to find its size so that we can step over it. Any part of the collection that walks over the objects is costly so with the new parallel collector for the tenured generation we wanted to have fewer walks over the objects.

Keeping all this in mind resulted in a parallel collector that

\* Marked the live objects using multiple GC threads and maintained some summary information (explained below).

\* Used the summary information to determine where each live object would go. Also determined the amount of deadwood to keep.

\* Moved live objects so there was one contiguous block of free space in the generation and updated references to live objects using multiple GC threads.

I'll refer to these three phases as the marking phase, summary phase and move-and-update phase. In some of the other documentation on this new collector the last phase is referred to as the compaction phase. I prefer move-and-update because that's the type of name used in the code. Just in case you're ever looking at the code.

Marking phase

Doing the marking in parallel was straight forward. The parallel young generation collector does this and a similar technique was used. The additional twist to the marking was creating the summary information. The heap is logically divided into chunks. Each object starts within a chunk. As we're doing the marking, when a live object was found, its size was added to the chunk containing the start of the object. At the end this gives us the size the of data for objects that begin in each chunk.

Summary phase

The summary phase first walks over the chunks looking for the extent of the heap that contains a good amount of deadwood. We call that part of the heap the dense prefix. Live objects that are moved will be placed in the heap starting immediately after the dense prefix. Given that we know the amount of live data for each chunk, we can calculate the new location of any live object using the summary data. Basically adding up the sizes of the live data in all the chunks before chunk A tells you where the live data in chunk A is going to land. The summary phase is currently done by one thread but can be done in parallel if it turns out to be a scaling bottleneck.

Update-and-move phase

Given the summary information the move-and-update phase identifies chunks that are ready to be filled and then moves the appropriate objects into those chunks. A chunk is ready to be filled if it is empty or if its objects are going to be compacted into that same chunk. There is always at least one of those. The move-and-update phase is done with multiple GC threads. One thing to note is that it is possible that there is only 1 ready chunk at any given time. If the chunks are A, B, C, D ... and the only dead object is in A and fits entirely in A, then A is the only ready chunk. The move-and-update phase will first fill any remaining space in A. Then B will be the only ready chunk. Obviously an extreme case but it makes the point. There is a technique to widen this "gap" of ready chunks. We expect to implement it, but it's not in this release.

That's basically it. We're working on some scaling issues with this new collector, but it working pretty well as is.

For more on this new collector see the slides for the GC presentation done at this year's JavaOne. They can be download from


The last time I looked, there was a "Click here" link for the technical sessions in the first paragraph. Following that link, download the Java SE bundle and in there our GC presentation is TS-1168.

Thursday Aug 24, 2006

The GC whitepaper

There's a very nice whitepaper on our GC at


It talks about basic concepts, some specifics about our GC, suggestions on tuning, and some of the latest improvements such as the parallel old collector. If you've looked at the GC tuning guides, some of it will look familiar, but it nicely pulls together information from several sources.

Tuesday Aug 15, 2006

The Real Thing

Here's a real treat dished up by Ross K. who hangs out 3 offices down the hall. A while back I wrote a blog about thread local allocation buffers (TLAB's). The short comings of my blog have probably been annoying Ross so much that he's written this one for us. If there are any discrepancies between my earlier blog and this one, believe this one. Thanks, Ross.

A Thread Local Allocation Buffer (TLAB) is a region of Eden
that is used for allocation by a single thread.  It enables
a thread to do object allocation using thread local top and
limit pointers, which is faster than doing an atomic operation
on a top pointer that is shared across threads.

A thread acquires a TLAB at it's first object allocation
after a GC scavenge.  The size of the TLAB is computed via
a somewhat complex process discribed below.  The TLAB is
released when it is full (or nearly so), or the next GC scavenge
occurs. TLABs are allocated only in Eden, never from From-Space
or the OldGen.

Flags        	       default 	description

 UseTLAB     		true   	Use thread-local object allocation
 ResizeTLAB  		false  	Dynamically resize tlab size for threads
 TLABSize    	       (below) 	Default (or starting) size of a TLAB (in bytes)
 TLABWasteTargetPercent   1    	Percentage of Eden that can be wasted
 PrintTLAB              false  	Print various TLAB related information

AggressiveHeap settings:

  TLABSize   	       256Kb
  ResizeTLAB 	       true     (Corrected 2007/05/09)

Minor flags

 MinTLABSize             2\*K   	Minimum allowed TLAB size (in bytes)
 TLABAllocationWeight     35   	Weight for exponential averaging of allocation
 TLABRefillWasteFraction  64   	Max TLAB waste at a refill (internal fragmentation)
 TLABWasteIncrement        4   	Increment allowed waste at slow allocation
 ZeroTLAB              false   	Zero out the newly created TLAB

These flags are for tuning the current implementation of
TLABs and maybe disappear or change their initial value in a
future release of the jvm.

If it is not specified on the command-line (or specified as zero)
via the -XX:TLABSize flag, the initial size of a TLAB is computed as:

  init_size = size_of_eden / (allocating_thread_count \* target_refills_per_epoch)


  a) Allocating_thread_count is the expected number of threads
    which will be actively allocating during the next epoch
    (an epoch is the mutator time between GC scavenges.)
    At jvm startup this is defined to be one.  It is then
    recomputed at each GC scavenge from the number of threads
    that did at least one allocation of a tlab during the
    latest epoch. It's then exponentially averaged over the
    past epochs.
  b) Target_refills_per_epoch is the desired number of tlab
    allocations per thread during an epoch.  It is computed from
    the value of TLABWasteTargetPercent which is the percentage of
    Eden allowed to be wasted due to TLAB fragmentation.
    From a mutator thread's perspective a GC scavenge can occur
    unexpectedly at any time.  So, on average, only half of a
    thread's current TLAB will be allocated when a GC scavenge
      TLABWasteTargetPercent = 0.5 \* (1/target_refills_per_epoch) \* 100
    Solving for target_refills_per_epoch:
      target_refills_per_epoch = ( 0.5 \* 100) / TLABWasteTargetPercent
    With the default value of 1 for TLABWasteTargetPercent
      target_refills_per_epoch =  50

When TLABResize is true (which it is by default) the tlab size
  is recomputed for each thread that did an allocation in the latest
  epoch.  Threads that did not allocate in the latest epoch do not
  have their TLABs resized.  The resize goal is to get the number
  of refills closer to the ideal: target_refills_per_epoch (default
  value is 50).  For each thread, the number of refills in the latest
  epoch is exponentially averaged with values from previous
  epochs.  If this average refill number is greater than
  target refills_per_epoch, then the tlab size is increased.  If
  the average is less, the tlab size is decreased.
  The computation is (approximately):
     new_size = (old_size \* avg_refills_per_epoch) / target_refills_per_epoch
  It's actually computed from the fraction of the latest epoch's
  eden size used by this thread, because the next epoch may use a
  resized eden.

To experiment with a specific TLAB size, two -XX flags need
  to be set, one to define the initial size, and one to disable
  the resizing:

     -XX:TLABSize= -XX:-ResizeTLAB

  The minimum size of a tlab is set with -XX:MinTLABSize which
  defaults to 2K bytes.  The maximum size is the maximum size
  of an integer Java array, which is used to fill the unallocated
  portion of a TLAB when a GC scavenge occurs.

Diagnostic Printing Options


  Prints at each scavenge one line for each thread (starts with "TLAB: gc thread: "
  without the "'s) and one summary line.

Thread example:

TLAB: gc thread: 0x0004ac00 [id:  2] size: 61KB slow allocs: 5  refill waste: 980B
         alloc: 0.99996     3072KB refills: 50 waste  0.1% gc: 0B slow: 4144B fast: 0B

  The tag "gc" indicates that this information was printed at a GC scavenge,
  after the tlabs have been filled. The "gc" tag doesn't mean a thread is a gc
     The address of the gc thread structure and it's system thread id.
     The size of the tlab in kilobytes.
  slow allocs:
     The number of allocations too large for remaining space in the TLAB.
     The allocation was done directly in eden space.
  refill waste: (in HeapWord units)
     The name is truncated in the dump, and should be: refill_waste_limit
     and is used to limit the amount of wasted space from internal
     fragmentation.  If the remaining space in the TLAB is larger than
     this amount, and an allocation is requested that is too large to
     be allocated in the TLAB, then the allocation is done directly in
     Eden and the TLAB is not retired.  If the remaining space is less
     than refill_waste_limit then the TLAB is retired, a new TLAB is
     allocated, and the object allocation is attempted in the new TLAB.

     After each allocation outside of the TLAB, the refill_waste_limit
     is incremented by TLABWasteIncrement to prevent an allocation of
     a size slightly less than refill_waste_limit from continually
     being allocated outside of the TLAB.

  alloc: [fraction]  [sizeInBytes]

     Expected amount of eden allocated by this thread computed as
     a fraction of eden and number of heap words.


     Number of tlab refills.     

  waste [percent] gc: [bytes] slow: [bytes] fast: [bytes]

     Percentage of eden allocated to this thread that was wasted.
     Waste is the sum of three components:
        gc:   unused space in the current TLAB when stopped for a scavenge.
        slow: sum of unused space in TLABs when they're retired to allocate a new one.
        fast: the client system can allocate a TLAB with a fast allocator.
              This is the amount of waste via that method.

Summary example:

TLAB totals: thrds: 1  refills: 50 max: 50 slow allocs: 5 max 5 waste:  0.1%
             gc: 0B max: 0B slow: 4144B max: 4144B fast: 0B max: 0B


     Number of threads that did an allocation.
  refills: [tt] max: [mm]

     Total number of TLAB refills by all threads, and
     maximun number of TLAB refills by a single thread.

  slow allocs: [ss] max [mm]

     Total number of allocations done outside of a TLAB, and
     maximum number by a single thread.

  waste [percent] gc: [bytes] slow: [bytes] max: [mm] fast: [bytes] max: [mm]

     Percentage of eden that was wasted across all threads.
     Waste is the sum of three components:
        gc:   unused space in the current TLABs when scavenge starts.
        slow: sum of unused space in TLABs when they're retired to allocate a new ones.
        fast: the client system can allocate a TLAB with a fast allocator.
              This is the amount of waste via that method.

     For "slow" and "fast", the maximum value by a single thread is printed.

More detail with addition of Verbose flag.

-XX:+PrintTLAB -XX:+Verbose

  Using both: -XX:+PrintTLAB -XX:+Verbose will print the
  new tlab sizes for each thread when it is resized.  Resizing is
  only done at GC scavenges.


TLAB new size: thread: 0x001eac00 [id: 19] refills 50  alloc: 0.402570 size: 19684 -> 18996

  New size 18996.  Previous size was: 19684.


     Number of tlab refills for this thread.


     The expected fraction of eden this thread will use.

Monday Aug 14, 2006

Location, Location, Location

Allocating a bunch of objects together and then using them together seems to be a good thing for cache utilization. Not always true, of course, but, if you're willing to take that as a quasi-truth, then you might be interested in how garbage collections can stir up that good locality.

I'm going to talk first about our serial garbage collector. It is a generational garbage collector with a copying young generation collector and a mark-sweep-compact tenured generation collector. My comments generally apply to all our collectors but allow me to do the simplest one first. After that I'll comment on how doing the collection in parallel makes it even more interesting. By the way the techniques I'll be describing for the serial collector are pretty generic and are used by other garbage collectors.

When a application thread T allocates objects, the allocations are generally done out of the young generation. Those allocations done by T together (i.e., at about the same time) are allocated in close proximity in the heap. This is mostly true even though allocations are being done in parallel by different application threads (different T's). Each T allocates from its own thread local allocation buffer (TLAB). I described recent improvement to TLAB's in an earlier blog with the title "A Little Thread Privacy, Please". If you read the first part of that one, you'll get the idea. At this point after the allocations are done, we have the good locality we've come to expect from the allocate-together, use-together practice. Then a garbage collection happens.

When the young generation is collected, the surviving objects can be copied to either the survivor spaces or the tenured generation. The survivor spaces are areas in the young generation where young generation objects are kept if they survive a few collections. After surviving a few collections objects in the survivor spaces are copied (promoted) to the tenured generation where long lived objects are kept. Let's consider the very first collection and the case where all the objects that survive that first collection are copied to a survivor space.

The young generation collection starts by finding objects being used directly by the application. By "directly" I mean the application has a reference to those objects A. That's as opposed to having a reference to an object (A) that contains a reference to a second object (B). The garbage collector scans the stacks of each application thread and, when it finds a reference to an object A, it copies A to a survivor space. Incidentally, that's also when the reference to A in the thread's stack is updated to point to the new location of A. Only A is copied at this point. If A contains references to another object B, the object B is not touched. The garbage collector continues to scan the stacks of the threads until it's found and copied all the objects directly referenced by the application. At that point the garbage collector starts looking inside the objects A that it has copied. It's then that the garbage collector would find B in A and would copy B to a survivor space. And as before if B contains a reference to object C, nothing is yet done about C.

As an example, suppose an application thread T

\* allocates object C

\* allocates object B and puts a reference to C into B

\* allocates object A and puts a reference to B into A

repeatedly. The layout of objects in the heap would look something like

... C1 B1 A1 C2 B2 A2 C3 B3 A3 ...

If thread T only kept a reference to the objects A (i.e., there are no reference directly to the B's or the C's on the thread stacks), then after the young generation collection the layout in the survivor space would look like

... A1 A2 A3 ... B1 B2 B3 ... C1 C2 C3 ...

where the (...) represent other objects X found directly from T's stack or objects found from those X's. Here I'm being just a little generous in grouping the A's, B's, and C's together. That grouping assumes that A1 is found first, then A2 is found, and then A3 is found.

Before the collection loading A1 into a cache line could reasonably also load B1 and C1 into the same cache line. After the collection B1 and C1 are in a galaxy far, far away. The A's, B's and C's have been separated so that their allocation order is no longer reflected in their location in the heap. The basic damage has been done.

In the case where all the objects A, B, and C are promoted to the tenured generation, the layout in the tenured generation would be similar (just in the tenured generation instead of in a survivor space). In the case where some of the objects are copied to a survivor space and some are promoted, some of each of the A's, B's, and C's can end up in either a survivor space or the tenured generation. You might think that it would tend to look something like

(survivor space) ... A1 A2 A3 ...

(tenured generation) ... B1 B2 B3 ... C1 C2 C3 ...

It might. Or it might be a bit messier. The above example would result if there was room in the survivor spaces for the A's but not for the B's and C's. If the B's were particularly large (so that they did not fit into the space remaining in the survivor space) and the C's were particularly small, then you might see

(survivor space) ... A1 A2 A3 ... C1 C2 C3 ...

(tenured generation) ... B1 B2 B3 ...

If there are references in the application thread stack to the B's and C's as well as to the A's, and the collector found objects in the order

C2 B3 A1 A2 A3

then after the collection the object layout would look like

... C2 B3 A1 A2 A3 ... B1 B2 ... C1 C3 ...

The mixtures are as endless as the ways in which the application references its objects.

How does collecting in parallel affect things? Recall that the serial collector first looks at all the references found in the application thread stacks (in the simpliest case the A's) before it looks at references inside the A's (i.e., the B's). The parallel collector starts by looking at some of the A's, but then may look at some of the B's before all the A's have been found. A parallel GC thread copies an A and puts its new location on a stack. I'll refer to this stack as its grey stack so as not to confuse it with the stacks of the application threads. Each GC thread has its own grey stack. When its grey stack starts to get full, a GC thread will start taking the locations of the A's off its grey stack and looking inside them for B's to copy. It starts looking inside the A's so that its grey stack does not grow too large. In general we prefer to do some copying instead of having to malloc more space for the grey stack. Very quickly there are A's, B's and C's on the grey stacks and the order in which they are copied looks largely random from outside the garbage collector.

When a parallel GC thread promotes an object, it promotes it into its own distinct area (let me call them PLAB's) in the tenured generation. A GC thread that has run out of objects to copy (no more application thread stacks to scan and no more objects on its own grey stack) can steal objects from another GC thread's grey stack. So A1 may be copied by one GC thread into its PLAB while B1 (if it were stolen) may be copied by another GC thread into a different PLAB.

So collections can disrupt good object locality in a variety of different ways. We don't like that. We're thinking about ways to improve it.

Hope this was understandable. Please ask if it was not.

Wednesday May 03, 2006

Top 10 GC Reasons

to upgrade to J2SE 5.0

10) We fixed some bugs.

9) Thread local allocation buffer dynamic tuning. (see "A Little Thread Privacy Please")

8) Promotion failure handling was implemented for the low pause collector.

Promotion failure handling is the ability to start a minor collection and then backout of it if there is not enough space in the tenured generation to promote all the objects that need to be promoted.

In 5.0 the "young generation guarantee" need not apply for the low pause collector because we implemented promotion failure handling for that collector. The net effect of having promotion failure handling is that more minor collections can be done before a major collection is needed. Since minor collections are typically more efficient than major collections, it's better. The "young generation guarantee" is discussed in the document below.


It was particularly harsh on the low pause collector because the "young generation guarantee" required one contiguous chunk of space in the tenured generation. For the low pause collector that requirement could be difficult to satisfy because the low pause collector does not compact the tenured generation and so is subject to fragmentation.

7) Parallel remark in the low pause collector.

For the low pause collector there are two stop-the-world pauses associated with each collection of the tenured generation: the initial marking pause and the remarking pause. The latter is typically much longer than the former. In 5.0 the remarking is done with multiple threads in order to shorten it.

6) Parallel reference processing in the low pause collector.

For an application that uses Reference objects (see)


extensively, the GC work to process the Reference objects can be noticeable. It's not necessarily worse in the low pause collector than in the other collects, but it hurts more (because we're trying to keep the pauses low). Parallel reference processing is available for the low pause collector but is not on by default. Unless there are tons of Reference Objects, doing the reference processing serially is usually faster. Turn it on with the flag -XX:+ParallelRefProcEnabled if you make extensive use of Reference Objects (most applications don't).

5) Server class machine defaults.

On server class machines (machines with 2 or more cpus and 2G or more of physical memory) the default compiler, garbage collector and maximum heap size are different. Before 5.0 it was

\* -client compiler

\* serial garbage collector

\* 64M maximum heap (on most 32 bit platforms)

With 5.0 the defaults for server class machines are

\* -server compiler

\* throughput garbage collector

\* maximum heap size the lesser of 1/4 of physical memory or 1G

For the smaller machines the defaults did not change.

The intent of this change was to provide better default performance on large machines that are typically dedicated to running 1 or a few applications and for which throughput is important. Desktops for which interactivity was expected to be more important were excluded from being server class machines.



for more details.

4) Low pause collector schedules remarking between minor collections.

Recall that the pause due to object remarking is the longer of the two stop-the-world pauses. Before 5.0 it could happen that a remark pause would occur immediately after a minor pause. When this back-to-back minor pause and remark pause occurred it looked like one big fat pause. With 5.0 the remarking is scheduled so as to be about mid way between two minor pauses.

3) Better scheduling of the start of a concurrent low pause collection.

Prior to 5.0 a low pause collector started a collection based on the rate of allocation and the amount of free space in the tenured generation. It did the calculation and started a concurrent collection so as to finish before the tenured generation ran dry. This was good, but there were some end cases that needed to be recognized. For example, if the tenured generation had to be expanded in order to support promotions for a minor collection that just finished, a concurrent collection was started right away. Or if the next minor collection might not succeeded because of lack of free space in the tenured generation, then a concurrent collection was started. That latter example wouldn't happen if we could perfectly predict the rate at which the tenured generation is filling up. We're not perfect. Also a bump in the allocation rate might mess us up.

2) GC ergonomics (See "It's Not Magic" at the start of my blogs)

And the number 1 GC reason for upgrading to 5.0 is

1) ???

You can tell me what you've found to be a good reason to upgrade to 5.0 in a comment to this blog. Don't be shy. And thanks for any responses.

Thursday Apr 13, 2006

What the Heck's a Concurrent Mode?

and why does it fail?

If you use the low pause collector, have you ever seen a message that contained the phase "concurrent mode failure" such as this?

174.445: [GC 174.446: [ParNew: 66408K->66408K(66416K), 0.0000618 secs]174.446: [CMS (concurrent mode failure): 161928K->162118K(175104K), 4.0975124 secs] 228336K->162118K(241520K)

This is from a 6.0 (still under development) JDK but the same type of message can come out of a 5.0 JDK.

Recall that the low pause collector is a mostly concurrent collector: parts of the collection are done while the application is still running. The message "concurrent mode failure" signifies that the concurrent collection of the tenured generation did not finish before the tenured generation became full. Recall also that a concurrent collection attempts to start just-in-time to finish before the tenured generations becomes full. The low pause collector measures the rate at which the the tenured generation is filling and the expected amount of time until the next collection and starts a concurrent collection so that it finished just-in-time (JIT). Three things to note in that last sentence. The "rate" at which the tenured generation is filling is based on historical data as is the "expected amount of time" until the next collection. Either of those might incorrectly predict the future. Also the JIT is really JIT plus some amount of padding so as to get it right most of the time. When a concurrent mode failure happens, the low pause collector does a stop-the-world (STW) collection. All the application threads are stopped, a different algorithm is used to collect the tenured generation (our particular flavor of a mark-sweep-compact), the applications threads are started again, and life goes on. Except that the STW collection is not very low pause and there's the rub.

At this point if you're asking "why does this happen", you've come to the right blog. There are several possibilites.

The amount of live data that is in the tenured generation is too large. Specifically, there is not enough free space in the tenured generation to support the rate of allocation into the tenured generation. For an example in the extreme if there are only 2 words of free space in the tenured generation after a collection of the tenured generation, chances are those 2 words will be exhausted before another concurrent collection of the tenured generation can be done. If you are seeing lots of concurrent mode failures, chances are your heap is too small.

Your application can change behaviors dramatically such that past behavior does not adequately predict future performance. If this is the problem you'll see the concurrent mode failures only near the change in behavior. After a few more collections the low pause collector adjusts its expectations to make better decisions. But to deal with the concurrent mode failures in the mean time, you'll usually be trading off better performance. You can tell the low pause collector to start a collection sooner. The flag -XX:CMSInitiatingOccupancyFraction=NN will cause a concurrent collection to start when NN percent of the tenured generation is full. If you use this option to deal with the concurrent mode failures that result from a change in the behavior of your application, much of the time (when the applications behavior is more steady state) you'll be starting collection too early and so doing more collections than necessary. If you set NN to 0, it will cause one concurrent collection to be followed as soon as possible by another. The next collection may not start immediately after the last because the check on when to start a collection is done only at particular points in the code, but the collection will start at the next opportunity.

Your application may not have dramatic changes in behavior, but if it has a large variance in allocation rates, that can cause the JIT GC to not be JIT. You can add some more padding to the time at which a concurrent collection kicks off by using the flag -XX:CMSIncrementalSafetyFactor=NN. The default value for NN is 10 (i.e., a 10% padding on the start of the concurrent collection). Increasing NN to 100 starts a concurrent collection at the next opportunity.

Tuesday Apr 04, 2006

The Fault with Defaults

During early work on the low pause collector some of the applications we used for benchmarking the collector had the characteristic that objects had very short lifetimes or relatively long lifetimes with not much in between. All our current collectors use a copying collection for the young generation. This copying collection will copy the younger objects back into the young generation and copy the older objects into the tenured generation. The expectation is that more of the younger objects will die in the young generation if they are give a little more time. A consequence of this is that long lived objects are copied repeatedly into the young generation before finally being promoted into the tenured generation. Depending on the mix of lifetimes of your objects, this can be good (more objects die in the young generation) or this can be bad (extra copying into the young generation). Additionally at about the same time we were discussing the policy of never doing the copying but rather always copying any objects that survived a young generation collection immediately into the tenured generation. There are some collectors that do that and there was some anecdotal evidence that it was a good strategy. Given some example applications where never copying back into the young generations seemed to be the right strategy and the general discussion about what was the right policy for copying, we decided to make the default behavior for the low pause collector to always promote objects surviving a young generation collection into the tenured generation.

This default policy still seems reasonable for some applications but the old adage about one size not fitting all certainly applies here. So here's what you should consider and what you can do.

The young generation is acting like a filter to dispose of the shorter lived objects. Any objects that get through the filter are moved to the tenured generation and are henceforth handled by the mostly concurrent collections (which I'll just call the concurrent collections from here on out). If the concurrent collections are starting too often to suit your tastes, one possibility is to filter out more of the short lived objects by doing some copying during the young generation collections. This may slow down the growth in the number of objects in the tenured generation and lessen the need for too frequent concurrent collections. To do this try the command line flags

-XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8

MaxTenuringThreshold is the number of collections that an object must survive before being promoted into the tenured generation. When objects are copied back into the young generation during a young generation collection, the objects are copied into an part of the young generation that is referred to as a survivor space. SurvivorRatio set to 8 will mean that the survivor spaces are about 10% of the size of the young generation. The survivor spaces and SurvivorRatio are described in the 5.0 tuning guide that can be found under


The FAQ attached to the 5.0 tuning guide has an entry about experimenting with MaxTenuringThreshold.

By the way having concurrent collections starting "too often" can also be a sign that the tenured generation is smaller than is needed for good performance. You can always try increasing the size of the heap to reduce the frequency of the concurrent collections. The suggestions above may be able to get you the less frequent concurrent collections without the longer concurrent collections that typically come when the heap is increased.

The default setting to always promote objects that survive a young generation collection is only the default for the low-pause collector. The other collectors by default will do some copying before promoting an object to the tenured generation.

Monday Mar 06, 2006

When the Sum of the Parts

doesn't equal a big enough hole.

Did I mention that the low pause collector maintains free lists for the space available in the tenured generation and that fragmentation can become a problem? If you're using the low pause collector and things are going just peachy for days and days and then there is a huge (relatively speaking) pause, the cause may be fragmentation in the tenured generation.

In 1.4.2 and older releases in order to do a young generation collection there was a requirement that there be a contiguous chunk of free space in the tenured generation that was big enough to hold all the the young generation. In the GC tuning documents at


this is referred to as the young generation guarantee. Basically during a young generation collection, any data that survives may have to be promoted into the tenured generation and we just don't know how much is going to survive. Being our usual conservative selves we assumed all of it would survive and so there needed to be room in the tenured generation for all of it. How does this cause a big pause? If the young generation is full and needs to be collected but there is not enough room in the tenured generation, then a full collection of both the young generation and the tenured generations are done. And this collection is a stop-the-world collection not a concurrent collection so you generally see a pause much longer than you want to. By the way this full collection is also a compacting collection so there is no fragmentation at the end of the full collection.

In 5.0 we added the ability in the low pause collector to start a young generation collection and then to back out of it if there was not enough space in the tenured generation. Being able to backout of a young generation collection allowed us to make a couple of changes. We now keep an average of the amount of space that is used for promotions and use that (with some appropriate padding to be on the safe side) as the requirement for the space needed in the tenured generation. Additionally we no longer need a single contiguous chunk of space for the promotions so we look at the total amount of free space in the tenured generation in deciding if we can do a young generation collection. Not having to have a single contiguous chunk of space to support promotions is where fragmentation comes in (or rather where it doesn't come in as often). Yes, sometimes using the averages for the amount promoted and the total amount of free in the tenured generation tells us to go ahead and do a young generation collection and we get surprised (there really isn't enough space in tenured generation). In that situation we have to back out of the young generation collection. It's expensive to back out of a collection, but it's doable. That's a very long way of saying that fragmentation is less of a problem in 5.0. It still occurs, but we have better ways of dealling with it.

What should you do if you run into a fragmentation problem?

Try 5.0.

Or you could try a larger total heap and/or smaller young generation. If your application is on the edge, it might give you just enough extra space to fit all your live data. But often it just delays the problem.

Or you can try to make you application do a full, compacting collection at a time which will not disturb your users. If your application can go for a day without hitting a fragmentation problem, try a System.gc() in the middle of the night. That will compact the heap and you can hopefully go another day without hitting the fragmentation problem. Clearly no help for an application that does not have a logical "middle of the night".

Or if by chance most of the data in the tenured generation is read in when your application first starts up and you can do a System.gc() after you complete initialization, that might help by compacting all data into a single chunk leaving the rest of the tenured generation available for promotions. Depending on the allocation pattern of the application, that might be adequate.

Or you might want to start the concurrent collections earlier. The low pause collector tries to start a concurrent collection just in time (with some safety factor) to collect the tenured generation before it is full. If you are doing concurrent collections and freeing enough space, you can try starting a concurrent collection sooner so that it finishes before the fragmentation becomes a problem. The concurrent collections don't do a compaction, but they do coalese adjacent free blocks so larger chunks of free space can result from a concurrent collection. One of the triggers for starting a concurrent collection is the amount of free space in the tenured generation. You can cause a concurrent collection to occur early by setting the option -XX:CMSInitiatingOccupancyFraction= where NNN is the percentage of the tenured generation that is in use above which a concurrent collection is started. This will increase the overall time you spend doing GC but may avoid the fragmentation problem. And this will be more effective with 5.0 because a single contiguous chunk of space is not required for promotions.

By the way, I've increased the comment period for my blogs. I hadn't realized it was so short.

Wednesday Feb 15, 2006

So What's It Worth to You?

Have you ever gone to your favorite restaurant, had a great meal and sat back with your second cup of coffee and asked "Why can't those guys do a pausless garbage collector? How hard can it be?"

Well, you're not the only one. That question came up around here recently.

There are techniques for doing pauseless garbage collection. The problem is that it costs in terms of application performance or additional Java heap size. Threads in an application read and write objects. The garbage collector reads and writes (i.e., frees or moves) objects. All those reads and writes have to have some coordination. "Pauseless" GC very often means that the application as a whole is not paused, but individual threads are paused one at a time. Imagine two application threads that are furiously playing hot potato with an object. As soon as thread 1 passes the object A to thread 2, thread 1 forgets about A and thread 2 realizes it has A. Then thread 2 immediately passes it back to thread 1 with the analogous result. A pauseless collector pauses thread 1 to look at what objects it is holding, then restarts thread 1 (so only one thread is paused at a time) and pauses thread 2 to examine it. Did object A get passed from thread 2 to thread 1 in that window during which both were executing? If it did and that is the only reference to object A, then the collector will not know that object A is live without doing more work.

I'm going to skip the details here. The point of this note is not to delve into the particular difficulties of pauseless GC, but rather to give you an idea of why it can be more costly.

Also I'm going to consider collectors that compact objects (i.e., move them). Mostly because it's an easier example to talk about. Some collectors don't move objects during a collections. Not moving objects simplifies some aspects of a collection but has its own complications (e.g., maintaining free lists of objects, splitting and coalescing blocks of free space, and fragmentation).

If the collector does compact the objects, it has to be sure that all references to live objects point to their new locations. A common way to catch object A when it has slipped through the window is to use what is fondly referred to as a read barrier. Any read of an object by the application goes through code (the read barrier) that looks at the object and decides whether something special has to be done. The "something special" depends on the particulars of the collector. Without getting into those particulars just having to do extra stuff on every read has got to hurt. Yes, I'm really over simplifying this example, but what can you expect from my really simple mind.

So what's pauseless GC worth to you (i.e., in terms of the extra space and performance costs, the extra complexity, maybe special hardware)? It's definitely worth a bunch to some. But for many it isn't really necessary to have truly pauseless GC. Shorter is good enough.

Why can't those guys do a pauseless garbage collector?

Would "pauseless garbage collection" give us the biggest bang for the buck (i.e., most benefits to most user)? When we were deciding what to do for the next release, we asked ourselves that question. And we decided there were things we should do before "pauseless GC". Did I mention that I recently worked on the parallel collection of the tenured generation for the throughput collector? I'll tell you about it some time.

Wednesday Feb 01, 2006

Tell Us Where It Hurts

Here's a URL for asking GC questions.


Engineers in the GC group answer the questions so that's probably as good as it gets.

Updated 2008/2/20

The link above no longer takes you to the form for asking GC related questions. I'm leaving this information above just to save a little bit of confusion (i.e., you won't find yourself saying "Didn't this page used to ..."). The new channel for asking questions is explained below.

Questions for the Hotspot[tm] JVM[tm] garbage collector group should be sent to the hotspot-gc-use mailing list. This is a mailing list that is open to the public. Please see the link below for information on subscribing to that mailing list.


If you have questions which you would prefer not be answered in public, please work through you Sun support contact.

Information on Sun support can be found at


Thursday Jan 26, 2006

Why not a Grand Unified Garbage Collector?

At last count we have three garbage collectors.

  • the parallel collector
  • the low pause collector
  • the serial collector

    Why is there more than one? Actually, the usual answer applies. Specialization often results in better performance. If you're interested in more particulars about our garbage collectors, read on.

    All three collectors are generational (young generation and tenured generation).

    Let's do the easy comparison first, why a parallel collector and a serial collector. Parallelism has overhead. Nuf said? Yeah, I used to read comic books when I was a kid. If you don't understand that reference, ignore it. I'm just older.

    As you might infer from the names, the serial collector uses 1 thread to do the GC work and the parallel collector uses multiple threads to do the same. As usual multiple threads doing the same tasks have to synchronize. That's pretty much it. On a single cpu machine the additional cost of the synchronization means that the parallel collector is slower than the serial collector. On a two cpu machine and a VM that has a small heap the parallel collector is as about fast as the serial collector. With two cpu's and large heaps the parallel collector will usually do better. We keep asking ourselves if we can get rid of the serial collector and use the parallel collector in its place. The answer so far keeps coming back no.

    More interesting is the case of the low pause collector versus the parallel collector. Above I made the remark about specialization and better performance. This is actually a case of more complexity and lesser performance. These two collectors do the collection of the young generation using almost the exact same techniques The differences in the collectors have to do with the collections of the tenured generation. The low pause collector does parts of that collection while the application continues to run. One way to do that is to not move the live objects when collecting the dead objects. The application tends to get confused if the objects it is using move around while the application is running. The other two collectors compact the heap during a collection of the tenured generation (i.e., live objects are moved so as to occupy one contiguous region of the heap). The low pause collector collects the dead objects and coalesces their space into blocks that are kept in free lists. Maintaining free lists and doing allocations from them takes effort so it's slower than having a heap that is compacted. Having applications run while a collection is happening means that new objects can be allocated during a collection. That leads to so more complexity. Also the collection of the tenured generation can be interrupted for a collection of the young generation. More complexity still. The bottom line is that the low pause collector has shorter GC pauses but it costs performance. That performance difference is not huge but it's large enough to keep us from ditching the parallel collector and always using the low pause collector.

    And last but not least, can we replace the serial collector with the low pause collector? Very tempting. The serial collector is used by default on desktop machines. We expect those to have 1 or 2 cpus and to be running applications that need 10's of megabytes of Java heap as opposed to 100's of megabytes. With small heaps the differences in collection times tend to make less difference. Even if the low pause collector was 10% slower than the serial collector, the difference between, for example, 70 ms and 77 ms often isn't large enough to matter. It would probably be a done deal except that the low pause collector has a larger memory footprint. It has additional data structures that it uses (for example to keep track of what references are being changed by the the running application while a collection is on going). It also usually needs a larger Java heap to run an application. Recall that the low pause collector uses free lists to keep track of the available space in the heap. Fragmentation can become a problem. The best bet is that we'll replace the serial collector with the low pause collector some day but not just yet.

  • Friday Jan 13, 2006

    A Little Thread Privacy Please

    This is not really about garbage collection but hopefully you'll find it interesting. And there is no punch line (i.e., cool command line flag that I can suggest to make your application run better). It's just a story about letting the VM do it.

    If you're using multiple threads (or one of the Java libraries is using multiple threads in your behalf) then threads in your application are doing allocations concurrently. All the threads are allocating from the same heap so some allowances have to be made for simultaneous allocations by 2 or more threads. In the simplest case (other than the case where no allowances are made and which is just plain wrong) each thread would grab a lock, do the allocation and release the lock. That gives the right answer but is just too slow. The slightly more complicated means would be to use some atomic operation (such as compare-and-swap) plus a little logic to safely do the allocation. Faster but still too slow. What is commonly done is to give each thread a buffer that is used exclusively by that thread to do allocations. You have to use some synchronization to allocate the buffer from the heap, but after that the thread can allocate from the buffer without synchronization. In the hotspot JVM we refer to these as thread local allocation buffers (TLAB's). They work well. But how large should the TLAB's be?

    Prior to 5.0 that was a difficult question to answer. TLAB's that were too large wasted space. If you had tons of threads and large TLAB's, you could conceivably fill up the heap with buffers that were mostly unused. Creating more threads might force a collection which would be unfortunate because the heap was mostly empty. TLAB's that were too small would fill quickly and would mean having to get a new TLAB which would require some form of synchronization. There was not a general recommendation that we could make on how large TLAB's should be. Yup, we were reduced to trial-and-error.

    Starting with 5.0 living large with TLAB's got much simpler - except for the guy down the hall that did the implementation. Here's what the VM does for you.

    Each thread starts with a small TLAB. Between the end of the last young generation collection and the start of the next (let me call that period an epoch), we keep track of the number of TLAB's a thread has used. We also know the size of the TLAB's for each thread. Averages for each thread are maintained for these two numbers (number and size of TLAB's). These averages are weighted toward the most recent epoch. Based on these averages the sizes of the TLAB's are adjusted so that a thread gets 50 TLAB's during a epoch. Why 50? All things being equal we figured that a thread would have used half its TLAB by the end of the epoch. Per thread that gave us 1/2 a TLAB not used (out of the magic 50) for a wastage of 1%. That seemed acceptable. Also if the young generation was not large enough to provide the desired TLAB's, the size of the young generation would be increased in order to make it so (within the usual heap size constraints, of course).

    The initial TLAB size is calculated from the number of threads doing allocation and the size of the heap. More threads pushes down the initial size of the TLAB's and larger heaps push up the initial size.

    An allocation that can not be made from a TLAB does not always mean that the thread has to get a new TLAB. Depending on the size of the allocation and the unused space remaining in the TLAB, the VM could decide to just do the allocation from the heap. That allocation from the heap would require synchronization but so would getting a new TLAB. If the allocation was considered large (some significant fraction of the current TLAB size), the allocation would always be done out of the heap. This cut down on wastage and gracefully handled the much-larger-than-average allocation.




    « July 2016