Friday Jun 28, 2013

The Unspoken - The Why of GC Ergonomics

Do you use GC ergonomics, -XX:+UseAdaptiveSizePolicy, with the UseParallelGC collector? The jist of GC ergonomics for that collector is that it tries to grow or shrink the heap to meet a specified goal. The goals that you can choose are maximum pause time and/or throughput. Don't get too excited there. I'm speaking about UseParallelGC (the throughput collector) so there are definite limits to what pause goals can be achieved. When you say out loud "I don't care about pause times, give me the best throughput I can get" and then say to yourself "Well, maybe 10 seconds really is too long", then think about a pause time goal. By default there is no pause time goal and the throughput goal is high (98% of the time doing application work and 2% of the time doing GC work). You can get more details on this in my very first blog.

GC ergonomics

The UseG1GC has its own version of GC ergonomics, but I'll be talking only about the UseParallelGC version.

If you use this option and wanted to know what it (GC ergonomics) was thinking, try


This will print out information every i-th GC (above i is 1) about what the GC ergonomics to trying to do. For example,

    UseAdaptiveSizePolicy actions to meet  *** throughput goal ***
                       GC overhead (%)
    Young generation:       16.10         (attempted to grow)
    Tenured generation:      4.67         (attempted to grow)
    Tenuring threshold:    (attempted to decrease to balance GC costs) = 1

GC ergonomics tries to meet (in order)

  • Pause time goal
  • Throughput goal
  • Minimum footprint

    The first line says that it's trying to meet the throughput goal.

        UseAdaptiveSizePolicy actions to meet  *** throughput goal ***

    This run has the default pause time goal (i.e., no pause time goal) so it is trying to reach a 98% throughput. The lines

        Young generation:       16.10         (attempted to grow)
        Tenured generation:      4.67         (attempted to grow)

    say that we're currently spending about 16% of the time doing young GC's and about 5% of the time doing full GC's. These percentages are a decaying, weighted average (earlier contributions to the average are given less weight). The source code is available as part of the OpenJDK so you can take a look at it if you want the exact definition. GC ergonomics is trying to increase the throughput by growing the heap (so says the "attempted to grow").

    The last line

        Tenuring threshold:    (attempted to decrease to balance GC costs) = 1

    says that the ergonomics is trying to balance the GC times between young GC's and full GC's by decreasing the tenuring threshold. During a young collection the younger objects are copied to the survivor spaces while the older objects are copied to the tenured generation. Younger and older are defined by the tenuring threshold. If the tenuring threshold hold is 4, an object that has survived fewer than 4 young collections (and has remained in the young generation by being copied to the part of the young generation called a survivor space) it is younger and copied again to a survivor space. If it has survived 4 or more young collections, it is older and gets copied to the tenured generation. A lower tenuring threshold moves objects more eagerly to the tenured generation and, conversely a higher tenuring threshold keeps copying objects between survivor spaces longer. The tenuring threshold varies dynamically with the UseParallelGC collector. That is different than our other collectors which have a static tenuring threshold. GC ergonomics tries to balance the amount of work done by the young GC's and the full GC's by varying the tenuring threshold. Want more work done in the young GC's? Keep objects longer in the survivor spaces by increasing the tenuring threshold.

    This is an example of the output when GC ergonomics is trying to achieve a pause time goal

        UseAdaptiveSizePolicy actions to meet  *** pause time goal ***
                           GC overhead (%)
        Young generation:       20.74         (no change)
        Tenured generation:     31.70         (attempted to shrink)

    The pause goal was set at 50 millisecs and the last GC was

    0.415: [Full GC (Ergonomics) [PSYoungGen: 2048K->0K(26624K)] [ParOldGen: 26095K->9711K(28992K)] 28143K->9711K(55616K), [Metaspace: 1719K->1719K(2473K/6528K)], 0.0758940 secs] [Times: user=0.28 sys=0.00, real=0.08 secs]

    The full collection took about 76 millisecs so GC ergonomics wants to shrink the tenured generation to reduce that pause time. The previous young GC was

    0.346: [GC (Allocation Failure) [PSYoungGen: 26624K->2048K(26624K)] 40547K->22223K(56768K), 0.0136501 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]

    so the pause time there was about 14 millisecs so no changes are needed.

    If trying to meet a pause time goal, the generations are typically shrunk. With a pause time goal in play, watch the GC overhead numbers and you will usually see the cost of setting a pause time goal (i.e., throughput goes down). If the pause goal is too low, you won't achieve your pause time goal and you will spend all your time doing GC.

    GC ergonomics is meant to be simple because it is meant to be used by anyone. It was not meant to be mysterious and so this output was added. If you don't like what GC ergonomics is doing, you can turn it off with -XX:-UseAdaptiveSizePolicy, but be pre-warned that you have to manage the size of the generations explicitly. If UseAdaptiveSizePolicy is turned off, the heap does not grow. The size of the heap (and the generations) at the start of execution is always the size of the heap. I don't like that and tried to fix it once (with some help from an OpenJDK contributor) but it unfortunately never made it out the door. I still have hope though.

    Just a side note. With the default throughput goal of 98% the heap often grows to it's maximum value and stays there. Definitely reduce the throughput goal if footprint is important. Start with -XX:GCTimeRatio=4 for a more modest throughput goal (%20 of the time spent in GC). A higher value means a smaller amount of time in GC (as the throughput goal).

  • Thursday Apr 18, 2013

    The Unspoken - Application Times

    Sometimes it is interesting to know how long your application runs between garbage collections. That can be calculated from the GC logs but a convenient way to see that information is through the command line flags -XX:+PrintGCApplicationStoppedTime and -XX:+PrintGCApplicationConcurrentTime. Adding these to your command line produces output such as this.

    Application time: 0.2863875 seconds
    Total time for which application threads were stopped: 0.0225087 seconds
    Application time: 0.1476791 seconds
    Total time for which application threads were stopped: 0.0255697 seconds

    The application ran (reported in the first line) for about 287 milliseconds and then was stopped for about 22 milliseconds (reported in the second line).

    The flags can be used separately or together.

    Add -XX:+PrintGCTimeStamps -XX:+PrintGCDetails and you get

    Application time: 0.1325032 seconds
    20.149: [GC (Allocation Failure) 20.149: [ParNew: 78656K->8704K(78656K), 0.02215
    98 secs] 225454K->158894K(253440K), 0.0222106 secs] [Times: user=0.13 sys=0.00,
    real=0.03 secs]
    Total time for which application threads were stopped: 0.0224188 seconds

    When the flags was first implemented the "stopped" time really was the GC stopped time but it was later changed to include the complete stopped time. There is other work that may be done during a "safepoint" (period during which the application is halted and the VM can do some work without the application changing). All that is now included in the "stopped" time. In the example above I would say that GC was the only thing happening during the safepoint (GC time being 0.0222106 secs and the complete time being nearly the same at 0.0224188 secs). I can't speak authoritatively about what else can happen because I just don't know. But to give one example I've heard about is compiled code deoptimization. When the code for a JIT'ed method needs to be thrown away (typically because some assumption made during the compilation has been violated), the VM has to switch from the compiled code to interpreting that method again. That switch is done at a safepoint (but don't quote me on that because its not my area).

    Tuesday Feb 12, 2013

    The Unspoken - CMS and PrintGCDetails

    What follows is an example of the GC logging output with CMS (UseConcMarkSweepGC) and PrintGCDetails plus some explanation of the output.

    The "CMS-initial-mark" indicates the start of a CMS concurrent collection.

    [GC [1 CMS-initial-mark: 463236K(515960K)] 464178K(522488K), 0.0018216 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]

    463236K above is the space occupied by objects in the old (CMS) generation at the start of the collection. Not all those objects are necessarily alive.

    515960K is the total size of the old (CMS) generation. This value changes if the generation grows or shrinks.

    464178K is the sum of the space occupied by objects in the young generation and the old (CMS) generation.

    522488K is the total size of the heap (young generation plus old (CMS) generation)

    0.0018216 secs is the duration of the initial mark pause. The initial mark is a stop-the-world phase.

    After the initial mark completes the CMS concurrent mark starts. The concurrent mark phase is a concurrent phase and can be interrupted by young generation collections. In this case the ParNew (UseParNewGC) is being used to collect the young generation. When a ParNew collection is ready to start, it raises a flag and the CMS collector yields execution to ParNew and waits for ParNew to finish before resuming.

    [GC[ParNew: 6528K->702K(6528K), 0.0130227 secs] 469764K->465500K(522488K), 0.0130578 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]

    6528K is the space in the young generation occupied by objects at the start of the ParNew collection. Not all those objects are necessarily alive.

    702K is the space occupied by live objects at the end of the ParNew collection.

    6528K is the total space in the young generation.

    0.0130227 is the pause duration for the ParNew collection.

    469764K is the space occupied by objects in the young generation and the old (CMS) generation before the collection starts.

    465500K is the space occupied by live objects in the young generation and all objects in the old (CMS) generation. For a ParNew collection, only the liveness of the objects in the young generation is known so the objects in the old (CMS) generation may be live or dead.

    522488K is the total space in the heap.

    [Times: user=0.05 sys=0.00, real=0.01 secs] is like the output of time(1) command. The ratio user / real give you an approximation for the speed up you're getting from the parallel execution of the ParNew collection. The sys time can be an indicator of system activity that is slowing down the collection. For example if paging is occurring, sys will be high.

    [GC[ParNew: 6526K->702K(6528K), 0.0136447 secs] 471324K->467077K(522488K), 0.0136804 secs] [Times: user=0.04 sys=0.01, real=0.01 secs]

    [GC[ParNew: 6526K->702K(6528K), 0.0161873 secs] 472901K->468830K(522488K), 0.0162411 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]

    [GC[ParNew: 6526K->702K(6528K), 0.0152107 secs] 474654K->470569K(522488K), 0.0152543 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]


    [GC[ParNew: 6526K->702K(6528K), 0.0144212 secs] 481073K->476809K(522488K), 0.0144719 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]

    This is the completion of the concurrent marking phase. After this point the precleaning starts.

    [CMS-concurrent-mark: 1.039/1.154 secs] [Times: user=2.32 sys=0.02, real=1.15 secs]

    The 1.039 is the elapsed time for the concurrent marking. The 1.154 is the wall clock time.

    The "Times" output is less meaningful because it is measured from the start of the concurrent marking and includes more than just the work done for the concurrent marking.

    This is the end of the precleaning phase.

    [CMS-concurrent-preclean: 0.006/0.007 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]

    The format of the precleaning phase output is analogous to that of the concurrent marking phase.

    [GC[ParNew: 6526K->702K(6528K), 0.0141896 secs] 482633K->478368K(522488K), 0.0142292 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]

    [GC[ParNew: 6526K->702K(6528K), 0.0162142 secs] 484192K->480082K(522488K), 0.0162509 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]

    The remark phase is scheduled so that it does not occur back-to-back with a ParNew so as not to appear to be a pause that is the sum of the ParNew and the remark pause. A second precleaning phase is started and is aborted when the remark phase is ready to start. Aborting this second precleaning phase is the expected behavior. That it was aborted is not an indication of an error. Since the remark phase is waiting, why not preclean but don't delay the remark for the sake of precleaning.

    [CMS-concurrent-abortable-preclean: 0.022/0.175 secs] [Times: user=0.36 sys=0.00, real=0.17 secs]

    This is the remark phase.

    [GC[YG occupancy: 820 K (6528 K)][Rescan (parallel) , 0.0024157 secs][weak refs processing, 0.0000143 secs][scrub string table, 0.0000258 secs] [1 CMS-remark: 479379K(515960K)] 480200K(522488K), 0.0025249 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]

    [YG occupancy: 820 K (6528 K)] shows that at the start of the remark the occupancy (sum of the total of the allocated objects in the young generation is 820K out of a total of 6528K. The length of the remark pause depends to some degree on the occupancy of the young generation so we print it out.

    The "Rescan" completes the marking of live objects while the application is stopped. In this case the rescan was done in parallel and took 0.0024157 secs.

    "weak refs processing" and "scrub string table" are tasks done during the remark. Those tasks took 0.0000143 secs and 0.0000258 secs, respectively. If those numbers dominate the remark pause time, they can explain unexpectedly large pauses. Not that they cannot legitimately be large. Just that generally they are not and when they are, take note. If the weak refs processing is dominate, you might be able to cut that time down by using parallel reference processing (-XX:+ParallelRefProcEnabled). No comment on the case when scrub string table is dominant. I've never had to deal with it.

    The concurrent sweeping phase starts at the end of the remark.

    [GC[ParNew: 6526K->702K(6528K), 0.0133250 secs] 441217K->437145K(522488K), 0.0133739 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]

    [GC[ParNew: 6526K->702K(6528K), 0.0125530 secs] 407061K->402841K(522488K), 0.0125880 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]


    [GC[ParNew: 6526K->702K(6528K), 0.0121435 secs] 330503K->326239K(522488K), 0.0121996 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]

    The sweep phase ends here.

    [CMS-concurrent-sweep: 0.756/0.833 secs] [Times: user=1.68 sys=0.01, real=0.83 secs]

    The format above is analogous to that of the concurrent marking.

    And lastly the reset phase.

    [CMS-concurrent-reset: 0.009/0.009 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]

    It is not expected that another concurrent CMS collection start before several ParNew collections have been done. If another CMS collection starts immediately, check how full the old (CMS) generation is. If the old (CMS) generation is close to being full immediately after the end of a collection, the heap might be too small.

    I took this log with an early build of JDK8

    java version "1.8.0-ea" Java(TM) SE Runtime Environment (build 1.8.0-ea-b73) Java HotSpot(TM) Server VM (build 25.0-b14, mixed mode)

    and used the flags

    -server -XX:+UseConcMarkSweepGC -XX:NewRatio=8 -XX:-PrintGCCause -XX:ParallelGCThreads=4 -Xmx1g -XX:+PrintGCDetails

    I usually also use -XX+PrintTimeStampsps to get time stamps in the logs and use -XX:+PrintGCDateStamps if I want to correlate the GC output to application events.

    Monday Feb 11, 2013

    The Unspoken - Phases of CMS

    A CMS (UseConcMarkSweepGC) garbage collection has several phase. As explain here some of those phases are stop-the-world and some are concurrent.[Read More]

    Tuesday Nov 01, 2011

    Really? iCMS? Really?

    When I use the term iCMS, I'm referring to the incremental mode of CMS (-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode). This is a mode of CMS where the concurrent phases (concurrent marking and concurrent sweeping) are run in short increments (does some work, yields and sleeps for a while, then runs again). It was implemented in CMS primarily for use on platforms with a single hardware thread. The concurrent phases are typically long (think seconds and not milliseconds). If CMS hogged the single hardware thread for several seconds, the application would not execute during those several seconds and would in effect experience a stop-the-world pause. With iCMS the application makes progress during concurrent phases. That of course is good. The down side is that there is more overhead to iCMS (e.g., all that stopping and starting), more floating garbage (objects that die during a CMS collection that we can't tell have died until the next collection) and more flags to tune (just what you wanted, right). Also it's just more complex (more things to go wrong). We put quite a bit of effort into teaching iCMS how to do the work in pieces and still finish the collection before the heap fill up, but it's not perfect. Are you using iCMS? Does your platform only have one hardware thread? Ok, maybe two hardware threads is reason enough if you don't want half your cycles eaten by CMS during a long concurrent phases. But, otherwise, I'd suggest you try without the incremental mode. There are exceptions to every rule, but I think you'll be happier. Really.

    Monday May 19, 2008

    Help for the NUMA Weary

    When you think about running on a machine with a non-uniform memory architecture (NUMA), do you think, "Cool, some memory is really close"? Or do you think, "Why are my memory latency choices bad, worse and worst"? Or are you like me and try not to think about it at all?

    Well for all of the above, this option rocks.

    Executive summary: There is a option which attempts to improve the performance of an application on a NUMA machine by increasing the application's use of lower latency memory. This is done by creating per cpu pools of memory in the young generation. A thread AA that runs on a cpu XX will get objects allocated out of the pool for XX. When cpu XX first touches a new page of memory, Solaris tries to assign to XX memory that is closer to XX. Additionally, if a thread AA has run on XX, Solaris tries to keep AA running on XX. When this all works as we hope, AA is accessing the memory that is closer to it more of the time. This option does not improve GC performance but improves the performance of application threads. I'll say something about support for non-Solaris platforms at the end.

    So if you're asking yourself if you should care about NUMA, two examples of Sun servers with NUMA architectures are the Sun Sparc E6900 and the AMD Opteron X4600. Sun's early chip multithreading (CMT) boxes (T1000, T2000, T5120 and T5220) are not NUMA, but the later CMT T5140 and T5240 are.

    In the above summary I used the term "per cpu pools" for brevity when I should really have used the term "per node pools" to be more precise. Nodes in this context have 1 or more cpu's, local memory and interconnect as a minimum. I'll try to be more precise below, but if you see node and think cpu, it's close enough.

    On a NUMA system there is some memory that is closer (local) to the cpu's in a node and some memory that is farther away (remote). Local and remote is relative to a node. On Solaris 10 the OS tries to improve the performance of a thread AA executing on a node RR by increasing AA's use of memory local to RR. This is achieved by a "first touch" policy (my words, not a technical term). AA can make a call to get memory but physical memory is not committed to AA until the memory is used (first touched). When AA executing on RR first touches memory, Solaris tries to assign it memory that is local to RR. If there is no available local memory, AA will get memory remote from RR.

    The UseNUMA feature takes advantage of this policy to get better locality for an application thread that allocates objects and then uses them (as opposed to an architecture where one thread allocates objects and some other thread uses them).

    In JDK6 update 2 we added -XX:+UseNUMA in the throughput collector (UseParallelGC). When you turn this feature on the, JVM then divides the young generation into separate pools, 1 pool for each node. When AA allocates an object, the JVM looks to see what node AA is on and then allocates the object from the pool for that node. In the diagram AA is running on RR and RR has its pool in the young generation.

    Combine this with a first touch policy and the memory in the pool for RR is first touched by a thread running on RR and so is likely to be local to RR. And as I mentioned above if AA has run on RR. Solaris will try to keep AA executing on RR. So best case is that you have AA accessing local memory most of the time. It may sound a bit like wishful thinking, but we've seen very nice performance improvements on some applications.

    Contrast this with the allocation without per node pools. As a thread does allocations, it marches deeper and deeper into the young generation. A thread actually does allocations out of thread local buffers (TLAB's) but even so, the TLAB's for a thread are generally scattered throughout the young generation and it is even more wishful thinking to expect those TLAB's to all be mapped to local memory.

    This part is extra credit and you don't really need to know about it to use UseNUMA. Solaris has the concept of locality groups or lgroups. You can read more about lgroups at

    Locality Group APIs

    A node has an lgroup and within that lgroup are the resources that are closer to the node. There is actually a hierarchy of lgroups, but lets talk as if a node has an lgroup that has its closest resources (local resources) and the resources farther away are just someplace else (remote resources). Thread AA running on RR can ask what lgroup MM it is in and can ask if a page of memory is in MM. This type of information is used by the page scanner that I describe below.

    There are a couple of caveats.

    The young generation is divided into per node pools. If any of these pools are exhausted, a minor collection is done. That's potentially wasteful of memory so to ameliorate that, the sizes of the pools are adjusted dynamically so that threads that do more allocation get larger pools.

    In situations where memory is tight and there are several processes running on the system, the per node pools can be a mixture of local and remote memory. That simply comes about when RR first touches a page in its pool and there is not local memory available. It just gets remote memory. To try to increase the amount of local memory in the pool, there is a scanner that looks to see if a page in a pool for RR is in the lgroup MM for RR. If the page is not, the scanner releases that page back to the OS in the hopes that, when AA on RR again first touches that page, the OS will allocate memory in MM for that page. Recall that eden in the young generation is usually empty after a minor collection so these pages can be released. The scanner also looks for small pages in the pool. On Solaris you can have a mixture of pages of different sizes in a pool and performance can be improved by using more large pages and fewer small pages. So the scanner also releases small pages in the hope that it will be allocated a large page the next time it uses the memory. This scanning is done after a collection and only scans a certain number of pages (NUMAPageScanRate) per collection so as to bound the amount of scanning done per collection.

    To review, if you have

    • Thread AA running on node RR and the JVM allocating objects for AA in the pool for RR.
    • Solaris mapping memory for the pool for RR in the lgroup MM (i.e., local to RR) based on first touch.
    • Solaris keeping thread AA running on node RR.

    then your application will run faster. And all you have to do is turn on -XX:+UseNUMA.

    An implementation on linux is in the works and will be in an upcoming update of jdk6. The API's are different for binding the per node pools to local memory (e.g., the JVM requests that pages be bound rather than relying on first touch) but you really don't need to know about any differences. Just turn it on. We've looked at an implementation for windows platforms and have not figured out how to do it yet.

    If you would like to know a little more about dealing with NUMA machines, you might find this useful.

    Increasing Application Performance on NUMA Architectures

    Friday Feb 01, 2008

    Our Collectors

    I drew this diagram on a white board for some customers recently. They seemed to like it (or were just being very polite) so I thought I redraw it for your amusement.

    Each blue box represents a collector that is used to collect a generation. The young generation is collected by the blue boxes in the yellow region and the tenured generation is collected by the blue boxes in the gray region.

  • "Serial" is a stop-the-world, copying collector which uses a single GC thread.
  • "ParNew" is a stop-the-world, copying collector which uses multiple GC threads. It differs from "Parallel Scavenge" in that it has enhancements that make it usable with CMS. For example, "ParNew" does the synchronization needed so that it can run during the concurrent phases of CMS.
  • "Parallel Scavenge" is a stop-the-world, copying collector which uses multiple GC threads.
  • "Serial Old" is a stop-the-world, mark-sweep-compact collector that uses a single GC thread.
  • "CMS" is a mostly concurrent, low-pause collector.
  • "Parallel Old" is a compacting collector that uses multiple GC threads.

    Using the -XX flags for our collectors for jdk6,

  • UseSerialGC is "Serial" + "Serial Old"
  • UseParNewGC is "ParNew" + "Serial Old"
  • UseConcMarkSweepGC is "ParNew" + "CMS" + "Serial Old". "CMS" is used most of the time to collect the tenured generation. "Serial Old" is used when a concurrent mode failure occurs.
  • UseParallelGC is "Parallel Scavenge" + "Serial Old"
  • UseParallelOldGC is "Parallel Scavenge" + "Parallel Old"


    1) UseParNew and UseParallelGC both collect the young generation using multiple GC threads. Which is faster?

    There's no one correct answer for this questions. Mostly they perform equally well, but I've seen one do better than the other in different situations. If you want to use GC ergonomics, it is only supported by UseParallelGC (and UseParallelOldGC) so that's what you'll have to use.

    2) Why doesn't "ParNew" and "Parallel Old" work together?

    "ParNew" is written in a style where each generation being collected offers certain interfaces for its collection. For example, "ParNew" (and "Serial") implements space_iterate() which will apply an operation to every object in the young generation. When collecting the tenured generation with either "CMS" or "Serial Old", the GC can use space_iterate() to do some work on the objects in the young generation. This makes the mix-and-match of collectors work but adds some burden to the maintenance of the collectors and to the addition of new collectors. And the burden seems to be quadratic in the number of collectors. Alternatively, "Parallel Scavenge" (at least with its initial implementation before "Parallel Old") always knew how the tenured generation was being collected and could call directly into the code in the "Serial Old" collector. "Parallel Old" is not written in the "ParNew" style so matching it with "ParNew" doesn't just happen without significant work. By the way, we would like to match "Parallel Scavenge" only with "Parallel Old" eventually and clean up any of the ad hoc code needed for "Parallel Scavenge" to work with both.

    Please don't think too much about the examples I used above. They are admittedly contrived and not worth your time.

    3) How do I use "CMS" with "Serial"?

    -XX:+UseConcMarkSweepGC -XX:-UseParNewGC. Don't use -XX:+UseConcMarkSweepGC and -XX:+UseSerialGC. Although that's seems like a logical combination, it will result in a message saying something about conflicting collector combinations and the JVM won't start. Sorry about that. Our bad.

    4) Is the blue box with the "?" a typo?

    That box represents the new garbage collector that we're currently developing called Garbage First or G1 for short. G1 will provide

  • More predictable GC pauses
  • Better GC ergonomics
  • Low pauses without fragmentation
  • Parallelism and concurrency in collections
  • Better heap utilization

    G1 straddles the young generation - tenured generation boundary because it is a generational collector only in the logical sense. G1 divides the heap into regions and during a GC can collect a subset of the regions. It is logically generational because it dynamically selects a set of regions to act as a young generation which will then be collected at the next GC (as the young generation would be).

    The user can specify a goal for the pauses and G1 will do an estimate (based on past collections) of how many regions can be collected in that time (the pause goal). That set of regions is called a collection set and G1 will collect it during the next GC.

    G1 can choose the regions with the most garbage to collect first (Garbage First, get it?) so gets the biggest bang for the collection buck.

    G1 compacts so fragmentation is much less a problem. Why is it a problem at all? There can be internal fragmentation due to partially filled regions.

    The heap is not statically divided into a young generation and a tenured generation so the problem of an imbalance in their sizes is not there.

    Along with a pause time goal the user can specify a goal on the fraction of time that can be spent on GC during some period (e.g., during the next 100 seconds don't spend more than 10 seconds collecting). For such goals (10 seconds of GC in a 100 second period) G1 can choose a collection set that it expects it can collect in 10 seconds and schedules the collection 90 seconds (or more) from the previous collection. You can see how an evil user could specify 0 collection time in the next century so again, this is just a goal, not a promise.

    If G1 works out as we expect, it will become our low-pause collector in place of "ParNew" + "CMS". And if you're about to ask when will it be ready, please don't be offended by my dead silence. It's the highest priority project for our team, but it is software development so there are the usual unknowns. It will be out by JDK7. The sooner the better as far as we're concerned.

    Updated February 4. Yes, I can edit an already posted blog. Here's a reference to the G1 paper if you have ACM portal access.

  • Thursday Jan 10, 2008

    GC Errata 2

    I know that none of you are still making the transition from jdk1.4.2 to jdk5 (aka jdk1.5.0, yah, we changed the numbering) but if you were, I'd want you to know that you might need a larger permanent generation. In jdk5 class data sharing was implemented (the -Xshare option, see the "java -X" output) when the client VM (-client) is being used. The implementation of class data sharing split some of the information about classes into two parts (the part that can be shared and the part that is specific to an execution of the JVM). The space in the permanent generation for classes that can be shared increased by 10-20%. That doesn't mean that you need a permanent generation that is 10-20% larger! Only some of the classes are affected and only some of the data structures for those classes are affected. But if you're running a tight ship in terms of the size of permanent generation, you might be affected by this. Of course, none of you are still making the transition to jdk5.

    Some users have reported occasional long pauses which occur at the time of a GC but cannot be accounted for by GC times. Such long pauses may be a symptom of a known latency in the use of (Linux's) mprotect() which is used by the JVM while steering application threads to so-called safepoints. A known workaround at the JVM level to step around the Linux mprotect() issue is the use of JVM option -XX:+UseMembar.

    The real fix, unfortunately, lies in the Linux kernel; see You might wish to pursue this with your Linux support engineer and see if a patch is available for the problem.

    To avoid running into this bug, make sure there is plenty of physical memory free in your system so that the Linux pager does not start evicting pages for any process from memory to swap.

    If you're seeing large variations in your minor GC pauses and are using UseParNewGC (which is on by default with the low pause collector), you might be running into 6433335. A large variation is a factor of a 100. This is a bug that has to do with large objects (or large free blocks) in the heap. If you turn off UseParNewGC and see larger pauses (by a multiplier that is on the order of the number of cpu's) but more regular pauses, then 6433335 is a likely candidate. It's been fixed in jdk 1.4.2 update 14, jdk 5 update 10 and jdk 6 update 1.

    I've heard of more than one report recently of crashes with the low-pause collector that were caused by bug

    6558100 - CMS crash following parallel work queue overflow

    This bug is fixed in 1.4.2u18, 5.0u14 and 6u4. You can workaround the bug with the flag -XX:-ParallelRemarkEnabled. You could also run into this bug if you explicitly enable ParallelRefProcEnabled, so if you include -XX:+ParallelRefProcEnabled, remove it. ParallelRefProcEnabled is off by default so if you don't explicitly turn it on, don't worry about it.

    Starting in jdk6 the biased locking optimization is "on" by default (command line option UseBiasedLocking). This optimization reduces the cost of uncontended locks by treating the thread that owns the lock preferentially. It's a nice optimization but could increase the memory usage of the JVM.

    If you move from jdk5 to jdk6 and start seeing messages such as

    java.lang.OutOfMemoryError: requested bytes for GrET in C

    you could try turning off biased locking and see if it helps. Improvements in biased locking were made in 6u1 to make this much less of a problem.

    Friday Oct 26, 2007

    Did You Know ...

    These are a few esoteric factoids that I never expected users to need, but which have actually come up recently. Most of the text is just background information. If you already recognize the command line flags that I've bold'ed, you probably already know more than is good for you.


    The low-pause collector (UseConcMarkSweepGC) does parts of the collection of the tenured generation concurrently with the execution of the application (i.e., not during a stop-the-world). There are principally two concurrent phases of the collection: the concurrent marking phase and the concurrent sweeping phase. In JDK 6 the concurrent marking phase can use more than 1 GC threads (uses parallelism as well as concurrency). This use of the parallelism is controlled by the command line flag CMSConcurrentMTEnabled. The number of threads used during a concurrent marking phase is ParallelCMSThreads. If it is not set on the command line it is calculated as

    (ParallelGCThreads + 3)/4)

    where ParallelGCThreads is the command line flag for setting the number of GC threads to be used in a stop-the-world parallel collection.

    Where did this number come from? We added parallelism to the concurrent marking phase because we observed that a single GC thread doing concurrent marking could be overwhelmed by the allocations of many applications threads (i.e., while the concurrent marking was happening, lots of applications threads doing allocations could exhaust the heap before the concurrent marking finished). We could see this with a fewer application threads allocating at a furious rate or many application threads allocating at a more modest rate, but whatever the application we would often seen the concurrent marking thread overwhelmed on platforms with 8 or more processors.

    The above policy provides a second concurrent marking threads at ParallelGCThreads=5 and approaches a fourth of ParallelGCThread at the higher processor numbers. Because we still do have the added overheard of parallelism 2 concurrent marking threads provide only a small boost in concurrent marking over a single concurrent marking thread. We expect that to still be adequate up to ParallelGCThreads=8. At ParallelGCThreads=9 we get a third concurrent marking thread and that's when we expect to need it.


    Our low-pause collector (UseConcMarkSweepGC) which we are usually careful to call our mostly concurrent collector has several phases, two of which are stop-the-world (STW) phases.

  • STW initial mark
  • Concurrent marking
  • Concurrent precleaning
  • STW remark
  • Concurrent sweeping
  • Concurrent reset

    The first STW pause is used to find all the references to objects in the application (i.e., object references on thread stacks and in registers). After this first STW pause is the concurrent marking phase during which the application threads runs while GC is doing additional marking to determine the liveness of objects. After the concurrent marking phase there is a concurrent preclean phase (described more below) and then the second STW pause which is called the remark phase. The remark phase is a catch-up phase in which the GC figures out all the changes that the application threads have made during the previous concurrent phases. The remark phase is the longer of these two pauses. It is also typically the longest of any of the STW pauses (including the minor collection pauses). Because it is typically the longest pause we like to use parallelism where ever we can in the remark phase.

    Part of the work in the remark phase involves rescanning objects that have been changed by an application thread (i.e., looking at the object A to see if A has been changed by the application thread so that A now references another object B and B was not previously marked as live). This includes objects in the young generation and here we come to the point of these ramblings. Rescanning the young generation in parallel requires that we divide the young generation into chunks so that we can give chunks out to the parallel GC threads doing the rescanning. A chunk needs to begin on the start of an object and in general we don't have a fast way to find the starts of objects in the young generation.

    Given an arbitrary location in the young generation we are likely in the middle of an object, don't know what kind of object it is, and don't know how far we are from the start of the object. We know that the first object in the young generation starts at the beginning of the young generation and so we could start at the beginning and walk from object to object to do the chunking but that would be expensive. Instead we piggy-back the chunking of the young generation on another concurrent phase, the precleaning phase.

    During the concurrent marking phase the applications threads are running and changing objects so that we don't have an exact picture of what's alive and what's not. We ultimately fix this up in the remark phase as described above (the object-A-gets-changed-to-point-to-object-B example). But we would like to do as much of the collection as we can concurrently so we have the concurrent precleaning phase. The precleaning phase does work similar to parts of the remark phase but does it concurrently. The details are not needed for this story so let me just say that there is a concurrent precleaning phase. During the latter part of the concurrent precleaning phase the the young generation "top" (the next location to be allocated in the young generation and so at an object start) is sampled at likely intervals and is saved as the start of a chunk. "Likely intervals" just means that we want to create chunks that are not too small and not too large so as to get good load balancing during the parallel remark.

    Ok, so here's the punch line for all this. When we're doing the precleaning we do the sampling of the young generation top for a fixed amount of time before starting the remark. That fixed amount of time is CMSMaxAbortablePrecleanTime and its default value is 5 seconds. The best situation is to have a minor collection happen during the sampling. When that happens the sampling is done over the entire region in the young generation from its start to its final top. If a minor collection is not done during that 5 seconds then the region below the first sample is 1 chunk and it might be the majority of the young generation. Such a chunking doesn't spread the work out evenly to the GC threads so reduces the effective parallelism.

    If the time between your minor collections is greater than 5 seconds and you're using parallel remark with the low-pause collector (which you are by default), you might not be getting parallel remarking after all. A symptom of this problem is significant variations in your remark pauses. This is not the only cause of variation in remark pauses but take a look at the times between your minor collections and if they are, say, greater than 3-4 seconds, you might need to up CMSMaxAbortablePrecleanTime so that you get a minor collection during the sampling.

    And finally, why not just have the remark phase wait for a minor collection so that we get effective chunking? Waiting is often a bad thing to do. While waiting the application is running and changing objects and allocating new objects. The former makes more work for the remark phase when it happens and the latter could cause an out-of-memory before the GC can finish the collection. There is an option CMSScavengeBeforeRemark which is off by default. If turned on, it will cause a minor collection to occur just before the remark. That's good because it will reduce the remark pause. That's bad because there is a minor collection pause followed immediately by the remark pause which looks like 1 big fat pause.l


    We got a complaint recently from a user who said that all his GC pauses were too long. I, of course, take such a statement with a grain of salt, but I still try to go forward with an open mind. And this time the user was right, his GC pauses were way too long. So we started asking the usual questions about anything unusual about the application's allocation pattern. Mostly that boils down to asking about very large objects or large arrays of objects. I'm talking GB size objects here. But, no, there were nothing like that. The user was very helpful in terms of trying experiments with his application, but we weren't getting anywhere until the user came back and said that he had commented out part of his code and the GC's got much smaller. Hmmm. Curiouser and curiouser. Not only that, but the code that was commented out was not being executed. At this point the strain on my brain began to be too much and I lost consciousness. Fortunately, another guy in the group persevered and with some further experiments determined that the code that was being commented out was in a method that was not always being JIT'ed.

    Methods larger than a certain size will not be JIT'ed in hotspot. Commenting out some code would bring the size of the method below the JIT size limit and the method would get compiled. How did that affect GC you might ask? When a method is compiled, the compilers generate and save information on where object references live (e.g., where on the stack or in which registers). We refer to these as oop maps and oop maps are generated to speed up GC. If the method has not been JIT'ed, the GC has to generate the oop maps itself during the GC. We do that by a very laborious means that we call abstract interpretation. Basically, we simulate the execution of the method with regard to where reference are stored. Large methods mean large abstract interpretation times to generate the oop maps. We do save the oop maps for the next GC, but oop maps are different at different locations in the method. If we generate an oop map for PC=200 this time but stop for a GC at PC=300 next time, we have to generate the oop map for PC=300. Anyway, the method in which code was being commented in and out, was too large to be JIT'ed with the code commented in and that led to the long GC's.

    If you have some huge methods and GC's are taking a very long time, you could try -XX:-DontCompileHugeMethods. This will tell the JIT to ignore its size limit on compilation. I'm told by the compiler guy in my carpool that it's not a good idea to use that flag in general. Refactor your methods down to a less than huge size instead. By the way, the huge method was something like 2500 lines so it was what I would call huge.

  • Friday Jun 22, 2007

    Size Matters

    I recently had an interesting exchange with a customer about concurrent mode failures with the low-pause collector. My initial response was with my litany of whys and wherefores of concurrent mode failures. But it got more interesting, because it turned out to be an example of where object sizes and the young generation size made an unexpected difference.

    Before I get to that let's talk a little bit about the JVM's behavior with regard to allocating large objects. Large, of course, is a relative term and what I mean here is large relative to the size of the young generation. My comments here apply to JDK 5 update 10 and later and JDK 1.4.2 update 13 and later. Earlier versions of those JVM's had a bug in them (6369448) that made their behavior different. All versions of JDK 6 are also free from this bug. By the way the customer was on JDK 5 update 6 so was subject to 6369448. That made it a little more interesting for me but may not be of interest to you if you are running on the later releases. And also the policy I describe below does not apply to the throughput collector (UseParallelGC) which has its own variation on this policy.

    Objects are normally allocated out of the young generation and get moved to the tenured generation as they age. But what happens when a object is larger than the young generation (actually larger than the eden space in the young generation)? If the application tries to allocate object AAA and there is not enough free space in eden for the allocation, a garbage collection usually occurs. The exception to this is if AAA is larger than eden. An outline of the policy for allocation and collections follow.

    1. Allocate out of the young generation
    2. Possibly allocate out of the tenured generation
    3. Collect the young generation
    4. Possibly collect the tenured generation
    5. Heroics

    If the allocation in 1. fails, before starting a GC the JVM considers allocating AAA out of the tenured generation at 2. The JVM compares the size of AAA to the current capacity of eden. By capacity I mean the total size available in an empty eden. If the capacity of eden is too small to hold AAA, then collecting the young generation will not help with the allocation of AAA. So the JVM tries to allocate AAA directly out of the tenured generation. If the allocation out of the tenured generation succeeds, no young generation GC is done and the JVM just continues. If AAA is smaller than eden, then the JVM proceeds to the collection of the young generation at 3. After the the young generation collection, a check is done at 4. to see if enough space has been freed to allocate AAA in eden. If AAA still cannot be allocated out of the young generation, then the tenured generation is collected at 5. and an allocation attempt is made out of the tenured generation. If AAA cannot be allocated in the tenured generation, some additional heroics are attempted (e.g., clearing all SoftReferences). Failing those heroics an out-of-memory is thrown.

    A few things to note. The initial allocation at 1. is what we refer to as fast-path allocation. Fast-path allocation does not call into the JVM to allocate an object. The JIT compilers know how to allocate out of the young generation and code for an allocation is generated in-line for object allocation. The interpreter also knows how to do the allocation without making a call to the VM. And yes, both know about thread local allocation buffers (TLAB's) and use them. The allocation out of the tenured generation at 2. could be attempted unconditionally for any sized object. That could avoid a collection of the young generation, but would also defeat the purpose of having the young generation. We want new objects allocated in the young generation because that's where they will probably die. We only want the longed-lived objects to make it into the tenured generation. Also an allocation at 2. is not a fast-path allocation so the execution of the application could be significantly affected by too much allocation at 2. At 4. a check is made to see if enough space has been freed in the young generation for AAA and if not the tenured generation is collected. Why not attempt the allocation of AAA out of the tenured generation before doing the collection? Attempting the allocation first would probably work lots of the time, but there are pathological cases where allocations would repeatedly be done at 4. If we called allocation at 2. slow-path allocation, we'd have to call allocation at 4. slow-slow-path allocation. At 4. we've already stopped the world and done a young generation collection. The pathological case would execute very slowly. We know. We've tried variations that did the slow-slow-path allocation first and the reports that we usually got was the the JVM was hung! The JVM was not hung. Going slow-slow-path just made it seem so. Doing the collection of the tenured generation actually also collects the young generation and doing that avoids the pathological case in practice.

    The final thing to note is that you don't want to be stuck using slow-path allocation (allocation at 2.) It's really slow compared to fast-path allocation. So if you have large objects (and only you would know), try to make your young generation large enough to hold plenty of them (10's would be good, 100's would be better). Or that you have very few of them.

    The policy for the throughput collector differs from the above at 2. and 5. At 2. the throughput collector will allocate an object out of the tenured generation if it is larger than half of eden. That variation tries to not fill up the young generation with large objects. Is this better? I just don't know. The heroics at 5. are also different.

    The interesting exchange with the customer that got me thinking about this blog started with this snippet of a GC log. Again this was with JDK 5 update 6 so the bug 6369448 was in play.

    963236.282: [GC 963236.282: [ParNew: 15343K->0K(21184K), 0.1025429 secs] 8289743K->8274557K(12287936K), 0.1029416 secs]
    963236.479: [GC 963236.479: [ParNew: 10666K->0K(21184K), 0.1072986 secs] 963236.587: [CMS (concurrent mode failure): 8285141K->7092744K(12266752K), 91.3603721 secs] 8285224K->7092744K(12287936K), 91.4763098 secs]
    963328.194: [GC 963328.194: [ParNew: 21120K->0K(21184K), 0.1455909 secs] 7135030K->7114033K(12287936K), 0.1459930 secs]
    963328.434: [GC 963328.435: [ParNew: 7442K->0K(21184K), 0.0745429 secs] 963328.509: [CMS (concurrent mode failure): 7114084K->7073781K(12266752K), 78.1535121 secs] 7121475K->7073781K(12287936K), 78.2286852 secs]
    963408.503: [GC 963408.503: [ParNew : 21120K->0K(21184K), 0.0651745 secs] 7116067K->7097487K(12287936K), 0.0656080 secs]

    When I first looked at this log I jumped right to the first concurrent mode failure. The first thing to notice was that the available free space in the tenured generation was large (about 4g). The total space in the tenured generation was 12266752K and the occupancy (amount of space currently being used) of the tenured generation was only 8285141K (see the part of the log immediately following the first "concurrent mode failure" message). That's about 4g of space. With that much free space the only thing I could think of was a really bad case of fragmentation. Now a concurrent mode failure is bad for the low-pause collector. If it is due to fragmentation at least the resulting collection is a compacting collection and you should not have to worry about fragmentation for a while. So why is there another concurrent mode failure after just one minor collection? And there is even more free space in the tenured generation when this second one occurred (about 5g). So now I'm sweating. I fortunately started noticing some other peculiarities in the log. For example the first minor collection starts when the young generation had only allocated 15343K out of a total of 21184K. That's not that unusual because the survivor space can be sizable and mostly empty. But the third collection is also a minor collection and the young generation has allocated 21120K by the time the minor collection starts (as does the fifth collection). Hmmm. Looking at the occupancies of the young generation at the point of the two concurrent mode failures shows the pattern: the young generation has a good chunk of free space but is still being collected. So I asked the customer if very large objects are being allocated, objects that are approaching the size of the young generation. To my relief the answer was yes. What follows now is my interpretation of the logs. I didn't have the application to run and really debug the behavior, but I think the pieces fit together.

    The first collection (a minor collection) starts seemingly early because a large object is being allocated. The object is larger than the avalable space in eden but is not larger than all of eden so the minor collection does free up enough space to allocate that large object. The next collection is started when an even larger object allocation is attempted. This is where bug 6369448 comes into play. If not for 6369448 the larger object would have been allocated out of the tenured generation at 2. (the corrected case that I discussed above). There was an inconsistency in the test at 2. and the check at 4. before the collection of the tenured generation. A young generation is divided into an eden and two survivor spaces. The test at 2. compared the object size to the total size of the young generation. The check at 4. only checked the size (as is correct) against eden. So the short circuit at 2. was not taken, the collection at 3. doesn't free up enough space, and a tenured generation collection (concurrent mode failure) is the result. The third collection is an uninteresting minor collection. The fourth collection again is a collection prompted by the allocation of an object too big to fit into eden so another concurrent mode failure happens. The fifth collection is also uninteresting as are the many minor collections that follow.

    My story for the two concurrent mode failures does depend on the large objects fitting in that gap in size between eden and the total young generation, but I'm willing to assume that there was a mix of object sizes and occasionally there was something in the gap. The two concurrent mode failure pattern above didn't happen often in the entire log and it was a long log. And the customer increased the size of the young generation and these types of concurrent mode failures did not reoccur.

    I've always expected the occurrence of slow-path allocation to be a rare event. I think it is rare, but that's not the same as never so I'm thinking about adding a counter to be printed into the GC log to indicate how many slow-path allocations have occurred since the last GC.

    Monday Apr 23, 2007

    Get What You Need

    As we all know by now GC ergonomics exists to automatically adjust the sizes of the generations to achieve a pause time goal and/or a throughput goal while using a minimum heap size. But sometimes that's not what you really want. I was talking to a user recently who was using java.lang.Runtime.totalMemory() and java.lang.Runtime.freeMemory() to monitor how full the heap was. The intent was that the user's application be able to anticipate when a GC was coming and to throttle back on the number of requests being serviced so as to avoid dropping requests due to the reduced throughput when the GC pause did occur. The problem was that the capacity of the heap (the amount returned by totalMemory()) was varying so much that anticipating when a GC was coming was not very reliable. First the pretty picture.

    Looking at the memory used plot first, I inferred from the graph that there was some processing during initialization that required a lesser amount of memory. After that there was activity that used and released memory until about 6k seconds. At that point the activity died down until about 7.5k seconds when there was an up tick in activity and then a quiet period again until about 11k seconds. After that there was pretty regular allocations being done.

    Looking at the total memory I see what I'm expecting to see. The initial total heap size at zero time is larger than is needed during the first phase and the total memory drops. Then as allocations picked up the size of the heap grew and, on average, stayed at a higher level until the first quiet period. During the first quiet period the size total heap size decayed to a lower value that corresponds to the lesser demand for allocations. Then there is the short burst of activity with an corresponding growth in the heap. The toal heap again drops during the second quiet period and then grows again as the activity picks up.

    I like the behavior I see, but it wasn't what the user wanted.

    The amount of variation in the total heap size during the active periods is high. This is varying too quickly for this user. The reason for the variations is that the GC ergonomics is trying to react to variations in the behavior of the applications and so is making changes to the heap as soon as a variation in the application's behavior is recognized. There are two ways to look at this:

    • GC ergonomics was designed for the case where a steady state behavior has been reached, and as exhibited by the variations in the amount of memory used, this application doesn't reach a steady state (on the time scale of interest) and GC ergonomics is doing the best that it can under the circumstances, and
    • the user's application is what it is and what should he do?

    The first bullet is the excuse I use for almost everything. Let's consider the second.

    First a word about what GC ergonomics does by default that provides some damping of variations.

    GC ergonomics calculates the desired size of a generation (in terms of meeting the goals specified by the user) and then steps toward that size. There is a miminum size for a step and if the distance to the desired size is less than that minimum step, then the step is not made. The minimum step size is the page size on the platform. This eliminates changes as we get close to the goal and removes some of the jitter.

    When deciding whether a goal is being met, GC ergonomics uses weighted averages for quantities. For example, a weighted average for each pause time is used. The weighted averages typically change more slowly than the instantaneous values of the quantities so their use also tends to damp out variations.

    But if you really want a more stable heap size, here's what you can try. These suggestions limit the range of changes that can be made by GC ergonomics. The more minor limitations are given first followed by the more serious ones, finally ending with turning off GC ergonomics. You can just try the suggestions (in bold) and skip over the explanations as you wish.

    Reduce the jitter caused by the different sizes of the survivor spaces (and the flip-flopping of the roles of the survivor spaces) by reducing the size of the survivor spaces. Try NNN = 8 as a starter.



    The total size of the young generation is the size of eden plus the size of from-space. Only from-space is counted because the space in the young generation that contain objects allocated by the application is only eden and from-space. The other survivor space (to-space) can be thought of as scratch space for the copying collection (the type of collection that is used for the young generation). Each survivor space alternately plays the role of from-space (i.e. during a collection survivor space A is from-space and in the next collection survivor space B is from-space). Since the two survivor spaces can be of different sizes, just the swapping can change the total size of the young generation. In a steady state situation the survivor spaces tend to be the same size but in situations where the application is changing behavior and GC ergonomics is trying to adjust the sizes of the survivor spaces to get better performance, the sizes of the survivor spaces are often different temporarily.

    The default value for MinSurvivorRatio is 3 and the default value for InitialSurvivorRatio is 8. Pick something in between. A smaller value puts more space into the survivor spaces. That space might go unused. A larger value limits the the size of the survivor spaces and could result in objects being promoted to the tenured generation prematurely. Note that the survivor space sizes are still adjusted by GC ergonomics. This change only puts a limit on how large they can get.

    Reduce the variation in the young generation size by setting a minimum and a maximum with -XX:NewSize=NNN and -XX:MaxNewSize=MMM.

    GC ergonomics will continue to adjust the size of the young generation within the range specified. If you want to make the minimum and maximum limits of the young generation the same you can use -XX:NewSize=NNN and -XX:MaxNewSize=NNN or the flag -XmnNNN will also do that.

    Reduce the range over which the generations can change by specifying the minimum and maximum heap size with -Xms and -Xmx, respectively.

    If you've already explicitly specified the size of the young generation, then specifying the limit on the entire heap will in effect specify the limits of the tenured generation. You don't have to make the minimum and maximum the same but making them the same will have the maximum effect in terms of reducing the variations.

    Turn off GC ergonomics with the flag -XX:-UseAdaptiveSizePolicy

    The size of the generations, the sizes of the survivor spaces in the young generation and the tenuring threshold stay at their starting value throughout the execution of the VM. You can exercise this level of control but the you're back to tuning the GC yourself. But sometimes that really is the best solution. The user I talked to liked the more predictable heap size better than the automatic tuning and so turned off ergonomics.

    Monday Mar 05, 2007

    GC Errata 1

    Just some tid bits of information to fill some of the holes in the documentation.

    GC ergonomics is used by default on server class machines. Please see

    for a description of GC ergonomics and server class machine. The flag "-server" causes the server VM to be used. "-server" does not cause the platform on which the VM is running to be considered a server class machine.

    Prior to JDK6 the UseParallelGC collector (with GC ergonomics turned on) used only the flags InitialSurvivorRatio and MinSurvivorRatio to set the initial survivor ratio and the minimum survivor ratio, respectively. GC ergonomics dynamically resizes the survivor spaces, but these flags allow the user to specify initial and minimum sizes for the survivor spaces. The flag SurvivorRatio is used by the other collectors (that don't dynamically resize the survivor spaces) to set the size of the survivor spaces. SurvivorRatio was ignored by the UseParallelGC collector. Beginning with JDK6 if the SurvivorRatio is set on the command line for the UseParallelGC collector and InitialSurvivorRatio and MinSurvivorRatio are not explicitly set on the command line, then InitialSurvivorRatio and MinSurvivorRatio are set to SurvivorRatio + 2.

    If UseAdaptiveSizePolicy is turned off when using the UseParallelGC collector, the tenured generation and the young generation stay at their initial sizes throughout the execution of the VM.

    Starting with JDK5 update 06 the maximum tenuring threshold is limited to 15. The GC logging output does not correctly reflect this change. This is a bug that will be fixed under 6521376.

    If a minimum heap size is set with -Xms and NewSize is not explicitly set, the minimum young generation size is calculated using NewRatio. The UseParallelGC collector is not correctly using the minimum value calculated from -Xms and NewRatio but is using the default value of NewSize. Explicitly setting NewSize does correctly set the minimum size for the generation. This is being fixed under bug 6524727.

    Monday Feb 12, 2007

    How Does Your Garden Grow?

    Actually, the title should be "How Does Your Generation Grow" but then the literary reference would have been totally lost.

    I've written about how the generations grow if you are using GC ergonomics.

    What if you're not using GC ergonomics? There is a different policy for growing and shrinking the heap that is used by the low pause collector and the serial collector. If you're interested in that policy, this is you're lucky blog.

    It's actually pretty straightforward. At the end of the collection of a generation, we check the amount of free space in the generation. If the amount of free space is below a certain percentage (specified by MinHeapFreeRatio) of the total size of the generation, then the generation is expanded. Basically, we ask if we have enough free space in the generation so that the VM can run a while before we have to collect again. At the other end we don't want to have an excessive amount of free space so if the amount of free space is above a certain percentage (specified by MaxHeapFreeRatio) the generation is shrunk. That's it.

    If you decide to play with the free ratio parameters, leave enough distance between MinHeapFreeRatio and MaxHeapFreeRatio so that the generations are not constantly adjusting by small amounts to get to the "perfect" ratio. Also our experience is that even if a generation does not need the extra free space right now, it will shortly, so don't be too aggressive with MaxHeapFreeRatio.

    Will this policy eventually be replaced by GC ergonomics? Actually, I think not. I recently talked to a customer who told me that he was more concerned with bounds on the heap footprint than with achieving some specific throughput. The way he put it was something like "I don't want to be doing GC's all the time so there should be a good amount of free space in the heap but not too much". This policy is one approximation of what he wants.

    Tuesday Feb 06, 2007

    JDK6 GC Release notes

    Here's a couple of pointers to the GC section of the release notes for JDK6.

    Parallel Compaction

    Concurrent Mark Sweep Collector Enhancements

    Edit done May 16, 2007.

    Ryan and Tobais, sorry but I just saw your comments. I could not figure out how to reply with a comment (comment period being closed and I didn't want to mess with the settings too much) so I'm editting the blog to respond.


    The line "CMS-concurrent-abortable-preclean" does not indicate a problem. The "abortable" only indicates that the current phase (precleaning) can be interrupted in order to start the next phase.


    Yes, the UseParallelOldGC does imply UseParallelGC.

    Thursday Jan 11, 2007

    When You're at Your Limit

    It's often the case that we advise users of large applications to use large young generations. The rational is that large applications (very often multithreaded) have large allocation rates and a large young generations is good because it allows more time between minor collections. One case where that is not a good idea is when you're using the largest heap you can (e.g. close to 4G on a 32 bit system) and the live data is bigger than will fit in the tenured generation.

    When the young generation fills up, we want to do a minor collection. Minor collections are in general more productive (the young generation tends to have more garbage) and faster (the time to collect depends largely on the amount of live data in the young generation and not on the about of garbage). If, however, when we're contemplating a minor collection and the tenured generation is full, the garbage collector will think that it's time to do a major collection and collect both the tenured generation and the young generation. If the tenured generation is actually full (or very nearly full) of live data, then the tenured generation may still be full at the end of the major collection. The collection worked, but the next time the young generation fills up, a major collection will again be done.

    If you find yourself in this situation, try reducing the size of the young generation and increasing the size of the tenured generation accordingly. The goal is to have enough free space available in the tenured generation after a major collection so that the next collection can be a minor collection. Even if you can only manage enough free space in the tenured generation to do 1 minor collection, it's a win. Sizing the generations so you can get a few minor collections between major collections would be sweeter, but that all depends on how close you are to your limit.




    « July 2016