Jon Masamitsu's Weblog

  • Java
    October 26, 2007

Did You Know ...

Guest Author
These are a few esoteric factoids that I never expected users to
need, but which have actually come up recently. Most of the
text is just background information. If you already recognize
the command line flags that I've bold'ed, you probably already know
more than is good for you.


The low-pause collector (UseConcMarkSweepGC) does parts of the collection
of the tenured generation concurrently with the execution of the application
(i.e., not during a stop-the-world). There are principally two
concurrent phases of the collection: the concurrent marking phase and
the concurrent sweeping phase. In JDK 6 the concurrent marking phase
can use more than 1 GC threads (uses parallelism as well as concurrency).
This use of the parallelism is controlled by the command line flag
CMSConcurrentMTEnabled. The number of threads used during a concurrent
marking phase is ParallelCMSThreads. If it is not set on the command
line it is calculated as

(ParallelGCThreads + 3)/4)

where ParallelGCThreads is the command line flag for setting the
number of GC threads to be used in a stop-the-world parallel collection.

Where did this number come from? We added parallelism to the concurrent
marking phase because we observed that a single GC thread doing concurrent marking
could be overwhelmed by the allocations of many applications threads
(i.e., while the concurrent marking was happening, lots of applications
threads doing allocations could exhaust the heap before the concurrent
marking finished). We could see this with a fewer application
threads allocating at a furious
rate or many application threads allocating at a more modest rate, but
whatever the application we would often seen the concurrent marking
thread overwhelmed on platforms with 8 or more processors.

The above
policy provides a second concurrent marking threads at ParallelGCThreads=5
and approaches a fourth of ParallelGCThread at the higher processor
numbers. Because we still do have the added overheard of parallelism
2 concurrent marking threads provide only a small boost in concurrent
marking over a single concurrent marking thread. We expect that to still
be adequate up to ParallelGCThreads=8.
At ParallelGCThreads=9 we get a third concurrent marking
thread and that's when we expect to need it.


Our low-pause collector (UseConcMarkSweepGC) which we are usually careful
to call our mostly concurrent collector has several phases, two
of which are stop-the-world (STW) phases.

  • STW initial mark

  • Concurrent marking

  • Concurrent precleaning

  • STW remark

  • Concurrent sweeping

  • Concurrent reset

    The first STW pause is used to find all the references to objects
    in the application (i.e., object references on thread stacks
    and in registers).
    After this first STW pause is the concurrent marking phase
    during which the application threads runs while GC is doing additional
    marking to determine the liveness of objects. After the
    concurrent marking phase there is a concurrent preclean phase
    (described more below) and then the second STW pause which is called the
    remark phase. The remark phase is a catch-up phase
    in which the GC figures out all the changes that
    the application threads have made during the previous concurrent phases.
    The remark phase is the longer of these two pauses.
    It is also typically the longest
    of any of the STW pauses (including the minor collection pauses). Because
    it is typically the longest pause we like to use parallelism where
    ever we can in the remark phase.

    Part of the work in the remark phase
    involves rescanning objects that have been changed by
    an application thread (i.e., looking at the object A to see if A
    has been changed by the application thread so that A now
    references another object B and B was not previously marked as live).
    This includes objects in the young generation and here we come to
    the point of these ramblings. Rescanning the young generation in parallel
    requires that we divide the young generation into chunks so that we can
    give chunks out to the parallel GC threads doing the rescanning. A
    chunk needs to begin on the start of an object and in general we don't have
    a fast way to find the starts of objects in the young generation.

    Given an arbitrary location in the young generation
    we are likely
    in the middle of an object, don't know what kind of object it is, and
    don't know how far we are from the start of the object.
    We know that the first object
    in the young generation starts at the beginning of the young generation
    and so we could start at the beginning and walk from object to object
    to do the chunking but that would be expensive. Instead we piggy-back
    the chunking of the young generation on another concurrent phase, the
    precleaning phase.

    During the concurrent marking phase the applications threads are
    running and changing objects so that we don't have an exact picture
    of what's alive and what's not. We ultimately fix this up in the
    remark phase as described above (the object-A-gets-changed-to-point-to-object-B example). But we would like to do as much of the collection as we
    can concurrently so we have the concurrent precleaning phase. The
    precleaning phase does work similar to parts of the remark phase but does it
    concurrently. The details
    are not needed for this story so let me just say that there is a
    concurrent precleaning phase. During the latter part of
    the concurrent precleaning phase
    the the young generation "top"
    (the next location to be allocated in the young generation
    and so at an object start)
    is sampled at likely intervals and is saved as the start of
    a chunk.
    "Likely intervals" just means that we want to create chunks that are not too
    small and not too large so as to get good load balancing during the
    parallel remark.

    Ok, so here's the punch line for all this.
    When we're doing the precleaning we do the sampling
    of the young generation top for a fixed amount of time
    before starting the remark. That fixed amount of time is
    CMSMaxAbortablePrecleanTime and its default value is 5 seconds.
    The best situation is to have a minor collection happen during
    the sampling. When that happens the sampling is done over
    the entire region in the young generation from its start to its
    final top.
    If a minor collection is not done during that 5 seconds then
    the region below the first sample is 1 chunk and it might be
    the majority of the young generation. Such a chunking
    doesn't spread the work out evenly to the GC threads so reduces the
    effective parallelism.

    If the
    time between your minor collections is greater than 5 seconds and
    you're using parallel remark with the low-pause collector (which you
    are by default), you might not be getting parallel remarking after all.
    A symptom of this problem is significant variations in your remark
    pauses. This is not the only cause of variation in remark pauses but
    take a look at the times between your minor collections and if they
    are, say, greater than 3-4 seconds, you might need
    to up CMSMaxAbortablePrecleanTime so that you get a minor collection
    during the sampling.

    And finally, why not just have the remark phase wait for a minor
    collection so that we get effective chunking? Waiting is often a
    bad thing to do. While waiting the application is running and
    changing objects and allocating new objects. The former makes more
    work for the remark phase when it happens and the latter could cause
    an out-of-memory before the GC can finish the collection. There is
    an option CMSScavengeBeforeRemark which is off by default. If turned
    on, it will cause a minor collection to occur just before the remark.
    That's good because it will reduce the remark pause. That's bad because
    there is a minor collection pause followed immediately by the remark
    pause which looks like 1 big fat pause.l


    We got a complaint recently from a user who said that all his GC pauses
    were too long. I, of course, take such a statement with a grain of
    salt, but I still try to go forward with an open mind. And this time the
    user was right, his GC pauses were way too long. So we started asking
    the usual questions about anything unusual about the application's
    allocation pattern. Mostly that boils down to asking about very large
    objects or large arrays of objects. I'm talking GB size objects here.
    But, no, there were nothing like that. The user was very helpful
    in terms of trying experiments with his application, but we weren't
    getting anywhere until the user came back and said that he had
    commented out part of his code and the GC's got much smaller.
    Hmmm. Curiouser and curiouser. Not only that, but the code that
    was commented out was not being executed. At this point the strain
    on my brain began to be too much and I lost consciousness. Fortunately,
    another guy in the group persevered and with some further experiments
    determined that the code that was being commented out was in a
    method that was not always being JIT'ed.

    Methods larger than a
    certain size will not be JIT'ed in hotspot.
    Commenting out some code would bring the size of the method below the
    JIT size limit and the method would get compiled. How did that
    affect GC you might ask? When a method is compiled, the compilers
    generate and save information on where object references live (e.g.,
    where on the stack or in which registers). We refer to these as oop maps and
    oop maps are generated to speed up GC. If the method has not
    been JIT'ed, the GC has to generate the oop maps itself during the
    GC. We do that by a very laborious means that we call abstract
    interpretation. Basically, we simulate the execution of the method
    with regard to where reference are stored. Large methods mean large
    abstract interpretation times to generate the oop maps. We do save the
    oop maps for the next GC, but oop maps are different at different
    locations in the method. If we generate an oop map for PC=200 this
    time but stop for a GC at PC=300 next time, we have to generate
    the oop map for PC=300. Anyway, the method in which code was being
    commented in and out, was too large to be JIT'ed with the code
    commented in and that led to the long GC's.

    If you have some
    huge methods and GC's are taking a very long time,
    you could try -XX:-DontCompileHugeMethods. This will tell the
    JIT to ignore its size limit on compilation. I'm told by the compiler
    guy in my carpool that it's not a good idea to use that flag in
    general. Refactor your methods down to a less than huge size instead.
    By the way, the huge method was something like 2500 lines so it was
    what I would call huge.

  • Join the discussion

    Comments ( 11 )
    • studdugie Friday, October 26, 2007

      Great article. I always look forward to your posts.

      Quick question. What is the default method "do not compile" limit?

    • Jon Ustmasami Friday, October 26, 2007

      The limit is measured as the amount of bytecode for a method and is 8000 bytes of bytecode. That number is a compile time constant for a product build.

    • Damon Hart-Davis Sunday, October 28, 2007

      Hi Jon,

      Thanks for that!

      Do you happen to know whether typical/largeish JSP pages are expected tend to stay under the 'huge' HotSpot compilation limit or go over it? Because I'm pretty sure that those will be the biggest single chunks of bytecode in my GC-stressing app, and do eventually get heavily used, and have all sorts of interesting allocation going on inside them.

      (Of course, I could go and count lines of code spat out after JSP compilation, and I will do so right now!)



    • Damon Hart-Davis Sunday, October 28, 2007

      Just looked at some stats for my most-heavily hit page:

      -rw-r----- 1 root root 113773 2007-10-24 16:01 index_jsp.class

      -rw-r----- 1 root root 192816 2007-10-24 16:01 index_jsp.java

      wc -l \*.java


      4607 index_jsp.java

      So that looks in danger of being 'huge' doesn't it (and indeed I've had trouble in the past and factored parts out from time to time)...



    • Jon Ustmasaim Monday, October 29, 2007


      That's an interesting question . I've not run into this problem of methods exceeding the huge limit before so I'm guessing it's not that common.

    • Damon Hart-Davis Tuesday, October 30, 2007

      Hi again,

      Is there a flag that can be switched on to log when HotSpot has refused to compile what it considers to be a huge method? Then at least I could do something about it if it is happening!



    • Jon Ustmasiam Wednesday, October 31, 2007


      No, there is no switch to get a warning message that a method has not been compiled due to the huge method limit.

    • Bharath R Sunday, November 4, 2007


      Is there a more detailed write-up on what an OOP map is, why it must be generated by GC and how it affects GC in general? Could you please provide a pointer to such a document?



    • Jon Ustmaisam Tuesday, November 6, 2007


      Go to the page


      and search down for an paper

      "Finding References in JavaTM Stacks" by Agensen and Detlefs.

      The paper describes the problem much better than I can, but basically we need to find all the references to Java objects on a Java thread's stack. Given a point of execution in a method, which slots on the stack of the method hold reference to Java objects? Additionally moving up the stack frames which slots in each stack frame hold references? The garbage collector needs to know that information in order to know what objects are live. Within a method there are many places where a garbage collection can take place. At such a point A these oopmaps would be needed if at garbage collection was done when the method was executing at A. But if a garbage collection never happens at A, then the oopmaps there would never be needed. So oopmaps are calculated lazily. If a GC happens at A then the oopmaps are calculated at A by the garbage collector.

    • Bharath R Wednesday, November 7, 2007

      Thanks Jon. The reference was very useful.

    • Dmitry Serebrennikov Monday, November 12, 2007

      I'm guessing that JSPs only look big. In reality, most of their bulk turns in to constant strings and is probably not counted towards the 8K bytecode limit.

      Can anyone confirm this?

      It really would be great to have a way of telling which methods are not being compiled, through a warning or through some code that loads a class file and checks the method sizes. This could even be a warning issued by javac.

    Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.