Jon Masamitsu's Weblog

  • Java |
    April 13, 2006

What the Heck's a Concurrent Mode?

and why does it fail?

If you use the low pause collector, have you ever seen a message that contained
the phase "concurrent mode failure" such as this?

174.445: [GC 174.446: [ParNew: 66408K->66408K(66416K), 0.0000618 secs]174.446: [CMS (concurrent mode failure): 161928K->162118K(175104K), 4.0975124 secs] 228336K->162118K(241520K)

This is from a 6.0 (still under development) JDK but the same type of message can come
out of a 5.0 JDK.

Recall that the low pause collector is a mostly concurrent collector: parts of the
collection are done while the application is still running. The message "concurrent mode
failure" signifies that the concurrent collection of the tenured generation did not
finish before the tenured generation became full. Recall also that a concurrent
collection attempts to start just-in-time to finish before the tenured generations
becomes full. The low pause collector measures the rate at which the the tenured
generation is filling and the expected amount of time until the next collection and
starts a concurrent collection so that it finished just-in-time (JIT). Three things to note
in that last sentence. The "rate" at which the tenured generation is filling is based
on historical data as is the "expected amount of time" until the next collection. Either
of those might incorrectly predict the future. Also the JIT is really
JIT plus some amount of padding so as to get it right most of the time.
When a concurrent mode failure happens, the low pause collector does a stop-the-world (STW)
collection. All the application threads are stopped, a different algorithm is used
to collect the tenured generation (our particular flavor of a mark-sweep-compact),
the applications threads are started again, and life goes on. Except that the
STW collection is not very low pause and there's the rub.

At this point if you're asking "why does this happen", you've come to the right blog.
There are several possibilites.

The amount of live data that is in the tenured generation is too large. Specifically,
there is not enough free space in the tenured generation to support the rate of allocation into
the tenured generation. For an example in the extreme if there are only 2 words of free space
in the tenured generation after a collection of the tenured generation, chances are those
2 words will be exhausted before another concurrent collection of the tenured generation
can be done. If you are seeing lots of concurrent mode failures, chances are your heap
is too small.

Your application can change behaviors dramatically such that past behavior does not
adequately predict future performance. If this is the problem you'll see the
concurrent mode failures only near the change in behavior. After a few more collections the
low pause collector adjusts its expectations to make better decisions.
But to deal with the concurrent mode failures in the mean time, you'll usually
be trading off better performance. You can tell the low pause
collector to start a collection sooner. The flag -XX:CMSInitiatingOccupancyFraction=NN
will cause a concurrent collection to start when NN percent of the tenured generation
is full. If you use this option to deal with the concurrent mode failures that result
from a change in the behavior of your application, much of the time (when the applications
behavior is more steady state) you'll be starting collection too early and so
doing more collections than necessary. If you set NN to 0, it will cause one concurrent
collection to be followed as soon as possible by another. The next collection may not
start immediately after the last because the check on when to start a collection is
done only at particular points in the code, but the collection will start at the next

Your application may not have dramatic changes in behavior, but if it has a large variance in
allocation rates, that can cause the JIT GC to not be JIT. You can add some more padding to
the time at which a concurrent collection kicks off by using the flag
-XX:CMSIncrementalSafetyFactor=NN. The default value for NN is 10 (i.e., a 10% padding
on the start of the concurrent collection). Increasing NN to 100 starts a
concurrent collection at the next opportunity.

Join the discussion

Comments ( 1 )
  • guest Thursday, April 13, 2006
    These posts are great. Thanks
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha

Integrated Cloud Applications & Platform Services