What the Heck's a Concurrent Mode?
By jonthecollector on Apr 13, 2006
If you use the low pause collector, have you ever seen a message that contained the phase "concurrent mode failure" such as this?
This is from a 6.0 (still under development) JDK but the same type of message can come out of a 5.0 JDK.
Recall that the low pause collector is a mostly concurrent collector: parts of the collection are done while the application is still running. The message "concurrent mode failure" signifies that the concurrent collection of the tenured generation did not finish before the tenured generation became full. Recall also that a concurrent collection attempts to start just-in-time to finish before the tenured generations becomes full. The low pause collector measures the rate at which the the tenured generation is filling and the expected amount of time until the next collection and starts a concurrent collection so that it finished just-in-time (JIT). Three things to note in that last sentence. The "rate" at which the tenured generation is filling is based on historical data as is the "expected amount of time" until the next collection. Either of those might incorrectly predict the future. Also the JIT is really JIT plus some amount of padding so as to get it right most of the time. When a concurrent mode failure happens, the low pause collector does a stop-the-world (STW) collection. All the application threads are stopped, a different algorithm is used to collect the tenured generation (our particular flavor of a mark-sweep-compact), the applications threads are started again, and life goes on. Except that the STW collection is not very low pause and there's the rub.
At this point if you're asking "why does this happen", you've come to the right blog. There are several possibilites.
The amount of live data that is in the tenured generation is too large. Specifically, there is not enough free space in the tenured generation to support the rate of allocation into the tenured generation. For an example in the extreme if there are only 2 words of free space in the tenured generation after a collection of the tenured generation, chances are those 2 words will be exhausted before another concurrent collection of the tenured generation can be done. If you are seeing lots of concurrent mode failures, chances are your heap is too small.
Your application can change behaviors dramatically such that past behavior does not adequately predict future performance. If this is the problem you'll see the concurrent mode failures only near the change in behavior. After a few more collections the low pause collector adjusts its expectations to make better decisions. But to deal with the concurrent mode failures in the mean time, you'll usually be trading off better performance. You can tell the low pause collector to start a collection sooner. The flag -XX:CMSInitiatingOccupancyFraction=NN will cause a concurrent collection to start when NN percent of the tenured generation is full. If you use this option to deal with the concurrent mode failures that result from a change in the behavior of your application, much of the time (when the applications behavior is more steady state) you'll be starting collection too early and so doing more collections than necessary. If you set NN to 0, it will cause one concurrent collection to be followed as soon as possible by another. The next collection may not start immediately after the last because the check on when to start a collection is done only at particular points in the code, but the collection will start at the next opportunity.
Your application may not have dramatic changes in behavior, but if it has a large variance in allocation rates, that can cause the JIT GC to not be JIT. You can add some more padding to the time at which a concurrent collection kicks off by using the flag -XX:CMSIncrementalSafetyFactor=NN. The default value for NN is 10 (i.e., a 10% padding on the start of the concurrent collection). Increasing NN to 100 starts a concurrent collection at the next opportunity.