Low pause times during the application run is the most
important goal for many enterprise applications, especially for
the transaction-based systems where long latencies can result in the transaction
time-outs. For systems running on the Java Virtual Machines, garbage
collections can sometimes be the cause of the long pauses.
In this post I am going to describe different scenarios
where we can encounter long GC pauses and how we can diagnose and troubleshoot
these GC pauses.
Following are the different situations that can cause long
GC pauses during the application run.
Fragmentation in the Java Heap can cause GCs to occur more
frequently and also sometimes causing long pauses in the GCs. This is more
probable in the case of Concurrent Mark Sweep collector, also known as CMS,
where the tenured generation space is not compacted with the concurrent
In case of the CMS, due to fragmentation in the tenured
generation space, the young generation collections can face promotion failures
and thus triggering 'Concurrent Mode Failure' collections that are
stop-the-world Full GCs, and Full GCs take a long time to finish as compared to
the concurrent collection pauses.
Due to the fragmentation, the direct allocations in the
tenured generation may fail even when there is lot of free space available and
thus causing Full GCs. Fragmentation can also cause frequent allocation failures
and thus triggering frequent Full GCs that increase the overall time the
application is paused for.
The following logs collected with the CMS collector show
that the fragmentation in the CMS generation space is very high, that leads to
the promotion failure during a young generation ParNew collection and then a
'concurrent mode failure'. A Full GC is done in the event of 'concurrent mode
failure' that takes a very long time, 17.1365396 seconds to finish.
Sometimes the OS activities such as the swap space or
networking activity happening at the time when GC is taking place can make the
GC pauses last much longer. These pauses can be of the order of few seconds to
If your system is configured to use swap space, Operating
System may move inactive pages of memory of the JVM process to the swap space,
to free up memory for the currently active process which may be the same
process or a different process on the system. Swapping is very expensive as it
requires disk accesses which are much slower as compared to the physical memory
access. So, if during a garbage collection the system needs to perform
swapping, the GC would seem to run for a very long time.
Following is the log of a young generation collection that
lasts for 29.47 seconds.
Corresponding 'vmstat' output at 03:58:
This minor GC takes around 29 secs to complete. The
corresponding vmstat output shows that the available swap space drops down by
~600mb during this period. That means during this garbage collection some pages
from the RAM were moved out to the swap space, not necessarily by the same
process running on the system.
From the above, it is clear that the physical memory
available on the system is not enough for all the processes running on the
system. The way to resolve this is to run fewer processes or if possible, add
more RAM to increase the physical memory of the system. In the case above, the
specified maximum tenured generation size is set as 9G and out of that only
1.8G is occupied. So it makes sense to reduce the heap size to lower the memory
pressure on the physical memory so as to avoid or minimize the swapping
Apart from swapping, we should monitor if there is any i/o
or network activity happening during the long GC pauses. These can be monitored
using iostat and netstat tools. It is also helpful to see the CPU statistics
with the mpstat tool to figure out if enough CPU resources were available
during the GC pauses.
If the application footprint is larger than the maximum heap
space that we have specified for the JVM, it results in frequent collections.
Due to the insufficient heap space, the allocation requests fail and the JVM
needs to invoke garbage collections in an attempt to reclaim space for the
allocations. But since it cannot claim much space with each collection,
subsequent allocation failures result in more GC invocations.
These frequent Full GCs cause long pauses in the application
run. For example, in the following case, the permanent generation is almost
full and the allocation attempts into the permanent generation are failing,
triggering the Full GCs.
Similarly, the frequent Full GCs can occur if there is
insufficient space in the tenured generation for the allocations or promotions.
The solution for these long pauses is to identify the
average footprint of the application and then specify the heap size
Sometimes these long pauses could be due to a bug in the
JVM. For example, due to the following bugs in the JVM, Java applications may
face long GC pauses.
If you are running with a JVM version affected with these
bugs, please upgrade to the version where these bugs are fixed.
Check if there are any explicit System GCs happening.
Requests to invoke these System GCs which are stop-the-world Full GCs could be
coming from the System.gc() calls from some class in the application or from a
some third party module. These explicit System GCs too can cause very long
If you are using RMI and are observing explicit Full GCs on
a regular interval, then these are coming from the RMI implementation that
triggers System.gc() on a regular interval. This interval can be configured
using the following system properties:
The default value for these properties in JDK 1.4.2 and 5.0
is 60000 milliseconds, and 3600000 milliseconds in JDK 6 and later releases.
If you want to disable the explicit Full GCs invoked using
System.gc(), run the application with -XX:+DisableExplicitGC JVM option.
1. Collect GC logs with -XX:+PrintGCDetails -XX:+PrintHeapAtGC
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps and -XX:+PrintGCApplicationStoppedTime.
In case of the CMS collector, add option -XX:PrintFLSStatistics=2 as well.
The GC logs can give us details on the nature and the frequency
of the GC pauses i.e. they can provide answers to the questions like - are the long
GC pauses occurring during young collections or old collections, and how
frequently those collections are encountering long pauses.
2. Monitor the overall health of the system using OS tools
like vmstat, iostat, netstat and mpstat etc. on Solaris and Linux platforms, and
tools like Process Monitor
and Task Manager on the Windows operating system.
3. Use GCHisto
tool to visually analyze the GC Logs and figure out which GCs are taking long
time and if there is a pattern in the occurrence of these collections.
4. Try to see from the GC logs if there are any signs of
fragmentation in the Java Heap space.
5. Monitor if the specified Heap size is enough to contain
the footprint of the application.
6. Check if you are running with a JVM that has a known bug
related to the long GC pauses and then upgrade if that bug is fixed in a later