Tuesday Feb 03, 2015

Analysis of Strange Hotspot Crashes

In this blog post, I am going to talk about the core file analysis of some strange looking hotspot crashes using gdb and Serviceability Agent debugging tools. At the first glance, these crashes appeared to be caused by some Hotspot Server Compiler issue but in the end these actually turned out to be due to a Linux Kernel bug.

This first crash happened in the compiled code of method SafeHashMap.put()

# Problematic frame:
# J 3575 C2
com.tangosol.util.SafeHashMap.put(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; (190 bytes) @ 0x00002aaaabda7ffd [0x00002aaaabda7da0+0x25d]
# Core dump written.

I opened the core with gdb and the stack trace of the crashing thread looked like this:

#15 0x00002afd41aa5e7f in JVM_handle_linux_signal (sig=11, info=0x40fd7c90, ucVoid=0x40fd7b60, abort_if_unrecognized=1)
#16 0x00002afd41a9ce13 in signalHandler (sig=11, info=0x40fd7c90, uc=0x40fd7b60)
#17 <signal handler called>
#18 0x00002aaaabda7ffd in ?? ()
#19 0x00000000c005e3c8 in ?? ()
#20 0x00000000c005e3c8 in ?? ()
#21 0x00000000c005e3c8 in ?? ()
#22 0xc005e3c82000cb42 in ?? ()
#23 0x0000000000000007 in ?? ()
#24 0x0000000000000000 in ?? ()

Opened the core file with Serviceability Agent tool too and inspected the addresses present in the above stack trace:

hsdb> whatis 0x00002aaaabda7ffd
Address 0x00002aaaabda7ffd: In code in NMethod for com/tangosol/util/SafeHashMap.put(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; content: [0x00002aaaabda7da0, 0x00002aaaabda9280),  code: [0x00002aaaabda7da0, 0x00002aaaabda9280),  data: [0x00002aaaabda9280, 0x00002aaaabda9f18),  oops: [0x00002aaaabda9280, 0x00002aaaabda9288),  frame size: 144

hsdb> whatis 0x00000000c005e3c8
Address 0x00000000c005e3c8: In unknown section of Java heap

So, all the above digging confirmed that the crash happened in the compiled code of com.tangosol.util.SafeHashMap.put(). I looked at the assembly of the generated code in gdb:

(gdb) x/610i 0x00002aaaabda7da0
   0x2aaaabda7f83:      mov    $0x2000cbc5,%r9d
   0x2aaaabda7f89:      movabs $0x0,%r11
   0x2aaaabda7f93:      lea    (%r11,%r9,8),%r11
   0x2aaaabda7f97:      mov    0xa8(%r11),%r11
   0x2aaaabda7f9e:      mov    %r11,0x0(%r13)
   0x2aaaabda7fa2:      movl   $0x2000cbc5,0x8(%r13)
   0x2aaaabda7faa:      mov    %r12d,0xc(%r13)
   0x2aaaabda7fae:      mov    %r12,0x10(%r13)
   0x2aaaabda7fb2:      mov    %r12,0x18(%r13)
   0x2aaaabda7fb6:      mov    %r13,%r9
   0x2aaaabda7fb9:      mov    %r14,%r11
   0x2aaaabda7fbc:      mov    %r11d,0x10(%r13)
   0x2aaaabda7fc0:      shr    $0x9,%r9
   0x2aaaabda7fc4:      movabs $0x2aaab9f5b000,%r11
   0x2aaaabda7fce:      mov    %r12b,(%r11,%r9,1)
   0x2aaaabda7fd2:      mov    (%rsp),%r9d
   0x2aaaabda7fd6:      mov    %r9d,0x14(%r13)
   0x2aaaabda7fda:      mov    %r13,%r11
   0x2aaaabda7fdd:      shr    $0x9,%r11
   0x2aaaabda7fe1:      movabs $0x2aaab9f5b000,%r9
   0x2aaaabda7feb:      mov    %r12b,(%r9,%r11,1)
   0x2aaaabda7fef:      lock addl $0x0,(%rsp)
   0x2aaaabda7ff4:      mov    %r10d,0xc(%r13)
   0x2aaaabda7ff8:      movq   %xmm0,%r11
   0x2aaaabda7ffd:      mov    (%r11),%rax  <---crash here, r11 is 0
   0x2aaaabda8000:      mov    %rax,%r11

The thread crashed while accessing the value at an address stored in r11 register. Looking at the value of r11 register:
r11 0x0

r11 contains 0 and accessing the value at null address caused the crash.

Now, if we look at the previous instruction "movq   %xmm0,%r11", value in r11 was copied over from xmm0 register. The value in xmm0 register is 0.

There is a bug in Linux kernel where the xmm registers may get corrupted under certain circumstances. Details on the kernel bug: https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=547893

The machine where this crash was observed:
RELEASE=2.6.18-308.  VERSION=#1 SMP Tue Apr 17 16:41:30 EDT 2012  MACHINE=x86_64

So, the process was running on a machine which is vulnerable to that Linux kernel bug.

Let's take a look at another such Hotspot crash.

# A fatal error has been detected by the Java Runtime Environment:
#  Internal Error (sharedRuntime.cpp:833), pid=26946, tid=1082554688
#  fatal error: exception happened outside interpreter, nmethods and vtable stubs at pc 0x00002aaaab1f5340

Current thread (0x000000001ee9b000):  JavaThread "main" [_thread_in_Java, id=26949, stack(0x0000000040767000,0x0000000040868000)]

Stack: [0x0000000040767000,0x0000000040868000],  sp=0x00000000408631a0,  free space=1008k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xaaa6ea]  VMError::report_and_die()+0x2ba
V  [libjvm.so+0x4f2749]  report_fatal(char const*, int, char const*)+0x59
V  [libjvm.so+0x9a952a]  SharedRuntime::continuation_for_implicit_exception(JavaThread*, unsigned char*, SharedRuntime::ImplicitExceptionKind)+0x33a
V  [libjvm.so+0x91370a]  JVM_handle_linux_signal+0x48a
V  [libjvm.so+0x909c83]  signalHandler(int, siginfo*, void*)+0x43
C  [libpthread.so.0+0xebe0]
J 528 C2 com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaDOMParser.characters(Lcom/sun/org/apache/xerces/internal/xni/XMLString;Lcom/sun/org/apache/xerces/internal/xni/Augmentations;)V (110 bytes) @ 0x00002aaaab607423 [0x00002aaaab606ec0+0x563]

(gdb) where
#0  0x0000003968a30265 in raise ()
#1  0x0000003968a31d10 in abort ()
#2  0x00002b96ba0bbea5 in os::abort (dump_core=true)
#3  0x00002b96ba258af3 in VMError::report_and_die (this=0x40863320)
#4  0x00002b96b9ca0749 in report_vm_error (detail_msg=<optimized out>,
error_msg=<optimized out>, line=<optimized out>, file=<optimized out>)   
#5  report_fatal (file=0x2b96ba2e0330, line=833, message=0x408633b8)   
#6  0x00002b96ba15752a in SharedRuntime::continuation_for_implicit_exception(thread=0x1ee9b000,
    pc=0x2aaaab1f5340, exception_kind=<optimized out>)   
#7  0x00002b96ba0c170a in JVM_handle_linux_signal (sig=11, info=0x408637c0,
ucVoid=0x40863690, abort_if_unrecognized=1)   
#8  0x00002b96ba0b7c83 in signalHandler (sig=11, info=0x408637c0, uc=0x40863690)
#9  <signal handler called>
#10 0x00002aaaab1f5340 in ?? ()
#11 0x00000000fe8bace8 in ?? ()
#12 0x00002aaaab607423 in ?? ()
#13 0x00000000fe94d0a0 in ?? ()
#14 0x00000000fe8bacd0 in ?? ()
#15 0x00000000fe8bacd0 in ?? ()
#16 0x00002aaac0cdd850 in ?? ()
#17 0x0000000000000007 in ?? ()

Inspecting the crashing PC 0x00002aaaab1f5340 with SA:

hsdb> whatis 0x00002aaaab1f5340
Address 0x00002aaaab1f5340: In code in StubRoutines (2) content:
[0x00002aaaab1f4e20, 0x00002aaaab1fa410),  code: [0x00002aaaab1f4e20,
0x00002aaaab1fa410),  data: [0x00002aaaab1fa410, 0x00002aaaab1fa410),  frame
size: 0

Frame #12:

hsdb> whatis 0x00002aaaab607423
Address 0x00002aaaab607423: In code in NMethod for
rnal/xni/Augmentations;)V content: [0x00002aaaab606ec0, 0x00002aaaab6083d8),  code: [0x00002aaaab606ec0, 0x00002aaaab6083d8),  data: [0x00002aaaab6083d8,
0x00002aaaab608e98),  oops: [0x00002aaaab6083d8, 0x00002aaaab6083e8),  frame size: 144

Disassembly around the crashing PC:

0x2aaaab1f5280:      movdqu 0x30(%rdi,%rdx,8),%xmm0
   0x2aaaab1f5286:      movdqu %xmm0,0x30(%rsi,%rdx,8)
   0x2aaaab1f528c:      movdqu 0x20(%rdi,%rdx,8),%xmm1
   0x2aaaab1f5292:      movdqu %xmm1,0x20(%rsi,%rdx,8)
   0x2aaaab1f5298:      movdqu 0x10(%rdi,%rdx,8),%xmm2
   0x2aaaab1f529e:      movdqu %xmm2,0x10(%rsi,%rdx,8)
   0x2aaaab1f52a4:      movdqu (%rdi,%rdx,8),%xmm3
   0x2aaaab1f52a9:      movdqu %xmm3,(%rsi,%rdx,8)
   0x2aaaab1f52ae:      sub    $0x8,%rdx
   0x2aaaab1f52b2:      jge    0x2aaaab1f5280
   0x2aaaab1f52b4:      add    $0x4,%rdx
   0x2aaaab1f52b8:      jl     0x2aaaab1f52d4
   0x2aaaab1f52ba:      movdqu 0x10(%rdi,%rdx,8),%xmm0
   0x2aaaab1f52c0:      movdqu %xmm0,0x10(%rsi,%rdx,8)
   0x2aaaab1f52c6:      movdqu (%rdi,%rdx,8),%xmm1
   0x2aaaab1f52cb:      movdqu %xmm1,(%rsi,%rdx,8)
   0x2aaaab1f52d0:      sub    $0x4,%rdx
   0x2aaaab1f52d4:      add    $0x4,%rdx
   0x2aaaab1f52d8:      jg     0x2aaaab1f5265
   0x2aaaab1f52da:      xor    %rax,%rax
   0x2aaaab1f52dd:      leaveq
   0x2aaaab1f52de:      retq
   0x2aaaab1f52df:      nop
   0x2aaaab1f52e0:      push   %rbp
   0x2aaaab1f52e1:      mov    %rsp,%rbp
   0x2aaaab1f52e4:      mov    %rdx,%rcx
   0x2aaaab1f52e7:      shr    $0x2,%rdx
   0x2aaaab1f52eb:      lea    -0x8(%rdi,%rdx,8),%rdi
   0x2aaaab1f52f0:      lea    -0x8(%rsi,%rdx,8),%rsi
   0x2aaaab1f52f5:      neg    %rdx
   0x2aaaab1f52f8:      jmpq   0x2aaaab1f5370
   0x2aaaab1f52fd:      mov    0x8(%rdi,%rdx,8),%rax
   0x2aaaab1f5302:      mov    %rax,0x8(%rsi,%rdx,8)
   0x2aaaab1f5307:      inc    %rdx
   0x2aaaab1f530a:      jne    0x2aaaab1f52fd
   0x2aaaab1f530c:      test   $0x2,%ecx
   0x2aaaab1f5312:      je     0x2aaaab1f5322
   0x2aaaab1f5314:      mov    0x8(%rdi),%eax
   0x2aaaab1f5317:      mov    %eax,0x8(%rsi)
   0x2aaaab1f531a:      add    $0x4,%rdi
   0x2aaaab1f531e:      add    $0x4,%rsi
   0x2aaaab1f5322:      test   $0x1,%ecx
   0x2aaaab1f5328:      je     0x2aaaab1f5332
   0x2aaaab1f532a:      mov    0x8(%rdi),%ax
   0x2aaaab1f532e:      mov    %ax,0x8(%rsi)
   0x2aaaab1f5332:      xor    %rax,%rax
   0x2aaaab1f5335:      leaveq
   0x2aaaab1f5336:      retq
   0x2aaaab1f5337:      nopw   0x0(%rax,%rax,1)
=> 0x2aaaab1f5340:      movdqu -0x38(%rdi,%rdx,8),%xmm0
   0x2aaaab1f5346:      movdqu %xmm0,-0x38(%rsi,%rdx,8)
   0x2aaaab1f534c:      movdqu -0x28(%rdi,%rdx,8),%xmm1
   0x2aaaab1f5352:      movdqu %xmm1,-0x28(%rsi,%rdx,8)
   0x2aaaab1f5358:      movdqu -0x18(%rdi,%rdx,8),%xmm2
   0x2aaaab1f535e:      movdqu %xmm2,-0x18(%rsi,%rdx,8)
   0x2aaaab1f5364:      movdqu -0x8(%rdi,%rdx,8),%xmm3
   0x2aaaab1f536a:      movdqu %xmm3,-0x8(%rsi,%rdx,8)
   0x2aaaab1f5370:      add    $0x8,%rdx

In the assembly code of the StubRoutine, we can see that the values are being moved to and from xmm registers and the crash happens while accessing value at an address present in rdi register.

rdx 0xffffffffffffffc8
rdi 0x1f8

Machine details where this crash happened:
RELEASE=2.6.18-308.  VERSION=#1 SMP Tue Apr 17 16:41:30 EDT 2012  MACHINE=x86_64

If you encounter such weird looking Hotspot crashes where the cause of the register/memory corruption seems to be unreasonable, please make sure that you are running on a latest patched Linux kernel. Kernels of version older that 2.6.32 may be vulnerable to this kernel bug.

Wednesday Mar 19, 2014

About G1 Garbage Collector, Permanent Generation and Metaspace

We received some questions around the G1 garbage collector and the use of Permanent Generation with it. There seems to be some confusion that the Hotspot JVM does not use permanent generation when G1 is used as the garbage collector. Here’s some clarification:

JDK 7: PermGen

Permanent Generation still exists in JDK 7 and its updates, and is used by all the garbage collectors. In JDK7, the effort to remove the permanent generation was started and some parts of the data residing in the permanent generation were moved to either the Java Heap or to the native heap. Permanent generation was not completely removed and it still exists in JDK 7 and its updates. Here's the list of things that were moved out of the permanent generation in JDK7:

  • Symbols were moved to the native heap

  • Interned strings were moved to the Java Heap

  • Class statics were moved to the Java Heap

JDK7: G1 and PermGen

With G1 collector, PermGen is collected only at a Full GC which is a stop-the-world (STW) GC. If G1 is running optimally then it does not do Full GCs. G1 invokes the Full GCs only when the PermGen is full or when the application does allocations faster than G1 can concurrently collect garbage.

With CMS garbage collector, we can use option -XX:+CMSClassUnloadingEnabled to collect PermGen space in the CMS concurrent cycle. There is no equivalent option for G1. G1 only collects PermGen during the Full stop-the-world GCs.

We can use options PermSize and MaxPermSize to tune the PermGen space size according to the application needs.

JDK8: PermGen

Permanent generation has been completely removed in JDK 8. This work has been done under the bug https://bugs.openjdk.java.net/browse/JDK-6964458. Options PermSize and MaxPermSize have also been removed in JDK 8.

Email to openjdk alias regarding the PermGen elimination project: http://mail.openjdk.java.net/pipermail/hotspot-dev/2012-September/006679.html

JDK8: Metaspace

In JDK 8, classes metadata is now stored in the native heap and this space is called Metaspace. There are some new flags added for Metaspace in JDK 8:

  • -XX:MetaspaceSize=<NNN> where <NNN> is the initial amount of space(the initial high-water-mark) allocated for class metadata (in bytes) that may induce a garbage collection to unload classes. The amount is approximate. After the high-water-mark is first reached, the next high-water-mark is managed by the garbage collector
  • -XX:MaxMetaspaceSize=<NNN> where <NNN> is the maximum amount of space to be allocated for class metadata (in bytes). This flag can be used to limit the amount of space allocated for class metadata. This value is approximate. By default there is no limit set.
  • -XX:MinMetaspaceFreeRatio=<NNN> where <NNN> is the minimum percentage of class metadata capacity free after a GC to avoid an increase in the amount of space (high-water-mark) allocated for class metadata that will induce a garbage collection.
  • -XX:MaxMetaspaceFreeRatio=<NNN> where <NNN> is the maximum percentage of class metadata capacity free after a GC to avoid a reduction in the amount of space (high-water-mark) allocated for class metadata that will induce a garbage collection.

By default class metadata allocation is only limited by the amount of available native memory. We can use the new option MaxMetaspaceSize to limit the amount of native memory used for the class metadata. It is analogous to MaxPermSize. A garbage collection is induced to collect the dead classloaders and classes when the class metadata usage reaches MetaspaceSize (12Mbytes on the 32bit client VM and 16Mbytes on the 32bit server VM with larger sizes on the 64bit VMs). Set MetaspaceSize to a higher value to delay the induced garbage collections. After an induced garbage collection, the class metadata usage needed to induce the next garbage collection may be increased.

Saturday Jan 04, 2014

UseLargePages on Linux

There is a JVM option UseLargePages (introduced in JDK 5.0u5) that can be used to request large memory pages from the system if large pages memory is supported by the system. The goal of the large page support is to optimize processor Translation-Lookaside Buffers and hence increase performance.

Recently we saw few instances of HotSpot crashes with JDK7 on the Linux platform when using the large memory pages.

8013057: assert(_needs_gc || SafepointSynchronize::is_at_safepoint()) failed: only read at safepoint

8007074: SIGSEGV at ParMarkBitMap::verify_clear()

Cause: The cause of these crashes is the way mmap works on the Linux platform. If the large page support is enabled on the system, commit_memory() implementation of HotSpot on Linux platform tries to commit the earlier reserved memory with 'mmap' call using the large pages. If there are not enough number of large pages available, the mmap call fails releasing the reserved memory, allowing the same memory region to be used for other allocations. This causes the same memory region to be used for different purposes and leads to unexpected behaviors.

Symptoms: With the above mentioned issue, we may see crashes with stack traces something like this:
 V  [libjvm.so+0x759a1a]  ParMarkBitMap::mark_obj(HeapWord*, unsigned long)+0x7a
 V  [libjvm.so+0x7a116e]  PSParallelCompact::MarkAndPushClosure::do_oop(oopDesc**)+0xce
 V  [libjvm.so+0x485197]  frame::oops_interpreted_do(OopClosure*, RegisterMap const*, bool)+0xe7
 V  [libjvm.so+0x863a4a]  JavaThread::oops_do(OopClosure*, CodeBlobClosure*)+0x15a
 V  [libjvm.so+0x77c97e]  ThreadRootsMarkingTask::do_it(GCTaskManager*, unsigned int)+0xae
 V  [libjvm.so+0x4b7ec0]  GCTaskThread::run()+0x130
 V  [libjvm.so+0x748f90]  java_start(Thread*)+0x100

Here the crash happens while writting to an address 0x00007f2cf656eef0 in the mapped region of ParMarkBitMap. And that memory belongs to the rt.jar (from hs_err log file):
7f2cf6419000-7f2cf65d7000 r--s 039dd000 00:31 106601532                  /jdk/jdk1.7.0_21/jre/lib/rt.jar

Due to this bug, the same memory region got mapped for two different allocations and caused this crash.


8013057 strengthened the error handling of mmap failures on Linux platform and also added some diagnostic information for these failures. It is fixed in 7u40.

8007074 fixes the reserved memory mapping loss issue when using the large pages on the Linux platform. Details on this fix: http://mail.openjdk.java.net/pipermail/hotspot-dev/2013-July/010117.html. It is fixed in JDK 8 and will also be included into 7u60, scheduled to be released in May 2014.


1. Disable the use of large pages with JVM option -XX:-UseLargePages.

2. Increase the number of large pages available on the system. By having the sufficient number of large pages on the system, we can reduce the risk of memory commit failures and thus reduce the chances of hitting the large pages issue. Please see the details on how to configure the number of large pages here:

Other related fixes:

8026887: Make issues due to failed large pages allocations easier to debug

With the fix of 8013057, diagnostic information for the memory commit failures was added. It printed the error messages something like this:
os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed;
error='Cannot allocate memory' (errno=12)

With this fix of 8026887, this error message has been modified to suggest that the memory commit failed due to the lack of large pages, and it now looks like the following:
os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed;
error='Cannot allocate large pages, falling back to small pages' (errno=12)

This change has been integrated into 7u51.

The fix for 8007074 will be available in 7u60 and could not be included into 7u51, so this change (JDK-8026887) makes the error messages printed for the large pages related commit memory failures more informative. If we see these messages in the JVM logs that indicates that we have the risk of hitting the unexpected behaviors due to the bug 8007074.

8024838: Significant slowdown due to transparent huge pages

With the fix of 8007074, significant performance degradation was detected. This regression has been fixed with JDK-8024838 in JDK 8 and will also be included in JDK 7u60.

Wednesday Jul 31, 2013

Troubleshooting Long GC Pauses

Low pause times during the application run is the most important goal for many enterprise applications, especially for the transaction-based systems where long latencies can result in the transaction time-outs. For systems running on the Java Virtual Machines, garbage collections can sometimes be the cause of the long pauses.

In this post I am going to describe different scenarios where we can encounter long GC pauses and how we can diagnose and troubleshoot these GC pauses.


Following are the different situations that can cause long GC pauses during the application run.

1. Fragmentation in the heap

Fragmentation in the Java Heap can cause GCs to occur more frequently and also sometimes causing long pauses in the GCs. This is more probable in the case of Concurrent Mark Sweep collector, also known as CMS, where the tenured generation space is not compacted with the concurrent collections.

In case of the CMS, due to fragmentation in the tenured generation space, the young generation collections can face promotion failures and thus triggering 'Concurrent Mode Failure' collections that are stop-the-world Full GCs, and Full GCs take a long time to finish as compared to the concurrent collection pauses.

Due to the fragmentation, the direct allocations in the tenured generation may fail even when there is lot of free space available and thus causing Full GCs. Fragmentation can also cause frequent allocation failures and thus triggering frequent Full GCs that increase the overall time the application is paused for.

The following logs collected with the CMS collector show that the fragmentation in the CMS generation space is very high, that leads to the promotion failure during a young generation ParNew collection and then a 'concurrent mode failure'. A Full GC is done in the event of 'concurrent mode failure' that takes a very long time, 17.1365396 seconds to finish.

{Heap before GC invocations=7430 (full 24):
par new generation total 134400K, used 121348K [0x53000000, 0x5c600000, 0x5c600000)
eden space 115200K, 99% used [0x53000000, 0x5a07e738, 0x5a080000)
from space 19200K, 32% used [0x5a080000, 0x5a682cc0, 0x5b340000)
to space 19200K, 0% used [0x5b340000, 0x5b340000, 0x5c600000)
concurrent mark-sweep generation total 2099200K, used 1694466K [0x5c600000, 0xdc800000, 0xdc800000)
concurrent-mark-sweep perm gen total 409600K, used 186942K [0xdc800000, 0xf5800000, 0xfbc00000)
10628.167: [GC Before GC:
Statistics for BinaryTreeDictionary:
Total Free Space: 103224160
Max Chunk Size: 5486
Number of Blocks: 57345
Av. Block Size: 1800
Tree Height: 36
Statistics for IndexedFreeLists:
Total Free Space: 371324
Max Chunk Size: 254
Number of Blocks: 8591 <---- High fragmentation
Av. Block Size: 43
free=103595484 frag=1.0000 <---- High fragmentation
Before GC:
Statistics for BinaryTreeDictionary:
Total Free Space: 0
Max Chunk Size: 0
Number of Blocks: 0
Tree Height: 0
Statistics for IndexedFreeLists:
Total Free Space: 0
Max Chunk Size: 0
Number of Blocks: 0
free=0 frag=0.0000
10628.168: [ParNew (promotion failed)
Desired survivor size 9830400 bytes, new threshold 1 (max 1)
- age 1: 4770440 bytes, 4770440 total
: 121348K->122157K(134400K), 0.4263254 secs]10628,594: [CMS10630.887: [CMS-concurrent-mark: 7.286/8.682 secs] [Times: user=14.81 sys=0.34, real=8.68 secs] (concurrent mode failure): 1698044K->625427K(2099200K), 17.1365396 secs] 1815815K->625427K(2233600K), [CMS Perm : 186942K->180711K(409600K)]After GC:
Statistics for BinaryTreeDictionary:
Total Free Space: 377269492
Max Chunk Size: 377269492
Number of Blocks: 1
Av. Block Size: 377269492
Tree Height: 1
Statistics for IndexedFreeLists:
Total Free Space: 0
Max Chunk Size: 0
Number of Blocks: 0
free=377269492 frag=0.0000
After GC:
Statistics for BinaryTreeDictionary:
Total Free Space: 0
Max Chunk Size: 0
Number of Blocks: 0
Tree Height: 0
Statistics for IndexedFreeLists:
Total Free Space: 0
Max Chunk Size: 0
Number of Blocks: 0
free=0 frag=0.0000
, 17.5645589 secs] [Times: user=17.82 sys=0.06, real=17.57 secs]
Heap after GC invocations=7431 (full 25):
par new generation total 134400K, used 0K [0x53000000, 0x5c600000, 0x5c600000)
eden space 115200K, 0% used [0x53000000, 0x53000000, 0x5a080000)
from space 19200K, 0% used [0x5b340000, 0x5b340000, 0x5c600000)
to space 19200K, 0% used [0x5a080000, 0x5a080000, 0x5b340000)
concurrent mark-sweep generation total 2099200K, used 625427K [0x5c600000, 0xdc800000, 0xdc800000)
concurrent-mark-sweep perm gen total 409600K, used 180711K [0xdc800000, 0xf5800000, 0xfbc00000)
Total time for which application threads were stopped: 17.5730653 seconds

2. Other OS activities happening at the time of GC

Sometimes the OS activities such as the swap space or networking activity happening at the time when GC is taking place can make the GC pauses last much longer. These pauses can be of the order of few seconds to some minutes.

If your system is configured to use swap space, Operating System may move inactive pages of memory of the JVM process to the swap space, to free up memory for the currently active process which may be the same process or a different process on the system. Swapping is very expensive as it requires disk accesses which are much slower as compared to the physical memory access. So, if during a garbage collection the system needs to perform swapping, the GC would seem to run for a very long time.

Following is the log of a young generation collection that lasts for 29.47 seconds.

{Heap before GC invocations=132 (full 0):
par new generation total 2696384K, used 2696384K [0xfffffffc20010000, 0xfffffffce0010000, 0xfffffffce0010000)
eden space 2247040K, 100% used [0xfffffffc20010000, 0xfffffffca9270000, 0xfffffffca9270000)
from space 449344K, 100% used [0xfffffffca9270000, 0xfffffffcc4940000, 0xfffffffcc4940000)
to space 449344K, 0% used [0xfffffffcc4940000, 0xfffffffcc4940000, 0xfffffffce0010000)
concurrent mark-sweep generation total 9437184K, used 1860619K [0xfffffffce0010000, 0xffffffff20010000, 0xffffffff20010000)
concurrent-mark-sweep perm gen total 1310720K, used 511451K [0xffffffff20010000, 0xffffffff70010000, 0xffffffff70010000)
2013-07-17T03:58:06.601-0700: 51522.120: [GC Before GC: : 2696384K->449344K(2696384K), 29.4779282 secs] 4557003K->2326821K(12133568K) ,29.4795222 secs] [Times: user=915.56 sys=6.35, real=29.48 secs]

Corresponding 'vmstat' output at 03:58:

kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id
20130717_035806 0 0 0 77611960 94847600 55 266 0 0 0 0 0 0 0 0 0 3041 2644 2431 44 8 48
20130717_035815 0 0 0 76968296 94828816 79 324 0 18 18 0 0 0 0 1 0 3009 3642 2519 59 13 28
20130717_035831 1 0 0 77316456 94816000 389 2848 0 7 7 0 0 0 0 2 0 40062 78231 61451 42 6 53
20130717_035841 2 0 0 77577552 94798520 115 591 0 13 13 0 0 13 12 1 0 4991 8104 5413 2 0 98

This minor GC takes around 29 secs to complete. The corresponding vmstat output shows that the available swap space drops down by ~600mb during this period. That means during this garbage collection some pages from the RAM were moved out to the swap space, not necessarily by the same process running on the system.

From the above, it is clear that the physical memory available on the system is not enough for all the processes running on the system. The way to resolve this is to run fewer processes or if possible, add more RAM to increase the physical memory of the system. In the case above, the specified maximum tenured generation size is set as 9G and out of that only 1.8G is occupied. So it makes sense to reduce the heap size to lower the memory pressure on the physical memory so as to avoid or minimize the swapping activity.

Apart from swapping, we should monitor if there is any i/o or network activity happening during the long GC pauses. These can be monitored using iostat and netstat tools. It is also helpful to see the CPU statistics with the mpstat tool to figure out if enough CPU resources were available during the GC pauses.

3. Insufficient heap size

If the application footprint is larger than the maximum heap space that we have specified for the JVM, it results in frequent collections. Due to the insufficient heap space, the allocation requests fail and the JVM needs to invoke garbage collections in an attempt to reclaim space for the allocations. But since it cannot claim much space with each collection, subsequent allocation failures result in more GC invocations.

These frequent Full GCs cause long pauses in the application run. For example, in the following case, the permanent generation is almost full and the allocation attempts into the permanent generation are failing, triggering the Full GCs.

166687.013: [Full GC [PSYoungGen: 126501K->0K(922048K)] [PSOldGen: 2063794K->1598637K(2097152K)] 2190295K->1598637K(3019200K) [PSPermGen: 165840K->164249K(166016K)], 6.8204928 secs] [Times: user=6.80 sys=0.02, real=6.81 secs]
166699.015: [Full GC [PSYoungGen: 125518K->0K(922048K)] [PSOldGen: 1763798K->1583621K(2097152K)] 1889316K->1583621K(3019200K) [PSPermGen: 165868K->164849K(166016K)], 4.8204928 secs] [Times: user=4.80 sys=0.02, real=4.81 secs]

Similarly, the frequent Full GCs can occur if there is insufficient space in the tenured generation for the allocations or promotions.

The solution for these long pauses is to identify the average footprint of the application and then specify the heap size accordingly.

4. Bug in the JVM

Sometimes these long pauses could be due to a bug in the JVM. For example, due to the following bugs in the JVM, Java applications may face long GC pauses.

  • 6459113: CMS+ParNew: wildly different ParNew pause times depending on heap shape caused by allocation spread
    • fixed in JDK 6u1 and 7
  • 6572569: CMS: consistently skewed work distribution indicated in (long) re-mark pauses
    • fixed in JDK 6u4 and 7
  • 6631166: CMS: better heuristics when combatting fragmentation
    • fixed in JDK 6u21 and 7
  • 6999988: CMS: Increased fragmentation leading to promotion failure after CR#6631166 got implemented
    • fixed in JDK 6u25 and 7
  • 6683623: G1: use logarithmic BOT code such as used by other collectors
    • fixed in JDK 6u14 and 7
  • 6976350: G1: deal with fragmentation while copying objects during GC
    • fixed in JDK 8

If you are running with a JVM version affected with these bugs, please upgrade to the version where these bugs are fixed.

5. Explicit System GCs

Check if there are any explicit System GCs happening. Requests to invoke these System GCs which are stop-the-world Full GCs could be coming from the System.gc() calls from some class in the application or from a some third party module. These explicit System GCs too can cause very long pauses.

164638.058: [Full GC (System) [PSYoungGen: 22789K->0K(992448K)] [PSOldGen: 1645508K->1666990K(2097152K)] 1668298K->1666990K(3089600K) [PSPermGen: 164914K->164914K(166720K)], 5.7499132 secs] [Times: user=5.69 sys=0.06, real=5.75 secs]

If you are using RMI and are observing explicit Full GCs on a regular interval, then these are coming from the RMI implementation that triggers System.gc() on a regular interval. This interval can be configured using the following system properties:


The default value for these properties in JDK 1.4.2 and 5.0 is 60000 milliseconds, and 3600000 milliseconds in JDK 6 and later releases.

If you want to disable the explicit Full GCs invoked using System.gc(), run the application with -XX:+DisableExplicitGC JVM option.

How to approach the problem

1. Collect GC logs with -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps and -XX:+PrintGCApplicationStoppedTime. In case of the CMS collector, add option -XX:PrintFLSStatistics=2 as well.

The GC logs can give us details on the nature and the frequency of the GC pauses i.e. they can provide answers to the questions like - are the long GC pauses occurring during young collections or old collections, and how frequently those collections are encountering long pauses.

2. Monitor the overall health of the system using OS tools like vmstat, iostat, netstat and mpstat etc. on Solaris and Linux platforms, and tools like Process Monitor and Task Manager on the Windows operating system.

3. Use GCHisto tool to visually analyze the GC Logs and figure out which GCs are taking long time and if there is a pattern in the occurrence of these collections.

4. Try to see from the GC logs if there are any signs of fragmentation in the Java Heap space.

5. Monitor if the specified Heap size is enough to contain the footprint of the application.

6. Check if you are running with a JVM that has a known bug related to the long GC pauses and then upgrade if that bug is fixed in a later version.

Sunday Mar 09, 2008

How to create VC++ project for HotSpot workspace

I am sure very few people would know that on Windows platform, we can create a VC++ project for the HotSpot workspace in JDK sources. It is very useful as it gives me the ability to view and browse HotSpot source code and also build it at a simple click from Visual Studio IDE. It is also very helpful in debugging hotspot issues which can be reproduced on Windows with a simple testcase(and not just at customer site :) )

So let's see how we can create this project for OpenJDK hotspot workspace on 32bit platform and open it in Visual Studio.

There is a batch file 'create.bat' in \\build\\windows\\ folder which creates the VC++ project for hotspot workspace on Windows. Before we run this batch, we need to set two environment variables.

- The 32-bit OpenJDK Windows build requires Microsoft Visual Studio .NET 2003 (VS2003) Professional Edition compiler. Once the compiler is installed, we need to run VCVARS32.BAT to set the compiler environment variables

Setting environment for using Microsoft Visual Studio .NET 2003 tools.
(If you have another version of Visual Studio or Visual C++ installed and wish
to use its tools from the command line, run vcvars32.bat for that version.)

- Set the Path to include installed jdk bin folder (jdk6 or jdk7)
set PATH=D:\\Java\\jdk1.6.0_03\\bin;%PATH%

Now, run create.bat

Usage: create HotSpotWorkSpace HotSpotBuildSpace HotSpotJDKDist

This is the interactive build setup script (as opposed to the batch
build execution script). It creates HotSpotBuildSpace if necessary,
copies the appropriate files out of HotSpotWorkSpace into it, and
builds and runs MakeDeps in it. This has the side-effect of creating
the file in the build space, which is then used in Visual C++.
The HotSpotJDKDist defines place where JVM binaries should be placed.
Environment variable FORCE_MSC_VER allows to override MSVC version autodetection.

The generated project file depends upon the include databases. If
those are changed then MakeDeps is rerun.

NOTE that it is now NOT safe to modify any of the files in the build
space, since they may be overwritten whenever this script is run or
nmake is run in that directory.

Passing those required arguments:

D:\\openjdk7\\openjdk\\hotspot\\build\\windows>create.bat D:\\openjdk7\\openjdk\\hotspot d:\\hotspot\\build d:\\hotspot\\bin
Will generate VC7 project
NOTE: Using the following settings:
javac -g -d d:\\hotspot\\build\\jvmtifiles D:\\openjdk7\\openjdk\\hotspot/src/share/vm/prims/jvmtiGen.java
Generating d:\\hotspot\\build\\jvmtifiles/jvmtiEnv.hpp
Generating d:\\hotspot\\build\\jvmtifiles/jvmtiEnter.cpp
..... cut the output here ....
creating c1_ValueType.cpp
writing grand include file

writing dependencies file

Writing .vcproj file...
rm -f ad_x86_32.cpp ad_x86_32.hpp ad_x86_32_clone.cpp ad_x86_32_expand.cpp ad_x86_
d_x86_32_peephole.cpp ad_x86_32_pipeline.cpp adGlobals_x86_32.hpp dfa_x86_32.cpp
adlc -q -T -U_LP64 x86_32.ad

This creates vm.vcproj file in d:\\hotspot\\build folder.

Open this in Visual Studio .Net and have fun!




« March 2015