Tuesday Feb 03, 2015

Analysis of Strange Hotspot Crashes

In this blog post, I am going to talk about the core file analysis of some strange looking hotspot crashes using gdb and Serviceability Agent debugging tools. At the first glance, these crashes appeared to be caused by some Hotspot Server Compiler issue but in the end these actually turned out to be due to a Linux Kernel bug.

This first crash happened in the compiled code of method SafeHashMap.put()

# Problematic frame:
# J 3575 C2
com.tangosol.util.SafeHashMap.put(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; (190 bytes) @ 0x00002aaaabda7ffd [0x00002aaaabda7da0+0x25d]
#
# Core dump written.

I opened the core with gdb and the stack trace of the crashing thread looked like this:

#15 0x00002afd41aa5e7f in JVM_handle_linux_signal (sig=11, info=0x40fd7c90, ucVoid=0x40fd7b60, abort_if_unrecognized=1)
#16 0x00002afd41a9ce13 in signalHandler (sig=11, info=0x40fd7c90, uc=0x40fd7b60)
#17 <signal handler called>
#18 0x00002aaaabda7ffd in ?? ()
#19 0x00000000c005e3c8 in ?? ()
#20 0x00000000c005e3c8 in ?? ()
#21 0x00000000c005e3c8 in ?? ()
#22 0xc005e3c82000cb42 in ?? ()
#23 0x0000000000000007 in ?? ()
#24 0x0000000000000000 in ?? ()

Opened the core file with Serviceability Agent tool too and inspected the addresses present in the above stack trace:

hsdb> whatis 0x00002aaaabda7ffd
Address 0x00002aaaabda7ffd: In code in NMethod for com/tangosol/util/SafeHashMap.put(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; content: [0x00002aaaabda7da0, 0x00002aaaabda9280),  code: [0x00002aaaabda7da0, 0x00002aaaabda9280),  data: [0x00002aaaabda9280, 0x00002aaaabda9f18),  oops: [0x00002aaaabda9280, 0x00002aaaabda9288),  frame size: 144

hsdb> whatis 0x00000000c005e3c8
Address 0x00000000c005e3c8: In unknown section of Java heap

So, all the above digging confirmed that the crash happened in the compiled code of com.tangosol.util.SafeHashMap.put(). I looked at the assembly of the generated code in gdb:

(gdb) x/610i 0x00002aaaabda7da0
....
....
   0x2aaaabda7f83:      mov    $0x2000cbc5,%r9d
   0x2aaaabda7f89:      movabs $0x0,%r11
   0x2aaaabda7f93:      lea    (%r11,%r9,8),%r11
   0x2aaaabda7f97:      mov    0xa8(%r11),%r11
   0x2aaaabda7f9e:      mov    %r11,0x0(%r13)
   0x2aaaabda7fa2:      movl   $0x2000cbc5,0x8(%r13)
   0x2aaaabda7faa:      mov    %r12d,0xc(%r13)
   0x2aaaabda7fae:      mov    %r12,0x10(%r13)
   0x2aaaabda7fb2:      mov    %r12,0x18(%r13)
   0x2aaaabda7fb6:      mov    %r13,%r9
   0x2aaaabda7fb9:      mov    %r14,%r11
   0x2aaaabda7fbc:      mov    %r11d,0x10(%r13)
   0x2aaaabda7fc0:      shr    $0x9,%r9
   0x2aaaabda7fc4:      movabs $0x2aaab9f5b000,%r11
   0x2aaaabda7fce:      mov    %r12b,(%r11,%r9,1)
   0x2aaaabda7fd2:      mov    (%rsp),%r9d
   0x2aaaabda7fd6:      mov    %r9d,0x14(%r13)
   0x2aaaabda7fda:      mov    %r13,%r11
   0x2aaaabda7fdd:      shr    $0x9,%r11
   0x2aaaabda7fe1:      movabs $0x2aaab9f5b000,%r9
   0x2aaaabda7feb:      mov    %r12b,(%r9,%r11,1)
   0x2aaaabda7fef:      lock addl $0x0,(%rsp)
   0x2aaaabda7ff4:      mov    %r10d,0xc(%r13)
   0x2aaaabda7ff8:      movq   %xmm0,%r11
   0x2aaaabda7ffd:      mov    (%r11),%rax  <---crash here, r11 is 0
   0x2aaaabda8000:      mov    %rax,%r11
....
....

The thread crashed while accessing the value at an address stored in r11 register. Looking at the value of r11 register:
r11 0x0

r11 contains 0 and accessing the value at null address caused the crash.

Now, if we look at the previous instruction "movq   %xmm0,%r11", value in r11 was copied over from xmm0 register. The value in xmm0 register is 0.

There is a bug in Linux kernel where the xmm registers may get corrupted under certain circumstances. Details on the kernel bug: https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=547893


The machine where this crash was observed:
RELEASE=2.6.18-308.4.1.0.1.el5xen  VERSION=#1 SMP Tue Apr 17 16:41:30 EDT 2012  MACHINE=x86_64

So, the process was running on a machine which is vulnerable to that Linux kernel bug.

Let's take a look at another such Hotspot crash.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (sharedRuntime.cpp:833), pid=26946, tid=1082554688
#  fatal error: exception happened outside interpreter, nmethods and vtable stubs at pc 0x00002aaaab1f5340

Current thread (0x000000001ee9b000):  JavaThread "main" [_thread_in_Java, id=26949, stack(0x0000000040767000,0x0000000040868000)]

Stack: [0x0000000040767000,0x0000000040868000],  sp=0x00000000408631a0,  free space=1008k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xaaa6ea]  VMError::report_and_die()+0x2ba
V  [libjvm.so+0x4f2749]  report_fatal(char const*, int, char const*)+0x59
V  [libjvm.so+0x9a952a]  SharedRuntime::continuation_for_implicit_exception(JavaThread*, unsigned char*, SharedRuntime::ImplicitExceptionKind)+0x33a
V  [libjvm.so+0x91370a]  JVM_handle_linux_signal+0x48a
V  [libjvm.so+0x909c83]  signalHandler(int, siginfo*, void*)+0x43
C  [libpthread.so.0+0xebe0]
J 528 C2 com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaDOMParser.characters(Lcom/sun/org/apache/xerces/internal/xni/XMLString;Lcom/sun/org/apache/xerces/internal/xni/Augmentations;)V (110 bytes) @ 0x00002aaaab607423 [0x00002aaaab606ec0+0x563]

(gdb) where
#0  0x0000003968a30265 in raise ()
#1  0x0000003968a31d10 in abort ()
#2  0x00002b96ba0bbea5 in os::abort (dump_core=true)
#3  0x00002b96ba258af3 in VMError::report_and_die (this=0x40863320)
#4  0x00002b96b9ca0749 in report_vm_error (detail_msg=<optimized out>,
error_msg=<optimized out>, line=<optimized out>, file=<optimized out>)   
#5  report_fatal (file=0x2b96ba2e0330, line=833, message=0x408633b8)   
#6  0x00002b96ba15752a in SharedRuntime::continuation_for_implicit_exception(thread=0x1ee9b000,
    pc=0x2aaaab1f5340, exception_kind=<optimized out>)   
#7  0x00002b96ba0c170a in JVM_handle_linux_signal (sig=11, info=0x408637c0,
ucVoid=0x40863690, abort_if_unrecognized=1)   
#8  0x00002b96ba0b7c83 in signalHandler (sig=11, info=0x408637c0, uc=0x40863690)
#9  <signal handler called>
#10 0x00002aaaab1f5340 in ?? ()
#11 0x00000000fe8bace8 in ?? ()
#12 0x00002aaaab607423 in ?? ()
#13 0x00000000fe94d0a0 in ?? ()
#14 0x00000000fe8bacd0 in ?? ()
#15 0x00000000fe8bacd0 in ?? ()
#16 0x00002aaac0cdd850 in ?? ()
#17 0x0000000000000007 in ?? ()


Inspecting the crashing PC 0x00002aaaab1f5340 with SA:

hsdb> whatis 0x00002aaaab1f5340
Address 0x00002aaaab1f5340: In code in StubRoutines (2) content:
[0x00002aaaab1f4e20, 0x00002aaaab1fa410),  code: [0x00002aaaab1f4e20,
0x00002aaaab1fa410),  data: [0x00002aaaab1fa410, 0x00002aaaab1fa410),  frame
size: 0


Frame #12:

hsdb> whatis 0x00002aaaab607423
Address 0x00002aaaab607423: In code in NMethod for
com/sun/org/apache/xerces/internal/impl/xs/opti/SchemaDOMParser.characters(Lcom/sun/org/apache/xerces/internal/xni/XMLString;Lcom/sun/org/apache/xerces/inte
rnal/xni/Augmentations;)V content: [0x00002aaaab606ec0, 0x00002aaaab6083d8),  code: [0x00002aaaab606ec0, 0x00002aaaab6083d8),  data: [0x00002aaaab6083d8,
0x00002aaaab608e98),  oops: [0x00002aaaab6083d8, 0x00002aaaab6083e8),  frame size: 144

Disassembly around the crashing PC:

0x2aaaab1f5280:      movdqu 0x30(%rdi,%rdx,8),%xmm0
   0x2aaaab1f5286:      movdqu %xmm0,0x30(%rsi,%rdx,8)
   0x2aaaab1f528c:      movdqu 0x20(%rdi,%rdx,8),%xmm1
   0x2aaaab1f5292:      movdqu %xmm1,0x20(%rsi,%rdx,8)
   0x2aaaab1f5298:      movdqu 0x10(%rdi,%rdx,8),%xmm2
   0x2aaaab1f529e:      movdqu %xmm2,0x10(%rsi,%rdx,8)
   0x2aaaab1f52a4:      movdqu (%rdi,%rdx,8),%xmm3
   0x2aaaab1f52a9:      movdqu %xmm3,(%rsi,%rdx,8)
   0x2aaaab1f52ae:      sub    $0x8,%rdx
   0x2aaaab1f52b2:      jge    0x2aaaab1f5280
   0x2aaaab1f52b4:      add    $0x4,%rdx
   0x2aaaab1f52b8:      jl     0x2aaaab1f52d4
   0x2aaaab1f52ba:      movdqu 0x10(%rdi,%rdx,8),%xmm0
   0x2aaaab1f52c0:      movdqu %xmm0,0x10(%rsi,%rdx,8)
   0x2aaaab1f52c6:      movdqu (%rdi,%rdx,8),%xmm1
   0x2aaaab1f52cb:      movdqu %xmm1,(%rsi,%rdx,8)
   0x2aaaab1f52d0:      sub    $0x4,%rdx
   0x2aaaab1f52d4:      add    $0x4,%rdx
   0x2aaaab1f52d8:      jg     0x2aaaab1f5265
   0x2aaaab1f52da:      xor    %rax,%rax
   0x2aaaab1f52dd:      leaveq
   0x2aaaab1f52de:      retq
   0x2aaaab1f52df:      nop
   0x2aaaab1f52e0:      push   %rbp
   0x2aaaab1f52e1:      mov    %rsp,%rbp
   0x2aaaab1f52e4:      mov    %rdx,%rcx
   0x2aaaab1f52e7:      shr    $0x2,%rdx
   0x2aaaab1f52eb:      lea    -0x8(%rdi,%rdx,8),%rdi
   0x2aaaab1f52f0:      lea    -0x8(%rsi,%rdx,8),%rsi
   0x2aaaab1f52f5:      neg    %rdx
   0x2aaaab1f52f8:      jmpq   0x2aaaab1f5370
   0x2aaaab1f52fd:      mov    0x8(%rdi,%rdx,8),%rax
   0x2aaaab1f5302:      mov    %rax,0x8(%rsi,%rdx,8)
   0x2aaaab1f5307:      inc    %rdx
   0x2aaaab1f530a:      jne    0x2aaaab1f52fd
   0x2aaaab1f530c:      test   $0x2,%ecx
   0x2aaaab1f5312:      je     0x2aaaab1f5322
   0x2aaaab1f5314:      mov    0x8(%rdi),%eax
   0x2aaaab1f5317:      mov    %eax,0x8(%rsi)
   0x2aaaab1f531a:      add    $0x4,%rdi
   0x2aaaab1f531e:      add    $0x4,%rsi
   0x2aaaab1f5322:      test   $0x1,%ecx
   0x2aaaab1f5328:      je     0x2aaaab1f5332
   0x2aaaab1f532a:      mov    0x8(%rdi),%ax
   0x2aaaab1f532e:      mov    %ax,0x8(%rsi)
   0x2aaaab1f5332:      xor    %rax,%rax
   0x2aaaab1f5335:      leaveq
   0x2aaaab1f5336:      retq
   0x2aaaab1f5337:      nopw   0x0(%rax,%rax,1)
=> 0x2aaaab1f5340:      movdqu -0x38(%rdi,%rdx,8),%xmm0
   0x2aaaab1f5346:      movdqu %xmm0,-0x38(%rsi,%rdx,8)
   0x2aaaab1f534c:      movdqu -0x28(%rdi,%rdx,8),%xmm1
   0x2aaaab1f5352:      movdqu %xmm1,-0x28(%rsi,%rdx,8)
   0x2aaaab1f5358:      movdqu -0x18(%rdi,%rdx,8),%xmm2
   0x2aaaab1f535e:      movdqu %xmm2,-0x18(%rsi,%rdx,8)
   0x2aaaab1f5364:      movdqu -0x8(%rdi,%rdx,8),%xmm3
   0x2aaaab1f536a:      movdqu %xmm3,-0x8(%rsi,%rdx,8)
   0x2aaaab1f5370:      add    $0x8,%rdx

In the assembly code of the StubRoutine, we can see that the values are being moved to and from xmm registers and the crash happens while accessing value at an address present in rdi register.

rdx 0xffffffffffffffc8
rdi 0x1f8

Machine details where this crash happened:
RELEASE=2.6.18-308.4.1.0.1.el5xen  VERSION=#1 SMP Tue Apr 17 16:41:30 EDT 2012  MACHINE=x86_64

If you encounter such weird looking Hotspot crashes where the cause of the register/memory corruption seems to be unreasonable, please make sure that you are running on a latest patched Linux kernel. Kernels of version older that 2.6.32 may be vulnerable to this kernel bug.

Saturday Jan 04, 2014

UseLargePages on Linux

There is a JVM option UseLargePages (introduced in JDK 5.0u5) that can be used to request large memory pages from the system if large pages memory is supported by the system. The goal of the large page support is to optimize processor Translation-Lookaside Buffers and hence increase performance.

Recently we saw few instances of HotSpot crashes with JDK7 on the Linux platform when using the large memory pages.

8013057: assert(_needs_gc || SafepointSynchronize::is_at_safepoint()) failed: only read at safepoint
https://bugs.openjdk.java.net/browse/JDK-8013057

8007074: SIGSEGV at ParMarkBitMap::verify_clear()
https://bugs.openjdk.java.net/browse/JDK-8007074

Cause: The cause of these crashes is the way mmap works on the Linux platform. If the large page support is enabled on the system, commit_memory() implementation of HotSpot on Linux platform tries to commit the earlier reserved memory with 'mmap' call using the large pages. If there are not enough number of large pages available, the mmap call fails releasing the reserved memory, allowing the same memory region to be used for other allocations. This causes the same memory region to be used for different purposes and leads to unexpected behaviors.

Symptoms: With the above mentioned issue, we may see crashes with stack traces something like this:
 V  [libjvm.so+0x759a1a]  ParMarkBitMap::mark_obj(HeapWord*, unsigned long)+0x7a
 V  [libjvm.so+0x7a116e]  PSParallelCompact::MarkAndPushClosure::do_oop(oopDesc**)+0xce
 V  [libjvm.so+0x485197]  frame::oops_interpreted_do(OopClosure*, RegisterMap const*, bool)+0xe7
 V  [libjvm.so+0x863a4a]  JavaThread::oops_do(OopClosure*, CodeBlobClosure*)+0x15a
 V  [libjvm.so+0x77c97e]  ThreadRootsMarkingTask::do_it(GCTaskManager*, unsigned int)+0xae
 V  [libjvm.so+0x4b7ec0]  GCTaskThread::run()+0x130
 V  [libjvm.so+0x748f90]  java_start(Thread*)+0x100

Here the crash happens while writting to an address 0x00007f2cf656eef0 in the mapped region of ParMarkBitMap. And that memory belongs to the rt.jar (from hs_err log file):
7f2cf6419000-7f2cf65d7000 r--s 039dd000 00:31 106601532                  /jdk/jdk1.7.0_21/jre/lib/rt.jar

Due to this bug, the same memory region got mapped for two different allocations and caused this crash.

Fixes:

8013057 strengthened the error handling of mmap failures on Linux platform and also added some diagnostic information for these failures. It is fixed in 7u40.

8007074 fixes the reserved memory mapping loss issue when using the large pages on the Linux platform. Details on this fix: http://mail.openjdk.java.net/pipermail/hotspot-dev/2013-July/010117.html. It is fixed in JDK 8 and will also be included into 7u60, scheduled to be released in May 2014.

Workarounds:

1. Disable the use of large pages with JVM option -XX:-UseLargePages.

2. Increase the number of large pages available on the system. By having the sufficient number of large pages on the system, we can reduce the risk of memory commit failures and thus reduce the chances of hitting the large pages issue. Please see the details on how to configure the number of large pages here:
http://www.oracle.com/technetwork/java/javase/tech/largememory-jsp-137182.html

Other related fixes:

8026887: Make issues due to failed large pages allocations easier to debug
https://bugs.openjdk.java.net/browse/JDK-8026887

With the fix of 8013057, diagnostic information for the memory commit failures was added. It printed the error messages something like this:
os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed;
error='Cannot allocate memory' (errno=12)

With this fix of 8026887, this error message has been modified to suggest that the memory commit failed due to the lack of large pages, and it now looks like the following:
os::commit_memory(0x00000006b1600000, 352321536, 2097152, 0) failed;
error='Cannot allocate large pages, falling back to small pages' (errno=12)

This change has been integrated into 7u51.

The fix for 8007074 will be available in 7u60 and could not be included into 7u51, so this change (JDK-8026887) makes the error messages printed for the large pages related commit memory failures more informative. If we see these messages in the JVM logs that indicates that we have the risk of hitting the unexpected behaviors due to the bug 8007074.

8024838: Significant slowdown due to transparent huge pages
https://bugs.openjdk.java.net/browse/JDK-8024838

With the fix of 8007074, significant performance degradation was detected. This regression has been fixed with JDK-8024838 in JDK 8 and will also be included in JDK 7u60.

About

Poonam-Oracle

Search

Categories
Archives
« May 2015
SunMonTueWedThuFriSat
     
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
      
Today