X

Poonam Bajaj's Blog

  • February 3, 2015

Analysis of Strange Hotspot Crashes

Poonam Parhar
Consulting Member of Technical Staff


In this blog post, I am going to talk about the core file analysis of some strange looking hotspot crashes using gdb and Serviceability Agent debugging tools. At the first glance, these crashes appeared to be caused by some Hotspot Server Compiler issue but in the end these actually turned out to be due to a Linux Kernel bug.

This first crash happened in the compiled code of method SafeHashMap.put()

# Problematic frame:
# J 3575 C2
com.tangosol.util.SafeHashMap.put(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; (190 bytes) @ 0x00002aaaabda7ffd [0x00002aaaabda7da0+0x25d]
#
# Core dump written.

I opened the core with gdb and the stack trace of the crashing thread looked like this:

#15 0x00002afd41aa5e7f in JVM_handle_linux_signal (sig=11, info=0x40fd7c90, ucVoid=0x40fd7b60, abort_if_unrecognized=1)
#16 0x00002afd41a9ce13 in signalHandler (sig=11, info=0x40fd7c90, uc=0x40fd7b60)
#17 <signal handler called>
#18 0x00002aaaabda7ffd in ?? ()
#19 0x00000000c005e3c8 in ?? ()
#20 0x00000000c005e3c8 in ?? ()
#21 0x00000000c005e3c8 in ?? ()
#22 0xc005e3c82000cb42 in ?? ()
#23 0x0000000000000007 in ?? ()
#24 0x0000000000000000 in ?? ()

Opened the core file with Serviceability Agent tool too and inspected the addresses present in the above stack trace:

hsdb> whatis 0x00002aaaabda7ffd
Address 0x00002aaaabda7ffd: In code in NMethod for com/tangosol/util/SafeHashMap.put(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; content: [0x00002aaaabda7da0, 0x00002aaaabda9280),  code: [0x00002aaaabda7da0, 0x00002aaaabda9280),  data: [0x00002aaaabda9280, 0x00002aaaabda9f18),  oops: [0x00002aaaabda9280, 0x00002aaaabda9288),  frame size: 144

hsdb> whatis 0x00000000c005e3c8
Address 0x00000000c005e3c8: In unknown section of Java heap

So, all the above digging confirmed that the crash happened in the compiled code of com.tangosol.util.SafeHashMap.put(). I looked at the assembly of the generated code in gdb:

(gdb) x/610i 0x00002aaaabda7da0
....
....
   0x2aaaabda7f83:      mov    $0x2000cbc5,%r9d
   0x2aaaabda7f89:      movabs $0x0,%r11
   0x2aaaabda7f93:      lea    (%r11,%r9,8),%r11
   0x2aaaabda7f97:      mov    0xa8(%r11),%r11
   0x2aaaabda7f9e:      mov    %r11,0x0(%r13)
   0x2aaaabda7fa2:      movl   $0x2000cbc5,0x8(%r13)
   0x2aaaabda7faa:      mov    %r12d,0xc(%r13)
   0x2aaaabda7fae:      mov    %r12,0x10(%r13)
   0x2aaaabda7fb2:      mov    %r12,0x18(%r13)
   0x2aaaabda7fb6:      mov    %r13,%r9
   0x2aaaabda7fb9:      mov    %r14,%r11
   0x2aaaabda7fbc:      mov    %r11d,0x10(%r13)
   0x2aaaabda7fc0:      shr    $0x9,%r9
   0x2aaaabda7fc4:      movabs $0x2aaab9f5b000,%r11
   0x2aaaabda7fce:      mov    %r12b,(%r11,%r9,1)
   0x2aaaabda7fd2:      mov    (%rsp),%r9d
   0x2aaaabda7fd6:      mov    %r9d,0x14(%r13)
   0x2aaaabda7fda:      mov    %r13,%r11
   0x2aaaabda7fdd:      shr    $0x9,%r11
   0x2aaaabda7fe1:      movabs $0x2aaab9f5b000,%r9
   0x2aaaabda7feb:      mov    %r12b,(%r9,%r11,1)
   0x2aaaabda7fef:      lock addl $0x0,(%rsp)
   0x2aaaabda7ff4:      mov    %r10d,0xc(%r13)
   0x2aaaabda7ff8:      movq   %xmm0,%r11
   0x2aaaabda7ffd:      mov    (%r11),%rax  <---crash here, r11 is 0
   0x2aaaabda8000:      mov    %rax,%r11
....
....

The thread crashed while accessing the value at an address stored in r11 register. Looking at the value of r11 register:
r11 0x0

r11 contains 0 and accessing the value at null address caused the crash.

Now, if we look at the previous instruction "movq   %xmm0,%r11", value in r11 was copied over from xmm0 register. The value in xmm0 register is 0.

There is a bug in Linux kernel where the xmm registers may get corrupted under certain circumstances. Details on the kernel bug: https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=547893


The machine where this crash was observed:
RELEASE=2.6.18-308.4.1.0.1.el5xen  VERSION=#1 SMP Tue Apr 17 16:41:30 EDT 2012  MACHINE=x86_64

So, the process was running on a machine which is vulnerable to that Linux kernel bug.

Let's take a look at another such Hotspot crash.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (sharedRuntime.cpp:833), pid=26946, tid=1082554688
#  fatal error: exception happened outside interpreter, nmethods and vtable stubs at pc 0x00002aaaab1f5340

Current thread (0x000000001ee9b000):  JavaThread "main" [_thread_in_Java, id=26949, stack(0x0000000040767000,0x0000000040868000)]

Stack: [0x0000000040767000,0x0000000040868000],  sp=0x00000000408631a0,  free space=1008k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xaaa6ea]  VMError::report_and_die()+0x2ba
V  [libjvm.so+0x4f2749]  report_fatal(char const*, int, char const*)+0x59
V  [libjvm.so+0x9a952a]  SharedRuntime::continuation_for_implicit_exception(JavaThread*, unsigned char*, SharedRuntime::ImplicitExceptionKind)+0x33a
V  [libjvm.so+0x91370a]  JVM_handle_linux_signal+0x48a
V  [libjvm.so+0x909c83]  signalHandler(int, siginfo*, void*)+0x43
C  [libpthread.so.0+0xebe0]
J 528 C2 com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaDOMParser.characters(Lcom/sun/org/apache/xerces/internal/xni/XMLString;Lcom/sun/org/apache/xerces/internal/xni/Augmentations;)V (110 bytes) @ 0x00002aaaab607423 [0x00002aaaab606ec0+0x563]

(gdb) where
#0  0x0000003968a30265 in raise ()
#1  0x0000003968a31d10 in abort ()
#2  0x00002b96ba0bbea5 in os::abort (dump_core=true)
#3  0x00002b96ba258af3 in VMError::report_and_die (this=0x40863320)
#4  0x00002b96b9ca0749 in report_vm_error (detail_msg=<optimized out>,
error_msg=<optimized out>, line=<optimized out>, file=<optimized out>)   
#5  report_fatal (file=0x2b96ba2e0330, line=833, message=0x408633b8)   
#6  0x00002b96ba15752a in SharedRuntime::continuation_for_implicit_exception(thread=0x1ee9b000,
    pc=0x2aaaab1f5340, exception_kind=<optimized out>)   
#7  0x00002b96ba0c170a in JVM_handle_linux_signal (sig=11, info=0x408637c0,
ucVoid=0x40863690, abort_if_unrecognized=1)   
#8  0x00002b96ba0b7c83 in signalHandler (sig=11, info=0x408637c0, uc=0x40863690)
#9  <signal handler called>
#10 0x00002aaaab1f5340 in ?? ()
#11 0x00000000fe8bace8 in ?? ()
#12 0x00002aaaab607423 in ?? ()
#13 0x00000000fe94d0a0 in ?? ()
#14 0x00000000fe8bacd0 in ?? ()
#15 0x00000000fe8bacd0 in ?? ()
#16 0x00002aaac0cdd850 in ?? ()
#17 0x0000000000000007 in ?? ()


Inspecting the crashing PC 0x00002aaaab1f5340 with SA:

hsdb> whatis 0x00002aaaab1f5340
Address 0x00002aaaab1f5340: In code in StubRoutines (2) content:
[0x00002aaaab1f4e20, 0x00002aaaab1fa410),  code: [0x00002aaaab1f4e20,
0x00002aaaab1fa410),  data: [0x00002aaaab1fa410, 0x00002aaaab1fa410),  frame
size: 0


Frame #12:

hsdb> whatis 0x00002aaaab607423
Address 0x00002aaaab607423: In code in NMethod for
com/sun/org/apache/xerces/internal/impl/xs/opti/SchemaDOMParser.characters(Lcom/sun/org/apache/xerces/internal/xni/XMLString;Lcom/sun/org/apache/xerces/inte
rnal/xni/Augmentations;)V content: [0x00002aaaab606ec0, 0x00002aaaab6083d8),  code: [0x00002aaaab606ec0, 0x00002aaaab6083d8),  data: [0x00002aaaab6083d8,
0x00002aaaab608e98),  oops: [0x00002aaaab6083d8, 0x00002aaaab6083e8),  frame size: 144

Disassembly around the crashing PC:

0x2aaaab1f5280:      movdqu 0x30(%rdi,%rdx,8),%xmm0
   0x2aaaab1f5286:      movdqu %xmm0,0x30(%rsi,%rdx,8)
   0x2aaaab1f528c:      movdqu 0x20(%rdi,%rdx,8),%xmm1
   0x2aaaab1f5292:      movdqu %xmm1,0x20(%rsi,%rdx,8)
   0x2aaaab1f5298:      movdqu 0x10(%rdi,%rdx,8),%xmm2
   0x2aaaab1f529e:      movdqu %xmm2,0x10(%rsi,%rdx,8)
   0x2aaaab1f52a4:      movdqu (%rdi,%rdx,8),%xmm3
   0x2aaaab1f52a9:      movdqu %xmm3,(%rsi,%rdx,8)
   0x2aaaab1f52ae:      sub    $0x8,%rdx
   0x2aaaab1f52b2:      jge    0x2aaaab1f5280
   0x2aaaab1f52b4:      add    $0x4,%rdx
   0x2aaaab1f52b8:      jl     0x2aaaab1f52d4
   0x2aaaab1f52ba:      movdqu 0x10(%rdi,%rdx,8),%xmm0
   0x2aaaab1f52c0:      movdqu %xmm0,0x10(%rsi,%rdx,8)
   0x2aaaab1f52c6:      movdqu (%rdi,%rdx,8),%xmm1
   0x2aaaab1f52cb:      movdqu %xmm1,(%rsi,%rdx,8)
   0x2aaaab1f52d0:      sub    $0x4,%rdx
   0x2aaaab1f52d4:      add    $0x4,%rdx
   0x2aaaab1f52d8:      jg     0x2aaaab1f5265
   0x2aaaab1f52da:      xor    %rax,%rax
   0x2aaaab1f52dd:      leaveq
   0x2aaaab1f52de:      retq
   0x2aaaab1f52df:      nop
   0x2aaaab1f52e0:      push   %rbp
   0x2aaaab1f52e1:      mov    %rsp,%rbp
   0x2aaaab1f52e4:      mov    %rdx,%rcx
   0x2aaaab1f52e7:      shr    $0x2,%rdx
   0x2aaaab1f52eb:      lea    -0x8(%rdi,%rdx,8),%rdi
   0x2aaaab1f52f0:      lea    -0x8(%rsi,%rdx,8),%rsi
   0x2aaaab1f52f5:      neg    %rdx
   0x2aaaab1f52f8:      jmpq   0x2aaaab1f5370
   0x2aaaab1f52fd:      mov    0x8(%rdi,%rdx,8),%rax
   0x2aaaab1f5302:      mov    %rax,0x8(%rsi,%rdx,8)
   0x2aaaab1f5307:      inc    %rdx
   0x2aaaab1f530a:      jne    0x2aaaab1f52fd
   0x2aaaab1f530c:      test   $0x2,%ecx
   0x2aaaab1f5312:      je     0x2aaaab1f5322
   0x2aaaab1f5314:      mov    0x8(%rdi),%eax
   0x2aaaab1f5317:      mov    %eax,0x8(%rsi)
   0x2aaaab1f531a:      add    $0x4,%rdi
   0x2aaaab1f531e:      add    $0x4,%rsi
   0x2aaaab1f5322:      test   $0x1,%ecx
   0x2aaaab1f5328:      je     0x2aaaab1f5332
   0x2aaaab1f532a:      mov    0x8(%rdi),%ax
   0x2aaaab1f532e:      mov    %ax,0x8(%rsi)
   0x2aaaab1f5332:      xor    %rax,%rax
   0x2aaaab1f5335:      leaveq
   0x2aaaab1f5336:      retq
   0x2aaaab1f5337:      nopw   0x0(%rax,%rax,1)
=> 0x2aaaab1f5340:      movdqu -0x38(%rdi,%rdx,8),%xmm0
   0x2aaaab1f5346:      movdqu %xmm0,-0x38(%rsi,%rdx,8)
   0x2aaaab1f534c:      movdqu -0x28(%rdi,%rdx,8),%xmm1
   0x2aaaab1f5352:      movdqu %xmm1,-0x28(%rsi,%rdx,8)
   0x2aaaab1f5358:      movdqu -0x18(%rdi,%rdx,8),%xmm2
   0x2aaaab1f535e:      movdqu %xmm2,-0x18(%rsi,%rdx,8)
   0x2aaaab1f5364:      movdqu -0x8(%rdi,%rdx,8),%xmm3
   0x2aaaab1f536a:      movdqu %xmm3,-0x8(%rsi,%rdx,8)
   0x2aaaab1f5370:      add    $0x8,%rdx

In the assembly code of the StubRoutine, we can see that the values are being moved to and from xmm registers and the crash happens while accessing value at an address present in rdi register.

rdx 0xffffffffffffffc8
rdi 0x1f8

Machine details where this crash happened:
RELEASE=2.6.18-308.4.1.0.1.el5xen  VERSION=#1 SMP Tue Apr 17 16:41:30 EDT 2012  MACHINE=x86_64

If you encounter such weird looking Hotspot crashes where the cause of the register/memory corruption seems to be unreasonable, please make sure that you are running on a latest patched Linux kernel. Kernels of version older that 2.6.32 may be vulnerable to this kernel bug.

Join the discussion

Comments ( 3 )
  • Amit Mishra Wednesday, March 25, 2015

    Hi Poonam,

    I have gone through your blogs regarding long gc pauses and as comment entry for particular blog has been closed, i have to write via this Blog.

    Basically we are facing long random minor gc pauses while disk busy % is very high.

    I want to know coorelation between high i/o(due to high disk busy% as seen in iostat -xtc) and long Minor GC pause, can you please explain how gc operation get prolong because of disk busy% as memory and CPU remain normal all the time.

    if possible share your email i can provide you logs.

    It will be great if you just help me out on this.


  • Poonam Wednesday, March 25, 2015

    Hi Amit,

    Which JDK are you using and which garbage collector are you running with?

    It is possible that the system is going through paging/swapping activity during the GC. If sufficent amount of memory is not available, then GC could be paging in/out memory pages while traversing the object graph in the heap.

    So, please check the memory health of your system, reduce the memory usage by the processes running on the system or increase RAM.

    Thanks,

    Poonam


  • guest Friday, September 18, 2015

    Hi Poonam,

    I am not much aware of memory management, garbage collector and how threads allocate and deallocate memory in java 7.I am working on a core java application which parses a binary file and keeps it in memory from there we save the data to DB. but as the data in the binary file is increased its exceeding the heap limit and generating dump.

    Threads are being used in this I see that the memory is not getting decreased after the thread complition also.Could you please provide some information on this or some link where I can find some material regarding to memory management,GC and thread memory management.

    Thanks

    Shyam Kishore

    shyamivy@gmail.com


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.