Sunday Oct 19, 2008

Windows crash dumps for Java Processes

Windows Crash Dump is memory dump of a process running on a Windows system. These dumps can be very useful for debugging Java process crashes. In this entry I discuss how to collect sane Crash Dumps for Java process crashes on Windows machines that can later be analyzed using Windbg or other 'Debugging Tools For Windows'.

I have a simple java class test which uses native library test.dll using JNI. test.dll implements a 'native' method where it accesses a null pointer and causes a crash. Complete source of this crashing program is here


Windows NT, 2000 and XP

Let's first run this program on a Windows NT/2000/XP machine:

D:\\demo>java test
#
# An unexpected error has been detected by Java Runtime Environment:
#
# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x10011f68, pid=2020, tid=640
#
# Java VM: Java HotSpot(TM) Client VM (10.0-b19 mixed mode, sharing windows-x86)
# Problematic frame:
# C [test.dll+0x11f68]
#
# An error report file with more information is saved as:
# D:\\demo\\hs_err_pid2020.log
#
# If you would like to submit a bug report, please visit:
# http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Running this program created an hs_err log file in the same folder. Error log file contains the following stack trace:

C [test.dll+0x1032]
j test.f()V+0
j test.main([Ljava/lang/String;)V+19
v ~StubRoutines::call_stub
V [jvm.dll+0x873ed]
V [jvm.dll+0xdfb96]
V [jvm.dll+0x872be]
V [jvm.dll+0x8e501]
C [java.exe+0x14c5]
C [java.exe+0x69cd]
C [kernel32.dll+0x16fd7]

Now, let's run it with -XX:+ShowMessageBoxOnError JVM option. I get the following message box:

Attaching Visual Studio or Windbg to this process or collecting a crash dump with Dr Watson and opening that crash dump with Windbg shows the following call trace:

ntdll!DbgBreakPoint
jvm!VMError::show_message_box+0x7f
jvm!VMError::report_and_die+0xe7
jvm!report_error+0x2d
jvm!topLevelExceptionFilter+0x54d
jvm!os::os_exception_wrapper+0x7d
msvcr71!except_handler3+0x61
jvm!JavaCalls::call+0x23
jvm!jni_invoke_static+0xb1
jvm!jni_CallStaticVoidMethod+0x86
java+0x209e
java+0x898f
kernel32!GetModuleFileNameA+0x1b4

This shows JVM's error handling frames and does not show that the crash happened in function test.f().

This is because, by default, the First Chance Exceptions are not sent to the System Debugger. And Debuggers receive only the Second Chance Exceptions. The exception in test.f() was a first Chance Exception that was hanlded by JVM. Please see details on First and Second chance exceptions: http://support.microsoft.com/kb/286350

So how can we collect crash dumps that contain the correct stack trace of the crash. Let's try 'adplus' to collect the crash dumps.

Start adplus in crash mode:
D:\\windbg>adplus -crash -pn java.exe

and then start your java process with -XX:+ShowMessageBoxOnError

By default, adplus creates mini dumps on First Chance Exceptions and full memory dumps on Second Chance Exceptions. It can be changed; details here: http://support.microsoft.com/kb/286350

'adplus' would create dump files in a folder like Crash_Mode__Date_08-12-2008__Time_15-12-56PM. Using Windbg, open the dump created at First Chance Exception and it shows the crashing frame as:

test!Java_test_f+0x22

Ah! that's what I was looking for.



Windows 2008 and Windows Vista

Dr. Watson is not available on Windows 2008 and Windows Vista

Starting with Windows Server 2008 and Windows Vista Service Pack1 (SP1), Windows has new error reporting mechanism called 'Windows Error Reporting' or WER. WER can be configured so that full user-mode dumps are collected and stored locally after a user-mode application crashes. This feature is not enabled by default. Enabling this feature requires administrator privileges. To enable and configure the feature, we need to use the following registry values under the
HKEY_LOCAL_MACHINE\\Software\\Microsoft\\Windows\\Windows Error Reporting\\LocalDumps key.

Details here:

http://msdn.microsoft.com/en-us/library/bb787181(VS.85).aspx

And yes, here also we can use 'adplus' to collect Crash Dumps for First Chance and Second Chance Exceptions.



Enjoy Debugging !

Sunday Feb 24, 2008

Debugging JVM hang with 'dbx'

Recently worked on a JVM hang problem, so posting here the steps followed in debugging this issue with dbx.

Customer was facing hard hang on a Solaris-x86 machine with jdk 1.5.0_08 and the process was not even responding to kill -3. They sent us a core file of the hung process.

For dbx to have the debug information, I re-built the product version of 1.5.0_08 libjvm.so with -g flag and loaded and used this new libjvm.so with dbx. With that, dbx was able to get all the debug information for the core file that was generated with a libjvm.so without any debug info.

The pstack on the core file showed that VMThread is trying to acquire Threads_lock, main thread is waiting on a condition on Threads_lock and another thread 3992 was also trying to grab Threads_lock, so this lock seemed to be causing the deadlock.

----------------- lwp# 1 / thread# 1 --------------------
fef4009b lwp_park (0, 0, 0)
fef3a909 cond_wait_queue (807c540, 807c528, 0, 0) + 3b
fef3ae02 _cond_wait (807c540, 807c528) + 66
fef3ae44 cond_wait (807c540, 807c528) + 21
fe92fc64 __1cHMonitorEwait6Mil_i_ (807c4d0, 0, 0) + 2a4
fecf526b __1cHThreadsKdestroy_vm6F_i_ (8073fb9, 0, 806dc14, fee4d33c, fee4a2a0, 10004) + 7b
feb66ea3 jni_DestroyJavaVM (fee43560) + 123
08051f8e main (2, 807017c, 80432b4) + bfb
080512fa ???????? (86, 8043468, 804346d, 8043494, 80434a8, 80434c6)
----------------- lwp# 11 / thread# 11 --------------------
fef4009b lwp_park (0, 0, 0)
fef399bb slow_lock (fbc92000, 807c528, 0) + 3d
fef39ab1 mutex_lock_impl (807c528, 0) + ec
fef39bbd mutex_lock (807c528) + 1a
fe8c9a00 __1cFMutexElock6MpnGThread__v_ (807c4d0, 8261480) + c0
fe8cc1fe __1cFMutexElock6M_v_ (807c4d0) + 4e
fe996295 __1cUSafepointSynchronizeFbegin6F_v_ (fee493fc, 8261480, fedee000, f8199b74, 3, f8199bbc) + 85
fea659c4 __1cIVMThreadEloop6M_v_ (8261480) + 134
fea66478 __1cIVMThreadDrun6M_v_ (8261480) + 88
fecbe0f2 __1cG_start6Fpv_0_ (8261480) + d2
fef3fd52 _thr_setup (fbc92000) + 4e
fef40040 _lwp_start (fbc92000, 0, 0, f8199ff8, fef40040, fbc92000)
----------------- lwp# 3992 / thread# 3992 --------------------
fef4009b lwp_park (0, 0, 0)
fef399bb slow_lock (6c173c00, 807c528, 0) + 3d
fef39ab1 mutex_lock_impl (807c528, 0) + ec
fef39bbd mutex_lock (807c528) + 1a
fe8ea95e __1cFMutexbClock_without_safepoint_check6M_v_ (807c4d0) + 7e
fe95c56e __1cUSafepointSynchronizeFblock6FpnKJavaThread__v_ (a27f780) + ee
fecb6825 __1cGParkerEpark6Mix_v_ (a27f908, 0, d964b198, 45) + 1e5
fed0963d Unsafe_Park (a27f840, 6807aba4, 0, d964b198, 45, 9480bcb8) + 1dd
f876f2f9 ???????? (0, 0, a27c330, 0, 9ae69a8, 0)
00000000 ???????? (fecf6330, fea09d40, fea03c80, fe964f40, fe91cc40, fe939c50)
00000000 ???????? (fffffd2c, 3badd9be, f468fe16, e78b7254, fe1dbbcc, 0)
00000000 ???????? ()

Now let's look at the Threads_lock in dbx.

(dbx) print -fx Threads_lock
Threads_lock = 0x807c4d0

(dbx) print -r \*((Monitor\*)0x807c4d0)
\*((class Monitor \*) 0x807c4d0) = {
Monitor::Mutex::_lock_count = 0
Monitor::Mutex::_lock_event = 0x807c500
Monitor::Mutex::_suppress_signal = 0
Monitor::Mutex::_owner = 0x97b0978 <----------------
Monitor::Mutex::_name = 0xfed97a0f "Threads_lock"
Monitor::Mutex::INVALID_THREAD = 0
Monitor::_event = (nil)
Monitor::_counter = 0
Monitor::_waiters = 1
Monitor::_tickets = 0
}

(dbx) print -r \*((Thread\*)0x97b0978)
\*((class Thread \*) 0x97b0978) = {
Thread::ThreadShadow::_pending_exception = (nil)
Thread::_exception_file = (nil)
Thread::_exception_line = 0
Thread::_real_malloc_address = 0x97b0d28
Thread::_rawmonitor_list = (nil)
Thread::_SR_lock = 0x9b7b890
Thread::_vm_suspend_depth = 0
Thread::_suspend_flags = 0
Thread::_active_handles = 0x97b0b48
Thread::_free_handle_block = (nil)
Thread::_last_handle_mark = 0x65c92618
Thread::_oops_do_parity = 0
Thread::_highest_lock = 0x65c92cb8 "x\^I{\^I\\xe8\^K{\^I\\xb0\\xd6\\x82\\xf4\\xa0H\\xbeu"
Thread::_unboosted_priority = NoPriority
Thread::_tlab = {
Thread::ThreadLocalAllocBuffer::_start = 0x762476d8
Thread::ThreadLocalAllocBuffer::_top = 0x7624c750
Thread::ThreadLocalAllocBuffer::_end = 0x76257308
Thread::ThreadLocalAllocBuffer::_size = 14078U
Thread::ThreadLocalAllocBuffer::_refill_waste_limit = 219U
Thread::ThreadLocalAllocBuffer::_number_of_refills = 6U
Thread::ThreadLocalAllocBuffer::_fast_refill_waste = 0
Thread::ThreadLocalAllocBuffer::_slow_refill_waste = 202U
Thread::ThreadLocalAllocBuffer::_gc_waste = 0
Thread::ThreadLocalAllocBuffer::_slow_allocations = 0
Thread::ThreadLocalAllocBuffer::_allocation_fraction = {
Thread::ThreadLocalAllocBuffer::AdaptiveWeightedAverage::_average = 0.006992618
Thread::ThreadLocalAllocBuffer::AdaptiveWeightedAverage::_sample_count = 1U
Thread::ThreadLocalAllocBuffer::AdaptiveWeightedAverage::_weight = 35U
Thread::ThreadLocalAllocBuffer::AdaptiveWeightedAverage::_last_sample = 0.006992618
}
Thread::ThreadLocalAllocBuffer::_target_refills = 50
Thread::ThreadLocalAllocBuffer::_global_stats = 136222936
}
Thread::_vm_operation_started_count = 1
Thread::_vm_operation_completed_count = 1
Thread::_current_pending_monitor = (nil)
Thread::_current_pending_monitor_is_from_java = 1
Thread::_current_waiting_monitor = (nil)
Thread::_osthread = 0x97b0b00 <-------------
Thread::_resource_area = 0x8c55f50
Thread::_handle_area = 0x98dafc0
Thread::_stack_base = 0x65c93000 ""
Thread::_stack_size = 262144U
Thread::_self_raw_id = 1731617792U
}

(dbx) print -r \*((OSThread\*)0x97b0b00)
\*((class OSThread \*) 0x97b0b00) = {
OSThread::_start_proc = (nil)
OSThread::_start_parm = (nil)
OSThread::_state = RUNNABLE
OSThread::_interrupted = 0
OSThread::_thread_id = 11982U
OSThread::_lwp_id = 11982U
OSThread::_caller_sigmask = {
OSThread::__sigbits = (4U, 0, 0, 0)
}
OSThread::_vm_created_thread = 1
OSThread::_current_callback = (nil)
OSThread::_current_callback_lock = (nil)
OSThread::_interrupt_event = 0x97b07d8
OSThread::_saved_interrupt_thread_state = _thread_new
}

This shows that owner of Threads_lock is thread with id 11982. Stack trace of this thread.

(dbx) thread t@11982
t@11982 (l@11982) stopped in __lwp_park at 0xfef4009b
0xfef4009b: __lwp_park+0x000b: jae __lwp_park+0x19 [ 0xfef400a9, .+0xe ]
(dbx) where
current thread: t@11982
=>[1] __lwp_park(0x0, 0x0), at 0xfef4009b
[2] mutex_lock_queue(0x67366400, 0x0, 0x88a0598, 0x0), at 0xfef3911f
[3] slow_lock(0x67366400, 0x88a0598, 0x0), at 0xfef399bb
[4] mutex_lock_impl(0x88a0598, 0x0), at 0xfef39ab1
[5] __mutex_lock(0x88a0598), at 0xfef39bbd
[6] Mutex::wait_for_lock_blocking_implementation(0xa27f908, 0x97b0978), at 0xfe9d93ba
[7] Mutex::lock(0xa27f908, 0x97b0978), at 0xfe8c99ea
[8] Mutex::lock(0xa27f908), at 0xfe8cc1fe
[9] Parker::unpark(0xa27f908), at 0xfecb69bb
[10] os::interrupt(0xa27f780), at 0xfecbf6bb
[11] Thread::interrupt(0xa27f780), at 0xfecf2ed9
[12] JVM_Interrupt(0x97b0a38, 0x65c923fc, 0x65c923d0, 0x0, 0x65c923fc, 0xf4873020, 0x0, 0xf482f600, 0x65c923fc, 0x65c92424,
0xf8202b6b, 0x0, 0xf8206399, 0x9593fdc0, 0x1, 0x0, 0x65c92400, 0xf482db6b, 0x65c92438, 0xf4873020), at 0xfeb92255
[13] 0xf820843a(0x0, 0xf8206399, 0x9593fdc0, 0x1, 0x0, 0x65c92400), at 0xf820843a
[14] 0xf8202b6b(0x0, 0x0, 0x97f59c18, 0x9593fdc0, 0x65c9243c, 0xf5746e95), at 0xf8202b6b
[15] 0xf8202b6b(0x967a0c40, 0x9, 0x968f6eb8, 0x65c92464, 0xf57472be, 0x65c92498), at 0xf8202b6b
[16] 0xf8202b6b(0x0, 0x968f6eb8, 0x967a0c40, 0x65c9249c, 0xf5845a0c, 0x65c924c4), at 0xf8202b6b
[17] 0xf8202b6b(0x1b58, 0x967a0c40, 0x65c924c8, 0xf5845674, 0x65c92524, 0xf58a1f78), at 0xf8202b6b
[18] 0xf8202b6b(0x7624c168, 0x7624c138, 0x7624c108, 0x7624c0d8, 0x7624bee8, 0x7624bd00), at 0xf8202b6b

It is trying to acquire Monitor 0xa27f908

Now, look at this Monitor:
(dbx) print -r \*((Monitor\*)0xa27f908)
\*((class Monitor \*) 0xa27f908) = {
Monitor::Mutex::_lock_count = 0
Monitor::Mutex::_lock_event = 0x88a0570 <-----------
Monitor::Mutex::_suppress_signal = 0
Monitor::Mutex::_owner = (nil)
Monitor::Mutex::_name = 0xfed97c99 "Park Semaphore"
Monitor::Mutex::INVALID_THREAD = 0
Monitor::_event = (nil)
Monitor::_counter = 0
Monitor::_waiters = 0
Monitor::_tickets = 0
}
(dbx) print -r \*((Event\*)0x88a0570)
More than one identifier 'Event'.
Select one of the following:
0) Cancel
1) class os::Solaris::Event
2) class Solaris::Event
> 1
\*((class os::Solaris::Event \*) 0x88a0570) = {
Event::_count = 0
Event::_nParked = 0
Event::CachePad = (-6.7529205832009e-205, -6.6220952168053e-205, -1.2786650622506e-204, -1.9668306807887e-204)
Event::_mutex = (
{
Event::_lwp_mutex::flags = {
Event::_lwp_mutex::flag1 = 4U
Event::_lwp_mutex::flag2 = '\\0'
Event::_lwp_mutex::ceiling = '\\0'
Event::_lwp_mutex::mbcp_type_un = {
Event::_lwp_mutex::bcptype = 0
Event::_lwp_mutex::mtype_rcount = {
Event::_lwp_mutex::count_type1 = '\\0'
Event::_lwp_mutex::count_type2 = '\\0'
}
}
Event::_lwp_mutex::magic = 19800U
}
Event::_lwp_mutex::lock = {
Event::_lwp_mutex::lock64 = {
Event::_lwp_mutex::pad = ""
}
Event::_lwp_mutex::lock32 = {
Event::_lwp_mutex::ownerpid = 0
Event::_lwp_mutex::lockword = 16842752U
}
Event::_lwp_mutex::owner64 = 72339069014638592ULL
}
Event::_lwp_mutex::data = 1813462016ULL
}

)
Event::_cond = (
{
Event::_lwp_cond::flags = {
Event::_lwp_cond::flag = ""
Event::_lwp_cond::type = 0
Event::_lwp_cond::magic = 17238U
}
Event::_lwp_cond::data = 0
}

)
Event::FreeNext = (nil)
Event::Immortal = 0
}

Here _lwp_mutex->data which is 0x6C173C00 is the owner thread (ulwp_t)

looking at it:
(dbx) x 0x6C173C00/30X
0x6c173c00: 0x6c173c00 0x00000000 0x00000000 0x00000000
0x6c173c10: 0x00000000 0x00000000 0x00000000 0x00000000
0x6c173c20: 0x00000000 0x00000000 0x00000000 0xfef6b740
0x6c173c30: 0x00000000 0x00000000 0x6c174000 0x6c173800
0x6c173c40: 0x00000000 0x00000000 0x00000000 0x6803b000
0x6c173c50: 0x00040000 0x00000000 0x6807b000 0x00040000
0x6c173c60: 0x6803b000 0x00040000 0x00000000 0x00000398
0x6c173c70: 0x00000f98 0x0000007f

0x00000f98 (which is 3992) is the thread id holding this Event::mutex

Now looking at thread 3992:
(dbx) thread t@3992
t@3992 (l@3992) stopped in __lwp_park at 0xfef4009b
0xfef4009b: __lwp_park+0x000b: jae __lwp_park+0x19 [ 0xfef400a9, .+0xe ]
(dbx) where
current thread: t@3992
=>[1] __lwp_park(0x0, 0x0), at 0xfef4009b
[2] mutex_lock_queue(0x6c173c00, 0x0, 0x807c528, 0x0), at 0xfef3911f
[3] slow_lock(0x6c173c00, 0x807c528, 0x0), at 0xfef399bb
[4] mutex_lock_impl(0x807c528, 0x0), at 0xfef39ab1
[5] __mutex_lock(0x807c528), at 0xfef39bbd
[6] Mutex::lock_without_safepoint_check(0x807c4d0), at 0xfe8ea95e
[7] SafepointSynchronize::block(0xa27f780), at 0xfe95c56e
[8] Parker::park(0xa27f908, 0x0, 0xd964b198, 0x45), at 0xfecb6825
[9] Unsafe_Park(0xa27f840, 0x6807aba4, 0x0, 0xd964b198, 0x45, 0x9480bcb8), at 0xfed0963d

It is waiting to lock Threads_lock.

So threads 11982 and 3992 have a deadlock.

Thread 11982 – holds Thread_lock and waiting to acquire Monitor 0xa27f908
Thread 3992 – holds lock on 0xa27f908 and waiting to acquire Threads_lock

This problem has been fixed with 6601725 in 5.0u15 and mustang.

About

poonam

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today