X
  • November 2, 2016

Hung JVM due to the threads stuck in pthread_cond_timedwait()

Poonam Parhar
Consulting Member of Technical Staff

In this post, I am going to discuss a couple of scenarios where the Java process hangs are actually NOT because of a Java/JVM issue but happen due to the OS problems. Both the hangs being discussed here occur on the Linux operating system.

Lets take a look at the first case where a thread seems to be stuck in the Java Thread.sleep() call. From the process stack trace, the two threads of interest are:

Thread 26755: (state = IN_VM)
 - java.lang.Thread.interrupt0() @bci=0 (Interpreted frame)
 - java.lang.Thread.interrupt() @bci=51, line=852 (Interpreted frame)
 - oracle.core.ojdl.BufferedLogWriter.stopFlusher() @bci=17, line=380
(Interpreted frame)

Thread 21714: (state = BLOCKED)
 - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
 - oracle.core.ojdl.BufferedLogWriter$Flusher.run() @bci=30, line=401
(Interpreted frame)

Thread 26755 is attempting to interrupt thread 21714 which is executing Thread.sleep(). Thread 21714 stays in the BLOCKED state preventing other threads in the process from making any progress, and thus causing the process to hang.

Looking at their native stack traces:

Thread 26755:
#0  0xf77e7430 in __kernel_vsyscall ()
#1  0x4dcc31b9 in __lll_lock_wait () from /lib/libpthread.so.0
#2  0x4dcbe550 in _L_lock_677 () from /lib/libpthread.so.0
#3  0x4dcbe421 in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x064ff034 in os::Linux::Event::unpark() ()
#5  0x064fc60a in os::interrupt(Thread*) ()

Thread 21714:
#0  0xf77e7430 in __kernel_vsyscall ()
#1  0x4dc0fc43 in __lll_lock_wait_private () from /lib/libc.so.6
#2  0x4db94348 in _L_lock_9835 () from /lib/libc.so.6
#3  0x4db91f27 in calloc () from /lib/libc.so.6
#4  0x4dcc0e1c in pthread_cond_timedwait@GLIBC_2.0 () from /lib/libpthread.so.0
#5  0x064fefa1 in os::Linux::Event::timedwait(timespec*) ()
#6  0x064fc2bf in os::sleep(Thread*, long long, bool) ()
#7  0x063d5d90 in JVM_Sleep ()

You might be wondering as to what was the value of time duration passed to pthread_cond_timedwait() function. It was 5 secs, and so after 5 seconds pthread_cond_timedwait() should have come out of its wait.

pthread_cond_timedwait() accepts 'pthread_cond_t* cond' and 'pthread_mutex_t* mutex' arguments that it uses to coordinate with the other threads (https://linux.die.net/man/3/pthread_cond_timedwait). Take a look at the cond and the mutex arguments:

(gdb) print *((pthread_cond_t*)0xad533ff0)
$3 = {__c_lock = {__status = 0, __spinlock = 17}, __c_waiting = 0xad534050}

(gdb) print *((pthread_mutex_t*)0xad533fd8)
$4 = {__m_reserved = 2, __m_count = 0, __m_owner = 0x54d2, __m_kind = 0, __m_lock = {__status = 1, __spinlock = 0}}
__m_owner = 0x54d2 = 21714

(gdb) print *$11->_osthread->_interrupt_event
$14 = {<CHeapObj> = {<No data fields>}, _count = 0, _nParked = 1,
cachePad = {-2.3622328335110874e-90, 4.2439915835115547e-313,
    1.4480588116526359e-314, 4.2439915866735748e-313}, _mutex =
{{__m_reserved = 2, __m_count = 0, __m_owner = 0x54d2, __m_kind = 0,
__m_lock = {
        __status = 1, __spinlock = 0}}}, _cond = {{__c_lock = {__status
= 0, __spinlock = 17}, __c_waiting = 0xad534050}}, FreeNext = 0xbad, Immortal = 1}

From the above data, we can see that the sleeping thread owns the _mutex, and the interrupting thread is waiting to acquire that _mutex to signal the sleeping thread. If we look closely, the sleeping thread is actually waiting on a native heap lock to complete some allocation (in calloc) before it could release the _mutex.

So it turned out not to be a JVM issue but some problem with the lock synchronization at the linux kernel level, causing the thread to be stuck in pthread_cond_timedwait() for infinite amount of time.

Let's take look at another case where there were three threads stuck in __pthread_cond_timedwait(), stopping the progress of other threads in the process.

----------------- 11374 -----------------
0x00000032f1c0ba0e      __pthread_cond_timedwait + 0x13e
0x00007fb1453bbf94      _ZN13ObjectMonitor4waitElbP6Thread + 0x224
0x00007fb1454a0d83      _ZN18ObjectSynchronizer4waitE6HandlelP6Thread + 0x53
0x00007fb14520b34b      JVM_MonitorWait + 0x1fb
0x00007fb13d010eee      * java.lang.Object.wait(long) bci:0 (Interpreted frame)
0x00007fb13d005a82      * java.lang.Thread.join(long) bci:70 line:1214 (Interpreted frame)
0x00007fb13d005a82      * iasauto.execution.GetCfgRunDTE.execute() bci:864 line:504 (Interpreted frame)
0x00007fb13d005a82      * iasauto.execution.RunOneJob.execute() bci:1675 line:1521 (Interpreted frame)
0x00007fb13d005a82      * iasauto.execution.RunOneJob.main(java.lang.String[]) bci:18 line:58 (Interpreted frame)
0x00007fb13d000438      <StubRoutines>
0x00007fb14519e8d0      _ZN9JavaCalls11call_helperEP9JavaValueP12methodHandleP17JavaCallArgumentsP6Thread + 0x1e0
0x00007fb1453ca829      _ZN2os20os_exception_wrapperEPFvP9JavaValueP12methodHandleP17JavaCallArgumentsP6ThreadES1_S3_S5_S7_ + 0x19
0x00007fb14519e6e5      _ZN9JavaCalls4callEP9JavaValue12methodHandleP17JavaCallArgumentsP6Thread + 0x25
0x00007fb1451d99f7      _Z17jni_invoke_staticP7JNIEnv_P9JavaValueP8_jobject11JNICallTypeP10_jmethodIDP18JNI_ArgumentPusherP6Thread + 0x147
0x00007fb1451c753f      jni_CallStaticVoidMethod + 0x20f
0x0000000040002284      JavaMain + 0x2a4

----------------- 31490 -----------------
0x00000032f1c0ba0e      __pthread_cond_timedwait + 0x13e
0x00007fb1453bbf94      _ZN13ObjectMonitor4waitElbP6Thread + 0x224
0x00007fb1454a0d83      _ZN18ObjectSynchronizer4waitE6HandlelP6Thread + 0x53
0x00007fb14520b34b      JVM_MonitorWait + 0x1fb
0x00007fb13d010eee      * java.lang.Object.wait(long) bci:0 (Interpreted frame)
0x00007fb13d005a82      * java.lang.Thread.join(long) bci:70 line:1214 (Interpreted frame)
0x00007fb13d005a82      * iasauto.engine.LocalTaskProcess.execute() bci:70 line:206 (Interpreted frame)
0x00007fb13d005a82      * iasauto.engine.TaskProcess.executeLocal(boolean) bci:277 line:751 (Interpreted frame)
0x00007fb13d005a82      * iasauto.engine.TaskProcess.run() bci:631 line:289 (Interpreted frame)
0x00007fb13d005f5c      * java.lang.Thread.run() bci:11 line:682 (Interpreted frame)
0x00007fb13d000438      <StubRoutines>
0x00007fb14519e8d0      _ZN9JavaCalls11call_helperEP9JavaValueP12methodHandleP17JavaCallArgumentsP6Thread + 0x1e0
0x00007fb1453ca829      _ZN2os20os_exception_wrapperEPFvP9JavaValueP12methodHandleP17JavaCallArgumentsP6ThreadES1_S3_S5_S7_ + 0x19
0x00007fb14519e246      _ZN9JavaCalls12call_virtualEP9JavaValue11KlassHandle12symbolHandleS3_P17JavaCallArgumentsP6Thread + 0x116
0x00007fb14519e2c7      _ZN9JavaCalls12call_virtualEP9JavaValue6Handle11KlassHandle12symbolHandleS4_P6Thread + 0x47
0x00007fb145230c12      _Z12thread_entryP10JavaThreadP6Thread + 0xa2
0x00007fb1454d3401      _ZN10JavaThread3runEv + 0x121

----------------- 11681 -----------------
0x00000032f1c0ba0e      __pthread_cond_timedwait + 0x13e
0x00007fb1453bbf94      _ZN13ObjectMonitor4waitElbP6Thread + 0x224
0x00007fb1454a0d83      _ZN18ObjectSynchronizer4waitE6HandlelP6Thread + 0x53
0x00007fb14520b34b      JVM_MonitorWait + 0x1fb
0x00007fb13d010eee      * java.lang.Object.wait(long) bci:0 (Interpreted frame)
0x00007fb13d005a82      * java.util.TimerThread.mainLoop() bci:201 line:509 (Interpreted frame)
0x00007fb13d005a82      * java.util.TimerThread.run() bci:1 line:462 (Interpreted frame)
0x00007fb13d000438      <StubRoutines>
0x00007fb14519e8d0      _ZN9JavaCalls11call_helperEP9JavaValueP12methodHandleP17JavaCallArgumentsP6Thread + 0x1e0
0x00007fb1453ca829      _ZN2os20os_exception_wrapperEPFvP9JavaValueP12methodHandleP17JavaCallArgumentsP6ThreadES1_S3_S5_S7_ + 0x19
0x00007fb14519e246      _ZN9JavaCalls12call_virtualEP9JavaValue11KlassHandle12symbolHandleS3_P17JavaCallArgumentsP6Thread + 0x116
0x00007fb14519e2c7      _ZN9JavaCalls12call_virtualEP9JavaValue6Handle11KlassHandle12symbolHandleS4_P6Thread + 0x47
0x00007fb145230c12      _Z12thread_entryP10JavaThreadP6Thread + 0xa2
0x00007fb1454d3401      _ZN10JavaThread3runEv + 0x121
0x00007fb1453cbcdf      _Z10java_startP6Thread + 0x13f

From the investigations so far, we believe it to be a Linux Kernel issue, probably related to the fix done by the following commit.
https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0

To summarize, even when on the surface the Java process hangs look to be Java/JVM issues, they may not be. We should be looking closely at the native stack traces of the threads to understand what is really holding them up, and where they are stuck.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services