Quickly Diagnose the Root Cause of Stuck Threads using Oracle Enterprise Manager 12c JVM Diagnostics
By Shiraz Kanga-Oracle on May 12, 2014
Note: Clicking on any image will open the same image in full size in a new window
One of the hidden gems in Oracle Enterprise Manager 12c is JVM Diagnostics. If you purchased the Weblogic Management Pack license then you already own it. JVMD allows administrators to diagnose performance problems in production Java applications. By eliminating the need to reproduce these “production only” problems in QA, it reduces the time required to resolve them. It does not require complex instrumentation or restarting of the application to get in-depth application details. Application administrators will be able to identify Java problems or database issues that are causing application downtime without any detailed knowledge of the application internals. It is also very well suited to diagnosing issues with “Stuck Threads” which will be the focus of this blog.
What is a [STUCK] Thread
In a Weblogic server, all incoming requests are handled by a thread pool which is controlled by a work manager. Worker threads that are taken out of the pool and not returned after a specified time period are marked as [STUCK] by the work manager. This time period is 10 minutes by default but it is configurable on a per work manager basis using the "StuckThreadMaxTime" parameter (default is set to 600 seconds).
Note that it is possible that some of your threads are doing legitimate work for over 10 min with no issues. If you have such threads then you should consider placing them in a another work manager with proper setting for the "StuckThreadMaxTime" parameter
Why JVMD is Well Suited to Diagnosing [STUCK] Threads
Traditionally, developers will use a stack trace generated by jstack or kill -3 and try to determine the cause of a stuck thread. However, in my experience a majority of the time this stack is not even the culprit. The problem often lies in another tier of the application or even in another thread of the same application. JVMD has the ability to provide additional context such as the name of the request and which tier it called out to Eg: RDBMS servers, LDAP servers, Web servers, RMI servers, etc. Using fine grained thread states (i.e. DB, Network, IO, CPU, RMI, Lock, etc) and the ability to see additional details about the thread, JVMD users can quickly pinpoint the root cause of the problem. Since JVMD is always on, it can also debug these issues that happened in the past and can proactively notify you about stuck threads Eg: Get an email at 1am when you had stuck threads. And lastly, sometimes developers have no access to the target host due to lack of credentials needed to run command line applications.
On several occasions, the thread may be stuck but is doing legitimate work. In such scenarios JVMD allows you to scan back and forth through a large number of samples to see what work is being processed by the thread. In addition, you can take a look at other threads that were serviced the same request to see if they behaved similarly or not. This will allow you to quickly determine whether there is really a problem or not.
Real-Time [STUCK] Thread Analysis
With JVMD there are two real use cases for stuck thread analysis. If you get notified about a stuck thread in real-time (via email, etc) then you can perform a real-time stuck thread analysis. Alternatively, if you are investigating a thread that was stuck in the past but is not present any more, then you can perform a historical stuck thread analysis. In either case the first thing to do is to navigate to the JVM (or JVM pool) where the thread is stuck. We do this by clicking on Targets -> Middleware as shown
From here we can filter the list of targets by target type or by target name. Your most recent filter request will be remembered the next time you visit the page. Select the Target Type of JVM to see all of the JVM targets.
Pick the JVM for the Weblogic server which is having the stuck thread issues and click on it. This will take you to the target home page. Click the button at the top that says “Live Thread Analysis”. Type the word "stuck" into the thread name search box and click on the arrow to filter the table. Now you should see all the stuck threads. In this case we can see a thread that is stuck in the “Network Wait” state. It is stuck on line 358 in function writeBuffer() of OutputRecord.java which is in package com.sun.ssl.internal.ssl which makes it clear that this stuck thread has made an SSL call and the remote server has not responded in a reasonable amount of time so the client thread is stuck.
Here is another example of stuck threads, this time in the “DB Wait” state. Notice how the tool tip over the SQL ID field shows the SQL being executed. Click on it to view longer SQL statements. Also try clicking on the DB Wait link which takes you directly to this specific database session in the Oracle Database Diagnostics section of EM for further analysis. The columns displayed are controlled by “View” drop-down menu. Here we added the “User” column to show the logged in user who executed the request.
Historical [STUCK] Thread Analysis
In order to start historical stuck thread analysis you need to navigate to the JVM target home page in the same way as discussed in the real-time section. From the target home page clicking on the “JVM Performance Diagnostics” button at the top of the page. On the performance diagnostics page you can filter the data to make it more relevant to your task. The first filter to apply is of time. If you know the exact time you can use the “Edit Date and Time” button to specify it as shown. Otherwise use the handy shortcut links for Day, 1 Hours, 1 Hour or 15 Minutes as needed.
The next thing to filter is the Thread Name. Expand the filter options region if necessary and add the Thread Name filter to be “[STUCK]*” so you only see threads whose name starts with [STUCK]
Below the filter region the “General” tab will show you the Thread States, Top Requests, Top Methods, Top SQLs, Top DB Wait Events and Top Databases – only for the filtered data i.e. for only threads that are stuck. Try clicking on method names to see the call stack for the method. The charts are all interactive and fetch additional data about the item clicked.
If you want to find a specific thread move from the “General” tab over to the “Threads” tab. This is fine grained data with each sample and state transition visible. You can click on any sample to view it in the sample analyzer which should look familiar to you if you saw the threads in real-time. Details about SQLs, Wait states, etc are all available here also along with the complete call stack which can also be exported to a CSV file.
In conclusion, we can see the JVMD provides a rich set of additional details which are only a mouse click away that help you to diagnose the root causes of your stuck threads.
NOTE: Many of the screen shots taken here are using testing & debug code, which deliberately tries to create stuck threads. This does not and should not reflect on the nature of any Oracle products being shipped to customers.