Monday May 12, 2014

Quickly Diagnose the Root Cause of Stuck Threads using Oracle Enterprise Manager 12c JVM Diagnostics

Note: Clicking on any image will open the same image in full size in a new window

One of the hidden gems in Oracle Enterprise Manager 12c is JVM Diagnostics. If you purchased the Weblogic Management Pack license then you already own it. JVMD allows administrators to diagnose performance problems in production Java applications. By eliminating the need to reproduce these “production only” problems in QA, it reduces the time required to resolve them. It does not require complex instrumentation or restarting of the application to get in-depth application details. Application administrators will be able to identify Java problems or database issues that are causing application downtime without any detailed knowledge of the application internals. It is also very well suited to diagnosing issues with “Stuck Threads” which will be the focus of this blog.

What is a [STUCK] Thread

In a Weblogic server, all incoming requests are handled by a thread pool which is controlled by a work manager. Worker threads that are taken out of the pool and not returned after a specified time period are marked as [STUCK] by the work manager. This time period is 10 minutes by default but it is configurable on a per work manager basis using the "StuckThreadMaxTime" parameter (default is set to 600 seconds).

Note that it is possible that some of your threads are doing legitimate work for over 10 min with no issues. If you have such threads then you should consider placing them in a another work manager with proper setting for the "StuckThreadMaxTime" parameter

Why JVMD is Well Suited to Diagnosing [STUCK] Threads

Traditionally, developers will use a stack trace generated by jstack or kill -3 and try to determine the cause of a stuck thread. However, in my experience a majority of the time this stack is not even the culprit. The problem often lies in another tier of the application or even in another thread of the same application. JVMD has the ability to provide additional context such as the name of the request and which tier it called out to Eg: RDBMS servers, LDAP servers, Web servers, RMI servers, etc. Using fine grained thread states (i.e. DB, Network, IO, CPU, RMI, Lock, etc) and the ability to see additional details about the thread, JVMD users can quickly pinpoint the root cause of the problem. Since JVMD is always on, it can also debug these issues that happened in the past and can proactively notify you about stuck threads Eg: Get an email at 1am when you had stuck threads. And lastly, sometimes developers have no access to the target host due to lack of credentials needed to run command line applications.

On several occasions, the thread may be stuck but is doing legitimate work. In such scenarios JVMD allows you to scan back and forth through a large number of samples to see what work is being processed by the thread. In addition, you can take a look at other threads that were serviced the same request to see if they behaved similarly or not. This will allow you to quickly determine whether there is really a problem or not.

Real-Time [STUCK] Thread Analysis

With JVMD there are two real use cases for stuck thread analysis. If you get notified about a stuck thread in real-time (via email, etc) then you can perform a real-time stuck thread analysis. Alternatively, if you are investigating a thread that was stuck in the past but is not present any more, then you can perform a historical stuck thread analysis. In either case the first thing to do is to navigate to the JVM (or JVM pool) where the thread is stuck. We do this by clicking on Targets -> Middleware as shown

From here we can filter the list of targets by target type or by target name. Your most recent filter request will be remembered the next time you visit the page. Select the Target Type of JVM to see all of the JVM targets.

Pick the JVM for the Weblogic server which is having the stuck thread issues and click on it. This will take you to the target home page. Click the button at the top that says “Live Thread Analysis”. Type the word "stuck" into the thread name search box and click on the arrow to filter the table. Now you should see all the stuck threads. In this case we can see a thread that is stuck in the “Network Wait” state. It is stuck on line 358 in function writeBuffer() of OutputRecord.java which is in package com.sun.ssl.internal.ssl which makes it clear that this stuck thread has made an SSL call and the remote server has not responded in a reasonable amount of time so the client thread is stuck.

Here is another example of stuck threads, this time in the “DB Wait” state. Notice how the tool tip over the SQL ID field shows the SQL being executed. Click on it to view longer SQL statements. Also try clicking on the DB Wait link which takes you directly to this specific database session in the Oracle Database Diagnostics section of EM for further analysis. The columns displayed are controlled by “View” drop-down menu. Here we added the “User” column to show the logged in user who executed the request.

Historical [STUCK] Thread Analysis

In order to start historical stuck thread analysis you need to navigate to the JVM target home page in the same way as discussed in the real-time section. From the target home page clicking on the “JVM Performance Diagnostics” button at the top of the page. On the performance diagnostics page you can filter the data to make it more relevant to your task. The first filter to apply is of time. If you know the exact time you can use the “Edit Date and Time” button to specify it as shown. Otherwise use the handy shortcut links for Day, 1 Hours, 1 Hour or 15 Minutes as needed.

The next thing to filter is the Thread Name. Expand the filter options region if necessary and add the Thread Name filter to be “[STUCK]*” so you only see threads whose name starts with [STUCK]

Below the filter region the “General” tab will show you the Thread States, Top Requests, Top Methods, Top SQLs, Top DB Wait Events and Top Databases – only for the filtered data i.e. for only threads that are stuck. Try clicking on method names to see the call stack for the method. The charts are all interactive and fetch additional data about the item clicked.

If you want to find a specific thread move from the “General” tab over to the “Threads” tab. This is fine grained data with each sample and state transition visible. You can click on any sample to view it in the sample analyzer which should look familiar to you if you saw the threads in real-time. Details about SQLs, Wait states, etc are all available here also along with the complete call stack which can also be exported to a CSV file.

In conclusion, we can see the JVMD provides a rich set of additional details which are only a mouse click away that help you to diagnose the root causes of your stuck threads.

NOTE: Many of the screen shots taken here are using testing & debug code, which deliberately tries to create stuck threads. This does not and should not reflect on the nature of any Oracle products being shipped to customers.


Thursday Feb 16, 2012

Managing Oracle Database 11g—Questions and Answers from the Oracle Enterprise Management Online Forum

We received tons of questions from our recent Oracle Enterprise Manager 12c Online Forum, we wanted to continue to provide answers to the most popular ones. In this installment, we'll cover questions from the DBA and developer track.

Q. What kind of testing do you recommend for upgrading from Oracle Database 11gR1 to 11gR2?
A. Oracle Real Application Testing using Database Replay and SQL Performance Analyzer's capabilities are recommended for any Oracle Database upgrade. Check out these resources for more details: Oracle Real Application Testing datasheet and OTN for details on upgrading your database.

Q. Can I manage Oracle Exadata and traditional Oracle Databases from the same console?
A. Yes, you can manage Oracle Exadata and single instance and/or RAC databases from the same Oracle Enterprise Manager Cloud Control console, as well as WebLogic and many other targets. Check out this demo to see how.

Q. I thought Active Session History (ASH) was intended for real-time?
A. ASH can be use for both real-time and historical analysis. It is a black box that records session activities and helps to analyze across several performance dimensions. Click here to see a quick demo.

Q. What is the difference between Oracle Enterprise Manager 12c and Oracle Database Control?
A. Database Control is a subset which runs off a single database. Oracle Enterprise Manager 12c is a central repository which allows administrators to manage and monitor from a single console.

Q. How does Real-Time Automatic Database Diagnostics Monitor (ADDM) work – the database is hung right?
A. Real-Time ADDM, included in the Oracle Diagnostic Pack for Oracle Database, uses 2 different modes of connection to the database. A normal connection and a diagnostic mode that is a lock less, latch less connection which allows only few actions. Using the diagnostic mode connection Real-Time ADDM performs a hang analysis and determines any blockers in the systems. Check out this demo to see how Real-Time ADDM works.

Q. Can we achieve all the new functionality in Oracle Enterprise Manager 12c using terminal?
A. No. Features like ASH Analytics, Real-Time ADDM and Compare Period ADDM are only available using Oracle Enterprise Manager 12c's console.

Q. Is Real-Time ADDM available only in Oracle Enterprise Manager 12c?
A. Yes, Real-Time ADDM is a new feature in Oracle Enterprise Manager 12c.

Q. Can you explain the difference between Oracle Database 11g Monitor in Memory Access Mode vs. Real-Time ADDM direct access to SGA?
A. Real-Time ADDM does not use Memory attach mode. We use a proprietary connection method which connects using a lock less, latch less connection bypassing the SQL Access layer.

Q. Is there any limit on the number of days on which ASH can be used for analysis?
A. The in memory ASH data is typically available for 1 hour or till the point the memory buffer is flushed to disk. You can find out ASH retention by using: select min(sample_time), max(sample_time) from sys.WRH$_ACTIVE_SESSION_HISTORY;

From more Oracle Database Management product information check out these resources:

Stay Connected:

Twitter | Facebook | YouTube | Linkedin | Newsletter

Wednesday Aug 31, 2011

How to create a consolidated backlog indicator report for EM11

While working with customers, and looking through some of the setups and configurations of Enterprise Manager, one question typically comes up all the time:

Q) How do I know my Enterprise Manager site is running healthy?

And the one thing, that is by far the most important for Enterprise Manager, is to keep an eye on the backlog indicators Enterprise Manager itself is collecting.
By looking at those numbers, you can more or less predict performance 'hiccups', or some bad behavior starting to form.

To help Enterprise Administrators keeping an eye on these important infrastructure metrics, a report can be created in Enterprise Manager to show all the relevant data in one single page.
To create this report, follow these steps:

  • In the Console, go to the 'Reports' tab, and use the 'Create' button to create a new report
  • On the report creation screen, enter this information
General Tab
 Title      : Backlog Indicators
 Category   : Enterprise Manager Setup
 Subcategory: Enterprise Manager Health
 Target     : Specify a 'specific' one:
              Name: Management Services and Repository
              Type: OMS and Repository
 Time Period: Check the 'Allow the report viewer to customize the time period' checkbox

Elements Tab
Add 8 elements, 4 charts, 4 tables, and use the layout button to represent then on 4 rows like this:

     Row 1   
       Chart from SQL
       Table from SQL
     Row 2
       Chart from SQL
       Table from SQL
     Row 3
       Chart from SQL
       Table from SQL
     Row 4
       Chart from SQL
       Table from SQL

  • Details for Row 1 (XML Loader backlog)
    Chart:
    Header    : XML Loader Backlog - Historical
    Type      : Timeseries chart
    Legend    : Bottom
    SQL to use:

          SELECT SUBSTR(key_value,0,INSTR(key_value,':')-1), rollup_timestamp, average
          FROM   sysman.mgmt$metric_hourly
          WHERE  target_guid = ??EMIP_BIND_TARGET_GUID??
            AND  metric_guid = HEXTORAW('B72713257822A65853FDF0C77554F660')
            AND  rollup_timestamp BETWEEN ??EMIP_BIND_START_DATE?? AND ??EMIP_BIND_END_DATE??

          ORDER BY rollup_timestamp, key_value

Table:
Header: XML Loader Backlog - Current
SQL to use:
      SELECT SUBSTR(key_value,0,INSTR(key_value,':')-1), value
      FROM   sysman.mgmt$metric_current
      WHERE  target_guid = ??EMIP_BIND_TARGET_GUID??
        AND  metric_guid = HEXTORAW('B72713257822A65853FDF0C77554F660')
      ORDER BY key_value

  • Details for Row 2 (EM Job backlog)
    Chart:
    Header: Job Backlog - Historical
    Type  : Timeseries chart
    Legend: Bottom
    SQL to use:
          SELECT 'System', rollup_timestamp, average
          FROM   sysman.mgmt$metric_hourly
          WHERE  target_guid = ??EMIP_BIND_TARGET_GUID??
            AND  metric_guid = HEXTORAW('CA4FF4BB045B18ADD7CA465C47A696F5')
            AND  rollup_timestamp BETWEEN ??EMIP_BIND_START_DATE?? AND ??EMIP_BIND_END_DATE??
          ORDER BY rollup_timestamp
Table:
Header: Job Backlog - Current
SQL to use:
      SELECT 'System', value
      FROM   sysman.mgmt$metric_current
      WHERE  target_guid = ??EMIP_BIND_TARGET_GUID??
        AND  metric_guid = HEXTORAW('CA4FF4BB045B18ADD7CA465C47A696F5')
  • Details for Row 3 (Notification backlog)
    Chart:
    Header: Notification Delivery Backlog - Historical
    Type  : Timeseries chart
    Legend: Bottom
    SQL to use:
          SELECT key_value, rollup_timestamp, average
          FROM   sysman.mgmt$metric_hourly
          WHERE  target_guid = ??EMIP_BIND_TARGET_GUID??
            AND  metric_guid = HEXTORAW('F4450F5AD8E95174CBFA21A261D5993C')
            AND  rollup_timestamp BETWEEN ??EMIP_BIND_START_DATE?? AND ??EMIP_BIND_END_DATE??
            AND  key_value != 'RCA'
          ORDER BY rollup_timestamp, key_value
Table:
Header: Notification Delivery Backlog - Current
SQL to use:
      SELECT key_value, value
      FROM   sysman.mgmt$metric_current
      WHERE  target_guid = ??EMIP_BIND_TARGET_GUID??
        AND  metric_guid = HEXTORAW('F4450F5AD8E95174CBFA21A261D5993C')
        AND  key_value != 'RCA'
      ORDER BY key_value

  • Details for Row 4 (Repository Metrics backlog)
    Chart::
    Header: Repository Metrics Backlog - Historical
    Type  : Timeseries chart
    Legend: Bottom
    SQL to use:
          SELECT key_value, rollup_timestamp, average
          FROM   sysman.mgmt$metric_hourly
          WHERE  target_guid = ??EMIP_BIND_TARGET_GUID??
            AND  metric_guid = HEXTORAW('5175E215E86BCCD6A55CF3D883B6AF2D')
            AND  rollup_timestamp BETWEEN ??EMIP_BIND_START_DATE?? AND ??EMIP_BIND_END_DATE??
          ORDER BY rollup_timestamp, key_value
Table:
Header: Repository Metrics Backlog - Current
SQL to use:
      SELECT key_value, value
      FROM   sysman.mgmt$metric_current
      WHERE  target_guid = ??EMIP_BIND_TARGET_GUID??
        AND  metric_guid = HEXTORAW('5175E215E86BCCD6A55CF3D883B6AF2D')
      ORDER BY key_value
  • Now save the report, and click on it, so you can see the consolidated backlog report of the site.
    On the top of the screen, you will have the option to set a time period, to report on backlog of a given period.
    The default is the last 24 hours, but you can change this to any period containing hourly rollup date (which typically will be the last 31 days, unless the system has been tweaked to change this default retention time)
About

Latest information and perspectives on Oracle Enterprise Manager.

Related Blogs




Search

Archives
« March 2015
SunMonTueWedThuFriSat
1
2
3
4
5
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today