Monday Nov 05, 2012

Oracle Proactive Support: Automatic Capture of Diagnostic Data Upon First Failure!

Oracle Proactive Support have written a blog covering how to use the FMW Diagnostics Framework tooling to generate an incident package and upload with an Oracle Service Request. This includes how to use the WLST and ADRCI command line interfaces to inspect incidents and generate incident packages.

Check it out here

Thursday May 03, 2012

DFW JRockit Flight Recording Dump added in FMW 11.1.1.6.0

Starting in FMW 11.1.1.6.0 the FMW Diagnostic Framework includes a new diagnostic dump "jvm.flightRecording" that is a wrapper around the calls required to generate a JRockit Flight Recorder file. Having the Flight Recording available via a DFW diagnostic dump makes it easier for generating and automatically adding the output to a diagnostic incident. This can be achieved using the WebLogic Server Scripting Tool (WLST) "executeDump" command.

The example output below shows the dump being executed and it's output added to incident "1":

wls:/base_domain/serverConfig> executeDump(name="jvm.flightRecording", id="1") Dump file jvm_flightRecording10_i1.txt added to incident 1

wls:/base_domain/serverConfig> showIncident(id="1") showIncident(id="1") Incident Id: 1 Problem Id: 1 Problem Key: DFW-99999 [MANUAL] Incident Time:Thu May 03 08:00:18 PDT 2012 Error Message Id: DFW-99999 Execution Context: Flood Controlled: false Dump Files : readme.txt jvm_threads3_i1.txt dms_metrics4_i1.txt odl_logs7_i1.txt diagnostic_image_AdminServer_2012_05_03_08_00_19.zip jvm_flightRecording10_i1.txt

You can read further information on DFW here

You can read further information on JRockit here

Tuesday Jul 12, 2011

Remote Diagnostic Agent (RDA) & DFW Collections

“It's late, you have a tight deadline, and there is a problem in Fusion Middleware. What do you do?”

One thing you could do is run the Remote Diagnostic Agent (RDA) tool to capture diagnostic information (collections) about the installation, which may be used to help resolve the issue.

There are some good knowledge articles in My Oracle Support out there to give you a grounding in RDA, so that is not the scope of this article. Search on 314422.1 (Getting Started) and 330363.1 (FAQ). Instead, I’m going to focus on the diagnostics captured for the FMW Diagnostic Framework (DFW). A familiarity of How the FMW Diagnostic Framework Works is assumed, as I want to show where to find the DFW diagnostics in RDA, rather than re-explain DFW.

So when did RDA start capturing DFW data?

RDA exists as a separate standalone download. However, RDA started to ship with FMW from 11g onwards. RDA includes a series of modules, which capture collections of data. The WREQ module performs the DFW collection (among other diagnostics). For an optimal DFW collection, I recommend RDA 4.24 or later. This version will ship with FMW 11g PS5. Alternatively RDA  may be downloaded from OTN.

Where can I find DFW data in an RDA output?

In FMW 11g, RDA is located under $ORACLE_HOME.  Take a look at the readme there for more details on how to configure and perform collections. Assuming RDA has been configured, and collections obtained, start by opening up the resulting output file  *__start.htm file in a browser, to view the RDA output.

Figure 1) WLS Servers

In the top left hand frame, select a  Server link under Oracle WebLogic Server. In this example, two managed servers are highlighted, ‘Server-0’ and Server-1'. Notice also the ‘AdminServer’.

I’m going to select “AdminServer”.

Figure 2) Diagnostic Repository

For the WLS Server selected, scroll almost to the bottom in the bottom left hand frame, until you see "Log Files / Diagnostic Repository". You could always do a find in your browser. There are three options:-

- Problem Overview
- Most Recent Incidents
- Files from Latest Incidents

These are the links we are interested in.

Problem Overview

Taking the first link, Problem Overview, this is an example of what to expect on the right hand frame in your browser, when you click on it.

Figure 3) Problem Overview

The list of Problems are reported along with details of the last occurrence, incident ID and timestamp, of the Problem. Notice the last incident for problem ID 3 in this sample is incident 7.

Most Recent Incidents

Figure 4) Most Recent Incidents From the overview, we know the incident ID’s of the last problems. So we could drill down further, selecting an incident we are interested in.

Drilling down on Incident Id 7, for example…

Figure 5) Incident 7 Quite a lot of information is captured about the incident, including when it first occurred, and where on the file system the associated dump files reside.

Files from Latest Incidents

So we know from Problem Overview and Most Recent Incidents the latest incidents we may want to investigate. Files from Latest Incidents enables the associated dump files to be reviewed.

Figure 6) Files from Latest Incidents

Note: Due bug 12720093  related to DFW collections, the Incidents listed under Files from Latest Incidents  are not in sync with Problem Overview and Most Recent Incidents. However, this example illustrates the expected  behaviour. This issue will be fixed in RDA 4.25.

The first file to examine is readme.txt which gives a good summary of the incident. The diagnostic dump files could then be mined for further details. The wls_image file is a  diagnostic snapshot produced by the Diagnostic Image Capture component of the Oracle WebLogic Diagnostic Framework (WLDF). The other .dmp files may be opened in a text editor.

Monday Nov 15, 2010

How the FMW Diagnostic Framework Works

In my last post I introduced the FMW Diagnostic Framework (DFW), covering it's goals, key components and concepts. In this post I will drill down into how DFW detects and creates diagnostic incidents.

At the heart of the incident detection and creation process is an internally named component "Diagnostics Data Extractor" (DDE). Below is a summary of the features that the DDE covers:

  • Active in each WebLogic Server
  • Rule driven incident detection and creation
  • Incident detection
    • Java.util.logging log filter
    • WLDF Diagnostic Module JMX notifications
  • Incident creation
    • Default set of dumps to execute
    • Rule based dump execution
  • Java API for creating incidents
  • Gathers incident meta data (correlation keys, component info, etc.)
  • Flood control
  • Support for creating manual incidents
  • Default rules detect INCIDENT_ERROR level log messages
  • Default WLDF Watch/Notification Configuration:
    • Out of memory errors
    • Uncaught exceptions
    • Thread deadlocks
    • Stuck threads

The Diagnostic Framework is active in each server and provides automatic error detection through predefined configured rules. Oracle Fusion Middleware components and applications automatically benefit from this always-on checking. The rules are currently not customer modifable and are defined and registered by FMW components. Diagnostic rules detail when an incident should be created and what diagnostics to collect with the incident. By default DFW will create an incident when a log message is logged at the INCIDENT_ERROR log level.The rules allow for fine grained collection of diagnostics specific to helping diagnosis an incident.

Incidents are automatically detected in two ways:

  1. By the incident detection log filter, which is automatically configured to detect critical errors.
  2. By the WLDF Watch and Notification component. The Diagnostics Framework listens for a predefined notification type and creates incidents when it receives such notifications.

In addition Oracle Fusion Middleware components may use the DFW Java API directly to signal an incident.

Log detection based incident creation

The diagram below shows the interaction when the incident is detected by the incident log detector. It shows the interaction between the log detection part of DDE, diagnostic dumps and ADR.

loginc.gif

The steps represented above are::

  1. The incident detection log filter is initialized with component and application diagnostic rules.
  2. An application or component (in this case Oracle WebCenter) logs a message using the java.util.logging API.
  3. The ODL log handler passes the message to the incident detection log filter.
  4. The incident log detection filter inspects the log message to see if an incident should be created, basing its decision on the diagnostic rules for the component. If the diagnostic rule indicates that an incident should be created, it creates an incident in the ADR.
  5. The ODL log handler writes the log message to the log file, and returns control back to Oracle WebCenter.

  6. When an incident is created, a message, similar to the following, is written to the log file:

    [2010-02-16T06:37:59.264-07:00] [dfw] [NOTIFICATION] [DFW-40104] [oracle.dfw]
    [tid: 10] [ecid: 0000IF34gtMC8xT6uBf9EH1AgEck000000,0] [errid: 6]
    [detailLoc: /middleware/user_projects/base_domain/servers/AdminServer/adr/diag/ofm/base_domain/AdminServer]
    [probKey: MDS-123456 [testComponent][testModule]] incident 6 created with
    problem key "MDS-123456 [testComponent][testModule]", in directory
    /middleware/user_projects/base_domain/servers/AdminServer/adr/diag/ofm/base_domain/AdminServer/incident/incdir_6

  7. The Diagnostic Framework executes the diagnostic dumps that are indicated by the diagnostic rules for the component.
  8. The Diagnostic Framework writes the dumps to ADR, in the directory created for the incident.
  9. The Diagnostic Framework invokes the WLDF Diagnostic Image MBean requesting that a Diagnostic Image be created in ADR.
  10. WLDF writes the Diagnostic Image to ADR.

At step 4 above DFW may also execute dumps that are designed to be ran on the same thread as the code logging the message, for example in order to gain an accurate stack leading up to the code logging a message.

WLDF Notification based incident creation

The diagram below shows the interaction when an incident is detected by the WLDF Watch and Notification system. It shows the interaction among the DDE incident notification listener, the WLDF Watch and Notification system, diagnostics dumps and ADR.

wldfinc.gif

The steps represented in above are:

  1. The incident notification listener is initialized with component and application diagnostic rules.
  2. Oracle Fusion Middleware Diagnostic Framework registers a JMX notification listener with WLDF. The listener listens for events from the WLDF Watch and Notification system. It only processes notifications of type oracle.dfw.wldfnotification.
  3. Something in the system causes the configured WLDF watch to be triggered, causing a notification to be sent to the incident notification listener. The notification includes event information describing the data that caused the watch to trigger.
  4. The Diagnostic Framework creates an incident in ADR.
  5. The Diagnostic Framework executes the diagnostic dumps that are indicated by the diagnostic rules.
  6. The Diagnostic Framework writes the dumps to ADR, in the directory created for the incident.
  7. The Diagnostic Framework invokes the WLDF Diagnostic Image MBean requesting that a Diagnostic Image be created in ADR.
  8. WLDF writes the Diagnostic Image to ADR.

Fusion Middleware configures a WLDF Diagnostics Module that contains a set of Watch and Notification rules for detecting a specific set of critical errors and creating an incident for each occurrence of those errors. The module is called Module-FMWDFW and contains the following set of Watch conditions:

Name Description
Deadlock Two or more Java threads have circular lock chains among their Java Monitor object usage.
StuckThread An Oracle WebLogic Server ExecuteThread, which is blocked or busy for more than the time specified by the Oracle WebLogic Server StuckThreadMaxTime parameter.
UncheckedException This category includes all Unchecked Exception, RuntimeException, and Errors caught by the Oracle WebLogic Server ExecuteThread, such as NullPointerException, StackOverflowError, or OutOfMemoryError.

The Diagnostic Module also includes a configured WLDF JMX Notification FMWDFW-notification of type oracle.dfw.wldfnotification. You can reuse this WLDF JMX Notification for your own WLDF Watch conditions in order to create an incident:

  1. Display the Administration Console
  2. In the Change Center, click Lock & Edit.
  3. In the left pane, expand Diagnostics and select Diagnostic Modules. 
  4. The Summary of Diagnostic Modules page is displayed.

  5. Click Module-FMWDFW. 
  6. The Settings for Module-FMWDFW page is displayed.

  7. Select the Watches and Notifications tab, which is shown in the following figure:
  8. dfwnotif.gif
  9. Select the Watches tab and click New.
  10. The Create Watch page is displayed.

  11. For Name, enter a name for the watch.
  12. For Watch Type, select a type.
  13. Click Next.
  14. For Current Watch Rule, construct an expression. For example, (SEVERITY = 'Error') AND (MSGID = 'BEA-000337').
  15. Click Next.
  16. Select an alarm type.
  17. For Notifications, select FMWDFW-notification.
  18. Click Finish.

For more information on creating watches, see "Construct watch rule expressions" in the Administration Console Online Help.

Incident flood control

It is conceivable that a problem could generate dozens or perhaps hundreds of incidents in a short period of time. This would generate too much diagnostic data, which would consume too much space in the ADR and could possibly slow down your efforts to diagnose and resolve the problem. For these reasons, the Diagnostic Framework applies flood control to incident generation after certain thresholds are reached. A flood-controlled incident is an incident that is not recorded in the ADR. Instead, a message at the WARNING level is written to the log file. Flood-controlled incidents provide a way of informing you that a critical error is ongoing, without overloading the system with diagnostic data.

Full documentation on DFW can be found in chapter 10 of the Oracle Fusion Middleware Administrator's Guide

In further posts I will cover:

- Working with Problems and Incidents
- Diagnostic dumps in detail
- Configuring DFW
- Supportability flows

Wednesday Oct 06, 2010

Introduction to FMW Diagnostic Framework

Oracle Fusion Middleware includes a Diagnostic Framework (DFW) which aids in detecting, diagnosing, and resolving problems. The problems that are targeted in particular are critical errors such as those caused by code bugs, deadlocked threads, out of memory errors, and uncaught exceptions. DFW is available with all FMW 11g installations that run on WebLogic Server.

The goals of DFW are:

  • First-failure diagnostics
  • Reducing problem diagnostic time
  • Reducing problem resolution time
  • Simplifying customer interaction with Oracle Support
  • Speed up internal testing cycle

The important thing for me is that the relevant diagnostics are captured at the moment of failure giving the customer and Oracle Support the best start they can for resolving the issue. For example, when a deadlocked thread is detected a thread dump should be automatically captured detailing all threads, pin pointing those that are deadlocked.

The framework came about as a result of a diagnostics project that started with the Oracle Database 11g release. In that release the database development group came up with a set of concepts and infrastrucutre for capturing, recording and indexing diagnostics in a consistent way. Out of this project ADR (Automatic Diagnostic Repository) was born, a file-system repository for cataloguing occurences of failures and storage of associated diagnostic data. ADR was designed with the intention that other Oracle products could integrate with it, providing consistency not only for Oracle Database diagnostics but for products across the Oracle stack. In FMW 11gR1 ADR was adopted, along with the concepts, and a framework built that extended it to support FMW environments.

For more information on ADR refer to the Oracle Database Administrator's Guide.

What are the concepts?

A Problem is a critical error, that could be due to an internal error, server error (i.e. thread deadlock) or configuration error that results in a critical condition. Each Problem has a Problem Key, which is a text value used to associate incidents to problems. It is based on the error message id and other context values. Problems are tracked in ADR.

An Incident is single occurrence of a problem. An incident is created for each occurrence of problem (critical error), although subject to flood control. Each incident has a unique ID, and so when DFW logs a message indicating that an incident has been created an administrator can use the ID to look at the associated diagnostics in ADR.

What are the components?

Automatic Diagnostic Repository

The Automatic Diagnostic Repository (ADR) is a file-based repository for storing diagnostics data associated with incidents. It consists of metadata that describes each Problem and Incident, along with the set of diagnostic dump output generated for each incident.

Each WebLogic Server has it's own ADR. The ADR root directory is known as ADR base. By default, the ADR base is located in the following directory:

DOMAIN_HOME/servers/server_name/adr

Within ADR base, there can be multiple ADR homes, where each ADR home is the root directory for all incident data for a particular instance of Oracle WebLogic Server. The following path shows the location of the ADR home:

ADR_BASE/diag/ofm/domain_name/server_name

The image below show the ADR directory structure for Fusion Middleware.

adrdir.gif

The subdirectories in the ADR home contain the following information:

  • alert - The XML-formatted alert log.
  • incident - A directory that can contain multiple subdirectories, where each subdirectory is named for a particular incident. The subdirectories are named incdir_n, with n representing the number of the incident. Each subdirectory contains information and diagnostic dumps pertaining only to that incident.
  • (others): Other subdirectories of ADR home, which store incident packages and other information

 The ADR Command Interpreter (ADRCI) is a utility that enables you to investigate problems, and package and upload first-failure diagnostic data to Oracle Support, all within a command-line environment. ADRCI also enables you to view the names of the dump files in the ADR, and to view the alert log with XML tags stripped, with and without content filtering.

ADRCI is installed in the following directory:

(UNIX) MW_HOME/wlserver_10.3/server/adr
(Windows) MW_HOME\wlserver_10.3\server\adr

Diagnostic Dumps

Diagnostic dumps perform targeted diagnostic data capture when an incident is created or on demand when requested by an administrator. They are generally implemented by FMW components/applications and will be configured to run with appropriate types of critical errors. Example diagnostic dumps include:

  • Thread dump
  • Execution Context (all active ones)
  • Active HTTP requests
  • Class histogram
  • DMS Metrics
  • Logs (by ECID)
  • Logs (by timestamp up to -/+ 5 minutes)
  • WLDF Diagnostic Image

These dumps will be looked at in more detail in a future post.

Management MBeans

DFW provides MBeans that allow you to:


  • Configure DFW
  • Query Problems and Incidents
  • Create manual incident
  • Upload files and associate them with an existing incident
  • Query available diagnostic dumps
  • Execute diagnostic dumps
  • Download diagnostic dump data

All of the DFW MBeans are available under the "oracle.dfw" MBean domain.

WLST Commands

DFW provides WLST commands that you can use to view information about problems and incidents, create incidents, execute specific dumps and query the set of diagnostic dump types. Refer to the "Diagnostic Framework Custom WLST Commands" documentation for more information.


Full documentation on DFW can be found in chapter 12 of the Oracle Fusion Middleware Administrator's Guide.

In further posts I will cover:

  • The process of detecting and creating incidents
  • Working with Problems and Incidents
  • Integration with WLDF
  • Diagnostic dumps in detail
  • Configuring DFW
  • Supportability flows

Monday Aug 23, 2010

Welcome to the Fusion Middleware Diagnostics Blog

Fusion Middleware consists of many product suites and components, some new to the Oracle suite of products and some that have been around for a long time. Oracle has done a great job of aligning these products on multiple levels, including manageability and diagnosability, providing a consistent environment and set of tools for customers to work with.

With customers building out both small and large scale deployments from these components it becomes even more important to have good management and diagnostics support in the products and infrastructure. This blog will attempt to provide insight into the diagnostic features and tools available in Fusion Middleware, as well as providing tips and techniques for diagnosing issues.

Some of the areas we hope to cover in the coming months include:

  • FMW Diagnostic Framework (DFW)
  • Dynamic Monitoring Service (DMS)
  • Oracle Diagnostic Logging (ODL)
  • Remote Diagnostics Agent (RDA)

If you have any suggestions for topics to be covered in the area of Fusion Middleware diagnostics feel free to leave a comment.

About

bocadmin_ww

Search

Top Tags
Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today