X

Proactive insights, news and tips from Oracle WebLogic Server Support. Learn Oracle from Oracle.

  • October 29, 2015

Using Diagnostic Context for Correlation

The WebLogic Diagnostics Framework (WLDF) and Fusion Middleware Diagnostics Monitoring System (DMS) provide correlation information in diagnostic artifacts such as logs and Java Flight Recorder (JFR).

The
correlation information flows along with a Request across threads
within and between WebLogic server processes, and can also flow across
process boundaries to/from other Oracle products (such as from OTD or to
the Database). This correlation information is exposed in the form of
unique IDs which can be used to identify and correlate the flow of a
specific request through the system. This information can also provide
details on the ordering of the flow as well.

The correlation IDs are described as follows:

  • DiagnosticContextID
    (DCID) and ExecutionContextID (ECID). This is the unique identifier
    which identifies the Request flowing through the system. While the name
    of the ID may be different depending on whether you are using WLDF or
    DMS, it is the same ID. I will be using the term ECID as that is the
    name used in the broader set of Oracle products.
  • Relationship ID (RID).
    This ID is used to describe where in the overall flow (or tree) the
    Request is currently at. The ID itself is an ordered set of numbers that
    describes the location of each task in the tree of tasks. The leading
    number is usually a zero. A leading number of 1 indicates that it has
    not been possible to track the location of the sub-task within the
    overall sub-task tree.

These correlation IDs have been
around for quite a long time, what is new in 12.2.1 is that WLDF now
picks up some capabilities from DMS (even when DMS is not present):

  1) The RelationshipID (RID) feature from DMS is now supported
  2) The ability to handle correlation information coming in over HTTP
  3) The ability to propagate correlation out over HTTP when using the WebLogic HTTP client
  4) The concept of a non-inheritable Context (not covered in this blog, may be the topic of another blog)

For
this blog, we will walk through a simple contrived scenario to show how
an administrator can make use of this correlation information to
quickly find the data available related to a particular Request flow. This diagram shows the basic scenario:


Each
arrow in the diagram shows where a Context propagation could occur, however in our example propagation occurs only where we have solid blue arrows.
The reason for this is that in our example we are using a Browser client which does not
supply a Context, so for our example case the first place where a
Context is created is when MySimpleServlet is called. Note that a
Context could propagate into MySimpleServlet if it is called by a
clients capable of providing the Context (for example, a DMS enabled
HTTP client, a 12.2.1+ WebLogic HTTP client, or OTD).

In our contrived applications, we have each level querying the value of the ECID/RID using the DiagnosticContextHelper API,
and the servlet will report these values. A real application would not
be doing this, this is just for our example purposes so our servlet can
display them.

We also have the EJB hard-coded to throw an
Exception if the servlet request was supplied with a query string. The
application will log warnings when that is detected, the warning log
messages will automatically get the ECID/RID values included in them.
The application does not need to do anything special to get them.

The applications used here as well as well as some basic instructions are attached in blog_example.zip.

First we will show hitting our servlet with an URL that is not expected to fail (http://myhost:7003/MySimpleServlet/MySimpleServlet):




From
the screen shot above we can see that all of the application components
are reporting the same ECID
(f7cf87c6-9ef3-42c8-80fa-e6007c56c21f-0000022f). We also can see that
the RID being reported by each components here are different and show
the relationship between each of the components:


Next we will show hitting our servlet with an URL that is expected to fail (http://myhost:7003/MySimpleServlet/MySimpleServlet?fail):

We
see that the EJB reported that it failed. In our contrived example app,
we can see that the ECID is for the entire flow where the failure
occured was "f7cf87c6-9ef3-42c8-80fa-e6007c56c21f-00000231". In a real
application, that would not be the case. An administrator would most
likely first see warnings reported in the various server logs, and see
the ECID reported with those warnings. Since we know the ECID in this
case, we can "grep" for it to show what those warnings would look like
and that they have ECID/RID reported in them:

Upon
seeing that we had a failure, the admin will capture JFR data from all
of the servers involved. In a real scenario, the admin may have noticed
the warnings in the logs, or perhaps had a Policy/Action (formerly known as Watch/Notification) configured to automatically notify or capture data. For our simple example, a WLST script is included to capture the JFR data.


The
assumption is that folks here are familiar with JFR and Java Mission
Control (JMC), also that they have installed the WebLogic Plugin for JMC
(video on installing the plugin)

Since
we have an ECID in hand already related to the failure (in a real case
this would be from the warnings in the logs), we will pull up the JFR
data in JMC and go directly to the "ECIDs" tab in the "WebLogic" tab
group. This tab initially shows us an unfiltered view from the
AdminServer JFR, which includes all ECIDs present in that JFR recording:

Next we will copy/paste the ECID "f7cf87c6-9ef3-42c8-80fa-e6007c56c21f-00000231" into the "Filter Column" for "ECID":


With
only the specific ECID displayed, we can select that and see the JFR
events that are present in the JFR recording related to that ECID. We
can right-click to add those associated events to the "operative set"
feature in JMC. Once in the "operative set" other views in JMC can also
be set to show only the operative set as well, see Using WLDF with Java Flight Recorder for more information.

Here we see screen shots showing the same filtered view for the ejbServer and webappServer JFR data:



In
our simple contrived case, the failure we forced was entirely within
application code. As a result, the JFR data we see here shows us the
overall flow for example purposes, but it is not going to give us more
insight into the failure in this specific case itself. In cases where
something that is covered by JFR events caused a failure, it is a good
way to see what failed and what happened leading up to the failure.

For more related information, see:

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha