Tuesday Apr 16, 2013

Demystifying WebLogic and Fusion Middleware Management --- by Glen Hawkins, Senior Director, Product Management

So for this week’s blog on Cloud Application Platform, we are going to switch gears and talk about something that is near and dear to everyone responsible for running applications and middleware in their environment and that is monitoring and management with specific emphasis on the Oracle Enterprise Manager Cloud Control solution. 

Often, this particular topic is dismissed early in the architectural discussion and doesn’t rear its (sometimes ugly) head until development is fairly far along on a new application and planning their deployment or worse, problems in production begin to impact the overall service levels of an application to the point that the end-users are complaining or top line revenue is being lost because of poor performance or reliability problems.  The result is that the inexperienced will treat monitoring and management of their middle tier and their application system as a whole as an afterthought, while those that are more experienced or forward looking will tackle it from day one.

So, let's start with some common pitfalls or myths that people run into when considering or planning the deployment of their management along with some discussions on each of these points:


I think that most that have attempted this in the past have learned the error of their ways.  Most tools such as administration consoles like the WebLogic Administration Console are designed to get the product up and running and for general configuration and administration purposes of a single domain.  They are not intended as a solution to monitor and manage many domains (possibly even multiple versions of those domains) as well as the entire application infrastructure (i.e. Databases, Hosts, Message Queues, Service Buses, etc) at once.  And, they routinely don’t provide any historical metrics or real 24/7 diagnostics.  No administrator wants to be in a situation where a problem occurred an hour ago and they no longer have any information on it because they only have real-time data.  You need both real-time and historical monitoring and diagnostics capabilities. 

In addition, administrators routinely want to be able to answer the usual question that comes up when everything was running fine one day and fails to perform on the next, which is “what has changed”.  You need historical information to refer to at all tiers of the application including the host as well as visibility across the stack including both monitoring and configuration data to answer that question. 

Possible answers could be that the end-users have increased, the way the end-users were using the application has changed (i.e. that marketing event you didn’t know about changed behavior), application changes, WebLogic domain changes, JVM changes, a patch was applied, or someone even may have started running something new on the machine or impacted the OS. 

Correlating these changes and coming to a quick conclusion is key to ensure optimal application service levels in a production environment for your end-users.  That means that you need a full stack 24/7 real-time and historical monitoring solution that can also provide meaningful diagnostics and and track/compare configuration standards across the entire application system stack which is something that only Oracle Enterprise Manager Cloud Control is able to provide in the case of the Oracle stack.


This one is quite simple at the end of the day, especially for those that have been pulled into a war room in regards to a production application emergency with all the finger pointing and frustration that routinely ensues.  The various team members responsible for the different portions of an application system almost always need to collaborate to resolve problems.  By using separate tools, collaboration can be slow and frustrating. 

A single pane of glass with different roles and privileges mitigating who can see what allows everyone to speak the same language.  At the end of the day, when a fire drill arises, communication and collaboration will allow you to pull through, which is greatly enhanced with the correct solution. 

Oracle’s Enterprise Manager Cloud Control solution was designed to promote this level of communication between roles with flexible dashboards providing different views of the application to different team members and diagnostics that can provide meaningful diagnostics such as bi-directional navigation between JVM threads and Oracle database sessions which goes well beyond just isolating SQL calls and the Middleware Diagnostics Advisor which provides recommendations diagnostic findings for WebLogic stack to quickly cut down on your time to resolution as opposed to raw metrics which force you to piece together fragments of the story from completely separate tools.


I think this particular myth tends to surprise those that are new to application and middle-tier management.  In development environments, particularly during the QA and load testing phases for most applications, the environments are usually so well controlled and, as they are not in production, you can more easily reproduce errors and attempt to resolve them in these environments.  However, in production environments, it becomes extremely difficult to reproduce issues as the load, network, application environment, and overall intermittent behavior of all of the tiers can challenge even the most technical operations person including those who developed the application in the first place. 

We routinely see issues reported by end-users in production environments where monitoring is minimal. Often, hours, days, even weeks are spent trying to reproduce issues or waiting for them to happen again if they are intermittent and no historical monitoring and diagnostics is available in the environment.  The bottom-line is that you need to be able to diagnose problems in the production environment itself.

Within Enterprise Manager Cloud Control, both historical and real-time metrics are available 24/7 across all tiers and they are correlated together.  Let me provide a quick simple example of a possible root cause analysis scenario where an application is perhaps degrading in performance over time.  Memory analysis tools by themselves are not able to pinpoint the problem, but it is clear that there is a buildup of referenced objects on the heap (i.e. possibly falling under the high level classification of a “memory leak” like issue, but then again there are possibly other causes).  The historical solution might be to attempt to restart servers on a regular basis trying to maintain high availability as you do, but that will not get you closer to finding the real issue and it is a band-aid at the end of the day that may very well fail when and if capacity increases for your application.

Let’s say we start with getting a notification from Enterprise Manager Cloud Control that a critical alert has occurred on the Work Manager – Pending Requests metric indicating there is a buildup of requests in the application.  This an early indicator and the Request Processing Time alert likely soon to follow if the trend continues, so let’s jump in and diagnose the problem.

First, let’s look at one of the higher level customizable dashboards in the product to see the lay of the land:


We can see from our WebLogic application above (just a simple Medrec example in this case) that all of our servers look like they are up and running, some of our heap and other metrics look high, but not unreasonable with the exception of some of our JVMs which show some DB Wait locks in red in the right-most bottom table.  This is a sure indicator that the pending requests that we were alerted to earlier are likely associated with calls of some kind to the back-end database.  If I click on the JVM in question, I can take this down a level.


Now we are on our JVM target home page within our WebLogic Domain hierarchy (many more metrics and capabilities there that we won’t go into in this blog, but I will provide links below to see those capabilities) where we can see a bit more detail and filter on anything to our heart’s delight by clicking on the various hour glasses to search on methods, requests, SQL, thread state, ECID (a transaction ID in FMW), and other criteria, which will filter the graphs further down the page which show thread breakdowns by many of these dimensions.  I could also immediately create a diagnostic snapshot of the data to look at later if I so desired.  I can also click on the Threads tab (next to the highlighted “General” tab above) and look at historical thread data or play with the timeframe, but we can see just by looking at this that we were correct about the threads in the DB Wait state and it has been going on for some time now.  Let’s navigate from historical to JVM live threads (collected every 2 secs using native thread sampling as opposed to byte code instrumentation) to try to determine the root cause of why so many threads are stuck in the DB Wait state.


Looking above, it is apparent that we are running an SQL prepared statement originating from a front-end request from the “/registerPatient.action” URL.  I could then click on the “SQL ID” to actually bring myself to the SQL in question within a tuning screen, but the route of more interest is to click on the DB Wait link highlighted in the lower half of the screen for one of the threads.  This will take me into a read-only view of the actual Oracle database session itself.


Here we are in the database session itself.  As an operations person or developer, my options are obviously very restricted, but I can see that there is a blocking session ID.  Better yet, I can now click on that blocking session ID and see that something that is entirely outside of my WLS container or JVM  is causing contention and I can now communicate with my DBA to address the problem.  This could have been just as easily a badly tuned SQL statement or perhaps indicated an index problem.  Likewise, I could have discovered that my threads were locked by one another or a Network Wait or even File IO.  There are a multitude of possibilities, but because I have a tool that can see across these tiers, I can quickly diagnose the issue and I am speaking the same language as my DBA.  DBAs can also drill back up by the way from SQL statements to the JVM and WLS container (also in read-only mode obviously), so they can be proactive about maintaining the application.  This is just one simple example of how Enterprise Manager Cloud Control facilitates this type of communication between roles as there are many other similar features from the dashboards which can be tweaked per role giving the appropriate visibility for the various team members or the incident management that is designed to allow teams to collaborate or even work with Oracle Support via the WebLogic Support Workbench if necessary.


It is true that most Java transaction tracing solutions create overhead because of byte code instrumentation.  There is certainly a time and place for this type of diagnostics which can be very detailed and rich in its analysis.  Within Oracle Enterprise Manager Cloud Control, we do have an optional advanced diagnostics feature that provides this functionality.  Overhead is routinely much lower than just about any other solution out there, and it is indeed able to run 24/7 without incurring much overhead.  For many, the little overhead required is reasonable and well worth the enormous amount of visibility you get by being able to track individual or groups of transaction through each tier of your application isolating problems based on the actual payload.

However, for those who prefer to not use byte code instrumentation, the entire example provided above does not require any.  It simply uses the stack metrics collected from the Enterprise Manager Cloud Control agent, which sits on the host (not in the WLS container and thus out of process) and the JVMD agent, an extremely lightweight agent (just a war file) that uses native code sampling (no byte code instrumentation and thus no restart of the managed server).  The bottom-line is that you can get a ton of visibility without incurring any noticeable overhead and decide where and if you want to also trace transactions on an individual basis.  This type of flexibility ensures that all diagnostics needs are met.

Alright, so that was my last myth to dispel for this blog.  I could go on for quite some time and show the many other capabilities of the Enterprise Manager product such as the earlier mentioned Middleware Diagnostics Advisor, log viewing and alerting, the multitude of dashboards, thresholds, lifecycle management, disaster recovery, and patch automation features that span the full capabilities of Oracle’s solution for WebLogic and Fusion Middleware management, but perhaps there will be time for another blog on those topics later.

For now, I will leave you with some resources to help you leap beyond the myths.

Additional Resources

Oracle Enterprise Manager Cloud Control 12c Middleware Management OTN Page

Free Online Self-Study Courses from Oracle Learning Library (OLL)

· Best Practices for WebLogic and SOA Management Self-Study Course

· Oracle Real User Experience Insight: Oracle's Approach to User Experience

· Oracle Real User Experience Insight: Basic Navigation, Data Structures, and Workflows

WLS Performance Monitoring and Diagnostics

· Navigate the Middleware Routing Topology

· Customize Middleware Performance Summaries

· Diagnose WebLogic and JVM Performance Bottlenecks

· Capture Diagnostics Snapshots

· Use the Middleware Diagnostics Advisor to Size the JDBC Connection Pool

· Diagnose Performance Issues End-to-End

· Construct a Service Level Agreement

· Overview of Business Transaction Management

· Service Dashboard

· Business Application Dashboard

WLS Configuration and Lifecycle Management

· Use and Report on Out-of-the-Box Compliance Standards

· Use and Report on Out-of-the-Box Compliance Standards

· Create WebLogic Domain Provisioning Profile

· Clone an Oracle WebLogic Domain from the Software Library

· Redeploy a Java EE Application

· Patching WebLogic Server

· Automate Disaster recovery with "Oracle Site Guard"

Coherence Management

· Manage and Monitor Oracle Coherence

· Provision Coherence

Real User Experience Insight

· Manage End User Performance with Real User Experience Insight


Monday Apr 15, 2013

Cloud Application Foundation Week in Review

Cloud Application Foundation blogs this week include Market Share results for WebLogic Server based on the Gartner Market Share, All Software Markets, Worldwide, 2012 report released on March 29, 2013; a customer case study on how Thomson Reuters Westlaw uses Oracle Coherence to improve scalability, availability and performance; and information about how easy it is to develop and deploy Tuxedo services in Java with step by step descriptions and sample code.

WebLogic Server

Oracle is #1 in the Application Server Market segment again with 40. 7% of the market share according to the Gartner Market Share report that was released March 2013. Read the WebLogic blog to find out more and for links to supporting information.

Coherence

Thomson Reuters Westlaw, one of the primary online legal research services for lawyer and legal professionals, provides proprietary data services and information from more than 30,000 databases of case law, statutes, synopses, treatises, best practices, news articles and public records. Read the Coherence blog to find out how they improved scalability and performance with Oracle Coherence.

Tuxedo

Developing Tuxedo services in Java in easy and straightforward. Read the Tuxedo blog for more information about Java Server support in Tuxedo and the programming environment. Find out the steps needed to develop and deploy Java service on Tuxedo and how to configure the Tuxedo Java Server. Sample implementation and configuration code is included.

Stay Connected
Follow Cloud Application Foundation:

Follow WebLogic:



Follow Coherence:

OracleCoherence's profile on slideshare

Follow Tuxedo:


Friday Apr 05, 2013

Cloud Application Foundation: What's New?

This week’s Cloud Application Foundation (CAF) blogs—WebLogic, Coherence and Tuxedo—all talk about exciting new updates in each product area. The WebLogic blog describes new capabilities made available by WebLogic Server on Oracle Database Appliance 2.5. The Coherence blog announces the spring edition of the New York Coherence Special Interest Group held on April 10. And, the Tuxedo blog focuses on three new high availability features in the Tuxedo 12c release.


WebLogic

This week, Oracle announced exciting news about WebLogic Server on Oracle Database Appliance 2.5—a complete solution for building and deploying enterprise Java EE applications in a fully integrated system of software, servers, storage, and networking. Not only does this solution deliver highly available database and WebLogic services, it also it reduces IT cost with a unique capacity-on-demand software licensing model.

Read the blog, WebLogic Server on Oracle Database Appliance, for more details.  

Coherence

The Coherence blog, New York Coherence SIG on April 10, describes the NYCSIG, scheduled to take place on April 10, 2013 from 1:00-5:00 pm ET at the Oracle Office, 120 Park Avenue, 26th Floor, New York, NY. Don’t miss this opportunity to hear what’s new in Coherence and talk with subject matter experts. Links to registration, the agenda and additional information is included in the blog.

Tuxedo

In the Tuxedo blog, Chief Oracle Tuxedo Architect, Todd Little, discusses three new features in the Tuxedo 12cR1 that help improve the availability of Tuxedo based applications. Todd describes how the *ROUTING section can now specify up to three server groups that can be associated with a range of values, allowing application partition to span up to 3 machines. He also talks about how automatic migration of machines and server groups and service versioning enhance Tuxedo’s capability to support highly available applications. Check out the blog for more details.

Let us know what you think

What CAF topics would you like to hear more about?

Stay connected

Follow CAF

Twitter

Blog

Follow WebLogic:

Facebook

LinkedIn

Twitter

YouTube

Blog


Follow Coherence:

Facebook

LinkedIn

Twitter

YouTube

Blog


Follow Tuxedo:

LinkedIn

YouTube

Blog

About

This blog covers the concepts, architecture, practices, technologies, and products that provide foundational infrastructure for the cloud.

Search

Archives
« April 2013 »
SunMonTueWedThuFriSat
 
1
2
3
4
6
7
8
9
10
11
12
13
14
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today