By Anand Akela on Jul 24, 2012
To try and make the new features a bit more understandable, I’ll be writing a number of blog entries over the coming months to highlight just some of my favourite new features for EM12c. From an administrator’s perspective, one of those standout features (and the subject of today’s entry) has to be incident management.
The goal of incident management is to enable administrators to monitor and resolve service disruptions that may be occurring in their data centre as quickly and efficiently as possible. Instead of managing the numerous discrete individual events that may be raised as the result of any of these service disruptions, we want to manage a smaller number of more meaningful incidents, and to manage them based on business priority across the lifecycle of those incidents.
To do this, Enterprise Manager now provides a centralized incident console called Incident Manager that will enable the administrator to track, diagnose, and resolve incidents, as well as providing features to help rectify the root causes of recurrent incidents. Incident Manager also directly leverages Oracle’s own expertise via My Oracle Support knowledge base articles and documentation to enable administrators to accelerate the process of diagnosing and resolving incidents and problems. Finally, Incident Manager also offers the ability to do lifecycle operations for incidents, so you can assign ownership of an incident to a specific user, acknowledge an incident, set priority for an incident, track an incident’s status, escalate an incident or suppress it so you can defer it to a later time. You can also raise notifications on an incident or open a helpdesk ticket via the helpdesk connectors.
Enterprise Manager continues to be the primary tool for managing and monitoring the Oracle data center, so it manages and monitors Oracle applications as well as the application stack from presentation layer to middleware, databases to hosts and the operating system, as well as non-Oracle technology. When Enterprise Manager detects issues in any of this infrastructure, it raises events. Sample events might be:
1. Metric alerts (for example, CPU utilization or tablespace usage alerts) where a critical threshold you set has been crossed
2. Job events – events are raised by the job system for job statuses
that you specify, for example an event is raised to signal the failure of a job.
3. Standards violations – if you are using compliance standards and any of the targets that are being monitored violate any of the compliance standards, then a standards violation event could be raised.
4. Availability events – if a target is down and Enterprise Manager detects that, an availability event that the target is down can be raised
5. Other events – there are other types of events that occur as well
All these events signal particular issues have occurred in the managed data centre. As an administrator, you really want to be able to determine which of these events are significant. From these significant events, you then want to be able to correlate discrete events that are related to the same underlying issue, so you in fact have to manage a smaller number of significant incidents.
An incident could then be defined as an object containing a significant event (such as a target being down, for example) or it could be a combination of events that all relate to the same issue (for example, running out of space could be detected by Enterprise Manager as separate events raised from the database, host and storage target types). For example, you may have a performance incident that amalgamates a number of performance events, another incident related to space, and a different incident based on availability problems.
Sound good? OK, so how do we do this? Well, events are significant occurrences in your IT infrastructure and that Enterprise Manager detects and raises. Each event has a set of attributes– what type of event it is, the severity (fatal, critical and so on), the object or entity on which the event is raised (typically a target but it can also be a job or some other object), the message associated with the event, the timestamp at which it occurred, as well as the functional category (such as availability, security etc.)
Some examples of the different types of events include:
· Target availability: raised when a target is down or has gone into an agent unreachable state.
· Metric alert: raised when a metric crosses its threshold.
· Job status change: raised, for example, when a job fails.
· Compliance standard rule: raised when a compliance standard rule is violated.
· Metric evaluation: raised when there is an error with the evaluation of a metric.
· Other events such as SLA Alert, High Availability and Compliance Standard Score violation can also be raised, and of course, users can cause an event to be raised.
Associated with these event types are event severities. The first of these, “Fatal”, is a new severity level in Enterprise Manager specifically associated with the target availability event type for when the target is down. Critical and warning events have the same meaning as they had in previous releases, and then we have the Advisory level. Typically, this is associated with non-service-impacting events such as compliance standard violation events. The informational level is an event severity used to indicate simply that an event has occurred, but there is no need to do anything about it.
As we discussed previously, an actual incident will contain one or more events. Let’s look at the details of an incident with one event. For example, Figure 1 shows us an availability event:
Figure 1: Incident with one event
The event signals that the database DB1 is down and includes a timestamp of when the event was raised. Because this is a target availability event and the database is down, the severity is marked as Fatal. An incident can be created for that event, so the incident contains only one event. In order to manage and track the resolution of the incident, the incident has other attributes such as owner (the Enterprise Manager user that is working on the incident), status, incident severity (which is based on the event severity), priority and a comment field.
Many incidents will instead contain
multiple events, where those events are related and pointed to the same
underlying cause. In the example shown
in Figure 2, we have two metric alert events on a host target -- a memory
utilization metric alert event and a CPU utilization metric alert event because
the host is starting to suffer from heavy load. We have a warning severity memory utilization metric alert event, and a
short time later a critical severity CPU utilization metric alert event.
Figure 2: Incident with multiple events
An incident can be created containing both events in order to manage and track the resolution of the incident. In the current release, the administrator needs to manually combine events into an incident in the Enterprise Manager console (the automatic grouping of related events into an incident is a future enhancement). Again, we have additional attributes associated with the incident like we had in the previous example. Enterprise Manager automatically assigns the incident severity, based on the worst case event severity of all the events contained in the incident. Since the worst event severity is Critical, the incident severity is also set to Critical. Finally, the incident has a summary which is a short description of what the incident is about. The individual events are indicating the machine load is high so you can set the summary to that. Alternatively, you can set the incident summary to be the same as the event messages.
If you are using one of the helpdesk connectors to interface to a helpdesk system, an incident might also result in a helpdesk ticket which can allow the helpdesk analyst to work on the ticket. Within Enterprise Manager, we’ll be able to track both the ticket number and the status of that particular ticket.
A problem is the underlying root cause of an incident. In Enterprise Manager terms, a problem is specifically related to either an Automatic Diagnostic Repository (ADR) incident or Oracle software incident. Enterprise Manager will automatically create a problem whenever it detects an ADR incident has been raised. An ADR incident can be thought of as a critical Oracle software problem where the resolution of the software problem typically involves contacting Oracle Support, opening a service request and possibly receiving a patch for that problem.
Whenever an ADR incident is raised, we generate one incident in Enterprise Manager for that ADR incident, and we also automatically generate a problem as well. All the ADR incidents that have the same problem signature (that is, the same root cause) will be linked into a single problem object. The administrator can manage the problem in Incident Manager in the same way as you would manage an incident, so you can assign an owner to the problem, track the resolution and so on. In addition, there are in-context links to Support Workbench functionality which allows the administrator to package the diagnostic material, open a service request and view the status of diagnostic activity such as the SR number and ultimately bug number (if one is generated) within the user interface.
Figure 3 shows a diagrammatic example of how incidents and problems are related. Two ADR incidents have occurred, in this example two ORA-600 errors have occurred in my database. Both of these incidents are of critical severity. Enterprise Manager automatically creates a problem containing those incidents. Within the Incident Manager interface you can link to the Support Workbench to open a service request which you can then track from Incident Manager.
Figure 3: Incidents and problems
So now you have an understanding of the terminology and relationships between these terms, what’s next? Well, the next thing to understand is just how you deal with these incidents. That will be the topic of my next blog, so stay tuned for more!
Contributed by Pete Sharman , Principal Product Manager, Oracle Enterprise Manager