As the scale of dynamic IT environments increase, ITOps and DevOps teams are challenged with an increase in the volume and variety of incidents that need to be managed.  With this increasing volume, it becomes more important to find efficient and smarter ways to manage and resolve incidents in a timely and consistent manner.  A key part of this solution involves leveraging and transforming expertise and tribal knowledge in managing incidents into standardized runbook procedures that everyone can follow when responding to incidents.  

In Enterprise Manager 13.5 Release Update 7 (13.5.0.7), we support this solution with our new Dynamic Runbooks feature.  Dynamic Runbooks allow your subject matter experts (e.g., senior DBAs, IT engineering teams) to define a standard set of steps for triaging, diagnosing and resolving incidents into runbooks that can be executed directly in Enterprise Manger (EM) in context of an incident.  Without this feature, operations teams that have runbook procedures for managing incidents typically consult external wiki pages and use various tools for executing the procedures.   With Dynamic Runbooks, these runbook procedures can be defined within EM itself, and thus become more easily and immediately accessible to operations team who manage incidents in EM.

 

Live runbook authoring

When you author a runbook in EM, you don’t start with a blank page.  Instead, you create it in context of a (live) incident for which you want to create the runbook. Authoring a runbook against an actual, live incident provides an easy way to test the steps in your runbook as you are creating it. For example, let’s say you want to create a runbook to manage FRA (Fast Recovery Area) incidents.  To do so, locate an FRA incident in EM and start the runbook creation flow based on the FRA incident.  This will enable your runbook to have the context of the FRA incident available to it, such as the database target of the incident, FRA metric for the incident, timestamp on which the event occurred, etc.  This incident context will help define and test the steps in the runbook.

A runbook itself consists of a series of executable steps.  The first step of a runbook is the Overview & Prerequisites step which is used to define the overall purpose of the runbook and any prerequisites required for the successful execution of the runbook.  An example of a prerequisite is access to named credentials needed to run SQL against a target database.

Create Runbook - Step 1
Figure 1: When creating a new runbook, specify its purpose and any prerequisites in the first step.

After the Overview & Prerequisites step, add other steps in the runbook to triage and resolve the incident.  There are different step types that can be used to build the steps in the runbook. Step types include the Metric Data step for showing a metric timeseries chart in context of the event, Target SQL step for running SQL against a target database, Repository SQL step for running SQL against the EM repository, etc. As you define each step, you can immediately test the step by running the step against the incident context.  For example, you may want to create a runbook step based on the Target SQL step that queries the target database to check for file types that are using the most FRA space.  Once you define the SQL query in the Target SQL step, you can immediately run the step which executes the query against the target database that is part of the incident context.  Check the output of the step to verify the query is working or fix the query.  This iterative create-execute-verify-fix process can be done on each step of the runbook until all runbook steps have been defined.

Create Runbook - SQL Step
Figure 2: In the Target SQL step, you can immediately check your SQL by clicking on the ‘Run’ button to execute and display query results.

Once you’ve completed and tested the steps in the runbook, publish the runbook.  Once published, it is ready for general use.

 

Easy access to Runbooks within Incident Manager

Runbooks that have been published are easily accessible within Incident Manager.   An operator working on an incident in Incident Manager can navigate to the Runbook Sessions section, and start a runbook session. Once the runbook session is started, locate and select the appropriate runbook and start executing its steps.  Each step in the runbook is executed by clicking on the play button on the step. Once the steps in the runbook have been followed and executed, and the incident is resolved, mark the runbook session as done.

Runbook sessions are automatically saved up to 14 days and are available for teams who may want to do a postmortem analysis on the incident and its resolution.

Use Runbook
Figure 3: In a runbook session, click on the play button on each step to execute the step.

 

More efficient incident resolution

Dynamic Runbooks enable DevOps and ITOps teams to manage and resolve incidents more efficiently.  It provides a way for the teams to encapsulate subject matter expertise in resolving various types of incidents and to make this subject matter expertise easily accessible to all operations staff.  Working on incidents in a standard, consistent way enables IT teams to resolve incidents quicker and meet their SLAs for incident response.

Resources: