Monitoring your most critical targets, such as databases, applications, servers are all about minimizing risk. Many Oracle customers use Enterprise Manager (EM) to drive culture and best practices for risk management. Why? Because EM 13c is built to deliver a highly available, scalable, and performant platform itself, designed for administrators of large, critical IT estates who need management-at-scale monitoring. Collectively, these qualities come from Manage the Manager (MTM) functionality. This blog describes how to leverage the latest features of MTM to improve the monitoring of EM 13.5 sub-systems.
Administrators need early warning signals. Early detection can prevent critical issues from impacting performance and reduce the risk of failure. Examples of performance issues are listed below in Figure 1:
Figure 1: Early Warning Symptoms
How can Enterprise Manager help mitigate these issues?
You can access the Manage the Manager capabilities from the Enterprise Manager console, by navigating to Setup and then Manage Cloud Control. The various dashboards help keep an eye on the health of the Enterprise Manager sub-systems. You can drill further into each component, from the respective dashboards.
You can check the status of the key components such as the EM services, application deployments, job system backlog and WebLogic deployments on the Health Overview dashboard. Clicking on the status icon for each service or application will drill down to show availability details. Each component is represented as a target in EM. If any components are down, you can use the information provided on the target’s respective home page to assist in diagnosing and resolving the availability issue. Figure 2 illustrates the four major areas covered under Manage the Manager
Figure 2: Manage the Manager
The Enterprise Manager Job system automates many administrative tasks like backup, cloning, and patching. It enables you to create your own jobs using custom OS and SQL scripts as well as multi-task jobs comprised of multiple steps over different targets. The health of the Job system is essential to keep the OMS running at its optimal performance.
In Enterprise Manager 13.5 (also back-ported to 13.4 Release Update 8) we have introduced the Job Diagnostics dashboards so that Administrators can monitor key components of the Job system. Often times the Job system is choked due to an underlying issue with the OMS, agent, network or scripts that may not have the right permissions etc. These dashboards allow you focus on the problem areas and to debug down to the job activity level of each job. You can diagnose the following:
• Jobs and Steps being re-tried, thus overloading the resources
• Queues blocked due to issues with the underlying system or agents
• Job execution performance data
• Job Dispatcher services Dashboard
• Health status for all Dispatcher services in a single or multi-OMS system
• Thread Pool and Connection Pool details for each Dispatcher
• Performance details on Long and Short Running User Jobs, as well as System Jobs
On the navigation pane of the Job Diagnostics home page, you can select various job system areas to analyze. You can also check how many jobs have been executing for the last month (or over 1 day, 3 days, and 1 week) in this particular environment.
Figure 3: Figure 3 Shows the Job Diagnostics Home Page
The Overview section displays at-a-glance information for 3 main elements of the Job System:
1. Number of Dispatchers and their status – the Job Dispatcher runs locally on each OMS and dispatches the jobs found by the Step Scheduler to the job worker threads.
If the dispatcher cannot keep up with the work in the queue, the backlog increases.
2. Steps Scheduler Status indicates when was it last executed and when is the next scheduled run.
The Job Step Scheduler is a global component so there is only one per Enterprise Manager environment. It is scheduled to run by the DBMS Scheduler. The primary purpose of this component is to mark steps ready for the dispatcher to execute.
3. Bookkeeping steps are important functionality to maintain the Job System running and healthy.
These bookkeeping steps are internal Job System steps which help maintain continuity of the job execution when various sub-systems of Enterprise Manager perform specific actions. For example, bookkeeping steps include: marking job executions and steps as failed, scheduled or suspended based on various system events such as agent bounce, blackouts, or group changes.
Figure 4 shows the Retried Jobs Page which displays the top list of jobs that were retried in the last one month and how many times. Figure 4 shows the total number of retried jobs as 51, and each job was retried 100 times. Thus, 5100 job cycles have been used to retry jobs, which can represent a significant amount of system resource. On this page, the top job SI_NMR shows it was retried 100 times before it failed. When using the drill down function, it will indicate NMR scripts were not run when the agent was installed. Such a permission issue has created an unnecessary load on the Job System.
Figure 4: Retried Jobs
The Jobs System dashboard allows you to see the details of the longest queues. When a series of jobs are meant to be executed in a certain order, they are put into a job queue. This ensures a sequential execution of the jobs. If a job at the head of the queue is stuck for any reason, none of the other jobs in that queue will be able to run. Figure 5 shows the name of the Target at which the head of the queue is stuck, the name and type of Job that is stuck and its Status. You can click on the Head of the Queue job name to go to the Job Activity page for the head job to determine why the jobs in the queue are not getting processed.
Figure 5: Queue Details
As shown in Figure 6, you can drill down to the Job Activity page, where you will find specific details about why the job at the head of the queue is causing the backlog. In this case, the current status is Agent is not ready and that can help debug why the Agent is not up and running.
Figure 6: Drill-Down to Job Activity Page
As mentioned earlier, Job Dispatchers are services that handle dispatching the Job Steps for execution. A new dashboard allows you to view the current status of all Dispatchers in your Enterprise Manager environment (one dispatcher per OMS). Figure 7 shows that this environment has 5 Job dispatchers. You can view their Status and since when they have been available. Each dispatcher has several thread pools like User Short, User Long, System Critical, System Normal, and Internal. Thread pools provide a way to scope the resources used by the Job System. For example, the user short pool defaults to 25 threads. This allows each OMS to run up to 25 different user steps marked "short running", concurrently. Similarly, the user long pool defaults to 10. Each Dispatcher also has its own Connection Pool and you can get the Configuration, Usage and Status on each one of them.
Figure 7: Job Dispatcher Dashboard
These Job Diagnostics Dashboards indicate what types of jobs are executing in the system, if the system is idle, overloaded or blocked, and depicts the health of the overall Job System. For more details refer to the Diagnosing Job System Issues section of the Cloud Control Administrator’s Guide.
To ensure Enterprise Manager is configured and optimized properly, implementation planning should take into account the sizing recommendations provided in the Oracle Enterprise Manager Cloud Control Advanced Installation and Configuration Guide. Sizing is based on a combination of the number of agents, targets and concurrent users. After implementation, Administrators should review the system sizing and usage on a regular basis to account for system growth.
Another new feature added in the 13.5 release is the Enterprise Manager Sizing Compliance Standards, a framework that continuously evaluates and recommends EM sizing configuration for OMS and Repository based on the system workloads.
Notifications are sent to the administrator through the out-of-band notification system and the Compliance dashboard shows the OMS and Repository score, warnings, and so on, so that users can make the appropriate sizing choices. You can configure out-of-band notification by following the My Oracle Support (MOS) ID 1472854.1: EM 13c, 12c: How to Set up Out Of Band Email Notification for OMS and Repository Database Targets in Enterprise Manager Cloud Control
Figure 8: Sizing Standards for OMS and Repository
Enterprise Manager comes with out-of-the-box Enterprise Manager sizing compliance standard rules for periodic health check of OMS and repository. Figure 8 shows these rules are enabled by default and associated to the OMS and the repository.
The Compliance Dashboard from Figure 9 below, shows the violations for OMS and repository targets. It includes the overall compliance score for OMS and repository targets. This enables the Administrators to take an action on the violations so that the OMS and Repository can meet the sizing recommendations.
Figure 9: Compliance Dashboard
As shown in Figure 10, the Compliance standard sizing rules will automatically evaluate the current hardware and storage configurations. These rules will notify administrators if there are violations and recommend them to set the parameters to the expected value for better performance.
Figure 10: Compliance Results for Hardware and Storage Requirements
Similarly, compliance standard sizing rules also periodically evaluate the repository database initialization parameters. Notifications are sent to Administrators if there are violations, including recommended database parameter values that would enable better performance.
Figure 11: Compliance Results for Repository Database Parameters
The consolidated Compliance Evaluation Report is available to show the overall results for hardware, storage requirements, and repository database parameters. As seen in Figure 12, the reports show the failed rules and the overall compliance score for the Enterprise Manager sizing standard. Administrators can review the report and take corrective actions to remediate the failed rules and achieve better EM performance.
Figure 12: Compliance Evaluation Report
Out-of-band notifications can be configured in Enterprise Manager 13c to send an email or trigger a script when certain fatal conditions occur. This functionality allows the EM Administrators to receive notifications when there is a failure in an EM component. The notification is triggered in the following scenarios:
· Single OMS environment, if the OMS is down, but the Agent is up
· Multi-OMS environment, if all OMSs are down, but the Agent is up
· If the Repository database is unavailable (down, archiver hung, listener down...etc.)
· If the notification job is broken or has an invalid schedule
Figure 13: Proactive Alerts
Configure out-of-band notifications by following the steps in the MOS note: EM 13c, 12c: How to Set up Out Of Band Email Notification for OMS and Repository Database Targets in Enterprise Manager Cloud Control
Figure 14: Out-of-Band Notifications
As an environment grows for any enterprise, the dependency on Oracle Enterprise Manager 13c to help monitor and administer the environment becomes essential. The Manage the Manager features help maintain, support, and keep the Enterprise Manager highly available, thus increasing its operational efficiency. Refer to the Manage the Manager Documentation for more details on how to leverage the dashboards and features discussed in this blog.