Oracle Support Master Note for Diagnostic Tools for 10g Enterprise Manager Grid Control Components (Doc ID 1098262.1)

 

 

For most current information refer Master Note for Diagnostic Tools for 10g Enterprise Manager Grid Control Components (Doc ID 1098262.1)

 

 

 

In this Document
  Purpose
  
Scope and Application
  
Master Note for Diagnostic Tools for 10g Enterprise Manager Grid Control Components
     
Diagnostic Tools for the 10g Oracle Management Service (OMS)
     
Diagnostic Tools for the Enterprise Manager Grid Control Repository
     Diagnostic Tools for the 10g Grid Agent
     Diagnostic Tools for the Enterprise Manager Console Operations
     Diagnostic Tools Common for all Enterprise Manager Components
     
Best Practices (Certification, Maintenance Activities, OCM, Healthcheck, CPU & PSU)
  
References


Applies to:

Enterprise Manager Grid Control - Version: 10.1.0.2 to 10.2.0.5 - Release: 10.1 to 10.2
Information in this document applies to any platform.

Purpose

This Master Note helps understand the Diagnostic tools available for investigating Enterprise Manager Grid Control issues.

Scope and Application

This document is intended to assist Enterprise Manager Grid Control Administrators in reading the log/trace files and using the available Diagnostic tools. This document covers the following topics:

1. Diagnostic Tools available for Enterprise Manager Grid Control Components.
2. Best Practices (Certification, Maintenance Activities, OCM, Healthcheck, CPU & PSU)

Master Note for Diagnostic Tools for 10g Enterprise Manager Grid Control Components

Before referring to this document, the user should have a thorough understanding of the installation and usage of the Enterprise Manager Grid Control Components.

Diagnostic Tools for the 10g Oracle Management Service (OMS)

Problems at the OMS level could be related to:

1. Process control - unable to start / stop / check status
2. Performance - OMS re-starting / crashing at frequent intervals, with or without core dump.
3. Communication problems from OMS to the other EM components.
4. Failure of some of the OMS components to function properly - notifications not being sent, xml files not being uploaded, etc.

The following diagnostic tools are available for diagnosing the problems at the OMS level:

  • RDA

    The Remote Diagnostic Agent (RDA) can be executed specifically with the Grid Control / OMS profile name: GridControl in order to reduce the number of questions that need to be answered and also to collect all details of the OMS home correctly. 

    The steps to execute the RDA with GridControl profile is explained in: 

             
    Note 1057051.1: How to Run the RDA against a Grid Control Installation

    After executing the RDA, navigate to the 'output' directory of the RDA and open the file RDA__start.htm in any browser. The RDA output consists of 3 frames - two on the left side and 1 on the right side:

 


The topic chosen in the top frame on the left has sub-topics that are shown in the frame below it. Clicking on the sub-topics will show the details in the right frame.

To access the OMS details click on the 'Enterprise Manager Server' in the top frame on the left.

The RDA output is useful for gathering a lot of information about the OMS, namely:

    • Obtaining the Hostnames and IP addresses of the OMS machine.
    • Performing a network ping test to one or more remote machines including the Grid console machine, Repository Database machine to test their accessibility.
    • The outputs from 'emctl' command options showing whether the OMS is running, secured etc.
    • Oracle Inventory details from the OMS home - list of one-off patches, Patchsets
    • Configuration files from 
      <OMS_HOME>/sysman and sub-directories
      <OMS_HOME>/Apache/Apache/conf
      <OMS_HOME>/opmn/conf
    • Log files from 
      <OMS_HOME>/sysman/log
      <OMS_HOME>/cfgtoollogs and sub-directories
      <OMS_HOME>/opmn/logs
    • Two memory dumps taken at an interval of 3 mins (from 10.2.0.4 GC onwards)
    • Output of network ping test to one or more Agent machines to test their accessibility.
    • Collection of the earlier run EMDiagkit outputs (if available in the filesystem, etc.

Note:
1. It is highly recommended that the latest EMDiagkit is installed and executed in the OMS home, before running the RDA. This will ensure that the RDA picks up the latest data collected by the EMDiagkit.
2. Always download and use the latest RDA, to take advantage of the new features introduced in that version.

 

  • OMS Log/Trace Files

The OMS generates log/trace files during their startup and operation. These log/trace files have a wealth of information about the working of the OMS. It is also possible to increase the trace level from the default WARN to DEBUG, to get additional information in the trace files. In addition, the stdout/stderr output is also redirected to the <OMS_HOME>/opmn/logs/OC4J~OC4J_EM~default_island~1 file, for example: the OMS thread dump

For more details, refer to 
Note 229627.1: How to Locate 10g Grid Control OMS Log / Trace Files and Control their Size and Other Details

The OMS is a multi-threaded Java application. Hence, the emoms.log and emoms.trc will have entries from many threads interspersed with each other. The format of an entry is as follows:

<timestamp> [<thread name>] <trace level> <detail>

For example:

2010-05-07 15:20:27,334 [PingHeartBeatRecorder] DEBUG emdrep.pingHBRecorder mainTask.884 - Ping heartbeat recorder mainTask called

where:
2010-05-07 15:20:27,334 : timestamp
PingHeartBeatRecorder : thread name
DEBUG : trace level
'emdrep.pingHBRecorder mainTask.884...' : details of the entry.

Suggestions for Reading the emoms.log / emoms.trc files:

    • Not every line in the log/trace file is an error message or indicative of a problem. There are a lot of informational messages which are logged to depict the OMS operations. When reading the emoms.log / emoms.trc, it is necessary to know the operation / activity for which the entries are being checked. 
      For example, if you need to check for Loader related problems, search in the emoms.trc for entries related to 'XMLLoader'. Depending on the number of xml loader threads configured, there could be multiple threads in the emoms.trc performing the Data Upload. 
      Entries will be as:

2010-05-10 12:55:02,979 [XMLLoader0 40000000995.xml] DEBUG XMLLoader.Splitter startElement.794 - m_docOmsProtocolVer = 10.2.0.5.0 m_version = null
..
2010-05-10 13:01:54,913 [XMLLoader1 10000000246.xml] DEBUG emdrep.XMLLoaderContext flushAllStatements.1402 - Rows updated during flushing for table MGMT_METRICS_RAW insert = 13 update = 0

 

    • Understand the Java threads in the OMS and their corresponding entries. These are described in Note 1097545.1: Description of Important Java Threads in a 10g Grid Control Oracle Management Service
    • Find the Thread name in the log / trace messages for the problematic sub-system. If necessary, you can filter the contents of the log / trace file for this thread name, for example:

      On Unix:

      grep XMLLoader emoms.trc > emoms_loader.trc

      On Windows:

      findstr XMLLoader emoms.trc > emoms_loader.trc
    • If the timestamp at which the problem occurred is known, then always check the log / trace file for entries around this timestamp. If checking for a problem spanning across the OMS and Agent components, for example: execution of an EM-level job, then check the TimeZone differences between the OMS and Agent machines. It is also possible that the machine time is not correctly set. The log / trace file entries should be read after considering the above timezone / timestamp settings.
    • As explained in Note 229627.1, both the emoms.log and emoms.trc files are written in a cyclic manner. After the filesize limit is reached, the log / trace file is rotated. The file rotation is done till the limit for total number of files is reached. Hence, it is not helpful if only the entries in the emoms.trc file are checked. Instead, all the emoms.trc.n files should be checked for the entries related to a problem. 
    • If the OMS is running in DEBUG mode, the emoms.trc may get filled up at very fast rate, depending on the total number of targets being monitored and the activities being performed. The trace entries may be lost quite quickly as the files get rotated. Hence, it is important that after immediately reproducing the issue, all the files in the <OMS_HOME>/sysman/log directory are backed up. If needed, the limits for the filesize and total number of files can be increased by modifying the parameters listed inNote 229627.1.
    • If you are diagnosing an OMS start / stop problem, then refer to the flow of steps in 
      Note 730308.1: How to Troubleshoot Process Control (start, stop, check status) the 10g Oracle Management Service(OMS) Component in 10 Enterprise Manager Grid Control.
    • If you are diagnosing a communication problem from the OMS to Repository / Agent, refer to the steps in 
      Note 1089693.1: How to Troubleshoot Communication Between the Oracle Management Service (OMS) and Grid Control Repository Database Components in 10g Enterprise Manager Grid Control
      Note 1088414.1: How to Troubleshoot Communication Between the Oracle Management Service (OMS) and Grid Agent Components in 10g Enterprise Manager Grid Control




********************************************************************************

Diagnostic Tools for the Enterprise Manager Grid Control Repository

Problems at the Repository Database level could be due to:

1. Database and Listener level issues such as Database Hang, Archive destination full, ORA-600 / 7445 errors causing database crash, inaccessibility of the Datafiles/Redologs etc, block corruption, dbms_jobs not running at scheduled time, Tablespaces / Datafiles running out of space, Listener Hang / crash etc.

2. EM Repository Schema (sysman) level issues such as Invalid Objects, design flaw in the pl/sql objects, deadlocks caused by EM sessions, EM related dbms_jobs not submitted etc.

The following diagnostic tools can be used for investigating problems at the Repository Database level:

  • EMDiagkit

    The EMDiagkit is a diagnostic tool developed to assist in diagnosis and correction of Enterprise Manager 10g Framework issues. At present, the tool allows us to extract necessary troubleshooting data from the EM Repository Schema using the repvfy utility. 

    Details about repvfy:
    • The repvfy utility is certified for both 10gR1 and 10gR2 Grid Control
    • The shell and PERL scripts used by repvfy have been verified on the following platforms: 
      • Linux 
      • Windows 
      • Solaris 
      • HP-UX 
      • AIX 
      • Mac OS
    • There are 3 diagnostic actions that can be performed against the EM Repository:
      • Validate the entire repository for data integrity by running "verify"
      • Analyze individual repository objects by running "analyze"
      • Dump details of an repository object and source of a Pl/SQL object.
    • The available "levels" for repository verification and object analysis are:

Level

Description

0

Fatal issues (functional breakdown). These test highlight fatal errors found in the repository. These errors will prevent EM from functioning normally and should get addressed straight away. All tests numbered 0 to 99 are in this category 

1

Critical issues (functionality blocked). All tests numbered 100 to 199 are in this category.

2

Severe issues (restricted functionality). All tests numbered 200 to 299 are in this category.

3

Warning issues (potential functional issue). These tests are meant as 'warning', to highlight issues which could lead to potential problems.All tests numbered 300 to 399 are in this category.

4

Informational issues (potential functional issue). These tests are informational only. They represent best practices, potential issues, or just areas to verify. All tests numbered 400 to 499 are in this category.

5

Currently not used

6

Best practice violation. These tests highlight discrepancies between the known best practices, and the actual implementation of the EM environment. All tests numbered 600 to 699 are in this category.

7

Purging issues (obsolete data). These test highlight failures to clean up (all the) historical data, or problems with orphan data left behind in the repository.All tests numbered 700 to 799 are in this category.

8

Failure Reports (historical failures). These tests highlight historical errors that have occurred. All tests numbered 800 to 899 are in this category.

9

Tests and internal verifications. These tests are internal tests, or temporary and diagnostics tests added to resolve specific problems. They are not part of the 'regular' kit, and are usually added while debugging or testing specific issues. All tests numbered 900 to 999 are in this category.

    •  
  • For more details on the verify / analyze / dump options of the repvfy, refer to 
    Note.421563.1: EMDiagkit - How to Use the EMDiag Kit

    Also, refer to 
    Note 421053.1: EMDiagkit Download and Master Index
    Note.421499.1: EMDiagkit - How to Install - Deinstall
    Note.421586.1: EMDiagkit - Environment Variables
  • Note: 

    1. Always download and use the latest EMDiagkit. The tests and options are regularly modified and improved by the Development to ensure that the latest bugs / issues identified are discovered by the Diagkit.
    2. As seen in the table above, Tests having a higher level have lower impact. So, it is necessary to check the level of the test to understand whether the issue reported is critical or not.
  •  
  • Using emctl

    From 10.2.0.5 Grid Control onwards, the emctl utility has an additional option for dumping the details of the Repository. This can be extremely useful for verifying any incorrect configuration or Performance issue in the Repository Database. The output will be similar to an repvfy dump.
    For details, check 
    Note 842677.1: How To Get Repository Dump For Assessment Using emctl
  • PL/SQL Tracing

    It is possible to trace the PL/SQL routines executed in the Grid Control Repository for certain OMS modules. This feature is available from the Grid Control 10.2.0.1 onwards and is very useful when trying to debug some SQL exception, or trying to narrow-down a problem with one of the internal OMS server sub-systems.

    The modules / context-types which can be traced are:

         COMPLIANCE_EVALUATION
         DEFAULT
         EM.BLACKOUT
         EM.JOBS
         LOADER
         NOTIFICATION
         REPOCOLLECTION
         SVCTESTAVAIL
         TRACER


    For more details regarding the steps, refer to 
    Note 435055.1: How to Enable Tracing for PL/SQL Routines in the 10g Enterprise Manager Grid Control Repository

Note:

1. Do not leave modules in trace mode for prolonged periods of time. Enabling the tracing for a long time can drastically increase the size of the EMDW_TRACE_DATA table in the database, where the tracing data is stored.

2. Enable the tracing to get more information about a specific issue, and de-active the tracing again, by calling the 'set_trace_level' routine again with a value of '0' (zero).

3. Enable the tracing only for the problematic module, which needs to be traced.

4. There is no need to stop and restart any Enterprise Manager Component (OMS, Repository, OMS and Agent monitoring the Repository) for the tracing level change to take effect.

5. The tracing level change takes effect immediately and you will see rows created in the SYSMAN schema table EMDW_TRACE_DATA right after you changed the tracing level.
However if you do not see any logging in the table EMDW_TRACE_DATA, there may be two reasons:

    • The tracing changes have not been committed.
    • The module that was being traced has not yet been executed. For example, if tracing the Notification, then any metric for which the notification has been enabled should be triggered so that the pl/sql routine related to the notification sub-system gets executed.

 

  • RDA

    The Remote Diagnostic Agent (RDA) can be executed specifically with the Database profile name: DB10g DB11g in order to reduce the number of questions that need to be answered and also to collect all details of the Database home correctly. 

    The steps to execute the RDA with Database profile is explained in: 

               
    Note 1057051.1: How to Run the RDA against a Grid Control Installation
  • Database-level Diagnostic Utilities

    There are many tools at the Database side, which are helpful in diagnosing performance issues in the Repository Database. A Database-level problem can manifest itself in several ways affecting the performance and operation of the Grid Control components:
    • Performance of the OMS components like Loader, Notifications etc are slow resulting in a high backlog.
    • The OMS is crashing frequently due to inaccessibility of the repository database. This can be verified by attempting a connection to the database via sqlplus, outside the EM setup.
    • Certain Grid Console pages are slow in responding, etc.

 

Note: The scope of this document does not include detailed description of all the Database diagnostic tools, but will list the useful tools available.


1. If the Grid Console is accessible and the Repository Database is discovered as a target, then most of the Database details can be viewed from the Console itself:

    • Check the metric alerts in the database homepage by navigating to Targets -> Databases -> click on the database name. Any problem reported by a critical alert should be immediately correctly. All warning alerts should be checked and preventive action taken, if necessary for avoiding critical problems in future.
    • Check the Targets -> Databases -> click on the database name -> Performance page to:
      • Identify the cause of any bottlenecks.
      • Access information for database locks, top SQL, top sessions, top files, and top objects.
      • Generate AWR / ADDM / ASH reports for a performance analysis. Multiple AWR/ADDM/ASH Reports should be collected for a 1-hour slot, to understand the performance of the Database. 
      • For ASH Report : In the grid Console, go to Targets -> Databases -> click on the repository database -> Performance tab -> click on 'Run ASH Report'
      • For AWR Report: In the Grid Console, go to Targets -> Databases -> click on the repository database -> Click on one of the instances -> Server -> Automatic Workload Repository -> 'Run AWR report'
      • For ADDM Report: In the Grid Console, go to Targets -> Databases -> click on the repository database -> Advisor Central -> ADDM -> Choose "Run ADDM to analyze current performance".
      • Switch to Memory Access Mode for slow or hung systems, etc.
      • If you select Historical from the View Data drop-down list, another page appears with a Historical Interval Selection chart. Drag the shaded box to the desired 24-hour interval to update the charts on the page. The historical view provides the following monitoring links:
        • Period SQL 
        • Instance Activity 
        • Snapshots 
      • For a particular database session, to enable sql (10046) trace, check the wait events, open cursors, statistics, kill the session, SQL ID etc.
      • Click on the online help button to view more details about the features in the Database Performance page.

2. To collect Database diagnostics outside of Grid Console, refer to:

    
Note 438452.1:> Performance Tools Quick Reference Guide

 Note 1062507.1: Database 11.2 Product Info Center: Diagnostics
    
Note 556679.1: Data Gathering for Troubleshooting RAC Issues
    
Note 402983.1: Database Performance - FAQ
    Oracle Documentation: 
Oracle® Database Performance Tuning Guide



*******************************************************************************

Diagnostic Tools for the 10g Grid Agent

Problems at the Agent level could be related to:

1. Process control - unable to start / stop / check status
2. Performance - Agent re-starting / crashing at frequent intervals, without or without core dump, high usage of cpu / memory / file handlers, etc.
3. Communication problems from Agent to the OMS or monitored targets.
4. Failure of some of the Agent components to function properly - metrics not being collected at correct intervals, metric collection is failing or returning incorrect results, Agent unable to upload files to OMS, etc.

The following diagnostic tools are available for diagnosing the problems at the Agent level:

  • RDA

    The Remote Diagnostic Agent (RDA) can be executed specifically with Agent profile name: AGT in order to reduce the number of questions that need to be answered and also to collect all details of the Agent home correctly. 
    The steps to execute the RDA with AGT profile is explained in: 

          
    Note 1057051.1: How to Run the RDA against a Grid Control Installation

    After executing the RDA, navigate to the 'output' directory of the RDA and open the file RDA__start.htm in any browser. The RDA output consists of 3 frames - two on the left side and 1 on the right side:

 

The topic chosen in the top frame on the left has sub-topics that are shown in the frame below it. Clicking on the sub-topics will show the details in the right frame.

To access the Agent details click on the 'EM Agent (agentmachine.domain:1830)' in the top frame on the left. 

The RDA output is useful for gathering a lot of information about the Agent, namely:

    • Obtaining the Hostnames and IP addresses of the Agent machine.
    • Performing a network ping test to the OMS machine to test its accessibility. 
    • The output of the emctl command showing whether the Agent is running, able to upload, its list of monitored targets, metric execution schedule, agent timezone etc.
    • List of current metric severity states for all the monitored targets.
    • List of files pending upload to the OMS.
    • Oracle Inventory details from the Agent home - list of one-off patches, Patchsets
    • Configuration files from
                <AGENT_HOME>/sysman and sub-directories
    • Log files from 

                   <AGENT _HOME>/sysman/log
                   <AGENT _HOME>/cfgtoollogs and sub-directories, etc.

Note: Always download and use the latest RDA, to take advantage of the new features introduced in the new version.

 

  • Agent Log/Trace Files

    The Agent component generates log/trace files during its startup and operation. These log/trace files have a wealth of information about the working of the Agent. It is also possible to increase the trace level from the default WARN to DEBUG, to get additional information in the trace files.

    There are multiple log/trace files in the Agent home, which can be checked for problems.
    For more details, refer to 
    Note 229624.1: How to Locate and Control the Size and the Content of Grid Control Agent 10g Log / Trace Files

    The Agent is a multi-threaded C Application. Hence, the log/trace files will have entries from many threads interspersed with each other. 

    The format of an entry in the emagent.trc / emagent.log is as follows:

    <timestamp> [<thread ID>] <trace level> <component name> <detail>

    For example:

    2010-05-07 15:20:27,334 Thread-3572 ERROR engine: [oracle_database,V920_agentmachine.domain,db_recSegmentSettings] : nmeegd_GetMetricData failed : ORA-12541: TNS:no listener

    where:
    2010-05-07 15:20:27,334 : timestamp
    Thread-3572 : Thread ID
    ERROR : Trace level
    engine : Component name, in this case refer to Metric Engine
    '[oracle_database,V920_agentmachine.domain,db_recSegmentSettings]..' : details of the entry.


    The format of an entry in the emagent_perl.trc is as follow:

    <perl script> <timestamp>: <trace level> <details from the script>

    For example:

    emrepresp.pl: Fri May 14 00:25:23 2010: DEBUG: Connectdescriptor (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=em11gc.idc.oracle.com)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=emrep.oracle.com)))

    where:
    emrepresp.pl : name of the perl script being executed.
    Fri May 14 00:25:23 2010 : timestamp
    DEBUG : Trace level
    'Connectdescriptor (DESCRIPTION=...' : details from the perl script

    Suggestions for Reading the Agent log / trace files:
    • Not every line in the log/trace file is an error message or indicative of a problem. There are a lot of informational messages which are logged to depict the Agent operations. When reading the Agent log/trace files, it is necessary to know the operation / activity for which the entries are being checked.
      For example, if you need to check for Metric execution problems, search in the emagent.trc for entries related to the internal metric name. Entries will be as:


2009-01-27 08:57:27,353 Thread-1236 ERROR engine: [oracle_database,orcl.domain,alertLog] :
nmeegd_GetMetricData failed : Missing Properties : [log_file_absolute]
2009-01-27 08:57:27,353 Thread-1236 WARN collector: <nmecmc.c> Error exit. Error message: Missing Properties : [log_file_absolute]


where the entry details include -
oracle_database: target type
orcl.domain: target name
alertLog: internal name of the metric which can be identified using steps in 
Note 435975.1

    • Understand how the various components and threads of the Agent interact with each other. These are described in Note 1101615.1: Description of Important Components / Threads in a 10g Enterprise Manager Grid Control Agent
    • Find the component name / Thread ID in the log / trace messages for the problematic sub-system. If necessary, you can filter the contents of the log / trace file for this thread name, for example:

      On Unix:

      grep engine emagent.trc > emagent_engine.trc

      On Windows:

      findstr engine emagent.trc > emagent_engine.trc
    • If the timestamp at which the problem occurred is known, then always check the log / trace file for entries around this timestamp. If checking for a problem spanning across the OMS and Agent components, for example: execution of an EM-level job, then check the TimeZone differences between the OMS and Agent machines. It is also possible that the machine time is not correctly set. The log / trace file entries should be read after considering the above timezone / timestamp settings.
    • As explained in Note 229624.1, both the agent log/trace files are written in a cyclic manner. After the filesize limit is reached, the log / trace file is rotated. The file rotation is done till the limit for total number of files is reached. Hence, it is not helpful if only the entries in the emagent.trc file are checked. Instead, all the emagent.trc.n files should be checked for the entries related to a problem. 
    • If the DEBUG mode is enabled for many of the Agent components, then the emagent.trc may get filled up at very fast rate, depending on the total number of targets being monitored and the activities being performed. The trace entries may be lost quite quickly as the files get rotated. Hence, it is important that after immediately reproducing the issue, all the files in the <AGENT_HOME>/sysman/log directory are backed up. If needed, the limits for the filesize and total number of files can be increased by modifying the parameters listed in Note 229624.1.
    • If you are diagnosing an Agent start / stop problem, then refer to the flow of steps in 
      Note 1082009.1: Master Note for 10g Grid Control Agent Process Control (Start, Stop & Status) & Configuration
    • If you are diagnosing a communication problem from the Agent to OMS, refer to the steps in 
      Note 951076.1: How to Troubleshoot Communication Between the Grid Agent and Oracle Management Service (OMS) Components in 10g Enterprise Manager Grid Control

      If the network configuration is OK, and the two end-points can talk to each other, the HTTP[S] protocol exchange can still throw errors:
      • (OMS) Apache timeouts: If the Apache webserver is not responding to the Agent requests in time, the Agent will shutdown the connection, and retry at a later time.
      • (Agent) Agent timeouts / Agent busy: If too many concurrent requests are coming in at the Agent side, the Agent will run out of threads to handle the workload, and will reject new incoming request with an 'Agent Busy' error.
    • If you are diagnosing an upload problem from the Agent to OMS, refer to the steps in 
      Note 550617.1: How To Effectively Investigate & Diagnose 10g Grid Agent Upload Problems to the Oracle Management Service (OMS).
    • If you are diagnosing an Agent performace issue, follow the steps in
      Note 1087997.1: Master Note for 10g Enterprise Manager Grid Control Agent Performance & Core Dump issues
    • Tracing SQL*Net Connection to Target Database:

      By default, the Agent installation does not enable any kind of SQL*Net tracing for connections made to the Target database. Only a SQLNET.LOG file will be created in the <AGENT_HOME>/sysman/emd directory with any failed login attempt. There is no rollover mechanism defined for this file, which means that file can continue to grow if the SQL*Net configuration is not adjusted. The sqlnet.ora file in $TNS_ADMIN or <AGENT_HOME>/network/admin file can be modified to make any changes to the logging and tracing of the SQL*Net connections. SQL*Net logging can be done for the SQL fetchlet, the AQ recvlet and the PERL scripts making a SQL*Net connection. 
    • Debugging metrics and metric collections:
      • Before enabling any debugging at the agent, always make sure that the metric in question is getting scheduled and executed:
        emctl status agent scheduler
      • If a target is considered 'down', the metrics for this target will not get scheduled. Only if the 'Response' metric of this target flags the 'all clear' will the scheduling of other metrics continue
      • If the emagent.trc and/or emagent_perl.trc shows that the metric is executing but throwing errors, edit the emd.properties, and set the following properties:

        tracelevel.collector=DEBUG
        tracelevel.engine=DEBUG
        tracelevel.scheduler=DEBUG
      • These settings will trace the various Agent modules involved with executing metrics at the Agent side. All log and trace information will be logged to the emagent.log and emagent.trc files.
      • PERL scripts can get debugged by activating the PERL debugging:

        EMAGENT_PERL_TRACE_LEVEL=DEBUG

        All PERL logging will be written to the emagent_perl.trc file.

  • Agent Metric Browser

    Agent Metric browser is a tool that can be used to view the 'raw' metric data being collected by the Management Agent at that particular instant of time i.e accessing the metric details would force the Agent to execute the metric and show the current results.

    For steps to enable the Metric Browser at the Agent, refer to

    Note 276350.1: How to Enable the Metric Browser/Agent Browser for the Oracle Management Agent

    After enabling, the Metric Browser URL can be accessed from any web browser. The main page will be as below:

The page will show the list of Target types and Target names that are monitored by the Agent. Each of the TargetName is a link, clicking on which will show a list of Metrics being collected for that target:

 

To view the details of a particular metric, click on the metric name, for example 'dumpFull(Dump Area)':

Suggestions for Using the Agent Metric Browser

    • By default, the metric browser is disabled and not accessible.
    • Accessing the metric details from the Metric Browser will not result in any Alerts being raised for that metric. The Metric Browser only retrieves the metric data. It does not compare this data against the Metric thresholds.
    • Its main purpose is to help debug metric issues and is useful in checking metric collection errors. No agent functionality is based on the metric browser. 
    • Every request made from the metric browser is treated like a real-time metric request to schedule and execute that metric immediately to get the results.
    • Enabling the metric browser and accessing the metric details at very frequent intervals can consume more resources at the Agent machine, as the Agent is forced to execute the metric out-of the scheduled turn.
    • The use of the Metric Browser should also be avoided for testing those metrics which scan a log/trace file and keeps track of the last scanned point (line). The next metric evaluation will start only from this last scanned point.
      For example, the Logfile monitoring host level metric, Alert Log monitoring Database level metric. Any log errors that are retrieved when running the metric via the Metric Browser will not have alerts raised for them because the Metric Browser only runs the underlying script. But when the Agent collects the metric at next interval, the metric script will only examine the log from where the execution of the metric by the Metric Browser finished, resulting in loss of errors that may have occurred when the metric was executed from the Metric browser. For more details, refer to:

      Note 976982.1: Monitoring 10g Database Alert Log Errors in Enterprise Manager
      Note 403264.1: Troubleshooting a Database Tablespace Used(%) Alert problem
  • Custom Troubleshooting Scripts 

    Oracle Support and development provides custom scripts, as and when needed, which can be used for collecting more data at the Agent. For example:

    Note 464414.1: Script for Checking the Agent CPU, Memory & Threads Usage 

    This script is available for Linux, Oracle Solaris & AIX OS.




********************************************************************************

Diagnostic Tools for the Enterprise Manager Console Operations

Problems at the Console level can manifest in multiple ways:

1. Homepage of the Grid Console is very slow after Login
2. Particular page in the Console is very slow.
3. Communication errors are seen when connecting to Agents and other targets.
4. Some pages in the Console show incorrect data, etc.

The following diagnostic tools can be used for investigating problems for Console-level Operations:

  • Extended UI Tracing for Console Operations against Repository Database

    The SQL queries that are being run against the Repository Database, for a console-level activity can be traced using the Extended / SQL Console UI Trace URL:

    http://<omshost>:<port>/em/console/admin/rep/extendedSQLTrace

    The UI tracing is similar to enabling 10046 trace in the repository database for the session from the console. For more details, refer to 
    Note 436592.1:Steps to Enable Extended SQL / UI Trace for Enterprise Manager Grid Console Sessions Against Repository Schema

Note:

    • Extended UI trace is useful only for sql queries that are executed by the OMS against the repository database. This is not useful for real-time data that is gathered from the Target Databases.
    • The UI tracing should not be enabled for a prolonged period of time. This will result in generation of lot of trace files in the user_dump_dir location of the repository database, drastically affecting the disk usage / performance.

 

  • 'EM SQL History' for Administrative / Real-time Operations against Target Database

    For many of the Database-related pages in the Grid Console, the data is obtained by making a direct SQL*Net connection from the OMS machine to the target Database. This connection bypasses the Grid Agent on the Database machine; hence it is possible to view certain pages of the database even if the Agent is not running. 
    Such database pages can be classified as:
    • Administrative: Pages which are related to DBA activities against the database, for example: checking the Tablespace information from the Database Administration (till 10.2.0.4 GC) or the Server (from 10.2.0.5 GC) page, adding a datafile, performing a DDL operation against a Table, enabling backup, modifying init.ora parameters etc.
    • Real-time Monitoring: Pages where the real-time information for the Database is shown, for example: the Database Performance page, 

The OMS, by default, makes use of the host, port, database SID connection details that are provided in the Targets -> Databases -> click on the database name -> click on the 'Monitoring Configuration' link, to connect to the Database. If needed, it is also possible to provide a 'Preferred Connect String' in the Monitoring configuration page, for the OMS to use when connecting to the target database.

It is possible to view / trace the SQL statements executed by the OMS against the Target Databases, for the operations performed in the Grid Console:

    • Till 10.2.0.4 Grid Console, navigate to Targets -> Databases -> click on the database name -> 'SQL History' link the bottom of the page, under the 'Related Links' section.
    • In 10.2.0.5 Grid Console, navigate to Targets -> Databases -> click on the database name -> 'EM SQL History' link the bottom of the page, under the 'Related Links' section.

The EM SQL History page shows the Administration and Real-time monitoring SQL statements.

To collect the SQL statements executed by the Console session, against a particular database, refer to 
Note 357318.1: How to Identify the SQL Used by OMS For Administrative / Real-time Monitoring Pages of a Database Target?

 

Note: Along with the above tools, any operation performed in the Console will also be logged in the emoms.log / emoms.trc of the OMS. Enabling DEBUG level trace and reproducing the issue can provide more details about the Console thread, performing the operations.




***************************************************************************

Diagnostic Tools Common for all Enterprise Manager Components

  • Wget

    GNU Wget (or just Wget) is a computer program that retrieves content from web servers, and is part of the GNU Project. Its name is derived from World Wide Web and get, connotative of its primary function. It currently supports downloading via HTTP, HTTPS, and FTP protocols, the most popular TCP/IP-based protocols used for web browsing.
    For more details, refer:

Note: Wget is a third party tool and problems faced while using this tool cannot be supported by Oracle Support. Also, the above mentioned download links are not maintained by Oracle and hence are subject to change.

 

  • Operating System Level Utilities

    Many tools available at the Operating System level, provided by the OS vendors or 3rd party are useful in tracking the performance of the EM components.
    • Unix/Linux Utilities

      The OS Watcher is a tool developed by the Database Center Of Expertise team that can be very helpful in collecting OS related statistics that can be used when diagnosing a performance problem.
      • OS Watcher (OSW) is a series of shell scripts that collect specific kinds of data, using operating system diagnostic utilities.
      • Control is passed to individually spawned operating system data collector processes, which in turn collect specific data, timestamp the data output, and append the data to pre-generated and named files.
      • Each data collector will have its own file, created and named by the File Manager process.
      • OSW invokes the distinct operating system utilities listed below as data collectors.
      • These utilities will be supported, or their equivalents, as available for each supported target platform:

           ps
           top
           mpstat
           iostat
           netstat
           traceroute
           vmstat
      • For more details, refer to Note 301137.1: OS Watcher User Guide
    • Microsoft Windows Utilities
      • OS Watcher (for Microsoft Windows): Similar to the Unix tool described above. For more details, refer to Note 433472.1: OS Watcher For Windows (OSWFW) User Guide 
      • Microsoft Sysinternals Utilities
        • Process Explorer
          It has a GUI interface and displays more information about each running process. Find out what files, registry keys and other objects processes have open, which DLLs they have loaded, and more. This uniquely powerful utility will even show you everything about the process (CPU,Memory,Handles ...etc)
        • Process Monitor
          Monitor file system, Registry, process, thread and DLL activity in real-time.
        • PsList
          Show information about processes and threads.
        • ProcDump 
          Is a command-line utility whose primary purpose is monitoring an application for CPU spikes and generating crash dumps during a spike that an administrator or developer can use to determine the cause of the spike.
        • VMMap
          See a breakdown of a process's committed virtual memory types as well as the amount of physical memory (working set) assigned by the operating system to those types. Identify the sources of process memory usage and the memory cost of application features.
        • Handle
          This handy command-line utility will show you what files are open by which processes, and much more.

 

Note: Microsoft Sysinternals Utilities are third party tools and any problems faced while using these tools cannot be supported by Oracle Support. Also, the above mentioned download links are not maintained by Oracle and hence are subject to change.

 




***************************************************************************

Best Practices (Certification, Maintenance Activities, OCM, Healthcheck, CPU & PSU)

This section lists some of the best practices which will help prevent problems with Enterprise Manager components.

EM Certification Checker

It is strongly recommended that you always use a certified combination of OMS, Agent and Repository Database for managing Targets which are certified with this combination.
The Enterprise Manager certification details are available in:

Note 412431.1: Oracle Enterprise Manager 10g Grid Control Certification Checker

Maintenance Activities

  • Execute EMDiagKit at regular intervals (once per week or more frequently, depending on your setup) and check for any new problems that are reported.
  • Take valid backups of the OMS and Repository Database Homes at regular intervals, to restore back any configuration files that are deleted by accident.
  • Regularly monitor the errors in the Setup -> Management Services and Repository -> Errors page and check for any critical errors.
  • Regularly monitor the Loader backlog shown in the Grid Console Setup -> Management Services and Repository -> Overview page.
  • Enable Log Rotation for the opmn files in the OMS home, to ensure that the files do not grow very large. Refer Note 339819.1: How to Enable Rotation for the HTTP_Server and OC4J_EM Logfiles in the 10g Grid Control OMS Home?
  • Never delete any files under AGENT_HOME/sysman/emd in order to fix problems, this will create more problems.
  • Always download and use the latest RDA to ensure that latest features are used.
  • Always download and use the latest EMDiagkit. The tests and options are regularly modified and improved by the Development to ensure that the latest bugs / issues identified are discovered by the Diagkit.
  • Plan to execute on a regular basis, the Repository Maintenance tasks described in the

    1. White Paper:
    Enterprise Manager Grid Control Performance Best Practices (page 12/30).
    2. The EM Documentation:
    For Grid Control 10.1.x.x.x Release, refer 
    Oracle Enterprise Manager Advanced Configuration
    For Grid Control 10.2.x.x.x Release, refer Oracle Enterprise Manager Advanced Configuration


OCM

Oracle Configuration Manager (OCM) works with My Oracle Support to enable proactive support capability that helps you organize, collect and manage your Oracle configurations by providing Proactive configuration-specific notification of Security and General Alerts, HealthCheck recommendations based on Support Best practices when using configuration auto-collection, Simplified Service Request logging, tracking and reporting and Project cataloging of key milestones and contacts associated with your configurations.

  • Among these the following topics are related to the Enterprise Manager Components: 
    • 2.52 Oracle Enterprise Manager 10g Grid Control Management Agent:
    • 2.54 Oracle Enterprise Manager 10g Grid Control Management Service
    • 2.53 Oracle Enterprise Manager 10g Grid Control Management Repository
    • 2.72 Oracle Grid Control Repository (for oracle_emrep target)
    • 2.38 Oracle Agent Deployment Configuration (oracle_emd target)
    • 2.73 Oracle Home
    • 2.23 Host

Note: The above list is expected to be expanded as and when new collections are introduced in future.

  • It is also advisable to review the collections available for the Database instance, so that the Database hosting the repository can be monitored as well:
    • 2.10 Database Instance
    • 2.78 Oracle Listener

Healthcheck

Healthchecks are executed dynamically against the Oracle Configuration Manager uploaded configurations in My Oracle Support. These checks, based on Oracle Best practices, will proactively notify you of potential problems in your environment, and provide recommendations that help you improve system performance and avoid problems in your Oracle environment. 

  • If you are receiving any Healthcheck alerts in My Oracle support, then refer to the following document for the alert details and its corresponding document for resolving the same:

Note 868955.1: My Oracle Support Health Checks Catalog

  • For Healthchecks specific to the Enterprise Manager and Repository Database, refer to the sections titled:
    • Enterprise Manager (for the OMS)
    • Oracle Database (for the Database hosting the Repository)





CPU and PSU

  • CPU

    Critical Patch Updates (CPU) is the primary means of releasing security fixes for Oracle products. They are released on the Tuesday closest to the 15th day of January, April, July and October. This page lists all the currently available Critical Patch Updates (CPUs) in chronological order and is updated whenever new Critical Patch is released. You can also subscribe to the CPU Email Alerts using the steps listed here.

    To obtain the latest CPU patch details for the Enterprise Manager Grid Control and its dependent products - Oracle Application Server and Oracle Database:

    - In the 
    page, click on the link shown for the latest CPU in the table under the 'Critical Patch Updates'.
    - The next page, lists all the products which have security fixes in the chosen CPU release. Scroll down to 'Patch Availability Table ..' topic and find the table with details for the Product Group and Patch Availability and Installation Information. 
    - In the table, find the row related to Product Group: 'Oracle Enterprise Manager' and pick up the document number given in the Patch Availability and Installation Information column. In the document, navigate to: 

                 "Critical Patch Update Availability for Oracle Products" and then to
                 "Oracle Enterprise Manager Grid Control"
  • PSU

    Patch Set Updates (PSU) are proactive cumulative patches containing recommended bug fixes that are released on a regular and predictable schedule. PSUs are on the same quarterly schedule as the Critical Patch Updates (CPU), specifically the Tuesday closest to the 15th of January, April, July, and October. The PSUs serve as a new baseline version for reporting issues to Oracle, hence it is always recommended to be on the latest PSU release.
    • For more details on PSU, refer Note 854428.1: Patch Set Updates for Oracle Products 
    • For Enterprise Manager specific PSU, refer Note 822485.1: Oracle Recommended Patches -- Oracle Enterprise Manager
  • Choosing between CPU / PSU patches 

    The PSU and CPU released each quarter contain the same security content. However, the patches employ different patching mechanisms, so customers need to choose wisely which patch satisfies their needs better:
    • A PSU can be applied on the CPU released at the same time or on an any earlier CPU for the base release version. A PSU can be applied on any earlier PSU or the base release version. CPUs are only created on the base release version. 
    • Once a PSU has been installed, the recommended way to get future security content is to apply subsequent PSUs. Reverting from PSU back to CPU, while possible, would require significant effort, and so is not advised. 
  • Getting CPU / PSU patch recommendations via OCM 

    OCM also collects and recommends the latest CPU and PSU patch that can be applied to a particular Oracle Home. These details can be seen in the My Oracle Support ->Patches and Updates -> Patch Recommendations section 
    - 'Security' patch recommendations include the CPU patches.
    - 'Other Recommendations' include the PSU patches.

 

References

NOTE:1081865.1 - Master Note for 10g Grid Control OMS Process Control (Start, Stop and Status) & Configuration
NOTE:1082009.1 - Master Note for 10g Grid Control Agent Process Control (Start, Stop & Status) & Configuration
NOTE:1086343.1 - Master Note for 10g Grid Control Enterprise Manager Communication and Upload issues
NOTE:1087997.1 - Master Note for 10g Enterprise Manager Grid Control Agent Performance & Core Dump issues
NOTE:1092513.1 - Master Note for 10g Enterprise Manager Grid Control Security Framework
NOTE:1161003.1 - Master Note for 10g OMS Performance Issues

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

News and Troubleshooting tips for Oracle Database and Enterprise Manager

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today