How To Clear An Alert
By emdg on Feb 04, 2010
Signs In The Console
One of the most requested things for Grid Control debugging is a way to 'clear' an old alert. The problem typically is that what people are asking is either not possible, or not advisable to do.
To get into more detail about this problem, let's first discuss what an alert is, what it means, and what it shows in the console.
First of all let's start with the thing that an alert is not: An alert is NOT an error. An error condition is a problem with the execution of the metric. An error cannot be cleared by changing the monitoring setup of the metric: It has to get cleared by fixing the underlying problem with the execution of the metric first as indicated in the emagent.trc file. Since the metric itself has encountered a problem, changing the thresholds will not make any difference for the error condition.
An alert is threshold violation of the data-point collected by the metric. The metric did evaluate, and a valid data-point was collected and uploaded by the Agent. This data-point was then used to compare against the specified threshold, and a violation was detected.
This brings up the main point with alerts: There is 'state' kept on both tiers: Both the Agents and the OMS/repository know about an alert. And this means that you simply cannot shuffle something under the rug, and cleanup either just the OMS or just the Agent.
Clearing alerts as told by the cowboys in the wild-wild internet
A lot of people have a lot of creative ideas on how to 'get rid' of an alert in Grid Control. Several of these urban legends get posted on blogs and forums. And most of them involve removing something directly from the repository table. And although it might appear to be doing what people want it to, it is in fact CORRUPTING the data in the repository, and breaking the information flow between the Agent and the OMS.
Several of the solutions have been put out on blogs on the internet. And the recommendation here is pretty simple:
- NEVER remove data directly from the repository. No matter how appealing it may sound (or look in the console), it will ALWAYS get you into trouble
Clearing the alert
The only way to clean-up an outstanding alerts is to make sure the values collected by the metric are no longer in violation with the thresholds
- The most obvious way of clearing the alert is to fix the underlying problem.
If a disk is 100% full, making some free space on the disk will clear the condition and remove the outstanding alert.
- If a metric triggered an alert for a non-fatal or a non-problematic condition, the thresholds of the metric are not configured correctly. By updating the thresholds and/or changing the number of occurrences to trigger the alert, the next iteration of the metric will evaluate the data-point with the changed thresholds, and update (and preferably clear) the alert.
- If the metric is not important, the big question to ask is if the metric is even relevant and needs to be collected in the first place. If the data-points are not important, the metric itself can disabled.
By disabling the collection, all outstanding alerts will be cleared, and the Agent will be instructed to not collect the metric anymore.
Acknowledging the alert
Sometimes, the thresholds are set OK, the metric is just doing fine, but the administrator working on the problem would just like to signal that the problem is 'under control' and people are working on it.
To signal that the alert has been 'received', an acknowledgement can be added to the alert in the console.
This is a trigger for the notification system to not trigger any more notifications for this one anymore. And it's also an indication for anyone working with that target that somebody is paying attention, and is doing something about the reported issue.