Analytics & Threshold Alerts
By Steve Tunstall on May 03, 2012
Alerts are great for not only letting you know when there's some kind of hardware event, but they can also be pro-active and let you know there's a bottleneck coming BEFORE it happens. Check these out. There are two kinds of Alerts in the ZFSSA. When you go to Configuration-->Alerts, you fist see the plus sign by the "Alert Actions" section. These are pretty self-explanatory and not what I'm talking about today. Click on the "Threshold Alerts", and then click the plus sign by those.
This is what I'm talking about. The default one that comes up, "CPU: Percent Utilization" is a good one to start with. I don't mind if my CPUs go to 100% utilized for a short time. After all, we bought them to be used, right? If they go over 90% for over 10 minutes, however, something is up, and maybe we have workloads on this machine it was not designed for, or we don't have enough CPUs in the system and need more. So we can setup an alert that will keep an eye on this for us and send us an email if this were to occur. Now I don't have to keep watching it all the time. For an even better example, keep reading...
What if you want to keep your eyes on whether your Readzillas or Logzillas are being over-utilized? In other words, do you have enough of them? Perhaps you only have 2 Logzillas, and you think you may be better off with 4, but how do you prove it? No problem. Here in Threshold Alerts, click on the Threshold drop-down box, and choose your "Disk: Percent Utilization for Disk: Jxxxxx 013" choice, which is my Logzilla drive in the Jxxxxx tray.
Wait. What's that? You don't have a choice in your drop-down for the Threshold item you are looking for, such as an individual disk?
Well, we will have to fix that.
Leave Alerts for now, and join me over in Analytics. Start with a worksheet with "Disk: Percent utilization broken down by Disk" chart. You do have this, as it's already one of your built-in datasets.
Now, expand it so you can see all of your disks, and find one of your Readzilla or Logzilla drives. (Hint: It will NOT be disk 13 like my example here. Logzillas are always in the 20, 21, 22, or 23 slots of a disk tray. Go to your Configuration-->Hardware screens and you can easily find out which drives are which for your system).
Now, click on that drive to highlight it, like this:
Click on the Drill Button, and choose to drill down on that drive as a raw statistic. You will now have a whole new data chart, just for that one drive.
Don't go away yet. You now need to save that chart as a new dataset, which will keep it in your ZFSSA analytic metrics forever. Well, until you delete it.
Click on the "Save" button, the second to last button on that chart. It looks like a circle with white dots on it (it's supposed to look like a reel-to-reel tape spindle).
Now go to your "Analytics-->Datasets", and you will see a new dataset in there for it.
Go back to your Threshold Alerts, and you will now be able to make an alert that will tell you if this specific drive goes over 90% for more than 10 minutes. If this happens a lot, you probably need more Readzillas or Logzillas.
I hope you like these Alerts. They may take some time to setup at first, but in the long run you may thank yourself. It might not be a bad idea to send the email alerts to a mail distribution list, instead of a single person who may be on vacation when the alert is hit. Enjoy.