Tuesday Sep 24, 2013
Monday Jun 04, 2012
By Steve Tunstall-Oracle on Jun 04, 2012
If you have upgraded to the new 2011.1.3.0 code, you may find some very useful settings for the Analytics. If you didn't already know, the analytic datasets have the potential to fill up your OS hard drives. The more datasets you use and create, that faster this can happen. Since they take a measurement every second, forever, some of these metrics can get in the multiple GB size in a matter of weeks. The traditional 'fix' was that you had to go into Analytics -> Datasets about once a month and clean up the largest datasets. You did this by deleting them. Ouch. Now you lost all of that historical data that you might have wanted to check out many months from now. Or, you had to export each metric individually to a CSV file first. Not very easy or fun. You could also suspend a dataset, and have it not collect data at all. Well, that fixed the problem, didn't it? of course you now had no data to go look at. Hmmmm....
All of this is no longer a concern. Check out the new Settings tab under Analytics...
Now, I can tell the ZFSSA to keep every second of data for, say, 2 weeks, and then average those 60 seconds of each minute into a single 'minute' value. I can go even further and ask it to average those 60 minutes of data into a single 'hour' value. This allows me to effectively shrink my older datasets by a factor of 1/3600 !!! Very cool. I can now allow my datasets to go forever, and really never have to worry about them filling up my OS drives.
That's great going forward, but what about those huge datasets you already have? No problem. Another new feature in 2011.1.3.0 is the ability to shrink the older datasets in the same way. Check this out. I have here a dataset called "Disk: I/O opps per second" that is about 6.32M on disk (You need not worry so much about the "In Core" value, as that is in RAM, and it fluctuates all the time. Once you stop viewing a particular metric, you will see that shrink over time, just relax).
When one clicks on the trash can icon to the right of the dataset, it used to delete the whole thing, and you would have to re-create it from scratch to get the data collecting again. Now, however, it gives you this prompt:
As you can see, this allows you to once again shrink the dataset by averaging the second data into minutes or hours.
Here is my new dataset size after I do this. So it shrank from 6.32MB down to 2.87MB, but i can still see my metrics going back to the time I began the dataset.
Now, you do understand that once you do this, as you look back in time to the minute or hour data metrics, that you are going to see much larger time values, right? You will need to decide what size of granularity you can live with, and for how long. Check this out.
Here is my Disk: Percent utilized from 5-21-2012 2:42 pm to 4:22 pm:
After I went through the delete process to change everything older than 1 week to "Minutes", the same date and time looks like this:
Just understand what this will do and how you want to use it. Right now, I'm thinking of keeping the last 6 weeks of data as "seconds", and then the last 3 months as "Minutes", and then "Hours" forever after that. I'll check back in six months and see how the sizes look.
Thursday May 03, 2012
By Steve Tunstall-Oracle on May 03, 2012
Alerts are great for not only letting you know when there's some kind of hardware event, but they can also be pro-active and let you know there's a bottleneck coming BEFORE it happens. Check these out. There are two kinds of Alerts in the ZFSSA. When you go to Configuration-->Alerts, you fist see the plus sign by the "Alert Actions" section. These are pretty self-explanatory and not what I'm talking about today. Click on the "Threshold Alerts", and then click the plus sign by those.
This is what I'm talking about. The default one that comes up, "CPU: Percent Utilization" is a good one to start with. I don't mind if my CPUs go to 100% utilized for a short time. After all, we bought them to be used, right? If they go over 90% for over 10 minutes, however, something is up, and maybe we have workloads on this machine it was not designed for, or we don't have enough CPUs in the system and need more. So we can setup an alert that will keep an eye on this for us and send us an email if this were to occur. Now I don't have to keep watching it all the time. For an even better example, keep reading...
What if you want to keep your eyes on whether your Readzillas or Logzillas are being over-utilized? In other words, do you have enough of them? Perhaps you only have 2 Logzillas, and you think you may be better off with 4, but how do you prove it? No problem. Here in Threshold Alerts, click on the Threshold drop-down box, and choose your "Disk: Percent Utilization for Disk: Jxxxxx 013" choice, which is my Logzilla drive in the Jxxxxx tray.
Wait. What's that? You don't have a choice in your drop-down for the Threshold item you are looking for, such as an individual disk?
Well, we will have to fix that.
Leave Alerts for now, and join me over in Analytics. Start with a worksheet with "Disk: Percent utilization broken down by Disk" chart. You do have this, as it's already one of your built-in datasets.
Now, expand it so you can see all of your disks, and find one of your Readzilla or Logzilla drives. (Hint: It will NOT be disk 13 like my example here. Logzillas are always in the 20, 21, 22, or 23 slots of a disk tray. Go to your Configuration-->Hardware screens and you can easily find out which drives are which for your system).
Now, click on that drive to highlight it, like this:
Click on the Drill Button, and choose to drill down on that drive as a raw statistic. You will now have a whole new data chart, just for that one drive.
Don't go away yet. You now need to save that chart as a new dataset, which will keep it in your ZFSSA analytic metrics forever. Well, until you delete it.
Click on the "Save" button, the second to last button on that chart. It looks like a circle with white dots on it (it's supposed to look like a reel-to-reel tape spindle).
Now go to your "Analytics-->Datasets", and you will see a new dataset in there for it.
Go back to your Threshold Alerts, and you will now be able to make an alert that will tell you if this specific drive goes over 90% for more than 10 minutes. If this happens a lot, you probably need more Readzillas or Logzillas.
I hope you like these Alerts. They may take some time to setup at first, but in the long run you may thank yourself. It might not be a bad idea to send the email alerts to a mail distribution list, instead of a single person who may be on vacation when the alert is hit. Enjoy.
Wednesday Mar 28, 2012
By Steve Tunstall-Oracle on Mar 28, 2012
If you read this blog, I am assuming you are at least familiar with the Analytic functions in the ZFSSA. They are basically amazing, very powerful and deep.
However, you may not be aware of some great, hidden functions inside the Analytic screen.
Once you open a metric, the toolbar looks like this:
Now, I’m not going over every tool, as we have done that
before, and you can hover your mouse over them and they will tell you what they
do. But…. Check this out.
Open a metric (CPU Percent Utilization works fine), and click on the “Hour” button, which is the 2nd clock icon. That’s easy, you are now looking at the last hour of data. Now, hold down your ‘Shift’ key, and click it again. Now you are looking at 2 hours of data. Hold down Shift and click it again, and you are looking at 3 hours of data. Are you catching on yet?
You can do this with not only the ‘Hour’ button, but also with the ‘Minute’, ‘Day’, ‘Week’, and the ‘Month’ buttons. Very cool. It also works with the ‘Show Minimum’ and ‘Show Maximum’ buttons, allowing you to go to the next iteration of either of those.
One last button you can Shift-click is the handy ‘Drill’ button. This button usually drills down on one specific aspect of your metric. If you Shift-click it, it will display a “Rainbow Highlight” of the current metric. This works best if this metric has many ‘Range Average’ items in the left-hand window. Give it a shot.
Also, one will sometimes click on a certain second of data in the graph, like this:
In this case, I clicked 4:57 and 21 seconds, and the 'Range Average' on the left went away, and was replaced by the time stamp. It seems at this point to some people that you are now stuck, and can not get back to an average for the whole chart. However, you can actually click on the actual time stamp of "4:57:21" right above the chart. Even though your mouse does not change into the typical browser finger that most links look like, you can click it, and it will change your range back to the full metric.
Another trick you may like is to save a certain view or look of a group of graphs. Most of you know you can save a worksheet, but did you know you could Sync them, Pause them, and then Save it? This will save the paused state, allowing you to view it forever the way you see it now.
Heatmaps. Heatmaps are cool, and look like this:
Some metrics use them and some don't. If you have one, and wish to zoom it vertically, try this. Open a heatmap metric like my example above (I believe every metric that deals with latency will show as a heatmap). Select one or two of the ranges on the left. Click the "Change Outlier Elimination" button. Click it again and check out what it does.
Enjoy. Perhaps my next blog entry will be the best Analytic metrics to keep your eyes on, and how you can use the Alerts feature to watch them for you.
This blog is a way for Steve to send out his tips, ideas, links, and general sarcasm. Almost all related to the Oracle 7000, code named ZFSSA, or Amber Road, or Open Storage, or Unified Storage. You are welcome to contact Steve.Tunstall@Oracle.com with any comments or questions
- ZFS Storage Eye Chart v.13
- Allen's Grilling Channel
- Blog about scripting in the 7000
- Blog for size calc on the 7000
- Darius' Blog on the ZFSSA
- Blog about 7000 partition alignments
- Blog about Dedup on the 7000 from Roch
- Blog- Jeff Savit's great Solaris/ZFS blog
- Ronen's blog on Openstack cloud service with ZFSSA
- Recent Articles