Thursday Dec 17, 2009

Tweeting your Sun Storage 7000 Appliance Alerts

Twitter, Instant Messages, Mobile Alerts have always fascinated me. I truly believe that a Storage Administrator should not have to leave the comfort of their iPhone, Droid or Palm Pre to do 90% of their day to day management tasks. As I was scanning the headlines of I saw this great article on Tweeting from Command Line using Python.

So, leading into how to manage a Sun Storage 7000 Appliance using Alerts (the next article in my series) I thought I would take some time and adapt this script to tweet my received Sun Storage 7000 Appliance Traps. I am going to use the AK MIB traps (to be explained in more detail in the next article) to achieve this.

Writing the Python Trap Handler

First, create the trap handler (this is based on the Python Script presented in the Blog Article: Tweeting from Command Line using Python).

Here is the Python Script:

import sys

from os import popen

def tweet(user,password,message):
print 'Hold on there %s....Your message %s is getting posted....' % (message, user)

url = ''

curl = '/usr/dist/local/sei/tools/SunOS-sparc/curl -s -u %s:%s -d status="%s" %s' % (user,password,message,url)

pipe = popen(curl, 'r')
print 'Done...awesome'

if __name__ == '__main__':
host = sys.stdin.readline()
ip = sys.stdin.readline()
uptime = sys.stdin.readline()
uuid = sys.stdin.readline()
alertclass = sys.stdin.readline()
alertcount = sys.stdin.readline()
alerttype = sys.stdin.readline()
alertseverity = sys.stdin.readline()
alertsresponse = sys.stdin.readline()
messageArray = [host,ip,alerttype]
message = t.join(messageArray)
message = message[0:140]

user = "yourtwitter" #put your username inside these quotes

password = "yourpassword" #put your password inside these quotes

You will have to make the following changes at a minimum

  • Re-insert the missing tabs based on Python formatting
  • Ensure the path to CURL is appropriate
  • Change the user and password variables to your Twitter account

Once that is done you should be set.

Adding the Trap Handler to SNMP

Next, set up your snmptrapd.conf to handle traps from the AK MIB by invoking the Python Script above. My /etc/sma/snmp/snmptrapd.conf looks something like this:

traphandle . /export/home/oracle/

The OID . identifies the sunAkTraps portion of the AK-MIB delivered with the Sun Storage 7000 Appliance.

Now, invoke snmptrapd using the above configuration file (there are many ways to do this but I am doing the way I know will pick up my config file :-)

/usr/sfw/sbin/snmptrapd -c /etc/sma/snmp/snmptrapd.conf -P

Sending Alerts from the Sun Storage 7000

Using the article I posted yesterday, ensure SNMP is enabled with a trapsink identifying the system where your trap receiver is running. Now we have to enable an alert to be sent via SNMP from your Sun Storage 7000 Appliance (this is different from the default Fault Management Traps I discussed yesterday).

For now, trust me on this, I will explain more in my next article, let's enable a simple ARC size threshold to be violated. Go into the Browser User Interface for your system (or the simulator) and go into the Configuration -> Alerts screen. Click through to the "Thresholds" and add one that you know will be violated, like this one:

Each alert that is sent gets posted to the twitter account identified within the Python Script! And as my friends quickly noted, from there it can go to your Facebook account where you can quickly declare "King of the Lab"!

Thanks Sandip for your inspiring post this morning :-)

Wednesday Dec 16, 2009

The SNMP Service on a Sun Storage 7000 Appliance

Without a doubt, SNMP rules the playground in terms of monitoring hardware assets, and many software assets, in a data center monitoring ecosystem. It is the single biggest integration technology I'm asked about and that I've encountered when discussing monitoring with customers.

Why does SNMP have such amazing staying power?

  • It's extensible (vendors can provide MIBs and extend existing MIBs)
  • It's simple (hierarchical data rules and really it boils down to GET, SET, TRAP)
  • It's ubiquitous (monitoring tools accept SNMP, systems deliver SNMP)
  • It operates on two models, real time (traps) and polling (get)
  • It has aged gracefully (security extensions in v4 did not destroy it's propagation)

To keep the SNMP support in the Sun Storage 7000 Appliances relatively succinct, I am going to tackle this in two separate posts. This first post shows how to enable SNMP and what you get "out of the box" once it's enabled. The next post discusses how to deliver more information via SNMP (alerts with more information and threshold violations).

To get more information on SNMP on the Sun Storage 7000 and to download the MIBs that will be discussed here, go to the Help Wiki on a Sun Storage 7000 Appliance (or the simulator):

  • SNMP - https://[hostname]:215/wiki/index.php/Configuration:Services:SNMP

Also, as I work at Sun Microsystems, Inc., all of my examples of walking MIBs on a Sun Storage 7000 Appliance or receiving traps will be from a Solaris-based system. There are plenty of free / open source / trial packages for other Operating System platforms so you will have to adapt this content appropriately for your platform.

One more note as I progress in this series, all of my examples are from the CLI or from scripts, so you won't find many pretty pictures in the series :-)

Enabling SNMP on the Sun Storage 7000 Appliance gives you the ability to:

  • Receive traps (delivered via Sun's Fault Manager (FM) MIB)
  • GET system information (MIB-II System, MIB-II Interfaces, Sun Enterprise MIB)
  • GET information customized to the appliance (using the Sun Storage AK MIB)

Enabling alerts (covered in the next article) extends the SNMP support by delivering targeted alerts via the AK MIB itself.

Enable SNMP

The first thing we'll want to do is log into a target Sun Storage 7000 Appliance via SSH and check if SNMP is enabled.

aie-7110j:> configuration services snmp
aie-7110j:>configuration services snmp> ls
<status> = disabled
community = public
network =
syscontact =
trapsinks =

aie-7110j:configuration services snmp>

Here you can see it is currently disabled and that we have to set up all of the SNMP parameters. The most common community string to this day is "public" and as we will not be changing system information via SNMP we will keep it. The "network" parameter to use for us is, this allows access to the MIB from any network. Finally, I will add a single trapsink so that any traps get sent to my management host. The last step shown is to enable the service once the parameters are committed.

aie-7110j:configuration services snmp> set network=
network = (uncommitted)
aie-7110j:configuration services snmp> set syscontact="Paul Monday"
syscontact = Paul Monday (uncommitted)
aie-7110j:configuration services snmp> set trapsinks=
trapsinks = (uncommitted)
aie-7110j:configuration services snmp> commit
aie-7110j:configuration services snmp> enable
aie-7110j:configuration services snmp> show
<status> = online
community = public
network =
syscontact = Paul Monday
trapsinks =

From the appliance perspective we are now up and running!

Get the MIBs and Install Them

As previously mentioned, all of the MIBs that are unique to the Sun Storage 7000 Appliance are also distributed with the appliance. Go to the Help Wiki and download them, then move them to the appropriate location for monitoring.

On the Solaris system I'm using, that location is /etc/sma/snmp/mibs. Be sure to browse the MIB for appropriate tables or continue to look at the Help Wiki as it identifies relevant OIDs that we'll be using below.

Walking and GETting Information via the MIBs

Using standard SNMP operations, you can retrieve quite a bit of information. As an example from the management station, we will retrieve a list of shares available from the system using snmpwalk:

-bash-3.00# ./snmpwalk -c public -v 2c isv-7110h sunAkShareName
SUN-AK-MIB::sunAkShareName.1 = STRING: pool-0/MMC/deleteme
SUN-AK-MIB::sunAkShareName.2 = STRING: pool-0/MMC/data
SUN-AK-MIB::sunAkShareName.3 = STRING: pool-0/TestVarious/filesystem1
SUN-AK-MIB::sunAkShareName.4 = STRING: pool-0/oracle_embench/oralog
SUN-AK-MIB::sunAkShareName.5 = STRING: pool-0/oracle_embench/oraarchive
SUN-AK-MIB::sunAkShareName.6 = STRING: pool-0/oracle_embench/oradata
SUN-AK-MIB::sunAkShareName.7 = STRING: pool-0/AnotherProject/NoCacheFileSystem
SUN-AK-MIB::sunAkShareName.8 = STRING: pool-0/AnotherProject/simpleFilesystem
SUN-AK-MIB::sunAkShareName.9 = STRING: pool-0/default/test
SUN-AK-MIB::sunAkShareName.10 = STRING: pool-0/default/test2
SUN-AK-MIB::sunAkShareName.11 = STRING: pool-0/EC/tradetest
SUN-AK-MIB::sunAkShareName.12 = STRING: pool-0/OracleWork/simpleExport

Next, I can use snmpget to obtain a mount point for the first share:

-bash-3.00# ./snmpget -c public -v 2c isv-7110h sunAkShareMountpoint.1
SUN-AK-MIB::sunAkShareMountpoint.1 = STRING: /export/deleteme

It is also possible to get a list of problems on the system identified by problem code:

-bash-3.00# ./snmpwalk -c public -v 2c isv-7110h sunFmProblemUUID
SUN-FM-MIB::sunFmProblemUUID."91e97860-f1d1-40ef-8668-dc8fb85679bb" = STRING: "91e97860-f1d1-40ef-8668-dc8fb85679bb"

And then turn around and retrieve the associated knowledge article identifier:

-bash-3.00# ./snmpget -c public -v 2c isv-7110h sunFmProblemCode.\\"91e97860-f1d1-40ef-8668-dc8fb85679bb\\"
SUN-FM-MIB::sunFmProblemCode."91e97860-f1d1-40ef-8668-dc8fb85679bb" = STRING: AK-8000-86

The FM-MIB does not contain information on severity, but using the problem code I can SSH into the system and retrieve that information:

isv-7110h:> maintenance logs select fltlog select uuid="91e97860-f1d1-40ef-8668-dc8fb85679bb"
isv-7110h:maintenance logs fltlog entry-005> ls
timestamp = 2009-12-15 05:55:37
uuid = 91e97860-f1d1-40ef-8668-dc8fb85679bb
desc = The service processor needs to be reset to ensure proper functioning.
type = Major Defect

isv-7110h:maintenance logs fltlog entry-005>

Take time to inspect the MIBs through your MIB Browser to understand all of the information available. I tend to shy away from using SNMP for getting system information and instead write scripts and workflows as much more information is available directly on the system, I'll cover this in a later article.

Receive the Traps

Trap receiving on Solaris is a piece of cake, at least for demonstration purposes. What you choose to do with the traps is a whole different process. Each tool has it's own trap monitoring facilities that will hand you the fields in different ways. For this example, Solaris just dumps the traps to the console.

Locate the "snmptrapd" binary on your Solaris system and start monitoring:

-bash-3.00# cd /usr/sfw/sbin
-bash-3.00# ./snmptrapd -P
2009-12-16 09:27:47 NET-SNMP version 5.0.9 Started.

From there you can wait for something bad to go wrong with your system or you can provoke it yourself. Fault Management can be a bit difficult to provoke intentionally since things one thinks would provoke a fault are actually administrator activites. Pulling a disk drive is very different from a SMART drive error on a disk drive. Similarly, pulling a Power Supply is different from tripping over a power cord and yanking it out. The former is not a fault since it is a complex operation requiring an administrator to unseat the power supply (or disk) whereas the latter occurs out in the wild all the time.

Here are some examples of FM traps I've received through this technique using various "malicious" techniques on a lab system ;-)

Here is an FM Trap when I "accidentally" tripped over a power cord in the lab. Be careful when you do this so you don't pull the system off the shelf if it is not racked properly (note that I formatted this a little bit from the raw output):

2009-11-16 12:25:34 isv-7110h []:
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1285895753) 148 days, 19:55:57.53
SNMPv2-MIB::snmpTrapOID.0 = OID: SUN-FM-MIB::sunFmProblemTrap
SUN-FM-MIB::sunFmProblemUUID."2c7ff987-6248-6f40-8dbc-f77f22ce3752" = STRING: "2c7ff987-6248-6f40-8dbc-f77f22ce3752"
SUN-FM-MIB::sunFmProblemCode."2c7ff987-6248-6f40-8dbc-f77f22ce3752" = STRING: SENSOR-8000-3T
SUN-FM-MIB::sunFmProblemURL."2c7ff987-6248-6f40-8dbc-f77f22ce3752" = STRING:

Notice again that I have a SunFmProblemUUID that I can turn around and shell into the system to obtain more details (similarly to what was shown in the last section). Again, the next article will contain an explanation of Alerts. Using the AK MIB and Alerts, we can get many more details pushed out to us via an SNMP Trap, and we have finer granularity as to the alerts that get pushed.

Here, I purchased a very expensive fan stopper-upper device from a fellow tester. It was quite pricey, it turns out it is also known as a "Twist Tie". Do NOT do this at home, seriously, the decreased air flow through the system can cause hiccups in your system.

DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1285889746) 148 days, 19:54:57.46
SNMPv2-MIB::snmpTrapOID.0 = OID: SUN-FM-MIB::sunFmProblemTrap
SUN-FM-MIB::sunFmProblemUUID."cf480476-51b7-c53a-bd07-c4df59030284" = STRING: "cf480476-51b7-c53a-bd07-c4df59030284"
SUN-FM-MIB::sunFmProblemCode."cf480476-51b7-c53a-bd07-c4df59030284" = STRING: SENSOR-8000-26
SUN-FM-MIB::sunFmProblemURL."cf480476-51b7-c53a-bd07-c4df59030284" = STRING:

You will receive many, many other traps throughout the day including the Enterprise MIB letting us know when the system starts up or any other activities.

Wrap it Up

In this article, I illustrated enabling the SNMP Service on the Sun Storage 7000 Appliance via an SSH session. I also showed some basic MIB walking and traps that you'll receive once SNMP is enabled.

This is really simply the "start" of the information we can push through the SNMP pipe from a system. In the next article I'll show how to use Alerts on the system with the SNMP pipe so you can have more control over the events on a system that you wish to be notified about.

Thursday Dec 10, 2009

Monitoring the Sun Storage 7000 Appliance

Over the past several months I've been working on integrating our Sun Storage 7000 Appliances into monitoring products from other companies. The monitoring work I'm doing is a combination of software writing (via a plug-in for a data center monitoring product that will see it's release in conjunction with our next Sun Storage 7000 Appliance Software Release) and "consulting" with our customers directly about monitoring the appliances they install after purchase.

The Sun Storage 7000 Appliance comes with a variety of mechanisms for monitoring:
- SNMP (via several different MIBs using traps or GETs)
- Email Alerts
- Remote Syslog

A variety of software and hardware faults delivered internal to the system as Fault Management Architecture (FMA) events get pushed to the monitoring environment via the above mechanisms.

As valuable as these capabilities are, customers always have more advanced monitoring needs that require customization of the environment. Some customers want to tune the information available for significant digits, get more significant digits than we surface in the CLI, or gather data from our industry leading analytics capabilities delivered with the appliance. Some may want to integrate with an ITIL-style Configuration Management Database, others may want to create a billing system based on user capacity and accounting for levels of service (guaranteed space, thin-provisioned space, etc...).

All of these customizations can easily be achieved using simple SSH navigation of the appliance's environment or more advanced manipulation of the environment using the embedded JavaScript environment on each Sun Storage 7000 Appliance via scripts or Workflows.

Over the next few weeks, I'm going through my Email Archives (not a pretty sight to be honest) and I'm going to mine the greatest hits as I've sent out information to specific audiences on monitoring boxes and customizing the environment based on specific monitoring application use cases. Other articles will be focused on how I achieved the monitoring environment for the upcoming plug-in that will hit the download center with the next software release.

With all of that lead-in, I am going to kick off my monitoring guidance with what I tell everyone right out of the chute, "Use the Built-in Sun Storage 7000 Appliance Help Wiki to get up to speed on these topics and get the latest information". After all, this blog post will age with each release of the Sun Storage 7000 Appliance whereas the Help Wiki is updated with each release.

On a running Sun Storage 7000 Appliance, use the following URLs (substituting the address of the appliance where I put [hostname]):

  • SNMP - https://[hostname]:215/wiki/index.php/Configuration:Services:SNMP
  • Alerts - https://[hostname]:215/wiki/index.php/Configuration:Alerts
  • Scripting - https://[hostname]:215/wiki/index.php/User_Interface:CLI:Scripting
  • Workflows - https://[hostname]:215/wiki/index.php/Maintenance:Workflows

You can download the latest Sun Storage 7000 Appliance Storage Simulator and follow these instructions as well.

In case the pages have moved, be sure to use the Search feature in the Help Text that comes with the Wiki.

There are always cases that customers want more hardcore examples tailored to environments of each of the above or a slightly different take on learning these topics. And that, my friends, is what the next few weeks will be about. I'll give more examples and approaches, similar to what I did with my Fun with the FMA and SNMP article.

Friday May 25, 2007

Storage Remote that...

One of my many projects is to tackle the product-side architecture for Remote Monitoring of our storage systems. Remote Monitoring is a fascinating problem to solve for many, many reasons:

  • There are different ways to break the problem up, each being pursued with almost religious fanaticism, but each having its place depending on the customer's needs
  • It is a cross-organizational solution (at least within Sun)
  • It has a classic separation of responsibilities in its architecture
  • It solves real problems for customers and for our own company
  • It is conceptually simple, yet extremely difficult to get right

The problem at hand was to create a built-in remote monitoring solution for our midrange storage systems. Our NAS Product Family and anything being managed by our Common Array Manager was a good start. Our CAM software alone covers our Sun StorageTek 6130, 6140, 6540, 2530, and 2540 Arrays. Our high-end storage already has a level of remote monitoring and we already have a solution to do remote monitoring of "groups" of systems via a service appliance, so our solution was targeted directly at monitoring individual systems with a built in solution.

This remote monitoring solution is focused on providing you with a valuable service: "Auto Service Request", ASR. The Remote Monitoring Web Site has a great definition of ASR: Uses fault telemetry to automatically initiate a service request and begin the problem resolution process as soon as a problem occurs. This focus gives us the ability to trim down the information being sent to Sun to faults, it also gives you a particular tightens up the service pipeline to get you what you need in a timely manner.

For example, if a serious fault occurs in your system (one that would typically involve Sun Services), we will have a case generated for you within a few minutes...typically less than 15.

The information flow with the "built in" Remote Monitoring is only towards Sun Microsystems (we heard you with security!). If you, the customer, want to work with us remotely to resolve the problem, a second solution known as Shared Shell is in place. With this solution, we work cooperatively with you so that you can collaborate with us to resolve problems.

Remember though, I'm an engineer, so let's get back to the problem...building Remote Monitoring.

The solution is a classic separation of concerns. Here are the major architectural components:

  • HTTPS protocol for connectivity
  • Security (user-based and repudiation) via Authentication and Public / Private Key Pairs
  • Information Producer (the product installed at the customer site)
  • Information Consumer (the service information processor that turns events into cases)
  • Routing Infrastructure

The REST-XML API gives us a common information model that abstracts away implementation details yet gives all of the organizations involved in information production and consumption a common language. The relatively tight XML Schema also gives an easily testable output for the product without having to actually deliver telemetry in the early stages of implementation. Further, the backend can eaily mock up messages to test their implementation without a product being involved. Early in the implementation we cranked out a set of messages that were common to some of the arrays and sent them to the programmers on the back end, the teams then worked independently on their implementations. When we brought the teams back together, things went off without much of a hiccup, though we did find places where the XML Schema was too tight or too loose for one of the parties, so you do still have to talk. The format also helps us bring teams on board quickly...give them an XSD and tell them to come back later.

Here is an example of a message (real data removed...). Keep in mind there are multiple layers of security to protect this information from prying eyes. We've kept the data to a minimum, just the data we need to help us determine if a case needs to be created and what parts we probably need to ship out:

<?xml version="1.0" encoding="UTF-8"?>
<message xmlns:xsi="" xsi:noNamespaceSchemaLocation="message.xsd">
<message-time timezone="America/Denver">2005-11-22T12:10:11</message-time>
<product-name>Sun StorageTek 6130</product-name>
<event-time timezone="America/Denver">2005-11-22T12:10:11</event-time>
<summary>Critical: Controller 0 write-cache is disabled</summary>
<description>Ctlr 0 Battery Pack Low Power</description>

Use of XML gives us the ability to be very tight with use of tabs and enforce particular values, like severity, across the product lines.

The format above is heavily influenced by our Fault Management Architecture, though an FMA implementation is not required.

What we've found is that good diagnostics on a device (and FMA helps with this) yields a quick assembly of the information we need and fewer events that are not directly translated into cases. FMA and "self healing" provide and exceptional foundation for remote monitoring with a heavy reduction in "noise".

The rest of the architecture (the services that produce, consume, secure, and transport the information) is handed off to the implementors! The product figures out how to do diagnostics and output the XML via HTTPS to services at Sun Microsystems. Another team deploys services in the data center for security and registration (there are additional XML formats, authentication capabilities and POST headers for this part of the workflow). Another team deploys a service to receive the telemetry, check the signature on the telemetry for repudiation purposes, process it, filter it, and create a case.

There are additional steps that each product needs to go through, such as communicating across organizations the actual message-ids that a device can send and what should happen if that message-id is received.

In the end, the centerpiece of the architecture is the information and the language that all teams communicate with. Isn't this the case with any good architecture? Choose the interfaces and the implementations will follow.

Keep in mind, this remote monitoring solution is secure end to end. Further, remote monitoring is only one piece of the broader services portfolio...I'm just particularly excited about this since I was privileged to have worked with a great, cross-organizational team to get it done! The team included Mike Monahan (who KICKS BUTT), Wayne Seltzer, Bill Masters, Todd Sherman, Mark Vetter, Jim Kremer, Pat Ryan and many others (I hope I didn't forget any). There are also lots of folks that were pivotal in getting this done that we lost along the way (Kathy MacDougall I hope you are doing well as well as Mike Harding!).

This post has been a long time in coming! Enjoy!




« July 2016