Storage Remote Monitoring...got that...

One of my many projects is to tackle the product-side architecture for Remote Monitoring of our storage systems. Remote Monitoring is a fascinating problem to solve for many, many reasons:


  • There are different ways to break the problem up, each being pursued with almost religious fanaticism, but each having its place depending on the customer's needs
  • It is a cross-organizational solution (at least within Sun)
  • It has a classic separation of responsibilities in its architecture
  • It solves real problems for customers and for our own company
  • It is conceptually simple, yet extremely difficult to get right

The problem at hand was to create a built-in remote monitoring solution for our midrange storage systems. Our NAS Product Family and anything being managed by our Common Array Manager was a good start. Our CAM software alone covers our Sun StorageTek 6130, 6140, 6540, 2530, and 2540 Arrays. Our high-end storage already has a level of remote monitoring and we already have a solution to do remote monitoring of "groups" of systems via a service appliance, so our solution was targeted directly at monitoring individual systems with a built in solution.

This remote monitoring solution is focused on providing you with a valuable service: "Auto Service Request", ASR. The Remote Monitoring Web Site has a great definition of ASR: Uses fault telemetry to automatically initiate a service request and begin the problem resolution process as soon as a problem occurs. This focus gives us the ability to trim down the information being sent to Sun to faults, it also gives you a particular value...it tightens up the service pipeline to get you what you need in a timely manner.

For example, if a serious fault occurs in your system (one that would typically involve Sun Services), we will have a case generated for you within a few minutes...typically less than 15.

The information flow with the "built in" Remote Monitoring is only towards Sun Microsystems (we heard you with security!). If you, the customer, want to work with us remotely to resolve the problem, a second solution known as Shared Shell is in place. With this solution, we work cooperatively with you so that you can collaborate with us to resolve problems.

Remember though, I'm an engineer, so let's get back to the problem...building Remote Monitoring.

The solution is a classic separation of concerns. Here are the major architectural components:


  • REST-XML API
  • HTTPS protocol for connectivity
  • Security (user-based and repudiation) via Authentication and Public / Private Key Pairs
  • Information Producer (the product installed at the customer site)
  • Information Consumer (the service information processor that turns events into cases)
  • Routing Infrastructure

The REST-XML API gives us a common information model that abstracts away implementation details yet gives all of the organizations involved in information production and consumption a common language. The relatively tight XML Schema also gives an easily testable output for the product without having to actually deliver telemetry in the early stages of implementation. Further, the backend can eaily mock up messages to test their implementation without a product being involved. Early in the implementation we cranked out a set of messages that were common to some of the arrays and sent them to the programmers on the back end, the teams then worked independently on their implementations. When we brought the teams back together, things went off without much of a hiccup, though we did find places where the XML Schema was too tight or too loose for one of the parties, so you do still have to talk. The format also helps us bring teams on board quickly...give them an XSD and tell them to come back later.

Here is an example of a message (real data removed...). Keep in mind there are multiple layers of security to protect this information from prying eyes. We've kept the data to a minimum, just the data we need to help us determine if a case needs to be created and what parts we probably need to ship out:


<?xml version="1.0" encoding="UTF-8"?>
<message xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="message.xsd">
<site-id>paul</site-id>
<message-uuid>uuid:xxxxx</message-uuid>
<message-time timezone="America/Denver">2005-11-22T12:10:11</message-time>
<system-id>SERIAL</system-id>
<asset-id>UNIQUENUMBER</asset-id>
<product-id>uniqueproductnumber</product-id>
<product-name>Sun StorageTek 6130</product-name>
<event>
<primary-event-information>
<message-id>STK-8000-5H</message-id>
<event-uuid>uuid:00001111</event-uuid>
<event-time timezone="America/Denver">2005-11-22T12:10:11</event-time>
<severity>Critical</severity>
<component>
<hardware-component>
<name>RAID</name>
</hardware-component>
</component>
<summary>Critical: Controller 0 write-cache is disabled</summary>
<description>Ctlr 0 Battery Pack Low Power</description>
</primary-event-information>
</event>
</message>

Use of XML gives us the ability to be very tight with use of tabs and enforce particular values, like severity, across the product lines.

The format above is heavily influenced by our Fault Management Architecture, though an FMA implementation is not required.

What we've found is that good diagnostics on a device (and FMA helps with this) yields a quick assembly of the information we need and fewer events that are not directly translated into cases. FMA and "self healing" provide and exceptional foundation for remote monitoring with a heavy reduction in "noise".

The rest of the architecture (the services that produce, consume, secure, and transport the information) is handed off to the implementors! The product figures out how to do diagnostics and output the XML via HTTPS to services at Sun Microsystems. Another team deploys services in the data center for security and registration (there are additional XML formats, authentication capabilities and POST headers for this part of the workflow). Another team deploys a service to receive the telemetry, check the signature on the telemetry for repudiation purposes, process it, filter it, and create a case.

There are additional steps that each product needs to go through, such as communicating across organizations the actual message-ids that a device can send and what should happen if that message-id is received.

In the end, the centerpiece of the architecture is the information and the language that all teams communicate with. Isn't this the case with any good architecture? Choose the interfaces and the implementations will follow.

Keep in mind, this remote monitoring solution is secure end to end. Further, remote monitoring is only one piece of the broader services portfolio...I'm just particularly excited about this since I was privileged to have worked with a great, cross-organizational team to get it done! The team included Mike Monahan (who KICKS BUTT), Wayne Seltzer, Bill Masters, Todd Sherman, Mark Vetter, Jim Kremer, Pat Ryan and many others (I hope I didn't forget any). There are also lots of folks that were pivotal in getting this done that we lost along the way (Kathy MacDougall I hope you are doing well as well as Mike Harding!).

This post has been a long time in coming! Enjoy!

Comments:

Post a Comment:
Comments are closed for this entry.
About

pmonday

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today