Sun Cluster Service Level Management


Soon to be released, Sun Cluster 3.2 has a great feature called "Sun Cluster Service Level Management". With this feature, it is possible to see telemetry factors of system resources utilization by Sun Cluster. With a very easy setup using clsetup/scsetup command, you can view CPU, memory, swap and network utilization of the cluster node, resource groups and individual system components like disk, adapters etc.

Another interesting feature is to perform CPU control for Sun Cluster resource groups. This functionality is built on CPU control facility available in Solaris operating system. For example, in Sun Cluster 3.2 running on Solaris 10, you can

  • Assign CPU shares to resource groups running in global or non-global zones.
  • Set maximum or minimum number of processors in a dedicated processor set for resource groups.

By monitoring system resource usage through Sun Cluster, you can collect data that reflects how a service using specific system resources is performing, discover resource bottlenecks and overloads or even under utilized hardware resources. Based on this data you can assign applications to nodes that have the necessary resources and choose which node to failover to.

Sun Cluster Service Level Management uses its own Derby based database to store telemetry data and needs to be configured along with a Sun Cluster HAStoragePlus resource. It needs its highly available storage (in the form of a mount point) to be monitored by HAStoragePlus resource so that all the nodes of the cluster can access the telemetry data.

Here is a small experiment you can carry out on Sun Cluster 3.2 beta software to see power of Service Level Management. In my experiment, I configured a Highly Available NFS (HA-NFS) service and monitored its system resource utilization. I used filebench and four Sun Fire V210 NFS clients to generate traffic. A load of about 20,000 files was driven for 3 minutes. Disk, network, resource group and node's system resource utilization was observed using Sun Cluster Manager and command line interface. A threshold limit was set for write throughput for disks, configured in SVM metaset, to produce an alarm if it exceeds 50 KB/sec.

After configuring HAStoragePlus resource for derby, all you need to do is:

  • Permit system resource monitoring on Resource Groups. In this case, HA-NFS.
  • View and enable monitoring of addition telemetry attributes other than default ones.
  • View and optionally modify polling interval for telemetry data collection.
  • Set a threshold on a telemetry attribute. In this case, wbyte.rate for a disk.

Here are some more graphs generated by Sun Cluster Manager for resource utilization.


With more features coming their way, I think Service Level Management will be a much needed and liked feature of Sun Cluster. With the help of Service Level Management:

  • There won't be any need for hefty shell scripts to monitor disk utilizations, disk space and throughput.
  • Similarly there won't be any need for third party monitoring products.
  • It is possible to define Service Level Agreements (SLAs) for Sun Cluster.
  • Consolidated view of system resource utilization on resource group and individual resource basis.
  • Ability to show 24 hours of resource utilization data.

Atul Vidwansa
Sun Cluster Engineering
Comments:

How do you setup telemetry/SLM in a cluster? I would like to use telemetry to gather performance and utilization stats for my 3.2 cluster. However, I can find little or no documentation on how to set it up. A little help please! George

Posted by George Cebulka on July 31, 2007 at 02:33 AM PDT #

There are a couple of ways to configure and enable the Telemetry.
  • One is to use 'clsetup"
    • Run clsetup
    • Select menu item 8: Other Cluster Tasks
    • Select menu item 2: Configure Telemetry

  • Another option is to use "sctelemetry" resource utility
    • sctelemetry is installed along with SunCluster. The man page of sctelemetry includes the examples of configuring the Telemetry.

Posted by Leland Chen on July 31, 2007 at 04:46 AM PDT #

I am trying to set it up with little success, I have a 100mb LUN which is currently called cluster-config-rs and holds some shared cluster config stuff (such as the nfs dfstab file).

I just get an error method failed for unknown reason, is there any prereqs or anything I need to be aware about?

Posted by Tim Sutton on February 11, 2008 at 04:29 AM PST #

Assuming your cluster-config-rs is a HAStoragePlus resource with file system mount point, and you selected the mount point in clsetup.

The problem might be JavaDB driver was installed or loaded after cacao (common agent container) had been started. Try the following steps to see if it can be enabled successfully.

\* restart cacao on all cluster nodes by running "cacaoadm restart" on all cluster nodes.

\* go to clsetup, try enabling it again.

Posted by Leland Chen on February 11, 2008 at 04:59 AM PST #

Just for the record: we've got, on several systems, the same problem reported by Tim Sutton. We have been unable to get it to run at all. Telemetry installs but then "fails to start for unknown reason." We have a support contract with Sun and we've been pestering them about this since February, with no solution so far. We've tried manual installation, using clsetup, and everything we can think of, but no dice. Our only theory is that some package or service is missing, but we have no way of knowing which one because the error message (in syslog) is so vague.

Posted by Stephan Beal on April 15, 2008 at 04:54 PM PDT #

Here are the packages that telemetry need.

SUNWjavadb-core
SUNWjavadb-client

You can take a look at cacao log when you try to enable telemetry.
The cacao log file is

/var/cacao/instances/default/logs/cacao.0

Posted by Leland Chen on April 16, 2008 at 02:42 AM PDT #

Thanks for the tip, Leland, but still no luck. We only get one line in that log when we start telemetry:

Apr 18, 2008 8:34:40 AM com.sun.cacao.element.ElementSupport setAdministrativeState
FINE: Administrative state change to UNLOCKED : com.sun.cacao:type=module,instance="com.sun.cluster.ganymede"

To me that says "success", but telemetry "fails to start for unknown reason" (syslog message). We've got a case open with Sun, and hopefully they can help us resolve this.

Posted by Stephan Beal on April 17, 2008 at 05:04 PM PDT #

We finally got it solved, but only by accident: a sun technician told us to copy the /etc/cacao/.../security dir from one node to all others. That didn't do it. But copying the /etc/cacao tree in its entirety does work (i did that by accident while trying out the directions from the Sun technician). After puting an identical copy of /etc/cacao on all nodes, telemetry finally starts. It took us 6 frigging months to figure that out, though.

Posted by Stephan Beal on May 19, 2008 at 09:21 PM PDT #

Thanks for the update.

Did you restart cacao after you copied /etc/cacao/instances/default/security directory ?

There are only 3 directories under /etc/cacao. They are
../security, which contains all security related configuration.
../private, which contains the configuration for cacao itself.
../modules, which contains all add-on cacao modules. All cluster management modules(include ganymede) are in the directory.

If you restart cacao after sync the security directory on all nodes(still doesn't work), files under other 2 directories are not sync. If you still have the archives of original /etc/cacao on each node, please send the archives to the support people. That would be very valuable for analysis.

Thanks

Posted by Leland Chen on May 20, 2008 at 03:26 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

mkb

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today