X

News, tips, partners, and perspectives for the Oracle Solaris operating system

Sun Cluster Service Level Management

Guest Author

Soon to be released, Sun Cluster 3.2 has a great feature called "Sun Cluster Service Level Management". With this feature, it is possible to see telemetry factors of system resources utilization by Sun Cluster. With a very easy setup using clsetup/scsetup command, you can view CPU, memory, swap and network utilization of the cluster node, resource groups and individual system components like disk, adapters etc.

Another interesting feature is to perform CPU control for Sun Cluster resource groups. This functionality is built on CPU control facility available in Solaris operating system. For example, in Sun Cluster 3.2 running on Solaris 10, you can

  • Assign CPU shares to resource groups running in global or non-global zones.
  • Set maximum or minimum number of processors in a dedicated processor set for resource groups.

By monitoring system resource usage through Sun Cluster, you can collect data that reflects how a service using specific system resources is performing, discover resource bottlenecks and overloads or even under utilized hardware resources. Based on this data you can assign applications to nodes that have the necessary resources and choose which node to failover to.

Sun Cluster Service Level Management uses its own Derby based database to store telemetry data and needs to be configured along with a Sun Cluster HAStoragePlus resource. It needs its highly available storage (in the form of a mount point) to be monitored by HAStoragePlus resource so that all the nodes of the cluster can access the telemetry data.

Here is a small experiment you can carry out on Sun Cluster 3.2 beta software to see power of Service Level Management. In my experiment, I configured a Highly Available NFS (HA-NFS) service and monitored its system resource utilization. I used filebench and four Sun Fire V210 NFS clients to generate traffic. A load of about 20,000 files was driven for 3 minutes. Disk, network, resource group and node's system resource utilization was observed using Sun Cluster Manager and command line interface. A threshold limit was set for write throughput for disks, configured in SVM metaset, to produce an alarm if it exceeds 50 KB/sec.



After configuring HAStoragePlus resource for derby, all you need to do is:

  • Permit system resource monitoring on Resource Groups. In this case, HA-NFS.
  • View and enable monitoring of addition telemetry attributes other than default ones.
  • View and optionally modify polling interval for telemetry data collection.
  • Set a threshold on a telemetry attribute. In this case, wbyte.rate for a disk.

Here are some more graphs generated by Sun Cluster Manager for resource utilization.





With more features coming their way, I think Service Level Management will be a much needed and liked feature of Sun Cluster. With the help of Service Level Management:

  • There won't be any need for hefty shell scripts to monitor disk utilizations, disk space and throughput.
  • Similarly there won't be any need for third party monitoring products.
  • It is possible to define Service Level Agreements (SLAs) for Sun Cluster.
  • Consolidated view of system resource utilization on resource group and individual resource basis.
  • Ability to show 24 hours of resource utilization data.



Atul Vidwansa

Sun Cluster Engineering

Join the discussion

Comments ( 8 )
  • George Cebulka Tuesday, July 31, 2007
    How do you setup telemetry/SLM in a cluster? I would like to use telemetry to gather performance and utilization stats for my 3.2 cluster. However, I can find little or no documentation on how to set it up.
    A little help please!
    George
  • Tim Sutton Monday, February 11, 2008

    I am trying to set it up with little success, I have a 100mb LUN which is currently called cluster-config-rs and holds some shared cluster config stuff (such as the nfs dfstab file).

    I just get an error method failed for unknown reason, is there any prereqs or anything I need to be aware about?


  • Leland Chen Monday, February 11, 2008

    Assuming your cluster-config-rs is a HAStoragePlus resource with file system mount point, and you selected the mount point in clsetup.

    The problem might be JavaDB driver was installed or loaded after cacao (common agent container) had been started. Try the following steps to see if it can be enabled successfully.

    \* restart cacao on all cluster nodes by running "cacaoadm restart" on all cluster nodes.

    \* go to clsetup, try enabling it again.


  • Stephan Beal Tuesday, April 15, 2008

    Just for the record: we've got, on several systems, the same problem reported by Tim Sutton. We have been unable to get it to run at all. Telemetry installs but then "fails to start for unknown reason." We have a support contract with Sun and we've been pestering them about this since February, with no solution so far. We've tried manual installation, using clsetup, and everything we can think of, but no dice. Our only theory is that some package or service is missing, but we have no way of knowing which one because the error message (in syslog) is so vague.


  • Leland Chen Wednesday, April 16, 2008

    Here are the packages that telemetry need.

    SUNWjavadb-core

    SUNWjavadb-client

    You can take a look at cacao log when you try to enable telemetry.

    The cacao log file is

    /var/cacao/instances/default/logs/cacao.0


  • Stephan Beal Friday, April 18, 2008

    Thanks for the tip, Leland, but still no luck. We only get one line in that log when we start telemetry:

    Apr 18, 2008 8:34:40 AM com.sun.cacao.element.ElementSupport setAdministrativeState

    FINE: Administrative state change to UNLOCKED : com.sun.cacao:type=module,instance="com.sun.cluster.ganymede"

    To me that says "success", but telemetry "fails to start for unknown reason" (syslog message). We've got a case open with Sun, and hopefully they can help us resolve this.


  • Stephan Beal Tuesday, May 20, 2008

    We finally got it solved, but only by accident: a sun technician told us to copy the /etc/cacao/.../security dir from one node to all others. That didn't do it. But copying the /etc/cacao tree in its entirety does work (i did that by accident while trying out the directions from the Sun technician). After puting an identical copy of /etc/cacao on all nodes, telemetry finally starts. It took us 6 frigging months to figure that out, though.


  • Leland Chen Tuesday, May 20, 2008

    Thanks for the update.

    Did you restart cacao after you copied /etc/cacao/instances/default/security directory ?

    There are only 3 directories under /etc/cacao. They are

    ../security, which contains all security related configuration.

    ../private, which contains the configuration for cacao itself.

    ../modules, which contains all add-on cacao modules. All cluster management modules(include ganymede) are in the directory.

    If you restart cacao after sync the security directory on all nodes(still doesn't work), files under other 2 directories are not sync. If you still have the archives of original /etc/cacao on each node, please send the archives to the support people. That would be very valuable for analysis.

    Thanks


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.