Monday Mar 16, 2009

Open HA Cluster on OpenSolaris - a milestone to look at

You might have noticed the brief message to the ha-clusters-discuss maillist that a new Colorado source drop is available.  If you have read my blog about why getting Open HA Cluster running on OpenSolaris is not just a "re-compile and run" experience, then it is worth to mention that the team working on project Colorado made some great progress:

  • The whole source can get compiled by using the latest SunStudioExpress compiler on OpenSolaris.
  • Logic has been implemented to create the IPS manifest content within the source gate (usr/src/ipsdefs/) and to send the IPS packages to a configurable repository as part of the build process. That way one can easily install the own set of created IPS packages on any OpenSolaris system which can reach that repository.
  • IPS package dependencies have been analysed and are now defined explicitly in more fine granularity.
  • scinstall has been enhanced to do all the required steps at initial configuration time, which have been previously done within individual postinstall/preremove SVR4 package scripts.
  • Shell scripts have been verified to either work with KSH93 or been changed to use /usr/xpg4/bin/sh.
  • While the framework gate has still a build dependency to JATO (which is part of webconsole), any run-time dependency has been removed.
  • pconsole has been made available for OpenSolaris, which can be used instead of the Solaris Cluster adminconsole (which uses Motif).
  • Changes have been made to work with new networking features introduced by projects Crossbow, Clearview and Volo. This especially means that vnics can be used to setup a minimal cluster on systems which just have one physical network interface.
  • Changes have been made to improve the DID layer, to work with non-shared storage exported as Solaris iSCSI targets, which in turn can get used through configuring the Solaris iSCSI initiator on both nodes. This can get combined with ZFS to achieve failover for the corresponding zpool. Details can be found within the iSCSI design document
  • Changes have been made to implement a new feature called weak membership. This allows to form a two node cluster without requiring a quorum device (neither quorum disk nor quorum server). To better understand this new functionality, read the weak membership design document.
  • Changes have been made to the HA Containers agent to work the the ipkg brand type for non-global zones on OpenSolaris.
The source has been verified to compile on OpenSolaris 2009.06 build 108 on x86. Instructions on how to compile the framework and agent source code should enable you to try the various new possibilities out on your OpenSolaris system. Give it a try and report back your experience! Of course be aware that this is still work in progress.

Monday Feb 16, 2009

Participate: Colorado Phase 1 Open Design review

Colorado is the OpenSolaris project to have Open HA Cluster running on the OpenSolaris binary distribution. In a past blog I explained why this is not just a "compile and run" experience. Last year in September and October 2008 there have been two CLARC (CLuster Architecture Review Committee) slots assigned to review the Colorado requirements specification.

This week, Thursday 19th February 2009, you can participate within an Open CLARC review from 9am - 11am PT (Pacific Time). Free dialin information can be found at the Open CLARC wiki page.

Prior to the call it makes sense to first read the design documents:

  • The main document explains how generic requirements get implemented, like necessary (ksh) script changes, modifications to some agent, specifically support for the ipkg brand type within the HA Containers agent, details on the IPS support implementation, build changes to support compilation with Sun Studio Express, etc.
  • The document about weak membership does explain the modifications to achieve a two node cluster without requiring an extra quorum device. Instead a new mechanism gets implemented which makes use of health checks. The changed behavior to the already existing strong membership model gets explained, as well as the modifications to the command line interface to switch between those two membership models.
  • The networking design document describes the changes necessary to work with the OpenSolaris projects Clearview and Crossbow. Especially the later also requires changes within the cluster CLI to support the configuration of virtual network interfaces (VNIC). 
  • The iSCSI design document describes the changes required within the DID (device id) driver and the interaction with the Solaris iSCSI subsystem.

You can also send your comments to the ha-clusters-dicsuss-at-opensolaris-dot-org mailing list.

Don't miss this interesting opportunity to participare and learn a lot about the Open HA Cluster internals. Hear you at the CLARC slot or read you on the maillist :-)

Monday Jan 19, 2009

unexpected ksh93 behaviour with build-in commands send to background

While working on modifying the HA Containers agent to support the ipkg zone brand type from OpenSolaris, I stumbled over a difference in behavior of ksh88 vs ksh93 to be aware off, since it will at least impact the GDS based agents.

The difference affects commands that exist within user land as well as shell build-in, like sleep, and how they are then seen within the process table.

Example: Consider the following script called "/var/tmp/mytesting.ksh":


# send one sleep into background - it will run longer than the script itself
sleep 100 &

# another sleep, this time not in the background
sleep 20
echo "work done"
exit 0

If you then invoke the script on OpenSolaris, and while the second sleep runs invoke "ps -ef | grep mytest" in a different window, it will show something like:

    root  8185  8159   0 06:48:13 pts/4       0:00 /bin/ksh /var/tmp/mytesting.ksh
    root  8248  8178   0 06:48:32 pts/5       0:00 grep mytesting
    root  8186  8185   0 06:48:13 pts/4       0:00 /bin/ksh /var/tmp/mytesting.ksh

Note the two processes with the same name.

After the second sleep 20 did finish, you will see the script has terminated with printing "work done".

However, "ps -ef | grep mytest" will show:

    root  8262  8178   0 06:48:37 pts/5       0:00 grep mytesting
    root  8186     1   0 06:48:13 pts/4       0:00 /bin/ksh /var/tmp/mytesting.ksh

until the first sleep did also finish.

What is interesting is that the sleep put into background has the script name in the process table.

On a system where /bin/ksh is ksh88 based, you would see a process called "sleep 100", and "/bin/ksh /var/tmp/mytesting.ksh" just once.

If you on the other side create the following script called "/var/tmp/mytesting2.ksh":


# send one sleep into background - it will run longer than the script itself
/usr/bin/sleep 100 &

# another sleep, this time not in the background
/usr/bin/sleep 20
echo "work done"
exit 0

And do the above testing again, you will see:

# ps -ef | grep mytesting
    root  8276  8159   0 06:57:31 pts/4       0:00 /bin/ksh /var/tmp/mytesting2.ksh
    root  8292  8178   0 06:57:36 pts/5       0:00 grep mytesting

Ie. the script appears just once. And you can see:

# ps -ef | grep sleep
    root  8278  8276   0 06:57:31 pts/4       0:00 /usr/bin/sleep 20
    root  8296  8178   0 06:57:43 pts/5       0:00 grep sleep
    root  8277  8276   0 06:57:31 pts/4       0:00 /usr/bin/sleep 100

While the second sleep is still running and the script has not finished.

Once it has finished, you just see:

# ps -ef | grep sleep
    root  8306  8178   0 06:57:55 pts/5       0:00 grep sleep
    root  8277     1   0 06:57:31 pts/4       0:00 /usr/bin/sleep 100

and no longer the /var/tmp/mytesting2.ksh process.

This does make a difference in our logic within the GDS based agents, where we disable the pmf action script. Before doing that we invoke a sleep with the START_TIMEOUT length to assure that at least one process is within the tag until the action script is disabled.

And in our probe script, the wait_for_online mechanism does check if the start_command is still running. If it does, the probe returns 100 to indicate that the resource is not online yet.

So far many of our code invokes sleep instead of /usr/bin/sleep - and with the wait_for_online combination above, this will cause our start commands to always run into a timeout - although the script itself does terminate and everything worked fine. Manual testing will not show anything obvious as well.

The fix is to always invoke /usr/bin/sleep, not the shell build-in sleep.

Took me a while to understand it, and I write it here so others are not scratching their head as I did ;-)

Thursday Oct 16, 2008

Aufzeichnungen Vorträge SourceTalk 2008

Einige Vorträge sind während den SourceTalk Tagen 2008 mit dem TeleTeachingTool aufgezeichnet worden. Seit kurzem stehen diese auf den Webseiten der Veranstaltung zur Verfügung. Darunter auch mein  Erfahrungsbericht "Portierung von Open HA Cluster auf OpenSolaris": ohne Video (12MB) und mit Video (53MB).

In meinem Blogeintrag vom letzten Jahr wird erklärt, wie man die Aufzeichnung auch unter Solaris anschauen kann. Das ttt.ksh Skript habe ich aktualisiert, der MP3 Ton wird damit korrekt abgespielt.

Meinen Präsentation kann man auch gerne als PDF runterladen.

Monday Sep 29, 2008

Building Open HA Cluster on OpenSolaris

The first source code tarball for project Colorado, to compile the Open HA Cluster core framework on the OpenSolaris binary distribution, is available. It also contains a set of updated scbld tools needed to run with KSH93 on OpenSolaris.

Have a look at the detailed instructions on how to get it compiled on OpenSolaris 2008.05.

The following parts have been disabled to compile:
  • dsconfig
  • agent_snmp_event_mib/mib
  • spm
  • adminconsole
and are thus not part of the created SVR4 packages, since the source code for those components relies on headers/libraries not being available for the OpenSolaris binary distribution.

Next steps: design on how to create/send IPS packages instead of building the SVR4 packages. Any help is welcome. Have a look at the Wiki page for the current plans.

Friday Sep 19, 2008

Einladung zum SourceTalk 2008

Vom 23. bis 25. September finden am  Mathematisches Institut der Universität Göttingen zum vierten mal die SourceTalk Tage 2008 statt. Am Mittwoch, 24. September ist eine Vortragsreihe OpenSolaris gewidmet. Darin werde ich um 11:15 Uhr einen Vortrag zum Thema  "Erfahrungsbericht - Portierung von Open HA Cluster auf OpenSolaris" halten.

Der Erfahrungsbericht handelt von den bisherigen Aktivitäten beim OpenSolaris Projekt Colorado. Einen Vorgucker kann man in meinem zugehörigen Blogeintrag nachlesen (in Englisch).

Zusammen mit meinen anderen Kollegen werden wir gegen 17 Uhr auch zu einer "Meet the Experts" Session für Fragen aller Art um das Thema Solaris, OpenSolaris und Open HA Cluster zur Verfügung stehen.

Man sieht sich in Göttingen!

Monday Sep 08, 2008

Project Colorado: Running Open HA Cluster on OpenSolaris

If you have ever asked yourself why Solaris Cluster Express (the Open HA Cluster binary distribution) is running on Solaris Express Community Edition and not yet on the OpenSolaris binary distribution, then you might be interested in project Colorado. This project is endorsed by the HA Clusters community group and has as its goal to provide a minimal and extensible binary distribution of Open HA Cluster that runs on the OpenSolaris binary distribution.

As always, the devil is in the details. Here are some of the reasons why this isn't just a "recompile and run" experience:

  • Package system did change:
    One of the big changes with the OpenSolaris binary distribution is the switch to use the image packaging system (IPS). As Stephen Hahn explains in some of his blogs, it is a key design criteria of IPS to not include something like the rich scripting hooks found within the System V packaging system.
    Have a look at our initial analysis on how and where those scripting hooks are currently used. Note that installation is one aspect, uninstall another. In the more modular world of network based package system this means dependencies to other packages need to be explicit and in fine granularity. And within the Cluster world, this needs to be in agreement across multiple nodes. Since we also deliver kernel modules and plan to deliver a set of optional functionality, this brings the challenge on where to put the tasks that have been within the scripting hooks for configuration and uninstallation.
    Further, if you know the details of the Open HA Cluster build steps (which is similar to the ON build steps), you know that building the packages via the pkgdefs hierarchy is an important element to later then also assemble the deliverables for the DVD image. Since we do not just want a 1:1 conversion of our existing packages into IPS packages, we need to come up with a mechanism to deliver the IPS packages into a network repository as part of the build step, since there is no delivery of a DVD image going forward.
  • Zones behaviour did change:
    The native zones brand type introduced in Solaris 10 did inherit the packages to install from the global zone. For Solaris Cluster this means that cluster related packages got automatically installed and configured when a zone got created. On OpenSolaris this behavour did change. Currently there is only the ipkg brand type available. This means that out of the box none of the ways Solaris Cluster integrates with zones works without various changes needed.
  • KSH93 vs KSH88:
    OpenSolaris did switch to KSH93 for /bin/sh and /usr/bin/ksh, the previous KSH88 shell is no longer available. Again the devil is within the details. KSH93 does deal e.g. with local vs. global variables differently. Some scripts required for building (like /opt/scbld/bin/nbuild) or to install Solaris Cluster (like /usr/cluster/bin/scinstall) do break with KSH93. The full set of impacted scripts needs to get determined.
  • The Sun Java Web Console (webconsole) is not part of OpenSolaris:
    The Java Web Console provides a common location for users to access web-based system management applications. The single entry point that the web console provides eliminates the need to learn URLs for multiple applications. In addition, the single entry point provides user authentication and authorization for all applications that are registered with the web console. All web console-based applications conform to the same user interface guidelines, which enhances ease of use.
    Those are all reasons why Solaris Cluster did choose to deliver its browser user interface (BUI) named Sun Cluster Manager using and leveraging the Sun Java Web Console framework. In addition it also uses the Web Application Framework (JATO) and Lockhart Common Components.
    Since those components are not available for the OpenSolaris binary distribution, this brings the challenge which management framework to use (and develop against) instead. Of course a substitution is not trivial and can be quite time consuming. And it is not sure if existing code can get reused.
  • Dependencies on encumbered Solaris code:
    Besides components that the OpenSolaris binary distribution did choose not to deliver anymore, there is the goal to create a freely redistributable binary distribution. This means OpenSolaris does also not deliver the Common Desktop Environment (CDE), which includes the Motif libraries. The adminconsole delivered with Solaris Cluster does use Motif and ToolTalk.
    The adminconsole tools need to get redesigned to use libraries available within OpenSolaris.
  • No SPARC support for OpenSolaris yet:
    The OpenSolaris binary distribution is currently only available for the i386 platform. Solaris Express Community Edition does provide also a distribution for SPARC. While this is not a strong inhibitor to run on OpenSolaris, it is nonetheless a reason why providing Solaris Cluster Express is still a requirement.
    The good news is that there are plans to provide SPARC support for OpenSolaris within future releases.
  • OpenSolaris Installer does not support network installations yet:
    While this not a direct problem, it becomes to one if you consider that developers for Open HA Cluster are distributed around the world and most engineers have only access to remote systems, without the possibility to perform an installation requiring keyboard and monitor.
    Again the good news is that there are plans to add support for automated network installations within future OpenSolaris releases.

Besides solving the above challenges, we also want to offer some new possibilities within Colorado. You can read the details within the umbrella requirement specification. There are separate requirement specifications to outline specific details for the planned private-interconnect changes, cluster infrastructure changes involving the weaker membership, enhancements to make the proxy file system (PxFS) optional, and changes to use iSCSI with ZFS for non-shared storage configurations.

You can provide feedback on those documents to the ha-clusters-discuss mailing list. There is a review scheduled with the Cluster Architecture Review Committee on 18 September 2008, where you are invited to participate by phone if you are interested.


This Blog is about my work at Availability Engineering: Wine, Cluster and Song :-) The views expressed on this blog are my own and do not necessarily reflect the views of Sun and/or Oracle.


« July 2016