HA For Grid Engine at osgc2008
By ashu on May 29, 2008
Last week i presented Open HA Cluster at Open Source Grid Cluster Conference in Oakland California. The conference had three different tracks, dedicated to Globus (GlobusWorld), Grid Engine (Grid Engine Workshop), and Rocks (Rocks Cluster Workshop).
My presentation about making Sun Grid Engine highly available using Open HA Cluster (OHAC) was part of the Grid Engine Workshop.
I noticed that the term Cluster was a bit overused at this conference with different products and technologies using it in slightly different ways. So i started with clarifying the term "HA Cluster" to refer to the technology which OHAC brings to the arena, which is about high availability, in spite of failures. A quick show of hands revealed that about 25% of the participants were aware of the concept of "HA Clusters" in general, with about 15% actually being aware of OHAC itself. Given that, i spent a larger portion of my talk on the concept of single points of failure, redundancy, failover and how OHAC recovers from system failures. Towards the end of my talk, i also talked about using OHAC to make Sun Grid Engine highly available and what are the key advantages of the HA solution based on OHAC. These points and the slides are curtsy of Thorsten Frueauf . The key points about how OHAC helps in improving the availability of Sun Grid Engine are noted in this blog entry .
The presentation did generate a couple questions from the audience, i remember one question was about how does OHAC takes care of MAC address change when it fails over a HA ipaddress from one node to another. I explained that OHAC uses gratuitous ARPs to update the ARP cache of any routers on the network and that works with all but a very few exceptions. Another question was about data recovery during disk/mirror failures and whether the end application needs to worry about it, i explained that typically that recovery is performed by a volume manager and the end application is blissfully unaware of it. The OHAC framework makes sure that the end application has the data available when and where (on the node where the app is) the application is started. Another question was about the speed of failover (how fast is the recovery upon various failures). I turned that question into an advantage where i explained how OHAC is tightly integrated with Solaris and thus can detect failures quickly and recover from the failures quickly. I then invited folks to view the failover demo on my laptop on the next day, in the "Grill the Gurus" portion of the conference.
I was somewhat curious about the audience mix as well about whether the larger percentage was from academic community or the commercial community. A quick show of hands revealed that the commercial users were well represented, roughly in the same numbers as the academic/research users. After the talk, i did speak to a couple of people during the coffee/lunch breaks and met a variety of people. Here are some folks i remember: A sysadmin at an European Oil company interested in using Grid Engine for optimizing/minimizing application license for a commercial software he uses for geological data analysis, a IT manager for a Medical Software startup based in San Francisco who was interested in Open Source software as a way to minimize costs, a deployment architect for a IT consultancy company who was interested in geographical data replication and content based routing of incoming jobs, a lab manager from an ivy league university who wanted to figure out an easy way for his students to be effective at managing his compute lab environment, a IT admin for a storage manufacturer who was interested in learning about techniques for efficient monitoring of workloads.
For the demo next day, i had Sun Grid Engine configured as a HA server across two zones on my laptop. I was able to demo the very quick restart of the Grid Engine qmaster and scheduler daemons. People seemed to be somewhat interested to learn as to how that is happening, which led me to explain how Solaris Contracts are used by the process monitoring implementation in OHAC, which leads to quick detection and recovery from application failures. Most people were simply interested in chatting about the general concept of clusters itself and discussing their own "Grid and Cluster" scenarios.
I you are interested in the actual slides i used for the talk, you can check them out here . If you missed this conference, you would have an opportunity to learn more about Open HA Cluster and OpenSolaris at the upcoming LinuxTag conference in Berlin, Germany from May 28th till 31th of May 2008.
The picture at the top is taken during a coffee break in the conference. Check out this link for other photos i took at the conference. Also, Deirdré Straughan made a video of my talk, complete with neat fading in and out of the presentation slides. Click in the embedded window below to watch the presentation in flash.
If you'd like, you can watch the video in iPod format and watch it on your video iPod . Beware that the file is rather big though.
This conference was a nice experience for me to talk to lot of people and make them aware of Open HA Cluster , and also learn about what is going on in other Open Source communities such as Grid. Hope you found some of the things in this blog useful and interesting.
Solaris Cluster Engineering