Reliable Failovers, an example hits home!

Recently on a trip to Australia, I had the opportunity to go diving on the Great Barrier Reef and had a chance to get a different perspective on what it means to have Reliable Failovers. Basically, while diving, you have a tube thru which you breath, and another one which is a backup. See illustration on the right. This has parallels to how Sun Cluster manages reliable failovers to achieve greatest possible availability. Consider:

  • There needs to be a back up for all components which can fail. This is the No Single Point Of Failure philosphy which Sun Cluster deploys for mission critical applications.
  • First, when the primary breathing tube fails, the detection of that failure has to be solid. You cannot be randomly switching back and forth between primary and secondary breathing tubes while 50 feet underwater. This is the Reliable Failure Detection point which Sun Cluster provides for you.
  • Next, you remove the failed tube from your mouth, but you still have to keep your airways open by continuing to exhale air out your mouth (slowly, of course!, you don't wanna bubble out the limited amount of air you have). Otherwise there is a danger that the airways can collapse. With Sun Cluster, this is roughly akin to how SC reserves and fences the shared resources (storage etc.) so that the shared resources (a la your airways) are not negatively impacted by the failed component.
  • Lastly, you can't simply shove in the secondary tube in your mouth and start breathing. The intake mouth piece would contain water (and sand depending upon where you have been) and you would start choking. So you first have to blow hard into the intake mouth piece to clear it, before you can start breathing. With Sun Cluster, this is akin to cleaning up any leftover state of the application such as application lock files, or pid files, files which were open on the cluster filesystem on the failed node etc. . Sun Cluster core framework takes care of expunging (not by blowing hard, but similar :-)) system level state for the failed node, and the job of cleaning out leftover application state falls on Application Agents which remove leftover files and application state which could interfere with a reliable failover of the application.
  • Nothing like 50 feet of water on top of you and your life itself at stake to make you realize what "Reliability" means!! For a mission critical application, "50 feet of water" == "Thousands of dollars of revenue per minute of downtime" and "life itself at stake" == "Your job and your credibility as the keeper of the system at stake"!! Believe me, all those steps I outlined above don't feel so easy if you have to do them while either under 50 feet or water, or when your mission critical application is down and you need to recover it quickly and reliably.

    Hope I did not gross you out already by being such a geek!! I mean, here I am, on the trip of a lifetime, diving on the famous Great Barrier Reef (BTW: Did you know that GBR can be seen from the moon?) and all i can think about is Clustering and what it means to have reliable failovers??  Fear not, my dear friends, I shook these geeky feelings off and went on and had 2 great dives lasting about 1/2 hour each. Saw tons of exotic fish, and even a mid-sized (4-5 feet) Whitetip reef shark , which, to my untrained eyes, looked exactly like a great white. But that is another blog.

    No worries mate!


    Nice illustration Ashu! Apart from this blog entry, it would be nice if you can blog about NFSv4 support in Sun Cluster

    Posted by Atul Vidwansa on April 18, 2007 at 06:27 PM PDT #

    Hi Atul, Thanks. About NFSv4 support, believe it or not, i am halfway thru writing one! I wanted to get some feedback from NFSv4 folks before posting it. So, look for it here in the next couple weeks. -ashu

    Posted by guest on April 19, 2007 at 06:32 AM PDT #

    Post a Comment:
    • HTML Syntax: NOT allowed

    Oracle Solaris Cluster Engineering Blog


    « July 2016