Tuesday Oct 09, 2007

Announcing Solaris Cluster support in LDoms I/O domains

If you keep track of Solaris Cluster developments, you probably have already seen the Marketing announcement which went out recently. In this blog entry we would be talking about this new support from more of a technical point of view.

First, just to make sure we are on the same page about what exactly we are talking about, this is about supporting Solaris Cluster in the LDoms I/O domains. For details on what an LDoms I/O domain is, please see LDoms Admin Guide . Informally, an LDoms I/O domain "owns" at least one PCI bus on the system and thus has direct physical access to the devices on that bus. The I/O domain can then export services to other guest domains on the system. These "services" are in form of virtual devices which are made available to other domains.

So, where does that leave support for Solaris Cluster and LDoms guest domains? You can create guest domains on the same system where SC is running in the I/O domain and deploy non-HA applications into those guest domains. This allows one to achieve better utilization of hardware and flexibility with respect to application deployments. The ability to manage HA applications inside LDoms guest domains is something we are currently working on, stay tuned.

With that bit of informal taxonomy and scope clarification out of the way, let us look at how Solaris Cluster 3.2 can be deployed in such LDoms I/O domains. First thing to note is that on some LDoms capable servers, there is only one PCI bus available and hence there can be only one I/O domains on such systems (which would also, by definition, be the control domain). The figure below illustrates this deployment scenario.

The thing to note in the configuration above is that the non-clustered guest domains are using the public network provided via the control domain which is running SC. This allows for sharing of network bandwidth between non-clustered applications running inside the guest domains and the HA applications running inside the control domain. Thus, sizing requirements should keep this mind while deciding on how many guest domains to run on the system and how much network load is present on the system. Similar considerations apply for any I/O bandwidth which may be shared between guest domains and I/O domains. There is no specific restriction on use of LDoms virtualization features inside the guest domains, such as different kind of virtual storage devices, dynamic assignment of CPUs etc.

For our next deployment scenario, we will pick a server platform which has more then 1 PCI bus. This would allow us to create additional I/O domains on the system and create more interesting (and potentially more useful to customers) scenarios. We would take the Sun Fire T2000 (Ontario) as the target system as it has two PCI busses. Note that not all systems have that flexibility, some systems have only one PCI bus. In this deployment scenario, we take two Ontario machines in a split bus configuration. For details on how to configure a split bus configuration, see Alex's blog entry on split bus ldom configurations with T2000s. The resulting 4 I/O domains (2 each on each Ontario system) are then configured as two different clusters. The picture below should help clarify what we are talking about.

In the configuration above, the bus pci@7c0 (bus_b) has been assigned to the "primary" domain and bus pci@780 (bus_a) has been assigned to a domain named "alternate". One point to note here is that on the Ontario, both internal disks are actually on bus_b, and so a dual channel fiber storage HBA card has been added to the "alternate" domain on its PCI bus (bus_a, slot 0), to provide for local storage for OS image etc as well as access to shared storage. Note that the disk which holds the OS image needs to be fiber bootable for this to work. On the PCI bus bus_b, we also need another HBA card to provide access to the shared storage. That exhausts all available I/O slots and we do not have a way to add additional network cards on the system. That means that on the "alternate" cluster, we are left with only 2 onboard NICs (e1000g0 and e1000g1) to provide for both public and private network connectivity. Having just a single NIC card for the public network is no problem (IPMP and SC support that just fine), but for a single private interconnect, you would have to use the custom option of scinstall command directly to install the cluster because using the standard installation option enforces the minimum 2 private network interfaces. For mission critical deployments, you may want to avoid this configuration because the single interconnect link can lead to reduced availability in some scenarios.

The above configuration creates two separate 2 node clusters on two T2000 boxes. That is useful for scenarios where a mission critical application and other HA application needs to be consolidated on the same hardware, providing for cost savings. This also caters to scenarios where increased isolation (resource-wise as well as from administrative isolation point of view) between different application is desired, which require two different clusters for maximum isolation.

In the next configuration we study, we take the same two Ontario machines in split bus configuration as before. However, instead of creating two different 2 node clusters, we create a single 4 node cluster. The schematic below illustrates that configuration:

Note that the two cluster nodes which are on the "alternate" domains have only single interconnect cards. To get the cluster to install in this configuration, you may want to first install a 4 node cluster with single interconnect, then add the additional interconnect on the primary domains using clsetup. Alternatively, one can first create a 2 node cluster on the primary domains which have the two interconnect cards, then add the two alternate domain nodes to the cluster via the clnode add command after installing the SC software. For example:
clnode add -n node1 -c clustername -e node3:e1000g4,switch1

The above command, when executed on node3, adds it to the cluster by using node1 as the sponsor node and using a single network card e1000g4 connected to switch1 as the private interconnect card. Also note that while for a 4 node cluster, a Quorum device is not strictly required, in this situation you would definitely want to configure a quorum device so that loss of a single physical machine does not lead to loss of the whole cluster.

This above 4 node configuration can be useful in scenarios where the increased isolation of 2 different clusters is not necessary for the applications being deployed and ease of administration of a single cluster is desired. Note that all the power of LDoms is available to you in these configurations (my favourite: Dynamically move CPUs from one domain into another!) making your deployments very flexible and cost effective.

Note that the configurations displayed in this blog are for the first generation of Sun Fire T2000 systems. On the newer systems such as the just released Sun SPARC Enterprise T5x20 systems the device assignments are a bit different. The basic considerations on how to deploy SC on such platforms would still be the same. Check out the details of the T5520 system here . Read more about what people are saying about CMT and UltraSPARC T2 technologies on Allan Packer's weblog .

Hope this was useful. Stay tuned for Solaris Cluster support for LDoms Guest Domains which would allow you to cluster LDoms guest domains and thus bring the full power of SC application/data management to guest domains.

Ashutosh Tripathi - Solaris Cluster Engineering
Alexandre Chartre - LDoms Engineering

Tuesday Apr 10, 2007

Reliable Failovers, an example hits home!

Recently on a trip to Australia, I had the opportunity to go diving on the Great Barrier Reef and had a chance to get a different perspective on what it means to have Reliable Failovers. Basically, while diving, you have a tube thru which you breath, and another one which is a backup. See illustration on the right. This has parallels to how Sun Cluster manages reliable failovers to achieve greatest possible availability. Consider:

  • There needs to be a back up for all components which can fail. This is the No Single Point Of Failure philosphy which Sun Cluster deploys for mission critical applications.
  • First, when the primary breathing tube fails, the detection of that failure has to be solid. You cannot be randomly switching back and forth between primary and secondary breathing tubes while 50 feet underwater. This is the Reliable Failure Detection point which Sun Cluster provides for you.
  • Next, you remove the failed tube from your mouth, but you still have to keep your airways open by continuing to exhale air out your mouth (slowly, of course!, you don't wanna bubble out the limited amount of air you have). Otherwise there is a danger that the airways can collapse. With Sun Cluster, this is roughly akin to how SC reserves and fences the shared resources (storage etc.) so that the shared resources (a la your airways) are not negatively impacted by the failed component.
  • Lastly, you can't simply shove in the secondary tube in your mouth and start breathing. The intake mouth piece would contain water (and sand depending upon where you have been) and you would start choking. So you first have to blow hard into the intake mouth piece to clear it, before you can start breathing. With Sun Cluster, this is akin to cleaning up any leftover state of the application such as application lock files, or pid files, files which were open on the cluster filesystem on the failed node etc. . Sun Cluster core framework takes care of expunging (not by blowing hard, but similar :-)) system level state for the failed node, and the job of cleaning out leftover application state falls on Application Agents which remove leftover files and application state which could interfere with a reliable failover of the application.
  • Nothing like 50 feet of water on top of you and your life itself at stake to make you realize what "Reliability" means!! For a mission critical application, "50 feet of water" == "Thousands of dollars of revenue per minute of downtime" and "life itself at stake" == "Your job and your credibility as the keeper of the system at stake"!! Believe me, all those steps I outlined above don't feel so easy if you have to do them while either under 50 feet or water, or when your mission critical application is down and you need to recover it quickly and reliably.

    Hope I did not gross you out already by being such a geek!! I mean, here I am, on the trip of a lifetime, diving on the famous Great Barrier Reef (BTW: Did you know that GBR can be seen from the moon?) and all i can think about is Clustering and what it means to have reliable failovers??  Fear not, my dear friends, I shook these geeky feelings off and went on and had 2 great dives lasting about 1/2 hour each. Saw tons of exotic fish, and even a mid-sized (4-5 feet) Whitetip reef shark , which, to my untrained eyes, looked exactly like a great white. But that is another blog.

    No worries mate!


    Oracle Solaris Cluster Engineering Blog


    « May 2016