Tuesday Apr 21, 2009

Solaris Cluster + MySQL Demo at a Exhibition Hall near you


If by any chance you are at the MySQL user conference in Santa Clara conference center, there are some great possibilities to see MySQL and Solaris Cluster in action. We have a demo at the exhibition hall, where you can see MySQL with zone cluster and MySQL with SC Geographic Edition.

I will host a birds of a feather at Wednesday 4/22/09 evening in meeting room 205 at 7 pm. The Title is "Configuring MySQL in Open HA Cluster, an Easy Exercise" check out the details here.  I will give a presentation at Thursday 4/23/09 morning 10:50 in Ballroom G, the title is "Solutions for High Availability and Disaster Recovery with MySQL". Check out the details here

I hope to see you there in person.

-Detlef Ulherr

Availability Products Group

Friday Mar 06, 2009

Cluster Chapter in OpenSolaris Bible

In addition to my day job as an engineer on the Sun Cluster team, I spent most of my nights and weekends last year writing a tutorial and reference book on OpenSolaris. OpenSolaris Bible, as it's titled, was released by Wiley last month and is available from amazon.com and all other major booksellers. At almost 1000 pages, my co-authors Dave, Jerry, and I were able to be fairly comprehensive, covering topics from the bash shell to the xVM Hypervisor, and most everything in between. You can examine the table of contents and index on the book website.

Of particular interest to readers of this blog will be Chapter 16, “Clustering OpenSolaris for High Availability.” (After working on Sun Cluster for more than 8 years, I couldn't write a book like this without a Chapter on HA clusters!) Coming at the end of Part IV, “OpenSolaris Reliability, Availability, and Serviceability”, this chapter is a 70 page tutorial in using Sun Cluster / Open HA Cluster. After the requisite introduction to HA Clustering, the chapter jumps in with instructions for configuring a cluster. Next, it goes through two detailed examples. The first shows how to make Apache highly available in failover mode using a ZFS failover file system. The second demonstrates how to configure Apache in scalable mode using the global file system. Following the two examples, the chapter covers the details of resources, resource types, and resource groups, shows how to use zones as logical nodes, and goes into more detail on network load balancing. After a section on writing your own agents using the SMF Proxy or the GDS, the chapter concludes with an introduction to Geographic Edition.

This chapter should be useful both as a tutorial for novices as well as a reference for more advanced users. I enjoyed writing it (and even learned a thing or two in the process), and hope you find it helpful. Please don't hesitate to give me your feedback!

Nicholas Solter

Wednesday Feb 25, 2009

Disaster Recovery Protection Options for Oracle with Sun Cluster Geographic Edition 01/09

With the announcement of Solaris Cluster 3.2 01/09 comes a new version of Sun Cluster Geographic Edition (SCGE). Among the features delivered in this release is support for replication of Oracle RAC databases using Oracle Data Guard. So it seems like a good opportunity to summarise the ways you can protect your Oracle infrastructure against disaster using the replication support provided by Sun Cluster Geographic Edition.

I'll start by breaking the discussion into two halves: first deployments using a highly available (HA) Oracle implementations, the second using Oracle Real Application Clusters (RAC). Additionally, I'll reiterate the replication technologies that SCGE supports, namely: EMC Symmetrix Remote Data Facility (SRDF), Hitachi TrueCopy, Sun StorageTek Availability Suite (AVS) and last, but not least, Oracle Data Guard (ODG). One final point to make is that SCGE support for SRDF/A is limited to takeover operations only.

HA Oracle Deployments

HA-Oracle deployments are found in environments where the cost/benefit analysis determines that you are prepared to trade off the longer outages involved in switching or failing over an Oracle database compared with the near-continuous service that Oracle RAC can offer, against the additional licensing costs involved.

Deployments of HA-Oracle can be on a file system: UFS or VxFS (stay posted for ZFS support) or on raw disk with, or without, a volume manager: Solaris Volume Manager (SVM) or Veritas Volume Manager (VxVM). Why not Oracle Automatic Storage Management you might ask? Well, while ASM is indeed supported on Oracle RAC, it poses problems when employed in a failover environment. There is a need to fail over either the ASM instance or just the disk groups used to support the dependent databases. These requirements currently preclude ASM from being supportable. Are we working on this? Of course we are!

So this gives us a set of storage deployment options that must be married with the replication options that SCGE supports and any restrictions that may come into play when deploying HA-Oracle in a Solaris Container (a.k.a zone).

Coverage is extensive: Oracle 9i, 10g and 11g are supported on file systems (UFS or VxFS) or raw devices, with or without containers and with or without VxVM, using either AVS, SRDF or TrueCopy. In contrast, SVM restricts the replication technology support to AVS only.

Why isn't Oracle Data Guard supported here, especially given that it's one of the new replication modules in SCGE 3.2 01/09? The answer lies in the use Oracle Data Guard broker as an interface to control the replication. Unfortunately, ODG broker stores a physical host name in its configuration files and after a fail-over it doesn't match that of the new host, thus invalidating the configuration. Consequently, Oracle does not support ODG broker on 'cold failover' database implementations even if this host name change could be avoided, say by putting the database into a Solaris Container.

Oracle RAC Deployments

With HA-Oracle options covered I'll now turn to Oracle RAC. As you will no doubt know from reading "Solaris Cluster 3.2 Software: Making Oracle Database 10G R2 and 11G RAC Even More Unbreakable", Solaris Cluster brings a number of additional benefits to Oracle RAC deployments including support for shared QFS as a means of storing the Oracle data files. So now you'll need to know what deployment options exist when you include SCGE in your architecture.

As we're still working on adding Solaris Container Cluster support to SCGE there is currently no support for Oracle RAC, or indeed any other data service, using this virtualisation technique.

Furthermore, I should remind you that AVS is not an option for any Oracle RAC deployment simply because it cannot intercept the writes coming from more than one node simultaneously.

On the positive side, storage replication products such as SRDF and TrueCopy are an option as they intercept writes at a storage array level rather than at the kernel level. These replication technologies are restricted to configurations using raw disk on hardware RAID or raw VxVM/CVM volumes. These storage options can then be used with Oracle 9i, 10g or 11g RAC. For a write up of just such a configuration, please read EMC's white paper on our joint demonstration at Oracle Open World 2008.

Combinations wishing to use shared QFS or ASM are currently precluded because of the additional steps that must be interposed prior to an SCGE switchover or takeover being effected. Are we looking to address this? Absolutely!

If you want unfettered choice of storage options on Solaris Cluster when replicating Oracle 10g and 11g RAC data, then the new Oracle Data Guard module for SCGE is the answer. You have freedom to choose any combination of raw disk, ASM, shared QFS deployment combination that makes sense to you. You can configure a physical standby partner in either single instance to single instance or dual instance to single instance combinations, i.e. primary site to standby site configurations. All ODG replication modes are supported: maximum performance, maximum availability and maximum protection. Although the SCGE can control logical standby configurations Sun have not yet announced formal support for use of this feature.

You've Read The Blog, Now See The Movie....

I hope that gives you a clear picture of how you can use a combination of Solaris Cluster and the various replication technologies that Sun Cluster Geographic Edition supports to create disaster recovery solutions for your Oracle databases. If you would like to see demonstrations of some of these capabilities, please watch the video of an ODG setup and an SRDF configuration on Sun Learning Exchange.

Tim Read
Staff Engineer
Solaris Cluster Engineering

Monday Feb 09, 2009

Software Quorum

This blog post describes a new feature called software quorum, which is introduced in Solaris Cluster 3.2 1/09.

To explain things well, let us start with a simple explanation of the concept of quorum.

A cluster can potentially break up into multiple partitions due to the complete failure of all paths across the private interconnects for any pair of cluster nodes. In such scenarios, it is imperative for the cluster software to ensure that only one of those partitions survives as the running cluster, and all other partitions kill themselves. This is essential to ensure data integrity on storage shared between cluster nodes, and to ensure that only one set of cluster nodes functions as the cluster hosting services and applications.

How do we decide which partition survives? There is a concept of 'quorum' - quite similar to the usual meaning of the word. Each cluster node is given a vote. If the sum of the votes given to cluster nodes is V, a cluster partition can survive as a running cluster when the sum of the votes of the nodes present in the partition is at least one more than half of the total votes V, in other words at least (1 + V/2) votes.

What happens if two partitions have the same number of votes? Consider a simple example. Each node of a 2-node cluster wants to survive as an independent cluster, when network communications fail. We could say let neither partition form a cluster, but we want high availability as well. So we have the concept of a quorum device (think of it as a 'token' for simplification). We say in such an equal-partitions case, the partition that holds the token should form the surviving cluster. So the idea is: all partitions 'race' to get the quorum device ('token'), but only one partition can win the race and survive. The other partition simply commits suicide. That is the basic simplistic idea of a quorum device. There are many complications in practice, but let's leave it at that simple explanation.

Now what do we use as a quorum device? Note that the term 'device' does not mean a disk necessarily. There are various entities that could constitute a quorum device. Apart from shared disks, Solaris Cluster can use NAS units as quorum devices. An external machine can be configured to run a quorum server for Solaris Cluster, and the quorum server will serve the purpose of a quorum device for the cluster. But we will not discuss those as part of this blog post.

Solaris Cluster can use a shared disk as a quorum device. But there is a caveat. Traditionally, Solaris Cluster could not use just any shared disk as a quorum device. For the quorum device acquisition 'race', Solaris Cluster uses SCSI reservation protocols on the shared disk. Over the years, we have encountered multiple cases where a device did not correctly support the SCSI reservation protocols. These configurations need an alternative mechanism to support their shared disks as quorum devices. Today, there are disks that do not support the SCSI reservations at all. So here is another reason for having an alternative mechanism to support the shared disk as a quorum device.

The software quorum feature, introduced in Solaris Cluster 3.2 1/09, addresses this need. The Solaris Cluster software quorum protocol completely supports all aspects of quorum device behavior for a shared disk, and does so entirely in software without any SCSI reservation-related operations.

So how does software quorum work? Solaris Cluster has exclusive use of 65 sectors on the disk from the Solaris disk driver. Traditionally, these sectors have been used to implement what is called Persistent Group Reservation emulation in Solaris Cluster software (more on that in a separate post). The software quorum subsystem reuses the reserved space on shared disks for storing quorum-related information (lock data, registration and reservation keys) and performs block reads and writes to access this information. Using these reserved sectors, software quorum essentially implements the Persistent Group Reservation (PGR) emulation algorithm in software, without using SCSI reservation operations for shared-disk access control. During cluster reconfiguration, different cluster nodes (possibly in different partitions) attempt to access (read, write, remove) registration and reservation keys on the quorum device in a distributed fashion, in their 'race' to acquire the quorum device. At any time, only one node can manipulate (write, preempt) the keys on the quorum device. Software quorum ensures proper mutual exclusion on concurrent reservation data access (read, write).

To summarize this concept in a few words: Any Sun-supported disk that is connected to multiple nodes in a cluster can serve as a software-quorum disk. To highlight the advantages further, notice that the disk you choose need not support SCSI reservation protocols - it might well be a SATA disk. So if you do not have a shared disk that supports SCSI reservation protocols and you do not have an external machine to use as a quorum server, then you now have the flexibility to choose any Sun-supported shared disk.

To use a shared disk as a software-quorum disk, SCSI fencing should be disabled on the shared disk. Solaris Cluster device-management framework uses SCSI fencing to control access of the shared disks from cluster nodes and external machines. There are certain configurations where users want to turn off fencing for some shared disks. A new feature called optional fencing, that comes with Solaris Cluster 3.2 1/09, provides users with a simple method to achieve this flexibility. We will look at the optional fencing feature in another blog post. But for the purpose of software quorum, let's assume that we have such a feature like optional fencing that provides a simple command (as shown below in the example) to turn off SCSI fencing for the shared disk to be used as a software-quorum disk.

Now let's see an example of how to configure such a software-quorum disk. Let us assume that d4 is the shared disk that is connected to multiple cluster nodes, and you want to use d4 as a software-quorum disk.

(1) First disable SCSI fencing for the shared disk, from one cluster node:

schost1$ cldevice set -p default_fencing=nofencing d4
Updating shared devices on node 1
Updating shared devices on node 2
Updating shared devices on node 3
Updating shared devices on node 4

Let's check whether that worked fine:

schost1$ cldevice show d4
=== DID Device Instances ===                 DID Device Name:                                /dev/did/rdsk/d4
 Full Device Path:                                schost1:/dev/rdsk/c4t600C0FF00000000009280F30888FB604d0
 Full Device Path:                                schost2:/dev/rdsk/c4t600C0FF00000000009280F30888FB604d0
 Full Device Path:                                schost3:/dev/rdsk/c4t600C0FF00000000009280F30888FB604d0
 Full Device Path:                                schost4:/dev/rdsk/c4t600C0FF00000000009280F30888FB604d0
 Replication:                                     none
 default_fencing:                                 nofencing

Notice that the 'default_fencing' parameter above for d4 is set to 'nofencing', which means SCSI fencing is disabled for d4.

(2) Now from one cluster node, let's add d4 as a quorum device. The cluster software will automatically detect that d4 is set to use 'nofencing', and hence the cluster software should use software quorum protocol for d4.

schost1$ clquorum add d4

Let's check to see what shows:

schost1$ clquorum show d4
=== Quorum Devices ===                       Quorum Device Name:                             d4
 Enabled:                                         yes
 Votes:                                           3
 Global Name:                                     /dev/did/rdsk/d4s2
 Type:                                            shared_disk
 Access Mode:                                     sq_disk
 Hosts (enabled):                                 schost1, schost2, schost3, schost4

Notice that the 'Access Mode' parameter above for d4 is set to 'sq_disk', which is a short form for 'software-quorum disk'. Let's check the quorum status:

schost1$ clquorum status
=== Cluster Quorum ===
--- Quorum Votes Summary ---
           Needed   Present   Possible
           ------   -------   --------
           4        7         7
--- Quorum Votes by Node ---
Node Name       Present       Possible       Status
---------       -------       --------       ------
schost1         1             1              Online
schost2         1             1              Online
schost3         1             1              Online
schost4         1             1              Online
--- Quorum Votes by Device ---
Device Name       Present      Possible      Status
-----------       -------      --------      ------
d4                3            3             Online

That's it! You have configured d4 as a software-quorum disk; it functions as any other traditional quorum device.

So, don't ignore non-SCSI disks anymore - you can use them as software-quorum disks.

Sambit Nayak

Solaris Cluster Engineering

Wednesday Feb 04, 2009

Zone Clusters

The Solaris(TM) Cluster 3.2 update 2 release , also called Sun Cluster 3.2, introduces the new feature called Zone Clusters, which is also known as Solaris Containers Clusters, and this blog introduces the reader to Zone Clusters. Here you will find an overview that defines a Zone Cluster and identifies some important reasons why you would want to use a Zone Cluster. Blogs should be short and concise. So this will be the introductory blog. I plan to provide a series of blogs, where each blog covers one important aspect of Zone Clusters. Subsequent blogs will cover the major use cases, a comparison of Zone Cluster versus other zone solutions, and explanations of various aspects of the technologies that support a Zone Cluster.

Now let’s begin by defining the feature.

A Zone Cluster is a virtual cluster, where each virtual node is a non-global zone.

Thus we are entering a world where a set of machines (defined as something that can host an operating system image) can now support multiple clusters. Prior to this feature, there was exactly one cluster and we did not have a unique name for that kind of cluster. The original cluster type has as voting member nodes all of the global zones, which led us to apply the name Global Cluster to that kind of cluster. Starting with SC3.2 1/09 (also called update 2) there will always be exactly one Global Cluster on a set of machines that Sun Cluster software supports.

The same set of machines can optionally also support concurrently an arbitrary number of Zone Clusters. The number of Zone Clusters is limited by the amount CPU's, memory, and other resources needed to support the applications in the Zone Clusters. Exactly one Solaris operating system instance and exactly one Sun Cluster instance supports the one Global Cluster and all Zone Clusters. A Zone Cluster cannot be up unless the Global Cluster is also up. The Global Cluster does not contain the Zone Clusters. Each cluster has its own private name spaces for a variety of purposes, including application management.

A Zone Cluster appears to applications as a cluster dedicated for those applications. This same principle applies to administrators logged in to a Zone Cluster.

The Zone Cluster design follows the minimalist approach about what items are present. Those items that are not directly used by the applications running in that Zone Cluster are not available in that Zone Cluster.

A typical application A stores data in a file system F. The application needs a network resource N (authorized IP address and NIC combination) to communicate with clients. The Zone Cluster would contain just the application A, file system F, and network resource N. Normally, the storage device for the file system would not be present in that Zone Cluster.

Many people familiar with the Global Cluster, will remember that the Global Cluster has other things, such as a quorum device. The Zone Cluster applications do not directly use the quorum device. So there is no quorum device in the Zone Cluster. When dealing with the Zone Cluster, the administrator can ignore quorum devices and other things that exist only in the Global Cluster.

The Zone Cluster design results in a much simpler cluster that greatly reduces administrative support costs.

A Zone Cluster provides the following major features:

  • Application Fault Isolation – A problem with an application in one Zone Cluster does not affect applications in other Zone Clusters. Those operations that might crash an entire machine are generally disallowed in a Zone Cluster. Some operations have been made safe. For example, a reboot operation in a Zone Cluster becomes a zone reboot. So even an action that can boot or halt one Zone Cluster, will not affect another Zone Cluster.

  • Security Isolation – An application in one Zone Cluster cannot see and cannot affect resources not explicitly configured to be present in that specific Zone Cluster. A resource only appears in a Zone Cluster when the administrator explicitly configures that resource to be in that Zone Cluster.

  • Resource Management – The Solaris Resource Management facilities can operate at the granularity of the zone. We have made it possible to manage resources across the entire Zone Cluster. All of the facilities of Solaris Resource Management can be applied to a Zone Cluster. This includes controls on CPU’s, memory, etc. This enables the administrator to manage Quality of Service and control application license fees based upon CPU's.

We recognize that administrators are overworked. So we designed Zone Clusters to reduce the amount of work that administrators must do. We provide a single command that can create/modify/destroy an entire Zone Cluster from any machine. This eliminates the need for the administrator to go to each machine to create the various zones.

Since a Zone Cluster is created after the creation of the Global Cluster, we use knowledge of the Global Cluster to further reduce administrative work. At this point we already know the configuration of the cluster private interconnect, and thus can automate the private interconnect set up for a Zone Cluster. We can specify reasonable default values for a wide range of parameters. For example, a Zone Cluster usually runs with the same time zone as the Global Cluster.

Once you have installed Sun Cluster 3.2 1/09 on Solaris 10 5/08 (also called update 5) or later release, the Zone Cluster feature is ready to use. There is no need to install additional software. The Zone Cluster feature is maintained by the regular patches and updates for the Sun Cluster product.

So a Zone Cluster is a truly simplified cluster.

Now, let’s talk at a high level about why you would use a Zone Cluster.

Many organizations run multiple applications or multiple data bases. It has been common practice to place each application or data base on its own hardware. Figure 1 shows an example of three data bases running on different clusters.

Moore’s Law continues to apply to computers, and the industry continues to produce ever more powerful computers. The trend towards ever more powerful processors has been accompanied by increases in storage capacity, network bandwidth, etc. Along with greater power has come improved price/performance ratios. Over time, application processing demands have grown, but in many cases the application processing demands have grown at a much slower rate than that of the processing capacity of the system. The result is that many clusters now have considerable surplus processing capacity in all areas: processor, storage, and networking.

Such large amounts of idle processing capacity present an almost irresistible opportunity for better system utilization. Organizations seek ways to reclaim this unused capacity. Thus, they are choosing to host multiple cluster applications on a single cluster. However, concerns about interactions between cluster applications, especially in the areas of security and resource management, make people wary. Zone Clusters provide safe ways to host multiple cluster applications on a single cluster hardware configuration. Figure 2 shows the same data bases from the previous example now consolidated onto one set of cluster hardware using three Zone Clusters.

Zone Clusters can support a variety of use cases:

  • Data Base Consolidation – You can run separate data bases in separate Zone Clusters. We have run Oracle RAC 9i, RAC 10g, and RAC 11g in separate Zone Clusters on the same hardware concurrently.

  • Functional Consolidation – Test and development activities can occur concurrently while also being independent.

  • Multiple Application Consolidation – Zone Clusters can support applications generally. So you can run both data bases and also applications that work with data bases in the same or separate Zone Clusters. We will be announcing certification of other applications in Zone Clusters in the coming months.

  • License Fee Cost Containment – Resource controls can be used to control costs. There are many use cases where someone can save many tens of thousands of dollars per year. The savings are heavily dependent upon the use case.

    Here is an arbitrary example: the cluster runs two applications, where each application takes half of the CPU resources. The two applications come from different vendors, who each charge a license fee where: Total_Charge = Number_CPUs \* Per_CPU_Charge. The administrator places each application in its own Zone Cluster with half the CPU's. This reduces the number of CPU's available to each application. The result is that the administrator has now reduced the Total Charge cost by 50%.

In future blogs, I plan to explain how to take the most advantage of Zone Cluster in these various use cases.

Please refer to this video blog that provides a long detailed explanation of Zone Cluster.

Dr. Ellard Roush

Technical Lead Solaris Cluster Infrastructure


Oracle Solaris Cluster Engineering Blog


« July 2016