X

News, tips, partners, and perspectives for the Oracle Solaris operating system

How does sun cluster decide on which node a service runs?

Guest Author
A customer asked us how Sun Cluster decides where to bring a resource group online.
The selection of a primary node for a resource group is determined first of all by the Nodelist property that is configured for the group. The Nodelist specifies a list of nodes or zones where the resource group can be brought online, in order of preference. These nodes or zones are known as the potential primaries (or masters) of the resource group.
The nodelist preference order is modified by the following factors:
- "Ping pong" prevention: If the resource group has recently failed to start on a given node or zone, that node or zone is given lower priority than one on which it has not yet failed.
- Resource group affinities: This is a configurable resource group property that indicates a positive or negative affinity of one group for another. The node selection algorithm always satisfies strong affinities, and makes a best-effort to satisfy weak affinities.
The resource group affinities property (RG_affinities) was introduced in Sun Cluster 3.1 9/04. This property allows you to express that the RGM should do one of the following:
1. Attempt to locate the resource group on a node that is a current master of another group, referred to as positive affinity.
2. Attempt to locate the resource group on a node that is not a current master of a given group, referred to as negative affinity.
Resource group affinities comes in five flavors:
+, or weak positive affinity
++, or strong positive affinity
+++, or strong positive affinity with failover delegation
-, or weak negative affinity
--, or strong negative affinity
At this point you may well be wondering, what are affinities used for, and how do they work? To answer that, here are a few examples:
Example 1: Enforcing collocation of a resource group with another resource group
Suppose that our cluster is running an Oracle database server controlled by a failover resource group, ora-rg. We also have an application in resource group dbmeasure-rg, whose job it is to measure and log the performance of the database. The dbmeasure application, if it runs, must run on the same node as the Oracle server. However, the measurement application is not mandatory, and Oracle can run fine without it.
We can force dbmeasure to run only on a node where Oracle is running, by declaring a strong positive affinity:
clrg set -p RG_affinities=++ora-rg dbmeasure-rg
When we initially switch ora-rg online, dbmeasure-rg will automatically come online on the same node. If ora-rg fails over to a different node or zone, then dbmeasure-rg will follow it automatically. While ora-rg remains online we can switch dbmeasure-rg offline, however, dbmeasure-rg cannot switch over or fail over onto any node where ora-rg is \*not\* running.
Note: Besides the RG_affinities, we may also configure a dependency of the dbmeasure resource upon the oracle server resource. This assures that the dbmeasure resource does not get started until the oracle server resource is online. Resource group affinities are enforced independently of resource dependencies. While resource dependencies control the order in which resources are started and stopped, RG_affinities control the _locations_ where resource groups are brought online across multiple nodes or zones of a cluster.
Suppose that dbmeasure is a more critical application, and it is important to keep it up and running? In that case, we might want to allow dbmeasure-rg itself to initiate a failover onto a different node, dragging ora-rg along with it. To accomplish this, we use the strong positive affinity with delegated failover:
clrg set -p RG_affinities=+++ora-rg dbmeasure-rg
Example 2: Specifying a preferred collocation of a resource group with another resource group
Assume again a cluster running our Oracle database resource group, ora-rg. On the same cluster, we are running a customer service application that uses the database; this application is configured in a separate failover resource group, app-rg. The application and the database _can_ run on two different nodes, but perhaps we have discovered that the application is database-intensive and runs faster if it is hosted on the same node as the database. Therefore, we prefer to start the application on the same node as the database.
However, it might also be the case that we want to avoid switching the application from one node to another, even if the database changes nodes. To avoid breaking client connections or for some other reason, we would rather keep the application on its current master, even if it incurs some performance penalty.
To achieve these semantics, we give app-rg a weak positive affinity for ora-rg:
clrg set -p RG_affinities=+ora-rg app-rg
With this affinity, the RGM will start app-rg on the same node as ora-rg when possible, but will not force it to always run on the same node.
Example 3: Balancing the load of a set of resource groups
Now suppose that we have a cluster that is hosting three independent applications in resource groups app1-rg, app2-rg, and app3-rg. By giving each resource group a weak negative affinity for the other two groups, we can achieve a rudimentary form of load balancing on our cluster:
clrg set -p RG_affinities=-app2-rg,-app3-rg app1-rg
clrg set -p RG_affinities=-app1-rg,-app3-rg app2-rg
clrg set -p RG_affinities=-app1-rg,-app2-rg app3-rg
With these settings, the RGM will try to bring each resource group online on a node that is not currently hosting either of the other two groups. If there are three or more nodes available, this will place each resource group onto its own node. If there are fewer than three nodes available, then the RGM will "double-up" or "triple-up" the resource groups onto the available node(s). Conceptually, the resource group with weak negative affinity is trying to stay away from the other group, sort of like electrostatic charges that repel one another.
Example 4: Specifying that a critical service has precedence
In this example, a critical service -- let's say it's our Oracle database in the ora-rg resource group -- is sharing the cluster with a non-critical service, for example, a prototype of a newer version of our software which is undergoing testing and development. Supposing that we have a two-node cluster, we want ora-rg to start on one node, and the test-rg to start on the other node. Suppose that the first node, hosting ora-rg, dies, causing ora-rg to fail over to the second node. We want the non-critical service in test-rg to go offline on that node.
To accomplish this behavior, we give test-rg a strong negative affinity for ora-rg:
clrg set -p RG_affinities=--ora-rg test-rg
When the first node dies and ora-rg fails over to the second node where test-rg is currently running, test-rg will get "bumped off" of the second node and will remain offline (assuming a two-node cluster). When the first node reboots, it takes on the role of backup node, and test-rg is automatically started on it.
Example 5: Combining different flavors of RG_affinities to achieve more complex behavior
In the Sun Cluster HA for SAP Replicated Enqueue Service, we configure the enqueue server in one resource group enq-rg, and the replica server in a second resource group repl-rg.
A requirement of this data service is that the enqueue server, if it fails on the node where it is currently running, must fail over to the node where the replica server is running. The replica server needs to move immediately to a different node. Setting a weak positive affinity from the enqueue server resource group to the replica server resource group ensures the enqueue server resource group will fail over to the node where the replica server is currently running:
clrg set -p RG_affinities=+rg-repl rg-enq
Setting a strong negative affinity from the replica server resource group to the enqueue server resource group ensures the replica server resource group is offloaded from the replica server node, before the enqueue server resource group is brought online on the same node:
clrg set -p RG_affinities=--rg-enq rg-repl
The replica server resource group will be started up on another node if one is available.
Thus by using the simple declarative mechanism of RG_affinities, we can achieve robust recovery behavior for our data services running on Sun Cluster.
Martin Rattner
Sun Cluster Engineering

Join the discussion

Comments ( 16 )
  • guest Friday, April 27, 2007
    interesting
  • guest Tuesday, May 29, 2007
    Thanks for answer. That's helps.
  • Mohammad Ali Tuesday, June 17, 2008

    Hi,

    For example:

    I have two resource group each of them having two resources like:

    rg1 - rg1rs1 rg1rs2

    rg2 - rg2rs1 rg2rs2

    Now I want rg1 and all of its resources to start first then rg2 and its resources to start.

    Now if I set dependency like this:

    clrg set -p RG_Dependencies=rg1 rg2

    Do I need to set the resource dependencies as well? If yes then what is the difference between RG_Dependencies & Resource_dependencies?

    Regards,

    Mohammad Ali


  • Martin Rattner Tuesday, June 17, 2008

    The RG_dependencies are a weaker form of dependency, in that they are applied only within a given node. In Ali's example above, if rg1 and rg2 are starting on the same node, then both resources of rg1 would start on that node before any resources of rg2 would be started. However, in the event that rg1 and rg2 are starting on two different nodes, there would be no guarantees about start ordering of the resources. [Note, if you wanted to force rg1 and rg2 to always start on the same node, you could use RG_affinities.]

    The only way to enforce resource start ordering across different nodes is to use resource dependencies:

    clrs set -p resource_dependencies=rg1rs1,rg1rs2 rg2rs1 rg2rs2

    Note, RG_dependencies was an older feature which we continue to support. However, it mostly has been superseded by resource dependencies.


  • Mohammad Ali Tuesday, June 17, 2008

    Hi Martin Rattner,

    Thanks a lot for your so quick response. This blog is really helpful. I appreciate.

    One more answer please...

    In the same box does global zone & local zone will be considered as two different node for the resource group or resource dependencies?

    Regards,

    Mohammad Ali


  • Martin Rattner Tuesday, June 17, 2008

    That's a good question. The answer is yes, currently each zone including the global zone is considered as if it were a different "node" on which to locate a resource group.

    Currently, resource group affinities work only at the zone level. For example, if rg1 has a strong positive affinity for rg2, then rg1 must start in the same zone (global or non-global) in which rg2 is started.

    We have discovered that many applications would prefer a physical node affinity; that is, rg1 should start on the same physical node (but not necessarily in the same zone) as rg2. We are working on this enhancement to RG affinities behavior, and hope to provide it in a future release of Solaris Cluster and/or Open HA Cluster.


  • Mohammad Ali Wednesday, June 18, 2008

    Hi Martin Rattner,

    I really appreciate your support. Thanks again.

    Regards,

    Mohammad Ali


  • Andrew Dibbins Friday, July 11, 2008

    Hi Martin,

    In your example 4, when a resource group has precedence over another resource, what happends if the sub-ordinate resource group, is only configured on one node?

    I have a two node cluster with a production Oracle database failing between nodes, and a QA Oracle database running solely on node 2, my customer doesn't want his QA Oracle database to fail over off of node 2, but to be taken offline, setting the RG_affinities parameter to "--<proddb-rg>" in my qadb-rg resource group doesn't appear to do this?

    Also, suppose I also have an application group, where I want the same behaviour, but against each group, ie the RG_affinities for my qadb-rg, needs to be "--<proddb-rg> --<prodapp-rg>" I want basically want an "OR" behaviour and not an "AND" behaviour?

    Regards

    Andy


  • Martin Rattner Friday, July 11, 2008

    Andy,

    In example 4, if the subordinate RG (in your example, the QA Oracle database) is configured to run on only one node, and it declares a strong negative affinity for the production RG; then if the production RG fails over onto that node, the subordinate RG will go offline as your customer would want.

    Setting RG_affinities=--<proddbrg> on the subordinate RG should have the desired effect of forcing it offline. I don't know why you are not observing this behavior. Check for syslog messages in /var/adm/messages on each node to see if you can find further information about what is happening.

    In your second example where an application RG declares strong negative affinity for two different production RGs, then either production rg alone (if it fails over onto the node where the application rg is running) should force the subordinate RG offline. I think this is what you are referring to as "OR" behavior and not AND behavior.


  • Martin Rattner Wednesday, July 23, 2008

    If you've been experimenting with RG affinities, you might have noticed the following problem:

    Suppose that two resource groups run on the same set of physical nodes, but are configured to run in different zones within those nodes. For example suppose that RG1's nodelist is: "node1:zoneA,node2:zoneA"

    while RG2's nodelist is "node1:zoneB,node2:zoneB". In this case, RG_affinities cannot currently be used between RG1 and RG2, because the affinity is interpreted at the zone level, i.e., that the two RGs should run in the same zone or in different zones.

    The question is: Is there a way to get the RG_affinities to ignore the zone component of the node list, and just to concentrate on the physical machines, rather than the combination of physical machine and zone?

    This issue is considered to be a design defect, addressed by OpenSolaris change request number is 6443496. A fix is underway.


  • Martin Rattner Wednesday, July 23, 2008

    Correction to the preceding comment: The change request number given (6443496) is a Sun CR number, not an OpenSolaris CR number. I still need to get familiar with the OpenSolaris processes, so I am not sure how the community gets access to existing Sun Cluster (now Open HA Cluster) bug reports.


  • David Schramm Wednesday, August 6, 2008

    What about the case where you want each node to have locally installed binaries but still have the HA Oracle (failover) depend on those filesystems (devices) online before the Oracle RG starts to turn up?

    thanks!


  • Martin Rattner Wednesday, August 6, 2008

    Re. David's question about locally installed Oracle binaries: Even though locally installed binaries are not strictly HA resources, you can use an HAStoragePlus resource on each node to manage the availability of the local file system on that node. The HASP resource for each node would be configured in its own single-mastered resource group (i.e., there would be one node in the RG nodelist). The failover ha-oracle resource group would declare RG_dependencies upon all of the single-node HASP resource groups. The RGM has special logic in this case, such that RG_dependencies upon single-node RGs are enforced only on the local node where HA-oracle is going online.

    If anyone is curious about the details, this RGM feature is described in Sun change request (CR) number 4778869.


  • Dave Schramm Thursday, August 7, 2008

    Ok, I set up a special HASP RG for the Oracle Home FS on the "local" diskset. I set the dependency like this:

    "clrs set -p Resource_dependencies_weak=orahome-rs{LOCAL_NODE} oracle_serverdb480rd1-rs"

    It appears to be working. Our local diskset has multiple file systems so will I still be able to create multiple HASP RG's for them? This will allow me to set weak dependencies on those FS' for the other various resource groups which rely on them being there before they mount up/run.

    thanks for your help.


  • Martin Rattner Thursday, August 7, 2008

    OK, you opted to use Resource_dependencies_weak. That should work OK if the HASP resource goes online. However, if the HAStoragePlus resource remains offline, the Oracle resource will attempt to start anyway due to the weak dependency. To avoid this problem, you can set a strong dependency (Resource_dependencies) with the {LOCAL_NODE} qualifier. This will cause the dependency to be enforced only on the local node where the Oracle RG is going online. For more info, see the r_properties(5) man page.

    In my earlier comment above, I had suggested using RG_dependencies instead of resource dependencies. That should also work as an alternative.


  • Martin Rattner Monday, July 27, 2009

    Just to give an update on the physical-node affinities feature (Change Request #4778869, mentioned in the comments above.)

    This feature was integrated in Solaris Cluster 3.2 1/09, the same update release in which Zone Cluster support was introduced. Physical-node affinities are now the default behavior, except when zones are configured as logical nodes (usually for prototyping or demo purposes) on a single-machine cluster.


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.