How does sun cluster decide on which node a service runs?
A customer asked us how Sun Cluster decides where to bring a resource group online.
The selection of a primary node for a resource group is determined first of all by the Nodelist property that is configured for the group. The Nodelist specifies a list of nodes or zones where the resource group can be brought online, in order of preference. These nodes or zones are known as the potential primaries (or masters) of the resource group.
The nodelist preference order is modified by the following factors:
- "Ping pong" prevention: If the resource group has recently failed to start on a given node or zone, that node or zone is given lower priority than one on which it has not yet failed.
- Resource group affinities: This is a configurable resource group property that indicates a positive or negative affinity of one group for another. The node selection algorithm always satisfies strong affinities, and makes a best-effort to satisfy weak affinities.
The resource group affinities property (RG_affinities) was introduced in Sun Cluster 3.1 9/04. This property allows you to express that the RGM should do one of the following:
1. Attempt to locate the resource group on a node that is a current master of another group, referred to as positive affinity.
2. Attempt to locate the resource group on a node that is not a current master of a given group, referred to as negative affinity.
Resource group affinities comes in five flavors:
+, or weak positive affinity
++, or strong positive affinity
+++, or strong positive affinity with failover delegation
-, or weak negative affinity
--, or strong negative affinity
At this point you may well be wondering, what are affinities used for, and how do they work? To answer that, here are a few examples:
Example 1: Enforcing collocation of a resource group with another resource group
Suppose that our cluster is running an Oracle database server controlled by a failover resource group, ora-rg. We also have an application in resource group dbmeasure-rg, whose job it is to measure and log the performance of the database. The dbmeasure application, if it runs, must run on the same node as the Oracle server. However, the measurement application is not mandatory, and Oracle can run fine without it.
We can force dbmeasure to run only on a node where Oracle is running, by declaring a strong positive affinity:
clrg set -p RG_affinities=++ora-rg dbmeasure-rg
When we initially switch ora-rg online, dbmeasure-rg will automatically come online on the same node. If ora-rg fails over to a different node or zone, then dbmeasure-rg will follow it automatically. While ora-rg remains online we can switch dbmeasure-rg offline, however, dbmeasure-rg cannot switch over or fail over onto any node where ora-rg is \*not\* running.
Note: Besides the RG_affinities, we may also configure a dependency of the dbmeasure resource upon the oracle server resource. This assures that the dbmeasure resource does not get started until the oracle server resource is online. Resource group affinities are enforced independently of resource dependencies. While resource dependencies control the order in which resources are started and stopped, RG_affinities control the _locations_ where resource groups are brought online across multiple nodes or zones of a cluster.
Suppose that dbmeasure is a more critical application, and it is important to keep it up and running? In that case, we might want to allow dbmeasure-rg itself to initiate a failover onto a different node, dragging ora-rg along with it. To accomplish this, we use the strong positive affinity with delegated failover:
clrg set -p RG_affinities=+++ora-rg dbmeasure-rg
Example 2: Specifying a preferred collocation of a resource group with another resource group
Assume again a cluster running our Oracle database resource group, ora-rg. On the same cluster, we are running a customer service application that uses the database; this application is configured in a separate failover resource group, app-rg. The application and the database _can_ run on two different nodes, but perhaps we have discovered that the application is database-intensive and runs faster if it is hosted on the same node as the database. Therefore, we prefer to start the application on the same node as the database.
However, it might also be the case that we want to avoid switching the application from one node to another, even if the database changes nodes. To avoid breaking client connections or for some other reason, we would rather keep the application on its current master, even if it incurs some performance penalty.
To achieve these semantics, we give app-rg a weak positive affinity for ora-rg:
clrg set -p RG_affinities=+ora-rg app-rg
With this affinity, the RGM will start app-rg on the same node as ora-rg when possible, but will not force it to always run on the same node.
Example 3: Balancing the load of a set of resource groups
Now suppose that we have a cluster that is hosting three independent applications in resource groups app1-rg, app2-rg, and app3-rg. By giving each resource group a weak negative affinity for the other two groups, we can achieve a rudimentary form of load balancing on our cluster:
clrg set -p RG_affinities=-app2-rg,-app3-rg app1-rg
clrg set -p RG_affinities=-app1-rg,-app3-rg app2-rg
clrg set -p RG_affinities=-app1-rg,-app2-rg app3-rg
With these settings, the RGM will try to bring each resource group online on a node that is not currently hosting either of the other two groups. If there are three or more nodes available, this will place each resource group onto its own node. If there are fewer than three nodes available, then the RGM will "double-up" or "triple-up" the resource groups onto the available node(s). Conceptually, the resource group with weak negative affinity is trying to stay away from the other group, sort of like electrostatic charges that repel one another.
Example 4: Specifying that a critical service has precedence
In this example, a critical service -- let's say it's our Oracle database in the ora-rg resource group -- is sharing the cluster with a non-critical service, for example, a prototype of a newer version of our software which is undergoing testing and development. Supposing that we have a two-node cluster, we want ora-rg to start on one node, and the test-rg to start on the other node. Suppose that the first node, hosting ora-rg, dies, causing ora-rg to fail over to the second node. We want the non-critical service in test-rg to go offline on that node.
To accomplish this behavior, we give test-rg a strong negative affinity for ora-rg:
clrg set -p RG_affinities=--ora-rg test-rg
When the first node dies and ora-rg fails over to the second node where test-rg is currently running, test-rg will get "bumped off" of the second node and will remain offline (assuming a two-node cluster). When the first node reboots, it takes on the role of backup node, and test-rg is automatically started on it.
Example 5: Combining different flavors of RG_affinities to achieve more complex behavior
In the Sun Cluster HA for SAP Replicated Enqueue Service, we configure the enqueue server in one resource group enq-rg, and the replica server in a second resource group repl-rg.
A requirement of this data service is that the enqueue server, if it fails on the node where it is currently running, must fail over to the node where the replica server is running. The replica server needs to move immediately to a different node. Setting a weak positive affinity from the enqueue server resource group to the replica server resource group ensures the enqueue server resource group will fail over to the node where the replica server is currently running:
clrg set -p RG_affinities=+rg-repl rg-enq
Setting a strong negative affinity from the replica server resource group to the enqueue server resource group ensures the replica server resource group is offloaded from the replica server node, before the enqueue server resource group is brought online on the same node:
clrg set -p RG_affinities=--rg-enq rg-repl
The replica server resource group will be started up on another node if one is available.
Thus by using the simple declarative mechanism of RG_affinities, we can achieve robust recovery behavior for our data services running on Sun Cluster.
Sun Cluster Engineering