Clustering SOA Suite
By Antony Reynolds on Jun 30, 2009
Building a SOA Suite Cluster
Having spent a couple of weeks working on a SOA Suite cluster thought I would share some thoughts around clustering and SOA Suite. Clustering of both BPEL Process Manager and Oracle Service Bus is relatively straightforward but there are a few gotchas. Both BPEL and SOA Suite are stateless in the way they implement clustering, however BPEL does of course persist state to a database.
SOA Suite Clusters
Both BPEL and OSB clusters expect to be fronted by a load balancer. Both can provide load balancing through a front end web server but a hardware load balancer is the best approach as shown in the diagram.
In this example we have an Oracle Real Application Cluster database running on 3-nodes to provide a high availability database environment. We then have two clusters. A cluster of 5 BPEL Process Manager instances, all pointing to the same RAC database, and a cluster of 5 OSB instances. The BPEL cluster and the OSB cluster are both fronted by two hardware load balancers.
The BPEL instances and the OSB instances are in an active-active mode, meaning that all nodes are processing requests at the same time. The load balancers are in active-passive mode, meaning that one load balancer processes all the traffic with the other load balancer acting as a hot standby in case of failure of the first load balancer.
This configuration avoids a single point of failure as every component is duplicated. The system has been sized to be able to sustain the expected load even in the event of losing a machine. This is known as "n+1" architecture, meaning that we need "n" machines to meet the requirements and so we provide "n+1" machines to allow for machine failure. In the example shown we actually have two machines more than we need in normal operation because the the RAC database is running an "n+1" configuration as well as the SOA Suite. Note that by running OSB on the same machines as BPEL we reduce the amount of extra hardware needed for failover and also reduce the latency of OSB-BPEL communication.
In the event of a failure then only in-process requests would be impacted. Any requests that are "idempotent" (meaning they can be resubmitted with no ill effects) can be set up to automatically retry, further reducing the impact of a software or hardware failure. Both BPEL and OSB can be set to automatically retry requests in event of failure, making the failover transparent.
Note that clustering does not help if we have a site failure due to fire, flooding, power failure or air conditioning failure for example. In those cases we would need to have some sort of disaster recovery site, perhaps using Oracle Data Guard to keep the sites in synch at the database level.
A BPEL cluster is effectively defined by a shared dehydration store. Synchronous interactions must be processed within a single BPEL server instance, as the client has connected to a socket and expects a response on that same socket. Asynchronous interactions are like any other long running BPEL process and may start processing on one node and then have processing resumed on another node, either due to failure or some other event. The dehydration store (an oracle RAC database in the example above) provides a common location for process state that allows any BPEL instance to resume execution of a process instance.
When installing a BPEL cluster the best place to start is the High Availability Guide. This outlines that you create a BPEL cluster by doing the following:
- Get the address of the load balancer.
- Run the repository creation assistant to create the BPEL meta-data in the database.
- Install OHS and OC4J components on the machines.
- Install BPEL Process Manager into the app server instances installed previously.
- If using a RAC database make sure that the JNDI data sources are using all RAC nodes. See the Enterprise Deployment Guide for instructions on using Fast Connection Failover.
- Configure BPEL in a cluster as outlined in the Enterprise Deployment Guide.
- Set up JGroups to make all nodes aware of each other in the cluster. Note that in 11g Coherence will be used for this which will simplify configuration.
- Set enableCluster and ClusterName in the collaxa-config.xml file.
- Make sure the BPEL PM instances all use the load balancer address for server URL and callback URLs. This ensures that in event of node failure requests and responses are rerouted to remaining instances.
- Deploy processes on each node to make sure that all components are available on all nodes. If you don't do this some processes will work because they don't have dependencies on non-BPEL components, but others will not.
Once set up use the BPEL fault handling framework to make sure that any calls to OSB are automatically retried in event of failure.
An OSB domain may have a single OSB cluster. This cluster is installed like any other cluster in WebLogic. Details on configuring the cluster can be found in the Creating WebLogic Domains Using the Configuration Wizard documentation. For normal operation the OSB cluster is completely stateless, however the metrics gathering and aggregation takes place in a singleton service that by default is assigned to the first machine created in the cluster. If this server fails then metrics will stay in the message queues to which they are delivered until a new instance starts.
- Get address of the load balancer.
- Install OSB software onto each machine or into a single shared location. If a shared location make sure that the reference to it is the same on each machine.
- Create an OSB domain
- Creating a cluster and machines and assign servers to machines as explained in the Creating WebLogic Domains Using the Configuration Wizard documentation.
- Use load balancer address for cluster address.
- If you are not using a shared file location for your OSB install then you need to copy the contents of the osb domain directory to all nodes. This ensure that the correct scripts are available for the node manager to launch managed servers.
- Run node manager on each machine.
- You can now launch your admin server and start the managed servers one at a time. It is recommended that you start the OSB server running the data collectors first. This will avoid timeouts on the other machines in the cluster and ensure that metrics are available.
- If message queues are required to be highly available then they should use persistent storage either on a shared highly available disk (a SAN for example) or they should use database persistence.
- If you are using an Oracle database then you should configure your OSB domain to store metrics in the Oracle database rather than in pointbase as explained in Creating WebLogic Domains Using the Configuration Wizard.
When calling BPEL or external web services from OSB make sure that you specify retries to allow for node failure. Intra-OSB calls should be done using "local" transport for efficiency.
Configuring a cluster is a bit more involved than configuring a single instance, but it is not massively more complicated and it does provide both scalability and high availability. Both BPEL and OSB scale linearly with increased nodes. The only limitation on BPEL is the load on the backend dehydration store. So go ahead, enjoy a cluster!