What I learnt About Clustering
By Antony Reynolds on Oct 27, 2009
Since moving to support I have learned a lot about clustering. Some of the things I have learnt are;
- Lots of customers are running SOA Suite clusters
- Lots of them haven't read the High Availability Guide (10g or 11g)
- Lots of them haven't read the Enterprise Deployment Guide or EDG (10g or 11g)
- Many of them have problems because of the points above
Part of the problem for many customers is that setting up a cluster has a lot of steps and a few gotchas that can come back to bite you. Just got off the phone with a customer who was having problems with a cluster install, nothing too serious but irritating and slowing him down. Unless the HA & EDG are followed very carefully it is easy to make mistakes. A few common problem areas I have seen are
- Failing to separate the design and run time of the ESB
The ESB design time is a singleton and there must never be more than one instance of the ESB design time active against the same repository at the same time.
- Poor configuration of JGroups or Coherence
In 10g JGroups is used to identify cluster membership, in 11g Coherence plays the same role. If these are not configured consistently across the cluster then one part of the cluster may be unaware of the existence of other parts of the cluster with dire circumstances for some shared resources.
- Failure to set up virtual addresses correctly
The cluster should have a single virtual address. This virtual address needs to be configured in the HTTP listener, into the OC4J servlet engine and into the BPEL URL settings, amongst other places. Failure to do this can lead to odd behavior.
- Failure to test on a cluster
Seems many companies have clusters in production but not in test and dev. Surely no-one is that stupid you say. Well they may configure test instances with a cluster but for resource reasons run only one node of the cluster, after all a one node cluster is still a cluster right... They then wonder why they only get certain problems in production and can't reproduce them in test.
- Many customers fail to have suitable test load balancers
Often customer will not have a hardware load balancer for use in their test or dev environments and will rely on a software load balancer. Some of these software load balancers use IP stickiness to keep affinity between clients and servers. This is bad when testing because it means that if you run a test script from a single machine it will only target a single machine in the cluster.... Make sure when testing with a software load balancer that it uses HTTP cookie affinity rather than IP address affinity.
- Poor testing with BPEL drivers
Often we use BPEL processes as test harnesses to exercise functionality. This is good but it doesn't work well in a cluster without some modifications to the driver process. By default BPEL will optimize communications, within the same JVM it will use Java calls, between JVMs in the same cluster it will use ORMI, it only uses HTTP if it has to. To test properly we need to distribute load through the load balancer and hence want to use HTTP. In the invokes we can use the property optSoapShortcut=false to force calls to go through HTTP and hence the load balancer.
The above isn't an exhaustive list. If you have others then feel free to share them, I would love to add more to the list.