Coping with Failure
By Antony Reynolds-Oracle on Sep 22, 2011
Handling Endpoint Failure in OSB
Recently I was working on a POC and we had demonstrated stellar performance with OSB fronting a BPEL composite calling back end EJBs. The final test was a failover test which tested killing an OSB and bringing it back online and then killing a SOA(BPEL) server and bringing it back online and finally killing a backend EJB server and bringing it back online. All was going well until the BPEL failover test when for some reason OSB refused to mark the BPEL server as down. Turns out we had forgotten to set a very important setting and so this entry outlines how to handle endpoint failure in OSB.
Step 1 – Add Multiple End Points to Business Service
The first thing to do is create multiple end points for the business service, pointing to all available backends. This is required for HTTP/SOAP bindings. In theory if using a T3 protocol then a single cluster address is sufficient and load balancing will be taken care of by T3 smart proxies. In this scenario though we will focus on HTTP/SOAP endpoints.
Navigate to the Business Service->Configuration Details->Transport Configuration and add all your endpoint URIs. Make sure that Retry Count is greater than 0 if you don’t want to pass failures back to the client. In the example below I have set up links to three back end webs service instances. Go to Last and Save the changes.
Step 2 – Enable Offlining & Recovery of Endpoint URIs
When a back end service instance fails we want to take it offline, meaning we want to remove it from the pool of instances to which OSB will route requests. We do this by navigating to the Business Service->Operational Settings and selecting the Enable check box for Offline Endpoint URIs in the General Configuration section. This causes OSB to stop routing requests to a backend that returns errors (if the transport setting Retry Application Errors is set) or fails to respond at all.
Offlining the service is good because we won’t send any more requests to a broken endpoint, but we also want to add the endpoint again when it becomes available. We do this by setting the Enable with Retry Interval in General Configuration to some non-zero value, such as 30 seconds. Then every 30 seconds OSB will add the failed service endpoint back into the list of endpoints. If the endpoint is still not ready to accept requests then it will error again and be removed again from the list. In the example below I have set up a 30 second retry interval. Remember to hit update and then commit all the session changes.
Considerations on Retry Count
A couple of things to be aware of on retry count.
If you set retry count to greater than zero then endpoint failures will be transparent to OSB clients, other than the additional delay they experience. However if the request is mutative (changes the backend) then there is no guarantee that the request might not have been executed but the endpoint failed before turning the result, in which case you submit the mutative operation twice. If your back end service can’t cope with this then don’t set retries.
If your back-end service can’t cope with retries then you can still get the benefit of transparent retries for non-mutative operations by creating two business services, one with retry enabled that handles non-mutative requests, and the other with retry set to zero that handles mutative requests.
Considerations on Retry Interval for Offline Endpoints
If you set the retry interval to too small a value then it is very likely that your failed endpoint will not have recovered and so you will waste time on a request failing to contact that endpoint before failing over to a new endpoint, this will increase the client response time. Work out what would be a typical unplanned outage time for a node (such as caused by a JVM failure and subsequent restart) and set the retry interval to be say half of this as a comprise between causing additional client response time delays and adding the endpoint back into the mix as soon as possible.
Always remember to set the Operational Setting to Enable Offlining and then you won’t be surprised in a fail over test!