Threading in OSB
I recently led an OSB POC where we needed to get high throughput from an OSB pipeline that had the following logic:
1. Receive Request
2. Send Request to External System
3. If Response has a particular value
3.1 Modify Request
3.2 Resend Request to External System
4. Send Response back to Requestor
All looks very straightforward and no nasty wrinkles along the way. The flow was implemented in OSB as follows (see diagram for more details):
- Proxy Service to Receive Request and Send Response
- Request Pipeline
- Copies Original Request for use in step 3
- Route Node
- Sends Request to External System exposed as a Business Service
- Response Pipeline
- Checks Response to Check If Request Needs to Be Resubmitted
- Modify Request
- Callout to External System (same Business Service as Route Node)
The Proxy and the Business Service were each assigned their own Work Manager, effectively giving each of them their own thread pool.
Imagine our surprise when, on stressing the system we saw it lock up, with large numbers of blocked threads. The reason for the lock up is due to some subtleties in the OSB thread model which is the topic of this post.
Basic Thread Model
OSB goes to great lengths to avoid holding on to threads. Lets start by looking at how how OSB deals with a simple request/response routing to a business service in a route node.
Most Business Services are implemented by OSB in two parts. The first part uses the request thread to send the request to the target. In the diagram this is represented by the thread T1. After sending the request to the target (the Business Service in our diagram) the request thread is released back to whatever pool it came from. A multiplexor (muxer) is used to wait for the response. When the response is received the muxer hands off the response to a new thread that is used to execute the response pipeline, this is represented in the diagram by T2.
OSB allows you to assign different Work Managers and hence different thread pools to each Proxy Service and Business Service. In out example we have the “Proxy Service Work Manager” assigned to the Proxy Service and the “Business Service Work Manager” assigned to the Business Service. Note that the Business Service Work Manager is only used to assign the thread to process the response, it is never used to process the request.
This architecture means that while waiting for a response from a business service there are no threads in use, which makes for better scalability in terms of thread usage.
Note that if the Proxy and the Business Service both use the same Work Manager then there is potential for starvation. For example:
- Request Pipeline makes a blocking callout, say to perform a database read.
- Business Service response tries to allocate a thread from thread pool but all threads are blocked in the database read.
- New requests arrive and contend with responses arriving for the available threads.
Similar problems can occur if the response pipeline blocks for some reason, maybe a database update for example.
The solution to this is to make sure that the Proxy and Business Service use different Work Managers so that they do not contend with each other for threads.
Do Nothing Route Thread Model
So what happens if there is no route node? In this case OSB just echoes the Request message as a Response message, but what happens to the threads? OSB still uses a separate thread for the response, but in this case the Work Manager used is the Default Work Manager.
So this is really a special case of the Basic Thread Model discussed above, except that the response pipeline will always execute on the Default Work Manager.
Proxy Chaining Thread Model
So what happens when the route node is actually calling a Proxy Service rather than a Business Service, does the second Proxy Service use its own Thread or does it re-use the thread of the original Request Pipeline?
Well as you can see from the diagram when a route node calls another proxy service then the original Work Manager is used for both request pipelines. Similarly the response pipeline uses the Work Manager associated with the ultimate Business Service invoked via a Route Node. This actually fits in with the earlier description I gave about Business Services and by extension Route Nodes they “… uses the request thread to send the request to the target”.
Call Out Threading Model
So what happens when you make a Service Callout to a Business Service from within a pipeline. The documentation says that “The pipeline processor will block the thread until the response arrives asynchronously” when using a Service Callout. What this means is that the target Business Service is called using the pipeline thread but the response is also handled by the pipeline thread. This implies that the pipeline thread blocks waiting for a response. It is the handling of this response that behaves in an unexpected way.
When a Business Service is called via a Service Callout, the calling thread is suspended after sending the request, but unlike the Route Node case the thread is not released, it waits for the response. The muxer uses the Business Service Work Manager to allocate a thread to process the response, but in this case processing the response means getting the response and notifying the blocked pipeline thread that the response is available. The original pipeline thread can then continue to process the response.
This leads to an unfortunate wrinkle. If the Business Service is using the same Work Manager as the Pipeline then it is possible for starvation or a deadlock to occur. The scenario is as follows:
- Pipeline makes a Callout and the thread is suspended but still allocated
- Multiple Pipeline instances using the same Work Manager are in this state (common for a system under load)
- Response comes back but all Work Manager threads are allocated to blocked pipelines.
- Response cannot be processed and so pipeline threads never unblock – deadlock!
The solution to this is to make sure that any Business Services used by a Callout in a pipeline use a different Work Manager to the pipeline itself.
The Solution to My Problem
Looking back at my original workflow we see that the same Business Service is called twice, once in a Routing Node and once in a Response Pipeline Callout. This was what was causing my problem because the response pipeline was using the Business Service Work Manager, but the Service Callout wanted to use the same Work Manager to handle the responses and so eventually my Response Pipeline hogged all the available threads so no responses could be processed.
The solution was to create a second Business Service pointing to the same location as the original Business Service, the only difference was to assign a different Work Manager to this Business Service. This ensured that when the Service Callout completed there were always threads available to process the response because the response processing from the Service Callout had its own dedicated Work Manager.
- Request Pipeline
- Executes on Proxy Work Manager (WM) Thread so limited by setting of that WM. If no WM specified then uses WLS default WM.
- Route Node
- Request sent using Proxy WM Thread
- Proxy WM Thread is released before getting response
- Muxer is used to handle response
- Muxer hands off response to Business Service (BS) WM
- Response Pipeline
- Executes on Routed Business Service WM Thread so limited by setting of that WM. If no WM specified then uses WLS default WM.
- No Route Node (Echo functionality)
- Proxy WM thread released
- New thread from the default WM used for response pipeline
- Service Callout
- Request sent using proxy pipeline thread
- Proxy thread is suspended (not released) until the response comes back
- Notification of response handled by BS WM thread so limited by setting of that WM. If no WM specified then uses WLS default WM.
- Note this is a very short lived use of the thread
- After notification by callout BS WM thread that thread is released and execution continues on the original pipeline thread.
- Route/Callout to Proxy Service
- Request Pipeline of callee executes on requestor thread
- Response Pipeline of caller executes on response thread of requested proxy
- Request message may be queued if limit reached.
- Requesting thread is released (route node) or suspended (callout)
So what this means is that you may get deadlocks caused by thread starvation if you use the same thread pool for the business service in a route node and the business service in a callout from the response pipeline because the callout will need a notification thread from the same thread pool as the response pipeline. This was the problem we were having.
You get a similar problem if you use the same work manager for the proxy request pipeline and a business service callout from that request pipeline.
It also means you may want to have different work managers for the proxy and business service in the route node.
Basically you need to think carefully about how threading impacts your proxy services.
Thanks to Jay Kasi, Gerald Nunn and Deb Ayers for helping to explain this to me. Any errors are my own and not theirs. Also thanks to my colleagues Milind Pandit and Prasad Bopardikar who travelled this road with me.