Waiting to be Served, And Waiting, And Waiting …

Have you ever been sat in a waiting room, say at the doctors, or worse the Dept of Motor Vehicles, and noticed that everyone else is being called up but you?  Turns out that they somehow “lost” you from their queue.  I have been working with a customer recently who has been seeing a large number of one way invocation messages appearing in the recovery list in BPEL 10.1.3.4.  These messages just sit there waiting to be delivered, but never actually get delivered.  The message just sits there wondering why it is being ignored.

To understand what is happening here lets look at how messages are delivered for one way invocations.

Message Delivery for One Way Invocations

In an earlier post I spoke about the threading mode in 10.1.3.4.  In this I explained how one-way invocations that will create new processes are stored in the database and a notification message is placed on an in-memory queue.  The actual message is placed in the INVOKE_MESSAGE table in the orabpel schema.  The notification message placed on the in-memory queue holds a key that can be used to retrieve the message from the INVOKE_MESSAGE table.  The dsp_invoke threads are waiting for messages to appear on this queue and as soon as a message is available one of the dsp invoke threads will remove it from the queue and stat executing the process associated with that message.

What Could Possibly Go Wrong?

So it all seems pretty straightforward.  Message is stored in database and a notification placed on a queue.  A thread reads the queue and retrieves the message and executes it.  All is well and good except for a couple of things.

Server Shutdown

If the BPEL server is shutdown, either as a result of a planned outage or due to some system failure then any notification messages left in the queue will be lost because it is an in-memory queue and not persisted.  When the server starts again the messages are sat in the INVOKE_MESSAGE table but there is no notification on the queue so the message just sits there.

Instance Rollback

Another way to have messages back up in the INVOKE_MESSAGE table is if when they are processed a rollback occurs.  In this case if the transaction rolls back to the initial receive then the message stays in the same state (0 meaning unhandled) in the table because the transaction marking it as removed from the table never commits.  In this case you would normally hope to see a faulted instance.  If the fault is transitory then it may be possible to re-exceute the process but there is no notofication message on the queue to ask for this to occur.

Recovering Manually

Just like when the secretary loses your name from the queue you need to put it in front of them again, we need to raise a new notification message.  This can be done manually from the BPEL console using the Instance/Recovery tab as shown in the screenshot below.  Select the messages you want to recover and hit the recover button.  This will place a notification message on the queue, requesting that the message be processed.

image

Avoiding Manual Labor

You probably don’t want to spend a lot of time recovering messages manually, surely there must be a way to automate this.  Sure enough there is.  In 10.1.3.4 Oracle introduced the auto-recovery console which is configured from the Configuration/Auto-Recovery tab of the BPEL console.

image

This allows you to resubmit notification messages automatically.  It consists of two parts, the startup schedule and the recurring schedule.

Dealing with Message Loss Due to Server Downtime

The startup schedule deals with our first notification message loss scenario, when notifications are lost due to server shutdown.  In this case at startup the INVOKE_MESSAGE table will be scanned for undelivered messages.  Undelivered messages have notification messages generated for them in batches of maxMessageRaiseSize.  There will be a delay of subsequentTriggerDelay seconds between batches.  This will continue until either all undelviered messages have a notification message placed on the in-memory queue, or until the startupRecoveryDuration time has been exceeded.  To enable this activity to occur all that is required is to set startupRecoveryDuration greater than 0.

Dealing with Message Loss Due to Intermittent Rollbacks

The recurring schedule deals with a our second notification message loss scenario, when notifications are lost due to a transient error causing a process to rollback.  In this case the INVOKE_MESSAGE table is scanned on a regular basis for undelivered messages.

image

Similar to the startup recovery this scheduled recovery will place up to maxMessageRaiseSize notification messages onto the queue every subsequentTriggerDelay seconds.  Unlike the startup scenario this will occur every day between startWindowTime and stopWindowTime, even if the previous check showed no messages to recover.

Note that there is a problem with this in that a message may already have a notification in the queue when the recovery is run, causing the message to be processed twice.  This can be avoided by setting the threshHoldTimeInMinutes to a suitable number of minutes, the default is 10.  This causes the recovery to ignore messages younger than threshHoldTimeInMinutes, giving the BPEL engine time to process the original notification.

WE can turn on scheduled recovery by making sure that the stopWindowTime comes after the startWindowTime.

Versions

The auto-recovery feature first appeared in release 10.1.3.4.  The threshHoldTimeInMinutes property was added in 10.1.3.5 and also in 10.1.3.4 MLR#8.

Summary

I strongly recommend that you configure the startup schedule auto-recovery feature as it will ensure that all messages get at least once chance to be delivered.  If you suffer from intermittent process rollbacks due to transient errors then you will also benefit from the recurring scheduled auto-recovery feature.  But be careful because you might just keep resubmitting messages that can never be recovered and over time your system will spend more of its time trying to reschedule unprocessable messages than doing real work.

So remember that recurring schedule auto-recovery is very powerful, but with great power comes great responsibility.  Something that secretaries in doctors surgeries and at the Dept of Motor Vehicles know and exploit!

Comments:

Post a Comment:
Comments are closed for this entry.
About

Musings on Fusion Middleware and SOA Picture of Antony Antony works with customers across the US and Canada in implementing SOA and other Fusion Middleware solutions. Antony is the co-author of the SOA Suite 11g Developers Cookbook, the SOA Suite 11g Developers Guide and the SOA Suite Developers Guide.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today