Workflow performance case study: Dont Repeat History, Learn from it
By gaurav.verma on Mar 01, 2008
A Sanskrit Shlok underlying the importance of careful planning
PrefaceThey say: "Those who do not know history are condemned to repeat it."
But why repeat history if you dont have to?
Many questions come to light on this topic of discussion:
- Is there a better way of designing conditional implementation in workflow?
- What are the merits/demerits of repeated checking of a particular condition as opposed to executing something when the actual condition happens? What are the performance implications?
- Are there workflow APIs available which simulate event based subscription model?
- What are the performance advantages of using the basic Workflow APIs that simulate this event based subscription model?
The intended audience is workflow designers, support personnel and Oracle Apps DBA alike. This case study can potentially help users to design their workflow more optimally and succintly, which would have a minimal impact on the execution. The I/O profile and time taken for Workflow Background Process bears out all this information. As far as possible, trends have been depicted from the customer's Production system.
The Business Scenario
The business requirement to be implemented was during the ATP (Available to Promise) phase in the life of an Order line. ATP simply means that we can guarentee the customer that a particular Item can be made available from a particular warehouse on a particular date.
When scheduling an Order Line, if a financial hold existed on the Order line due to some reason, it needed to be unschedule. Also, as soon as the financial hold was released by the Accounts Receivables department, the requirement was to proceed with the remaining business flow in the OEOL workflow design.
How It Was Designed
This is how it was designed:
Let us translate the flow:
- When the WF background program executes custom ATP process, it checks if the order is on hold or not.
- If the order header or one of the lines is on hold the custom activity "Unschedule Lines On Hold" will unschedule all those lines and waits for an hour before checking the line status again.
- This loop repeats until all holds on the header/line are released.
Checking its Node Attributes tab:
Now, checking its actual timeout value:
So, the bottomline is that the WAIT activity was timing out after every 15 mins (60 x 0.0416 ~ 15 mins). Just imagine the volume of OEOL item keys at this activity on any given hour. What kind of a load would they be putting on the system.
Now, is that an intelligent way of designing the workflow for this requirement?
Some shocking revelations..A little analysis of the volume in WF_ITEM_ACTIVITY_STATUSES_H (history) table showed something shocking (!!):
The XXONT_CUSTOM_ATP->WAIT step was generating the highest history records. In this case, it had created 27 million rows (75% of the entire volume) out of 35 millions rows in in WF_ITEM_ACTIVITY_STATUSES_H table.
PS: This is one of the reasons as to why this case study is very interesting: simply because its impact was tremendous.
How are history records created?
Lets clear up some air and get our basics right about how and when are history records created.
When the workflow engine encounters an activity which has a status of COMPLETE, it moves (figuratively) that record from WF_ITEM_ACTIVITY_STATUSES to WF_ITEM_ACTIVITY_STATUSES_H and creates a new record in WF_ITEM_ACTIVITY_STATUSES with status ACTIVE.
Therefore, the more times a COMPLETed activity is re-visited, the more history records for that activity are created. This can happen when a loop having a WAIT activity with a short timeout is designed. The more the workflow background process flows through the activities of such a loop, the more records will get created in WF_ITEM_ACTIVITY_STATUSES_H.
The performance implications...
Lets try to think as to what are the other implications of such poor design?
- The system keeps looping on idle timeout activity every 15 mins
- Immediately after the timeout completes, the activity becomes eligible to be processed by the workflow.
- The next activity (check if hold has been released) checks if the hold has been released?
- Statistically speaking, as the probability of a financial hold being released is not immediate, the control goes back to the WAIT activity, wherein the cycle repeats
- Every 15 mins, the backlog (timedout or deferred) count of unprocessed OEOL item type bloats
- Due to such frequent time-out activities, other more important processes/activities in the OEOL workflow are not executed in a realistic timeframe
- Valuable system resource are spent on scheduling and executing the loop's activities on the Production machine
- The OS on Production machine spends vital I/O resources in servicing these application calls
This is like a vicious cycle. The more data you have in in WF_ITEM_ACTIVITY_STATUSES and WF_ITEM_ACTIVITY_STATUSES_H tables, the more time it takes to query and insert into them (also, the more indexes you have, the more time it would take to delete/update/insert due to automatic index re-organization effort required). It just kills the online performance at times.
The following diagram bears out the above ideas. It shows the top activity (in terms of number of item_keys) in the breakup of OEOL backlog. It can be seen that, without fail, XXONT_CUSTOM_ATP was the top wait activity in the OEOL backlog. Not only was it a liablity to itself, it was also a liability to other legitimate activites perfectly eligible to be processed:
Design Considerations: The Danger Of Introducing Infinite Loops
There is another big danger here: that of introducing infinite loop in the execution of Workflow Background Process (also referred to as FNDWFBG) for OEOL.
Let us assume that there is a standard WAIT activity with a very short timeout/wait time in a loop with a condition checking activity. When the condition is checked, the activity will be deferred or timed out for a very short time.
Meanwhile, Workflow Background Process would be busy processing all other deferred or timed out activities. If FNDWFBG is not done processing other deferred activities by the time the wait/time-out time of previous activities expires, the RE-timedout or deferred activities will be enqueued and re-processed AGAIN by the very same FNDWFBG (Workflow Background Process for say item_type=OEOL/OEOH) job over and over, as long as the condition (e.g "line on hold") is met.
This creates an infinite loop, so FNDWFBG process almost never completes, eating up CPU time, creating massive DBMS archive logs (if turned on), and pumping rows into the WF activity statuses table. This would also create unnecessary rows in the WF_ITEM_ACTIVITY_STATUS_H table, which can be nemesis of online order entry performance in the 11i EBusiness Suite later on.
It is a good idea for workflow activities in custom workflows to have Wait/Timeout time greater than the longest normal execution time of Workflow Background process.
Also, a workflow activity having a near zero Relative (Wait) Time, or Wait Mode stated as Absolute Date, but no Absolute Date specified, which is equivalent to zero Relative Wait Time can cause the same issue (infinite loop) to occur for FNDWFBG (Workflow Background Process for say item_type=OEOL/OEOH) runs with Process Deferred = Yes.
Hence, these poor design practices should be avoided, because, remember, once a runtime for a particular WF definition gets initiated and even if the workflow definition is corrected later and uploaded, the previous runtime data will not be automatically corrected. So its prudent to take in good design considerations before letting your production data run on it.
More design considerations..
Whenever possible, standard Wait activity ( WF_STANDARD.WAIT() ) should not be placed in a loop with a small wait time (shorter than several hours). Alternative workflow designs should be evaluated.
WF_STANDARD.WAIT() with wait time of zero or near zero, or standard Defer activity (WF_STANDARD.DEFER()), should be NEVER be placed in a loop in a workflow process design diagram.
Extending the Wait time should only be used if this solution is not acceptable for a very good reason.
The Superior Design Alternative....
For the above business scenario, the best solution is not to use Wait activity in a loop at all. A much more attractive option is to use published Workflow blocking/contnuation APIs like WF_STANDARD.BLOCK() or OE_STANDARD_WF.STANDARD_BLOCK() should be used for stopping processing flow of a workflow item_key. WF_ENGINE.COMPLETEACTIVITY() should be used for resuming processing flow of a workflow item_key.
An additional advantage of using OE_STANDARD_WF.STANDARD_BLOCK is that any OEOL workflow item which is blocked on OE_STANDARD_WF.STANDARD_BLOCK can also be progressible through Order Entry form via "Actions" button Progress Order option. This gives more flexibility to the user/designer.
If one wants to progress the flow of a workflow item programatically, WF_ENGINE.COMPLETEACTIVITY() API can be called. This API notifies the Workflow Engine that the specified activity has been completed for a particular item. It signals the Workflow Engine that an asynchronous activity has been completed. This procedure requires that the activity currently has a status of 'Notified' (Internal status equivalent to BLOCKED). An optional activity completion result can also be passed. The result can determine what transition the process takes next.
/* Complete the BLOCK activity which is used in multiple
* subprocesses in parallel splits. */
To enhance the design AND achieve the same business functionality, we need to replace the WAIT activity with a BLOCK activity which will wait till the last hold is released from the order header/line without causing any looping. The COMPLETEACTIVITY API would be called from the Financial Hold release form at Order Header level.
Its as simple as that. The new design would look like:
Seeing the Performance Benefits in Production..
Adopting this approach will create only 1 record in WF_ITEM_ACTIVITY_STATUSES table and 2 records in WF_ITEM_ACTIVITY_STATUSES_H table.
Contrast this against continuous looping till a condition is met: for each execution of the WF logic, records are created in WF_ITEM_ACTIVITY_STATUS* tables.
As a result of this change, the landscape of OEOL backlog also changed drastically.
The average runtime of the Workflow background Process for OEOL also came down consistently.
Summary of Learnings
- It pays to know the published Workflow APIs. The Workflow Developer Guide is a good place to start. Spend time reading the fine manual. More information can never hurt.
- It pays to spend time and effort in evaluating re-design of loops in workflow design early, especially since once initiated, a specific version of workflow runtime data does not automatically get updated after a newer version of workflow design is uploaded (through WFLOAD)
- Poor design involving WAIT and poll activities in a loop can cause infinitely running Workflow Background Process (FNDWFBG)
- Short-timeout activities in loops should be replaced with OE_STANDARD_WF.STANDARD_BLOCK or WF_STANDARD.BLOCK APIs and WF_ENGINE.COMPLETEACTIVITY APIs
- Simple design changes can bring HUGE performance benefits, especially when a lot of volume is involved
- One thing which was also learnt in hindsight was that the Development/Testing team needs to be extremely careful to account for all instantiated financial Holds. We ran into a problem with the removal of the loop where a certain set of financial holds were not being completed. For this set of scenarios (automatically released holds not going through the quick order form) we created a custom concurrent program to identify the blocks and call the WF_ENGINE.COMPLETEACTIVITY API.
Simply put, Know your facts, Be smart and Dont repeat history if you dont have to. Above all, do you testing well.
Suggested Related Readings
When Conventional Thinking Fails: A Performance Case Study in Order Management Workflow customization --
Posted on 17 May 2007; 12:09:31 PM
A Performance Case Study: Taming the Workflow Purge --
Posted on 21 March 2007; 6:58:11 PM
We're Skewed! Using 10053 event for being Histogram-wise --
Posted on 27 March 2007; 3:02:33 PM