When Conventional Thinking Fails: A Performance Case Study in Order Management Workflow customization

small things put together can achieve great things.gif:

Foreword

Order management workflow interacts with a variety of components like core workflow, order management and pricing application call interfaces.

Customizing it without understanding the implications can be a double edged sword and the nature of customization may unwittingly worsen the situation. An undesirable combination of the aspects can cause a very difficult to diagnose and troubleshoot situation as shall be demonstrated in this case study.

This article is essentially the distilled wisdom of severity 1 situation at a client site who was unable to ship ordered items to their customers since OM order line workflow background process had a severe performance issue. The business impact was tremendous: $ 1.5 to 2 million worth of shipping and invoicing was being prevented.

Many a times, performance troubleshooting involves having some functional insight as well. TKPROF and 10046 trace is not the panacea for all performance issues. In this case, some insight into the nature of Order Management APIs was also required for getting traction on the performance problem. A painful discovery path followed, riddled with bumps and insightful discoveries. The troubleshooting approach used was really out of the box and involved breaking down the code being executed and using logical deduction.

The generic learnings from the ordeal are presented in this case study. It is hoped that the learnings will help oracle applications user community to be concious of hidden implications of customizing OM workflow.

Feeling the pinch...

oeol high backlog image.JPG:
The client had heavily customized their order management workflow during an implementation of Order to cash business cycle of Oracle applications 11i.

Immediately after the April 2007 financial quarter end, the OEOL workflow background process started exhibiting extremely long execution times, such that the backlog of unprocessed OEOLs came to ~150,000. This was a sharp contrast from the normal processing throughput.

This was a moving problem. The hanging of OEOL workflow background process was quite apparenly due to a 'few' order lines, belonging to a few orders. Once the 'problematic' orders' OEOL workflow were either CANCELLED or ERRORED, the backlog started clearing at a slow pace. This went on for a little while, until similar Order lines OEOL workflow seemed to take inordinately long to process and held up the entire workflow queue.

Coming to grips with it..

As a knee jerk approach, many things were tried, including killing the currently running workflow background process and re-submitting it. The clincher was that once or twice the workflow background process was killed , it marked the OEOL ITEM_KEY being processed into ERROR state and hence took it out of the deferred queue. This forced the Workflow background process to consider the next deferred OEOL item_keys in line.

In ~50% of the cases, it would work and the newly thread would start exhibiting a more acceptable throughput. But after a certain point, even killing and re-submitting workflow background process was simply not working. It did seem that the logic being executed by FNDWFBG was taking extremely long for processing one order line.

The observation was baffling. The following query was used to find out approximately which order line was a single thread of workflow background:

Query 1:

SQL> select * from (select to_char(enq_time,'dd-mon-rr hh24:mi') , 
     user_data


from wf_deferred_table_m where state = 0 and corrid = 'APPSOEOL'


order by priority, enq_time) where rownum < 5;


TO_CHAR(ENQ_TIM USER_DATA(ITEMTYPE, ITEMKEY, ACTID, FUNCTION_NAME, PARAM_LIST, RESULT)


----------------------------------------------------------------------


02-apr-07 14:25 WF_PAYLOAD_T('OEOL', '30273337', 141255, NULL, NULL, NULL)


02-apr-07 16:25 WF_PAYLOAD_T('OEOL', '30458068', 140850, NULL, NULL, NULL)


02-apr-07 16:25 WF_PAYLOAD_T('OEOL', '30458075', 140850, NULL, NULL, NULL)


02-apr-07 16:25 WF_PAYLOAD_T('OEOL', '30458089', 140850, NULL, NULL, NULL)


4 rows selected.


Tying it back to the order management tables, the following query tells us order level details like order_number, line_number, ordered_item etc, which is even more useful:

Query 2:

select distinct oh.org_id, oh.order_number,
ol.line_number, ol.shipment_number,              
ol.option_number, ol.ordered_item, ol.line_id, wiash.begin_date,
priority, wdtm.enq_time
from applsys.wf_deferred_table_m wdtm,
    applsys.wf_process_activities wpa,
     applsys.wf_item_activity_statuses wiash,
     apps.oe_order_lines_all ol,
     apps.oe_order_headers_all oh
where substr (wdtm.corrid,5) = 'OEOL'
and (wdtm.delay < sysdate or wdtm.delay is null)
and wpa.instance_id=wdtm.user_data.actid
and wiash.item_type=wdtm.user_data.itemtype
and wiash.item_key=wdtm.user_data.itemkey
and wiash.process_activity=wdtm.user_data.actid
and wiash.item_key = ol.line_id
and oh.header_id = ol.header_id
and wdtm.state = 0
group by
oh.org_id, oh.order_number, ol.line_number, ol.shipment_number,
ol.option_number,
ol.ordered_item, ol.line_id, wiash.begin_date, priority, wdtm.enq_time
order by priority, wdtm.enq_time ;

Observing the output over sometime gave the capability of identifying if workflow background process was stuck on a specific OEOL workflow.

Limited insight by conventional analysis..

The conventional debugging approaches provided limited insight into the root cause of the issue.

10046 trace could not bring out the drill down of time spent on pl/sql API calls clearly. In addition, there was an OS file limit on the trace file on most production servers, even if the max_dump_file_size=unlimited is used.

Another option was to use dbms_profiler API on FNDWFBG call. It could have worked, except the FNDWFBG API call never completed, so the data could never be collected.

Searching the metalink gives a host of notes for getting OM debugs (level 5), Pricing debug etc.

1. Note 255040.1:bde_wf_data.sql - Profiling Workflow Runtime Data related

2. Note 163208.1:bde_last_analyzed.sql - Verifies Statistics for all

3. Note 353991.1:OMSuiteDataChk.sql and OMSuiteDataChk11i10.sql


This is all good information, except that there can also be too much of un-meaningful and un-directed information. Interpreting detailed application debug logs can be useful when it is known what should be checked.

Taking the whitebox approach....

Lot of debug/trace information could be generated, but it was probably best to take the FNDWFBG execution apart in slow motion. Once the execution model has been understood, more brainstorming using divide and conquer strategy is effective.

An alternate way of looking at FNDWFBG is that it is nothing but a serial simulator of steps outlined in the oracle workflow designer.

compressed OM workflow from start fulfillment.JPG:

A whitebox (a software engineering term, as opposed to blackbox) approach was the need of the hour, coupled with some simple thinking. Using the workflow status monitor from Workflow Administrator responsbility in Oracle Applications 11i, the step at which an OEOL is currently at can be seen.

compressed workflow monitor output for a suspended OEOL workflow.JPG:

This may also be verified by the output of $FND_TOP/sql/wfstat.sql. Combining the observations from Query 2 above and drilling down via Workflow Monitor, it was clear that almost all the lines having performance issue were at Fulfill-Deferred step, which means that if FNDWFBG were to pick them up, it would then execute the next activity in line, which was Fulfill step.

Note: A quick workaround was to 1) Kill the FNDWFBG process 2) SUSPEND (this status avoids any processing by workflow) the problematic OEOL workflow via workflow monitor and 3) resubmit the FNDWFBG process.

Irreconcilable paradoxes of a Test case..

We really needed to analyze the drill down on the bad test case.

From the online examination of sqls being executed by FNDWFBG session, it was seen that a lot of time was being spent on XXONT Cascade Terms and Agreement, multiple times for apparently processing one OEOL ITEM_KEY. This was a really paradoxical observation and did not make any sense.

The only acid test of this was to call OE_FULFILL_WF.START_FULFILLMENT for specific OEOL line id or ITEM_KEY. The script run_start_fulfillment_with_om_debug.sql (Listing in Appendix A) was used to simulate a single execution of Fulfill step, which in turn called OE_FULFILL_WF.START_FULFILLMENT and did a rollback at the end. One such case ran for ~6 hours and finally finished.

Most surprisingly, an examination of raw trace file did show that sub-sequent steps like XXONT Cascade Terms and Agreement and Invoice Interface- Line had indeed been called multiple times. Therefore, it seemed that a single call to OE_FULFILL_WF.START_FULFILLMENT for one kind of OEOL was triggering off sequential execution of OM workflow for other OEOLs in the same Order, which were somehow related to the order line in question.

$ grep -i xxont_cascade_pkg.set_ln_trm_agr_proc prod_ora_677526.trc 
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end; 
value="XXONT_CASCADE_PKG"
value="XXONT_CASCADE_PKG"

This simply did not make any sense because Fulfill and XXONT Cascade Terms and Agreement step were discrete steps, as verfiable from viewing the OM workflow diagram flow above.

Another very interesting and puzzling observation was that without fail, the top 2 sql calls in TKPROF output (raw trace sorted with sort=prsela,exeela,fchela) showed that ~45% of the time was spent on OE_FULFILL_WF.START_FULFILLMENT API and ~45% of it was spent on the call to xxont_cascade_pkg.set_ln_trm_agr_proc API.

"tkprof output for 10046 of start_fulfillment API call"

Resolving the paradox...

The answer, which is difficult to digest, does make sense. Order management has unique processing requirements for configured ordered items created using Oracle Configurator module. Since OE_FULFILL_WF.START_FULFILLMENT is a workflow aware API (notice the _WF in its name), it was obviously calling more Workflow routines inside it to progress the business flow to next logical set of activities.

We first need to understand the concept of Fulfillment set. A fulfillment set is a set of related order lines that represent a configurable item. Whenever a fulfillment set pr a model is ready to fulfill or get built in production floor - i.e., all members' OEOL workflow are internally in FULFILL_LINE:Notified (blocked) status. The last member reaching this activity signals completion of FULFILL_LINE activities of all other members before proceeding itself. Since the XXONT Cascade Terms and Agreement step comes right after FULFILL_LINE step, XXONT Cascade Terms and Agreement step will be executed for all members of the current model or fulfillment set in that one go.

The relation between other lines whose OM workflow was being resumed and the problematic line in question could be found using the following sql:

Select top_model_line_id from oe_order_lines_all where line_id=<problematic_line>;

This query defines a FULFILLMENT SET (a set of lines which are related together

and constitute an orderable item):

Select line_id from oe_order_lines_all where top_model_line_id=<top_model_line_id_from_above_query>;

The user first configures the components of a confgurable product/item and then the item is actually built. Depending on the kind of configured model (Assemble to order -- ATO or Pick to order -- PTO), the oe_order_lines_all.flow_status_code (order management workflow status) stays at either SHIPPED or BOOKED status or becomes AWAITING_FULFILLMENT (internally blocked).

Here is an example of this intermediate stage of processing for different kind of configured Items (ATO/PTO/Stock):

compressed sql output of an example of configured Item.JPG:

Now, since a lot of time was being spent on XXONT Cascade Terms and Agreement step, After reviewing the code in the package XXONT_CASCADE_PKG.pkb, it could be seen that OE_ORDER_PUB.PROCESS_ORDER() API was being called to sync up the terms and agreements across Order header and line in question.

call sequence of xxont_cascade_pkg.GIF:

There were some penalties for doing this: the payment term and agreement fields on the order line triggered repricing of the order lines. Also, for especially large orders (~4000 lines) with mostly mis-matched payment terms and agreement lines, repeated calls to OE_ORDER_PUB.PROCESS_ORDER() were causing a huge overhead.

The reason why the timings for OE_FULFILL_WF.START_FULFILLMENT() showed high in the trace files was that FULFILL_LINE step syncronizes accross all components in a configurable model or fulfillment set. The following output from a workflow query in Appendix B lists out the time recorded by workflow for executing XXONT_PROD_SEC_STANDARD_LINE->FULFILL_LINE step:

complexity of the start_fulfillment api call.GIF:

Visualizing it in slow motion...

This means that, whenever a fulfillment set pr a model is ready to fulfill - i.e., all members OEOL workflow is internally in FULFILL_LINE:Notified (blocked) status, the last member reaching this activity will complete FULFILL_LINE activities of all other members before proceeding itself.

Since the Cascading activity comes right after FULFILL_LINE, it means that it will be executed for all members of the current model or fulfillment set in that one go. The aggregate timings for completing all other related OEOL workflows in the same fulfillment set goes into the timing recorded for OE_FULFILL_WF.START_FULFILLMENT() step.

The following diagrams attempt to represent the dynamics more clearly:

BEFORE calling OE_FULFILL_WF.START_FULFILLMENT() step:

before - status of oeol fulfillment set.GIF:

AFTER calling OE_FULFILL_WF.START_FULFILLMENT() step:

After - status of oeol fulfillment set.GIF:

Projecting the poor performance..

Based on the situation described above, a simple projection of poor performance can be done for a particular kind of order. Lets assume that there are 100 lines in a configurable model/Item. Also, lets assume that for 75% of the order lines, payment terms or/and agreements do not match with those in the Header. For each mis-matched (terms and agreements - with header) line, the time taken for pricing varied from 150 seconds to 750 seconds.

When we calcualte the cumulative time taken for executing xxont_cascade_pkg.set_ln_trm_agr_proc API for about 75% mis-matched lines of 100 lines, with ~5 mins taken for each line, the overall time taken comes to 75 x 5 mins ~ 8 hrs. That's the time for processing OE_FULFILL_WF.START_FULFILLMENT() step for one model. That was exactly the performance problem being experienced.

Now, just imagine the performance strain if the order had 10 such configurable items.

Mathmatically speaking:

complexity of the start_fulfillment api call.GIF:

The Solution: Simplify by taking the Highest Common Factor...

There were two important dimensions to the solution:

There was a bug in 11.5.9 Order management module which prevented the cascading order header changes to order lines from working properly (Profile option OM: Sales Order Form: Cascade Header Changes to Line should be Yes). This was prompting the calls to OE_ORDER_PUB.PROCESS_LINE() APIs.

Patch 5981436 is to be delivered by Oracle order management development for resolving this issue. This fix is also made part of 5950291 -11.5.9 OM CUMULATIVE PATCH. This would automatically match the payment terms and agreements from Order Header to order lines for majority of the cases, except if the user purposely chose to say No to the cascade option.

Another smart option was to do batch processing of all the mis-matched component lines of a configured model in ONE call to PROCESS_LINE() API by passing them as a pl/sql table argument. Also, this was to be done ONLY if the order line in question was for a Shippable Item. This was especially beneficial since only 1-5% of sub-components in a configured model are shippable. So that made it quite restrictive. This single change took out 95-97% of the repetitive work being done by OE_ORDER_PUB.PROCESS_ORDER() calls.

       --
       -- Calling API to update Terms and/or Agreement value(s) on the order line
       --
       oe_order_pub.process_line (p_line_tbl      => l_line_tbl,
                                  x_line_out_tbl  => o_line_tbl,
                                  x_return_status => l_return_status,
                                  x_msg_count     => l_msg_count,
                                  x_msg_data      => l_msg_data);

Seeing the performance Boost...

Before putting in the code change:

For a fulfillment set with top_model_line_id=9905479, there were 59 mis-matched lines, and the avg time for processing one call of process_line() was ~ 580 seconds.

before the code change.JPG:

After putting in the code change:

The time taken to same 59 mis-matched lines (of the same fulfillment set with top_model_line_id=29905479) in one pls/sql table (as an argument) in process_line() API call was ~ 625 seconds.

after the code change.JPG:

Putting statspack data into perspective...

This also explained why statspack reports showed xxont_cascade_pkg.set_ln_trm_agr_proc as one of  the major I/O offenders per execution:

statspack snippet for xxont_cascade_pkg api.JPG:

Summary of learnings from the Case study..


  • Don't have expensive custom activities defined after START_FULFILLMENT step in OM workflow. It will pronounce the performance hit many times, especially when Oracle configurator and order management modules are being used in tamdem (Better, don't  have expensive workflow custom activities at all)
  • Batch processing should be over piece-meal processing, especially when an API has provision for the same. This reduces most of the repetitive processing
  • The whitebox (drill down) approach works for taking apart a baffling performance problem. Trying to simulate a test case usually leads to the heart of the performance issue in a more reliable way
  • Getting extensive debugs and traces is great, but only when it is known what is being looked for. Asking the right probing questions is very important. Question, question, question.
  • A well though out plan with minimal tracing and using a drill down approach, can bring better results than a shot gun or blunderbuss approach
  • Sometimes, a high level functional knowledge of the processing being done can be very useful in understanding the nature of problem. A balance between strict technical and pure functional knowledge can be fruitful towards solving performance issues

TRIVIA: Why was the customization put in the first place?

Cascading is the process in which values from the sales order header level automatically populate to the corresponding fields at the sales order line level.

Due to a bug in 11.5.9, Agreement field at Header level was not populating to the child lines of a configuration.

Due to this, the autoinvoicing program split lines into separate invoices by agreement. There were thousands of invoices where the components were being invoicing separately from the top model. It was an invoicing fiasco.

Appendix A:

sandbox:dev> more run_start_fulfillment_with_om_debug.sql

var resultout varchar2(1000)
set serveroutput on time on timing on

prompt
prompt setting userid, responsbility and application id for JHASLAGE FOR order management super user
prompt

define user_id=7962
define resp_id=52560
define appl_id=660

define item_key=29904309
define activity_id=141236

declare

l_debug_file varchar2(400);
l_org_id NUMBER;
l_count NUMBER;
l_result VARCHAR2(30);
l_spid varchar2(20);

begin

fnd_global.apps_initialize(&user_id,&resp_id,&appl_id);

select p.spid
into l_spid
from v$process p, v$session s
where s.audsid=userenv('sessionid')
and paddr = addr;

OE_Standard_WF.OEOL_SELECTOR
(p_itemtype => 'OEOL'
,p_itemkey => '&&item_key'
,p_actid => 12345
,p_funcmode => 'SET_CTX'
,p_result => l_result
);

oe_debug_pub.debug_off;

--enable OM debug
oe_debug_pub.debug_on;
oe_debug_pub.initialize;
l_debug_file := OE_DEBUG_PUB.Set_Debug_Mode('FILE');
oe_debug_pub.setdebuglevel(5);

dbms_output.put_line('Debug file is : '|| l_debug_file);

--enable trace
EXECUTE IMMEDIATE ('alter session set max_dump_file_size=unlimited');
EXECUTE IMMEDIATE ('alter session set tracefile_identifier=''gverma_start_fulfillment_with_om_debug_&&item_key''');
EXECUTE IMMEDIATE ('ALTER SESSION SET EVENTS ''10046 TRACE NAME CONTEXT FOREVER, LEVEL 12''');

OE_FULFILL_WF.START_FULFILLMENT('OEOL','&&item_key',&&activity_id,'RUN',:resultout);

--disable trace
EXECUTE IMMEDIATE ('ALTER SESSION SET EVENTS ''10046 TRACE NAME CONTEXT OFF''');

--turn debug off
oe_debug_pub.debug_off;
dbms_output.put_line('File name '||OE_DEBUG_PUB.G_DIR||'/'||OE_DEBUG_PUB.G_FILE);
dbms_output.put_line('Trace file generated for this request:'||'/*'||l_spid||'*.trc');

rollback;

EXCEPTION WHEN OTHERS THEN

dbms_output.put_line('reached error in script ' ||sqlerrm);

rollback;
end;
/

select :resultout from dual
/

Appendix B:

col ITEM_KEY for a10 Head "Item Key"

col BEGIN_DATE for a18 head "Begin Date"


col END_DATE for a18 head "End Date"


col ROOT_ACTIVITY for a30 head "Root Activity"


col RESULT for a15


col STATUS for a8 Head Status


col time_taken_mins for 99999 head "Time(Mins)"


col step format a60

set pages 100 time on timing on lines 150

spool get_details_for_wf_step

select ias.item_key,

to_char(ias.begin_date,'DD-MON-RR HH24:MI:SS') begin_date, to_char(ias.end_date,'DD-MON-RR HH24:MI:SS') end_date,


(ias.end_date-ias.begin_date)*24*60 time_taken_mins,


pa.process_name||'->'||pa.instance_label step,


ias.activity_status Status,


ias.activity_result_code Result


from wf_item_activity_statuses ias,


wf_process_activities pa


where ias.item_type = 'OEOL'


and ias.process_activity = pa.instance_id


and ias.begin_date >= sysdate-14


and pa.process_name||'->'||pa.instance_label = 'XXONT_PROD_SEC_STANDARD_LINE->FULFILL_LINE'


and ias.activity_status = 'COMPLETE'


--


-- and the step was not cancelled


--


and ias.activity_result_code != '#FORCE'


--


-- it took more than so many mins to complete this step


--


and (ias.end_date-ias.begin_date)*24*60 > 30


order by (ias.end_date-ias.begin_date)*24*60 desc, ias.begin_date

/

spool off

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

bocadmin_ww

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today