« February 2007 | Main | April 2007 »

March 2007 Archives

March 12, 2007

Best Practices For Custom Order Entry Workflow Design

Customize, Customize...

Every organization's way of doing business is unique and this prompts customers to extend the seeded Oracle supplied order management workflow definition.

Unfortunately, there is a lack of authoritative documents on following best practices for workflow customization. In the absence of such documents, customization practices vary and become person dependent.

The Caveats..

A recently published  Note 402144.1 -- FAQ: Best Practices For Custom Order Entry Workflow Design can help you change all that. As the name suggests, it enumerates best practices which can help you manage your workflow design risks well ahead in advance.

Examples..

Some example FAQs answered are:

Question 6: How is an infinite loop situation created for FNDWFBG concurrent program?

Question 7: How can we avoid Infinite loop situations while executing workflow activities in the background?

...

Related suggested readings..

The experience has been that extensions and customizations  sometimes lead to imperfect and faulty workflow designs that cause problems, errors, and even data corruption which is not readily apparent and can only be corrected by data fix scripts provided by Oracle Support or, even more often, Oracle Development.

To help OM customers evaluate and validate their proposed workflow extensions and customizations before they are used in order processing, Development is introducing functionality called "OM Workflow Validation".

March 17, 2007

Lesser known method of mass retrying ERRORed workflow items through Oracle Applications Manager

Using Oracle Application Manager Workflow User Interface optimally for Mass re-trying ERRORed Activities..

After hitting some system wide exceptions or improper workflow background concurrent process improper termination, we may need to mass retry several different kind of workflow item types.

The Oracle Application Manager UI provides a lot of possibilities into executing mass Retry and Purge actions on different kind of workflow items. Lets talk about a lesser known method of mass retrying ERROR'ed workflow items.

After clicking on Workflow Administrator Web Applications responsbility, click on "Workflow" menu item. Click on "Site Map" tab to get the following screen:

first.JPG:

Click on the "Work Item Metrics" link under Workflow heading:

second.JPG:

Click on Work Items under the Throughput category at the bottom right

By default, you get this screen which shows all COMPLETED status workflow items:

smaller mass wf retry screen1.JPG:

Select the drop down box on the top right corner to show Errored Activities:

smaller mass wf retry screen2.JPG:

Select a particular kind of Activity which needs to be re-tried:

smaller mass wf retry screen3.JPG:

Now that ALL the errored items for a particular workflow Activity are listed, you can click on the Retry All button at the top of the output to do a mass re-try of them:

Alternately, you can also do an Abort All action on these items.

smaller mass wf retry screen4.JPG:


How to Detect Data Corruption and Purge More Eligible OEOH/OEOL Workflow Items for Order Management Workflow

                                                  looking closely:


The Possibility of Order Management Customizations causing huge non-purgable Workflow data volume in 11i Apps


Often customers have to customize the seeded Order management workflow to tailor it to their business needs. It is not un-common to open service requests with Oracle Support to diagnose huge order management workflow data volume in WF_ITEM_ACTIVITY_STATUSES and WF_ITEM_ACTIVITY_STATUSES_H tables.

If not enough focus is invested in maintaining and developing customizations, it can easily lead to difficult to diagnose problems.


It may get to a point where one may ask - Why does not the data purge automatically? Does the purge even work?


The Caveats...


Interestingly, following non-standard customization practices for Oracle Order management workflow  and  having a custom mechanism of retrying ERRORed workflow activities can and will lead to data corruption between Order management and OEOH/OEOL workflow items. A recently published  Note 402144.1 -- FAQ: Best Practices For Custom Order Entry Workflow Design will help understand the caveats and common pitfalls.


Data corruption may be defined as a situation wherein order management status is out of sync with OEOH or OEOL workflow item status. This leads to a lot of non-purgeable workflow data, confusion in issue recognization, lengthy and effort taking data fix TARs and Bugs, follow-up etc.


 


More Options.. Additional Big Picture Data Mining Whitepaper


 


How would you like to see the big picture of Order Management and corresponding Workflow information? In a nutshell, if we do not know what we are looking for, it can also be an un-necessary waste of everyone's time and energy. 


The recently published white paper (Note 405275.1) presents a high level data mining approach to Oracle community for identifying and debugging Oracle Management Workflow data corruption issues in addition to the BDE  (Bugs, Diagnostics and Escalations committee) debugging scripts from Note:183643.1.


What  is the New Approach in this Whitepaper ?


Through a data mining approach, it is attempted to provide some new tables and data analysis scripts (which may be further extended). This white paper essentially  helps identify the nature of data volume which is not getting purged and evaluate the additional hidden purge potential of OEOH/OEOL workflow items.

 This summarized tool can be very useful for Support Analysts, Users and Development personnel to find out the extent of issue at hand or the damage done due to customizations in a particular release. Extended reports can be developed at will to dig deeper or look at data from a different view.


They can be used as a supplement to the standard Workflow debugging and troubleshooting scripts available on metalink.


A quick preview...


An example is the following report which shows the extra hidden potential in OM workflow data which CAN get purged out of WF_ITEM_ACTIVITY_STATUSES and WF_ITEM_ACTIVITY_STATUSES_H:


q11 output:


Through reports and extra aggregate tables, the user gains a much clearer high level integrated view of Order management and Workflow, and has a better understanding of the nature of data volume which is not getting purged.


This approach should be followed for exercising a focused analysis of the order management workflow data volume.


This whitepaper, does not, however, give the tools for fixing the data corruption.


Innovate and Reap the Benefits..


More reports can be designed and explored to suit the user's criteria. They would only limited by imagination and cretivity.


 


A good example would be a WHAT-IF report .E.g. How much OEOL Workflow volume is out there waiting to be purged, IF we completed processing for Order lines in AWAITING_SHIPPING or AWAITING_RETURN status for Orders which more than 3 months old and are 90-95% closed as per Order Management status?


Additional Readings Suggested..


It is strongly suggested to read a recently published  Note 402144.1 -- FAQ: Best Practices For Custom Order Entry Workflow Design to understand the caveats and common pitfalls while undertaking workflow customizations.


Additionally, please check Note 398822.1 regularly for knowing about the latest and greatest data fix patches (and preventive patches, if identified) for different OM data corruption scenarios.

March 21, 2007

A Performance Case Study: Taming the Workflow Purge

                                                       investigate.JPG:


Preface


What happens when the oracle buffer cache and SGA is simply over-flooded with I/O, which may not be required? Not only does the system underperform due to I/O contention, buffer busy waits for other batch concurrent programs, but the actual database machine's I/O subsystem is not able to do perform upto its capacity, because there is a tremendous bottleneck at the data buffer cache section of the SGA.


In this case study, we shall see how somehow studying the code's logic and the overall affect on the I/O performance of the system can be hugely beneficial.


This case study is also paradoxical in the sense that reducing the I/O volume caused by one major contender caused more I/O to be done in lesser time by other batch concurrent programs and RMAN backup, thereby making it very interesting.


The Top I/O Hitters..


In an Oracle Applications 11i Order Management implementation, the "Purge Obsolete Workflow Runtime Data" was found to be the one of the longest running concurrent jobs in the Production system.


purge high io contender.JPG:


Now, "Purge Obsolete Workflow Runtime Data" was scheduled to run daily for approximately 40 different workflow item types, and the requests were launched all at the same time (8:30pm).


By checking the logical and physical read/write statistics for each submitted concurrent request, it was clear that OEOL purge was was the highest hitter on the I/O.


comparision of IO across item_types for FNDWFPR.JPG:


The finger of suspicion..


The logical reads being done by OEOL purge program were obviously magnitude(s) higher than its other cousins (e.g. XXONTOIM, POSCHORD..). While the OEOL workflow data was kind of huge (millions of records), an examination of purge code could not hurt.


Tearing apart the code revealed that in 11.5.9 Oracle Applications 11i, APIs were called in the following sequence:


wf_purge.totalconcurrent-> wf_purge.total --> wf_purge.aborterrorprocess
                                                                            --> wf_purge.items  


wf_purge.items() is the real API which does the purging, based on eligibility criteria.


The loose structure of the code is:


purge code structure.JPG:


So, for every batch of data to be processed (purged), we are opening the *same* cursor EVERY TIME, which means that we are issuing the *Same* query to the system FOR EVERY BATCH.  Therefore, If we had an overall ovolume of 1000000 rows to be processed and passed a parameter of 1 as batchsize, the cursor would be Opened and executed 1000000 (1 million times).


A better approach? Well, uh, Almost...


This seemed extremely in-efficient. Instead of this, since we were ultimately storing all the queried data in memory structures (pl/sql stables), did it make more sense to ONLY query the main cursor ONCE and then loop and process over the data in batches as per the batchsize passed to the API? The new code could look like this:


purge code structure new.JPG:


TAR  6168657.994 - CODE ENHANCEMENT IN FNDWFPR PROGRAM WITH PERFORMANCE IMPLICATIONS was logged with Oracle Support to get Development's thoughts on this.


As it turned out, this code design was chosen with a good reason:


To avoid fetch across commits (which would compromise Oracle's read consistency) because a delete is performed against the table (applsys.wf_items) within the main driving cursor. And a commit would need be performed within the fetch in order to control rollback.


In large volume systems, instead of doing the code change, if undo_retention database initialization parameter was set to a high value (to avoid ORA-01555 "snapshot too old: rollback segment number %s with name \"%s" too small" error message) and a very large batchsize was selected (which would reduce the frequency of the open/close of the cursor), it would simulate the same effect as the code change proposed without inducing fetch across commits.


SQL> show parameters undo


NAME                                 TYPE        VALUE
------------------------------------ ----------- -------------------
undo_management                      string      AUTO
undo_retention                       integer     28800
undo_suppress_errors                 boolean     TRUE
undo_tablespace                      string      RBS


Will this help us?


Only benchmarking could prove or disprove this. So, different batch commit sizes were selected for benchmarking the performance of OEOL purge: 1000, 2000, 5000, 10000, 999,999,999,999 (maximum possible -- the idea of infinity for the purge program). 3-5 samples were collected for each batch commit size and the purge concurrent program was run for OEOL. An average was taken across 3-5 samples for each batch commit size and the time taken to purge 1 item from the system. On charting out the average time taken per item purged, the gains were very visible.


chart of avg time taken to purge an item for different batch sizes.JPG:


Seeing the benefit in Production...


Voila! Using the highest commit batchsize seemed to be a positive improvement. After due deliberations, when it was put into production, the effect was quite remarkable.


It could be seen that the logical reads for OEOL dropped from 984 million to 71 million (about 7.2% of the original). Also looked at the total runtime of all of the purges. There was  a little fluctuation from day to day, but it looked like all of the purges took a total of 71.5 hours per night last week, versus 51.4 hours per night, after the change.


Another interesting bit was that the total I/O waits from 20:00 - 21:00 hours (when the purge jobs get submitted for all workflow item types) did not change much before and after the change, but buffer busy waits came down to about 35% lower, and the number of overall logical reads actuall became higher -- which means that the system was able to do more work, with less contention, in the same period of time.


The system went from 5.5 billion gets to 9.8 billion in that 1 hour timeslice. Not only this, the RMAN production backups started finishing 1 hour earlier. It seemed that this one change freed up a lot of IO strain on the Production system.


trend of oeol purge io and timings with batchsize increase.JPG:


Summary


Not only was it beneficial to understand the code of workflow purge, but experimenting performance and watching the I/O information through audit was a very positive experience for the 11i Oracle Applications Production environment. Opening a TAR with Oracle Support and getting a response from Development was a long winded experience, but it paid off in the end.


Note of Thanks


Special thanks to Brian Fane, resident Apps DBA at the client site, for his insightful analysis of the benefits realized by this change. Its been a pleasure to work with him.


Taking inspiration


The situation reminds us of some motivational quotes:


"We are constantly faced with problems brilliantly disguised as great opportunities."


"You can not discover new oceans unless you have the courage to lose sight of the shore."


More quotes can be found here..


 

March 27, 2007

We're Skewed! Using 10053 event for being Histogram-wise

                                                 histogram.JPG:

Preface


The Cost based optimizer is usually intelligent in many situations and will get the optimal plan for a given query, provided the statistics of the base objects are upto date. Many a times though, if the underlying data volume is unusually skewed towards a few distinct values, it needs to be educated to factor it and thats where histograms can be especially helpful.

This case study shall focus on one such situation at a client site where adding a height based histogram (the only kind provided by Oracle), brought wonders to the production system and reduced bad press for Oracle Workflow.

This article is a follow up on A Performance Case Study: Taming the Workflow Purge.

Swaying the Customer's perception

For almost two years after implementation of Order-to-Cash cycle of 11i Oracle Applications, the client had been very unhappy about the overall workflow purge performance and constantly blamed it for it's order entry performance issues.

This exercise was positive in the sense that it blew away some of that negative perception and brought to spotlight on some other areas of performance improvement in their Production system.

Noticing it in Statspack report..

It all started when Brian Fane, resident Apps DBA smaller bfane.JPG: (and a fine one, at that), noticed that the Workflow Purge program (Module: FNDWFPR) was waiting on logical and physical I/O from the statspack reports between 9pm -10 pm.

ecx purge statspack1.JPG:

A lot of physical I/O was being done as well for the same API calls:

ecx purge statspack2.JPG:

Not only was the ECX_DATA tablespace being pounded with excessive I/O waits.

ecx purge statspack3.JPG:

Also, both db file scattered read (Full table scans) and buffer busy waits contributed to quite a bit of the total wait time recorded.

ecx purge statspack4.JPG:

So it was quite clear that between 9pm and 10pm, WF_PURGE.TotalConcurrent() was causing the highest logical reads (gets) and physical reads. Internally, WF_PURGE.TotalConcurrent() called ECX_PURGE.Purge_Items() which did the following query on ECX.ECX_DOCLOGS:

SELECT MSGID, ITEM_TYPE, ITEM_KEY FROM ECX_DOCLOGS WHERE (ITEM_TYPE = :B7 ) AND (:B6 IS NULL OR TRANSACTION_TYPE = :B6 ) AND
(:B5 IS NULL OR TRANSACTION_SUBTYPE = :B5 ) AND (:B4 IS NULL OR
PARTYID = :B4 ) AND (:B3 IS NULL OR PARTY_SITE_ID = :B3 ) AND (:B2
IS NULL OR TIME_STAMP >=:B2 ) AND (:B1 IS NULL OR TIME_STAMP <=....

Piecing together the CBO's behaviour

The problem was that the above query's plan was using a Full table scan of ECX.ECX_DOCLOGS.

ecx_doclogs plan.JPG:

Interestingly, while an index was present on ECX_DOCLOGS.ITEM_TYPE column, somehow the query was not using it at all. Obviously, the CBO thought that doing a Full table scan was more efficient than using an index on ECX_DOCLOGS.ITEM_TYPE.

The behaviour was puzzling. The 10053 event (tracing the optimizer stats) trace file for the abovesaid sql could really help diagnose the behaviour better. The following documents were used for analyzing the cryptic 10053 trace file for the sql: (A look under the hood of 10053) and Metalink notes like  338137.1 and 343226.1 to understand the number infested output of 10053 trace file.

Doing the Math...

An indepth study of the 10053 event trace file revealed some interesting numbers. It did appear that the Full table scan was lest costly (cost=26376.00) than the index scan of ECX_DOCLOGS_N2 (cost=48785). For column # 56 (ITEM_TYPE), the following statistics were being shown:

ecx_doclogs query 10053.JPG:

Since the total cost of doing an index scan is : blevel + FF*leaf_blocks + FF*clustering_factor, using the information for ECX_DOCLOGS_N2 from BASE STATISTICAL INFORMATION of the trace, the following values can be arrived at:

  blevel=2 (LVLS:)
 FF (Filtering Factor) = 1/NDV =
1.2500e-01 (DENS)
  leaf_blocks = 11600 (#LB:)
  clustering_factor = 378671 (CLUF:)

Plugging in the numbers, the cost of index range scan of ECX_DOCLOGS_N2  = 2 + 1.2500e-01 ( 11600 + 378671 ) = 2 + 48783.85 ~ 48785.

As can be observed, there are two major contributors for the above cost being so high: either the Filtering Factor - FF (which is too low) or the sum of leaf_blocks + clustering_factor. The filtering factor was more likely to be the culprit and it seemed that somehow the optimizer did not feel that the index was selected, whereas in reality there was no reason for it to be unselective. This appeared like a very good case of data being skewed towards a particular distinct value, thereby causing the index selectivity to become unselective.

Improving the selectivity...

One of the best ways of checking if a histograms will help is to run the FND_STATS.CHECK_HISTOGRAM_COLS  API for the table.

ecx_doclogs check_fnd_histogram.JPG:

It was clear from the above that the columns in the tables in question could use Histograms. As you can see, it is apparent that if the 'Max-Count' is a large percentage of the 'Total-Count; it implies that there is at least one value that occurs 'Max-Count'  times and this introduces a skew  in the data distribution.

In 11i Oracle Applications, the best way for the  a histogram to be automatically made  is to insert a row in APPLSYS.FND_HISTOGRAM_COLS for the specific column and gather statistics on the underlying table (ECX_DOCLOGS) using FND_STATS API. This automatically creates the histogram.

SQL> insert into fnd_histogram_cols values (174, 'ECX_DOCLOGS','ITEM_TYPE',null,254,sysdate,1,sysdate,1,null,null);

1 row created.

SQL> Commit;

After FND_STATS were gathered on ECX_DOCLOGS, the sql plan for the underlying query changed as follows:

ecx_doclogs new plan.JPG:

Seeing it work in Production

In production, since all the workflow purges used to get submitted around 8:30pm, there used to be a huge spike in the reads of the system on normal working days. Since this change enabled the ECX_PURGE related sql to use an index, the projection was that this would bring down the overall time taken by all the FNDWFPR (workflow purges) programs submitted on any given night.

It did turn out to be that way. By and large, all of the item types improved, with the exception of OEOL (its runtime had more to do with the OEOL volume than the ECX_DOCLOGS step of the workflow purge). The largest improvements were for several item types that dropped from over an hour to 4 minutes or less (one went from 3 hours 34 minutes to 0 seconds).

Physical I/Os also came down. In the following graph, the black box represents the start of the workflow purges. On the far right is last night's purge. After the change, the physical I/Os started blending into the background:

ecx_doclogs post histogram physical reads profile.JPG:

The cumulative runtimes of ALL the workflow purges also came down considerably:

ecx_doclogs post histogram cumulative runtimes for purge.JPG:

How was it done?

Audit needs to be enabled at the database level:

SQL> l
1* select name, value from v$parameter where name like '%audit%'
SQL> /

NAME VALUE
------------------------------ ------------------------------
transaction_auditing TRUE
audit_sys_operations FALSE
audit_file_dest ?/rdbms/audit
audit_trail DB

SQL> audit create session;

Audit succeeded.

SQL> select * from dba_stmt_audit_opts
union
select * from dba_priv_audit_opts;

USER_NAME                      PROXY_NAME
------------------------------ ------------------------------
AUDIT_OPTION                             SUCCESS    FAILURE
---------------------------------------- ---------- ----------

CREATE SESSION                           BY ACCESS  BY ACCESS

This sql will tell us the logical and physical I/O information for a concurrent program:

$ more oeol_purge_check.sql

set lines 200
col wdate format a9 heading 'DATE'
col logon format a8 heading 'LOGON'
col logoff format a8 heading 'LOGOFF'
col total_min format b9999 heading 'TOTAL|MIN'
col lrate format 999999 heading 'LREAD|SEC'
col prate format b99999 heading 'PREAD|SEC'
col wrate format b99999 heading 'LWRITE|SEC'
col logread format 99999999999 heading 'LOGICAL|READS'
col physread format b9999999999 heading 'PHYSICAL|READS'
col logwrite format b999999999 heading 'LOGICAL|WRITES'
col concurrent_program_name format a32 heading 'PROGRAM|NAME'
col request_id format 999999999 heading 'REQUEST|ID'
col concurrent_program_name format a32 heading 'PROGRAM|NAME'
col request_id format 999999999 heading 'REQUEST|ID'
col argument6 format a10 heading 'BATCH|COMMIT|SIZE'

select to_char( r.REQUEST_DATE, 'mm/dd_Dy' ) wdate,
to_char( decode( logoff_time, null, actual_start_date, timestamp ), 'hh24:mi' ) logon,
to_char( decode( logoff_time, null, actual_completion_date, logoff_time ), 'hh24:mi' ) logoff,
nvl( logoff_time - timestamp,
actual_completion_date - r.actual_start_date ) * 1440 total_min,
logoff_lread / ( ( logoff_time - timestamp ) * 86400 ) lrate,
logoff_lread logread,
logoff_pread / ( ( logoff_time - timestamp ) * 86400 ) prate,
logoff_pread physread,
logoff_lwrite / ( ( logoff_time - timestamp ) * 86400 ) wrate,
logoff_lwrite logwrite,
p.concurrent_program_name,
r.argument6 ,
r.request_id
from sys.dba_audit_session a,
applsys.fnd_concurrent_programs p,
applsys.fnd_concurrent_requests r,
applsys.fnd_application fa
where r.program_application_id = p.application_id
and fa.application_id=p.application_id
and r.concurrent_program_id = p.concurrent_program_id
and r.oracle_session_id = a.sessionid
and logoff_time - timestamp > 0
and trunc(r.REQUEST_DATE) >= trunc(sysdate)-10
and r.REQUEST_DATE < trunc(sysdate)+1
and p.concurrent_program_name = 'FNDWFPR'
and r.argument1='OEOL'
order by trunc(r.actual_start_date) desc,logoff_lread desc
/

Summary

If you have implemented Statspack, it really pays to observe the worst I/O offenders (logical  gets, physical reads) and wait information out there for batch load windows. This is one of the strengths of Statspack.

Once the queries to be tuned have been identified, half the battle is won already. Again, using the DB audit feature really helps in prioritizing or knowing which programs are causing excessive I/O and inundating the data buffer area in SGA.

Apart from this, using the 10053 trace for understanding the thought process of Cost based optimizer is also a worthwhile exercise, as is using the FND_STATS.CHECK_HISTOGRAM_COLS  API for checking which columns can use a histogram.

Putting all these aspects together bring up a winning combination!

Suggested Readings:

A Performance Case Study: Taming the Workflow Purge


About March 2007

This page contains all entries posted to Experiments from the Field..Based on True Stories in March 2007. They are listed from oldest to newest.

February 2007 is the previous archive.

April 2007 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered by
Movable Type and Oracle