We're Skewed! Using 10053 event for being Histogram-wise

                                                 histogram.JPG:

Preface


The Cost based optimizer is usually intelligent in many situations and will get the optimal plan for a given query, provided the statistics of the base objects are upto date. Many a times though, if the underlying data volume is unusually skewed towards a few distinct values, it needs to be educated to factor it and thats where histograms can be especially helpful.

This case study shall focus on one such situation at a client site
where adding a height based histogram (the only kind provided by
Oracle), brought wonders to the production system and reduced bad press
for Oracle Workflow.

This article is a follow up on A Performance Case Study: Taming the Workflow Purge.

Swaying the Customer's perception

For almost two years after implementation of Order-to-Cash cycle of 11i Oracle Applications, the client had been very unhappy about the overall workflow purge performance and constantly blamed it for it's order entry performance issues.

This exercise was positive in the sense that it blew away some of that negative perception and brought to spotlight on some other areas of performance improvement in their Production system.

Noticing it in Statspack report..

It all started when Brian Fane, resident Apps DBA smaller bfane.JPG: (and a fine one, at that), noticed that the Workflow Purge program (Module: FNDWFPR) was waiting on logical and physical I/O from the statspack reports between 9pm -10 pm.

ecx purge statspack1.JPG:

A lot of physical I/O was being done as well for the same API calls:

ecx purge statspack2.JPG:


Not only was the ECX_DATA tablespace being pounded with excessive I/O waits.

ecx purge statspack3.JPG:

Also, both db file scattered read (Full table scans) and buffer busy waits contributed to quite a bit of the total wait time recorded.

ecx purge statspack4.JPG:


So it was quite clear that between 9pm and 10pm, WF_PURGE.TotalConcurrent() was causing the highest logical reads (gets) and physical reads. Internally, WF_PURGE.TotalConcurrent() called ECX_PURGE.Purge_Items() which did the following query on ECX.ECX_DOCLOGS:

SELECT MSGID, ITEM_TYPE, ITEM_KEY FROM ECX_DOCLOGS WHERE (ITEM_TYPE = :B7 ) AND (:B6 IS NULL OR TRANSACTION_TYPE = :B6 ) AND
(:B5 IS NULL OR TRANSACTION_SUBTYPE = :B5 ) AND (:B4 IS NULL OR
PARTYID = :B4 ) AND (:B3 IS NULL OR PARTY_SITE_ID = :B3 ) AND (:B2
IS NULL OR TIME_STAMP >=:B2 ) AND (:B1 IS NULL OR TIME_STAMP <=....

Piecing together the CBO's behaviour

The problem was that the above query's plan was using a Full table scan of ECX.ECX_DOCLOGS.

ecx_doclogs plan.JPG:

Interestingly, while an index was present on ECX_DOCLOGS.ITEM_TYPE column, somehow the query was not using it at all. Obviously, the CBO thought that doing a Full table scan was more efficient than using an index on ECX_DOCLOGS.ITEM_TYPE.

The behaviour was puzzling. The 10053 event (tracing the optimizer stats) trace file for the abovesaid sql could really help diagnose the behaviour better. The following documents were used for analyzing the cryptic 10053 trace file for the sql: (A look under the hood of 10053) and Metalink notes like  338137.1 and 343226.1 to understand the number infested output of 10053 trace file.

Doing the Math...

An indepth study of the 10053 event trace file revealed some interesting numbers. It did appear that the Full table scan was lest costly (cost=26376.00) than the index scan of ECX_DOCLOGS_N2 (cost=48785). For column # 56 (ITEM_TYPE), the following statistics were being shown:

ecx_doclogs query 10053.JPG:

Since the total cost of doing an index scan is : blevel + FF*leaf_blocks + FF*clustering_factor, using the information for ECX_DOCLOGS_N2 from BASE STATISTICAL INFORMATION of the trace, the following values can be arrived at:

  blevel=2 (LVLS:)
 FF (Filtering Factor) = 1/NDV =
1.2500e-01 (DENS)
  leaf_blocks = 11600 (#LB:)
  clustering_factor = 378671 (CLUF:)

Plugging in the numbers, the cost of index range scan of ECX_DOCLOGS_N2  = 2 + 1.2500e-01 ( 11600 + 378671 ) = 2 + 48783.85 ~ 48785.

As can be observed, there are two major contributors for the above cost being so high: either the Filtering Factor - FF (which is too low) or the sum of leaf_blocks + clustering_factor. The filtering factor was more likely to be the culprit and it seemed that somehow the optimizer did not feel that the index was selected, whereas in reality there was no reason for it to be unselective. This appeared like a very good case of data being skewed towards a particular distinct value, thereby causing the index selectivity to become unselective.

Improving the selectivity...

One of the best ways of checking if a histograms will help is to run the FND_STATS.CHECK_HISTOGRAM_COLS  API for the table.

ecx_doclogs check_fnd_histogram.JPG:

It was clear from the above that the columns in the tables in question could use Histograms. As you can see, it is apparent that if the 'Max-Count' is a large percentage of the 'Total-Count; it implies that there is at least one value that occurs 'Max-Count'  times and this introduces a skew  in the data distribution.

In 11i Oracle Applications, the best way for the  a histogram to be automatically made  is to insert a row in APPLSYS.FND_HISTOGRAM_COLS for the specific column and gather statistics on the underlying table (ECX_DOCLOGS) using FND_STATS API. This automatically creates the histogram.

SQL> insert into fnd_histogram_cols values (174, 'ECX_DOCLOGS','ITEM_TYPE',null,254,sysdate,1,sysdate,1,null,null);

1 row created.

SQL> Commit;

After FND_STATS were gathered on ECX_DOCLOGS, the sql plan for the underlying query changed as follows:

ecx_doclogs new plan.JPG:

Seeing it work in Production

In production, since all the workflow purges used to get submitted around 8:30pm, there used to be a huge spike in the reads of the system on normal working days. Since this change enabled the ECX_PURGE related sql to use an index, the projection was that this would bring down the overall time taken by all the FNDWFPR (workflow purges) programs submitted on any given night.

It did turn out to be that way. By and large, all of the item types improved, with the exception of OEOL (its runtime had more to do with the OEOL volume than the ECX_DOCLOGS step of the workflow purge). The largest improvements were for several item types that dropped from over an hour to 4 minutes or less (one went from 3 hours 34 minutes to 0 seconds).

Physical I/Os also came down. In the following graph, the black box represents the start of the workflow purges. On the far right is last night's purge. After the change, the physical I/Os started blending into the background:

ecx_doclogs post histogram physical reads profile.JPG:

The cumulative runtimes of ALL the workflow purges also came down considerably:

ecx_doclogs post histogram cumulative runtimes for purge.JPG:

How was it done?

Audit needs to be enabled at the database level:



SQL> l


1* select name, value from v$parameter where name like '%audit%'


SQL> /



NAME VALUE


------------------------------ ------------------------------


transaction_auditing TRUE


audit_sys_operations FALSE


audit_file_dest ?/rdbms/audit


audit_trail DB



SQL> audit create session;

Audit succeeded.

SQL> select * from dba_stmt_audit_opts
union
select * from dba_priv_audit_opts;

USER_NAME                      PROXY_NAME
------------------------------ ------------------------------
AUDIT_OPTION                             SUCCESS    FAILURE
---------------------------------------- ---------- ----------

CREATE SESSION                           BY ACCESS  BY ACCESS


This sql will tell us the logical and physical I/O information for a
concurrent program:


$ more oeol_purge_check.sql


set lines 200

col wdate format a9 heading 'DATE'
col logon format a8 heading 'LOGON'

col logoff format a8 heading 'LOGOFF'

col total_min format b9999 heading 'TOTAL|MIN'
col lrate format 999999 heading 'LREAD|SEC'

col prate format b99999 heading 'PREAD|SEC'

col wrate format b99999 heading 'LWRITE|SEC'

col logread format 99999999999 heading 'LOGICAL|READS'

col physread format b9999999999 heading 'PHYSICAL|READS'

col logwrite format b999999999 heading 'LOGICAL|WRITES'

col concurrent_program_name format a32 heading 'PROGRAM|NAME'

col request_id format 999999999 heading 'REQUEST|ID'

col concurrent_program_name format a32 heading 'PROGRAM|NAME'

col request_id format 999999999 heading 'REQUEST|ID'

col argument6 format a10 heading 'BATCH|COMMIT|SIZE'


select to_char( r.REQUEST_DATE, 'mm/dd_Dy' ) wdate,
to_char( decode( logoff_time, null, actual_start_date, timestamp ), 'hh24:mi' ) logon,

to_char( decode( logoff_time, null, actual_completion_date, logoff_time ), 'hh24:mi' ) logoff,

nvl( logoff_time - timestamp,

actual_completion_date - r.actual_start_date ) * 1440 total_min,
logoff_lread / ( ( logoff_time - timestamp ) * 86400 ) lrate,

logoff_lread logread,

logoff_pread / ( ( logoff_time - timestamp ) * 86400 ) prate,

logoff_pread physread,

logoff_lwrite / ( ( logoff_time - timestamp ) * 86400 ) wrate,

logoff_lwrite logwrite,

p.concurrent_program_name,

r.argument6 ,

r.request_id

from sys.dba_audit_session a,

applsys.fnd_concurrent_programs p,

applsys.fnd_concurrent_requests r,

applsys.fnd_application fa

where r.program_application_id = p.application_id

and fa.application_id=p.application_id

and r.concurrent_program_id = p.concurrent_program_id

and r.oracle_session_id = a.sessionid

and logoff_time - timestamp > 0

and trunc(r.REQUEST_DATE) >= trunc(sysdate)-10

and r.REQUEST_DATE < trunc(sysdate)+1

and p.concurrent_program_name = 'FNDWFPR'

and r.argument1='OEOL'

order by trunc(r.actual_start_date) desc,logoff_lread desc

/

Summary

If you have implemented Statspack, it really pays to observe the worst I/O offenders (logical  gets, physical reads) and wait information out there for batch load windows. This is one of the strengths of Statspack.

Once the queries to be tuned have been identified, half the battle is won already. Again, using the DB audit feature really helps in prioritizing or knowing which programs are causing excessive I/O and inundating the data buffer area in SGA.

Apart from this, using the 10053 trace for understanding the thought process of Cost based optimizer is also a worthwhile exercise, as is using the FND_STATS.CHECK_HISTOGRAM_COLS  API for checking which columns can use a histogram.

Putting all these aspects together bring up a winning combination!

Suggested Readings:

A Performance Case Study: Taming the Workflow Purge


Comments:

Hi Gourav, Its a nice case study. Worth reading. Please tell me, whats the source of "Physical I/O profile due to FNDWFPR" and "Cumulative Runtime for FNDWFPR conc programs". Neeraj

Posted by Neeraj on April 01, 2007 at 04:08 PM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

bocadmin_ww

Search

Archives
« May 2015
SunMonTueWedThuFriSat
     
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
      
Today