A Performance Case Study: Taming the Workflow Purge

                                                       investigate.JPG:


Preface


What happens when the oracle buffer cache and SGA is simply over-flooded with I/O, which may not be required? Not only does the system underperform due to I/O contention, buffer busy waits for other batch concurrent programs, but the actual database machine's I/O subsystem is not able to do perform upto its capacity, because there is a tremendous bottleneck at the data buffer cache section of the SGA.


In this case study, we shall see how somehow studying the code's logic and the overall affect on the I/O performance of the system can be hugely beneficial.


This case study is also paradoxical in the sense that reducing the I/O volume caused by one major contender caused more I/O to be done in lesser time by other batch concurrent programs and RMAN backup, thereby making it very interesting.


The Top I/O Hitters..


In an Oracle Applications 11i Order Management implementation, the "Purge Obsolete Workflow Runtime Data" was found to be the one of the longest running concurrent jobs in the Production system.


purge high io contender.JPG:


Now, "Purge Obsolete Workflow Runtime Data" was scheduled to run daily for approximately 40 different workflow item types, and the requests were launched all at the same time (8:30pm).


By checking the logical and physical read/write statistics for each submitted concurrent request, it was clear that OEOL purge was was the highest hitter on the I/O.


comparision of IO across item_types for FNDWFPR.JPG:


The finger of suspicion..


The logical reads being done by OEOL purge program were obviously magnitude(s) higher than its other cousins (e.g. XXONTOIM, POSCHORD..). While the OEOL workflow data was kind of huge (millions of records), an examination of purge code could not hurt.


Tearing apart the code revealed that in 11.5.9 Oracle Applications 11i, APIs were called in the following sequence:


wf_purge.totalconcurrent-> wf_purge.total --> wf_purge.aborterrorprocess
                                                                            --> wf_purge.items  


wf_purge.items() is the real API which does the purging, based on eligibility criteria.


The loose structure of the code is:


purge code structure.JPG:


So, for every batch of data to be processed (purged), we are opening the *same* cursor EVERY TIME, which means that we are issuing the *Same* query to the system FOR EVERY BATCH.  Therefore, If we had an overall ovolume of 1000000 rows to be processed and passed a parameter of 1 as batchsize, the cursor would be Opened and executed 1000000 (1 million times).


A better approach? Well, uh, Almost...


This seemed extremely in-efficient. Instead of this, since we were ultimately storing all the queried data in memory structures (pl/sql stables), did it make more sense to ONLY query the main cursor ONCE and then loop and process over the data in batches as per the batchsize passed to the API? The new code could look like this:


purge code structure new.JPG:


TAR  6168657.994 - CODE ENHANCEMENT IN FNDWFPR PROGRAM WITH PERFORMANCE IMPLICATIONS was logged with Oracle Support to get Development's thoughts on this.


As it turned out, this code design was chosen with a good reason:


To avoid fetch across commits (which would compromise Oracle's read consistency) because a delete is performed against the table (applsys.wf_items) within the main driving cursor. And a commit would need be performed within the fetch in order to control rollback.


In large volume systems, instead of doing the code change, if undo_retention database initialization parameter was set to a high value (to avoid ORA-01555 "snapshot too old: rollback segment number %s with name \"%s" too small" error message) and a very large batchsize was selected (which would reduce the frequency of the open/close of the cursor), it would simulate the same effect as the code change proposed without inducing fetch across commits.


SQL> show parameters undo


NAME                                 TYPE        VALUE
------------------------------------ ----------- -------------------
undo_management                      string      AUTO
undo_retention                       integer     28800
undo_suppress_errors                 boolean     TRUE
undo_tablespace                      string      RBS


Will this help us?


Only benchmarking could prove or disprove this. So, different batch commit sizes were selected for benchmarking the performance of OEOL purge: 1000, 2000, 5000, 10000, 999,999,999,999 (maximum possible -- the idea of infinity for the purge program). 3-5 samples were collected for each batch commit size and the purge concurrent program was run for OEOL. An average was taken across 3-5 samples for each batch commit size and the time taken to purge 1 item from the system. On charting out the average time taken per item purged, the gains were very visible.


chart of avg time taken to purge an item for different batch sizes.JPG:


Seeing the benefit in Production...


Voila! Using the highest commit batchsize seemed to be a positive improvement. After due deliberations, when it was put into production, the effect was quite remarkable.


It could be seen that the logical reads for OEOL dropped from 984 million to 71 million (about 7.2% of the original). Also looked at the total runtime of all of the purges. There was  a little fluctuation from day to day, but it looked like all of the purges took a total of 71.5 hours per night last week, versus 51.4 hours per night, after the change.


Another interesting bit was that the total I/O waits from 20:00 - 21:00 hours (when the purge jobs get submitted for all workflow item types) did not change much before and after the change, but buffer busy waits came down to about 35% lower, and the number of overall logical reads actuall became higher -- which means that the system was able to do more work, with less contention, in the same period of time.


The system went from 5.5 billion gets to 9.8 billion in that 1 hour timeslice. Not only this, the RMAN production backups started finishing 1 hour earlier. It seemed that this one change freed up a lot of IO strain on the Production system.


trend of oeol purge io and timings with batchsize increase.JPG:


Summary


Not only was it beneficial to understand the code of workflow purge, but experimenting performance and watching the I/O information through audit was a very positive experience for the 11i Oracle Applications Production environment. Opening a TAR with Oracle Support and getting a response from Development was a long winded experience, but it paid off in the end.


Note of Thanks


Special thanks to Brian Fane, resident Apps DBA at the client site, for his insightful analysis of the benefits realized by this change. Its been a pleasure to work with him.


Taking inspiration


The situation reminds us of some motivational quotes:


"We are constantly faced with problems brilliantly disguised as great opportunities."


"You can not discover new oceans unless you have the courage to lose sight of the shore."


More quotes can be found here..


 

Comments:

Hi Gaurav, Thanks a lot for providing all this valuable information for tunning workflow at http://blogs.oracle.com/gverma/2007/03/a_performance_case_study_tamin.html. Please provide the sql script to find logical and physical read/write statistics for each submitted concurrent request if possible to my mail ID suraj_sharma1981@yahoo.com Thanks, Suraj Sharma

Posted by Suraj Sharma on March 12, 2008 at 03:27 PM EDT #

Suraj, the script et al is present in http://blogs.oracle.com/gverma/2007/03/27/ under How was it done? section

Posted by Gaurav Verma on March 13, 2008 at 02:01 AM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

bocadmin_ww

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today