« March 2007 | Main | May 2007 »

April 2007 Archives

April 20, 2007

Performance issue with "Retry Activities in Error" concurrent program (delivered in patch 5248424 - Order Management Exception Management (11.5.9)

Preface

From Note 402144.1 -- FAQ: Best Practices For Custom Order Entry Workflow Design and Note 405275.1 -- How to Detect Data Corruption and Purge More Eligible OEOH/OEOL Workflow Items for Order Management Workflow, one of the most recommended practices from the Order Management product team is to utilize the Retry Activities in Error" Concurrent Program which is  delivered in Patch 5248424 - Order Management Exception Management (11.5.9) and Patch 4420026 (11.5.10).

The need for it?

Order Management Workflow for order lines fails all the time, due to a variety of different reasons. While the Standard Order Entry form and Quick Sales Order form does allow the user to manually progress the Order line or Retry workflow from Workflow Monitor self service screen, this is SIMPLY not a scalable way or retrying Errored workflow activities.

Hack it!

So then customers come up with their own custom concurrent program for retrying errored workflow activities for OEOH/OEOL workflow items. What they do not realize is that using homegrown OM workflow retry scripts causes huge volume of OM workflow data to remain stuck in the WF_ITEM_ACTIVITY_STATUSES, WF_ITEM_ACTIVITY_STATUSES_H, WF_ITEM_ATTRIBUTE_VALUES tables.

Over time, this volume grows to millions of rows, which further results in all sorts of arcane workflow performance issues.

Is there a different way?

Could we have done something differently? Well, newsflash. OM product team has come up with a "Retry Activities in Error" Concurrent Program which does all this AND a whole lot of other good things like making the existing OM workflow data more purge eligible.

What does it do?

This concurrent program delivers new functionality. It also helps purge WFERROR/OMERROR child workflows of OEOH/OEOL which are still ACTIVE and shouldn't be (if parent OEOH/OEOL doesn't have any open errors currently). Therefore, it makes more OEOH/OEOL workflow items eligible for purge by "Purge Obsolete Runtime Workflow Data" concurrent program.

This "Retry Activities in Error" concurrent program is is  delivered in Patch 5248424 - Order Management Exception Management (11.5.9) and Patch 4420026 (11.5.10).

The caveat in 11.5.9

If you are running Oracle Applications 11.5.9, this program has a severe performance problem. Let's try to understand how this performance issue can be bridged.

First blood??

The first time the customer ran this concurrent program in a test environment, the timings blew the daylights out of everyone:

OEOL :

    Date Submitted :      16-FEB-2007 15:28:57
    Date Started :           16-FEB-2007 15:29:16
    Date Completed :     19-FEB-2007 06:34:38

OEOH :

    Date Submitted :      16-FEB-2007 15:29:41
    Date Started :          19-FEB-2007 06:34:40
    Date Completed :     19-FEB-2007 07:49:36

Did you notice it? It took more than 2 days for the first run to finish off!!

This was totally unacceptable. TAR 6162173.993 was opened with Oracle Support on this issue.

Getting to the bottom of it

TKPROF utility was used to find out the SQL which was using the maximum parse, execute and fetch times. This sql came at the top:

tkprof of retry activities in error first run.JPG:

It really seemed that if there were an index on OE_PROCESSING_MSGS.TRANSACTION_ID would be a lot better. http://etrm.oracle.com is an authoritative source on what indexes are available on any table for any release. A simple comparision of indexes available on OE_PROCESSING_MSGS in 11.5.9 and 11.5.10 revealed that the OE_PROCESSING_MSGS_U1 index was acutally missing in 11.5.9!

etrm comparision of oe_processing_msgs in 11.5.9 and 11.5.10.jpg:

The fix

Bug 5895614 was logged with OM development for providing the patch for missing index. The resolution was simply to provide the following index in 11.5.9:

CREATE UNIQUE INDEX ONT.OE_PROCESSING_MSGS_U1 ON ONT.OE_PROCESSING_MSGS (TRANSACTION_ID) LOGGING STORAGE (INITIAL 4K NEXT 3M MINEXTENTS 1 MAXEXTENTS UNLIMITED PCTINCREASE 0 FREELIST GROUPS 4 FREELISTS 4) PCTFREE
 10 INITRANS 11 MAXTRANS 255 COMPUTE STATISTICS;

 Patch 5895614 -- To improve the performance of concurrent program 'Retry Activities in Error' for 11.5.9 customers, an index is created on the table OE_PROCESSING_MSGS -- was provided on 27th March 2007.

Suggested Followup Readings

Please followup on the discussion on implementing good Order Management workflow best practices:

Best Practices For Custom Order Entry Workflow Design 

How to Detect Data Corruption and Purge More Eligible OEOH/OEOL Workflow Items for Order Management Workflow.

April 23, 2007

Mining useful monitoring/troubleshooting information from Oracle Configurator (CZ) logs


                                 mining uneful information from config log files.JPG:

Preface


Configurator is fairly new product and there arent as many published monitoring/troubleshooting  scripts out there to really find out if a particular model was preloaded properly or what was the response time of a model launch from the log files.

Im a strong believer in open source and in this article, some scripts are pur forward for mining the information in the Configurator logs which are created in $IAS_ORACLE_HOME/Apache/Jserv/logs.

These are just some scripts which were made to dig deeper into escalated configurator issues at one of the large implementors of Oracle Configurator module in North americas.

But before that...

These scripts are simply a starting point. They are provided "as is". The user would be responsible extending them as per the environment specific to different installations.

Having said so, the scripts are fairly simple and should be easy enough to extend.

PS: If you do happen to have the time to extend them more, please let me know at gaurav.verma@oracle.com.

I wish there were something ...

...which could tell me if a model preloading actually worked or not and if it was loaded into the JVM heap on the startup of configurator servlet?

...which could tell me how long it took for a model to preload?

...which could tell me accurately how long did it take for a user to actually launch a configurator model on the browser. The user claims that it took a long time for it to actually launch. How can I verify it?

... which could tell me why a user is getting a wierd error while launching configurator model? How can I find the root cause?

The list goes on. Well, we can certainly get answers for most of the above questions.

You would know, if you looked...

There is a goldmine of information available at $IAS_ORACLE_HOME/Apache/Jserv/logs. Essentially, there are at least three kinds of files:

Error log files for failed configurator launch sessions:

config_machine:web_prod> ls cz*.log
cz_exc_Applet18_20Apr07_0135PM_37068.log  cz_exc_DHTML172_20Apr07_0246PM_25823.log 
cz_exc_DHTML17_20Apr07_0110PM_35240.log   cz_exc_DHTML5_20Apr07_0107PM_34927.log
....
....


Log files for each session:

config_machine:web_prod>  ls cz-session-DHTML*.txt
cz-session-DHTML101_20Apr07_0209PM_50243.txt  cz-session-DHTML24_20Apr07_0140PM_31102.txt
cz-session-DHTML10_20Apr07_0108PM_28629.txt   cz-session-DHTML25_20Apr07_0115PM_03127.txt
....
....

The master configurator servlet log file:

This file keeps information about each session or model launch. Its like a double entry system.

usellx27:web_prod27a> ls *serv*.txt
cz-servlet_20Apr07_0105PM_55204.txt  cz-servlet_20Apr07_0105PM_55221.txt


Each file has a timestamp in its name when the servlet was started. In this case, we see that the 9iAS home was configured to start two instances of the CZ servlet. The following command tells us the way AutoConfig has been setup on the configurator tier:


config_machine:web_test> grep "s_oacore_nprocs"  $CONTEXT_FILE
         <oacore_nprocs oa_var="s_oacore_nprocs">2</oacore_nprocs>
config_machine:web_test>

Controlling runtime caching of models..

Know that, by default, configurator servlet caches models at runtime if they have been launched once. The following setting in $IAS_ORACLE_HOME/Apache/Jserv/etc/jserv.properties file controls it:

config_machine:web_test> grep dio_share jserv.properties
# wrapper.bin.parameters=-Dcz.uiservlet.dio_share=false
config_machine:web_test>

In this case, runtime caching of models has been enabled by default. Un-commenting the above setting would cause
caching of models to be disabled.

How does preloading happen and finding if a model has been preloaded..

Sometimes, due to missing directives, preloading may not work as expected. This kind of a situation is covered in Note 404719.1.

It is assumed here that the preloading file is $IAS_ORACLE_HOME/Apache/Jserv/etc/preload.txt.

To find if a model is been preloaded or not, the following script
getpreloadtimings.sh may be used. This script assumes that there is a read only database account called ivci. This may of course, be modified as per the new installation.

Here is a sample run of this script:

output of getpreloadtimings.sh.jpg:

The script also pre-assumes that an auxiliary script timediff.sh be present in the same directory. timediff.sh tells the difference between two timestamps.:


config_machine:web_test> more timediff.sh
#!/bin/ksh

( [ -z $1 ] || [ -z $2 ] ) && exit 1

time=$1
h1=$(expr "$time" : "\(..\):..:..")
m1=$(expr "$time" : "..:\(..\):..")
s1=$(expr "$time" : "..:..:\(..\)")

time=$2
h2=$(expr "$time" : "\(..\):..:..")
m2=$(expr "$time" : "..:\(..\):..")
s2=$(expr "$time" : "..:..:\(..\)")

#echo "from=$h1:$m1:$s1 to=$h2:$m2:$s2"

seconds=$(echo "$h2*3600+$m2*60+$s2-($h1*3600+$m1*60+$s1)" | bc)

if [ "$seconds" -lt 0 ] ; then
((seconds=seconds+86400))
fi

hh=$(expr $seconds / 3600)
mm=$(expr ( $seconds - $hh * 3600 ) / 60)
ss=$(expr $seconds - $hh * 3600 - $mm * 60)
echo "elapsed: $hh:$mm:$ss ($seconds seconds)"

Extending this concept, a summary script getpreloadtimings_summary.sh can also be created to give min and max preloading timings across different models in a group. Here is a sample run of getpreloadtimings_summary.sh :


config_machine:web_test> ./getpreloadtimings_summary.sh
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 Preloading Min Max Statistics:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 Max preloading time was 794 secs taken by TDX3SE
 Min preloading time was 642 secs taken by TRSX50FB
 The following models could not be preloaded so far: ALL models were preloaded
 Avg preloading time was 0:12:19 ( 739 seconds ) secs/model over a total of 42 models
config_machine:web_test>

Is that not useful information? Well, that depending on how escalated your user is, it pretty much can be. Its amazing what simple observation and some unix pattern matching skills can fetch you, if the concepts are applied properly.

Verifying time taken to launch a model....

calculate_response_time_for_model_launch.sh is a model script which can written to find out what DHTML session launched a model at what time and whether it ended or not. Here is a sample run of the the script:

calculate_response_time_for_model_launch.sh_output.JPG:

What happened to my model launch??

Lets say a DHTML session did not end (end time undefined), there would be a very good probability that it either was not pre-loaded and timed out OR there was a corresponding log file created for that DHTML session. The DHTML logfile with the session number will have extended error messages which will point you in the correct direction.

Here you see an example of 4 different error log files and different kind of error messages, ranging from ora-01000 (maximum open cursors exceeded) to Configurator developer model rule contradiction to other runtime exceptions. This is just a starting point for troubleshooting model launch failures.

example of configurator error log files.JPG:

The Good stuff: source code of the scripts


Listing of timediff.sh  - Auxiliary script used for all other scripts

Listing of getpreloadtimings.sh

Listing of getpreloadtimings_summary.sh

Listing of calculate_response_time_for_model_launch.sh


April 28, 2007

Seeing the advantage of using Hugemem and Hugepages with JDK1.5, Good bye JDK1.4

Preface

Well, J2SE1.5 has been out for a long time now. Along with the new version, we now have capability to use some really exciting features like HugePages. Using this feature has been covered in great detail in an earlier article Using +LargePages feature with JDK1.5 on Linux hugemem kernel on 11.5.10.

Performance Benchmark

In this article, we'll see performance benchmarks done for preloading different configurator models. The models were loosely grouped into ~20 groups having different number of models. This was done to keep the results objective and reduce skew.

The experiments were done on Oracle Applications 11i Configurator environment. The dedicated configurator 4 CPU (dual-core) machines were running linux RHAS3 (32-bit architechture). 

For JDK1.5, the linux Hugemem Kernal was used so that the Virtual process size coule be as high as 3.5g. Please read through Redhat Linux Kernels and process space limits.. Some eye-openers for having a better idea on process address limits in different linux kernels. Common Incorrect beliefs about process address limits on 32bit platforms is also a good article that you can go through to clear memory address basics.

However, in these experiments, an address space of 1.5g was used for JDK1.5 JVM heapsize. In case its not clear at this point, +LargePages feature allows you to 'lock' JVM heap in real memory and not let it to be swapped out.

The comparision graph of preloading models across Groups using JDK1.5 without hugemem and Largepages Vs JDK1.4 for 1g heap

Now, it was seen that JDK1.4 actually performed better for executing Oracle configurator code while doing the preloading against the same database.

jdk1.5 without hugemem and largepages versus jdk1.4:

The comparision graph of preloading models across Groups using JDK1.5 WITH hugemem and Largepages (1.5g heap) Vs JDK1.4 (1g) heap


using JDK1.5 WITH hugemem and Largepages Vs JDK1.4:

As can be seen, JDK1.5 behaves more if the LargePage feature is used.

Advantage Summary

The good thing about largepages is that the JVM heap's contents are never swapped out. Also, with JDK1.5, you can actually address more than 1G of heapsize and not start swapping out, if the hugemem kernel is used. Again, Redhat Linux Kernels and process space limits.. Some eye-openers is a good place to start for understanding the differences between different Linux kernels and their respective impact on virtual process size limitations.


About April 2007

This page contains all entries posted to Experiments from the Field..Based on True Stories in April 2007. They are listed from oldest to newest.

March 2007 is the previous archive.

May 2007 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered by
Movable Type and Oracle