Tuesday Dec 04, 2012

How to Achieve OC4J RMI Load Balancing

This is an old, Oracle SOA and OC4J 10G topic. In fact this is not even a SOA topic per se. Questions of RMI load balancing arise when you developed custom web applications accessing human tasks running off a remote SOA 10G cluster. Having returned from a customer who faced challenges with OC4J RMI load balancing, I felt there is still some confusions in the field how OC4J RMI load balancing work. Hence I decide to dust off an old tech note that I wrote a few years back and share it with the general public.

Here is the tech note:

Overview

A typical use case in Oracle SOA is that you are building web based, custom human tasks UI that will interact with the task services housed in a remote BPEL 10G cluster. Or, in a more generic way, you are just building a web based application in Java that needs to interact with the EJBs in a remote OC4J cluster. In either case, you are talking to an OC4J cluster as RMI client. Then immediately you must ask yourself the following questions:

1. How do I make sure that the web application, as an RMI client, even distribute its load against all the nodes in the remote OC4J cluster?

2. How do I make sure that the web application, as an RMI client, is resilient to the node failures in the remote OC4J cluster, so that in the unlikely case when one of the remote OC4J nodes fail, my web application will continue to function?

That is the topic of how to achieve load balancing with OC4J RMI client.

Solutions

You need to configure and code RMI load balancing in two places:

1. Provider URL can be specified with a comma separated list of URLs, so that the initial lookup will land to one of the available URLs.

2. Choose a proper value for the oracle.j2ee.rmi.loadBalance property, which, along side with the PROVIDER_URL property, is one of the JNDI properties passed to the JNDI lookup.(http://docs.oracle.com/cd/B31017_01/web.1013/b28958/rmi.htm#BABDGFBI)

More details below:

About the PROVIDER_URL

The JNDI property java.name.provider.url's job is, when the client looks up for a new context at the very first time in the client session, to provide a list of RMI context

The value of the JNDI property java.name.provider.url goes by the format of a single URL, or a comma separate list of URLs.
  • A single URL. For example: opmn:ormi://host1:6003:oc4j_instance1/appName1
  • A comma separated list of multiple URLs. For examples:  opmn:ormi://host1:6003:oc4j_instanc1/appName, opmn:ormi://host2:6003:oc4j_instance1/appName, opmn:ormi://host3:6003:oc4j_instance1/appName

When the client looks up for a new Context the very first time in the client session, it sends a query against the OPMN referenced by the provider URL. The OPMN host and port specifies the destination of such query, and the OC4J instance name and appName are actually the “where clause” of the query.

When the PROVIDER URL reference a single OPMN server

Let's consider the case when the provider url only reference a single OPMN server of the destination cluster. In this case, that single OPMN server receives the query and returns a list of the qualified Contexts from all OC4Js within the cluster, even though there is a single OPMN server in the provider URL. A context represent a particular starting point at a particular server for subsequent object lookup.

For example, if the URL is opmn:ormi://host1:6003:oc4j_instance1/appName, then, OPMN will return the following contexts:

  • appName on oc4j_instance1 on host1
  • appName on oc4j_instance1 on host2,
  • appName on oc4j_instance1 on host3, 
(provided that host1, host2, host3 are all in the same cluster)

Please note that
  • One OPMN will be sufficient to find the list of all contexts from the entire cluster that satisfy the JNDI lookup query. You can do an experiment by shutting down appName on host1, and observe that OPMN on host1 will still be able to return you appname on host2 and appName on host3.
When the PROVIDER URL reference a comma separated list of multiple OPMN servers


When the JNDI propery java.naming.provider.url references a comma separated list of multiple URLs, the lookup will return the exact same things as with the single OPMN server: a list of qualified Contexts from the cluster.

The purpose of having multiple OPMN servers is to provide high availability in the initial context creation, such that if OPMN at host1 is unavailable, client will try the lookup via OPMN on host2, and so on. After the initial lookup returns and cache a list of contexts, the JNDI URL(s) are no longer used in the same client session. That explains why removing the 3rd URL from the list of JNDI URLs will not stop the client from getting the EJB on the 3rd server.


About the oracle.j2ee.rmi.loadBalance Property

After the client acquires the list of contexts, it will cache it at the client side as “list of available RMI contexts”.  This list includes all the servers in the destination cluster. This list will stay in the cache until the client session (JVM) ends. The RMI load balancing against the destination cluster is happening at the client side, as the client is switching between the members of the list.

Whether and how often the client will fresh the Context from the list of Context is based on the value of the  oracle.j2ee.rmi.loadBalance. The documentation at http://docs.oracle.com/cd/B31017_01/web.1013/b28958/rmi.htm#BABDGFBI list all the available values for the oracle.j2ee.rmi.loadBalance.

Value Description
client
If specified, the client interacts with the OC4J process that was initially chosen at the first lookup for the entire conversation.
context
Used for a Web client (servlet or JSP) that will access EJBs in a clustered OC4J environment.
If specified, a new Context object for a randomly-selected OC4J instance will be returned each time InitialContext() is invoked.
lookup
Used for a standalone client that will access EJBs in a clustered OC4J environment.
If specified, a new Context object for a randomly-selected OC4J instance will be created each time the client calls Context.lookup().


Please note the regardless of the setting of oracle.j2ee.rmi.loadBalance property, the “refresh” only occurs at the client. The client can only choose from the "list of available context" that was returned and cached from the very first lookup. That is, the client will merely get a new Context object from the “list of available RMI contexts” from the cache at the client side. The client will NOT go to the OPMN server again to get the list. That also implies that if you are adding a node to the server cluster AFTER the client’s initial lookup, the client would not know it because neither the server nor the client will initiate a refresh of the “list of available servers” to reflect the new node.

About High Availability (i.e. Resilience Against Node Failure of Remote OC4J Cluster)

What we have discussed above is about load balancing. Let's also discuss high availability.

This is how the High Availability works in RMI: when the client use the context but get an exception such as socket is closed, it knows that the server referenced by that Context is problematic and will try to get another unused Context from the “list of available contexts”. Again, this list is the list that was returned and cached at the very first lookup in the entire client session.

Using BPEL Performance Statistics to Diagnose Performance Bottlenecks

Tuning performance of Oracle SOA 11G applications could be challenging. Because SOA is a platform for you to build composite applications that connect many applications and "services", when the overall performance is slow, the bottlenecks could be anywhere in the system: the applications/services that SOA connects to, the infrastructure database, or the SOA server itself.How to quickly identify the bottleneck becomes crucial in tuning the overall performance.

Fortunately, the BPEL engine in Oracle SOA 11G (and 10G, for that matter) collects BPEL Engine Performance Statistics, which show the latencies of low level BPEL engine activities. The BPEL engine performance statistics can make it a bit easier for you to identify the performance bottleneck.

Although the BPEL engine performance statistics are always available, the access to and interpretation of them are somewhat obscure in the early and current (PS5) 11G versions.

This blog attempts to offer instructions that help you to enable, retrieve and interpret the performance statistics, before the future versions provides a more pleasant user experience.

Overview of BPEL Engine Performance Statistics 

SOA BPEL has a feature of collecting some performance statistics and store them in memory.

One MBean attribute, StatLastN, configures the size of the memory buffer to store the statistics. This memory buffer is a "moving window", in a way that old statistics will be flushed out by the new if the amount of data exceeds the buffer size. Since the buffer size is limited by StatLastN, impacts of statistics collection on performance is minimal. By default StatLastN=-1, which means no collection of performance data.

Once the statistics are collected in the memory buffer, they can be retrieved via another MBean oracle.as.soainfra.bpel:Location=[Server Name],name=BPELEngine,type=BPELEngine.>

My friend in Oracle SOA development wrote this simple 'bpelstat' web app that looks up and retrieves the performance data from the MBean and displays it in a human readable form. It does not have beautiful UI but it is fairly useful.

Although in Oracle SOA 11.1.1.5 onwards the same statistics can be viewed via a more elegant UI under "request break down" at EM -> SOA Infrastructure -> Service Engines -> BPEL -> Statistics, some unsophisticated minds like mine may still prefer the simplicity of the 'bpelstat' JSP. One thing that simple JSP does do well is that you can save the page and send it to someone to further analyze

Follows are the instructions of how to install and invoke the BPEL statistic JSP. My friend in SOA Development will soon blog about interpreting the statistics. Stay tuned.

Step1: Enable BPEL Engine Statistics for Each SOA Servers via Enterprise Manager

First st you need to set the StatLastN to some number as a way to enable the collection of BPEL Engine Performance Statistics

  • EM Console -> soa-infra(Server Name) -> SOA Infrastructure -> SOA Administration -> BPEL Properties
  • Click on "More BPEL Configuration Properties"
  • Click on attribute "StatLastN", set its value to some integer number. Typically you want to set it 1000 or more.

Step 2: Download and Deploy bpelstat.war File to Admin Server,

Note: the WAR file contains a JSP that does NOT have any security restriction. You do NOT want to keep in your production server for a long time as it is a security hazard. Deactivate the war once you are done.
  • Download the bpelstat.war to your local PC
  • At WebLogic Console, Go to Deployments -> Install
  • Click on the "upload your file(s)"
  • Click the "Browse" button to upload the deployment to Admin Server
  • Accept the uploaded file as the path, click next
  • Check the default option "Install this deployment as an application"
  • Check "AdminServer" as the target server
  • Finish the rest of the deployment with default settings


  • Console -> Deployments
  • Check the box next to "bpelstat" application
  • Click on the "Start" button. It will change the state of the app from "prepared" to "active"

Step 3: Invoke the BPEL Statistic Tool


  • The BPELStat tool merely call the MBean of BPEL server and collects and display the in-memory performance statics. You usually want to do that after some peak loads.
  • Go to http://<admin-server-host>:<admin-server-port>/bpelstat
  • Enter the correct admin hostname, port, username and password
  • Enter the SOA Server Name from which you want to collect the performance statistics. For example, SOA_MS1, etc.
  • Click Submit
  • Keep doing the same for all SOA servers.

Step 3: Interpret the BPEL Engine Statistics

You will see a few categories of BPEL Statistics from the JSP Page.

First it starts with the overall latency of BPEL processes, grouped by synchronous and asynchronous processes. Then it provides the further break down of the measurements through the life time of a BPEL request, which is called the "request break down".

1. Overall latency of BPEL processes

The top of the page shows that the elapse time of executing the synchronous process TestSyncBPELProcess from the composite TestComposite averages at about 1543.21ms, while the elapse time of executing the asynchronous process TestAsyncBPELProcess from the composite TestComposite2 averages at about 1765.43ms. The maximum and minimum latency were also shown.

Synchronous process statistics
<statistics>
    <stats key="default/TestComposite!2.0.2-ScopedJMSOSB*soa_bfba2527-a9ba-41a7-95c5-87e49c32f4ff/TestSyncBPELProcess" min="1234" max="4567" average="1543.21" count="1000">
    </stats>
</statistics>



Asynchronous process statistics
<statistics>
    <stats key="default/TestComposite2!2.0.2-ScopedJMSOSB*soa_bfba2527-a9ba-41a7-95c5-87e49c32f4ff/TestAsyncBPELProcess" min="2234" max="3234" average="1765.43" count="1000">
    </stats>
</statistics>


2. Request break down

Under the overall latency categorized by synchronous and asynchronous processes is the "Request breakdown". Organized by statistic keys, the Request breakdown gives finer grain performance statistics through the life time of the BPEL requests.It uses indention to show the hierarchy of the statistics.

Request breakdown
<statistics>
    <stats key="eng-composite-request" min="0" max="0" average="0.0" count="0">
        <stats key="eng-single-request" min="22" max="606" average="258.43" count="277">
            <stats key="populate-context" min="0" max="0" average="0.0" count="248">



Please note that in SOA 11.1.1.6, the statistics under Request breakdown is aggregated together cross all the BPEL processes based on statistic keys. It does not differentiate between BPEL processes. If two BPEL processes happen to have the statistic that share same statistic key, the statistics from two BPEL processes will be aggregated together. Keep this in mind when we go through more details below.


2.1 BPEL process activity latencies

A very useful measurement in the Request Breakdown is the performance statistics of the BPEL activities you put in your BPEL processes: Assign, Invoke, Receive, etc. The names of the measurement in the JSP page directly come from the names to assign to each BPEL activity. These measurements are under the statistic key "actual-perform"

Example 1: 
Follows is the measurement for BPEL activity "AssignInvokeCreditProvider_Input", which looks like the Assign activity in a BPEL process that assign an input variable before passing it to the invocation:


                               <stats key="AssignInvokeCreditProvider_Input" min="1" max="8" average="1.9" count="153">
                                    <stats key="sensor-send-activity-data" min="0" max="1" average="0.0" count="306">
                                    </stats>
                                    <stats key="sensor-send-variable-data" min="0" max="0" average="0.0" count="153">
                                    </stats>
                                    <stats key="monitor-send-activity-data" min="0" max="0" average="0.0" count="306">
                                    </stats>
                                </stats>

Note: because as previously mentioned that the statistics cross all BPEL processes are aggregated together based on statistic keys, if two BPEL processes happen to name their Invoke activity the same name, they will show up at one measurement (i.e. statistic key).

Example 2:
Follows is the measurement of BPEL activity called "InvokeCreditProvider". You can not only see that by average it takes 3.31ms to finish this call (pretty fast) but also you can see from the further break down that most of this 3.31 ms was spent on the "invoke-service". 

                                <stats key="InvokeCreditProvider" min="1" max="13" average="3.31" count="153">
                                    <stats key="initiate-correlation-set-again" min="0" max="0" average="0.0" count="153">
                                    </stats>
                                    <stats key="invoke-service" min="1" max="13" average="3.08" count="153">
                                        <stats key="prep-call" min="0" max="1" average="0.04" count="153">
                                        </stats>
                                    </stats>
                                    <stats key="initiate-correlation-set" min="0" max="0" average="0.0" count="153">
                                    </stats>
                                    <stats key="sensor-send-activity-data" min="0" max="0" average="0.0" count="306">
                                    </stats>
                                    <stats key="sensor-send-variable-data" min="0" max="0" average="0.0" count="153">
                                    </stats>
                                    <stats key="monitor-send-activity-data" min="0" max="0" average="0.0" count="306">
                                    </stats>
                                    <stats key="update-audit-trail" min="0" max="2" average="0.03" count="153">
                                    </stats>
                                </stats>



2.2 BPEL engine activity latency

Another type of measurements under Request breakdown are the latencies of underlying system level engine activities. These activities are not directly tied to a particular BPEL process or process activity, but they are critical factors in the overall engine performance. These activities include the latency of saving asynchronous requests to database, and latency of process dehydration.

My friend Malkit Bhasin is working on providing more information on interpreting the statistics on engine activities on his blog (https://blogs.oracle.com/malkit/). I will update this blog once the information becomes available.

Update on 2012-10-02: My friend Malkit Bhasin has published the detail interpretation of the BPEL service engine statistics at his blog http://malkit.blogspot.com/2012/09/oracle-bpel-engine-soa-suite.html.

Retrieve Performance Data from SOA Infrastructure Database

Here I would like offer examples of some basic SQL queries you can run against the infrastructure database of Oracle SOA Suite 11G to acquire the performance statistics for a given period of time. The final version of the script will prompt for the start and end time of the period of your interest.[Read More]
About


This is the blog for the Oracle FMW Architects team fondly known as the A-Team. The A-Team is the central, technical, outbound team as part of the FMW Development organization working with Oracle's largest and most important customers. We support Oracle Sales, Consulting and Support when deep technical and architectural help is needed from Oracle Development.
Primarily this blog is tailored for SOA issues (BPEL, OSB, BPM, Adapters, CEP, B2B, JCAP)that are encountered by our team. Expect real solutions to customer problems, encountered during customer engagements.
We will highlight best practices, workarounds, architectural discussions, and discuss topics that are relevant in the SOA technical space today.

Search

Archives
« December 2012 »
SunMonTueWedThuFriSat
      
1
2
3
5
6
7
8
9
11
12
13
14
16
17
18
19
20
22
23
24
25
26
27
28
29
30
31
     
Today