Thursday Mar 27, 2014

Not Just a Cache

Coherence as a Compute Grid

Coherence is best known as a data grid, providing distributed caching with an ability to move processing to the data in the grid.  Less well known is the fact that Coherence also has the ability to function as a compute grid, distributing work across multiple servers in a cluster.  In this entry, which was co-written with my colleague Utkarsh Nadkarni, we will look at using Coherence as a compute grid through the use of the Work Manager API and compare it to manipulating data directly in the grid using Entry Processors.

Coherence Distributed Computing Options

The Coherence documentation identifies several methods for distributing work across the cluster, see Processing Data in a Cache.  They can be summarized as:

  • Entry Processors
    • An InvocableMap interface, inherited by the NamedCache interface, provides support for executing an agent (EntryProcessor or EntryAggregator) on individual entries within the cache.
    • The entries may or may not exist, either way the agent is executed once for each key provided, or if no key is provided then it is executed once for each object in the cache.
    • In Enterprise and Grid editions of Coherence the entry processors are executed on the primary cache nodes holding the cached entries.
    • Agents can return results.
    • One agent executes multiple times per cache node, once for each key targeted on the node.
  • Invocation Service
    • An InvocationService provides support for executing an agent on one or more nodes within the grid.
    • Execution may be targeted at specific nodes or at all nodes running the Invocation Service.
    • Agents can return results.
    • One agent executes once per node.
  • Work Managers
    • A WorkManager class provides a grid aware implementation of the commonJ WorkManager which can be used to run tasks across multiple threads on multiple nodes within the grid.
    • WorkManagers run on multiple nodes.
    • Each WorkManager may have multiple threads.
    • Tasks implement the Work interface and are assigned to specific WorkManager threads to execute.
    • Each task is executed once.

Three Models of Distributed Computation

The previous section listing the distributed computing options in Coherence shows that there are 3 distinct execution models:

  • Per Cache Entry Execution (Entry Processor)
    • Execute the agent on the entry corresponding to a cache key.
    • Entries processed on a single thread per node.
    • Parallelism across nodes.
  • Per Node Execution (Invocation Service)
    • Execute the same agent once per node.
    • Agent processed on a single thread per node.
    • Parallelism across nodes.
  • Per Task Execution (Work Manager)
    • Each task executed once.
    • Parallelism across nodes and across threads within a node.

The entry processor is good for operating on individual cache entries.  It is not so good for working on groups of cache entries.

The invocation service is good for performing checks on a node, but is limited in its parallelism.

The work manager is good for operating on groups of related entries in the cache or performing non-cache related work in parallel.  It has a high degree of parallelism.

As you can see the primary choice for distributed computing comes down to the Work Manager and the Entry Processor.

Differences between using Entry Processors and Work Managers in Coherence

Aspect Entry Processors Work Managers
Degree of parallelization Is a function of the number of Coherence nodes. EntryProcessors are run concurrently across all nodes in a cluster. However, within each node only one instance of the entry processor executes at a time. Is a function of the number of Work Manager threads. The Work is run concurrently across all threads in all Work Manager instances.
Transactionality Transactional. If an EntryProcessor running on one node does not complete (say, due to that node crashing), the entries targeted will be executed by an EntryProcessor on another node. Not transactional. The specification does not explicitly specify what the response should be if a remote server crashes during an execution. Current implementation uses WORK_COMPLETED with WorkCompletedException as a result. In case a Work does not run to completion, it is the responsibility of the client to resubmit the Work to the Work Manager.
How is the Cache accessed or mutated? Operations against the cache contents are executed by (and thus within the localized context of) a cache. Accesses and changes to the cache are done directly through the cache API.
Where is the processing performed? In the same JVM where the entries-to-be-processed reside. In the Work Manager server. This may not be the same JVM where the entries-to-be-processed reside.
Network Traffic Is a function of the size of the EntryProcessor. Typically, the size of an EntryProcessor is much smaller than the size of the data transferred across nodes in the case of a Work Manager approach. This makes the EntryProcessor approach more network-efficient and hence more scalable. One EntryProcessor is transmitted to each cache node. Is a function of the
  • Number of Work Objects, of which multiple may be sent to each server.
  • Size of the data set transferred from the Backing Map to the Work Manager Server.
Distribution of “Tasks” Tasks are moved to the location at which the entries-to-be-processed are being managed. This may result in a random distribution of tasks. The distribution tends to get equitable as the number of entries increases. Tasks are distributed equally across the threads in the Work Manager Instances.
Implementation of the EntryProcessor or Work class. Create a class that extends AbstractProcessor. Implement the process method. Update the cache item based on the key passed in to the process method. Create a class that is serializable and implements commonj.work.Work. Implement the run method.
Implementation of “Task” In the process method, update the cache item based on the key passed into the process method. In the run method, do the following:
  • Get a reference to the named cache
  • Do the Work – Get a reference to the Cache Item; change the cache item; put the cache item back into the named cache.
Completion Notification When the NamedCache.invoke method completes then all the entry processors have completed executing. When a task is submitted for execution it executes asynchronously on the work manager threads in the cluster.  Status may be obtained by registering a commonj.work.WorkListener class when calling the WorkManager.schedule method.  This will provide updates when the Work is accepted, started and completed or rejected.  Alternatively the WorkManager.waitForAll and WorkManager.waitForAny methods allow blocking waits for either all or one result respectively.
Returned Results java.lang.Object – when executed on one cache item. This returns result of the invocation as returned from the EntryProcessor.
java.util.Map – when executed on a collection of keys. This returns a Map containing the results of invoking the EntryProcessor against each of the specified keys.
commonj.work.WorkItem - There are three possible outcomes
  • The Work is not yet complete. In this case, a null is returned by WorkItem.getResult.
  • The Work started but completed with an exception. This may have happened due to a Work Manager Instance terminating abruptly. This is indicated by an exception thrown by WorkItem.getResult.
  • The Work Manager instance indicated that the Work is complete and the Work ran to completion. In this case, WorkItem.getResult returns a non-null and no exception is thrown by WorkItem.getResult.
Error Handling Failure of a node results in all the work assigned to that node being executed on the new primary. This may result in some work being executed twice, but Coherence ensures that the cache is only updated once per item. Failure of a node results in the loss of scheduled tasks assigned to that node. Completed tasks are sent back to the client as they complete.

Fault Handling Extension

Entry processors have excellent error handling within Coherence.  Work Managers less so.  In order to provide resiliency on node failure I implemented a “RetryWorkManager” class that detects tasks that have failed to complete successfully and resubmits them to the grid for another attempt.

A JDeveloper project with the RetryWorkManager is available for download here.  It includes sample code to run a simple task across multiple work manager threads.

To create a new RetryWorkManager that will retry failed work twice then you would use this:

WorkManager = new RetryWorkManager("WorkManagerName", 2);  // Change for number of retries, if no retry count is provided then the default is 0.
You can control the number of retries at the individual work level as shown below:
WorkItem workItem = schedule(work); // Use number of retries set at WorkManager creation
WorkItem workItem = schedule(work, workListener); // Use number of retries set at WorkManager creation
WorkItem workItem = schedule(work, 4); // Change number of retries
WorkItem workItem = schedule(work, workListener, 4); // Change number of retries
Currently the RetryWorkManager defaults to having 0 threads.  To change use this constructor:
WorkItem workItem = schedule(work, workListener, 3, 4); // Change number of threads (3) and retries (4)
Note that none of this sample code is supported by Oracle in any way, and is provided purely as a sample of what can be done with Coherence.

How the RetryWorkManager Works

The RetryWorkManager delegates most operations to a Coherence WorkManager instance.  It creates a WorkManagerListener to intercept status updates.  On receiving a WORK_COMPLETED callback the listener checks the result to see if the completion is due to an error.  If an error occurred and there are retries left then the work is resubmitted.  The WorkItem returned by scheduling an event is wrapped in a RetryWorkItem.  This RetryWorkItem is updated with a new Coherence WorkItem when the task is retried.  If the client registers a WorkManagerListener then the RetryWorkManagerListener delegates non-retriable events to the client listener.  Finally the waitForAll and waitForAny methods are modified to deal with work items being resubmitted in the event of failure.

Sample Code for EntryProcessor and RetryWorkManager

The downloadable project contains sample code for running the work manager and an entry processor.

The demo implements a 3-tier architecture

  1. Coherence Cache Servers
    • Can be started by running RunCacheServer.cmd
    • Runs a distributed cache used by the Task to be executed in the grid
  2. Coherence Work Manager Servers
    • Can be started by running RunWorkManagerServer.cmd
    • Takes no parameters
    • Runs two threads for executing tasks
  3. Coherence Work Manager Clients
    • Can be started by running RunWorkManagerClient.cmd
    • Takes three parameters currently
      • Work Manager name - should be "AntonyWork" - default is "AntonyWork"
      • Number of tasks to schedule - default is 10
      • Time to wait for tasks to complete in seconds - default is 60

The task stores the number of times it has been executed in the cache, so multiple runs will see the counter incrementing.  The choice between EntryProcessor and WorkManager is controlled by changing the value of USE_ENTRY_PROCESSOR between false and true in the RunWorkManagerClient.cmd script.

The SetWorkManagerEnv.cmd script should be edited to point to the Coherence home directory and the Java home directory.

Summary

If you need to perform operations on cache entries and don’t need to have cross-checks between the entries then the best solution is to use an entry processor.  The entry processor is fault tolerant and updates to the cached entity will be performed once only.

If you need to perform generic work that may need to touch multiple related cache entries then the work manager may be a better solution.  The extensions I created in the RetryWorkManager provide a degree of resiliency to deal with node failure without impacting the client.

The RetryWorkManager can be downloaded here.

Wednesday Feb 26, 2014

Clustering Events

Setting up an Oracle Event Processing Cluster

Recently I was working with Oracle Event Processing (OEP) and needed to set it up as part  of a high availability cluster.  OEP uses Coherence for quorum membership in an OEP cluster.  Because the solution used caching it was also necessary to include access to external Coherence nodes.  Input messages need to be duplicated across multiple OEP streams and so a JMS Topic adapter needed to be configured.  Finally only one copy of each output event was desired, requiring the use of an HA adapter.  In this blog post I will go through the steps required to implement a true HA OEP cluster.

OEP High Availability Review

The diagram below shows a very simple non-HA OEP configuration:

Events are received from a source (JMS in this blog).  The events are processed by an event processing network which makes use of a cache (Coherence in this blog).  Finally any output events are emitted.  The output events could go to any destination but in this blog we will emit them to a JMS queue.

OEP provides high availability by having multiple event processing instances processing the same event stream in an OEP cluster.  One instance acts as the primary and the other instances act as secondary processors.  Usually only the primary will output events as shown in the diagram below (top stream is the primary):

The actual event processing is the same as in the previous non-HA example.  What is different is how input and output events are handled.  Because we want to minimize or avoid duplicate events we have added an HA output adapter to the event processing network.  This adapter acts as a filter, so that only the primary stream will emit events to out queue.  If the processing of events within the network depends on how the time at which events are received then it is necessary to synchronize the event arrival time across the cluster by using an HA input adapter to synchronize the arrival timestamps of events across the cluster.

OEP Cluster Creation

Lets begin by setting up the base OEP cluster.  To do this we create new OEP configurations on each machine in the cluster.  The steps are outlined below.  Note that the same steps are performed on each machine for each server which will run on that machine:

  • Run ${MW_HOME}/ocep_11.1/common/bin/config.sh.
    • MW_HOME is the installation directory, note that multiple Fusion Middleware products may be installed in this directory.
  • When prompted “Create a new OEP domain”.
  • Provide administrator credentials.
    • Make sure you provide the same credentials on all machines in the cluster.
  • Specify a  “Server name” and “Server listen port”.
    • Each OEP server must have a unique name.
    • Different servers can share the same “Server listen port” unless they are running on the same host.
  • Provide keystore credentials.
    • Make sure you provide the same credentials on all machines in the cluster.
  • Configure any required JDBC data source.
  • Provide the “Domain Name” and “Domain location”.
    • All servers must have the same “Domain name”.
    • The “Domain location” may be different on each server, but I would keep it the same to simplify administration.
    • Multiple servers on the same machine can share the “Domain location” because their configuration will be placed in the directory corresponding to their server name.
  • Create domain!

Configuring an OEP Cluster

Now that we have created our servers we need to configure them so that they can find each other.  OEP uses Oracle Coherence to determine cluster membership.  Coherence clusters can use either multicast or unicast to discover already running members of a cluster.  Multicast has the advantage that it is easy to set up and scales better (see http://www.ateam-oracle.com/using-wka-in-large-coherence-clusters-disabling-multicast/) but has a number of challenges, including failure to propagate by default through routers and accidently joining the wrong cluster because someone else chose the same multicast settings.  We will show how to use both unicast and multicast to discover the cluster. 

Multicast Discovery Unicast Discovery
Coherence multicast uses a class D multicast address that is shared by all servers in the cluster.  On startup a Coherence node broadcasts a message to the multicast address looking for an existing cluster.  If no-one responds then the node will start the cluster. Coherence unicast uses Well Known Addresses (WKAs). Each server in the cluster needs a dedicated listen address/port combination. A subset of these addresses are configured as WKAs and shared between all members of the cluster. As long as at least one of the WKAs is up and running then servers can join the cluster. If a server does not find any cluster members then it checks to see if its listen address and port are in the WKA list. If it is then that server will start the cluster, otherwise it will wait for a WKA server to become available.
  To configure a cluster the same steps need to be followed for each server in the cluster:
  • Set an event server address in the config.xml file.
    • Add the following to the <cluster> element:
      <cluster>
          <server-name>server1</server-name>
          <server-host-name>oep1.oracle.com</server-host-name>
      </cluster>
    • The “server-name” is displayed in the visualizer and should be unique to the server.

    • The “server-host-name” is used by the visualizer to access remote servers.

    • The “server-host-name” must be an IP address or it must resolve to an IP address that is accessible from all other servers in the cluster.

    • The listening port is configured in the <netio> section of the config.xml.

    • The server-host-name/listening port combination should be unique to each server.

 
  • Set a common cluster multicast listen address shared by all servers in the config.xml file.
    • Add the following to the <cluster> element:
      <cluster>
          …
          <!—For us in Coherence multicast only! –>
          <multicast-address>239.255.200.200</multicast-address>
          <multicast-port>9200</multicast-port>
      </cluster>
    • The “multicast-address” must be able to be routed through any routers between servers in the cluster.

  • Optionally you can specify the bind address of the server, this allows you to control port usage and determine which network is used by Coherence

    • Create a “tangosol-coherence-override.xml” file in the ${DOMAIN}/{SERVERNAME}/config directory for each server in the cluster.
      <?xml version='1.0'?>
      <coherence>
          <cluster-config>
              <unicast-listener>
                  <!—This server Coherence address and port number –>
                  <address>192.168.56.91</address>
                  <port>9200</port>
              </unicast-listener>
          </cluster-config>
      </coherence>
  • Configure the Coherence WKA cluster discovery.

    • Create a “tangosol-coherence-override.xml” file in the ${DOMAIN}/{SERVERNAME}/config directory for each server in the cluster.
      <?xml version='1.0'?>
      <coherence>
          <cluster-config>
              <unicast-listener>
                  <!—WKA Configuration –>
                  <well-known-addresses>
                      <socket-address id="1">
                          <address>192.168.56.91</address>
                          <port>9200</port>
                      </socket-address>
                      <socket-address id="2">
                          <address>192.168.56.92</address>
                          <port>9200</port>
                      </socket-address>
                  </well-known-addresses>
                  <!—This server Coherence address and port number –>
                  <address>192.168.56.91</address>
                  <port>9200</port>
              </unicast-listener>
          </cluster-config>
      </coherence>

    • List at least two servers in the <socket-address> elements.

    • For each <socket-address> element there should be a server that has corresponding <address> and <port> elements directly under <well-known-addresses>.

    • One of the servers listed in the <well-known-addresses> element must be the first server started.

    • Not all servers need to be listed in <well-known-addresses>, but see previous point.

 
  • Enable clustering using a Coherence cluster.
    • Add the following to the <cluster> element in config.xml.
      <cluster>
          …
          <enabled>true</enabled>
      </cluster>
    • The “enabled” element tells OEP that it will be using Coherence to establish cluster membership, this can also be achieved by setting the value to be “coherence”.

 
  • The following shows the <cluster> config for another server in the cluster with differences highlighted:
    <cluster>
        <server-name>server2</server-name>
        <server-host-name>oep2.oracle.com</server-host-name>
        <!—For us in Coherence multicast only! –>
        <multicast-address>239.255.200.200</multicast-address>
        <multicast-port>9200</multicast-port>
        <enabled>true</enabled>
    </cluster>

  • The following shows the <cluster> config for another server in the cluster with differences highlighted:
    <cluster>
        <server-name>server2</server-name>
        <server-host-name>oep2.oracle.com</server-host-name>
        <enabled>true</enabled>
    </cluster>

 
  • The following shows the “tangosol-coherence-override.xml” file for another server in the cluster with differences highlighted:
    <?xml version='1.0'?>
    <coherence>
        <cluster-config>
            <unicast-listener>
                <!—WKA Configuration –>
                <well-known-addresses>
                    <socket-address id="1">
                        <address>192.168.56.91</address>
                        <port>9200</port>
                    </socket-address>
                    <socket-address id="2">
                        <address>192.168.56.92</address>
                        <port>9200</port>
                    </socket-address>
                    <!—This server Coherence address and port number –>
                    <address>192.168.56.92</address>
                    <port>9200</port>
                </well-known-addresses>
            </unicast-listener>
        </cluster-config>
    </coherence>

You should now have a working OEP cluster.  Check the cluster by starting all the servers.

Look for a message like the following on the first server to start to indicate that another server has joined the cluster:

<Coherence> <BEA-2049108> <The domain membership has changed to [server2, server1], the new domain primary is "server1">

Log on to the Event Processing Visualizer of one of the servers – http://<hostname>:<port>/wlevs.  Select the cluster name on the left and then select group “AllDomainMembers”.  You should see a list of all the running servers in the “Servers of Group – AllDomainMembers” section.

Sample Application

Now that we have a working OEP cluster let us look at a simple application that can be used as an example of how to cluster enable an application.  This application models service request tracking for hardware products.  The application we will use performs the following checks:

  1. If a new service request (identified by SRID) arrives (indicated by status=RAISE) then we expect some sort of follow up in the next 10 seconds (seconds because I want to test this quickly).  If no follow up is seen then an alert should be raised.
    • For example if I receive an event (SRID=1, status=RAISE) and after 10 seconds I have not received a follow up message (SRID=1, status<>RAISE) then I need to raise an alert.
  2. If a service request (identified by SRID) arrives and there has been another service request (identified by a different SRID) for the same physcial hardware (identified by TAG) then an alert should be raised.
    • For example if I receive an event (SRID=2, TAG=M1) and later I receive another event for the same hardware (SRID=3, TAG=M1) then an alert should be raised.

Note use case 1 is nicely time bounded – in this case the time window is 10 seconds.  Hence this is an ideal candidate to be implemented entirely in CQL.

Use case 2 has no time constraints, hence over time there could be a very large number of CQL queries running looking for a matching TAG but a different SRID.  In this case it is better to put the TAGs into a cache and search the cache for duplicate tags.  This reduces the amount of state information held in the OEP engine.

The sample application to implement this is shown below:

Messages are received from a JMS Topic (InboundTopicAdapter).  Test messages can be injected via a CSV adapter (RequestEventCSVAdapter).  Alerts are sent to a JMS Queue (OutboundQueueAdapter), and also printed to the server standard output (PrintBean).  Use case 1 is implemented by the MissingEventProcessor.  Use case 2 is implemented by inserting the TAG into a cache (InsertServiceTagCacheBean) using a Coherence event processor and then querying the cache for each new service request (DuplicateTagProcessor), if the same tag is already associated with an SR in the cache then an alert is raised.  The RaiseEventFilter is used to filter out existing service requests from the use case 2 stream.

The non-HA version of the application is available to download here.

We will use this application to demonstrate how to HA enable an application for deployment on our cluster.

A CSV file (TestData.csv) and Load generator properties file (HADemoTest.prop) is provided to test the application by injecting events using the CSV Adapter.

Note that the application reads a configuration file (System.properties) which should be placed in the domain directory of each event server.

Deploying an Application

Before deploying an application to a cluster it is a good idea to create a group in the cluster.  Multiple servers can be members of this group.  To add a group to an event server just add an entry to the <cluster> element in config.xml as shown below:

<cluster>
      …
      <groups>HAGroup</groups>
   </cluster>

Multiple servers can be members of a group and a server can be a member of multiple groups.  This allows you to have different levels of high availability in the same event processing cluster.

Deploy the application using the Visualizer.  Target the application at the group you created, or the AllDomainMembers group.

Test the application, typically using a CSV Adapter.  Note that using a CSV adapter sends all the events to a single event server.  To fix this we need to add a JMS output adapter (OutboundTopicAdapter) to our application and then send events from the CSV adapter to the outbound JMS adapter as shown below:

So now we are able to send events via CSV to an event processor that in turn sends the events to a JMS topic.  But we still have a few challenges.

Managing Input

First challenge is managing input.  Because OEP relies on the same event stream being processed by multiple servers we need to make sure that all our servers get the same message from the JMS Topic.  To do this we configure the JMS connection factory to have an Unrestricted Client ID.  This allows multiple clients (OEP servers in our case) to use the same connection factory.  Client IDs are mandatory when using durable topic subscriptions.  We also need each event server to have its own subscriber ID for the JMS Topic, this ensures that each server will get a copy of all the messages posted to the topic.  If we use the same subscriber ID for all the servers then the messages will be distributed across the servers, with each server seeing a completely disjoint set of messages to the other servers in the cluster.  This is not what we want because each server should see the same event stream.  We can use the server name as the subscriber ID as shown in the below excerpt from our application:

<wlevs:adapter id="InboundTopicAdapter" provider="jms-inbound">
    …
    <wlevs:instance-property name="durableSubscriptionName"
            value="${com_bea_wlevs_configuration_server_ClusterType.serverName}" />
</wlevs:adapter>

This works because I have placed a ConfigurationPropertyPlaceholderConfigurer bean in my application as shown below, this same bean is also used to access properties from a configuration file:

<bean id="ConfigBean"
        class="com.bea.wlevs.spring.support.ConfigurationPropertyPlaceholderConfigurer">
        <property name="location" value="file:../Server.properties"/>
    </bean>

With this configuration each server will now get a copy of all the events.

As our application relies on elapsed time we should make sure that the timestamps of the received messages are the same on all servers.  We do this by adding an HA Input adapter to our application.

<wlevs:adapter id="HAInputAdapter" provider="ha-inbound">
    <wlevs:listener ref="RequestChannel" />
    <wlevs:instance-property name="keyProperties"
            value="EVID" />
    <wlevs:instance-property name="timeProperty" value="arrivalTime"/>
</wlevs:adapter>

The HA Adapter sets the given “timeProperty” in the input message to be the current system time.  This time is then communicated to other HAInputAdapters deployed to the same group.  This allows all servers in the group to have the same timestamp in their event.  The event is identified by the “keyProperties” key field.

To allow the downstream processing to treat the timestamp as an arrival time then the downstream channel is configured with an “application-timestamped” element to set the arrival time of the event.  This is shown below:

<wlevs:channel id="RequestChannel" event-type="ServiceRequestEvent">
    <wlevs:listener ref="MissingEventProcessor" />
    <wlevs:listener ref="RaiseEventFilterProcessor" />
    <wlevs:application-timestamped>
        <wlevs:expression>arrivalTime</wlevs:expression>
    </wlevs:application-timestamped>
</wlevs:channel>

Note the property set in the HAInputAdapter is used to set the arrival time of the event.

So now all servers in our cluster have the same events arriving from a topic, and each event arrival time is synchronized across the servers in the cluster.

Managing Output

Note that an OEP cluster has multiple servers processing the same input stream.  Obviously if we have the same inputs, synchronized to appear to arrive at the same time then we will get the same outputs, which is central to OEPs promise of high availability.  So when an alert is raised by our application it will be raised by every server in the cluster.  If we have 3 servers in the cluster then we will get 3 copies of the same alert appearing on our alert queue.  This is probably not what we want.  To fix this we take advantage of an HA Output Adapter.  unlike input where there is a single HA Input Adapter there are multiple HA Output Adapters, each with distinct performance and behavioral characteristics.  The table below is taken from the Oracle® Fusion Middleware Developer's Guide for Oracle Event Processing and shows the different levels of service and performance impact:

Table 24-1 Oracle Event Processing High Availability Quality of Service

High Availability Option Missed Events? Duplicate Events? Performance Overhead
Section 24.1.2.1, "Simple Failover" Yes (many) Yes (few) Negligible
Section 24.1.2.2, "Simple Failover with Buffering" Yes (few)Foot 1 Yes (many) Low
Section 24.1.2.3, "Light-Weight Queue Trimming" No Yes (few) Low-MediumFoot 2
Section 24.1.2.4, "Precise Recovery with JMS" No No High

I decided to go for the lightweight queue trimming option.  This means I won’t lose any events, but I may emit a few duplicate events in the event of primary failure.  This setting causes all output events to be buffered by secondary's until they are told by the primary that a particular event has been emitted.  To configure this option I add the following adapter to my EPN:

    <wlevs:adapter id="HAOutputAdapter" provider="ha-broadcast">
        <wlevs:listener ref="OutboundQueueAdapter" />
        <wlevs:listener ref="PrintBean" />
        <wlevs:instance-property name="keyProperties" value="timestamp"/>
        <wlevs:instance-property name="monotonic" value="true"/>
        <wlevs:instance-property name="totalOrder" value="false"/>
    </wlevs:adapter>

This uses the time of the alert (timestamp property) as the key to be used to identify events which have been trimmed.  This works in this application because the alert time is the time of the source event, and the time of the source events are synchronized using the HA Input Adapter.  Because this is a time value then it will increase, and so I set monotonic=”true”.  However I may get two alerts raised at the same timestamp and in that case I set totalOrder=”false”.

I also added the additional configuration to config.xml for the application:

<ha:ha-broadcast-adapter>
    <name>HAOutputAdapter</name>
    <warm-up-window-length units="seconds">15</warm-up-window-length>
    <trimming-interval units="millis">1000</trimming-interval>
</ha:ha-broadcast-adapter>

This causes the primary to tell the secondary's which is its latest emitted alert every 1 second.  This will cause the secondary's to trim from their buffers all alerts prior to and including the latest emitted alerts.  So in the worst case I will get one second of duplicated alerts.  It is also possible to set a number of events rather than a time period.  The trade off here is that I can reduce synchronization overhead by having longer time intervals or more events, causing more memory to be used by the secondary's or I can cause more frequent synchronization, using less memory in the secondary's and generating fewer duplicate alerts but there will be more communication between the primary and the secondary's to trim the buffer.

The warm-up window is used to stop a secondary joining the cluster before it has been running for that time period.  The window is based on the time that the EPN needs to be running to be have the same state as the other servers.  In our example application we have a CQL that runs for a period of 10 seconds, so I set the warm up window to be 15 seconds to ensure that a newly started server had the same state as all the other servers in the cluster.  The warm up window should be greater than the longest query window.

Adding an External Coherence Cluster

When we are running OEP as a cluster then we have additional overhead in the servers.  The HA Input Adapter is synchronizing event time across the servers, the HA Output adapter is synchronizing output events across the servers.  The HA Output adapter is also buffering output events in the secondary’s.  We can’t do anything about this but we can move the Coherence Cache we are using outside of the OEP servers, reducing the memory pressure on those servers and also moving some of the processing outside of the server.  Making our Coherence caches external to our OEP cluster is a good idea for the following reasons:

  • Allows moving storage of cache entries outside of the OEP server JVMs hence freeing more memory for storing CQL state.
  • Allows storage of more entries in the cache by scaling cache independently of the OEP cluster.
  • Moves cache processing outside OEP servers.

To create the external Coherence cache do the following:

  • Create a new directory for our standalone Coherence servers, perhaps at the same level as the OEP domain directory.
  • Copy the tangosol-coherence-override.xml file previously created for the OEP cluster into a config directory under the Coherence directory created in the previous step.
  • Copy the coherence-cache-config.xml file from the application into a config directory under the Coherence directory created in the previous step.
  • Add the following to the tangosol-coherence-override.xml file in the Coherence config directory:
    • <coherence>
          <cluster-config>
              <member-identity>
                  <cluster-name>oep_cluster</cluster-name>
                  <member-name>Grid1</member-name>
              </member-identity>
              …
          </cluster-config>
      </coherence>
    • Important Note: The <cluster-name> must match the name of the OEP cluster as defined in the <domain><name> element in the event servers config.xml.
    • The member name is used to help identify the server.
  • Disable storage for our caches in the event servers by editing the coherence-cache-config.xml file in the application and adding the following element to the caches:
    • <distributed-scheme>
          <scheme-name>DistributedCacheType</scheme-name>
          <service-name>DistributedCache</service-name>
          <backing-map-scheme>
              <local-scheme/>
          </backing-map-scheme>
          <local-storage>false</local-storage>
      </distributed-scheme>
    • The local-storage flag stops the OEP server from storing entries for caches using this cache schema.
    • Do not disable storage at the global level (-Dtangosol.coherence.distributed.localstorage=false) because this will disable storage on some OEP specific cache schemes as well as our application cache.  We don’t want to put those schemes into our cache servers because they are used by OEP to maintain cluster integrity and have only one entry per application per server, so are very small.  If we put those into our Coherence Cache servers we would have to add OEP specific libraries to our cache servers and enable them in our coherence-cache-config.xml, all of which is too much trouble for little or no benefit.
  • If using Unicast Discovery (this section is not required if using Multicast) then we want to make the Coherence Grid be the Well Known Address servers because we want to disable storage of entries on our OEP servers, and Coherence nodes with storage disabled cannot initialize a cluster.  To enable the Coherence servers to be primaries in the Coherence grid do the following:
    • Change the unicast-listener addresses in the Coherence servers tangosol-coherence-override.xml file to be suitable values for the machine they are running on – typically change the listen address.
    • Modify the WKA addresses in the OEP servers and the Coherence servers tangosol-coherence-override.xml file to match at least two of the Coherence servers listen addresses.
    • The following table shows how this might be configured for 2 OEP servers and 2 Cache servers
      OEP Server 1 OEP Server 2 Cache Server 1 Cache Server 2

      <?xml version='1.0'?>
      <coherence>
        <cluster-config>








          <unicast-listener>
            <well-known-addresses>
              <socket-address id="1">
                <address>
                  192.168.56.91
               
      </address>
                <port>9300</port>
              </socket-address>
              <socket-address id="2">
                <address>
                  192.168.56.92
               
      </address>
                <port>9300</port>
              </socket-address>
            </well-known-addresses>
            <address>
              192.168.56.91
           
      </address>
            <port>9200</port>
          </unicast-listener>
        </cluster-config>
      </coherence>

      <?xml version='1.0'?>
      <coherence>
        <cluster-config>








          <unicast-listener>
            <well-known-addresses>
              <socket-address id="1">
                <address>
                  192.168.56.91
               
      </address>
                <port>9300</port>
              </socket-address>
              <socket-address id="2">
                <address>
                  192.168.56.92
               
      </address>
                <port>9300</port>
              </socket-address>
            </well-known-addresses>
            <address>
              192.168.56.92
           
      </address>
            <port>9200</port>
          </unicast-listener>
        </cluster-config>
      </coherence>

      <?xml version='1.0'?>
      <coherence>
        <cluster-config>
          <member-identity>
            <cluster-name>
              oep_cluster
            </cluster-name>
            <member-name>
              Grid1
            </member-name>
          </member-identity>
          <unicast-listener>
            <well-known-addresses>
              <socket-address id="1">
                <address>
                  192.168.56.91
               
      </address>
                <port>9300</port>
              </socket-address>
              <socket-address id="2">
                <address>
                  192.168.56.92
               
      </address>
                <port>9300</port>
              </socket-address>
            </well-known-addresses>
            <address>
              192.168.56.91
           
      </address>
            <port>9300</port>
          </unicast-listener>
        </cluster-config>
      </coherence>

      <?xml version='1.0'?>
      <coherence>
        <cluster-config>
          <member-identity>
            <cluster-name>
              oep_cluster
            </cluster-name>
            <member-name>
              Grid2
            </member-name>
          </member-identity>
          <unicast-listener>
            <well-known-addresses>
              <socket-address id="1">
                <address>
                  192.168.56.91
               
      </address>
                <port>9300</port>
              </socket-address>
              <socket-address id="2">
                <address>
                  192.168.56.92
               
      </address>
                <port>9300</port>
              </socket-address>
            </well-known-addresses>
            <address>
              192.168.56.92
           
      </address>
            <port>9300</port>
          </unicast-listener>
        </cluster-config>
      </coherence>

    • Note that the OEP servers do not listen on the WKA addresses, using different port numbers even though they run on the same servers as the cache servers.
    • Also not that the Coherence servers are the ones that listen on the WKA addresses.
  • Now that the configuration is complete we can create a start script for the Coherence grid servers as follows:
    • #!/bin/sh
      MW_HOME=/home/oracle/fmw
      OEP_HOME=${MW_HOME}/ocep_11.1
      JAVA_HOME=${MW_HOME}/jrockit_160_33
      CACHE_SERVER_HOME=${MW_HOME}/user_projects/domains/oep_coherence
      CACHE_SERVER_CLASSPATH=${CACHE_SERVER_HOME}/HADemoCoherence.jar:${CACHE_SERVER_HOME}/config
      COHERENCE_JAR=${OEP_HOME}/modules/com.tangosol.coherence_3.7.1.6.jar
      JAVAEXEC=$JAVA_HOME/bin/java
      # specify the JVM heap size
      MEMORY=512m
      if [[ $1 == '-jmx' ]]; then
          JMXPROPERTIES="-Dcom.sun.management.jmxremote -Dtangosol.coherence.management=all -Dtangosol.coherence.management.remote=true"
          shift
      fi
      JAVA_OPTS="-Xms$MEMORY -Xmx$MEMORY $JMXPROPERTIES"
      $JAVAEXEC -server -showversion $JAVA_OPTS -cp "${CACHE_SERVER_CLASSPATH}:${COHERENCE_JAR}" com.tangosol.net.DefaultCacheServer $1
    • Note that I put the tangosol-coherence-override and the coherence-cache-config.xml files in a config directory and added that directory to my path (CACHE_SERVER_CLASSPATH=${CACHE_SERVER_HOME}/HADemoCoherence.jar:${CACHE_SERVER_HOME}/config) so that Coherence would find the override file.
    • Because my application uses in-cache processing (entry processors) I had to add a jar file containing the required classes for the entry processor to the classpath (CACHE_SERVER_CLASSPATH=${CACHE_SERVER_HOME}/HADemoCoherence.jar:${CACHE_SERVER_HOME}/config).
    • The classpath references the Coherence Jar shipped with OEP to avoid versoin mismatches (COHERENCE_JAR=${OEP_HOME}/modules/com.tangosol.coherence_3.7.1.6.jar).
    • This script is based on the standard cache-server.sh script that ships with standalone Coherence.
    • The –jmx flag can be passed to the script to enable Coherence JMX management beans.

We have now configured Coherence to use an external data grid for its application caches.  When starting we should always start at least one of the grid servers before starting the OEP servers.  This will allow the OEP server to find the grid.  If we do start things in the wrong order then the OEP servers will block waiting for a storage enabled node to start (one of the WKA servers if using Unicast).

Summary

We have now created an OEP cluster that makes use of an external Coherence grid for application caches.  The application has been modified to ensure that the timestamps of arriving events are synchronized and the output events are only output by one of the servers in the cluster.  In event of failure we may get some duplicate events with our configuration (there are configurations that avoid duplicate events) but we will not lose any events.  The final version of the application with full HA capability is shown below:

Files

The following files are available for download:

  • Oracle Event Processing
    • Includes Coherence
  • None-HA version of application
    • Includes test file TestData.csv and Load Test property file HADemoTest.prop
    • Includes Server.properties.Antony file to customize to point to your WLS installation
  • HA version of application
    • Includes test file TestData.csv and Load Test property file HADemoTest.prop
    • Includes Server.properties.Antony file to customize to point to your WLS installation
  • OEP Cluster Files
    • Includes config.xml
    • Includes tangosol-coherence-override.xml
    • Includes Server.properties that will need customizing for your WLS environment
  • Coherence Cluster Files
    • Includes tangosol-coherence-override.xml and coherence-cache-configuration.xml
    • includes cache-server.sh start script
    • Includes HADemoCoherence.jar with required classes for entry processor

References

The following references may be helpful:

Friday Nov 15, 2013

Postscript on Scripts

More Scripts for SOA Suite

Over time I have evolved my startup scripts and thought it would be a good time to share them.  They are available for download here.  I have finally converted to using WLST, which has a number of advantages.  To me the biggest advantage is that the output and log files are automatically written to a consistent location in the domain directory or node manager directory.  In addition the WLST scripts wait for the component to start and then return, this lets us string commands together without worrying about the dependencies.

The following are the key scripts (available for download here):

Script Description Pre-Reqs Stops when
Task Complete
startWlstNodeManager.sh Starts Node Manager using WLST None Yes
startNodeManager.sh Starts Node Manager None Yes
stopNodeManager.sh Stops Node Manager using WLST Node Manager running Yes
startWlstAdminServer.sh Starts Admin Server using WLST Node Manager running Yes
startAdminServer.sh Starts Admin Server None No
stopAdminServer.sh Stops Admin Server Admin Server running Yes
startWlstManagedServer.sh Starts Managed Server using WLST Node Manager running Yes
startManagedServer.sh Starts Managed Server None No
stopManagedServer.sh Stops Managed Server Admin Server running Yes

Samples

To start Node Manager and Admin Server

startWlstNodeManager.sh ; startWlstAdminServer.sh

To start Node Manager, Admin Server and SOA Server

startWlstNodeManager.sh ; startWlstAdminServer.sh ; startWlstManagedServer soa_server1

Note that the Admin server is not started until the Node Manager is running, similarly the SOA server is not started until the Admin server is running.

Node Manager Scripts

startWlstNodeManager.sh

Uses WLST to start the Node Manager.  When the script completes the Node manager will be running.

startNodeManager.sh

The Node Manager is started in the background and the output is piped to the screen. This causes the Node Manager to continue running in the background if the terminal is closed. Log files, including a .out capturing standard output and standard error, are placed in the <WL_HOME>/common/nodemanager directory, making them easy to find. This script pipes the output of the log file to the screen and keeps doing this until terminated, Terminating the script does not terminate the Node Manager.

stopNodeManager.sh

Uses WLST to stop the Node Manager.  When the script completes the Node manager will be stopped.

Admin Server Scripts

startWlstAdminServer.sh

Uses WLST to start the Admin Server.  The Node Manager must be running before executing this command.  When the script completes the Admin Server will be running.

startAdminServer.sh

The Admin Server is started in the background and the output is piped to the screen. This causes the Admin Server to continue running in the background if the terminal is closed.  Log files, including the .out capturing standard output and standard error, are placed in the same location as if the server had been started by Node Manager, making them easy to find.  This script pipes the output of the log file to the screen and keeps doing this until terminated,  Terminating the script does not terminate the server.

stopAdminServer.sh

Stops the Admin Server.  When the script completes the Admin Server will no longer be running.

Managed Server Scripts

startWlstManagedServer.sh <MANAGED_SERVER_NAME>

Uses WLST to start the given Managed Server. The Node Manager must be running before executing this command. When the script completes the given Managed Server will be running.

startManagedServer.sh <MANAGED_SERVER_NAME>

The given Managed Server is started in the background and the output is piped to the screen. This causes the given Managed Server to continue running in the background if the terminal is closed. Log files, including the .out capturing standard output and standard error, are placed in the same location as if the server had been started by Node Manager, making them easy to find. This script pipes the output of the log file to the screen and keeps doing this until terminated, Terminating the script does not terminate the server.

stopManagedServer.sh <MANAGED_SERVER_NAME>

Stops the given Managed Server. When the script completes the given Managed Server will no longer be running.

Utility Scripts

The following scripts are not called directly but are used by the previous scripts.

_fmwenv.sh

This script is used to provide information about the Node Manager and WebLogic Domain and must be edited to reflect the installed FMW environment, in particular the following values must be set:

  • DOMAIN_NAME – the WebLogic domain name.
  • NM_USERNAME – the Node Manager username.
  • NM_PASSWORD – the Node Manager password.
  • MW_HOME – the location where WebLogic and other FMW components are installed.
  • WEBLOGIC_USERNAME – the WebLogic Administrator username.
  • WEBLOGIC_PASSWORD - the WebLogic Administrator password.

The following values may also need changing:

  • ADMIN_HOSTNAME – the server where AdminServer is running.
  • ADMIN_PORT – the port number of the AdminServer.
  • DOMAIN_HOME – the location of the WebLogic domain directory, defaults to ${MW_HOME}/user_projects/domains/${DOMAIN_NAME}
  • NM_LISTEN_HOST – the Node Manager listening hostname, defaults to the hostname of the machine it is running on.
  • NM_LISTEN_PORT – the Node Manager listening port.

_runWLst.sh

This script runs the WLST script passed in environment variable ${SCRIPT} and takes its configuration from _fmwenv.sh.  It dynamically builds a WLST properties file in the /tmp directory to pass parameters into the scripts.  The properties filename is of the form <DOMAIN_NAME>.<PID>.properties.

_runAndLog.sh

This script runs the command passed in as an argument, writing standard out and standard error to a log file.  The log file is rotated between invocations to avoid losing the previous log files.  The log file is then tailed and output to the screen.  This means that this script will never finish by itself.

WLST Scripts

The following WLST scripts are used by the scripts above, taking their properties from /tmp/<DOMAIN_NAME>.<PID>.properties:

  • startNodeManager.py
  • stopNodeManager.py
  • startServer.py
  • startServerNM.py

Relationships

The dependencies and relationships between my scripts and the built in scripts are shown in the diagram below.

Thanks for the Memory

Controlling Memory in Oracle SOA Suite

Within WebLogic you can specify the memory to be used by each managed server in the WebLogic console.  Unfortunately if you create a domain with Oracle SOA Suite it adds a new config script, setSOADomainEnv.sh, that overwrites any USER_MEM_ARGS passed in to the start scripts.  setDomainEnv.sh only sets a single set of memory arguments that are used by all servers, so an admin server gets the same memory parameters as a BAM server.  This means some servers will have more memory than they need and others will be too tightly constrained.  This is a bad thing.

A Solution

To overcome this I wrote a small script, setMem.sh, that checks to see which server is being started and then sets the memory accordingly.  It supports up to 9 different server types, each identified by a prefix.  If the prefix matches then the max and min heap and max and min perm gen are set.  Settings are controlled through variables set in the script,  If using JRockit then leave the PERM settings empty and they will not be set.

SRV1_PREFIX=Admin
SRV1_MINHEAP=768m
SRV1_MAXHEAP=1280m
SRV1_MINPERM=256m
SRV1_MAXPERM=512m

The above settings match any server whose name starts Admin.  The above will result in the following USER_MEM_ARGS value

USER_MEM_ARGS=-Xms768m -Xmx1280m -XX:PermSize=256m -XX:MaxPermSize=512m

If the prefix were soa then it would match soa_server1, soa_server2 etc.

There is a set of DEFAULT_ values that allow you to default the memory settings and only set them explicitly for servers that have settings different from your default.  Note that there must still be a matching prefix, otherwise the USER_MEM_ARGS will be the ones from setSOADomainEnv.sh.

This script needs to be called from the setDomainEnv.sh.  Add the call immediately before the following lines:

if [ "${USER_MEM_ARGS}" != "" ] ; then
        MEM_ARGS="${USER_MEM_ARGS}"
        export MEM_ARGS
fi

The script can be downloaded here.

Thoughts on Memory Setting

I know conventional wisdom is that Xms (initial heap size) and Xmx (max heap size) should be set the same to avoid needing to allocate memory after startup.  However I don’t agree with this.  Setting Xms==Xmx works great if you are omniscient, I don’t claim to be omniscient, my omni is a little limited, so I prefer to set Xms to what I think the server needs, this is what I would set it to if I was setting the parameters to be Xms=Xmx.  However I like to then set Xmx to be a little higher than I think the server needs.  This allows me to get the memory requirement wrong and not have my server fail due to out of memory.  In addition I can now monitor the server and if the heap memory usage goes above my Xms setting I know that I calculated wrong and I can change the Xms setting and restart the servers at a suitable time.  Setting them not equal buys be time to fix increased memory needs, for exampl edue to change in usage patterns or new functionality, it also allows me to be less than omniscient.

ps Please don’t tell my children I am not omniscient!

Thursday Sep 26, 2013

Enterprise Deployment Presentation at OpenWorld

Presentation Today Thursday 26 September 2013

Today Matt & I together with Ram from Oracle Product Management and Craig from Rubicon Red will be talking about building a highly available, highly scalable enterprise deployment. We will go through Oracles Enterprise Deployment Guide and explain why it recommends what is does and also identify alternatives to its recommendations. Come along to Moscone West 2020 at 2pm today. it would be great see you there.

Update

Thanks to all who attended, we were gratified to see around 100 people turn out on Thursday afternoon. I know I am usually presentationed out by then! I have uploaded the presentation here.

Tuesday Aug 13, 2013

Oracle SOA Suite 11g Performance Tuning Cookbook

Just received this to review.

It’s a Java World

The first chapter identifies tools and methods to identify performance bottlenecks, generally covering low level JVM and database issues.  Useful material but not really SOA specific and the authors I think missed the opportunity to share the knowledge they obviously have of how to relate these low level JVM measurements into SOA causes.

Chapter 2 uses the EMC Hyperic tool to monitor SOA Suite and so this chapter may be of limited use to many readers.  Many but not all of the recipes could have been accomplished using the FMW Control that ships and is included in the license of SOA Suite.  One of the recipes uses DMS, which is the built in FMW monitoring system built by Oracle before the acquisition of BEA.  Again this seems to be more about Hyperic than SOA Suite.

Chapter 3 covers performance testing using Apache JMeter.  Like the previous chapters there is very little specific to SOA Suite, indeed in my experience many SOA Suite implementations do not have a Web Service to initiate composites, instead relying on adapters.

Chapter 4 covers JVM memory management, this is another good general Java section but has little SOA specifics in it.

Chapter 5 is yet more Java tuning, in this case generic garbage collection tuning.  Like the earlier chapters, good material but not very SOA specific.  I can’t help feeling that the authors could have made more connections with SOA Suite specifics in their recipes.

Chapter 6 is called platform tuning, but it could have been titled miscellaneous tuning.  This includes a number of Linux optimizations, WebLogic optimizations and JVM optimizations.  I am not sure that yet another explanation of how to create a boot.properties file was needed.

Chapter 7 homes in on JMS & JDBC tuning in WebLogic.

SOA at Last

Chapter 8 finally turns to SOA specifics, unfortunately the description of what dispatcher invoke threads do is misleading, they only control the number of threads retrieving messages from the request queue, synchronous web service calls do not use the request queue and hence do not use these threads.  Several of the recipes in this chapter do more than alter the performance characteristics, they also alter the semantics of the BPEL engine (such as “Changing a BPEL process to be transient”) and I wish there was more discussion of the impacts of these in the chapter.  I didn’t see any reference to the impact on recoverability of processes when turning on in-memory message delivery.  That said the recipes do cover a lot of useful optimizations, and if used judiciously will cause a boost in performance.

Chapter 9 covers optimizing the Mediator, primarily tweaking Mediator threading.  THe descriptions of the impacts of changes in this chapter are very good, and give some helpful indications on whether they will apply to your environment.

Chapter 10 touches very lightly on Rules and Human Workflow, this chapter would have benefited from more recipes.  The two recipes for Rules do offer very valuable advice.  The two workflow recipes seem less valuable.

Chapter 11 takes into the area where the greatest performance optimizations are to be found, the SOA composite itself.  7 generally useful recipes are provided, and I would have liked to see more in this chapter, perhaps at the expense of some of the java tuning in the first half of the book.  I have to say that I do not agree with the “Designing BPEL processes to reduce persistence” recipe, there are better more maintainable and faster ways to deal with this.  The other recipes provide valuable ideas that may help performance of your composites.

Chapter 12 promises “High Performance Configuration”.  Three of the recipes on creating a cluster, configuring an HTTP plug in and setting up distributed queues are covered better in the Oracle documentation, particularly the Enterprise Deployment Guide.  There are however some good suggestions in the recipe about deploying on virtualized environments, I wish they had spoken more about this.  The use of JMS bridges recipe is also a very valuable one that people should be aware of.

The Good, the Bad, and the Ugly

A lot of the recipes are really just trivial variations on other recipes, for example they have one recipe on “Increasing the JVM heap size” and another on “Setting Xmx and Xms to the same value”.

Although the book spends a lot of time on Java tuning, that of itself is reasonable as a lot fo SOA performance tuning is tweaking JVM and WLS parameters.  I would have found it more useful if the dots were connected to relate the Java/WLS tuning sections to specific SOA use cases.

As the authors say when talking about adapter tuning “The preceding sets of recipes are the basics … available in Oracle SOA Suite. There are many other properties that can be tuned, but their effectiveness is situational, so we have chosen to focus on the ones that we feel give improvement for the most projects.”.  They have made a good start, and maybe in a 12c version of the book they can provide more SOA specific information in their Java tuning sections.

Add the book to your library, you are almost certain to find useful ideas in here, but make sure you understand the implications of the changes you are making, the authors do not always spell out the impact on the semantics of your composites.

A sample chapter is available on the Packt Web Site.

Friday Jun 28, 2013

Free WebLogic Administration Cookbook

Free WebLogic Admin Cookbook

Packt Publishing are offering free copies of Oracle WebLogic Server 12c Advanced Administration Cookbook : http://www.packtpub.com/oracle-weblogic-server-12c-advanced-administration-cookbook/book  in exchange for a review either on your blog or on the title’s Amazon page.

Here’s the blurb:

  • Install, create and configure WebLogic Server
  • Configure an Administration Server with high availability
  • Create and configure JDBC data sources, multi data sources and gridlink data sources
  • Tune the multi data source to survive database failures
  • Setup JMS distributed queues
  • Use WLDF to send threshold notifications
  • Configure WebLogic Server for stability and resilience

If you’re a datacenter operator, system administrator or even a Java developer this book could be exactly what you are looking for to take you one step further with Oracle WebLogic Server, this is a good way to bag yourself a free cookbook (current retail price $25.49).

Free review copies are available until Tuesday 2nd July 2013, so if you are interested, email Harleen Kaur Bagga at: harleenb@packtpub.com.

I will be posting my own review shortly!

Tuesday May 21, 2013

Target Verification

Verifying the Target

I just built a combined OSB, SOA/BPM, BAM clustered domain.  The biggest hassle is validating that the resource targeting is correct.  There is a great appendix in the documentation that lists all the modules and resources with their associated targets.  The only problem is that the appendix is six pages of small print.  I manually went through the first page, verifying my targeting, until I thought ‘there must be a better way of doing this’.  So this blog post is the better way Smile

WLST to the Rescue

WebLogic Scripting Tool allows us to query the MBeans and discover what resources are deployed and where they are targeted.  So I built a script that iterates over each of the following resource types and verifies that they are correctly targeted:

  • Applications
  • Libraries
  • Startup Classes
  • Shutdown Classes
  • JMS System Resources
  • WLDF System Resources

Source Data

To get the data to verify my domain against, I copied the tables from the documentation into a text file.  The copy ended up putting the resource on the first line and the targets on the second line.  Rather than reformat the data I just read the lines in pairs, storing the resource as a string and splitting apart the targets into a list of strings.  I then stored the data in a dictionary with the resource string as the key and the target list as the value.  The code to do this is shown below:

# Load resource and target data from file created from documentation
# File format is a one line with resource name followed by
# one line with comma separated list of targets
# fileIn - Resource & Target File
# accum - Dictionary containing mappings of expected Resource to Target
# returns - Dictionary mapping expected Resource to expected Target
def parseFile(fileIn, accum) :
  # Load resource name
  line1 = fileIn.readline().strip('\n')
  if line1 == '':
    # Done if no more resources
    return accum
  else:
    # Load list of targets
    line2 = fileIn.readline().strip('\n')
    # Convert string to list of targets
    targetList = map(fixTargetName, line2.split(','))
    # Associate resource with list of targets in dictionary
    accum[line1] = targetList
    # Parse remainder of file
    return parseFile(fileIn, accum)

This makes it very easy to update the lists by just copying and pasting from the documentation.

Each table in the documentation has a corresponding file that is used by the script.

The data read from the file has the target names mapped to the actual domain target names which are provided in a properties file.

Listing & Verifying the Resources & Targets

Within the script I move to the domain configuration MBean and then iterate over the resources deployed and for each resource iterate over the targets, validating them against the corresponding targets read from the file as shown below:

# Validate that resources are correctly targeted
# name - Name of Resource Type
# filename - Filename to validate against
# items - List of Resources to be validated
def validateDeployments(name, filename, items) :
  print name+' Check'
  print "====================================================="
  fList = loadFile(filename)
  # Iterate over resources
  for item in items:
    try:
      # Get expected targets for resource
      itemCheckList = fList[item.getName()]
      # Iterate over actual targets
      for target in item.getTargets() :
        try:
          # Remove actual target from expected targets
          itemCheckList.remove(target.getName())
        except ValueError:
          # Target not found in expected targets
          print 'Extra target: '+item.getName()+': '+target.getName()
      # Iterate over remaining expected targets, if any
      for refTarget in itemCheckList:
        print 'Missing target: '+item.getName()+': '+refTarget
    except KeyError:
      # Resource not found in expected resource dictionary
      print 'Extra '+name+' Deployed: '+item.getName()
  print

Obtaining the Script

I have uploaded the script here.  It is a zip file containing all the required files together with a PDF explaining how to use the script.

To install just unzip VerifyTargets.zip. It will create the following files

  • verifyTargets.sh
  • verifyTargets.properties
  • VerifyTargetsScriptInstructions.pdf
  • scripts/verifyTargets.py
  • scripts/verifyApps.txt
  • scripts/verifyLibs.txt
  • scripts/verifyStartup.txt
  • scripts/verifyShutdown.txt
  • scripts/verifyJMS.txt
  • scripts/verifyWLDF.txt

Sample Output

The following is sample output from running the script:

Application Check
=====================================================
Extra Application Deployed: frevvo
Missing target: usermessagingdriver-xmpp: optional
Missing target: usermessagingdriver-smpp: optional
Missing target: usermessagingdriver-voicexml: optional
Missing target: usermessagingdriver-extension: optional
Extra target: Healthcare UI: soa_cluster
Missing target: Healthcare UI: SOA_Cluster ??
Extra Application Deployed: OWSM Policy Support in OSB Initializer Aplication

Library Check
=====================================================
Extra Library Deployed: oracle.bi.adf.model.slib#1.0-AT-11.1-DOT-1.2-DOT-0
Extra target: oracle.bpm.mgmt#11.1-DOT-1-AT-11.1-DOT-1: AdminServer
Missing target: oracle.bpm.mgmt#11.1.1-AT-11.1.1: soa_cluster
Extra target: oracle.sdp.messaging#11.1.1-AT-11.1.1: bam_cluster

StartupClass Check
=====================================================

ShutdownClass Check
=====================================================

JMS Resource Check
=====================================================
Missing target: configwiz-jms: bam_cluster

WLDF Resource Check
=====================================================

IMPORTANT UPDATE

Since posting this I have discovered a number of issues.  I have updated the configuration files to correct these problems.  The changes made are as follows:

  • Added WLS_OSB1 server mapping to the script properties file (verifyTargets.properties) to accommodate OSB singletons and modified script (verifyTargets.py) to use the new property.
  • Changes to verifyApplications.txt
    • Changed target from OSB_Cluster to WLS_OSB1 for the following applications:
      • ALSB Cluster Singleton Marker Application
      • ALSB Domain Singleton Marker Application
      • Message Reporting Purger
    • Added following application and targeted at SOA_Cluster
      • frevvo
    • Adding following application and targeted at OSB_Cluster & Admin Server
      • OWSM Policy Support in OSB Initializer Aplication
  • Changes to verifyLibraries.txt
    • Adding following library and targeted at OSB_Cluster, SOA_Cluster, BAM_Cluster & Admin Server
      • oracle.bi.adf.model.slib#1.0-AT-11.1.1.2-DOT-0
    • Modified targeting of following library to include BAM_Cluster
      • oracle.sdp.messaging#11.1.1-AT-11.1.1

Make sure that you download the latest version.  It is at the same location but now includes a version file (version.txt).  The contents of the version file should be:

FMW_VERSION=11.1.1.7

SCRIPT_VERSION=1.1

Wednesday Oct 31, 2012

Event Processed

Installing Oracle Event Processing 11g

Earlier this month I was involved in organizing the Monument Family History Day.  It was certainly a complex event, with dozens of presenters, guides and 100s of visitors.  So with that experience of a complex event under my belt I decided to refresh my acquaintance with Oracle Event Processing (CEP).

CEP has a developer side based on Eclipse and a runtime environment.

Server install

The server install is very straightforward (documentation).  It is recommended to use the JRockit JDK with CEP so the steps to set up a working CEP server environment are:

  1. Download required software
    • JRockit – I used Oracle “JRockit 6 - R28.2.5” which includes “JRockit Mission Control 4.1” and “JRockit Real Time 4.1”.
    • Oracle Event Processor – I used “Complex Event Processing Release 11gR1 (11.1.1.6.0)”
  2. Install JRockit
    • Run the JRockit installer, the download is an executable binary that just needs to be marked as executable.
  3. Install CEP
    • Unzip the downloaded file
    • Run the CEP installer,  the unzipped file is an executable binary that may need to be marked as executable.
    • Choose a custom install and add the examples if needed.
      • It is not recommended to add the examples to a production environment but they can be helpful in development.

Developer Install

The developer install requires several steps (documentation).  A developer install needs access to the software for the server install, although JRockit isn’t necessary for development use.

  1. Download required software
    • Eclipse  (Linux) – It is recommended to use version 3.6.2 (Helios)
  2. Install Eclipse
    • Unzip the download into the desired directory
  3. Start Eclipse
  4. Add Oracle CEP Repository in Eclipse
    • http://download.oracle.com/technology/software/cep-ide/11/
  5. Install Oracle CEP Tools for Eclipse 3.6
    • You may need to set the proxy if behind a firewall.
  6. Modify eclipse.ini
    • If using Windows edit with wordpad rather than notepad
    • Point to 1.6 JVM
      • Insert following lines before –vmargs
        • -vm
        • \PATH_TO_1.6_JDK\jre\bin\javaw.exe
    • Increase PermGen Memory
      • Insert following line at end of file
        • -XX:MaxPermSize=256M

Restart eclipse and verify that everything is installed as expected.

Voila The Deed Is Done

With CEP installed you are now ready to start a server, if you didn’t install the demoes then you will need to create a domain before starting the server.

Once the server is up and running (using startwlevs.sh) you can verify that the visualizer is available on http://hostname:port/wlevs, the default port for the demo domain is 9002.

With the server running you can test the IDE by creating a new “Oracle CEP Application Project” and creating a new target environment pointing at your CEP installation.

Much easier than organizing a Family History Day!

Wednesday Apr 25, 2012

Scripting WebLogic Admin Server Startup

How to Script WebLogic Admin Server Startup

My first car was a 14 year old Vauxhall Viva.  It is the only one of my cars that has ever been stolen, and to this day how they stole it is a mystery to me as I could never get it to start.  I always parked it pointing down a steep hill so that I was ready to jump start it!  Of course its ability to start was dramatically improved when I replaced the carburetor butterfly valve!

Getting SOA Suite or other WebLogic based systems to start can sometimes be a problem because the default WebLogic start scripts require you to stay logged on to the computer where you started the script.  Obviously this is awkward and a better approach is to run the script in the background.  This problem can be avoided by using a WLST script to start the AdminServer but that is more work, so I never bother with it.

If you just run the startup script in the background the standard output and standard error still go to the session where you started the script, not helpful if you log off and later want to see what is happening.  So the next thing to do is to redirect standard out and standard error from the script.

Finally it would be nice to have a record of the output of the last few runs of the Admin Server, but these should be purged to avoid filling up the directory.

Doing the above three tasks is the job of the script I use to start WebLogic.  The script is shown below:

Startup Script

#!/bin/sh

# SET VARIABLES

SCRIPT_HOME=`dirname $0`

MW_HOME=/home/oracle/app/Middleware

DOMAIN_HOME=$MW_HOME/user_projects/domains/dev_domain

LOG_FILE=$DOMAIN_HOME/servers/AdminServer/logs/AdminServer.out

# MOVE EXISTING LOG FILE

logrotate -f -s $SCRIPT_HOME/logrotate.status $SCRIPT_HOME/AdminServerLogRotation.cfg

#RUN ADMIN SERVER

touch $LOG_FILE

nohup $DOMAIN_HOME/startWebLogic.sh &> $LOG_FILE &

tail -f $LOG_FILE

Explanation

Lets walk through each section of the script.

SET VARIABLES

The first few lines of the script just set the environment.  Note that I put the output of the start script into the same location and same filename that it would go to if I used the Node Manager to start the server.  This keeps it consistent with other servers that are started by the node manager.

MOVE EXISTING LOG FILE

The next section keeps a copy of the previous output file by using the logrotate command.  This reads its configuration from the “AdminServerLogRotation.cfg” file shown below:

/home/oracle/app/Middleware/user_projects/domains/dev_domain/servers/AdminServer/logs/AdminServer.out {
  rotate 10
  missingok
}

This tells the logrotate command to keep 10 copies (rotate 10) of the log file and if there is no previous copy of the log file that is not an error condition (missingok).

The logrotate.status file is used by logrotate to keep track of what it has done.  It is ignored when the –f flag is used, causing the log file to be rotated every time the command is invoked.

RUN ADMIN SERVER

UPDATE: Sometimes the tail command starts before the shell has created the log file for the startWebLogic.sh command. To avoid an error in the tail command I "touch" the log file to make sure that it is there.

The final section actually invokes the standard command to start an admin server (startWebLogic.sh) and redirects the standard out and standard error to the log file.  Note that I run the command in the background and set it to ignore the death of the parent shell.

Finally I tail the log file so that the user experience is the same as running the start command directly.  However in this case if I Ctrl-C the command only the tail will be terminated, the Admin Server will continue to run as a background process.

This approach allows me to watch the output of the AdminServer but not to shut it down if I accidently hit Ctrl-C or close the shell window.

Restart Script

I also have a restart script shown below:

#!/bin/sh
# SET VARIABLES
SCRIPT_HOME=`dirname $0`
MW_HOME=/home/oracle/app/Middleware
DOMAIN_HOME=$MW_HOME/user_projects/domains/dev_domain

# STOP ADMIN SERVER
$DOMAIN_HOME/bin/stopWebLogic.sh

# RUN ADMIN SERVER
$SCRIPT_HOME/startAdminServer.sh

This is just like the start script except that it runs the stop weblogic command followed by my start script command.

Summary

The above scripts are quick and easy to put in place for the Admin Server and make the stdout and stderr logging consistent with other servers that are started from the node manager.  Now can someone help me push start my car!

Thursday Dec 22, 2011

SOA in a Windows World

Installing BPM Suite on Windows Server 2008 Domain Controller under VirtualBox

It seems I am working with a number of customers for whom Windows is an important part of their infrastructure.  Security is tied in with Windows Active Directory and many services are hosted using Windows Communication Framework.  To better understand these customers environment I got myself a Windows 2008 server license and decided to install BPM Suite on a Windows 2008 Server running as a domain controller.  This entry outlines the process I used to get it to work.

The Environment

I didn’t want to dedicate a physical server to running Windows Server so I installed it under Oracle Virtual Box.

My target environment was Windows 2008 Server with Active Directory and DNS roles.  This would give me access to the Microsoft security infrastructure and I could use this to make sure I understood how to properly integrate WebLogic and SOA Suite security with Windows security.  I wanted to run Oracle under a non-Administrator account, as this is often the way I have to operate on customer sites.  This was my first challenge.  For very good security reasons the only accounts allowed to log on to a Windows Domain controller are domain administrator accounts.  Now I only had resources (and licenses) for a single Windows server so I had to persuade Windows to let me log on with a non-Domain Admin account.

Logging On with  a non-Domain Admin Account

I found this very helpful blog entry on how to log on using a non-domain account - Allow Interactive Logon to Domain Controllers in Windows Server 2008.  The key steps from this post are as follows:

  • Create a non-Admin user – I created one called “oracle”.
  • Edit the “Default Domain Controllers” group policy to add the user “oracle” to the “Allow log on locally” policy in Computer Configuration > Policies > Windows Settings >Security Settings > Local Policies > User Rights Assignment:
  • Force a policy update.

If you didn’t get it right then you will get the following error when trying to logon "You cannot log on because the logon method you are using is not allowed on this computer".  This means that you correctly created the user but the policy has not been modified correctly.

Acquiring Software

The best way to acquire the software needed is to go to the BPM download page.  If you choose the Microsoft Windows 32bit JVM option you can get a list of all the required components and a link to download them directly from OTN.  The only download link I didn’t use was the database download because I opted for an 11.2 database rather than the XE link that is given.  The only additional software I added was the 11.1.1.5 BPM feature pack (obtain from Oracle Support as patch #12413651: 11.1.1.5.0 BPM FEATURES PACK) and the OSB software.  The BPM feature pack patch is applied with OPatch so I also downloaded the latest OPatch from Oracle support (patch 6880880 for 11.1.0.x releases on Windows 32-bit).

Installing Oracle Database

I began by setting the system environment variable ORACLE_HOSTNAME to be the hostname of the my Windows machine.  I also added this hostname to the hosts file, mapping it to 127.0.0.1.  When launching the installer as a non-Administrator account you will be asked for Administrator credentials in order to install.

Gotcha with Virtual Box Paths

I mounted the install software as a VirtualBox shared Folder and told it to auto-mount.  Unfortunately this auto-mount in Windows only applied to the current user, so when the software tried to run as administrator it couldn’t find the path.  The solution to this was to launch the installer using a UNC path “\\vboxsrv\<SHARE_NAME>\<PATH_TO_INSTALL_FILES>” because the mount point is available to all users, but the auto-mapping is only done at login time for the current user.

Database Install Options

When installing the database I made the following choices to make life easier later, in particular I made sure that I had a UTF-8 character set as recommended for SOA Suite.

  • Declined Security Updates (this is not a production machine)
  • Created and Configured a Database during install
  • Chose Server Class database to get character set options later
  • Chose single instance database installation
  • Chose advanced install to get character set options later
  • Chose English as my language
  • Chose Enterprise Edition
    • Included Oracle Data Extensions for .NET
  • Set Oracle Base as C:\app\oracle
  • Selected General Purpose/Transaction Processing
  • Changed default database name
  • Configuration Options
    • Accepted default memory management settings
    • Change character set to be AL32UTF8 as required by SOA Suite
    • Unselected “Assert all new security settings” to relax security as this is not a production system
    • Chose to create sample schemas
  • Used database control for database management
  • Use file system for database storage
  • Didn’t enable automated backups (that is what Virtual Box snapshots are for)
  • Used same non-compliant password for all accounts

I set up the environment variable ORACLE_UNQNAME to be the database name, this is provided on the last screen of the Oracle database Configuration Assistant.

Configuring for Virtual Box

Because Virtual Box port forwarding settings are global I changed the DB console listen port (from 1158 using emca) and the database listener port (from 1521 using EM console) before setting up port forwarding for virtual box to the new ports.  This required me to re-register the database with the listener and to reconfigure EM.

Firewall Restrictions

After changing my ports I had a final task to do before snapshotting my image, I had add a new Windows Firewall rule to open up database ports (EM & listener).

Installing WebLogic Server

With a working database I was now able to install WebLogic Server.  I decided to do a 32-bit install to simplify the process (no need for a separate JDK install).  As this was intended to be an all in one machine (developer and server) I accepted the Coherence (needed for SOA Suite) and OEPE (needed for OSB design time tooling) options.  After installing I set the oracle user to have full access permissions on the Middleware home I created in C:\app\oracle\product\FMW.

Installing SOA/BPM Suite

Because I was using a 32-bit JVM I had to provide the “–jreLoc” option to the setup.exe command in order to run the SOA Suite installer (see release notes).  The installer correctly found my Middleware Home and installed the SOA/BPM Suite.  After installing I set the oracle user to have full access to the new SOA home created in C:\app\oracle\product\FMW\Oracle_SOA and the Oracle common directory (C:\app\oracle\product\FMW\oracle_common).

Running Repository Creation Utility

I ran the RCU from my host OS rather than from within the Windows guest OS.  This helps avoid any unnecessary temporary files being created in the virtual machine.  I selected the SOA and BPM Infrastructure component and left the prefix at the default DEV.  Using DEV makes life easier when you come to create a SOA/BPM doamin because you don’t need to change the username in the domain config wizard.  Because this isn’t a production environment I also set all the passwords to be the same, again this will simplify things in the config wizard.

Adding BPM Feature Pack

With SOA installed I updated it to include the BPM feature pack.

Installing OPatch Update

First I needed to apply patch 6880880 to get the latest OPatch.  The patch can be applied to any Oracle home and I chose to apply it to the oracle_common home, it seemed to make more sense there rather than the Oracle_SOA home.  To apply the patch I moved the original OPatch directory to OPatch.orig and then unzipped the patch in the oracle_common directory which created a new OPatch directory for me.  Before applying the feature set patch I opened a command prompt and set the ORACLE_HOME environment variable to the Oracle_SOA home and added the new OPatch directory to the path.  I then tested the new OPatch by running the command “opatch lsinventory” which showed me the SOA Suite install version.

Fixing a Path Problem

OPatch uses  setupCCR.exe which has a dependency on msvc71.dll.  Unfortunately this DLL is not on the path so by default the call to setupCCR fails with an error “This application failed to start because MSVCR71.dll was not found”.  To fix this I found a helpful blog entry that led me to create a new key in the registry at “HKEY_LOCAL_MACHINE\SOFTWARE\Microsfot\Windows\CurrentVersion\App Paths\setupCCR.exe” with the default value set to “<MW_HOME>\utils\ccr\bin\setupCCR.exe”.  I added a String value to this key with a name of “Path” and a value of “<Oracle_Common_Home>\oui\lib\win32”.  This registers the setupCCR application with Windows and adds a custom path entry for this application so that it can find the MSVCR71 DLL.

Patching oracle_common Home

I then applied the BPM feature pack patch to oracle_common by

  • Setting ORACLE_HOME environment variable to the oracle_common directory
  • Creating a temporary directory “PATCH_TOP”
  • Unzipping the following files from the patch into PATCH_TOP
    • p12413651_ORACOMMON_111150_Generic.zip
    • p12319055_111150_Generic.zip
    • p12614083_111150_Generic.zip
  • From the PATCH_TOP directory run the command “<Oracle_Common_Home>\OPatch\opatch napply”
    • Note I didn’t provide the inventory pointer parameter (invPtrLoc) because I had a global inventory that was found just fine by OPatch and I didn’t have a local inventory as the patch readme seems to expect.
  • Deleting the PATCH_TOP directory

After successful completion of this “opatch lsinventory” showed that 3 patches had been applied to the oracle_common home.

Patching Oracle_SOA Home

I applied the BPM feature pack patch to Oracle_SOA by

  • Setting ORACLE_HOME environment variable to the Oracle_SOA directory
  • Creating a temporary directory “PATCH_TOP”
  • Unzipping the following file from the patch into PATCH_TOP
    • p12413651_SOA_111150_Generic.zip
  • From the PATCH_TOP directory run the command “<Oracle_Common_Home>\OPatch\opatch napply”
    • Note again I didn’t provide the inventory pointer parameter (invPtrLoc) because I had a global inventory that was found just fine by OPatch and I didn’t have a local inventory as the patch readme seems to expect.
  • Deleting the PATCH_TOP directory

After successful completion of this “opatch lsinventory” showed that 1 patch had been applied to the Oracle_SOA home.

Updating the Database Schemas

    Having updated the software I needed to update the database schemas which I did as follows:

    • Setting ORACLE_HOME environment variable to the Oracle_SOA directory
    • Setting Java_HOME to <MW_HOME>\jdk160_24
    • Running “psa -dbType Oracle -dbConnectString  //<DBHostname>:<ListenerPort>/<DBServiceName> -dbaUserName sys –schemaUserName DEV_SOAINFRA
      • Note that again I elided the invLocPtr parameter

    Because I had not yet created a domain I didn’t have to follow the post installation steps outlined in the Post-Installation Instructions.

    Creating a BPM Development Domain

    I wanted to create a development domain.  So I ran config from <Oracle_Common_Home>\common\bin selecting the following:

    • Create a New Domain
    • Domain Sources
      • Oracle BPM Suite for developers – 11.1.1.0
        • This will give me an Admin server with BPM deployed in it.
      • Oracle Enterprise Manager – 11.1.1.0
      • Oracle Business Activity Monitoring – 11.1.1.0
        • Adds a managed BAM server.
    • I changed the domain name and set the location of the domains and applications directories to be under C:\app\oracle\MWConfig
      • This removes the domain config from the user_projects directory and keeps it separate from the installed software.
    • Chose Development Mode and Sun JDK rather than JRockit
    • Selected all Schema and set password, service name, host name and port.
      • Note when testing the test for SOA Infra will fail because it is looking for version 11.1.1.5.0 but the BPM feature pack incremented it to 11.1.1.5.1.  If the reason for the failure is “no rows were returned from the test SQL statement” then you can continue and select OK when warned that “The JDBC configuration test did not fully complete”.  This is covered in Oracle Support note 1386179.1.
    • Selected Optional Configuration for Administration Server and Managed Servers so that I could change the listening ports.
      • Set Admin Server Listen port to 7011 to avoid clashes with other Admin Servers in other guest OS.
      • Set bam_server Listen port to 9011 to avoid clashes with other managed servers in other guest OS.
      • Changed the name of the LocalMachine to reflect hostname of machine I was installing on.
      • Changed the node manager listen port to 5566 to avoid clashes with other Node Managers in other guest OS.

    Having created my domain I then created a boot.properties file for the bam_server.

    Configuring Node Manager

    With the domain created I set up Node Manager to use start scripts by running setNMProps.cmd from <oracle_common>\common\bin.

    I then edited the <MW_Home>\wlserver_10.3\common\nodemanager\nodemanager.properties file and added the following property:

    • ListenPort=5566

    Firewall Policy Updates

    I had to add the Admin Server, BAM Server and Node Manager ports to the Windows firewall policy to allow access to those ports from outside the Windows server.

    Set Node Manager to Start as a Windows Service

    I wanted node manager to automatically run on the machine as a Windows service so I first edited the <MW_HOME>\wlserver_10.3\server\bin\installNodeMgrSvc.cmd and changed the port to 5566.  Then I ran the command as Administrator to register the service.  The service is automatically registered for automatic startup.

    Set Admin Server to Start as a Windows Service

    I also wanted the Admin Server to run as a Windows service.  There is a blog entry about how to do this using the installSvc command but I found it much easier to use NSSM. To use this I did the following:

    • Downloaded NSSM and put the 64-bit version in my MWConfig directory.
      • Once you start using NSSM the Services you create will point to the location from which you ran NSSM so don’t move it after installing a service!
    • Created a simple script to start the admin server and redirect its standard out and standard error to a log file (I redirected to the “%DOMAIN_HOME%\servers\AdminServer\logs\AdminServer.out” because this is the location that would be used if the AdminServer were started by the node manager.

        @REM Point to Domain Directory
        set DOMAIN_HOME=C:\app\oracle\MWConfig\domains\bp_domain
        @REM Point to Admin Server logs directory
        set LOGS_DIR=%DOMAIN_HOME%\servers\AdminServer\logs
        @REM Redirect WebLogic stdout and stderr
        set JAVA_OPTIONS=-Dweblogic.Stdout="%LOGS_DIR%\AdminServer.out" -Dweblogic.Stderr="%LOGS_DIR%\AdminServer.out"
        @REM Start Admin Server
        call %DOMAIN_HOME%\startWebLogic.cmd

    • Registered the script as a Windows service using NSSM
      • nssm install “Oracle WebLogic AdminServer” “C:\app\oracle\MWConfig\startAdminServer.cmd”

    Note that when you redirect WebLogic stdout and stderr as I have done it does not get the first few lines of output, so test your script from the command line before registering it as a service.

    By default the AdminServer will be restarted if it fails, allowing you to bounce the Admin Server without having to log on to the Windows machine.

    Configuring for Virtual Box

    Having created the domain and configured Node Manager I enabled port forwarding in VirtualBox to expose the Admin Server (port 7011), BAM Server (port 9011) and the Node Manager (port 5566).

    Testing It

    All that is left is to start the node manager as a service, start the Admin server as a service, start the BAM server from the WebLogic console and make sure that things work as expected.  In this case all seemed fine.  When I shut down the machine and then restarted everything came up as expected!

    Conclusion

    The steps above create a SOA/BPM installation running under Windows Server 2008 that is automatically started when Windows Server starts.  The log files can be accessed and read by a non-admin user so the status of the environment can be checked.  Additional managed servers can be started from the Admin console because we have node manager installed.  The database, database listener, database control, node manager and Admin Server all start up as Windows services when the server is started avoiding the need for an Administrator to start them.

    Friday Sep 23, 2011

    Oracle & JBoss Comparison

    It’s All About TCO!

    CrimsonConsultingLogoCrimson Research has just published a paper comparing total cost of ownership (TCO) of WebLogic versus JBoss.

    You can download the paper here.  Key point it makes is that acquisition of an application server platform is only a small part of the total cost of ownership over a 5 year period.  What I found surprising was the speed with which the report suggests the lower TCO of WebLogic begins to be noticable, it indicates that the break even point is about 18 months into the deployment.

    The study was sponsored by Oracle but Crimson conducted it using their own methodology and it certainly sets out a good case for the benefits of “-ility” features in bringing down the implementation costs of an applications server infrastructure.  This gels with my own experience where the more I work with operations staff the more I learn how important things like script recording and WLST are to them.

    Thursday Sep 22, 2011

    Coping with Failure

    Handling Endpoint Failure in OSB

    HardwareFailureRecently I was working on a POC and we had demonstrated stellar performance with OSB fronting a BPEL composite calling back end EJBs.  The final test was a failover test which tested killing an OSB and bringing it back online and then killing a SOA(BPEL) server and bringing it back online and finally killing a backend EJB server and bringing it back online.  All was going well until the BPEL failover test when for some reason OSB refused to mark the BPEL server as down.  Turns out we had forgotten to set a very important setting and so this entry outlines how to handle endpoint failure in OSB.

    Step 1 – Add Multiple End Points to Business Service

    The first thing to do is create multiple end points for the business service, pointing to all available backends.  This is required for HTTP/SOAP bindings.  In theory if using a T3 protocol then a single cluster address is sufficient and load balancing will be taken care of by T3 smart proxies.  In this scenario though we will focus on HTTP/SOAP endpoints.

    Navigate to the Business Service->Configuration Details->Transport Configuration and add all your endpoint URIs.  Make sure that Retry Count is greater than 0 if you don’t want to pass failures back to the client.  In the example below I have set up links to three back end webs service instances.  Go to Last and Save the changes.

    MultiOSBEndpoint

    Step 2 – Enable Offlining & Recovery of Endpoint URIs

    When a back end service instance fails we want to take it offline, meaning we want to remove it from the pool of instances to which OSB will route requests.  We do this by navigating to the Business Service->Operational Settings and selecting the Enable check box for Offline Endpoint URIs in the General Configuration section.  This causes OSB to stop routing requests to a backend that returns errors (if the transport setting Retry Application Errors is set) or fails to respond at all.

    Offlining the service is good because we won’t send any more requests to a broken endpoint, but we also want to add the endpoint again when it becomes available.  We do this by setting the Enable with Retry Interval in General Configuration to some non-zero value, such as 30 seconds.  Then every 30 seconds OSB will add the failed service endpoint back into the list of endpoints.  If the endpoint is still not ready to accept requests then it will error again and be removed again from the list.  In the example below I have set up a 30 second retry interval.  Remember to hit update and then commit all the session changes.

    OfflineOSBEndpoint

    Considerations on Retry Count

    A couple of things to be aware of on retry count.

    If you set retry count to greater than zero then endpoint failures will be transparent to OSB clients, other than the additional delay they experience.  However if the request is mutative (changes the backend) then there is no guarantee that the request might not have been executed but the endpoint failed before turning the result, in which case you submit the mutative operation twice.  If your back end service can’t cope with this then don’t set retries.

    If your back-end service can’t cope with retries then you can still get the benefit of transparent retries for non-mutative operations by creating two business services, one with retry enabled that handles non-mutative requests, and the other with retry set to zero that handles mutative requests.

    Considerations on Retry Interval for Offline Endpoints

    If you set the retry interval to too small a value then it is very likely that your failed endpoint will not have recovered and so you will waste time on a request failing to contact that endpoint before failing over to a new endpoint, this will increase the client response time.  Work out what would be a typical unplanned outage time for a node (such as caused by a JVM failure and subsequent restart) and set the retry interval to be say half of this as a comprise between causing additional client response time delays and adding the endpoint back into the mix as soon as possible.

    Conclusion

    Always remember to set the Operational Setting to Enable Offlining and then you won’t be surprised in a fail over test!

    Tuesday Aug 23, 2011

    Moving Address

    Managing IP Addresses with Node Manager

    Moving house and changing address is always a problem, Auntie Matilda and the Mega Credit card company continue to send letters to the old address for years, which are dutifully forwarded by the new occupants.  Every few months the dear folks at the Bristol England Oracle office start to feel guilty about the amount of mail addressed to me, so they stick it in a FedEx envelope and send it out to the Colorado Springs Oracle office, where I open it and throw it all in the recycling bin.  So it is with some relief that I can reveal how easy it is to have node manager take care of all moving address requirements for your managed WebLogic servers.

    My colleague James Bayer pointed out to me last week that there have been some enhancements to Node Manager in the way it handled IP addresses.

    Justification

    Some WebLogic managed servers need to be able to move from one machine to another, and when they move, so that everyone else can find them, we need to move their IP listening addresses at the same time.  This “Whole Server Migration”, sometimes referred to as “WSM” to deliberately cause confusion with “Web Services Manager”, can occur for a number of reasons:

    • Allow recovery of XA transactions by the transaction manager in that WebLogic instance
    • Allow recovery of messages in a JMS Queue managed by a JMS server in that WebLogic instance
    • Allow migration of singleton services, for instance the BAM server in SOA Suite

    Early History

    We used to enable whole server migration in the following way

    • Grant sudo privileges to the “oracle” user on the ifconfig and arping commands (which are used by the wlsifcfg script called by Node Manager) by adding the following to the /etc/sudoers file:
      • oracle ALL=NOPASSWD: /sbin/ifconfig, /sbin/arping
    • Tell Node Manager which interface (in the example below the eth1 network interface) it is managing and what the netmask is for that interface (in our example an 8-bit sub-net) by adding the following to the nodemanager.properties file:
      • Interface=eth1
      • NetMask=255.255.255.0
      • UseMACBroadcast=true
    • Identify target machines for the cluster in WebLogic console Environment->Clusters->cluster_name->Migration
    • Identify a target machine for the managed server in WebLogic console Environment->Servers->server_name
    • Identify a failover machine or machines for the managed server in WebLogic console Environment->Servers->server_name->Migration

    Once configured like this when a managed server is started the Node Manager will first check that the interface is not in use and then bring up the IP address on the given interface before starting the Managed Server, the IP address is brought up on a sub-interface of the given interface such as eth0:1.  Similarly when the Managed Server is shutdown or fails the Node Manager will release the IP address if it allocated it.  When failover occurs the Node Manager again checks for IP usage before starting the managed server on the failover machine.

    A Problem

    The problem with this approach is what happens when you have a managed server listening on more than one address!  We can only provide one Interface and even if multiple interface properties were allowed (they are not) we would not know which NetMask to apply.

    So for many years we have labored under the inability to have a server support both multiple listening addresses (channels in WebLogic parlance) and whole server migration – until now.

    Solution

    The latest Node Manager comes with a different syntax for managing IP addresses.  Instead of an Interface and NetMask property we now have a new property corresponding to the name of an interface on our computer.  For this interface we identify the range of IP addresses we manage on that interface and the NetMask associated with that address range.  This allows us to have multiple listening addresses in our managed server listening on different interfaces.  An example of adding support for multiple listening addresses on two interfaces (bond0 and eth0) is shown below:

    bond0=10.1.1.128-10.1.1.254,NetMask=255.255.255.0

    etho=10.2.3.10-10.2.3.127,NetMask=255.255.240.0

    UseMACBroadcast=true

    So now when a server has multiple listen channels then the Node Manager will check what address range each listen channel falls into and start the appropriate interface.

    A Practical Application

    A practical example of where this is useful is when we have an ExaLogic machine with an external Web Server or load balancer.  We want the managed server to talk to each other using the internal Infiband network, but we want to access the managed servers externally using the 10GigaBit ethernet network.  With the enhancements to the Node Manager we can do just this.

    Acknowledgements

    Thanks to James Bayer, Will Howery and Teju for making me aware of this functionality on a recent ExaLogic training course.  Thanks to them my managed servers can now move multiple addresses without the constant forwarding of mail that I get from the Bristol office.  Now if they can just get the Bristol office to stop forwarding my mail…

    ExaLogic Documentation on Server Migration

    Tuesday Apr 26, 2011

    The Horses Mouth

    [Read More]
    About

    Musings on Fusion Middleware and SOA Picture of Antony Antony works with customers across the US and Canada in implementing SOA and other Fusion Middleware solutions. Antony is the co-author of the SOA Suite 11g Developers Cookbook, the SOA Suite 11g Developers Guide and the SOA Suite Developers Guide.

    Search

    Categories
    Archives
    « April 2014
    SunMonTueWedThuFriSat
      
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
       
           
    Today