Synchronizing Coherence Clusters – A Tour of Push Replication

Lately I've been able to do some Coherence work with some local customers and play with the Coherence Incubator projects.  This entry will showcase one of the examples for Push Replication for sharing data among separate Coherence clusters.

Inter-cluster Data Replication

Coherence clustering technology makes a lot of sense for customers to be able to scale applications horizontally and reliably with very fast predictable performance.  This is of course easy to do within one data center on a fast network.  One of the common challenges that customers have is around High Availability and Disaster Recovery and keeping data synchronized across data centers.  Normally Coherence is optimized to use UDP unicast or multicast, but what happens if the network is unreliable and/or has high latency which is common when networking multiple data centers?  The answer is Coherence TCP Extend, which addresses these challenges by using TCP and the result is that multiple Coherence clusters can communicate together.

Multiple data centers are not the only reason you need to keep multiple clusters in sync.  I recently came across a use case where it made sense to have separate Coherence clusters on different physical machines.  The JVM processes for each machine have a single digit millisecond SLA for round trip time per request, so they are very sensitive to any outside events and it was a requirement to isolate the processes as much as possible.  In this scenario one physical machine was Active and one was Passive for fail-over.  During load-testing on the Active machine we found that bringing up cluster members on the Passive machine in the middle of the test impacted performance when both machines were part of the same cluster.  Using a single Coherence cluster per machine ensures that the impact of cluster membership events are isolated to that machine only.

Leveraging the Coherence Incubator Push Replication pattern, multiple Coherence clusters can keep data in sync whether they are on the same network subnet with low-latency or hundreds of miles apart in separate data centers.  Let's take a look at the simplest example, which is Active - Passive scenario.  In this situation we'll use Push Replication to make sure that Cache Entry operations (insert/update/delete) in the Active cache are replicated in the Passive cache.

Incubator-PushReplication-Active-Passive

Run the Example

To run this example called ActivePassiveExample - which is included in the src distribution of the Push Replication Pattern, I used the following software:

The Ant Approach

One of my colleagues Randy Stafford introduced me using an ant build.xml file to organize different types of Coherence cluster processes.  I found it to be a much simpler and cleaner approach than what I had been doing (multiple shell scripts or multiple Eclipse launch configurations).  This way it is simple to centralize the shared configuration to a set of properties that are used by all of the ant targets and each process can easily override one of the properties with a -DpropertyName=propertyValue at the command line.  Take a look at the build.xml file and let me know what you think.

Step by Step

  • Set the build.properties paths according to your environment
  • Set up shells with ANT_HOME and JAVA_HOME and the PATH, setAntEnv.txt is an example
  • Run ant from those shells with the appropriate targets as described in build.xml.  Minimum for this example:
    ant compile (you can reuse this shell)
    ant run_active_cache_server
    ant run_passive_cache_server
    ant run_active_publisher
  • Most likely you will also want to run the JMX and Console processes to see the values.
    ant run_active_jmx
    ant run_passive_jmx
    ant run_passive_console

Guided Tour

Once the JMX processes are running for Active and Passive, you can use jconsole to locally bind to the two MBeanConnector process.

cluster

To verify that the passive cache received all the updates you can execute the run_passive_console target and execute two commands to see the contents.

  • cache passive-cache
  • list

Notice that at the bottom of my output that the entry values are the last ten number leading up to 10000, which is what we expect looking at the sample publishing code.

     [java]   ClusterService{Name=Cluster, State=(SERVICE_STARTED, STATE_JOINED), Id=0, Version=3.5, OldestMemberId=1}
     [java]   InvocationService{Name=Management, State=(SERVICE_STARTED), Id=1, Version=3.1, OldestMemberId=1}
     [java]   )
cache passive-cache
     [java] 2009-10-22 17:07:31.797/10.344 Oracle Coherence GE 3.5.1/461p2 <Info> (thread=Main Thread, member=3): Loaded cache configuration from "jar:file:/C:/Oracle/coherence-v3.5.1b461/coherence/lib/coherence.jar!/coherence-cache-config.
xml"
     [java] 2009-10-22 17:07:32.156/10.703 Oracle Coherence GE 3.5.1/461p2 <D5> (thread=DistributedCache, member=3): Service DistributedCache joined the cluster with senior service member 1
     [java] 2009-10-22 17:07:32.172/10.719 Oracle Coherence GE 3.5.1/461p2 <D5> (thread=DistributedCache, member=3): Service DistributedCache: received ServiceConfigSync containing 259 entries
     [java] <distributed-scheme>
     [java]   <scheme-name>example-distributed</scheme-name>
     [java]   <service-name>DistributedCache</service-name>
     [java]   <backing-map-scheme>
     [java]     <local-scheme>
     [java]       <scheme-ref>example-binary-backing-map</scheme-ref>
     [java]     </local-scheme>
     [java]   </backing-map-scheme>
     [java]   <autostart>true</autostart>
     [java] </distributed-scheme>
     [java] Map (?):
list
     [java] 2 = 9992
     [java] 3 = 9993
     [java] 1 = 9991
     [java] 0 = 10000
     [java] 5 = 9995
     [java] 4 = 9994
     [java] 6 = 9996
     [java] 8 = 9998
     [java] 9 = 9999
     [java] 7 = 9997
     [java] Map (passive-cache):

Let's look at the Active cluster processes.  You can see that there are caches defined for many of the Incubator patterns.  Below you can see that the publishing-active cache should have a size of 10 after running the run_active_publisher ant target.

publishingCacheStore

Consider the role of some of the other caches:

DistributedCacheForCommandPattern - stores command contexts and co-located commands


DistributedCacheForMessages - stores messages (in this case - entries that need to be pushed to the other cluster)


DistributedCacheForSubscriptions - maintains durable subscribers for the publishers

You can learn more about these caches by looking at the Coherence Incubator Wiki and reviewing the cache-config xml files that correspond to the Incubator project they are named for.

Looking at the MBean for the PublishingService, you can find operations to suspend, resume, and drain messages that have yet to be published.

publishingService [2]

So if you suspend() the PublishingService, any changes to the publishing-active cache will be queued up as messages in the DistributedCacheForMessages coherence.messagingpattern.messages cache.  You can try this by executing the suspend() operation then the run_active_publisher target again.  You should then see the messaging cache full of the operations waiting to be replicated to the passive-cache.

messagesCache

Now if you execute the resume() operation on the PublisherService, the messages will replicate to the passive-cache.  Alternatively you could execute the drain() operation and all of the messages waiting to be replicated would be deleted and the passive-cache would never receive those updates.

On the Passive cluster, you can see the TCP Extend working by looking at the ConnectionManager and the Connection MBeans.  The passive process is listening in this case on port 20000 and has 1 active connection.

tcp_extend

Conclusion

This should give you a quick idea of the capabilities of the Push Replication Pattern and some of the JMX enabled capabilities in the Coherence Incubator projects.  This is the most basic example and there are also samples for Hub-and-Spoke, Active-Active, and Federated patterns which are more complex.

Comments:

wow great article, it was really informative. One question...Is it user friendly?

Posted by Lisa on October 22, 2009 at 05:45 AM PDT #

The documentation on the Incubator Wiki is good. However, the reason I decided to write this blog entry in the first place is that it took me some time and browsing of example code to understand the role of some of the caches and how each of the Incubator projects build on each other. If all you are interested in is Push Replication, I think this should help jump start you to know cache has entries yet to be replicated, how to suspend/resume/drain, etc. Having the JMX instrumentation really helps for manageability.

Posted by james.bayer on October 22, 2009 at 11:15 AM PDT #

Hi. It will be great if you can update this to use the Push Replication Pattern 3.0.3. (for 3.6 and above). This is pointing to version 2.4.0. Thanks.

Posted by guest on July 25, 2011 at 12:14 AM PDT #

Well the latest Push Replication Pattern is now 4.0.x
http://coherence.oracle.com/display/INC10/coherence-pushreplication-pattern
and it's primarily based on the new Event Distribution Pattern
http://coherence.oracle.com/display/INC10/coherence-eventdistribution-pattern
I'll give Brian Oliver a notice that there is interest in a post like this for the new patterns and updated versions, but I have yet to work with it.

Cheers,

James

Posted by james.bayer on July 25, 2011 at 01:49 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

James Bayer Image
I was formerly a Product Manager on the WebLogic Server team based out of Oracle HQ. You can find my new blog at http://iamjambay.com.
Follow Me on Twitter
Oracle WebLogic

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today