Impact of Server Failure on Coherence Request Processing

Requests against a given cache server may be temporarily blocked for several seconds following the failure of other cluster members. This may cause issues for applications that can not tolerate multi-second response times even during failover processing (ignoring for the moment that in practice there are a variety of issues that make such absolute guarantees challenging even when there are no server failures).


In general, Coherence is designed around the principle that failures in one member should not affect the rest of the cluster if at all possible. However, it's obvious that if that failed member was managing a piece of state that another member depends on, the second member will need to wait until a new member assumes responsibility for managing that state. This transfer of responsibility is (as of Coherence 3.7) performed by the primary service thread for each cache service. The finest possible granularity for transferring responsibility is a single partition. So the question becomes how to minimize the time spent processing each partition.


Here are some optimizations that may reduce this period:

  • Reduce the size of each partition (by increasing the partition count)
  • Increase the number of JVMs across the cluster (increasing the total number of primary service threads)
  • Increase the number of CPUs across the cluster (making sure that each JVM has a CPU core when needed)
  • Re-evaluate the set of configured indexes (as these will need to be rebuilt when a partition moves)
  • Make sure that the backing map is as fast as possible (in most cases this means running on-heap)
  • Make sure that the cluster is running on hardware with fast CPU cores (since the partition processing is single-threaded)


As always, proper testing is required to make sure that configuration changes have the desired effect (and also to quantify that effect).

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

The primary contributors to this blog are comprised of the Exalogic and Cloud Application Foundation contingent of Oracle's Fusion Middleware Architecture Team, fondly known as the A-Team. As part of the Oracle development organization, The A-Team supports some of Oracle's largest and most strategic customers worldwide. Our mission is to provide deep technical expertise to support various Oracle field organizations and customers deploying Oracle Fusion Middleware related products. And to collect real world feedback to continuously improve the products we support. In this blog, our experts and guest experts will focus on Exalogic, WebLogic, Coherence, Tuxedo/mainframe migration, Enterprise Manager and JDK/JRockIT performance tuning. It is our way to share some of our experiences with Oracle community. We hope our followers took away something of value from our experiences. Thank you for visiting and please come back soon.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today