Friday Sep 25, 2009

How Does Sailfin (Sun GlassFish Communications Server) SIP Session Replication Module Select Replica Instances?

For scalable deployments of middleware with high availability, employing a session state persistence approach to persist session state to all instances in the cluster could be a sub-optimal solution. Replicating sessions to all instances in the cluster would result in significantly higher network traffic just for replicating state reducing bandwidth for growing application user requests. This approach of sharing sessions across all instances perhaps is suited for small clusters with limited number of concurrent requests.

One of the better approaches to use to secure scaling advantages is the approach of buddy replication. In this approach, each instance selects one (or more) other instance(s) in the cluster to replicate any and all of its sessions. This is a superior approach and in fact, works for fairly large deployments. There are factors to consider here, in terms of the overhead the replication subsystem will need to handle at the cost of performance particularly when large number of concurrent sessions are being processed and later expired. An overhead to consider is the need for instances to form ring-like replica partnerships based on a certain order in which buddies would be available and selected. When a buddy instance fails, there is the cost of re-adjusting and forming new buddy relationship with another surviving instance, and when the original buddy recovers, to re-adjust again to use this upcoming instance as a replica partner by one of the instances in the cluster. Think of this as a chain based ring with its links randomly being removed but with the consistent goal of retaining a connected chain ring with the overhead of relinking each time a link is removed or added or a new one added to the chain.

There is also cost to be considered (if such were the design approach), each time the cluster shape changes for dynamically changing/updating any cookie information pertaining to replica locations that could be sent back as part of the response headers to the LB - typically that cost should also be avoided through more efficient means.

In the case of GlassFish, we have fairly successfully used buddy replication with each instance having a single replica buddy. We use the approach of locating sessions in the cluster when a request is directed by the LB to any random instance when a failure of an instance that was processing requests occurs. This has worked well for reasonably large mission critical environments where the scalability and availability requirements are within the boundaries of this approach.

In Sailfin 2.0, scaling and reliability needs for telco applications is typically very high and we needed a scalable approach to ensure Sip Session Replication overhead sustained good performance with the added reliability and availability. We, therefore, used a consistent hashing algorithm to dynamically assign a replica instance for each new session. This we did by leveraging the consistent hashing mechanism that the Sailfin Converged Load Balancer (CLB) uses for proxying requests to a target instance using a BEKey. In the case of replication, the same logic of using a hashed key for the target instance assignment is taken a bit further.

For replica selection, for each new session, we pre-calculate the most likely target instance that the CLB would failover to, if the current target primary instance that would serve the session, were to fail in future. This gives us the instance to which, the current primary instance, should replicate to. This gave us significant benefits in that there were no client cookie updates required to include replica partner information dynamically. There was no readjustment of replica partnerships needed when a particular instance failed as the hashing algorithm would provide another instance to replicate to with just an API call. When the failed instance comes back into the cluster, the sessions that were owned by it in its prior incarnation that are unexpired, would migrate back to it to maintain a balanced set of sessions across the cluster. And the replica selection algorithm would assign the original failover instance for this primary, as the replication partner.

Since this is based on a hashed selection algorithm with predetermined failover target, replica selection is dynamic, and does not need the knowledge of a particular order of instances being ready in the cluster to point all sessions from another instance as a replication partner. And more importantly, as the failover occurs to the specifc instance where replica data is located, there is significantly less network overhead to locate any particular session in the cluster when a particular request within the session scope is sent to the CLB. This allows for more bandwidth being available for a larger number of user sessions to be served. This approach is thus superior to the buddy replication approach and helped us scale to higher throughput and sustain a larger number of long running sessions.

It must be emphasized here that system level, and application server level tuning, and sizing are essential to ensure sustained performance, scalability and reliability in addition to the improvements provided with the SSR replication scheme and other parts of the Sailfin v2 server (aka Sun GlassFish Communications Server 2.0) .

As always, we welcome your feedback and encourage you to try Sailfin and send us any inputs and questions you may have in this respect.

Sailfin Promoted Builds are available here : Sailfin Downloads


Tuesday Sep 08, 2009

Sun GlassFish Communications Server (Sailfin) adds High Availability Feature

Project Sailfin is building version 2.0 of the JSR 289 compliant application server. The Sailfin 2.0 release also known as Sun GlassFish Communications Server 2.0 will have a notable new feature with the addition of Sip Session Replication component. Sailfin 2.0 will provide High Availability of Sip artifacts providing resilience and availability of conversational state to Telco deployments.  Sailfin 2.0 is targeted for release around end of October/early November 2009. 

High availability through Sip Session Replication component (aka SSR component) allows for replication of SIP artifacts such as SIP Application Sessions, SIP Sessions, Timers, and Dialog Fragments in addition to Converged Sessions. Combined with existing GlassFish replication support for HTTP sessions, deployments can now be highly available covering both SIP protocol-only applications  and converged (SIP and HTTP ) applications.  To support the large scale load that can typically expected with Telco applications, the HA team employed a dynamic replica selection algorithm for each SIP artifact based on a consistent hashing algorithm thus obviating the need for buddy based replication where one would need to react to cluster shape changes and re-partner with another instance when a failure occurs - an expensive operation during high load conditions.   (see this blog entry for more details )

The SSR component along with all of Sailfin is undergoing intensive quality testing including 24x7 longevity, scalability, reliability and fault tolerance testing at this time and we are making progress every day. 

Turning on SSR in Sailfin 2.0 builds is extremely easy similar to how it is with GlassFish. You only need to deploy your SIP (JSR 289 compliant) or Converged Application with the availability-enabled option checked in the Admin GUI console or use the --availabilityenabled=true switch with the asadmin deploy command when deploy your SIP Archive sar file. 

Here's your call to action : Go ahead and download the latest promoted build of Sailfin v2, deploy your SAR archive with availablity-enabled set to true (SSR enabled) and provide us feedback. 

Wednesday Mar 12, 2008

GlassFish High Availability Session at Sun Tech Days Hyderabad

At the recent Sun Tech Days event at Hyderabad, I gave a talk covering GlassFish's High Availability features, particularly the In-Memory Replication support, as part of GlassFish Day (Feb 29th). 

I had the privilege of talking to a full house of around 500 people. The session covered introduction to HA, how easy it is to create, and configure a cluster of instances, and to configure the application for enabling in-memory replication based availability. The session elicited very good questions ranging from the basics to involved ones in the area of sizing the heap to sticky sessions support. I spent an hour after the session outside the hall answering questions posed by interested folks from several companies.

Many attendees wanted to get a copy of the slide deck. Look here for it.

Needless to say, we would very much appreciate any feedback or questions on GlassFish's High Availability. Please send these to us at the GlassFish user mailing alias.


Tuesday Aug 07, 2007

Shoal Clustering Framework 1.0 Early Access available

Its been a fairly long time since I blogged.

Over the past few months, we (the GlassFish HA team and the Jxta Team ) have been concentrating on improving and addressing GlassFish high availability features and associated bugs. In the process, Shoal's Group Management Service benefited from intensive QE cycles on 8-node GlassFish clusters under scores of scenarios and test cases. We have been focused on fixing bugs, improving performance, and progressively gating changes in Shoal to manage risks for the upcoming GlassFish v2 FCS release. 

At the moment, Shoal is in good shape and we decided to release the Shoal 1.0 Early Access version a couple of days ago. This will be followed by the 1.0 Final release before the GlassFish v2 release once we know that we have delivered the final acceptable bits for that product.

After that our next step is to address the unique requirements that arise out of the Sailfin project which is building a SIP supported application server based on GlassFish.  

We'd love to hear feedback from our user community about your success stories, issues and enhancement requests using Shoal.

We have seen anecdotal evidence of how useful and easy-to-use this library is and it would help improve our adoption and project growth if more specific feedback is available. Our Statcounter statistics are showing companies from very interesting industry segments going through the Shoal site, downloads and documents so this is a huge boost to our commitment to build a good group communications API based library.

If you have a success story or feedback using Shoal that you can share, drop us a line at the Shoal user alias

users[at]shoal[dot]dev[dot]java[dot]net

and we will highlight it our blogs and on the Shoal web site.

Also we welcome code and design contributions from experienced clustering and distributed systems developers.

 

 

 

About

Shreedhar Ganapathy

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today