Monday Oct 28, 2013

Oracle Coherence, Split-Brain and Recovery Protocols In Detail

This article provides a high level conceptual overview of Split-Brain scenarios in distributed systems. It will focus on a specific example of cluster communication failure and recovery in Oracle Coherence. This includes a discussion on the witness protocol (used to remove failed cluster members) and the panic protocol (used to resolve Split-Brain scenarios).

Note that the removal of cluster members does not necessarily indicate a Split-Brain condition. Oracle Coherence does not (and cannot) detect a Split-Brain as it occurs, the condition is only detected when cluster members that previously lost contact with each other regain contact.

Cluster Topology and Configuration

In order to create an good didactic for the article, let's assume a cluster topology and configuration. In this example we have a six member cluster, consisting of one JVM on each physical machine. The member IDs are as follows:

Member ID  IP Address
 1  10.149.155.76
 2  10.149.155.77
 3  10.149.155.236
 4  10.149.155.75
 5  10.149.155.79
 6  10.149.155.78

Members 1, 2, and 3 are connected to a switch, and members 4, 5, and 6 are connected to a second switch. There is a link between the two switches, which provides network connectivity between all of the machines.


Member 1 is the first member to join this cluster, thus making it the senior member. Member 6 is the last member to join this cluster. Here is a log snippet from Member 6 showing the complete member set:

2010-02-26 15:27:57.390/3.062 Oracle Coherence GE 12.1.2.0.0 <Info> (thread=main, member=6): Started DefaultCacheServer...

SafeCluster: Name=cluster:0xDDEB

Group{Address=224.3.5.3, Port=35465, TTL=4}

MasterMemberSet
  (
  ThisMember=Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer)
  OldestMember=Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer)
  ActualMemberSet=MemberSet(Size=6, BitSetCount=2
    Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer)
    Member(Id=2, Timestamp=2010-02-26 15:27:17.847, Address=10.149.155.77:8088, MachineId=1101, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:296, Role=CoherenceServer)
    Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer)
    Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer)
    Member(Id=5, Timestamp=2010-02-26 15:27:49.095, Address=10.149.155.79:8088, MachineId=1103, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:3229, Role=CoherenceServer)
    Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer)
    )
  RecycleMillis=120000
  RecycleSet=MemberSet(Size=0, BitSetCount=0
    )
  )

At approximately 15:30, the connection between the two switches is severed:


Thirty seconds later (the default packet timeout in development mode) the logs indicate communication failures across the cluster. In this example, the communication failure was caused by a network failure. In a production setting, this type of communication failure can have many root causes, including (but not limited to) network failures, excessive GC, high CPU utilization, swapping/virtual memory, and exceeding maximum network bandwidth. In addition, this type of failure is not necessarily indicative of a split brain. Any communication failure will be logged in this fashion. Member 2 logs a communication failure with Member 5:

2010-02-26 15:30:32.638/196.928 Oracle Coherence GE 12.1.2.0.0 <Warning> (thread=PacketPublisher, member=2): Timeout while delivering a packet; requesting the departure confirmation for Member(Id=5, Timestamp=2010-02-26 15:27:49.095, Address=10.149.155.79:8088, MachineId=1103, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:3229, Role=CoherenceServer)
by MemberSet(Size=2, BitSetCount=2
  Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer)
  Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer)
  )

The Coherence clustering protocol (TCMP) is a reliable transport mechanism built on UDP. In order for the protocol to be reliable, it requires an acknowledgement (ACK) for each packet delivered. If a packet fails to be acknowledged within the configured timeout period, the Coherence cluster member will log a packet timeout (as seen in the log message above). When this occurs, the cluster member will consult with other members to determine who is at fault for the communication failure. If the witness members agree that the suspect member is at fault, the suspect is removed from the cluster. If the witnesses unanimously disagree, the accuser is removed. This process is known as the witness protocolSince Member 2 cannot communicate with Member 5, it selects two witnesses (Members 1 and 4) to determine if the communication issue is with Member 5 or with itself (Member 2). However, Member 4 is on the switch that is no longer accessible by Members 1, 2 and 3; thus a packet timeout for member 4 is recorded as well:

2010-02-26 15:30:35.648/199.938 Oracle Coherence GE 12.1.2.0.0 <Warning> (thread=PacketPublisher, member=2): Timeout while delivering a packet; requesting the departure confirmation for Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer)
by MemberSet(Size=2, BitSetCount=2
  Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer)
  Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer)
  )

Member 1 has the ability to confirm the departure of member 4, however Member 6 cannot as it is also inaccessible. At the same time, Member 3 sends a request to remove Member 6, which is followed by a report from Member 3 indicating that Member 6 has departed the cluster:

2010-02-26 15:30:35.706/199.996 Oracle Coherence GE 12.1.2.0.0 <D5> (thread=Cluster, member=2): MemberLeft request for Member 6 received from Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer)
2010-02-26 15:30:35.709/199.999 Oracle Coherence GE 12.1.2.0.0 <D5> (thread=Cluster, member=2): MemberLeft notification for Member 6 received from Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer)

The log for Member 3 determines how Member 6 departed the cluster:

2010-02-26 15:30:35.161/191.694 Oracle Coherence GE 12.1.2.0.0 <Warning> (thread=PacketPublisher, member=3): Timeout while delivering a packet; requesting the departure confirmation for Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer)
by MemberSet(Size=2, BitSetCount=2
  Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer)
  Member(Id=2, Timestamp=2010-02-26 15:27:17.847, Address=10.149.155.77:8088, MachineId=1101, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:296, Role=CoherenceServer)
  )
2010-02-26 15:30:35.165/191.698 Oracle Coherence GE 12.1.2.0.0 <Info> (thread=Cluster, member=3): Member departure confirmed by MemberSet(Size=2, BitSetCount=2
  Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer)
  Member(Id=2, Timestamp=2010-02-26 15:27:17.847, Address=10.149.155.77:8088, MachineId=1101, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:296, Role=CoherenceServer)
  ); removing Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer)

In this case, Member 3 happened to select two witnesses that it still had connectivity with (Members 1 and 2) thus resulting in a simple decision to remove Member 6.


Given the departure of Member 6, Member 2 is left with a single witness to confirm the departure of Member 4:

2010-02-26 15:30:35.713/200.003 Oracle Coherence GE 12.1.2.0.0 <Info> (thread=Cluster, member=2): Member departure confirmed by MemberSet(Size=1, BitSetCount=2
  Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer)
  ); removing Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer)

In the meantime, Member 4 logs a missing heartbeat from the senior member. This message is also logged on Members 5 and 6.

2010-02-26 15:30:07.906/150.453 Oracle Coherence GE 12.1.2.0.0 <Info> (thread=PacketListenerN, member=4): Scheduled senior member heartbeat is overdue; rejoining multicast group.

Next, Member 4 logs a TcpRing failure with Member 2, thus resulting in the termination of Member 2:

2010-02-26 15:30:21.421/163.968 Oracle Coherence GE 12.1.2.0.0 <D4> (thread=Cluster, member=4): TcpRing: Number of socket exceptions exceeded maximum; last was "java.net.SocketTimeoutException: connect timed out"; removing the member: 2

For quick process termination detection, Oracle Coherence utilizes a feature called TcpRing which is a sparse collection of TCP/IP-based connections between different members in the cluster. Each member in the cluster is connected to at least one other member, which (if at all possible) is running on a different physical box. This connection is not used for any data transfer, only heartbeat communications are sent once a second per each link. If a certain number of exceptions are thrown while trying to re-establish a connection, the member throwing the exceptions is removed from the cluster. Member 5 logs a packet timeout with Member 3 and cites witnesses Members 4 and 6:

2010-02-26 15:30:29.791/165.037 Oracle Coherence GE 12.1.2.0.0 <Warning> (thread=PacketPublisher, member=5): Timeout while delivering a packet; requesting the departure confirmation for Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer)
by MemberSet(Size=2, BitSetCount=2
  Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer)
  Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer)
  )
2010-02-26 15:30:29.798/165.044 Oracle Coherence GE 12.1.2.0.0 <Info> (thread=Cluster, member=5): Member departure confirmed by MemberSet(Size=2, BitSetCount=2
  Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer)
  Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer)
  ); removing Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer)

Eventually we are left with two distinct clusters consisting of Members 1, 2, 3 and Members 4, 5, 6, respectively. In the latter cluster, Member 4 is promoted to senior member.


The connection between the two switches is restored at 15:33. Upon the restoration of the connection, the cluster members immediately receive cluster heartbeats from the two senior members. In the case of Members 1, 2, and 3, the following is logged:

2010-02-26 15:33:14.970/369.066 Oracle Coherence GE 12.1.2.0.0 <Warning> (thread=Cluster, member=1): The member formerly known as Member(Id=4, Timestamp=2010-02-26 15:30:35.341, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer) has been forcefully evicted from the cluster, but continues to emit a cluster heartbeat; henceforth, the member will be shunned and its messages will be ignored.

Likewise for Members 4, 5, and 6:

2010-02-26 15:33:14.343/336.890 Oracle Coherence GE 12.1.2.0.0 <Warning> (thread=Cluster, member=4): The member formerly known as Member(Id=1, Timestamp=2010-02-26 15:30:31.64, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer) has been forcefully evicted from the cluster, but continues to emit a cluster heartbeat; henceforth, the member will be shunned and its messages will be ignored.

This message indicates that a senior heartbeat is being received from members that were previously removed from the cluster, in other words, something that should not be possible. For this reason, the recipients of these messages will initially ignore them. After several iterations of these messages, the existence of multiple clusters is acknowledged, thus triggering the panic protocol to reconcile this situation. When the presence of more than one cluster (i.e. Split-Brain) is detected by a Coherence member, the panic protocol is invoked in order to resolve the conflicting clusters and consolidate into a single cluster. The protocol consists of the removal of smaller clusters until there is one cluster remaining. In the case of equal size clusters, the one with the older Senior Member will survive. Member 1, being the oldest member, initiates the protocol:

2010-02-26 15:33:45.970/400.066 Oracle Coherence GE 12.1.2.0.0 <Warning> (thread=Cluster, member=1): An existence of a cluster island with senior Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer) containing 3 nodes have been detected. Since this Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer) is the senior of an older cluster island, the panic protocol is being activated to stop the other island's senior and all junior nodes that belong to it.

Member 3 receives the panic:

2010-02-26 15:33:45.803/382.336 Oracle Coherence GE 12.1.2.0.0 <Error> (thread=Cluster, member=3): Received panic from senior Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer) caused by Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer)

Member 4, the senior member of the younger cluster, receives the kill message from Member 3:

2010-02-26 15:33:44.921/367.468 Oracle Coherence GE 12.1.2.0.0 <Error> (thread=Cluster, member=4): Received a Kill message from a valid Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer); stopping cluster service.

In turn, Member 4 requests the departure of its junior members 5 and 6:

2010-02-26 15:33:44.921/367.468 Oracle Coherence GE 12.1.2.0.0 <Error> (thread=Cluster, member=4): Received a Kill message from a valid Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer); stopping cluster service.

2010-02-26 15:33:43.343/349.015 Oracle Coherence GE 12.1.2.0.0 <Error> (thread=Cluster, member=6): Received a Kill message from a valid Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer); stopping cluster service.

Once Members 4, 5, and 6 restart, they rejoin the original cluster with senior member 1. The log below is from Member 4. Note that it receives a different member id when it rejoins the cluster.

2010-02-26 15:33:44.921/367.468 Oracle Coherence GE 12.1.2.0.0 <Error> (thread=Cluster, member=4): Received a Kill message from a valid Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer); stopping cluster service.
2010-02-26 15:33:46.921/369.468 Oracle Coherence GE 12.1.2.0.0 <D5> (thread=Cluster, member=4): Service Cluster left the cluster
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 12.1.2.0.0 <D5> (thread=Invocation:InvocationService, member=4): Service InvocationService left the cluster
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 12.1.2.0.0 <D5> (thread=OptimisticCache, member=4): Service OptimisticCache left the cluster
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 12.1.2.0.0 <D5> (thread=ReplicatedCache, member=4): Service ReplicatedCache left the cluster
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 12.1.2.0.0 <D5> (thread=DistributedCache, member=4): Service DistributedCache left the cluster
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 12.1.2.0.0 <D5> (thread=Invocation:Management, member=4): Service Management left the cluster
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 12.1.2.0.0 <D5> (thread=Cluster, member=4): Member 6 left service Management with senior member 5
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 12.1.2.0.0 <D5> (thread=Cluster, member=4): Member 6 left service DistributedCache with senior member 5
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 12.1.2.0.0 <D5> (thread=Cluster, member=4): Member 6 left service ReplicatedCache with senior member 5
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 12.1.2.0.0 <D5> (thread=Cluster, member=4): Member 6 left service OptimisticCache with senior member 5
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 12.1.2.0.0 <D5> (thread=Cluster, member=4): Member 6 left service InvocationService with senior member 5
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 12.1.2.0.0 <D5> (thread=Cluster, member=4): Member(Id=6, Timestamp=2010-02-26 15:33:47.046, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer) left Cluster with senior member 4
2010-02-26 15:33:49.218/371.765 Oracle Coherence GE 12.1.2.0.0 <Info> (thread=main, member=n/a): Restarting cluster
2010-02-26 15:33:49.421/371.968 Oracle Coherence GE 12.1.2.0.0 <D5> (thread=Cluster, member=n/a): Service Cluster joined the cluster with senior service member n/a
2010-02-26 15:33:49.625/372.172 Oracle Coherence GE 12.1.2.0.0 <Info> (thread=Cluster, member=n/a): This Member(Id=5, Timestamp=2010-02-26 15:33:50.499, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer, Edition=Grid Edition, Mode=Development, CpuCount=2, SocketCount=1) joined cluster "cluster:0xDDEB" with senior Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer, Edition=Grid Edition, Mode=Development, CpuCount=2, SocketCount=2)

Cool isn't it?

Sunday Jul 08, 2012

The Developers Conference 2012: Presentation about CEP & BAM

This year I had the pleasure again of being one of the speakers in the TDC ("The Developers Conference") event. I have spoken in this event for three years from now. This year, the main theme of the SOA track was EDA ("Event-Driven Architecture") and I decided to delivery a comprehensive presentation about one of my preferred personal subjects: Real-time using Complex Event Processing. The theme of the presentation was "Business Intelligence in Real-time using CEP & BAM" and I would like to share here the presentation that I have done. The material is in Portuguese since was an Brazilian event that happened in São Paulo.

Once my presentation has a lot of videos, I decided to share the material as a Youtube video, so you can pause, rewind and play again how many times you want it. I strongly recommend you that before starting watching the video, you change the video quality settings to 1080p in High Definition.

Wednesday Apr 18, 2012

Oracle Technical Workshop: WebLogic Suite 12c (March 28, São Paulo)

In March 28 of 2012, I presented in the Pullman hotel in São Paulo, a whole day workshop about the technical innovations of Oracle WebLogic 12c, plus also the correlated middleware stack that Oracle created around it. This workshop, for those that could be in person, was very informational and productive once without any exception, every presentation was made in practice with the audience.

I would like to share with you the slides I have used in this workshop. The slides content are written in Portuguese, since was a Brazilian workshop for the Brazil folks. Please enjoy it!

Friday Oct 28, 2011

Enabling HPC (High Performance Computing) with InfiniBand in Java™ Applications

For many IT managers, choosing a computing fabric often falls short of receiving the center stage attention it deserves. Familiarity and time have set infrastructure architects on a seemingly intransigent course toward settling for fabrics that have become the "de-facto" norm. On one hand is Ethernet, which is regarded as the solution for an almost ridiculously broad range of needs, from web surfing to high-performance, every microsecond counts storage traffic. On the other hand is Fibre Channel, which provides more deterministic performance and bandwidth, but at the cost of bifurcating the enterprise network while doubling the cabling, administrative overhead, and total cost of ownership. But even with separate fabrics, these networks sometimes fall short of today's requirements. So slowly but surely some enterprises are waking up to the potential of new fabrics.

There is no other component of the data center that is as important as the fabric, and the I/O challenges facing enterprises today have been setting the stage for next-generation fabrics for several years. One of the more mature solutions, InfiniBand, is sometimes at the forefront when users are driven to select a new fabric. New customers are often surprised at what InfiniBand can deliver, and how easily it integrates into today's data center architectures. InfiniBand has been leading the charge toward consolidated, flexible, low latency, high bandwidth, lossless I/O connectivity. While other fabrics have just recently turned to addressing next generation requirements, with technologies such as Fibre Channel over Ethernet (FCoE) still seeking general acceptance and even standardization, InfiniBand's maturity and e bandwidth advantages continue to attract new customers.

Although InfiniBand remains a small industry compared to the Ethernet juggernaut, it continues to grow aggressively, and this year it is growing beyond projections. InfiniBand often plays a role in enterprises with huge datasets. The majority of Fortune 1000 companies are involved in high-throughput processing behind a wide variety of systems, including business analytics, content creation, content trans-coding, real-time financial applications, messaging systems, consolidated server infrastructures, and more. In these cases, InfiniBand has worked its way into the enterprise as a localized fabric that, via transparent interconnection to existing networks, is sometimes even hidden from the eyes of administrators.

For high performance computing environments, the capacity to move data across a network quickly and efficiently is a requirement. Such networks are typically described as requiring high throughput and low latency. High throughput refers to an environment that can deliver a large amount of processing capacity over a long period of time. Low latency refers to the minimal delay between processing input and providing output, such as you would expect in a real-time application.

In these environments, conventional networking using socket streams can create bottlenecks when it comes to moving data. Introduced in 1999 by the InfiniBand Trade Association, InfiniBand (IB for short) was created to address the need for high performance computing. One of the most important features of IB is Remote Direct Memory Access (RDMA). RDMA enables moving data directly from the memory of one computer to another computer, bypassing the operating system of both computers and resulting in significant performance gains. The usage of InfiniBand network is particularly interesting when applied in engineered systems like Oracle Exalogic and SPARC SuperCluster T4-4. In this type of system, you can boost the performance of your application from 3X to 10X without changing a single line of code.

The Sockets Direct Protocol or SDP for short is a networking protocol developed to support stream connections over InfiniBand fabric. SDP support was introduced to the JDK 7 release of the Java Platform, for applications deployed in the Solaris and Linux operating systems. The Solaris operating system has supported SDP and InfiniBand since Solaris 10. On the Linux, world the InfiniBand package is called OFED (OpenFabrics Enterprise Distribution). The JDK 7 release supports the 1.4.2 and 1.5 versions of OFED.

SDP support is essentially a TCP bypass technology. When SDP is enabled and an application attempts to open a TCP connection, the TCP mechanism is bypassed and communication goes directly to the InfiniBand network. For example, when your application attempts to bind to a TCP address, the underlying software will decide, based on information in the configuration file, if it should be rebound to an SDP protocol. This process can happen during the binding process or the connecting process, but happens only once for each socket.

There are no API changes required in your code to take advantage of the SDP protocol: the implementation is transparent and is supported by the classic networking and the NIO (New I/O) channels. SDP support is disabled by default. The steps to enable SDP support are:

  • Create an text-based file that will act as the SDP configuration file
  • Set the system property that specifies the location of this configuration file.

Creating an Configuration File for SDP

An SDP configuration file is a text file, and you decide where on the file system this file will reside. Every line in the configuration file is either a comment or a rule. A comment is indicated by the hash character "#" at the beginning of the line, and everything following the hash character will be ignored.

There are two types of rules, as follows:

  • A "bind" rule indicates that the SDP protocol transport should be used when a TCP socket binds to an address and port that match the rule.
  • A "connect" rule indicates that the SDP protocol transport should be used when an unbound TCP socket attempts to connect to an address and port that match the rule.

A rule has the following format:

("bind"|"connect")1*LWSP-char(hostname|ipaddress)["/"prefix])1*LWSP-char("*"|port)["-"("*"|port)]
The first keyword indicates whether the rule is a bind or a connect rule. The next token specifies either a host name or a literal IP address. When you specify a literal IP address, you can also specify a prefix, which indicates an IP address range. The third and final token is a port number or a range of port numbers. Consider the following notation in this sample configuration file:
# Enable SDP when binding to 192.168.0.1
bind 192.168.0.1 *

# Enable SDP when connecting to all
# application services on 192.168.0.*
connect 192.168.0.0/24     1024-*

# Enable SDP when connecting to the HTTP server
# or a database on oracle.infiniband.sdp.com
connect oracle.infiniband.sdp.com   80
connect oracle.infiniband.sdp.com   3306
The first rule in the sample file specifies that SDP is enabled for any port (*) on the local IP address 192.168.0.1. You would add a bind rule for each local address assigned to an InfiniBand adaptor. An InfiniBand adaptor is the equivalent of a network interface card (NIC) for InfiniBand. If you had several InfiniBand adaptors, you would use a bind rule for each address that is assigned to those adaptors. The second rule in the sample file specifies that whenever connecting to 192.168.0.* and the target port is 1024 or greater, SDP will be enabled. The prefix on the IP address /24 indicates that the first 24 bits of the 32-bit IP address should match the specified address. Each portion of the IP address uses 8 bits, so 24 bits indicates that the IP address should match 192.168.0 and the final byte can be any value. The "-*" notation on the port token specifies "and above." A range of ports, such as 1024—2056, would also be valid and would include the end points of the specified range.

The final rules in the sample file specify a host name called oracle.infiniband.sdp.com, first with the port assigned to an HTTP server (80) and then with the port assigned to a database (3306). Unlike a literal IP address, a host name can translate into multiple addresses. When you specify a host name, it matches all addresses that the host name is registered to in the name service.

How Enable the SDP Protocol?

SDP support is disabled by default. To enable SDP support, set the com.sun.sdp.conf system property by providing the location of the configuration file. The following example starts an application named MyApplication using a configuration file named sdp.conf:

     java -Dcom.sun.sdp.conf=sdp.conf -Djava.net.preferIPv4Stack=true MyApplication

MyApplication refers to the client application written in Java that is attempting to connect to the InfiniBand adaptor. Note that this example specifies another system property, java.net.preferIPv4Stack. Maybe, not everything discussed so far will work perfectly fine. There are some technical issues that you should be aware when enabling SDP, most of them, related with the IPv4 or IPv6 stacks. If something in your tests goes wrong, it is a nice idea ensure with your operating system administrators if the InfiniBand is correctly configured at this layer.

APIs and Supported Java Classes

    All classes in the JDK that read or write network sockets can use SDP. As said before, there are no changes in the way you write the code. This is particularly interesting if you are only installing a already deployed application in a cluster that uses InfiniBand. When SDP support is enabled, it just works without any change to your code. Compiling is not necessary. However, it is important to know that a socket is bound only once. A connection is an implicit bind. So, if the socket hasn't been previously bound and connect is invoked, the binding occurs at that time. For example, consider the following code:

    AsynchronousSocketChannel asyncSocketChannel = AsynchronousSocketChannel.open();
    InetSocketAddress localAddress = new InetSocketAddress("10.0.0.3", 8020);
    InetSocketAddress remoteAddress = new InetSocketAddress("192.168.0.1", 1521);
    asyncSocketChannel.bind(localAddress);
    Future result = asyncSocketChannel.connect(remoteAddress);
    

    In this example, the asynchronous socket channel is bound to a local TCP address when bind is invoked on the socket. Then, the code attempts to connect to a remote address by using the same socket. If the remote address uses InfiniBand, as specified in the configuration file, the connection will not be converted to SDP because the socket was previously bound.

    Conclusion

    InfiniBand is a "de-facto" standard for high performance computing environments that needs to guarantee extreme low latency and superior throughput. But to really take advantage of this technology, your applications need to be properly prepared. This article showed how you can maximize the performance of Java applications enabling the SDP support, the network protocol that "talks" with the InfiniBand network.

    About

    Ricardo Ferreira is just a regular person that lives in Brazil and is passionate for technology, movies and his whole family. Currently is working at Oracle Corporation.

    Search

    Categories
    Archives
    « April 2014
    SunMonTueWedThuFriSat
      
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
       
           
    Today