Leveraging Infiniband to bypass the BRIDGE
By Todd Little-Oracle on Jan 28, 2012
As Deepak mentioned in the previous post, the Tuxedo team has spent a lot of effort in leveraging some of the unique features of the Exalogic platform. Specifically we've developed support for Remote Direct Memory Access. This is a feature of Infiniband that allows one node to read or write directly into the memory of another node. In particular this can all be done from user mode, meaning there is no need to enter the operating system kernel to pass information from one node to another node.
Tuxedo 11gR1PS2 uses this RDMA capability to bypass the BRIDGE process used in a Tuxedo cluster (MP mode domain.) In standard hardware environments, when a request is made to a server on a remote node, the request is given to the BRIDGE which in turn passes the request to the remote BRIDGE which eventually places the request on the appropriate server's queue. The reply message takes the reverse path being placed on the local BRIDGE queue, relayed to the remote BRIDGE by a network connection, and then finally placed on the client's reply queue. In some cases this becomes a bottleneck as the BRIDGE is only partially multi-threaded. So on high core count systems with a lot of requests being made to remote servers, the BRIDGE creates a throughput bottleneck. As well the BRIDGE introduces substantial latency as the total round trip requires 4 System V IPC messages and two network messages. Where a local request/response can be performed in about 35 microseconds, a remote request/response through the BRIDGE can take about 1100 microseconds. This diagram shows the message flow:
With the BRIDGE bypass feature, a native client uses RDMA to place its request directly on the remote server's queue, and the reply is placed directly on the client's reply queue. This eliminates two message queue operations and two network operations. The net result is that throughput increases 7 fold for remote operations by bypassing the BRIDGE and reduces latency from 1100 microseconds to 160 microseconds. This diagram shows the message flow:
For the next release of Tuxedo sometime this summer, we're planning even more optimizations to achieve even more throughput and lower latency for remote operations.