• February 11, 2012

InfiniBand Building Blocks

While writing down this article, I will be referencing to Oracle's latest Engineered Systems which are built upon InfiniBand technology. Yes, I am talking about Exalogic, Exadata and SuperCluster platforms.

In the smallest and minimalistic configuration, we have a set of hosts connected to a pair of InfiniBand switches in dual star topology. Why two ? To provide you with redundancy. Each host has one dual-port IB HCA and these ports connect to independent IB switches via QSFP copper cables. The switches also have inter links to provide alternate paths across hosts. This pretty much completes our hardware configuration and connectivity inside the rack.

The picture below shows a basic connectivity block diagram. The notion of active and passive is explained further below.

Next comes the software components. Each host has required IB software stack built in to the operating system. Each IB switch also has its own software implementation to understand and manage the connected end points. In Linux computing environments, the IB software is based on some version of OFED. One special software worth mentioning here is the Subnet Manager. If this software is not enabled in the network, then what we get is an un-managed InfiniBand network. This is not something we want. The main purpose of subnet manager in the IB network is to enable communication paths across attached hosts, monitor the physical changes in the network periodically and adjust accordingly. In the context of this article, this role is taken up by IB switches. Now we have more than one switch in rack, so which one ? Answer is either one or more then one for redundancy. There is a messaging protocol amongst connected instances of subnet manager and they can negotiate with each other on which one will actually serve the role of subnet management. This is known as Master Subnet Manager. If there are more, they stay as Standby Subnet Managers. In case of a failure on master switch, next switch with a predefined criteria will take up the role.

Now, to give you some details, the subnet manager sweeps the fabric periodically for physical changes, assigns LIDs to end points, creates forwarding tables based on specified routing algorithm in a config file and performs a few more critical functions which I will defer for a later discussion.

With this setup, you are now ready to communicate over layer-2 under OSI model. I have mentioned earlier that technology remains transparent to the upper level protocols (ULP), so IP addresses for layer 3 are assigned at individual hosts just like you do in ethernet based networks. For redundancy, we have a pair of IB interfaces on a bonded interface in active standby mode and an IPv4 address assigned. Let me remind you here that the redundancy or high availability is achieved from hosts' perspective at layer-3. Both links from hosts are always active from switch and InfiniBand network perspective.

Let me show you an example here to make it more clear.

The screenshot below shows the status of a host's InfiniBand ports. You can see they are both 'Active' with  LIDs assigned and rate is 40 which means they have auto-negotiated to 4X QDR.

Now the next screenshot shows the layer-3 configuration which is IPoIB.

Lets look at the bonding status now. The following screenshot shows that interface ib0 is active while ib1 is standby. So, this redundancy and high availability is perceived from host at layer-3.

Other hosts in the connected fabric will also look similar with their own IPoIB addresses. This concludes the basic setup of an InfiniBand network and from here on, we should be able to customize and fine tune further in order to utilize this high speed efficient switching fabric for our upper layer applications and protocols.

In my next section, I will write more about how and where to go further from this point. Querying the neighbors, checking fabric health, communicating with other hosts and so on.

Join the discussion

Comments ( 2 )
  • Manjinder Singh Tuesday, March 6, 2012

    Hi Neeraj,

    Yet another excellent post. Kodos!!!

    One question, since the HCA is single card having 2 ports doesn't that makes it Single Point Of Failure? What I mean is, let's suppose the card develops a HW failure then both the PORTS will be unavailable and hence the node will not be connected to the infiniband network via BOND0. Am I correct to think that?

    For your views.....


    Manjinder Singh

  • Neeraj Gupta Wednesday, March 7, 2012

    Hello Manjinder, your understanding is correct. If the HCA card fails then the host will loose connectivity over InfiniBand. True, but let me add some more details here.

    High Availability is a multi dimensional thing if you ask me. The two ports on single HCA in a bonded interface allows us have redundant paths w.r.t. switches. Use of clustering applications, load balancers etc. provide HA and redundancy from computer's perspective. But whats more likely to fail is the key. Shall we consider component's MTBF ?

    The possibility of the single HCA card failing is less likely then an external fault e.g. cable removed, switch going in patching or upgrade mode. Moreover, the applications' architecture is such that it is usually tolerant to one computer failure from the whole cluster.

    Thanks for driving a good discussion. Appreciate it.

Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.