Infiniband: a highperformance network fabric - Part I
By Karoly Vegh on Nov 13, 2013
At the OpenWorld this year I managed to chat with interesting people again - one of them answering InfiniBand deepdive questions with ease by coffee turned out to be one of Oracle's IB engineers, Ted Kim, who actually actively participates in the InfiniBand Trade Association and integrates Oracle solutions with this highspeed network. This is why I love attending OOW. He granted me an hour of his time to talk about IB.
This post is mostly based on that tech interview.
Start of the actual post:
Traditionally datatransfer between servers and storage elements happens in networks with up to 10 gigabit/seconds or in SANs with up to 8 gbps fiberchannel connections. Happens. Well, data rather trickles through.
But nowadays data amounts grow well over the TeraByte order of magnitude, and multisocket/multicore/multithread Servers hunger data that these transfer technologies just can't deliver fast enough, causing all CPUs of this world do one thing at the same speed - waiting for data.
And once again, I/O is the bottleneck in computing. FC and Ethernet can't keep up. We have half-TB SSDs, dozens of TB RAM to store data to be modified in, but can't transfer it. Can't backup fast enough, can't replicate fast enough, can't synchronize fast enough, can't load fast enough.
The bad news is, everyone is used to this, like back in the '80s everyone was used to start compile jobs and go for a coffee. Or on vacation.
The good news is, there's an alternative. Not so-called "bleeding-edge" 8gbps, but (as of now) 56. Not layers of overhead, but low latency. And it is available now. It has been for a while, actually. Welcome to the world of Infiniband.
Infiniband was born as a result of joint efforts of HPAQ, IBM, Intel, Sun and Microsoft. They planned to implement a next-generation I/O fabric, in the 90s. In the 2000s Infiniband (from now on: IB) was quite popular in the high-performance computing field, powering most of the top500 supercomputers. Then in the middle of the decade, Oracle realized its potential and used it as an interconnect backbone for the first Database Machine, the first Exadata.
Since then, IB has been booming, Oracle utilizes and supports it in a large set of its HW products, it is the backbone of the famous Engineered Systems: Exadata, SPARC SuperCluster, Exalogic, OVCA and even the new DB backup/recovery box. You can also use it to make servers talk highspeed IP to eachother, or to a ZFS Storage Appliance.
Following Oracle's lead, even IBM has jumped the wagon, and leverages IB in its PureFlex systems, their first InfiniBand Machines.
IB Structural Overview:
If you want to use IB in your servers, the first thing you will need is PCI cards, in IB terms Host Channel Adapters, or HCAs. Just like NICs for Ethernet, or HBAs for FC. In these you plug an IB cable, going to an IB switch providing connection to other IB HCAs. Of course you're going to need drivers for those in your OS. Yes, these are long-available for Solaris and Linux.
Now, what protocols can you talk over IB? There's a range of choices. See, IB isn't accepting packet loss like Ethernet does, and hence doesn't need to rely on TCP/IP as a workaround for resends. That is, you still can run IP over IB (IPoIB), and that is used in various cases for control functionality, but the datatransfer can run over more efficient protocols all mapped directly on native IB.
About PCI connectivity: IB cards, as you see are fast. They bring low latency, which is just as important as their bandwidth. Current IB cards run at 56 gbit/s. That is slightly more than double of the capacity of a PCI Gen2 slot (of ~25 gbit/s). And IB cards are equipped usually with two ports - that is, altogether you'd need 112 gbit/s PCI slots, to be able to utilize FDR IB cards in an active-active fashion. PCI Gen3 slots provide you with around ~50gbps. This is why the most IB cards are configured in an active-standby way if both ports are used. Once again the PCI slot is the bottleneck.
Anyway, the new Oracle servers are equipped with Gen3 PCI slots, an the new IB HCAs support those too. Oracle utilizes the QDR HCAs, running at 40gbp/s brutto, which translates to a 32gbp/s net traffic due to the 10:8 signal-to-data information ratio. This means for some peak bandwidth cases, active-active must be used with QDR to get all the slot bandwidth.
Technology never stops to evolve. Mellanox is working on the 100 gbps (EDR) version already, which will be optical, since signal technology doesn't allow EDR to be copper. Also, I hear you say "100gbps? I will never use/need that much". Are you sure? Have you considered consolidation scenarios, where (for example with Oracle Virtual Network) you could consolidate your platform to a high densitiy virtualized solution providing many virtual 10gbps interfaces through that 100gbps? Technology never stops to evolve. I still remember when a 10mbps network was impressively fast. Back in those days, 16MB of RAM was a lot. Now we usually run servers with around 100.000 times more RAM. If network infrastrucure speends could grow as fast as main memory capacities, we'd have a different landscape now :)
You can utilize SRIOV as well for consolidation. That is, if you run LDoms (aka Oracle VM Server for SPARC) you do not have to add physical IB cards to all your guest LDoms, and you do not need to run VIO devices through the hypervisor either (avoiding overhead). You can enable SRIOV on those IB cards, which practically virtualizes the PCI bus, and you can dedicate Physical- and Virtual Functions of the virtualized HCAs as native, physical HW devices to your guests. See Raghuram's excellent post explaining SRIOV. SRIOV for IB is supported since LDoms 3.1.
This post is getting lengthier, so I will rename it to Part I, and continue it in a second post.