• February 13, 2012

Switches Inside Oracle's Engineered Systems

Continuing from my last blog about InfiniBand building blocks, now lets review the network switches used inside Oracle's Engineered Systems a little bit in detail. This will help you in understanding the overall integration, network design, architecture and troubleshooting in later articles.

There are total two category of network switches used to prepare computing environment inside the rack.

InfiniBand Switches - two models used depending on requirements
  1. Sun Oracle 36-port InfiniBand Switch
  2. Sun Oracle InfiniBand Gateway Switch

Ethernet Switch - primarily for management purposes

  1. Cisco Catalyst 4948

The following table will get you started quickly and save me a lot of writing.

IB Ports
IB Port 
Sun Oracle 

Sun Oracle 
10Gbps per port
0A-ETH-[1 to 4]
1A-ETH-[1 to 4]

48 [1-48]

Let me first give you some more insight on the InfiniBand switches and then we will talk about the Cisco Catalyst 4948. The following picture shows the 36-port IB switch. Gateway switch also looks similar with slight difference for the EoIB ports on extreme right.

Common information that applies to both of these InfiniBand switches
  1. Form Factor: One rack unit (1U) height
  2. Power Supplies: Two
  3. Cooling Fans: Five
  4. IB Subnet Management: Yes
  5. Firmware Upgradeable: Yes
  6. Command Line Access: Yes. Via ssh and usb-serial access
  7. Web Based Management: Yes
  8. SNMP Access: Yes

As you might have figured out by now that the IB Gateway switch is almost like a super set of 36-port switch in terms of features and capabilities.

Differences between 36-port and Gateway InfiniBand switches

Comparatively, there are four additional IB ports on 36-port switch. On the Gateway switch these are internally consumed to enable Ethernet over InfiniBand (EoIB) functionality. I am sure you are wondering how this is done. The simple explanation here is that there are two additional hardware devices installed inside IB Gateway switch. These are called Bridge-X, each of which internally connects to InfiniBand fabric via two IB ports. Hence, I showed the math of 36-4=32 in the table above. Towards the external world, they expose EoIB ports as 0A-ETH and 1A-ETH in QSFP+ form factor. But all devices in the the Ethernet world may not understand QSFP+ and we are not commonly using 40Gbps Ethernet too, so these are split into four (4) SFP+ at 10Gbps signalling rate each. Thats why the final port label on EoIB side is 0A-ETH-[N] and 1A-ETH-[N] where N has a fixed value from 1 to 4.

Why do we have two Ethernet ports on the InfiniBand switches ?

For those who have seen or will get their hands on these two InfiniBand switches, let me clarify something about the Ethernet management port. Visually, you will see two RJ45 ports on the switch but there is only one target interface inside. There is a small bridge inside the switch which connects to the management Ethernet and provides two connections to outside world. No, this is not for redundancy or high availability. It is there to allow you to create linear bus topology, if you need it. In simple term, you can daisy chain more than one such switch.

What about these Leaf and Spine switches ?

Okay, now that I have talked about these two InfiniBand switches... let me introduce you to two keywords which you will be hearing a lot and this will set the ground for further discussions.

  1. Spine Switch
  2. Leaf Switch

These are roles of a switch in the topology or connectivity layouts. I may write more about the topologies later but for now lets just keep this blog short, concise and in context of Oracle's Engineered Systems.

The switch where hosts are directly connected takes up the role of Leaf Switch.

The switch where there are no direct hosts attached but does have inter switch links (ISL) to provide alternate paths or for expanding the fabric takes up the role of Spine Switch.

In Exadata and SuperCluster racks, both roles are provided by 36-port InfiniBand switches.

In Exalogic racks, Leaf role is provided by Gateway switches whereas Spine role is provided by a 36-port switch.

How is the InfiniBand connectivity and topology build out

Consider all hosts with one dual-port HCA installed in their PCI-E slots. Connect port-1 to designated leaf switch-1 with an IB cable. When you are done, this completes a star topology. Now repeat the same on port-2 but this time use designated leaf switch-2. So, each host is connected to two leaf switches via independent port. This sets up your dual star topology. But wait, we need some inter switch links also. Why ? To ensure guaranteed communication in an asymmetric topology. For example, host A may be using port-1 while host-B may switch to port-2 for some reason.

Inter switch links may be as simple as cables between two leaf switches or they may go through another switch, which is known as Spine switch. I will not go into micro level details here as you can read more about how ISLs are chosen in various rack configurations in respective product guides.

Cisco Catalyst 4948

Each host and end point has a management network port. This is
always Ethernet based. Cisco 4948 switch integrates all such management
ports inside the rack. Everything is pre-wired and all you need is to
connect an uplink from this Cisco switch to your data center access
switch. Now be careful and do not connect two cables into your data
center access switch without planning for Spanning Tree Protocol. This switch is fully managed and also provides VLAN capabilities
based on 802.1Q specifications. By default, all hosts inside rack
connected to this switch are on same VLAN.

Overall Network Design

At a very high level, we have the following setup:

  1. Ethernet based management network served through Cisco Catalyst 4948 switch
  2. InfiniBand internal network served through InfiniBand switches in redundant configuration for high availability
    • This network facilitates all the internal communications within the Engineered Systems framework
  3. Ethernet based external world connectivity
    • In Exadata and SuperCluster, this is achieved via physical 10Gbps Ethernet from individual hosts. There are dedicated 10Gbps NICs installed in hosts. Their switching environment is outside of the rack.
    • In Exalogic, this is is achieved via virtual 10Gbps Ethernet from individual hosts. We have been referring to this as EoIB. From hosts' view, there is no additional hardware or cable. Same IB media path carries this traffic as well.

Next time, I will talk more about the virtual networks that are carried over this physical network. Thanks for reading and I welcome all your comments and questions.

Join the discussion

Comments ( 6 )
  • Eli Kleinman Monday, February 13, 2012

    Keep up the great work, I am wondering if vLan tags are fully supported on the IB switches? and another question is about lacp support between two IB switches for fail over?

  • Neeraj Gupta Monday, February 13, 2012

    Hi Eli, these topics are next on my agenda to publish. The short answer is that InfiniBand supports virtualization on LAN using partition keys (Pkeys) at layer 2. LACP in ethernet world is a little different but we do have alternate paths and fail over mechanisms in InfiniBand switches as well. More about it later. Thanks !

    -Neeraj Gupta

  • Daniel Friday, March 30, 2012

    HI Neeraj

    Great blog by the way !!!

    It's interesting you mentioned "Bridge X" for EoIB. Is this the same technology as the Mellanox BridgeX BX5020 ?

    I was trying to find out if oracle solaris 11 would be compatible with the Mellanox gateway switch for EoIB but no one could say for sure ?

    Regards, Daniel

  • guest Sunday, July 15, 2012

    Hi Neeraj,

    I have a basic question about how the 4 GW switches in a full rack are used.

    I understand that there are 2 IB ports per compute node, and 30 compute nodes in a full rack.


    IB port-1 on each node --> 30 ports --> IB switch1

    IB port-2 on each node --> 30 ports --> IB switch2

    How are IB switches 3 and 4 in a full rack connected?

  • guest Tuesday, October 22, 2013

    Great insights! Thank you very much for taking the time to share your knowledge & keep the Tech Community up to date

  • guest Monday, August 11, 2014

    great blog

Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.