• September 22, 2015

Myths, facts and other jargon about InfiniBand Partitions

When it comes to InfiniBand Partitions, we often hear two terms: Partition and Pkey.

Let me try to make a distincton between these two terms for the purpose of easier understanding and its usage in Oracle's Engineered Systems.

On Oracle's InfiniBand switch, the partition administration syntax (smpartition) allows us to define custom partition IDs from 0x0002 to 0x7FFE.  0x7FFF (last) is reserved for default partition and 0x0001 (first) is reserved for internal use.

After putting the partition in edit mode, first step is to create a partition ID and the next step is to add member(s) to it. These members are nothing but compute host's InfiniBand port GUIDs. Each port GUIDs is treated individually even if two ports are part of the same physical HCA card inside a machine. Once all administration is done, we must commit and save changes otherwise the table is not propagated to end points.

When we add a member to a partition ID, it obtains a pkey value. Here is how this value is derived.

If the membership is limited, then

pkey value = partition ID

If the membership is full, then the pkey value is calculated as follows:

pkey value = (partition ID) AND 0x8000 

If the membership is both, then the pkey value is the same as full membership. But the difference here is that a Virtual Machine created on that compute node can also have its IB ports as limited members of that partition ID. So it kind of allows "both" memberships on Virtual Machines.

Now lets take some numerical examples. Lets say, on the switch, we created a partition ID as 0x2001.

From host-A, we added both IB port GUIDs to partition 0x2001 as limited members. These ports will obtain pkey values as 0x2001 (same as partition ID).

From host-B, we added both port GUIDs to partition 0x2001 as full members. These ports will obtain pkey values as 0xA001 (final value after AND operation with 0x8000).

From host-C, we added both port GUIDs to partition 0x2001 as both members. The physical host (dom0) will have both IB ports obtain pkey values as 0x2001 and 0xA001. On this physical dom0 host, we can choose to create two Virtual Machines.

Virtual Machine C-1 with its ports as limited members to partition 0x2001 resulting in pkey values as 0x2001.

Virtual Machine C-2 with its ports as full members to partition 0x2001 resulting in pkey values as 0xA001.

Now lets see some of the common myths and their associated facts about these.  

Myth: Limited members of an IB partition can communicate within partition but not outside

Myth: Full members of an IB partition can communicate with other full members of a different partition

Myth: Port GUID with pkey value 0xA001 is different from partition ID 0x2001

Myth: Ports connected to a specific IB switch must be added to desired partition IDs on the same IB switch

Myth: All switches must run subnet manager in order for IB partitions to work 

Myth: Every port GUID must specify membership level when added to a partition ID

Myth: A partition ID can not have mix of full and limited members defined under it

Myth: If a partition is created with default membership set to full, then all members must be full only. Or vice versa. 

Myth: Partition based members can always use TCP/UDP over IPoIB

Myth: Various Engineered Systems like Exadata, Exalogic etc need to have their own partition tables and administration

Fact: The biggest fact is that nothing can communicate across two distinct pkey. For example, members of 0x2001 can not communicate with members of 0x2002 irrespective of membership levels.

Fact: The second fact here is that limited members can not communicate with each other even if they are in the same partition ID. For example, two or more GUIDs with limited membership to 0x2001 can  not communicate with each other.

Fact: Limited members can only communicate with full members of the same partition ID. For example, limited members of partition ID 0x2001 can communicate with full members of 0x2001.

Fact: IB partitions are administered globally on only one switch in the entire IB fabric where the master Subnet Manager is running. Rest of the IB switches need not be configured manually. All standby subnet manager switches receive the updated and latest partition table automatically as soon as we execute "smpartition commit". This is done through the admin IP list of all subnet manager switches inside "smnodes list".

Fact: If we have a large IB fabric with several Oracle Engineered Systems (multi-rack), then it is not necessary to run subnet manager instance on each and every IB switch. Fact is that only master subnet manager switch delivers and facilitates the partition table to each and every end point of the fabric. Standby switches receive updated tables but they are not used until master role is taken over by that switch. Lastly, even the hosts directly connected to a non subnet manager switch (unmanaged) will receive partition table as appropriate because master subnet manager for all nodes is the same.

Fact: When we create a partition ID, optionally we can specify its default membership level also. Thereafter when a member is added, it can automatically inherit the membership level specified by partition ID's default membership rule.  

Fact: Each partition ID can have several members (port GUIDs) and they need not be of the same membership type. In other words, we can have a mix of limited and full members of a partition ID. 

Fact: Individual members (port GUIDs) can override the default membership rule of the partition ID. For example, if partition ID 0x2001 has default membership set to full, we can still add one or more members by explicitly enforcing their own membership to limited. Or vice versa.

Fact: In order to enable IPoIB for any partition ID, a flag must be set. This flag is simply labeled as 'ipoib'. Without this flag, members can use native InfiniBand communication but not legacy TCP/UDP over IP.

Fact: As I mentioned before,  partitions are administered globally on one and only master subnet manager switch. So if we have various types of Engineered Systems connected together, they all need to be administered into the same partition table. Now interestingly, we may still have some partitions (with members) which are unique to each Engineered System. But if things are connected together, I would guess there is definitely a need to have some common partitions too. Wouldn't you agree ? For example, we may have a partition 0x000E with members from Exalogic and Exadata so that they can communicate with each other. Whereas partition 0x2001 may be strictly for Exadata clustering, 0x2010 for Exadata storage and 0x8006 for Exalogic EoIB.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.