X

Recent Posts

WAIT-VHUB ? Whats Going On ?

I know many of you have been working on Oracle's Exalogic and other Engineered Systems. With partitions enabled now, things have gone multi dimension. But its fun. Isn't it ? While you have some EoIB configurations together with InfiniBand partitions, the VNICs are not coming up and staying in WAIT-VHUB state ?  Chances are that you have forgot to add InfiniBand Gateway switches' Bridge-X port GUIDs to your partition. These must be added as FULL members for EoIB to work properly. VHUB means a virtual hub in EoIB. Bridge-x is the access point for hosts to work over EoIB so thats why it must be a full member in partition. Step 1: Find out the port GUIDs of your bridge-x devices in IB Gateway switch. # showgwportsINTERNAL PORTS:---------------Device   Port Portname  PeerPort PortGUID           LID    IBState  GWState---------------------------------------------------------------------------Bridge-0  1   Bridge-0-1    4    0x0010e00c1b60c001 0x0002 Active   UpBridge-0  2   Bridge-0-2    3    0x0010e00c1b60c002 0x0006 Active   UpBridge-1  1   Bridge-1-1    2    0x0010e00c1b60c041 0x0026 Active   UpBridge-1  2   Bridge-1-2    1    0x0010e00c1b60c042 0x002a Active   UpStep 2: Add these port GUIDs to the IB partition associated with EoIB. Login to master SM switch for this task.# smpartition start# smpartition add -pkey <PKey> -port <port GUID> -m full# smpartition commitEnjoy ! 

I know many of you have been working on Oracle's Exalogic and other Engineered Systems. With partitions enabled now, things have gone multi dimension. But its fun. Isn't it ? While you have some EoIB...

InfiniBand Enabled Diskless PXE Boot

When you want to bring up a compute server in your environment and need InfiniBand connectivity, usually you go through various installation steps. This could involve operating systems like Linux, followed by a compatible InfiniBand software distribution, associated dependencies and configurations. What if you just want to run some InfiniBand diagnostics or troubleshooting tools from a test machine ? What if something happened to your primary machine and while recovering in rescue mode, you also need access to your InfiniBand network ? Often times we use opensource community supported small Linux distributions but they don't come with required InfiniBand support and tools. In this weblog, I am going to provide instructions on how to add InfniBand support to a specific Linux image - Parted Magic.This is a free to use opensource Linux distro often used to recover or rescue machines. The distribution itself will not be changed at all. Yes, you heard it right ! I have built an InfiniBand Add-on package that will be passed to the default kernel and initrd to get this all working. Pre-requisites You will need to have have a PXE server ready on your ethernet based network. The compute server you are trying to PXE boot should have a compatible IB HCA and must be connected to an active IB network. Required Downloads Download the Parted Magic small distribution for PXE from Parted Magic website. Download InfiniBand PXE Add On package. Right Click and Download from here. Do not extract contents of this file. You need to use it as is. Prepare PXE Server Extract the contents of downloaded pmagic distribution into a temporary directory. Inside the directory structure, you will see pmagic directory containing two files - bzImage and initrd.img. Copy this directory in your TFTP server's root directory. This is usually /tftpboot unless you have a different setup. For Example: cp pmagic_pxe_2012_2_27_x86_64.zip /tmpcd /tmpunzip pmagic_pxe_2012_2_27_x86_64.zipcd pmagic_pxe_2012_2_27_x86_64 # ls -ltotal 12drwxr-xr-x  3 root root 4096 Feb 27 15:48 bootdrwxr-xr-x  2 root root 4096 Mar 17 22:19 pmagic cp -r pmagic /tftpboot As I mentioned earlier, we dont change anything to the default pmagic distro. Simply provide the add-on package via PXE append options. If you are using a menu based PXE server, then add an entry to your menu. For example /tftpboot/pxelinux.cfg/default can be appended with following section. LABEL Diskless Boot With InfiniBand SupportMENU LABEL Diskless Boot With InfiniBand SupportKERNEL pmagic/bzImageAPPEND initrd=pmagic/initrd.img,pmagic/ib-pxe-addon.cgz edd=off load_ramdisk=1 prompt_ramdisk=0 rw vga=normal loglevel=9 max_loop=256TEXT HELP* A Linux Image which can be used to PXE Boot w/ IB toolsENDTEXT Note: Keep the line starting with "APPEND" as a single line. If you use host specific files in pxelinux.cfg, then you can use that specific file to add the above mentioned entry. Boot Computer over PXE Now boot your desired compute machine over PXE. This does not have to be over InfiniBand. Just use your standard ethernet interface and network. If using menus, then pick the new entry that you created in previous section. After a few minutes, you will be booted into Parted Magic environment. Enable IPoIB Well, I have made things a bit easy for you :) The add-on package that we passed while booting, starts IPoIB automatically. All you need to do is provide an IP address to ib0 or ib1 interfaces. Open a terminal session and check the status. You can use commands like: ifconfig -aibstatibv_devicesibv_devinfo If you are connected to InfiniBand network with an active Subnet Manager, then your IB interfaces must have come online by now. You can proceed and assign IP address to them. This will enable you at IPoIB layer. Example InfiniBand Diagnostic Tools I have also added several InfiniBand Diagnistic tools in this add-on. You can use from following list: ibstat, ibstatus, ibv_devinfo, ibv_devicesperfquery, smpqueryibnetdiscover, iblinkinfo.plibhosts, ibswitches, ibnodes Wrap Up This concludes this weblog. Here we saw how to bring up a computer with IPoIB and InfiniBand diagnostic tools without installing anything on it. Its almost like running diskless !

When you want to bring up a compute server in your environment and need InfiniBand connectivity, usually you go through various installation steps. This could involve operating systems like Linux,...

Advance Routing for Multi-Homed Hosts

Earlier we discussed about a host participating in different networks or subnets, referred to as multi-homed host. Here I am going to talk about how to handle layer 3 routing. I will break down my discussion into two scenarios - Simple and Not-So-Simple. Simple Scenario Lets assume a host with four network interfaces connected to unique layer 3 subnets respectively. Three subnets are private LANs and the fourth one is a bigger one - WAN. The smaller networks could be for your management, development or testing lets say. And the bigger one is internet or intranet where it is not easy to define how many hosts or services will be there. This bigger network may even be sub-divided into more networks and almost always a router is present here. As you know that router's main function is to route traffic across unique broadcast subnets. So our multi-homed host will have a default gateway defined towards this bigger network or WAN. Whenever a communication has to happen to someone outside of our known networks, we forward it to the default gateway. This default gateway is also called as router. Let me write this down in simple terms here. Host's Network Participation Requirements eth0  - 201.19.23.128 / 24  with gateway IP 201.19.23.1 bond0 - 192.168.10.1  / 24  with no gateway requirements bond1 - 10.214.28.101 / 24  with no gateway requirements bond2 - 172.23.7.128  / 24  with no gateway requirements Looks like this machine only needs to be talking to the corporate network through eth0 via 201.19.23.1. Problem solved ! We can simply put this default gateway in /etc/sysconfig/networking file or /etc/sysconfig/network-scripts/ifcfg-eth0. Not So Simple Scenario Now if we take the same host from above scenario and instead of one connectivity to a bigger network, we make two such connections. One could be towards the real Internet and another one could be towards corporate wide area network. And we still maintain another two for management and internal communications. If we continue to use the standard way to configure our default gateway then only one of the two bigger networks will be accessible. Simply because default gateways are interface or layer 3 subnet bound. Host's Network Participation Requirements eth0  - 201.19.23.128 / 24  with gateway IP 201.19.23.1 bond0 - 192.168.10.1  / 24  with no gateway requirements bond1 - 10.214.28.101 / 24  with gateway IP 10.214.28.1 bond2 - 172.23.7.128  / 24  with no gateway requirements As you can see here that eth0 and bond1 need to have their own respective default gateways.bond1 and bond2 do not have any default gateway requirements. They are simply confined to their actual layer 3 subnet.If you simply add a default route then only one can be in effect at a time.  Problem Let me re-phrase the above discussion in form of a problem statement. How can a multi-homed host be made accessible over more than one networks across different routers ? Solution Linux has advanced routing capabilities made possible through iproute2 tools. This allows us to specify more than one default gateways or router addresses. I am presenting a sample config based on Oracle Enterprise Linux 5 but this can be easily adapted to other flavors including 'Vanilla' distributions :) Basically, we create some rules and tables for routing lookups. We will need some unique table IDs. I am going to use 224 and 225. They should not have been used before. You can check like this: ip rule list Look at the first column, the output should not have 224 or 225. Otherwise, use some other number.For eth0, create the following two files.vi /etc/sysconfig/network-scripts/rule-eth0 from 201.19.23.128/32 table 224 to 201.19.23.128 table 224 vi /etc/sysconfig/network-scripts/route-eth0 201.19.23.0/24 dev eth0 table 224 default via 201.19.23.1 dev eth0 table 224 For bond1, create following two files.vi /etc/sysconfig/network-scripts/rule-bond1 from 10.214.28.10/32 table 225 to 10.214.28.10 table 225 vi /etc/sysconfig/network-scripts/route-bond1 10.214.28.0/24 dev bond1 table 225 default via 10.214.28.1 dev bond1 table 225 Now you can restart the network to make these new configs effective. But do it at some planned time because this will interrupt your host's access. You may also use 'ip' commands for a runtime execution. Thats all. And your host should be now accessible across both routers. Static Routes Well, some of you may be wondering by now why I have not mentioned anything about static routes. I have not forgotten ! Use of static routes is for scenarios in between. If you have a well-known subnet beyond a router, then you should certainly add a static route for that. For example, if one of the machine connected to bond1 network also knows about another network and has routing capabilities, you can use static route through that. Ok, so that is all for this post. As always, your comments are most welcome. Thanks !

Earlier we discussed about a host participating in different networks or subnets, referred to as multi-homed host. Here I am going to talk about how to handle layer 3 routing. I will break down my...

Deployments over InfiniBand Infrastructure

I have been talking to many people who are working hard towards migrating an existing Ethernet based infrastructure or starting from grounds up to deploy their services and applications on InfiniBand based computing environments. At the end of the talk, it all looks very simple.. right ? But where to start.. that's the fun part. Lets give some insight into these type of discussions. Recall one of my previous blog where I focused on a few keywords - Consolidation, Isolation and Virtualization. This blog is build up on the discussion I presented there. First things first ! Go through your server sizing activity and get an idea on how many applications you can fit into a compute machine based on its resources like CPU, memory, disk space etc. When you have a number of applications to deploy or even migrate from Ethernet environments, you would probably go with a model where you group some servers to perform certain set of functions. Right ? Ok, now you want to plan how these groups communicate with each other and at a lower level, how the servers within each group communicate with each other. So, all this work will create a basis for network design and architecture. Network Design with InfiniBand For simplicity, I will assume that we have total eight compute machines divided equally into four groups. And by now, you have already decided what applications run where. Networks can be classified into two main categories - Internal and External. As these terms suggest, internal networks are primarily used for communications like server management, patching, updates, synchronization, health checks etc and external networks are used to advertise and offer the running services to your outside world which may even be Internet. Question - Will all eight machines be talking to the external network ? May be, may be not. Lets assume that we have a few select machines for this and they are categorized in group A. This group could be serving the purpose of a load balancer. Sure ! The two machines in this group may also be hosting some other service applications e.g. a name service and proxy. Sure ! Now, what I am trying to highlight here is that a machine may be deployed with multiple applications, those of which may have a need to participate in multiple networks. So, the network participation is not driven by machine as a hardware.. but its a function of each running service or application inside the machine. External Networks Group A faces the external world and if this is a case of Exalogic hardware, the external access is made possible through Ethernet over InfiniBand or EoIB. If we have two different functions e.g. load balancer and a name service exposed to external world, we may wish to isolate them. This can be done by creating separate Virtual NICs and assigning their own IP addresses. These may even be on unique VLANs. If isolation is not needed, we can simply use same IP address. Most of the applications have a "Listen" feature which dictates what socket (IP and port) they bind to. Use of in-line firewalls on these external networks is not uncommon and this will provide security to our edge computing group A. We can even go a step ahead and write iptable rules on the machine's inside group A for further security and hardening. Internal Networks Ok, so we took care of how the group A faces external world. In our example where we used load balancing as a function on the edge group A, it will have to communicate to internal groups where actual services may be running. And as I initially mentioned that we have already subdivided into three more groups - B, C and D. To build up a use case, lets assume that services to be load balanced are only deployed in group B. So that means, group A will be communicating to B on some specific sockets and we dont want any exposure from A to [C and D]. So, what do we do here ? We know that everything is connected over the same physical segment which is InfiniBand. Just like we will use VLAN in Ethernets, we will use Partition Keys here. The hosts in group A and B will share a common Partition key to allow a secured communication. Lets name this network as AB-10. It may so happen that when machines in group B receives jobs to do from A, they need some help from machines in group C. Earlier, we excluded group C from network AB-10 for security. Hmm.. so what we do here is create a new network between group B and C just for this internal usage. Lets call it BC-20. And let me remind you that when you create a new network using VLAN or Partiton Keys, it means you have a new set of network interfaces and their associated IP addresses. This is what I discussed in my earlier blog - Networks and Virtualization. What about group D ? Well, I left it alone on purpose just to demonstrate a use case that this group may be designated for some internal development and testing. But still, some network definitions must exist for this lone group D because there are two machines inside it and they may need to communicate to each other. Lets call it D-44. It will remain isloated from [A, B & C] as well as the external EoIB network created for A. I have not forgotten about the NFS server.. so lets finish our conversation here. If all machines need to mount their respective shares from a common NFS server, then we can allow it to participate in all sub-networks AB-10, BC-20 and D-44. The access control policies can be enforced to allow only specific hosts to use the exported shares. Wrap Up What I have discussed here is that a set of machines can be grouped into functional roles which eventually drives the need of networks for communications - internal and external. Network isolation can be achieved in InfiniBand through Partition Keys just like VLANs in Ethernet. In-line firewalls can be used on EoIB network paths. Host level iptable rules can also be used if desired.

I have been talking to many people who are working hard towards migrating an existing Ethernet based infrastructure or starting from grounds up to deploy their services and applications on InfiniBand...

Bonding Parameters Based on Network Layout

Quick Introduction to Linux Bonding As the name suggests, bonding driver creates a logical network interface by using multiple physical network interfaces underneath. There are various reasons to do so, including link aggregation for higher bandwidth, redundancy, high availability etc. Upper layers communicate through the logical bond interface which has an IP address but eventually the active physical interface(s) communicate to lower layer 2. It also provides transparency to upper layers by hiding the actual interface. Bonding Parameters Other than specifying the physical interfaces part of the logical bond interface, we also specify how to we want this whole thing to work. There are several possible configurations but I am only going to focus on one mode which is called "Active Backup" and has a numerical identifier as 1.You can actually list all parameters of the kernel bonding driver installed in your system. Look for the lines beginning with 'parm' below. You can work with these options in /etc/modprobe.conf or some of them can also be worked with directly under each bonding interface via /etc/sysconfig/network-scripts/ifcfg-bond1 [root@hostA ~]# modinfo /lib/modules/2.6.32-100.23.80.el5/kernel/drivers/net/bonding/bonding.ko filename:       /lib/modules/2.6.32-100.23.80.el5/kernel/drivers/net/bonding/bonding.ko author:         Thomas Davis, tadavis@lbl.gov and many others description:    Ethernet Channel Bonding Driver, v3.5.0 version:        3.5.0 license:        GPL srcversion:     4D5495287BB364C8C5A5ABE depends:        ipv6 vermagic:       2.6.32-100.23.80.el5 SMP mod_unload parm:           max_bonds:Max number of bonded devices (int) parm:           num_grat_arp:Number of gratuitous ARP packets to send on failover event (int) parm:           num_unsol_na:Number of unsolicited IPv6 Neighbor Advertisements packets to send on failover event (int) parm:           miimon:Link check interval in milliseconds (int) parm:           updelay:Delay before considering link up, in milliseconds (int) parm:           downdelay:Delay before considering link down, in milliseconds (int) parm:           use_carrier:Use netif_carrier_ok (vs MII ioctls) in miimon; 0 for off, 1 for on (default) (int) parm:           mode:Mode of operation : 0 for balance-rr, 1 for active-backup, 2 for balance-xor, 3 for broadcast, 4 for 802.3ad, 5 for balance-tlb, 6 for balance-alb (charp) parm:           primary:Primary network device to use (charp) parm:           lacp_rate:LACPDU tx rate to request from 802.3ad partner (slow/fast) (charp) parm:           ad_select:803.ad aggregation selection logic: stable (0, default), bandwidth (1), count (2) (charp) parm:           xmit_hash_policy:XOR hashing method: 0 for layer 2 (default), 1 for layer 3+4 (charp) parm:           arp_interval:arp interval in milliseconds (int) parm:           arp_ip_target:arp targets in n.n.n.n form (array of charp) parm:           arp_validate:validate src/dst of ARP probes: none (default), active, backup or all (charp) parm:           fail_over_mac:For active-backup, do not set all slaves to the same MAC.  none (default), active or follow (charp) How to verify bonding status and configuration Under active-backup mode, most common configuration that folks use is the link based failure detection via a set of parameters - miimon and use_carrier. Here is how it looks like in a running system. [root@hostA ~]# cat /proc/net/bonding/bond1 Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008) Bonding Mode: fault-tolerance (active-backup) (fail_over_mac active) Primary Slave: None Currently Active Slave: eth0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 5000 Down Delay (ms): 5000 Slave Interface: eth0 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:21:28:4a:cd:80 Slave Interface: eth1 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:21:28:4a:cd:81 [root@hostA ~]# What we see here is that bond1 is set in active-backup mode with two physical interfaces or slaves - eth0 and eth1. Their link status is monitored every 100ms. If a link goes down, the bonding driver will wait 5000ms before actually declaring it as DOWN. When the lost link recovers, the driver will again wait for 5000ms before declaring it as UP. The option 'primary' is set to none. This means that bonding driver does not have any preference of eth0 vs eth1 if both are UP at same time. Link failure counter keeps track of how many times the link has failed since host has been running. Limitations of MII monitoring based bonding Now lets review the following topology diagram. Here a host 'A' has two physical interfaces and their links are marked as 1 and 2 respectively. They are connected to independent Ethernet switches for redundancy and high availability purposes. These switches further connect into a bigger network which is external to us. It may be a corporate network or even Internet. The links are labeled as 3 and 4 respectively from our local Ethernet switches as well. Host  A with bond1 interface has eth0 as currently active interface. It is expected to communicate to an external network as shown. Scenario 1: When link number 1 is out of service, then bonding driver will detect it within specific period of time and activate the next backup interface eth1. Service will be restored at this point. :) Scenario 2: When link number 3 is out of service, then the bonding driver will be completely unaware of this scenario because both of its local physical interfaces are completely in service. However, the host A will be unable to reach out to the external world due to link number 3 being out of service. :( Alternate bonding configuration to detect failures at OSI layer 3 Bonding driver offers an alternate set of parameters to solve the problem illustrated above. Instead of miimon, we will use arp_ip_target and arp_interval. The modified configuration will look like this. [root@hostA ~]# cat /proc/net/bonding/bond1 Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth0 MII Status: up MII Polling Interval (ms): 0 Up Delay (ms): 0 Down Delay (ms): 0 ARP Polling Interval (ms): 60 ARP IP target/s (n.n.n.n form): 192.168.70.1 Slave Interface: eth0 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:21:28:4a:cd:80 Slave Interface: eth1 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:21:28:4a:cd:81 [root@hostA ~]# As you can see that the bonding driver is now monitoring accessibility to 192.168.70.1 every 60 milli seconds. If this is not successful then it will attempt to use eth1 irrespective of local link status. Conclusion MII monitoring based bonding is ideal when you are communicating within a LAN and do not go across a router. IPoIB is a good example in this case because currently InfiniBand networks are limited to same broadcast subnets or in other words, do not use a layer 3 router. ARP IP target based monitoring should be preferred if your setup is similar to what we just discussed above. If the bonded interface is expected to communicate to an outside world across a router, then its better to monitor the accessibility to a set of external IP addresses instead of just local link status. Client access networks created with EoIB is a good example here.

Quick Introduction to Linux Bonding As the name suggests, bonding driver creates a logical network interface by using multiple physical network interfaces underneath. There are various reasons to do...

Networks and Virtualization

Virtualization is always a heavily talked subject among infrastructure owners. Someone may ask why ? Often times, at any given time, the capacity of hardware equipment used to deploy services may be more than the services can actually utilize them. Data center real estate and power is expensive. So, what do we want to do ? Consolidation. Hmm.. sure but what about security ? Someone will not like the fact that they are running on a shared set of resources with everything exposed. Now, what do we want here ? Isolation. These two keywords - Consolidation and Isolation, may appear to be contradicting to each other but together, they form the basis of Virtualization in computing environments. In networking world, virtualization is achieved when you can consolidate multiple data flows over a given media and yet maintain isolation. I have been hearing people asking these question quite a lot of times. Whats with these VLANs ? How many should I create ? What are InfiniBand partitions ? How do I implement them ? Lets twist these questions around and ask differently. How many different local area networks we need to participate in ? Do we need security amongst these networks ? In modern day computing environment, almost all hosts and especially servers are part of more than one network. There is a term for this - Multi Homing. Lets elaborate a little bit on this. The meaning is quite literal. It is about participating in different networks at the same time. Lets start with one network. What do you need here ? Simple enough. One network port with one IP address and subnet mask. You will add a gateway address just in case your destinations are out there in the bigger network beyond your local area. Okay, things grow and your requirements may expand too. Add another network port, another IP address, mask and gateway etc. But can you scale with this model beyond a certain limit ? No. Very soon you will realize that you are hitting a physical limitation to expand. And nobody would like to spend more money on hardware and then manage more things on the data center floor. Those who have seen the cabling know what I am talking about. So what do we need to do here ? Consolidation. How do we do that ? Lets see... and keep in mind the first three layers of OSI because I will be referring to them going forward. As soon as we virtualize at a layer, the layers above it will inherit. Virtualization at Layer 1 (Hardware) Even if this is for a simple explanation, I would say we do have some consolidation happening at hardware or layer 1. Haven't you seen network interface cards with more then one network ports ? How about those dual band wireless routers ? At this layer, this is the kind of consolidation we have and isolation is pretty much built in. Instead of having two single port network cards, we may use dual-port or even quad-port cards. Different ports on the same card will have their own hardware address and physical path to external world.  Interfaces from NIC #1  Interfaces from NIC #2  Interfaces from NIC #3   Interfaces from NIC #4  Ethernet  eth0, eth1  eth2, eth3  eth4, eth5  eth6, eth7, eth8, eth9  InfiniBand  ib0, ib1  ib2, ib3  ib4, ib5  ib6, ib7 As you can see that each individual network interface card provides a unique interface to be used at layer and layer 3. The consolidation shown here is only at Network Card level and this is purely to build up some conversation here for understanding. Virtualization at Layer 2 (Link) The fun starts here :). In this section, we will stay on same physical media or lets say same network port which has a fixed hardware address. In Ethernet, this is known as MAC and in InfiniBand, it is Port GUID or LID under subnet management. We can consolidate multiple streams at layer 2 as the upper layer 3 can participate in multiple IP subnets but how do we achieve islolation here ? That will be the key for virtualization here. Ethernet world implements Virtual LAN based on 8021.Q specifications InfiniBand implements Partition Keys (PKeys) based on IBTA specifications Basically, the idea is to allow a unique ID for each stream consolidated at link layer 2 and present the new interface to upper layer with isolation. The overall result is network virtualization at layer 2.  Interfaces from NIC #1  Layer 2 Isolation Unique ID  Interface Presented to Layer 3  Ethernet  eth0, eth1  VLAN=20  eth0.20, eth1.20  InfiniBand  ib0, ib1  PKey=0x8001  ib0.8001, ib1.8001 In this approach, when you analyse the packets traversing on the wire, there will be unique fields set at layer 2 to isolate packets. Layer 3 will also have its own source and destination IP subnet information. Virtualization at Layer 3 (Network) This is the layer where we use IP addresses at end points and I will refer to Linux in my examples. Assumption here is that we have the same network interface with same hardware address and no virtualization at layer 2. So, how do we do this at layer 3 ? IP aliasing with unique subnets. Lets say our interface is eth0. Several operating systems, including Linux allows us to add more virtual interfaces with namig format like eth[n]:[y]. Here 'n' is our interface instance and 'y' will be a virtual instance on top of it. So, we can have several virtual layer 3 instances participating in their own IP subnets. What did we do here ? Consolidated data streams and also isolated via unique subneting. Interfaces from NIC #1  IP Aliases or Virtual Interfaces  Ethernet  eth0, eth1  eth0:1, eth0:2, eth1:1, eth1:2  InfiniBand  ib0, ib1  ib0:1, ib0:2, ib1:1, ib1:2 In this approach, when you analyse the packets traversing on the wire, the only differences will be at layer 3 in source and destination IP subnets. There will be no isolation at layer 2.  What works best ? We just saw how virtualization can be done at three lower layers of OSI model. Lets evaluate and recap. Layer 1: Requires more hardware and becomes un-manageable. Many people will just debate that this is not virtualization. I agree and my purpose was to give you an idea on what happens at each layer. Layer 2: This is the closest layer to hardware and once virtualized, the upper layers inherits the environment. Provides best confidence level in terms of security and isolation. Layer 3: Virtualization here does the job but confidence level falls due to unprotected layer 2. The choice is yours to make and it all depends on how you design your infrastructure. Advantages of virtualization at layer 2 seems to outweigh other options. The technologies at hand here are VLANs for Ethernet and Partition Keys for InfniBand. In my upcoming blogs, I will give you more insight on how Partitions and VLANs are actually implemented using products at hand and how do we consolidate services while maintaining isolation to achieve virtualization at network layers.

Virtualization is always a heavily talked subject among infrastructure owners. Someone may ask why ? Often times, at any given time, the capacity of hardware equipment used to deploy services may be...

Switches Inside Oracle's Engineered Systems

Continuing from my last blog about InfiniBand building blocks, now lets review the network switches used inside Oracle's Engineered Systems a little bit in detail. This will help you in understanding the overall integration, network design, architecture and troubleshooting in later articles. There are total two category of network switches used to prepare computing environment inside the rack. InfiniBand Switches - two models used depending on requirementsSun Oracle 36-port InfiniBand SwitchSun Oracle InfiniBand Gateway Switch Ethernet Switch - primarily for management purposes Cisco Catalyst 4948 The following table will get you started quickly and save me a lot of writing. External IB Ports IB SignalBitrate IB Port Labels EthernetPorts Sun Oracle 36-port InfiniBand Switch 36QSFP+ 40Gbps 0A-17A0B-17B Sun Oracle InfiniBand Gateway Switch 36-4 =32QSFP+ 40Gbps 0A-15A0B-15B EoIBTwo QSFP+10Gbps per port0A-ETH-[1 to 4]1A-ETH-[1 to 4] CiscoCatalyst4948 48 [1-48]10/100/1000Base-T Let me first give you some more insight on the InfiniBand switches and then we will talk about the Cisco Catalyst 4948. The following picture shows the 36-port IB switch. Gateway switch also looks similar with slight difference for the EoIB ports on extreme right. Common information that applies to both of these InfiniBand switches Form Factor: One rack unit (1U) height Power Supplies: Two Cooling Fans: Five IB Subnet Management: Yes Firmware Upgradeable: Yes Command Line Access: Yes. Via ssh and usb-serial access Web Based Management: Yes SNMP Access: Yes As you might have figured out by now that the IB Gateway switch is almost like a super set of 36-port switch in terms of features and capabilities. Differences between 36-port and Gateway InfiniBand switches Comparatively, there are four additional IB ports on 36-port switch. On the Gateway switch these are internally consumed to enable Ethernet over InfiniBand (EoIB) functionality. I am sure you are wondering how this is done. The simple explanation here is that there are two additional hardware devices installed inside IB Gateway switch. These are called Bridge-X, each of which internally connects to InfiniBand fabric via two IB ports. Hence, I showed the math of 36-4=32 in the table above. Towards the external world, they expose EoIB ports as 0A-ETH and 1A-ETH in QSFP+ form factor. But all devices in the the Ethernet world may not understand QSFP+ and we are not commonly using 40Gbps Ethernet too, so these are split into four (4) SFP+ at 10Gbps signalling rate each. Thats why the final port label on EoIB side is 0A-ETH-[N] and 1A-ETH-[N] where N has a fixed value from 1 to 4. Why do we have two Ethernet ports on the InfiniBand switches ? For those who have seen or will get their hands on these two InfiniBand switches, let me clarify something about the Ethernet management port. Visually, you will see two RJ45 ports on the switch but there is only one target interface inside. There is a small bridge inside the switch which connects to the management Ethernet and provides two connections to outside world. No, this is not for redundancy or high availability. It is there to allow you to create linear bus topology, if you need it. In simple term, you can daisy chain more than one such switch. What about these Leaf and Spine switches ?Okay, now that I have talked about these two InfiniBand switches... let me introduce you to two keywords which you will be hearing a lot and this will set the ground for further discussions. Spine Switch Leaf Switch These are roles of a switch in the topology or connectivity layouts. I may write more about the topologies later but for now lets just keep this blog short, concise and in context of Oracle's Engineered Systems. The switch where hosts are directly connected takes up the role of Leaf Switch. The switch where there are no direct hosts attached but does have inter switch links (ISL) to provide alternate paths or for expanding the fabric takes up the role of Spine Switch. In Exadata and SuperCluster racks, both roles are provided by 36-port InfiniBand switches. In Exalogic racks, Leaf role is provided by Gateway switches whereas Spine role is provided by a 36-port switch. How is the InfiniBand connectivity and topology build out Consider all hosts with one dual-port HCA installed in their PCI-E slots. Connect port-1 to designated leaf switch-1 with an IB cable. When you are done, this completes a star topology. Now repeat the same on port-2 but this time use designated leaf switch-2. So, each host is connected to two leaf switches via independent port. This sets up your dual star topology. But wait, we need some inter switch links also. Why ? To ensure guaranteed communication in an asymmetric topology. For example, host A may be using port-1 while host-B may switch to port-2 for some reason. Inter switch links may be as simple as cables between two leaf switches or they may go through another switch, which is known as Spine switch. I will not go into micro level details here as you can read more about how ISLs are chosen in various rack configurations in respective product guides. Cisco Catalyst 4948 Each host and end point has a management network port. This is always Ethernet based. Cisco 4948 switch integrates all such management ports inside the rack. Everything is pre-wired and all you need is to connect an uplink from this Cisco switch to your data center access switch. Now be careful and do not connect two cables into your data center access switch without planning for Spanning Tree Protocol. This switch is fully managed and also provides VLAN capabilities based on 802.1Q specifications. By default, all hosts inside rack connected to this switch are on same VLAN. Overall Network Design At a very high level, we have the following setup: Ethernet based management network served through Cisco Catalyst 4948 switch InfiniBand internal network served through InfiniBand switches in redundant configuration for high availability This network facilitates all the internal communications within the Engineered Systems framework Ethernet based external world connectivity In Exadata and SuperCluster, this is achieved via physical 10Gbps Ethernet from individual hosts. There are dedicated 10Gbps NICs installed in hosts. Their switching environment is outside of the rack. In Exalogic, this is is achieved via virtual 10Gbps Ethernet from individual hosts. We have been referring to this as EoIB. From hosts' view, there is no additional hardware or cable. Same IB media path carries this traffic as well. Next time, I will talk more about the virtual networks that are carried over this physical network. Thanks for reading and I welcome all your comments and questions.

Continuing from my last blog about InfiniBand building blocks, now lets review the network switches used inside Oracle's Engineered Systems a little bit in detail. This will help you in understanding...

InfiniBand Building Blocks

While writing down this article, I will be referencing to Oracle's latest Engineered Systems which are built upon InfiniBand technology. Yes, I am talking about Exalogic, Exadata and SuperCluster platforms. In the smallest and minimalistic configuration, we have a set of hosts connected to a pair of InfiniBand switches in dual star topology. Why two ? To provide you with redundancy. Each host has one dual-port IB HCA and these ports connect to independent IB switches via QSFP copper cables. The switches also have inter links to provide alternate paths across hosts. This pretty much completes our hardware configuration and connectivity inside the rack. The picture below shows a basic connectivity block diagram. The notion of active and passive is explained further below. Next comes the software components. Each host has required IB software stack built in to the operating system. Each IB switch also has its own software implementation to understand and manage the connected end points. In Linux computing environments, the IB software is based on some version of OFED. One special software worth mentioning here is the Subnet Manager. If this software is not enabled in the network, then what we get is an un-managed InfiniBand network. This is not something we want. The main purpose of subnet manager in the IB network is to enable communication paths across attached hosts, monitor the physical changes in the network periodically and adjust accordingly. In the context of this article, this role is taken up by IB switches. Now we have more than one switch in rack, so which one ? Answer is either one or more then one for redundancy. There is a messaging protocol amongst connected instances of subnet manager and they can negotiate with each other on which one will actually serve the role of subnet management. This is known as Master Subnet Manager. If there are more, they stay as Standby Subnet Managers. In case of a failure on master switch, next switch with a predefined criteria will take up the role. Now, to give you some details, the subnet manager sweeps the fabric periodically for physical changes, assigns LIDs to end points, creates forwarding tables based on specified routing algorithm in a config file and performs a few more critical functions which I will defer for a later discussion. With this setup, you are now ready to communicate over layer-2 under OSI model. I have mentioned earlier that technology remains transparent to the upper level protocols (ULP), so IP addresses for layer 3 are assigned at individual hosts just like you do in ethernet based networks. For redundancy, we have a pair of IB interfaces on a bonded interface in active standby mode and an IPv4 address assigned. Let me remind you here that the redundancy or high availability is achieved from hosts' perspective at layer-3. Both links from hosts are always active from switch and InfiniBand network perspective. Let me show you an example here to make it more clear. The screenshot below shows the status of a host's InfiniBand ports. You can see they are both 'Active' with  LIDs assigned and rate is 40 which means they have auto-negotiated to 4X QDR. Now the next screenshot shows the layer-3 configuration which is IPoIB. Lets look at the bonding status now. The following screenshot shows that interface ib0 is active while ib1 is standby. So, this redundancy and high availability is perceived from host at layer-3. Other hosts in the connected fabric will also look similar with their own IPoIB addresses. This concludes the basic setup of an InfiniBand network and from here on, we should be able to customize and fine tune further in order to utilize this high speed efficient switching fabric for our upper layer applications and protocols. In my next section, I will write more about how and where to go further from this point. Querying the neighbors, checking fabric health, communicating with other hosts and so on.

While writing down this article, I will be referencing to Oracle's latest Engineered Systems which are built upon InfiniBand technology. Yes, I am talking about Exalogic, Exadata and SuperCluster...

InfiniBand Vocabulary

Before I talk more about InfiniBand, I thought its a good idea to just have a blog space with collection of all terms and abbreviations we use. The InfiniBand Jargon, per se. The first one is easy... the list is long... so lets keep it live and add more as we progress. IB - InfiniBand IBTA - InfiniBand Trade Association. They deserve the top position here being the founder and maintainers of InfiniBand Specifications since 1999.OFED - Open Fabrics Enterprise Distribution. This is an open source community driven software package for InifniBand. The group is also active in interoperability, workshops, architectures and protocol development. OSI - Open Systems Interconnection. A standard for communication systems and modern day networking based on seven layers. HCA - Host Channel Adapter. This is a piece of hardware that gets installed inside an end point participating in IB network. Similar to network interface card (NIC) that you see in Ethernet world. In OSI model, this enables Layer 1. MAC - Media Access Control. An entity is worthless without an identity. MAC address provides identity of an end point in the network at hardware layer. In OSI model, this enables Layer 2. GUID - Globally Unique Identifier. To keep it simple, I would say this is a fixed hardware address of an end point participating in IB network. Conceptually it is similar to MAC address but has longer address length. LID - Local Identifier. This is a 16-bit address assigned to end points dynamically in an operational IB network. In OSI model, this enables Layer 2 switching. You must be wondering that if MAC and GUID addresses are similar to each other then why IB introduces another number called LID ? Well, this is an IB implementation to ensure sequential and simplified addressing within a network. Who would like to remember those long hex format GUIDs :) On flip side, the limitation is that we can not have more than 2^16 end points. SLID - Source LID. In messaging sequence, this will be the originator of packet. DLID - Destination LID. And this will be the destination of packet.SM - Subnet Manager. This is a software implementation that takes care of IB network management. I will have more on this topic later but in short, SM is responsible to monitor the connected network periodically for any changes, assign LIDs to end points, create switching tables, manage quality of service (QoS) etc. IPoIB - Internet protocol over InfiniBand. We are moving up the OSI layered stack now. Remember, in my last blog post I mentioned that our socket based messaging still works in IB. IPoIB is the first step in enabling that. What it means is that we simply assign a Layer 3 IP address to an underlying IB device. The address can be IPv4 or IPv6. When it comes to networks, we talk about numbers like 10/100/1000/10000Mbps (Ethernet), 11/54/150/300Mbps (WiFi) etc. Its about the signalling rate of bits. As a standard, I will use small letter 'b' for bits and capital 'B' for bytes. SDR - Single Data Rate. This is baseline at 2.5 Gbps DDR - Double Data Rate. Next level from SDR. 5.0 Gbps QDR - Quad Data Rate. Nest level from DDR. 10.0 Gbps SFP - Small Form-Factor (Hot) Pluggable Transceiver.  These are special connectors at the end of cables that we use for connecting network equipment.'+' is added for enhanced version which is capable of 10Gbps signalling rates. QSFP - Add Quad to the above mentioned SFP and here we get four SFP links into a single cable. '+' version is also available for enhanced and this is what gives us 4X 10Gbps. Sometimes you will also see a splitter cable which has QSFP+ on one end and four SFP+ on other.  Hmm.. everybody talks about InfiniBand capable to provide 40Gbps. How do we get that then ? Basically, its done at hardware level by the product manufacturers. IB specification allows link aggregation at hardware layer. Two possibilities are 4X and 12X. I am sure you already got it by now. 4X QDR InfiniBand provides us with a 40Gbps link. But can you actually see it from your application's perspective ? Short answer is "No". Lets see why. I told you earlier that these numbers are signalling rates for bits - On the Wire ! Let me introduce you to yet another term. 8B/10B - 8 bits over 10 bits. This is an encoding mechanism at OSI Link Layer-2. For every 10 bits transmitted on wire, 8 bits carry actual data. The other two bits are not waste but they are used to control the overall signalling. Math is simple here. We use only 80% of the signalling rate. 80% of 40Gbps = 32Gbps. Okay! So can we see 32Gbps throughput then ? Hmm... may be not. Without going into too much details, I will simplify it for you. There are a few factors which further reduces the realistic throughput to an end user's application(s). First one is the hardware capabilities. Compute machines interface with IB HCA through PCI or PCI-E bus. This has certain limitations hence reduces our throughput. The second one is communication protocol overheads. We know that there may be a few layers between our IB hardware and actual application based on OSI's 7-Layer implementation. Each layer will introduce some processing overhead. So, overall our throughput is reduced from 40Gbps. This is not a fixed number. Lets think of InfiniBand network as a high speed freeway. Its upto you - what and how you want to drive on this, but watch for the speed limit :) More next time.. Thanks for your time.

Before I talk more about InfiniBand, I thought its a good idea to just have a blog space with collection of all terms and abbreviations we use. The InfiniBand Jargon, per se. The first one is easy......

InfiniBand: What about it ?

Heard the buzz word - InfiniBand ? And wondering what it is ? Here is some information to get you started. I am quite sure that you are already familiar with more common networking technologies like Ethernet and various Wireless media these days. InfiniBand is yet another but it does not reach out to us in our daily lives as much as others and probably thats the reason you are still interested in reading about it here :) InfiniBand is meant to provide interconnect for high end computing environments by providing high bandwidth under extremely low latency. In other words, it enables computing end points to exchange more data, faster. Lets compare InfiniBand with Ethernet based on various product offerings today. Ethernet most commonly offer 1Gb/sec and 10Gb/sec bandwidth. InfiniBand offer upto 40Gb/s bandwidth with lower latency then observed on Ethernet media. I would like to point out that these are raw bandwidths and the actual throughput is usually lower which depends on messaging protocols across end points. I will talk about this more later. In recent years of technology evolution, computing platforms' capabilities have reached a point where they can use a better and higher speed network to communicate with peer platforms more efficiently. We refer to the term - bottlenecks, when such scenarios occur. In high demanding computing environments, InfiniBand solves this problem by allowing computers to exchange more data faster. So, what do you need to get on this high speed data highway ? Not likely that same equipment will work. You are right ! InfiniBand requires specialized hardware equipment. Each computing end point needs an I/O card that we call as Host Channel Adapter or HCA. They connect to InfiniBand Switches using special cables that are engineered to carry your data at this high rate with precision. Oh wow ! So, do I need to re-write my applications here ? I do not have time to do that ! I know you will ask this at this point. The answer is "no". Before I go any further, let me just state that InfiniBand follows well known industry standard for networking and this is known as Open Systems Interconnect or OSI. This model offers seven layers and just like ethernet, they apply to InfiniBand as well. Now, let me come back to the original point. We dont need to re-write our entire applications because InfiniBand technology enables very seamless integration.The new hardware that we just talked about integrates and presents itself to your application in a very similar way as Ethernet. Your view into the network remains same and you continue to interact with sockets comprised of IP addresses and ports. Thats all for this blog. I will come back with more information on this later and open up the topic in details. Thanks for reading !

Heard the buzz word - InfiniBand ? And wondering what it is ? Here is some information to get you started. I am quite sure that you are already familiar with more common networking technologies like...

Oracle

Integrated Cloud Applications & Platform Services