Use InfiniBand with Solaris X86 10/09 and HPC-ClusterTools 8.2

I will describe here how to set up HPC-ClusterTools (HPC-CT) 8.2 on Solaris10 X86 10/09 (SunOS 5.10 Generic_141445-09) to run over an InfiniBand (here a QDR IB) network. Attention: As I am behind a firewall, I use very open and possibly not secure settings, avoiding passwords, etc. If connected to the outside world, your cluster could become an easy target for hackers. This blog does not describe how to cable and to configure the switches. I am counting on your IT admin to do this.
Set up a local NFS file system

In order to install HPC-CT, you need a shared filesystem, visible from all nodes of your cluster.
In my case, I had to do this first.
Let us call node0 your headnode (who will the server of the NFS filesystem)
Start on node0
%svcadm -v enable -r network/nfs/server
%mkdir /tools
%chmod 777 /tools
%share -F nfs -o rw /tools


Add the share command into
%cat /etc/dfs/dfstab
share -F nfs -o rw /tools

and you will get it automatically after a reboot.

Now on all other client nodes (node1 to nodeN) do
%mkdir /tools
%mount -F nfs node0:/tools /tools

and add a line at the end of

%cat /etc/vfstab
server:/disk   -   /mount_point   nfs   -   yes    rw,soft
node0:/tools -   /tools         nfs   -   yes    rw,soft
Password-free rsh

The next step is to get a password free rsh for root
edit/create a rhostfile containing the hostnames and the login :

%cat ~/.rhosts
node0 root
node1 root
node2 root
nodeN root

and add the hostnames in the file

%cat /etc/hosts.equiv
node0
node1
node2
nodeN

you should now be able to create files under /tools and do a rsh nodeN command
without any password prompt.

Installing HPC-CT

Now it is time to install HP-CT 8.2. Download the latest version from here.
Stay on your headnode, node0, and put sun-hpc-ct-8.2-SunOS-i386.tar.gz under the shared filesystem /tools
%cd /tools
%gunzip -c sun-hpc-ct-8.2-SunOS-i386.tar.gz | tar xvf
%cd sun-hpc-ct-8.2-SunOS-i386/Product/Install_Utilities/bin
%./ctinstall -n node0,node1,node2,nodeN -r rsh


For more information, see here for the HPC CT installation guide.You do not need to have the IB network during the installation of HPC-CT. This is a feature taken at run-time, and not at install-time.

For the time being Solaris uses the uDAPL protocol. This protocol requires a TCP interface be up and running
Check with

% ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
     inet 127.0.0.1 netmask ff000000
e1000g0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
         inet 10.60.20.183 netmask ffffff00 broadcast 10.60.20.255
         ether 0:1e:68:2f:1d:9e

that this is the case. You can already try to run a mpi program by specifying the tcp interface:
%mpirun -np 2 -mca btl sm,tcp,self -mca plm_rsh_agent rsh -hostfile ./hostfile ./a.out
%cat hostfile
node0 slots=1
node1 slots=1

Configuring the IB interface

Check that the IB updates and packages are installed.
Run the
%pkginfo -x | grep -i ib
within a long list you should see something like this :
<snip>

SUNWhermon                        Sun IB Hermon HCA driver
SUNWib                            Sun InfiniBand Framework
SUNWibsdp                         Sun InfiniBand layered Sockets Direct Protocol
SUNWibsdpib                       Sun InfiniBand Sockets Direct Protocol
SUNWibsdpu                        Sun InfiniBand pseudo Sockets Direct Protocol Admin

<snip>
If you see nothing here, you will have to install the IB patches from the install image.
If you are using an earlier version of Solaris10 X86 (5/09), you can get these packages from here.

Check the /usr/sbin/datadm command
%datadm -v
If you see nothing, you have to check whether or not you have this file :

%cat /usr/share/dat/SUNWudaplt.conf
#
# Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
# Use is subject to license terms.
#
# ident "@(#)SUNWudaplt.conf    1.3     08/10/16 SMI"
#
driver_name=tavor  u1.2 nonthreadsafe default udapl_tavor.so.1 SUNW.1.0 " "
driver_name=arbel  u1.2 nonthreadsafe default udapl_tavor.so.1 SUNW.1.0 " "
driver_name=hermon u1.2 nonthreadsafe default udapl_tavor.so.1 SUNW.1.0 " "
Run the following command on all nodes
%datadm -a /usr/share/dat/SUNWudaplt.conf

Now datadm should display this
%datadm -v
ibd0 u1.2 nonthreadsafe default udapl_tavor.so.1 SUNW.1.0 " " "driver_name=hermon"

and you should have a file
%cat /etc/dat/dat.conf
ibd0 u1.2 nonthreadsafe default udapl_tavor.so.1 SUNW.1.0 " " "driver_name=hermon"


Eventually reboot now all nodes.
If you have done no mistake, they all should come back with an NFS mounted directory /tools and
password free rsh commands and, datadm should return the line as shown above.

Check if the IB interface is seen under
%ll /dev/ib\*
3120    2 lrwxrwxrwx   1 root     other         29 Nov 11 15:43 /dev/ibd -> 
../devices/pseudo/clone@0:ibd
92901    2 lrwxrwxrwx   1 root     root          72 Nov 16 10:09 /dev/ibd0 -> 
../devices/pci@0,0/pci8086,25f8@4/pci15b3,673c@0/ibport@2,ffff,ipib:ibd0


Here my interface is called ibd0. You may have another number at the end.

Now we have to configure the ibd0 interface. In my example, I decided
to give the following IP address for the ibd0 interface:
(Before doing this check with ping that these addresses are really unused ... )
node0  5.6.134.50
node1  5.6.134.51
node2  5.6.134.52

etc ...

Now on every node run ifconfig command with the correct IP
On node 0

%ifconfig ibd0 plumb 5.6.134.50 broadcast 5.6.255.25 netmask 255.255.0.0 up

on node1

%ifconfig ibd0 plumb 5.6.134.51 broadcast 5.6.255.25 netmask 255.255.0.0 up

etc

The ibd0 should now be unplumbed and show

%ifconfig ibd0
ibd0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 2044 index 4
      inet 5.6.134.50 netmask ffff0000 broadcast 5.6.255.255
      ipib 0:1:0:4a:fe:80:0:0:0:0:0:0:0:21:28:0:1:3e:5c:90


Finally to make this interface persistent across reboots you have to create on every node a file that contains the IP address for the ibd0 interface.
on node0

%cat /etc/hostname.ibd0
5.6.134.50

and on node1
%cat /etc/hostname.ibd0
5.6.134.51

etc ...

As a test you should be able to ping all IP adresses from all nodes.

Do a last sanity check by looking at
%ldd /opt/SUNWhpc/HPC8.2/sun/lib/openmpi/mca_btl_udapl.so
and check that all libraries are found

Now you are ready for rock'n roll and you can run
%setenv LD_LIBRARY_PATH /opt/SUNWhpc/HPC8.2/sun/lib
%mpirun -np  2 -mca btl sm,self,udapl -mca plm_rsh_agent rsh -x LD_LIBRARY_PATH -hostfile ./hostfile ./a.out

with the same! hostfile as above
%cat hostfile
node0 slots=1
node1 slots=1


Some additional remarks

As you have seen from the examples above, HPC-CT will look for the best way to communicate with the hosts mentioned in the hostfile by searching the fastest possible interconnect.
Let us suppose that node0 and node1 are connected (as described above over IB), while node3 and node4 are on the TCP interconnect. Running
%mpirun -np 4 -mca btl sm,self,tcp,udapl -mca plm_rsh_agent rsh -x LD_LIBRARY_PATH -hostfile ./hostfile ./a.out
%cat hostfile
node0 slots=1
node1 slots=1
node2 slots=1
node3 slots=1

will use IB between nodes 0 and 1 and the TCP network for the rest. If you would impose the IB network by setting -mca btl sm,self,udapl the run will fail and you get an error message.


Comments:

Is it possible to use rdma+sdp ? It should be faster than using ipoib, which add the tcp layer.

Posted by jobic on November 26, 2009 at 02:03 AM CET #

can I setup nfs over rdma? Would iops be faster? What about bounding channels?

Posted by sid wilroy on October 31, 2010 at 06:21 PM CET #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Be more productive with the Sun High-Performance Computing platform.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today