Wednesday Nov 25, 2009

Use InfiniBand with Solaris X86 10/09 and HPC-ClusterTools 8.2

I will describe here how to set up HPC-ClusterTools (HPC-CT) 8.2 on Solaris10 X86 10/09 (SunOS 5.10 Generic_141445-09) to run over an InfiniBand (here a QDR IB) network. Attention: As I am behind a firewall, I use very open and possibly not secure settings, avoiding passwords, etc. If connected to the outside world, your cluster could become an easy target for hackers. This blog does not describe how to cable and to configure the switches. I am counting on your IT admin to do this.
Set up a local NFS file system

In order to install HPC-CT, you need a shared filesystem, visible from all nodes of your cluster.
In my case, I had to do this first.
Let us call node0 your headnode (who will the server of the NFS filesystem)
Start on node0
%svcadm -v enable -r network/nfs/server
%mkdir /tools
%chmod 777 /tools
%share -F nfs -o rw /tools

Add the share command into
%cat /etc/dfs/dfstab
share -F nfs -o rw /tools

and you will get it automatically after a reboot.

Now on all other client nodes (node1 to nodeN) do
%mkdir /tools
%mount -F nfs node0:/tools /tools

and add a line at the end of

%cat /etc/vfstab
server:/disk   -   /mount_point   nfs   -   yes    rw,soft
node0:/tools -   /tools         nfs   -   yes    rw,soft
Password-free rsh

The next step is to get a password free rsh for root
edit/create a rhostfile containing the hostnames and the login :

%cat ~/.rhosts
node0 root
node1 root
node2 root
nodeN root

and add the hostnames in the file

%cat /etc/hosts.equiv

you should now be able to create files under /tools and do a rsh nodeN command
without any password prompt.

Installing HPC-CT

Now it is time to install HP-CT 8.2. Download the latest version from here.
Stay on your headnode, node0, and put sun-hpc-ct-8.2-SunOS-i386.tar.gz under the shared filesystem /tools
%cd /tools
%gunzip -c sun-hpc-ct-8.2-SunOS-i386.tar.gz | tar xvf
%cd sun-hpc-ct-8.2-SunOS-i386/Product/Install_Utilities/bin
%./ctinstall -n node0,node1,node2,nodeN -r rsh

For more information, see here for the HPC CT installation guide.You do not need to have the IB network during the installation of HPC-CT. This is a feature taken at run-time, and not at install-time.

For the time being Solaris uses the uDAPL protocol. This protocol requires a TCP interface be up and running
Check with

% ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
     inet netmask ff000000
e1000g0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
         inet netmask ffffff00 broadcast
         ether 0:1e:68:2f:1d:9e

that this is the case. You can already try to run a mpi program by specifying the tcp interface:
%mpirun -np 2 -mca btl sm,tcp,self -mca plm_rsh_agent rsh -hostfile ./hostfile ./a.out
%cat hostfile
node0 slots=1
node1 slots=1

Configuring the IB interface

Check that the IB updates and packages are installed.
Run the
%pkginfo -x | grep -i ib
within a long list you should see something like this :

SUNWhermon                        Sun IB Hermon HCA driver
SUNWib                            Sun InfiniBand Framework
SUNWibsdp                         Sun InfiniBand layered Sockets Direct Protocol
SUNWibsdpib                       Sun InfiniBand Sockets Direct Protocol
SUNWibsdpu                        Sun InfiniBand pseudo Sockets Direct Protocol Admin

If you see nothing here, you will have to install the IB patches from the install image.
If you are using an earlier version of Solaris10 X86 (5/09), you can get these packages from here.

Check the /usr/sbin/datadm command
%datadm -v
If you see nothing, you have to check whether or not you have this file :

%cat /usr/share/dat/SUNWudaplt.conf
# Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
# Use is subject to license terms.
# ident "@(#)SUNWudaplt.conf    1.3     08/10/16 SMI"
driver_name=tavor  u1.2 nonthreadsafe default SUNW.1.0 " "
driver_name=arbel  u1.2 nonthreadsafe default SUNW.1.0 " "
driver_name=hermon u1.2 nonthreadsafe default SUNW.1.0 " "
Run the following command on all nodes
%datadm -a /usr/share/dat/SUNWudaplt.conf

Now datadm should display this
%datadm -v
ibd0 u1.2 nonthreadsafe default SUNW.1.0 " " "driver_name=hermon"

and you should have a file
%cat /etc/dat/dat.conf
ibd0 u1.2 nonthreadsafe default SUNW.1.0 " " "driver_name=hermon"

Eventually reboot now all nodes.
If you have done no mistake, they all should come back with an NFS mounted directory /tools and
password free rsh commands and, datadm should return the line as shown above.

Check if the IB interface is seen under
%ll /dev/ib\*
3120    2 lrwxrwxrwx   1 root     other         29 Nov 11 15:43 /dev/ibd -> 
92901    2 lrwxrwxrwx   1 root     root          72 Nov 16 10:09 /dev/ibd0 -> 

Here my interface is called ibd0. You may have another number at the end.

Now we have to configure the ibd0 interface. In my example, I decided
to give the following IP address for the ibd0 interface:
(Before doing this check with ping that these addresses are really unused ... )

etc ...

Now on every node run ifconfig command with the correct IP
On node 0

%ifconfig ibd0 plumb broadcast netmask up

on node1

%ifconfig ibd0 plumb broadcast netmask up


The ibd0 should now be unplumbed and show

%ifconfig ibd0
ibd0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 2044 index 4
      inet netmask ffff0000 broadcast
      ipib 0:1:0:4a:fe:80:0:0:0:0:0:0:0:21:28:0:1:3e:5c:90

Finally to make this interface persistent across reboots you have to create on every node a file that contains the IP address for the ibd0 interface.
on node0

%cat /etc/hostname.ibd0

and on node1
%cat /etc/hostname.ibd0

etc ...

As a test you should be able to ping all IP adresses from all nodes.

Do a last sanity check by looking at
%ldd /opt/SUNWhpc/HPC8.2/sun/lib/openmpi/
and check that all libraries are found

Now you are ready for rock'n roll and you can run
%setenv LD_LIBRARY_PATH /opt/SUNWhpc/HPC8.2/sun/lib
%mpirun -np  2 -mca btl sm,self,udapl -mca plm_rsh_agent rsh -x LD_LIBRARY_PATH -hostfile ./hostfile ./a.out

with the same! hostfile as above
%cat hostfile
node0 slots=1
node1 slots=1

Some additional remarks

As you have seen from the examples above, HPC-CT will look for the best way to communicate with the hosts mentioned in the hostfile by searching the fastest possible interconnect.
Let us suppose that node0 and node1 are connected (as described above over IB), while node3 and node4 are on the TCP interconnect. Running
%mpirun -np 4 -mca btl sm,self,tcp,udapl -mca plm_rsh_agent rsh -x LD_LIBRARY_PATH -hostfile ./hostfile ./a.out
%cat hostfile
node0 slots=1
node1 slots=1
node2 slots=1
node3 slots=1

will use IB between nodes 0 and 1 and the TCP network for the rest. If you would impose the IB network by setting -mca btl sm,self,udapl the run will fail and you get an error message.

[Read More]

Wednesday Nov 18, 2009

Porting Python 2.5 on Solaris 10 X86

Python is a basis  for many software developers and the latest versions should configure correctly on Solaris10 X86. If you use older versions you  may run into problems, and use these workarounds.

I am using the SunStudio12 or later compiler release, normally installed under

Download and untar the Python file.

1. Edit Python configure file (vi configure) and add the following after  line 3931 :
# disable check for SUN Studio cc since it seems to pass, but generates a warning
if test "$CC" = cc

Next change is in line 10971 :
              else LDSHARED='$(CC) -G';
where you have to add the $(BASECFLAGS)
             else LDSHARED='$(CC) $(BASECFLAGS) -G';

and finally in line 11073 :
where you have to change the -xcode option to the -Kpic
              else CCSHARED="-xcode=pic32";
             else CCSHARED="-Kpic";


--prefix=your_prefix_path/Python-2.5 --enable-shared --without-gcc
CC=/opt/SUNWspro/bin/cc LDFLAGS='-fast -m64 -xarch=sse2a'
BASECFLAGS='-fast -m64 -xarch=sse2a'
make all
make install

Tip :

Sometimes you need to build your applications with the C++ libraries using the STL port (instead of standard libC).
There is no C++ in Python, but you can compile the main program with C++:

--prefix=/export/home/gunter/install64_stl/Python-2.5 --enable-shared
--without-gcc CC=/export/home/SUNWspro/bin/cc LDFLAGS='-fast -m64
-xarch=sse2a' BASECFLAGS='-fast -m64 -xarch=sse2a' --with-cxx-main='CC

[Read More]

Tuesday May 19, 2009

Dtrace solution for mission critical applications

Please read first Frederic's blog, for a detailed introduction and how the solution of this tricky problem was identified with dtrace.

In summary, a telephone application  of the service and hosting provider, Portrix, had a problem: after a couple of days of running the application, the system time went up, and eventually, became so high that the voice quality suffered. Restarting the application was not a solution and the machine had  to be rebooted. 

Our first suggestion and traditional approach, is to use the SUN profiler. We wanted to get three profiles : A first in the beginning, a second after some normal load, and a last one when the system time is high. All three profiles can then be loaded into the SUN analyser, for an easy comparison.
This would have given us two things : first a time line which could allow us to see where unexpected high system timings are spent, and second a performance profile which can be used to tune the application.

The latest version of the SUN profiler has option to attach to a running  process, and record a profile

%collect -P <pid> -t 180

should use dbx to attach, collect data for 180 seconds, and then detach and finish the experiment.

If you are using an older version of the profiler, you can do this through the debugger directly

  attach PID
  collector enable
  leave it running the time you want and stop it with Control C 

Any of these commands uses the debugger to attach a running process. The debugger has to stop the process for doing this (this is why you see the continue after the attach command). Furthermore, there is another risk: the debugger and profiler may catch or delay application important signals. For a mission critical application, running real users, this is completely unacceptable. At best, the application can handle the interrupts, and users will experience a small delay, at worst, the telephone conversation is cut.

How can you possible debug or profile an application without even having the possibility to use a debugger ?  Well, Solaris has a solution: Over 30-thousand probes are coming with the kernel and are waiting for you to be listened. The D scripting language is used to enable these probes, and to get a formatted and readable output.

The most general way to query a running system are these dtrace command lines. The first will grab all functions that are called, while the second while dive into the call stacks.

#dtrace -n 'profile-10ms{@a[func(arg0)]=count()} END{trunc(@a,100)}'
#dtrace -n 'profile-10ms{@a[stack()]=count()} END{trunc(@a,100)}' 

Start  any of these probes, let them run during one minute and stop them with ControlC. Firing up the first one, we see the following output:

 unix`mutex_vector_enter                                         629
 unix`default_lock_delay                                         770
 SUNW,UltraSPARC-T1`cpu_mutex_delay                              889
 unix`cas_delay                                                 1214
 zaptel`process_timers                                          6242
 unix`cpu_halt                                                 40277 

while the second one looks into the stacks of the functions and we are seeing at the end of this list


From this listing you are already see that the ztdummy driver is doing strange things, and that it is the culprit of the high system time usage. This explains why the restart of the application did not help. The driver was not restarted.

The problem with the dtrace command line above may be that they are taken across the entire system, in both user and kernel space, with no restriction to a PID, program, user or CPU. The profile event fires with either arg0 or arg1 set. arg0 means your in the kernel, arg1 means your in user.  Typically you want to narrow it down to a PID or at least exexname (pid == $1, execname == "myprog", etc). You can do this by adding /cpu==6/ before the command, and putting either arg0 or arg1 into the func calls.

dtrace -n 'profile-10ms/cpu==6/{@a[func(arg0)]=count()} END{trunc(@a,100)}'
dtrace -n 'profile-10ms/pid==1234/{@a[func(arg0)]=count()} END{trunc(@a,100)}'

Or you can directly use this script (kindly provided by Jim Fiori) which directly attaches to a PID only.

# cat uprof.d

#!/usr/sbin/dtrace -qs BEGIN {        interval = $2 - 1; } profile-997 /arg1 && pid == $1/ {        @s1[ustack(5)] = count();        @s2[ufunc(arg1)] = count();        s1total++; } tick-1sec /--interval <= 0/ {        printf("\\nHottest stacks...\\n");        trunc (@s1,30);        printa(@s1);        printf("\\nHottest functions...\\n");        trunc (@s2,30);        printa(@s2);        printf("\\nTOTAL samples %d\\n",s1total);        exit(0); } #chmod 755 uprof.d #./uprof.d <pid> <#seconds>  

There is another tool in the Sun Studio warchest, project D-Light, that will allow you to attach to a running process and observe whatever attributes you want. Under the covers, it uses Dtrace, so anything you can collect with Dtrace is possible to show in
D-Light. Since D-Light is graphical (think the timeline view that Analyzer has) the best Dtrace data to display is time on the
X-axis and the data you want to observe on the Y-axis. Here is a whitepaper that describes an overview of this.

[Read More]

Wednesday Apr 22, 2009

Attention with Lustre 1.6.7 and new configuration whitepaper

Lustre 1.6.7 had some serious bugs with the MDT server, and it was withdrawn from the download list. These bugs have been fixed, and a new version has been placed in the download area. If you did download or use Lustre 1.6.7 please upgrade it. Personally, I had a problem in mounting the Lustre filesystem (for both OSS and MDT). After creating the volumes and installing the patches, here on a RedHat CentOS 5.2 machine, I could do the
# mkfs.lustre --fsname lustre --mdt --mgs /dev/vg00n/mdt

but then I could not mount it
# mount -t lustre /dev/vg00/mdt /mdt
mount.lustre: mount /dev/vg00/mdt1 at /mdt failed: No such device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems
Note 'alias lustre llite' should be removed from modprobe.conf

and there only is a ldiskfs and no lustre  in /proc/filesystems ....
When using Lustre 1.6.6 everything works fine.

Furthermore, there is an excellent whitepaper explaining the configuration and benchmarks of two different hardware setups. The first one uses a Sun Fire X4250 OSS server connected to a Sun Storage J4200 array with 12 300GB SAS drives, while the other uses one Sun Fire X4540 server (THOR) with 48 internal 1TB 7200rpm SATA drives. The first one uses the disk in a RAID0, while the second uses a RAID6 setup. All configuration descriptions include a HA (high availability) version.  Please download this excellent paper from here.

[Read More]

Thursday Mar 19, 2009

Lustre quick start guide is available

As promised in my last blog, Torben Kling's Whitepaper about a step by step Lustre set-up is available now! Please get it from here. This whitepaper explains everything from the installation of Linux, creating the virtual volumes, downloading the Lustre packages, setting up a Metadata server, the Object Store Servers and the clients and finally some examples of managing the file system. Congratulations to Torben for this !
[Read More]

Tuesday Mar 03, 2009

Lustre Parallel File Sytem for CFD analysis Part 2

As already said, Lustre stores files, or blocks of files, that are considered as objects, of one or more OSTs. This is called striping in the Lustre terminology. You will need striping :

- if your file is too large to be stored on a single OST.
- if the required aggregate bandwidth for a single file cannot be offered by a single OST.
- if a client, i.e., your program running on the cluster, needs more bandwidth than a single OSS can offer.

Lustre allows you to configure the number of stripes, the stripesize and the servers (OSS) to use, for every file, directory or directory-tree.

Practically, the smallest recommended stripe size is 512 KB because Lustre tries to batch I/O into 512 KB chunks over the network. This is a good amount of data to transfer at one time.

Perhaps you will see a problem, that the file that you are using is considered as a single object. In this case, the file (or object) is stored on a single OST, i.e., a single disk, and you do not see any performance improvement.[Read More]

Tuesday Feb 24, 2009

Lustre Parallel File Sytem for CFD analysis

Whether you are looking at crash simulations, implicit-explicit computations, or CFD analysis, all computing numerical solutions for very different physical models, they have in common, that the size of the data sets becomes bigger and bigger. This is true for the input data, the temporarily computed scratch files, and the final output data. Generally, I/O times have been considered small compared to the runs times of the solver. This may not be longer true today. Not all ISV codes propose a parallel I/O option, and if, it is not always easy to use.

Look for example at the StarCD input files for StarCD V3, V4 and finally StarCCM+ : The same geometry, a 34M cell case, uses 3.5GB for StarCD V3 .geom file and climbs up to 5.1GB when converted into the V4 .ccmg file (or the CCM+ .sim file).[Read More]

Thursday Jan 08, 2009

Blogging is like HPC benchmarking

HPC benchmarks are characterized by the urgency to run them, and by the common thought that everything is doable in a single day.

Apparently, this is very similar to set up a blog. What was thought a five minutes exercise, took the whole day (minus lunch-time, as I live in France and lunch time is a French must).

It took the whole morning, to prepare my iMac to run a VNC server. As I am running MacOSX 10.3.9, many applications do not work, as they need 10.4. Finally I could install the "Vine Server" from Redstone. Then, I had to configure IP forwarding from my Linksys Router. After this, we entered the already mentioned lunch break.

This can be compared to finding the desired benchmark hardware and installing the desired operating system and the correct software versions. By the way, OpenSolaris is out, and running the most stunning desktop I have ever seen.

 During the afternoon, back to business, with my boss running a vnc client on my desktop ... no more secrets from now. The next pain we had to go through was to reconfigure my SUN accounts, as I already had an older private subscription. Finally, I managed, to change the settings in my account, and I could use the default SUN account for login. For quite a while, I felt like a dog biting its tail ..

Again, this is very similar to benchmarking, when you try to enter the NFS mounted directory and yo use, that the NFS server is down, that the FlexLM license is not working. By the way, the latest FlexLM problem, was due to the geographical location of the physical server .. I still wonder how they can figure this out, but they can ..

Now, we had to create 3 more additional accounts, on technocrati, google and feedburner ... and finally, yes, it was a five minutes exercise.

 In parallel to HPC benchmarking, you can now take your scripts and execute the data sets for 10minutes of run time .... and your day is over.

Here my day was over as well, but as I spent it virtually with my real boss, he did not ask any questions of what I have done during this day.


Be more productive with the Sun High-Performance Computing platform.


« July 2016