Tuesday Jul 28, 2009

A New GUI monitoring tool for SGE

For many years I've been begging for a web/GUI-based monitoring tool for SGE besides Qmon. While the xml-qstat by the BioTeam is a great tool, however SGE needed something that is officially supported for our customers. Well, the latest SGE 6.2 update 3 release includes a new monitoring tool call Inspect which is developed purely in Java and uses JMX, so your SGE installation has to enable JMX.

Here are some screenshots of the Inspect GUI provided by Chris Dag. Besides SGE, Inspect can also monitor SDM as well. Now, the last thing that is lacking is a GUI-based job submission interface...

Tuesday Apr 14, 2009

Nehalem Memory Configuration

Sun is announcing several new Nehalem-based servers today, so do watch out for the news. Among other things, we are also announcing a Sun Rapid Solution for highly scalable storage based on Lustre. I'll blog about it more soon.

Speaking of Nehalem, few days ago I read a nice article giving a nice introductory guide on the different memory configurations for the Xeon 5500, the offical model name for Nehalem. Those who understand the new Nehalem architecture will know that there are performance implications to how you configure the memory.

Here is my summary of the main points:

\* Nehalem supports 3 memory channels per socket, 3 DIMMs per memory channels.

\* All memory channels must be filled with at least 1 DIMM, else there will be a performance hit.

\* Memory speed varies by the number of DIMMs in each channel:
1 DIMM in each memory channel gives the best performance at 1333 MHz
2 DIMMs in any memory channel drops the speed to 1066 MHz
3 DIMMS in any memory channel drops the speed to 800 MHz

\* It is strongly recommended to use a "balanced" configuration - meaning filling up the memory channels in 3s. So it'll be like 3 DIMMS (1 per channel), 6 DIMMS (2 per channel) or 9 DIMMS (everything populated). Unbalanced memory config will reduce bandwidth (by as much as 23%).

\* Use same size DIMMS, else will lose 50-10% memory bandwidth.

\* For Nehalem processors with a QPI speed of 6.4 GT/s and using three 1333 MHz DIMMs (one per memory channel) per socket, expected memory bandwidth is 35 GB/s. For 1066 MHz and 800 MHz, expected about 32 GB/s and 25 GB/s respectively.

If you are totally new to Nehalem, have no idea about the new memory design and what is QPI or Turbo Boost, this other article might be useful for you.

Tuesday Mar 24, 2009

New SGE screencasts

I just scanned through the SGE user community mailing list and saw a postings of two new SGE related screencast that I thought that I'll share in this blog entry. The first is a video by Lubomir Petrik
demonstrating the new GUI installer from the latest SGE 6.2u2 version. You can see the screencast at his blog.

The other screencast is by James Coomer, Sun HPC architect, showing the use of Sun Secure Global Desktop working with SGE and integrating xml-qstat for monitoring. The neat demo can be viewed on YouTube here.

Wednesday Mar 04, 2009

Asia Top 500

A new Top500 list has been started just for Asia and Middle East. Being someone who is born, raised and working in Asia, I'm intrigued by this list. The submission is now open and the first list will be publish in Oct'09. Like the Top500, the Asia Top500 will be ranked by Linpack performance.

Friday Feb 27, 2009

Tuning SLES10

The Sun Compute Cluster currently uses SLES10 as the OS of choice for the compute nodes. Working on the software stack for the solution, I'm currently looking for more information on tuning the performance of SLES10. Of course the first step is to turn off the unnecessary services that a typical compute node will not require. I took a look at a freshly installed server with the default installation settings and found these services that are running which I think can be safely turned off:


I would also change the default runlevel to 3 to prevent starting up of X. If anyone has other suggestions or tuning guide, please do let me know!

Wednesday Feb 18, 2009

Rocks Cluster Customization Part 1

Rocks is a great HPC management software that allow very quick and easy deployment of HPC clusters. If using the default configuration and software packages that comes with Rocks, a typical average size cluster can be ready in no more than a few day. Besides doing bare metal provisioning of OS, additional software packaged as Rocks rolls can be added to the base Rocks distribution. Once these rolls are installed, provisioning a compute node will also automatically deploy and configure the service.

Rocks uses a their own way of assign hostnames. The suggested order of provisioning compute nodes is to start from first rack, from top server to bottom server. The default hostname are in the form of "compute-x-y", where x is the rack number and y is the position of the node in the rack. For example, the top most node in the first rack will be compute-0-0, the second node compute-0-1 and so on. The nodes in the next rack will then be compute-1-0, compute-1-1...

Although this works well for most, but definitely not for all. There are several ways to change the hostnames depending on how customized you want it to be. If you want to keep the numbering scheme but just want to change the prefix (e.g. to mynode-0-0), what you can do is to add a new appliance.

$ rocks add appliance mynode membership="My Compute Node" node=compute

What the command does is to add a new appliance using the same compute configuration. Then when provisioning new nodes, in the insert-ethers main menu, select "My Compute Node" appliance. When the new node is provisioned, it will be assigned hostname like mynode-x-y.

If you want to even go to the extend of changing the hostname numbering, what you can do is to fix the hostname and IP address in the Rocks database.

$ rocks add host n00 membership=compute rack=1 rank=1
$ rocks add host n01 membership=compute rack=1 rank=2
$ rocks add host n08 membership=compute rack=1 rank=8

In addition, if you want to assign the IP addresses manually:

$ rocks add host interface n00 iface=eth0 ip= mac=00:11:22:33:44:55:GG subnet=private name=n00
$ rocks add host interface n01 iface=eth0 ip= mac=00:11:22:33:44:55:FF subnet=private name=n01

Or if you want to configure a 2nd interface (e.g. ib0), you can add a new network and configure the interface in the Rocks database.

$ rocks set network IB subnet= netmask=
$ rocks add host interface n00 iface=ib0 ip= mac=00:11:22:33:44:55:AA subnet=IB name=n00-ib
$ rocks add host interface n01 iface=ib0 ip= mac=00:11:22:33:44:55:BB subnet=IB name=n01-ib

After the changes are made, make sure to sync the configurations.

$ rocks sync config
$ make -C /var/411 #use "make -C /var/411 force" if necessary, but it's slower
$ insert-ethers --update

Tuesday Feb 03, 2009

Good pNFS Overview

There is a good introductory article on pNFS for those who are interested:


Wednesday Jan 21, 2009

Windows HPCS 2008 Infiniband support

Looks like Microsoft HPCS is gaining more momentum in the HPC arena, with Mellanox announcing "its line of 20 and 40Gb/s InfiniBand adapters has passed Microsoft Windows Hardware Quality Labs testing for Microsoft Windows HPC Server 2008."

Full story here.

On the OpenFabric front, current verison of WinOF is at 2.0 with support for protocols like IPoIB, SRP, OpenSM and Network Direct. The roadmap for the future releases (copy-paste from here):

WinOF 2.1 scheduled release in June'09; functionality freeze in April'09
\* Connected mode IPoIB (NDIS 5.1)
\* OFED IB verbs API in place.
\* WinVerbs and rdma_cm fully supported.
\* Some OFED IB diagnostics
\* Server 2008 HPC supported via WDM install.
\* IPoIB enhanced error logging to the system event log.

WinOF 2.2 release in Q4'09; freeze Q3'09
\* Connected mode IPoIB based on NDIS 6.0
\* Qlogic HCA driver support.

Monday Dec 22, 2008

Solution Factory and HPC

I'm a little late in posting this, so this may be old news to some of you. Two quarters ago, I moved to a new team that is put together to developed new solutions that can be easily repeatable, pre-configured, pre-tested and quickly deployable. What this means to customer is less risk, reduced costs and higher efficiency. We have announced three new solutions, branded as the Sun Rapid Solutions, targeted at specific customer network infrastructure requirements around global web buildout, datacenter efficiency and high performance computing.

I'm part of the team that work on the HPC solution call Sun Compute Cluster. The solution is designed to provide customers with a simple, flexible, and scalable approach to addressing their compute-intensive environments. Thereby, enabling customers to quickly deploy a tested, reliable, and efficient architecture into production. It uses a Linux software stack comprises of Sun software like Sun Grid Engine, xVM Ops Center, HPC Clustertools and Sun Studio. Customer can also chooses to use the new Sun HPC Software, Linux edition, or any other software stack they wishes by engaging Sun Professional Service or partners for customization.

Friday Jun 27, 2008

IDC: Software is #1 roadblock

The recent IDC presentation at the ISC2008 conference state that software is the biggest roadblock for HPC users now... I couldn't agree more. As clusters grow larger and more complex, better management tools are required. Managing and monitoring HPC cluster typically uses different bits and pieces of different tools; setting up and operating the cluster becomes very difficult. Sun has also recognize this and also taking a very serious look at this, which is why we have the Sun HPC Software, Linux Edition. It is currently based on CentOS and still needs more work to be the complete easy-to-use management solution. We have also started the HPC-Stack project under OpenSolaris community for the OpenSolaris edition of the software stack.

Management software is only one piece of the puzzle. The other piece is the development tools and parallel libraries. As processor gets more cores and number of cores per cluster grows, many applications will need to be redesign and new programming paradigms are needed to ease development and improve efficiency.


Looks like more and more applications are moving towards nVidia GPUs and using the CUDA SDK. SciFinance from SciComp is a code synthesis technology for building derivatives pricing and risk models. Just by changing certain keyword, it can generate CUDA-enabled codes that are, according to the website, 30X-80X faster than the serial codes. Also, check out the CUDA-zone website which shows many successes of accelerating the performance of apps using CUDA and nVidia's GPU.

Tuesday Jun 03, 2008

Learning MPI tutorial

I came across a nice and simple tutorial for programmers who want a crash course in MPI, I think this is a good tutorial to start. It does not go into detail about message passing programming paradigm, but it gives step by step procedures of building the GCC, setting up the env vars, installing openmpi... on a Linux machine. Warning though, some knowledge on Linux/Unix required. Once you get everything up and running, the tutorial walks you through several examples with increasing level of complexity. The last example is matrix multiplication where it teaches how to parallelize the code using MPI.

See the tutorial here:

Monday May 05, 2008

Apache Hadoop

Apache Hadoop is gaining a lot of attention in the web community, especially support from Yahoo. It has a distributed filesystem and supports data intensive distributed application using the MapReduce computational model. It is been viewed as an important piece of the puzzle in Cloud computing, but can also be very useful to datamining type of applications. I think it won't be long before it catches attention in HPC, if it hasn't yet. With it's high scalability and fault tolerant nature, I think it has a lot of uses in HPC. Due to the data intensive nature, I wonder if there can be any value with using Hadoop with Lustre. If anyone has any insight to the I/O characteristics, I'll be glad to hear about it.

Friday Apr 25, 2008

China Shanghai ERC 2008

Just want to share the recent events that took place in Shanghai last week. We had a HPC track during the China ERC and also organized 2 workshops - Sun Grid Engine and Lustre - for our customers and partners.

This is me giving the presentation during the Sun Grid Engine workshop. :)

More photos and details of the events here.

Monday Feb 11, 2008

TACC Ranger in Production

The TACC Ranger cluster is now fully up and running. It is the first Sun Constellation System (and hopefully many more to come!), consisting of "62976 CPU cores, 504 peak TFlops, 123 TBytes memory, and 1730 TBytes disk". Read more about it here.

Melvin Koh


« July 2016