Monday May 03, 2010

Low Latency Scheduling with Oracle Coherence

I have been too caught up with my transition to Oracle, so have not been able to spend much time in writing my blog. As I'm now a part of Oracle, I have been spending some of my free time getting familiar with Oracle products, in particularly Oracle Coherence.

Coherence provides distributed data memory, something like a shared memory space across multiple systems. Existing similar competing products are Gigaspaces, Gemfire and Terracotta. Coherence is extremely versatile and can be use for different purpose, e.g. with application servers to scale your web application, in financial trading for low latency transaction processing and high performance data intensive computing.

Coherence provides a way for low-latency scheduling of small independent tasks. Traditional HPC job schedulers like Grid Engine schedules jobs in a fixed interval (usually in several seconds), it is fine for jobs that run for a few hours but not for many small tasks that needs to be processed quickly. One way that Coherence is able to address this is to use the WorkManager interface.
[Read More]

Tuesday Jul 28, 2009

A New GUI monitoring tool for SGE

For many years I've been begging for a web/GUI-based monitoring tool for SGE besides Qmon. While the xml-qstat by the BioTeam is a great tool, however SGE needed something that is officially supported for our customers. Well, the latest SGE 6.2 update 3 release includes a new monitoring tool call Inspect which is developed purely in Java and uses JMX, so your SGE installation has to enable JMX.

Here are some screenshots of the Inspect GUI provided by Chris Dag. Besides SGE, Inspect can also monitor SDM as well. Now, the last thing that is lacking is a GUI-based job submission interface...

Tuesday Apr 14, 2009

Nehalem Memory Configuration

Sun is announcing several new Nehalem-based servers today, so do watch out for the news. Among other things, we are also announcing a Sun Rapid Solution for highly scalable storage based on Lustre. I'll blog about it more soon.

Speaking of Nehalem, few days ago I read a nice article giving a nice introductory guide on the different memory configurations for the Xeon 5500, the offical model name for Nehalem. Those who understand the new Nehalem architecture will know that there are performance implications to how you configure the memory.

Here is my summary of the main points:

\* Nehalem supports 3 memory channels per socket, 3 DIMMs per memory channels.

\* All memory channels must be filled with at least 1 DIMM, else there will be a performance hit.

\* Memory speed varies by the number of DIMMs in each channel:
1 DIMM in each memory channel gives the best performance at 1333 MHz
2 DIMMs in any memory channel drops the speed to 1066 MHz
3 DIMMS in any memory channel drops the speed to 800 MHz

\* It is strongly recommended to use a "balanced" configuration - meaning filling up the memory channels in 3s. So it'll be like 3 DIMMS (1 per channel), 6 DIMMS (2 per channel) or 9 DIMMS (everything populated). Unbalanced memory config will reduce bandwidth (by as much as 23%).

\* Use same size DIMMS, else will lose 50-10% memory bandwidth.

\* For Nehalem processors with a QPI speed of 6.4 GT/s and using three 1333 MHz DIMMs (one per memory channel) per socket, expected memory bandwidth is 35 GB/s. For 1066 MHz and 800 MHz, expected about 32 GB/s and 25 GB/s respectively.

If you are totally new to Nehalem, have no idea about the new memory design and what is QPI or Turbo Boost, this other article might be useful for you.

Tuesday Mar 24, 2009

New SGE screencasts

I just scanned through the SGE user community mailing list and saw a postings of two new SGE related screencast that I thought that I'll share in this blog entry. The first is a video by Lubomir Petrik
demonstrating the new GUI installer from the latest SGE 6.2u2 version. You can see the screencast at his blog.

The other screencast is by James Coomer, Sun HPC architect, showing the use of Sun Secure Global Desktop working with SGE and integrating xml-qstat for monitoring. The neat demo can be viewed on YouTube here.

Wednesday Mar 04, 2009

Asia Top 500

A new Top500 list has been started just for Asia and Middle East. Being someone who is born, raised and working in Asia, I'm intrigued by this list. The submission is now open and the first list will be publish in Oct'09. Like the Top500, the Asia Top500 will be ranked by Linpack performance.

Friday Feb 27, 2009

Tuning SLES10

The Sun Compute Cluster currently uses SLES10 as the OS of choice for the compute nodes. Working on the software stack for the solution, I'm currently looking for more information on tuning the performance of SLES10. Of course the first step is to turn off the unnecessary services that a typical compute node will not require. I took a look at a freshly installed server with the default installation settings and found these services that are running which I think can be safely turned off:


I would also change the default runlevel to 3 to prevent starting up of X. If anyone has other suggestions or tuning guide, please do let me know!

Wednesday Feb 18, 2009

Rocks Cluster Customization Part 1

Rocks is a great HPC management software that allow very quick and easy deployment of HPC clusters. If using the default configuration and software packages that comes with Rocks, a typical average size cluster can be ready in no more than a few day. Besides doing bare metal provisioning of OS, additional software packaged as Rocks rolls can be added to the base Rocks distribution. Once these rolls are installed, provisioning a compute node will also automatically deploy and configure the service.

Rocks uses a their own way of assign hostnames. The suggested order of provisioning compute nodes is to start from first rack, from top server to bottom server. The default hostname are in the form of "compute-x-y", where x is the rack number and y is the position of the node in the rack. For example, the top most node in the first rack will be compute-0-0, the second node compute-0-1 and so on. The nodes in the next rack will then be compute-1-0, compute-1-1...

Although this works well for most, but definitely not for all. There are several ways to change the hostnames depending on how customized you want it to be. If you want to keep the numbering scheme but just want to change the prefix (e.g. to mynode-0-0), what you can do is to add a new appliance.

$ rocks add appliance mynode membership="My Compute Node" node=compute

What the command does is to add a new appliance using the same compute configuration. Then when provisioning new nodes, in the insert-ethers main menu, select "My Compute Node" appliance. When the new node is provisioned, it will be assigned hostname like mynode-x-y.

If you want to even go to the extend of changing the hostname numbering, what you can do is to fix the hostname and IP address in the Rocks database.

$ rocks add host n00 membership=compute rack=1 rank=1
$ rocks add host n01 membership=compute rack=1 rank=2
$ rocks add host n08 membership=compute rack=1 rank=8

In addition, if you want to assign the IP addresses manually:

$ rocks add host interface n00 iface=eth0 ip= mac=00:11:22:33:44:55:GG subnet=private name=n00
$ rocks add host interface n01 iface=eth0 ip= mac=00:11:22:33:44:55:FF subnet=private name=n01

Or if you want to configure a 2nd interface (e.g. ib0), you can add a new network and configure the interface in the Rocks database.

$ rocks set network IB subnet= netmask=
$ rocks add host interface n00 iface=ib0 ip= mac=00:11:22:33:44:55:AA subnet=IB name=n00-ib
$ rocks add host interface n01 iface=ib0 ip= mac=00:11:22:33:44:55:BB subnet=IB name=n01-ib

After the changes are made, make sure to sync the configurations.

$ rocks sync config
$ make -C /var/411 #use "make -C /var/411 force" if necessary, but it's slower
$ insert-ethers --update

Tuesday Feb 03, 2009

Good pNFS Overview

There is a good introductory article on pNFS for those who are interested:

Wednesday Jan 28, 2009

Eli Lilly uses Cloud Computing

For anyone that think that cloud computing is not real and no one really uses it seriously should think again. Here is an article on InformationWeek reporting that Eli Lilly "uses Amazon Web Services and other cloud services to provide high-performance computing, as needed, to hundreds of its scientists". The advantage of using Amazon WS is that "a new server can be up and running in three minutes... and a 64-node Linux cluster can be online in five minutes (compared with three months internally)". I'd advise everyone to take what it said with a pinch of salt, however it's still cool to know about commercial uses of AWS and what Eli Lilly is planning next.

Read the full article here.

Wednesday Jan 21, 2009

Windows HPCS 2008 Infiniband support

Looks like Microsoft HPCS is gaining more momentum in the HPC arena, with Mellanox announcing "its line of 20 and 40Gb/s InfiniBand adapters has passed Microsoft Windows Hardware Quality Labs testing for Microsoft Windows HPC Server 2008."

Full story here.

On the OpenFabric front, current verison of WinOF is at 2.0 with support for protocols like IPoIB, SRP, OpenSM and Network Direct. The roadmap for the future releases (copy-paste from here):

WinOF 2.1 scheduled release in June'09; functionality freeze in April'09
\* Connected mode IPoIB (NDIS 5.1)
\* OFED IB verbs API in place.
\* WinVerbs and rdma_cm fully supported.
\* Some OFED IB diagnostics
\* Server 2008 HPC supported via WDM install.
\* IPoIB enhanced error logging to the system event log.

WinOF 2.2 release in Q4'09; freeze Q3'09
\* Connected mode IPoIB based on NDIS 6.0
\* Qlogic HCA driver support.

Monday Jan 19, 2009

SGE GUI Installation

One of the nice new feature in the upcoming update, SGE 6.2u2, will include a GUI installer. The beta version of the update is available now and I was playing around with the GUI installer. For those who have done automatic installation of SGE in a large cluster will know about the installation config file, but now with the GUI installer it's becoming more straightforward to do mass installation. However, there is still the usual preparation that needs to be done before installation:

1. Create a SGE admin user - not mandatory, but recommended.
2. Copy/install the SGE binaries to all the hosts in the same directory if you're not using a shared directory.
3. Setup passwordless SSH. The GUI installer uses SSH to invoke $SGE_ROOT/util/arch in the remote hosts to get the type of hosts. If your host(s) is listed as "unreachable", that means the installer is unable to execute the command to get the host types. Try running the SSH command manually to see what's wrong.
4. If the SGE directory is not on a shared directory, then after installing the qmaster, you still have to copy the contents in SGE cell's common directory (e.g. if your cell's name is default, it will be $SGE_ROOT/default/common) manually to all the hosts that you are going to install as execution hosts.

Some screen shots of the GUI installer:

Monday Dec 22, 2008

Solution Factory and HPC

I'm a little late in posting this, so this may be old news to some of you. Two quarters ago, I moved to a new team that is put together to developed new solutions that can be easily repeatable, pre-configured, pre-tested and quickly deployable. What this means to customer is less risk, reduced costs and higher efficiency. We have announced three new solutions, branded as the Sun Rapid Solutions, targeted at specific customer network infrastructure requirements around global web buildout, datacenter efficiency and high performance computing.

I'm part of the team that work on the HPC solution call Sun Compute Cluster. The solution is designed to provide customers with a simple, flexible, and scalable approach to addressing their compute-intensive environments. Thereby, enabling customers to quickly deploy a tested, reliable, and efficient architecture into production. It uses a Linux software stack comprises of Sun software like Sun Grid Engine, xVM Ops Center, HPC Clustertools and Sun Studio. Customer can also chooses to use the new Sun HPC Software, Linux edition, or any other software stack they wishes by engaging Sun Professional Service or partners for customization.

Thursday Nov 27, 2008

Gigaspaces integrates with Sun Grid Engine

I was at the Supercomputing last week and got to talk to a rep from Gigaspaces who is at our Sun booths. I found out from him that they have integrated Gigaspaces XAP with Sun Grid Engine. One of the thing missing in SGE is the capability to do low-latency scheduling where many small transaction needs to be dispatched and the results return very quickly. Example is trading in finance where performance is measured by number of transaction per seconds. Gigaspaces provides a scalable platform that can fills this gap. Using our Sun SPARC T5240 server, Gigaspaces is able to achieve very impressive benchmark result.

The integration allows SGE to manage Gigaspaces XAP instances and dynamically provision new instances to satisfy the SLA if the load is too high. Here is a video of a short presentation that shows a demo of using SGE to automatically provision Gigaspaces XAP.

Friday Aug 01, 2008

Sun Shared Visualization System

Recently, one of the local IT architect asked about the Shared Visualization System and if Sun Rays can be used to view 3D graphics. And the answer is YES! The general idea about the Shared Visualization solution is that you can have a central pool of graphic resource, which potentially can be a grid of multiple, different systems with accelerated graphics capabilities. Users can then have the ability to remotely access and share 3D visualization applications on a variety of client platforms.

The main software that is used to enable this is virtualGL, an open source program which redirects the 3D rendering commands from Unix and Linux OpenGL applications to 3D accelerator hardware in a dedicated server and displays the rendered output interactively to a thin client located elsewhere on the network. The thin client can therefore be a Sun Ray, but I believe that a plug-in is required to be installed at the Sun Ray server. Sun Grid Engine is also part of the software stack to manage access to the graphic resources.

See also the slides by my Sun colleague Torben, who presented the solution at one of a Grid conference.

Friday Jun 27, 2008

IDC: Software is #1 roadblock

The recent IDC presentation at the ISC2008 conference state that software is the biggest roadblock for HPC users now... I couldn't agree more. As clusters grow larger and more complex, better management tools are required. Managing and monitoring HPC cluster typically uses different bits and pieces of different tools; setting up and operating the cluster becomes very difficult. Sun has also recognize this and also taking a very serious look at this, which is why we have the Sun HPC Software, Linux Edition. It is currently based on CentOS and still needs more work to be the complete easy-to-use management solution. We have also started the HPC-Stack project under OpenSolaris community for the OpenSolaris edition of the software stack.

Management software is only one piece of the puzzle. The other piece is the development tools and parallel libraries. As processor gets more cores and number of cores per cluster grows, many applications will need to be redesign and new programming paradigms are needed to ease development and improve efficiency.


Melvin Koh


« July 2016