Wednesday Jun 24, 2009

HPC in Hamburg: Sun Customers Speak at the HPC Consortium

It's crazy time again. I'm in Hamburg for two HPC events: Sun's HPC Consortium customer event, and ISC '09, the International Supercomputing Conference. The Consortium ran all day Sunday and Monday and then ISC started on Tuesday. It is now Wednesday and this is the first break I've had to post a summary talks of given at the Consortium. Due to the sheer number of presentations, including a wide range of Sun and partner talks, I only summarize those given by our customers. The full agenda is here.

Our first customer talk on Sunday was given by Dr. James Leylek, Executive Director of CU-CCMS, the Clemson University Center Computational Center for Mobility Systems, which focuses on problems in the automotive, aviation/aerospace, and energy industries.

The mission of CU-CCMS is not unique -- there are numerous university-based centers that work closely with industry by bringing resources and expertise to bear in a variety of problem domains. What sets CU-CCMS apart is its focus on addressing the mismatch between typical university time-scales and those of their industrial partners. Businesses need results quickly; universities move more slowly.

CU-CCMS has addressed this need in a few ways. They've staffed the center with full-time MS and PhD level engineers who have no teaching responsibilities. And they have provided a significant amount of computing gear to enable those engineers to work effectively with their industrial partners and generate results in a timely way.

Heterogeneity is another key part of the CU-CCMS strategy. By offering a range of computing platforms from clusters to very large shared-memory machines (from Sun) they are able to map problems to appropriate resources to deliver the fast turnaround times required by their industrial partners.

Dr. Leylek also briefly discussed the challenge of introducing HPC to industry as detailed in the Council on Competitiveness study, Reveal. As he noted, many companies are "sitting on the sidelines" of HPC and not engaging even though they could increase their competitiveness by using HPC techniques. He believes CU-CCMS offers a model for how such engagements can be run successfully: assemble a team of expert, dedicated technical resources with appropriate domain knowledge, algorithmic expertise, etc, and combine that with ample high performance computing infrastructure, and an understanding that turnaround time is critical for successful industrial engagements. And then generate valuable results. Lather, rinse, repeat.

Thomas Nau from the University of Ulm gave the next talk, which was a quick tour through several OpenSolaris technologies. He talked about COMSTAR, gave a quick demo of the new OpenSolaris Time Slider, and spent most of time talking about ZFS, specifically about the benefits of solid state disks for increasing ZFS performance. Thomas identified the ZIL -- the ZFS Intent Log -- as the component most often affecting performance. Experiments he has done that involved moving the ZIL from a standard hard disk to a ramdisk have shown significant ZFS performance improvements. In addition, only a small amount of solid state storage is needed to achieve good performance, e.g. perhaps 1-4 GB even for multi-TeraByte drives. Thomas noted that while one could theoretically increase ZFS performance by disabling the ZIL, DO NOT DO THIS. He then ended with the following statement, with which I can only agree: "Hardware RAID is dead, dead, dead. Just use ZFS." :-)

Our first customer talk on Monday was given by Prof. Dr. Thomas Lippert head of the Jülich Supercomputing Center(JCS), site of Sun's largest European deployment to date of our Sun Constellation System architecture. He first gave a brief history of the Jülich Research Center, which is one of the largest civilian research centers in Europe with over 4000 researchers in nine departments, one of which is the new Institute for Advanced Simulation of which JCS is a part. The site has a very long history of computer acquisitions, starting in 1957. This year JCS purchased three systems: a Sun system (JuRoPa), a Bull system (HPC-FF), and an IBM system (Jugene.) These systems have, respectively, 200 TFLOPs, 100 TFLOPs, and 1 PFLOPs of peak performance. Since the Sun and Bull systems are interconnected at the highest level of their switch hierarchies, the two machines can be run as a single system. This combined system delivered 274.8 TFLOPs on LINPACK which earned it the #10 entry on the latest edition of the TOP500 list. Collectively, JCS serves about 250 projects across Europe, including 20-30 highly scalable projects that are chosen by international referees for their potential for producing breakthrough science.

Dr. Lippert also spoke briefly about PRACE, the Partnership for Advanced Computing in Europe, which is radically changing the supercomputing landscape across Europe. Due to earlier studies, computing is now considered to be a crucial pillar of research infrastructure and, as such, it is now receiving considerable attention from funding agencies.

In closing, Dr. Lippert presented specific details of the JuRoPa system (2208 nodes, 17664 cores, 207 TFLOPs, 48 GB/node, and Sun's new M9 QDR switch.) He also described some of specific issues that will be explored with these systems, including control of jitter through the use of gang scheduling, daemon reduction, a SLERT kernel, etc. And some additional secret sauce from Sun perhaps. :-)

Prof. Satoshi Matsuoka from the Tokyo Institute of Technology spoke next. While he did mention Tsubame, Tokyo Tech's Sun-based supercomputer, he primarily spoke about the return of vector machines to HPC. They have, he believes, been reincarnated as GPGPU-based machines. Dinosaurs are once again walking the earth. :-) In particular, the GPGPU's high compute density, high memory bandwidth, and low memory latency echo some of the fundamental capabilities of vector machines that make them interesting for both tightly coupled codes like N-body as well as sparse codes like CFD. In his view, the GPGPU essentially becomes the main processor while the CPU becomes an ancillary processor.

Computers, however, are not useful unless they can be used to solve problems. To support the fact that GPGPU-based clusters can be effective HPC platforms, Prof. Matsuoka presented results from several new algorithms that have been developed at Tokyo Tech to take advantage of GPGPU-based systems. He showed impressive results for 3D FFTs used for protein folding and results for CFD with speedups up to 70X over CPU-based algorithms.

Our next customer speaker was Henry Tufo of the University of Colorado at Boulder (UCB) and the National Center for Atmospheric Research (NCAR.) He gave an update on UCB's upcoming Constellation-based HPC system and also spoke about some of the challenges related to climate modeling. It seems clear at this point that accurate climate modeling is going to be critical for understanding our future and our planet's future. It was a bit daunting to hear that climate modelers would like to increase many dimensions of their simulations, including spatial resolution by 10\^3 or 10\^5, the completeness of their models by a factor of 100x, the length of their simulator runs by 100x, and increase the number of modeled parameters by 100x. All told, their desires would increase computational needs by 10\^10 or 10\^12 over current requirements. It was sobering to hear that current technology trajectories predict that a 10\^7 improvement will take about 30 years. Not good.

Their new Sun-based system will consist of 12 Constellation racks, Nehalem blades, QDR InfiniBand, about 500 TB of storage with about 10% of the clusters nodes accelerated with GPUs. The system will be located next to an existing physics building in three containers -- one for the IT components, one for electrical, and one for cooling.

Stephane Thiell from the Commissariat à l’Énergie Atomique (CEA) gave an overview CEA, talked a bit about CEA's TERA-100 project and then detailed CEA's planned use of Lustre for TERA-100. The CEA computing complex currently has two computing centers, one classified (TERA) and one open (CCRT.) TERA-100 will be a follow-on to TERA-10, which is a 60 TFLOPs, Linux-based system built by Bull in 2005. It includes an impressive 1 PetaByte Lustre filesystem and uses HPSS to archive to Sun StorageTek tape libraries with a 15 PetaByte capacity.

TERA-100 aims to increase CEA's classified computing capacity by about 20x with a final size of one PFLOPs or perhaps a little larger. CEA plans to continue with their COTS-based, general-purpose approach rather than move of the main sequence to something more exotic. It will be x86-based with more than 500 GFLOPs per node using 4-socket nodes. There will be 2-4 GB per core and two Lustre file systems will be supported, one with a 300 GB/s transfer requirements and the other with a 200 GB/s requirement. The system will consume less than 5 MW. A 40 TFLOPs demonstrator system will be built first and it will include scaled-down versions of the Lustre file systems as well. In the final system the Lustre servers will be built with four-socket nodes and a four-node HA architecture will be used to guarantee against failure and to avoid long failover times.

CEA is involved in some interesting Lustre-based development, including joint work with Sun on a binding between Lustre and external HSM systems with the goal of supporting Lustre levels of performance with transparent access to hierarchical storage management. CEA is also working on Shine, a management tool for Lustre.

Dieter an Mey gave some general information about computing at RWTH Aachen University and then gave an update on their latest acquisition, a Sun-based supercomputer. He ended with a discussion about the pleasures and perils of workload placement on current generation systems. Along the way he shared some feedback on Sun products -- one of those habits that makes customers like Dieter such valuable partners for Sun.

Aachen provides both Linux and Windows-based HPC resources for their users. On Linux they record about 40,000 batch jobs per month and perhaps 150 interactive sessions per day. The Windows cluster is used primarily for interactive jobs. It was interesting to hear that Windows is gaining ground with respect to Linux at Aachen: a previous study at Aachen had shown that Windows lagged Linux in performance by about 24%, but a recent re-run of the study now shows the gap to be on the order of about 7%.

Aachen's new system will support both Linux and Windows equally with a flexible dividing line between them. The facility is designed to be general purpose with a mix of thin and fat nodes and with the required high-speed interconnect for those who use MPI. A new building is being erected to house this machine which will come fully online over the course of 2009-2010. When complete, the system will have a peak floating point rate in excess of 200 TFLOPs and it will include a 1 PetaByte Lustre file system. Speaking of Lustre, Dieter rated its configuration as "complex", something Sun is working on. The system will also include two of Sun's latest InfiniBand switches, the new 648-port QDR M9 switch.

Dieter's final topic was the correct placement of workload on non-uniform system architectures. In particular, he described the difference between compact and scatter placement on multicore NUMA systems. Compact placement uses threads on the same core first, then cores in the same socket --- a strategy that is used to minimize latency and to maximize cache sharing. Scatter placement uses threads on different sockets first, and then threads on different cores -- a strategy that maximizes memory bandwidth. Which strategy is best depends on the details of an application's underlying algorithms. (Dieter noted that currently Sun Grid Engine is not aware of these issues -- it treats nodes as flat arrays of threads or cores.) Placement decisions are further complicated when attempting to schedule more than one application onto a fat node. For example, different strategies would be used depending on whether single job turnaround is more important than overall throughput of jobs.

Our last customer talk at the Consortium was given by the tag team of Arnie Miles (left) from Georgetown University and Tim Bornholtz (right) of the Bornholtz Group. Their topic was the Thebes Consortium for which they presented current status, did a short demo, and announced that the source code would be available by the end of June on

The Thebes Consortium aims to help the widespread adoption of distributed computing technologies by creating an enabling infrastructure that focuses on scalability, security, and simplicity.

Arnie described (and Tim demo'ed) the instantiation of the Thebes first use case which assumes 1) that users have usernames and passwords in their home domain, 2) that one or more local resources have a trust relationship with a local STS (secure token service), 3) that these resources are known to users, and 4) that all resources are able to consume SAML.

The use case itself consists of the following actions: 1) users create job submission files using the client application or a command line, 2) users use institution usernames and passwords to acquire a signed SAML token, 3) users perform no other logins and do not have to go to a resource command line interface, 4) users manually choose their resources, 5) job scheduling is handled by resources. Note that in this instance "resource" refers to a DRM-managed cluster which will accept the incoming request and then schedule the job appropriately on its managed cluster. In the prototype as it currently exists, a service is a compute service though there is also some level of support for a file system service as well.


Josh Simons


« April 2014