Sun HPC Consortium: Day II Customer Talks
By Josh Simons on Nov 13, 2006
As we started Day II of the Sun HPC Consortium in Tampa, I was impressed to see 80 or so customers and partners in their seats at 8:30 on a Sunday morning. We had three customer talks: A tour of HPC at Mississippi State, a discussion of Thumper and ZFS as a high-capacity data store for particle physics research, and a status report on TSUBAME, currently the 7th largest supercomputer in the world. We also learned about the surprising perils of bubble formation in nuclear reactor cooling systems.
HPC at Mississippi State
Trey Breckenridge, HPC Research and Operations Administrator, and Roger Smith, Senior Computer Specialist, gave a tag-team talk about the science and engineering foci at Mississippi State. They also gave a brief history of HPC systems at MSU, including their latest, big HPC cluster from Sun. We also heard about how staff from the site helped save lives in the aftermath of Katrina.
The HPC Collaboratory (HPC\^2) at Mississippi State includes five centers:
- The Center for Advanced Vehicular Systems (CAVS), which investigates hybrid vehicles, power electronics, particulate materials, human factors, etc. The Center has been involved with the nearby Nissan manufacturing plant and has suggested improvements to the plant based on their research that has saved Nissan millions of dollars.
- The Center for Computational Sciences (CCS) is involved in a number of traditional HPC areas, including computational physics and computational chemistry.
- The GeoResources Institute (GRI) is the largest of the Collaboratory's five centers. It focuses on earth sciences and climate modeling primarily. Within two days of Katrina coming ashore, Center staff were working from an RV to translate victims' verbal 911 location descriptions to map coordinates that pilots unfamiliar with the area could use for rescue flights. They helped in a number of rescues, including five that saved lives that would have otherwise been lost.
- The Center for DoD Programming Environment and Training (PET) supplies high-level user support to many of the U.S. Department of Defense Major Shared Resource Centers (MSRC), including programming expertise, training, etc.
- The Computational Simulation and Design Center (SimCenter) is focused primarily on CFD with applications to biomedical, automotive, turbo machinery, and space shuttle problem analysis to name a few. We learned that Mississippi State was asked by NASA to perform an in-mission shuttle simulation to determine the probable impact of the loss of a panel from the shuttle during launch. The analysis, which concluded the loss was not critical to safety, was done on a Sun Fire E10K system at the Center. The Center also mails out an excellent calendar every year that's filled with beautiful images from simulations done at the Center.
Roger Smith then gave us an overview of Mississippi State's long involvement with HPC and their long relationship with Sun and Sun hardware. They built their first cluster, MADEM, in 1987 from Sun gear. Their second cluster, called SuperMSPARC, was built in 1993 and included Ethernet, Myrinet, and ATM interconnects between eight SPARCstation 10s with a total of 32 processors. I didn't realize that the original Myrinet drivers for SunOS/Solaris were done at MSU and that this cluster predated the first Beowulf cluster by a year. Cool.
MSU's latest system, Raptor, is a cluster with 512 Sun X2200 M2 diskless nodes, each with 8GB of memory. They are using GbE between nodes and 10 GbE pipes between the 16 racks that comprise the system. Their 2048 Opteron cores deliver a peak performance of about 10 TFLOPs. As Roger put it, a human doing one computation per second by hand would take about 338,000 years to do the work this system can do in one second.
Sun has a program called Customer Ready Systems (CRS) that can pre-build and pre-configure systems for customers prior to shipment. I've been aware of the program for a long time, but never heard a customer talk about the experience until today. It was impressive. The first eleven of their racks arrived by truck at 5pm on a Monday. The final five arrived that Wednesday at 5pm. By 7pm that same day, the entire system was assembled, booted, and running. As Roger said, the CRS approach hugely reduced the effort needed by MSU staff since so much work had been done previously back at Sun. And it also drastically reduced the amount of trash left on-site, which can be a very significant issue for large systems like this.
The presentation closed with a short time-lapse movie documenting the installation procedure. It looked a lot different than the typical build-a-cluster-from-scratch which typically shows each system cabinet being populated incrementally, server by server. Contrast that with the MSU installation in which full racks appear in quick succession on the floor. I nominate the MSU movie for "shortest movie in this genre" award. And that's a good thing.
Thumper for the Teeny
Martin Gasthuber from Deutsches Elektronen-Synchrotron (DESY) in Hamburg spoke about the enormous data processing and storage requirements of the tiny world of particle physics. As he said, they are "hunting the smallest, using the biggest." He estimates they need about 100K of today's fastest CPUs and they generate about 15 PB of new data per year currently (moving to exabytes.)
dCache is a key component of their multi-tiered, grid-based approach. It is designed to be used as a building block to create very large, module storage systems that deliver both high bandwidth and large capacity. The system has few dependencies: it requires only a JVM, a local file system, and one or more GbE connections. Most of the components are written in Java and testing has shown there is no real I/O penalty with this approach--they get excellent performance.
Having now built four generations of storage boxes, the dCache team has learned several important lessons:
- File systems (xfs,ext3,ufs) run into problems after very heavy loads over 3-6 months
- Data corruptions occur on volumes (including RAID5/6)
- Single GbE connections are not sufficient
- Disks should not be trusted, even if they report an OK status because they still sometimes return wrong data
- Large disks make data corruption and uncorrectable read errors a problem
- Single redundancy is too weak
- Better fault monitoring than just IPMI or SNMP is needed to scale
- End-to-end data integrity is critical for success
- dCache requires an I/O density of 50 MB/s per TB of usable space.
According to Gasthuber, Thumper and ZFS fit their requirements perfectly and they've already validated it addresses most of their issues. Performance is already higher than expected and they are looking forward to moving to 10 GbE over time. They will soon have about 160TB of thumper space online. See http://www.dcache.org for details.
The people's Supercomputer at Tokyo Institute of Technology
Professor Satoshi Matsuoka presented an update on Tokyo Tech's TSUBAME supercomputer, currently the 7th largest computer in the world, and the largest supercomputer in the world based on Sun hardware.
The system, which comprises 76 racks of compute, storage, and networking infrastructure, sits in approximately 350 square meters of floor space. The equipment weighs about 60 tons. The system has 648 Sun X4600 nodes, each with 16 Opteron cores and 32/64 GB memory. The interconnect is Voltaire Infiniband and storage capacity is about 1.1 PB, using Thumpers. There are also 360 Clearspeed accelerator cards installed in the system.
Cooling and power were perhaps the largest challenges in deploying TSUBAME. It took over a year for Tokyo Tech, Sun, and NEC to work out a solution. Given the space available, the installation required a power density of 700 watts per square foot, well above the current datacenter state of the art, which can handle only 500 watts per square foot. The solution includes some interesting aspects. For example, the under-floor space is used for cabling, but not airflow. All cooling is handled through large ceiling vents with a low (3m) ceiling. Airflow is very fast and made faster through the use of narrow aisles. Hairdos do not survive for long in this machine room.
Matsuoka-san commented that the choice of fat nodes was important in that it allows for maximum parallel programming flexibility and reduces node count, which increases both manageability and availability. The system is designed for both capability (very large jobs) and capacity (lots of smaller jobs.)
Since its installation this Spring, the system has had an availability of over 99%. There have been frequent faults as you'd expect with a system of this size, but with local effects only: Any affected jobs are automatically restarted by Sun Grid Engine. Most of the issues have been software problems that were fixed with either reboots or patches. There have been very few hardware problems.
Matsuoka-san ended his talk with a short survey of the science being done with TSUBAME. Areas include simulation of an Earth magnetosphere inversion (ported from the Earth Simulator), high resolution typhoon simulation, protein folding, and TNT explosions. Most interesting to me, however, was the bubble simulation work being done on TSUBAME.
Due to the high temperatures involved in nuclear reactor cooling, bubbles naturally form in cooling pipes as water is vaporized. Vapor, however, has a reduced heat conductance which means that if bubbles somehow adhere to the walls of cooling pipes, there is a real danger that the pipes may melt, creating a serious safety problem.
Summaries of Day III talks are here.