HPC and Virtualization: Oak Ridge Trip Report
By Josh Simons on Jan 09, 2009
Just before Sun's Winter Break, I attended a meeting at Oak Ridge National Laboratory in Tennessee with Stephen Scott, Geoffroy Vallee, Christian Engelmann, Thomas Naughton, and Anand Tikotekar, all of the Systems Research Team (SRT) at ORNL. Attending from Sun were Tim Marsland, Greg Lavender, Rebecca Arney, and myself. The topic was HPC and virtualization, an area the SRT has been exploring for some time and one I've been keen on as well as it has become clear v12n has much to offer the HPC community. This is my trip report.
I arrived at Logan Airport in Boston early enough on Monday to catch an earlier flight to Dulles, narrowly avoiding the five-hour delay that eventually afflicted my original flight. The flight from Boston to Knoxville via Dulles went smoothly and I arrived without difficulty to a rainy and chilly Tennessee evening. I was thrilled to have made it through Dulles without incident since more often than not I have some kind of travel difficulty when my trips pass through IAD (more on that later.) The 25 mile drive to the Oak Ridge DoubleTree was uneventful.
Oak Ridge is still very much a Lab town from what I could see, much like Los Alamos, but certainly less isolated. Movie reviews in the Oak Ridge Observer are rated with atoms rather than stars. Stephen Scott, who leads the System Research Team (SRT) at ORNL, mentioned that the plot plan for his house is stamped "Top Secret -- Manhattan Project" because the plan shows the degree difference between "ORNL North" and "True North", an artifact of the time when period maps of the area deliberately skewed the position of Oak Ridge to lessen the chance that a map could be used to successfully bomb ORNL targets from the air during the war.
We spent all day Tuesday with Stephen and most of the System Research Team. Tim talked about what Sun is doing with xVM and our overall virtualization strategy and ended with a set of questions that we spent some time discussing. Greg then talked in detail about both Crossbow and InfiniBand, specifically with respect to aspects related to virtualization. We spent the rest of the day hearing about some of the work on resiliency and virtualization being done by the team. See the end of this blog entry for pointers to some of the SRT papers as well as other HPC/virtualization papers I have found to be interesting.
Resiliency isn't something the HPC community has traditionally cared much about. Nodes were thin and cheap. If a node crashed, restart the job, replace the node, use checkpoint-restart if you can. Move on; life on the edge is hard. But the world is changing. Nodes are getting fatter again--more cores, more memory, more IO. Big SMPs in tiny packages with totally different economics from traditional large SMPs. Suddenly there is enough persistent state on a node that people start to care how long their nodes stay up. Capabilities like Fault Management start to look really interesting, especially if you are a commercial HPC customer using HPC in production.In addition, clusters are getting larger. Much larger, even with fatter nodes. Which means more frequent hardware failures. Bad news for MPI, the world's most brittle programming model. Certainly, some more modern programming models would be welcome, but in the meantime what can be done to keep these jobs running longer in the presence of continual hardware failures? This is one promise of virtualization. And one reason why a big lab like ORNL is looking seriously at virtualization technologies for HPC.
Live migration -- the ability to shift running OS instances from one node to another -- is particularly interesting from a resiliency perspective. Linking live migration to a capable fault management facility (see, for example, what Sun has been doing in this area) could allow jobs to avoid interruption due to an impending node failure. Research by the SRT (see the Proactive Fault Tolerance paper, below) and others has shown this is a viable approach for single-node jobs and also for increasing the survivability of MPI applications in the presence of node failures. Admittedly, the current prototype depends on Xen TCP tricks to handle MPI traffic interruption and continuation, but with sufficient work to virtualize the InfiniBand fabric, this technique could be extended to that realm as well. In addition, the use of an RDMA-enabled interconnect can itself greatly increase the speed of live migration as is demonstrated in the last paper listed in the reference section below.
We discussed other benefits of virtualization. Among them, the use of multiple virtual machines per physical node to simulate a much larger cluster for demonstrating an application's basic scaling capabilities in advance of being allowed access to a real, full-scale (and expensive) compute resource. Such pre-testing becomes very important in situations in which large user populations are vying for access to relatively scarce, large-scale, centralized research resources.Geoffroy also spoke about "adapting systems to applications, not applications to systems" by which he meant that virtualization allows an application user to bundle their application into a virtual machine instance with any other required software, regardless of the "supported" software environment available on a site's compute resource. Being able to run applications using either old versions of operating systems or perhaps operating systems with which a site's administrative staff has no experience, does truly allow the application provider to adapt the system to their application without placing an additional administrative burden on a site's operational staff. Of course, this does push the burden of creating a correct configuration onto the application provider, but the freedom and flexibility should be welcomed by those who need it. Those who don't could presumably bundle their application into a "standard" guest OS instance. This is completely analogous to the use and customization of Amazon Machine Instances (AMIs) on the Amazon Elastic Compute Cloud (EC2) infrastructure.
Observability was another simpatico area of discussion. DTrace has taken low-cost, fine-grained observability to new heights (new depths, actually). Similarly, SRT is looking at how one might add dynamic instrumentation at the hypervisor level to offer a clearer view of where overhead is occurring within a virtualized environment to promote user understanding and also offer a debugging capability for developers.
A few final tidbits to capture before closing. Several other research efforts are looking at HPC and virtualization. Among them V3VEE (University of New Mexico and Northwestern University), XtreemOS (a bit of a different approach to virtualization for HPC and Grids). SRT is also working on a virtualized version of OSCAR called OSCAR-V.
The Dulles Vortex of Bad Travel was more successful on my way home. My flight from Knoxville was delayed with an unexplained mechanical problem that could not be fixed in Knoxville, requiring a new plane to be flown from St. Louis. I arrived very late into Dulles, about 10 minutes before my connection to Boston was due to leave from the other end of the terminal. I ran to the gate, arriving two minutes before the flight was scheduled to depart and it was already gone-- no sign of the gate agents or the plane. Spent the night at an airport hotel and flew home first thing the next morning. Dulles had struck again--this was at least the third time I've had problems like this when passing through IAD. I have colleagues that refuse to travel with me through this airport. With good reason, apparently.
Proactive Fault Tolerance for HPC with Xen Virtualization, Nagarajan, Mueller, Engelmann, Scott
The Impact of Paravirtualized Memory Hierarchy on Linear Algebra Computational Kernels and Software, Youseff, Seymour, You, Dongarra, Wolski
Performance Implications of Virtualizing Multicore Cluster Machines, Ranadive, Kesavan, Gavrilovska, Schwan
High Performance Virtual Machine Migration with RDMA over Modern Interconnects, Huang, Gao, Liu, Panda