Friday Feb 12, 2010

Mirror, Mirror

Today is my last day at Sun, so I write to say goodbye to all my friends and colleagues who I will dearly miss.

I believe Oracle plans to maintain all previous blog content, but just in case I have made a copy of this entire blog, which is called, fittingly, the Mirror of the Navel of Narcissus. Yes, even more references to self-indulgence. :-)

Follow the link to find out what I'm doing next!

Friday Jan 15, 2010

Virtualization for HPC: The Heterogeneity Issue

I've been advocating for awhile now that virtualization has much to offer HPC customers (see here.) In this blog entry I'd like to focus on one specific use case, heterogeneity. It's an interesting case because while heterogeneity is either desirable or to be avoided, depending on your viewpoint, virtualization can help in either case.

The diagram above depicts a typical HPC cluster installation with each compute node running whichever distro was chosen as that site's standard OS. Homogeneity like this eases the administrative burden, but it does so at the cost of flexibility for end-users. Consider, for example, a shared compute resource like a national supercomputing center or a centralized cluster serving multiple departments within a company or other organization. Homogeneity can be a real problem for end-users whose applications only run on either other versions of the chosen cluster OS or, worse, on completely different operating systems. These users are generally not able to use these centralized facilities unless they can port their application to the appropriate OS or convinced their application provider to do so.

The situation with respect to heterogeneity for software providers, or ISVs -- independent software vendors, is quite different. These providers have been wrestling with expenses and other difficulties related to heterogeneity for years. For example, while ISVs typically develop their applications on a single platform (OS 0 above,) they must often port and support their application on several operating systems in order to address the needs of their customer base. Assuming the ISV decides correctly which operating systems should be supported to maximize revenue, it must still incur considerable expenses to continually qualify and re-qualify their application on each supported operating system version. And maintain a complex, multi-platform testing infrastructure and in-house expertise to support these efforts as well.

Imagine instead a virtualized world, as shown above. In such a world, cluster nodes run hypervisors on which pre-built and pre-configured software environments (virtual machines) are run. These virtual machines include the end-user's application and the operating system required to run that application. So far as I can see, everyone wins. Let's look at each constituency in turn:

  • End-users -- End-users have complete freedom to run any application using any operating system because all of that software is wrapped inside a virtual machine whose internal details are hidden. The VM could be supplied by an ISV, built by an open-source application's community, or created by the end-user. Because the VM is a black box from the cluster's perspective, the choice of application and operating system need no longer be restricted by cluster administrators.
  • Cluster admins -- In a virtualized world, cluster administrators are in the business of launching and managing the lifecycle of virtual machines on cluster nodes and no longer need deal with the complexities of OS upgrades, configuring software stacks, handling end-user special software requests, etc. Of course, a site might still opt to provide a set of pre-configured "standard" VMs for end-users who do not have a need for the flexibility of providing their own VMs. (If this all sounds familiar -- it should. Running a shared, virtualized HPC infrastructure would be very much like running a public cloud infrastructure like EC2. But that is a topic for another day.)
  • ISVs -- ISVs can now significantly reduce the complexity and cost of their business. Since ISV applications would be delivered wrapped within a virtual machine that also includes an operating system and other required software, ISVs would be free to select a single OS environment for developing, testing, AND deploying their application. Rather than basing their operating system choice on market share considerations, the decision could be made based on the quality of the development environment, or perhaps the stability or performance levels achievable with a particular OS, or perhaps on the ability to partner closely with an OS vendor to jointly deliver a highly-optimized, robust, and completely supported experience for end-customers.

Monday Dec 21, 2009

Sun HPC Consortium Videos Now Available

Thanks to Rich Brueckner and Deirdré Straughan, videos and PDFs are now available from the Sun HPC Consortium meeting held just prior to Supercomputing '09 in Portland, Oregon. Go here to see a variety of talks from Sun, Sun partners, and Sun customers on all things HPC. Highlights for me included Dr. Happy Sithole's presentation on Africa's largest HPC cluster (PDF|video), Marc Parizeau's talk about CLUMEQ's Collossus system and its unique datacenter design (PDF|video), and Tom Verbiscer's talk describing Univa UD's approach to HPC and virtualization, including some real application benchmark numbers illustrating the viability of the approach (PDF|video).

My talk, HPC Trends, Challenges, and Virtualization (PDF|video) is an evolution of a talk I gave earlier this year in Germany. The primary purposes of the talk were to illustrate the increasing number of common challenges faced by enterprise, cloud, and HPC users and to highlight some of the potential benefits of this convergence to the HPC community. Virtualization is specifically discussed as one such opportunity.

Monday Oct 19, 2009

Workshop on High Performance Virtualization

As discussed in several earlier posts (here and here), virtualization technology is destined to play an important role in HPC. If you have an opinion about that or are interested in learning more, consider attending or submitting a technical paper or position paper to the Workshop on High Performance Virtualization: Architecture, Systems, Software, and Analysis, which is being held in conjunction with HPCA-16, the 16th International Symposium on High Performance Computing Architecture, and also in conjunction with PPoPP 2010, the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Apparently, Bangalore is the place to be in early January 2010!

Time is short: Workshop submissions are due November 10, 2009.

Wednesday Jul 15, 2009

HPC Trends and Virtualization

Here are the slides (with associated commentary) that I used for a talk I gave recently in Hamburg at Sun's HPC Consortium meeting just prior to the International Supercomputing Conference. The topic was HPC trends with a focus on virtualization and the continued convergence of HPC, Enterprise, and Cloud IT. A PDF version of the slides is available here.

Challenges faced by the HPC community arise from several sources. Some are created by the arrival and coming ubiquity of multi-core processors. Others stem from the continuing increases in problem sizes at the high end of HPC and the commensurate need for ever-larger compute and storage clusters. And still others derive from the broadening of HPC to a wider array of users, primarily commercial/industrial users for whom HPC techniques may offer a significant future competitive advantage.

Perhaps chief among these challenges is the increasingly difficult issue of cluster management complexity, which has been identified by IDC as a primary area of concern. This is especially true at the high end due to the sheer scale involved, but problems exist in the low and midrange as well since commercial/industrial customers are generally much less interested in or tolerant of approaches with a high degree of operational complexity--they expect more than is generally available currently.

Application resilience is also becoming an issue in HPC circles. At the high end there is a recognition that large distributed applications must be able to make meaningful forward progress while running on clusters whose components may be experiencing failures on a near-continuous basis due to the extremely large sizes of these systems. At the midrange and low end, individual nodes will include enough memory and CPU cores that their continued operation in the presence of failures becomes a significant issue.

Gone are the days of macho bragging about the MegaWatt rating of one's HPC datacenter. Power and cooling must be minimized while still delivering good performance for a site's HPC workload. How this will be accomplished is an area of significant interest to the HPC community--and their funding agencies.

While the ramifications of multi-core processors to HPC are a critical issue as are issues related to future programming models and high productivity computing, these are not dealt with in this talk due to time constraints.

Most HPC practitioners are comfortable with the idea that innovations in HPC eventually become useful to the wider IT community. Extreme adherents to this view may point out that the World Wide Web itself is a byproduct of work done within the HPC community. Absent that claim, there are still plenty of examples illustrating the point that HPC technologies and techniques do eventually find broader applicability.

This system was recently announced by Oracle. Under the hood, its architecture should be familiar to any HPC practitioner: it is a DDR InfiniBand, x86-based cluster in a box, expandable in a scalable way to eight cabinets with both compute and storage included.

The value of a high-bandwidth, low-latency interconnect like InfiniBand is a good example of the leverage of HPC technologies in the Enterprise. We've also seen significant InfiniBand interest from the Financial Services community for whom extremely high messaging rates are important in real-time trading and other applications.

It is also important to realize that "benefit" flows in both directions in these cases. While the Enterprise may benefit from HPC innovations, the HPC community also benefits any time Enterprise uptake occurs since adoption by the much larger Enterprise marker virtually assures that these technologies will continue to be developed and improved by vendors. Widespread adoption of InfiniBand outside of its core HPC constituency would be a very positive development for the HPC community.

A few months ago, several colleagues and I held a panel session in Second Life to introduce Sun employees to High Performance Computing as part of our "inside outreach" to the broader Sun community.

As I noted at the time, it was a strange experience to be talking about "HPC" while sitting inside what is essentially an HPC application -- that is, Second Life itself. SL is a great example of how HPC techniques can be repurposed to deliver business value in areas far from what we would typically label High Performance Computing. In this case, SL executes about 30M concurrent server-side scripts at any given time and uses the Havok physics engine to simulate a virtual world that has been block-decomposed across about 15K processor cores. Storage requirements are about 100 TB in over one billion files. It sure smells like HPC to me.

Summary: HPC advances benefit the Enterprise in numerous ways. Certainly with interconnect technology, as we've discussed. With respect to horizontal scaling, HPC has been the poster child for massive horizontal scalability for over a decade when clusters first made their appearance in HPC (but more about this later.)

Parallelization for performance has not been broadly addressed beyond least not yet. With the free ride offered by rocketing clock speed increases coming to an end, parallelization for performance is going to rapidly become everyone's problem and not an HPC-specific issue. The question at hand is whether parallelization techniques developed for HPC can be repurposed for use by the much broader developer community. It is beyond the scope of this talk to discuss this in detail, but I believe the answer is that both the HPC community and the broader community have a shared interest in developing newer and easier to use parallelization techniques.

A word on storage. While Enterprise is the realm of Big Database, it is the HPC community that has been wrestling with both huge data storage requirements and equally huge data transfer requirements. Billions of files and PetaBytes of storage are not uncommon in HPC at this point with aggregate data transfer rates of hundreds of GigaBytes per second.

One can also look at HPC technologies that are of benefit to Cloud Computing. To do that, however, realize that before "Cloud Computing" there was "Grid Computing", which came directly from work by the HPC community. The idea of allowing remote access to large, scalable compute and storage resources is very familiar to HPC practitioners since that is the model used worldwide to allow a myriad of individual researchers access to HPC resources in an economically feasible way. Handling horizontal scale and the mapping of workload to available resources are core HPC competencies that translate directly to Cloud Computing requirements.

Of course, Clouds are not the same as Grids. Clouds offer advanced APIs and other mechanisms for accessing remote resources. And Clouds generally depend on virtualization as a core technology. But more on that later.

As a community, we tend to think of HPC as being leading and bleeding edge. But is that always the case? Have there been advances in either Enterprise or Cloud that can be used to advantage HPC? There is no question in my mind that the answer to this question is a very strong Yes.

Let's talk in more detail about how Enterprise and Cloud advances can help address application resilience, cluster management complexity, effective use of resources, and power efficiency for HPC. Specifically, I'd like to discuss how the virtualization technologies used in Enterprise and Cloud can be used to address these current and emerging HPC pain points.

I am going to use this diagram frequently on subsequent slides so it is important to define our terms. When I say "virtualization" I am referring to OS virtualization of the type done with, for example, Xen on x86 systems or LDOMs on SPARC systems. With such approaches, a thin layer of software or firmware (the hypervisor) works in conjunction with a control entity (called DOM0 or the Control Domain) to mediate access to a server's physical hardware and to allow multiple operating system instances to run concurrently on that hardware. These operating system instances are usually called "guest OS" or virtual machine instances.

This particular diagram illustrates server consolidation, the common Enterprise use-case for virtualization. With server consolidation, workload previously run on physically separate machines is aggregated onto a single system, usually to achieve savings on either capital or operational expense or both. This virtualization is essentially transparent to an application running within a guest OS instance, which is part of the power of the approach since applications need not be modified to run in a virtualized environment. Note that while there are cases in which consolidating multiple guest OSes onto a single node as above would be useful in an HPC context, the more common HPC scenario involves running a single guest OS instance per node.

While server consolidation is important in a Cloud context to reduce operational costs for the Cloud provider, the encapsulation of pre-integrated and pre-tested software in a portable virtual machine is perhaps the most important aspect of virtualization for Cloud Computing. Cloud users can create an entirely customized software environment that supports their application, create a virtual machine file that includes this software, and then upload and run this software on a Cloud's virtualized infrastructure. As we will see, this encapsulation can be used to advantage in certain HPC scenarios as well.

Before discussing specific HPC use-cases for virtualization, we must first address the issue of performance since any significant reduction in application performance would not be acceptable to HPC users, rendering virtualization uninteresting to the community.

Yes, I know. You can't read the graphics because they are too small. That was actually intentional even for the slides that were projected at the conference. The graphs show comparisons of performance in virtualized and non-virtualized (native) environments for several aspects of large, linear algebra computations. The full paper, The Impact of Paravirtualized Memory Hierarchy on Linear Algebra Computational Kernels and Software by Youseff, Seymour, You, Dongarra, and Wolski is available here.

The tiny graphs show that there was essentially no performance difference found between native and virtualized environments. The curves in the two lower graphs are basically identical, showing essentially the same performance levels for virtual and native. The top-left histogram shows the same performance across all virtual and native test cases. In the top-right graph each of the "fat" histogram bars represents a different test with separate virtual and native test results shown within each fat bar. The fat bars are flat because there was little or no difference between virtual and native performance results.

These results, while comforting, should not be surprising. HPC codes are generally compute intensive so one would expect such straight-line code to execute at full speed in a virtualized environment. These tests also confirm, however, that aspects of memory performance are essentially unaffected by virtualization as well. These results are a tribute primarily to the maturity of virtualization support included in current processor architectures.

Note, however, that the explorations described in this paper focused solely on the performance of computational kernels running within a single node. For virtualization to be useful for HPC, it must also offer good performance for distributed, parallel applications as well. And there, dear reader, is the problem.

Current virtualization approaches typically use what is called a split driver model to handle IO operations, which involves using a device driver within the guest OS instance that communicates to the "bottom" half of the driver which runs in DOM0. The DOM0 driver has direct control of the real hardware, which it accesses on behalf of any guest OS instances that make IO requests. While this correctly virtualizes the IO hardware, it does so with a significant performance penalty. That extra hop (in both directions) plays havok with achievable IO bandwidths and latencies, clearly not appropriate for either high-performance MPI communications or high-speed access to storage.

Far preferable would be to allow each guest OS instance direct access to real hardware resources. This is precisely the purpose of PCI-IOV (IO Virtualization), a part of the PCI specification that specifies how PCI devices should behave in a virtualized environment. Simply put, PCI-IOV allows a single physical device to masquerade as several separate physical devices, each with their own hardware resources. Each of these pseudo-physical devices can then be assigned directly to a guest OS instance for its own use, avoiding the proxied IO situation shown on the previous slide. Such an approach should greatly improve current IO performance situation.

PCI-IOV requires hardware support from the IO device and such support is beginning to appear. It also requires software support at the OS and hypervisor level and that is beginning to appear as well.

While not based on PCI-IOV, the work done by Liu, Huang, Abali, and Panda and reported in their paper, High Performance VMM-Bypass I/O in Virtual Machines, gives an idea as to what is possible when the proxied approach is bypassed and the high-performance aspects of the underlying IO hardware are made available to the guest OS instance. The full paper is available here.

The graph on the top-left shows MPI latency as a function of message size for the virtual and native cases while the top-right graph shows throughput as a function of message size in both cases. You will note that the curves on each graph are essentially identical. These tests used MVAPICH in polling mode to achieve these results.

By contrast, the bottom-left graph reports virtual and native Netperf results shown as transactions per second across a range of message sizes. In this case, the virtualized results are not as good, especially at the smaller message sizes. This is due to the fact that interrupt processing is still proxied through DOM0 and for small message sizes it has an appreciable effect on throughput, about 25% in the worst case.

The final graph compares the performance of NAS parallel benchmarks in the virtual and native cases and shows little performance impact in these cases for virtualization.

The work makes a plausible case for the feasibility of virtualization for HPC applications, including parallel distributed HPC applications. More exploration is needed, as is PCI-IOV capable InfiniBand hardware.

This and the next few slides outline several of the dozen or so use-cases for virtualization in HPC. The value of these use-cases to an HPC customer needs to be measured against any performance degradations introduced by virtualization to assess the utility of a virtualized HPC approach. I believe that several of these use-cases are compelling enough for virtualization to warrant serious attention from the HPC community as a future path for the entire community.

Heterogeneity. Many HPC sites support end users with a wide array of software requirements. While some may be able to use the default operating system version installed at a site, others may require a different version of the same OS due to application constraints. Yet others may require a completely different OS or the installation of other non-standard software on their compute nodes. With virtualization, the choice of operating system, application, and software stack is left to end users (or ISVs) who package their software into pre-built virtual machines for execution on an HPC hardware resource, much the way cloud computing offerings work today. In such an environment, site administrators manage the life cycle of virtual machines rather than configuring and providing a proliferation of different software stacks to meet their users' disparate needs. Of course, a site may still elect to make standard software environments available for those users who do not require custom compute environments.

Effective distributed resource management is an important aspect of HPC site administration. Scheduling a mixture of jobs of various sizes and priorities onto shared compute resources is a complex process that can often lead to less than optimal use of available resources. Using Live Migration, a distributed resource management system can make dynamic provisioning decisions to shift running jobs onto different compute nodes to free resources for an arriving high-priority job, or to consolidate running workloads onto fewer resources for power management purposes.

For those not familiar, Live Migration allows a running guest OS instance to be shifted from one physical machine to another without shutting it down. More precisely, the OS instance continues to run as its memory pages are migrated and then at some point, it is actually stopped so the remaining pages can be migrated to the new machine, at which point its execution can be resumed. The technique is described in detail in this paper.

Live migration can be significantly accelerated by bringing the speed and efficiency of InfiniBand RDMA to bear on the movement of data between systems as described in High Performance Virtual Machine Migration with RDMA over Modern Interconnects by Huang, Gao, Liu, and Panda. See the full paper for details on yet another example of how the Enterprise can benefit from advances made by the HPC community.

The ability to checkpoint long-running jobs has long been an HPC requirement. Checkpoint-restart (CPR) can be used to protect jobs from underlying system failures, to deal with scheduled maintenance events, and to allow jobs to be temporarily preempted by higher-priority jobs and later restarted. While an important requirement, HPC vendors have generally been unable to deliver adequate CPR functionality, requiring application writers to code customized checkpointing capabilities within their applications. Using virtualization to save the state of a virtual machines, both single-VM and multi-VM (e.g. MPI) jobs can be checkpointed more easily and more completely than was achievable with traditional, process-based CPR schemes.

As HPC cluster sizes continue to grow and with them the sizes of distributed parallel applications, it becomes increasingly important to protect application state from underlying hardware failures. Checkpointing is one method for offering this protection, but it is expensive in time and resources since the state of the entire application must be written to disk at each checkpoint interval.

Using Live Migration, it will possible to dynamically relocate individual ranks of a running MPI application from failing nodes to other healthy nodes. In such a scenario, applications will pause briefly and then continue as affected MPI ranks are migrated and required MPI connections are re-established to the new nodes. Coupled with advanced fault management capabilities, this becomes a fast and incremental method of maintaining application forward progress in the presence of underlying system failures. Both multi-node and single-node applications can be protected with this mechanism.

In summary, virtualization holds much promise for HPC as a technology that can be used to mitigate a number of significant pain points for the HPC community. With appropriate hardware support, both compute and IO performance should be acceptable in a virtualized environment especially when judged against the benefits to be accrued from a virtualized approach.

Virtualization for HPC is mostly a research topic currently. While incremental steps are feasible now, full support for all high-impact use cases will require significant engineering work to achieve.

This talk focused on virtualization and the benefits it can bring to the HPC community. This is one example of a much larger trend towards convergence between HPC, Enterprise, and Cloud. As this convergence continues, we will see many additional opportunities to leverage advances in one sphere to the future benefit of all. I see it as fortuitous that this convergence is underway since it allows significant amounts of product development effort to be leveraged across multiple markets, an approach that is particularly compelling in a time of reduced resources and economic uncertainty.

Friday Jan 09, 2009

HPC and Virtualization: Oak Ridge Trip Report

Just before Sun's Winter Break, I attended a meeting at Oak Ridge National Laboratory in Tennessee with Stephen Scott, Geoffroy Vallee, Christian Engelmann, Thomas Naughton, and Anand Tikotekar, all of the Systems Research Team (SRT) at ORNL. Attending from Sun were Tim Marsland, Greg Lavender, Rebecca Arney, and myself. The topic was HPC and virtualization, an area the SRT has been exploring for some time and one I've been keen on as well as it has become clear v12n has much to offer the HPC community. This is my trip report.

I arrived at Logan Airport in Boston early enough on Monday to catch an earlier flight to Dulles, narrowly avoiding the five-hour delay that eventually afflicted my original flight. The flight from Boston to Knoxville via Dulles went smoothly and I arrived without difficulty to a rainy and chilly Tennessee evening. I was thrilled to have made it through Dulles without incident since more often than not I have some kind of travel difficulty when my trips pass through IAD (more on that later.) The 25 mile drive to the Oak Ridge DoubleTree was uneventful.

Oak Ridge is still very much a Lab town from what I could see, much like Los Alamos, but certainly less isolated. Movie reviews in the Oak Ridge Observer are rated with atoms rather than stars. Stephen Scott, who leads the System Research Team (SRT) at ORNL, mentioned that the plot plan for his house is stamped "Top Secret -- Manhattan Project" because the plan shows the degree difference between "ORNL North" and "True North", an artifact of the time when period maps of the area deliberately skewed the position of Oak Ridge to lessen the chance that a map could be used to successfully bomb ORNL targets from the air during the war.

We spent all day Tuesday with Stephen and most of the System Research Team. Tim talked about what Sun is doing with xVM and our overall virtualization strategy and ended with a set of questions that we spent some time discussing. Greg then talked in detail about both Crossbow and InfiniBand, specifically with respect to aspects related to virtualization. We spent the rest of the day hearing about some of the work on resiliency and virtualization being done by the team. See the end of this blog entry for pointers to some of the SRT papers as well as other HPC/virtualization papers I have found to be interesting.

Resiliency isn't something the HPC community has traditionally cared much about. Nodes were thin and cheap. If a node crashed, restart the job, replace the node, use checkpoint-restart if you can. Move on; life on the edge is hard. But the world is changing. Nodes are getting fatter again--more cores, more memory, more IO. Big SMPs in tiny packages with totally different economics from traditional large SMPs. Suddenly there is enough persistent state on a node that people start to care how long their nodes stay up. Capabilities like Fault Management start to look really interesting, especially if you are a commercial HPC customer using HPC in production.

In addition, clusters are getting larger. Much larger, even with fatter nodes. Which means more frequent hardware failures. Bad news for MPI, the world's most brittle programming model. Certainly, some more modern programming models would be welcome, but in the meantime what can be done to keep these jobs running longer in the presence of continual hardware failures? This is one promise of virtualization. And one reason why a big lab like ORNL is looking seriously at virtualization technologies for HPC.

Live migration -- the ability to shift running OS instances from one node to another -- is particularly interesting from a resiliency perspective. Linking live migration to a capable fault management facility (see, for example, what Sun has been doing in this area) could allow jobs to avoid interruption due to an impending node failure. Research by the SRT (see the Proactive Fault Tolerance paper, below) and others has shown this is a viable approach for single-node jobs and also for increasing the survivability of MPI applications in the presence of node failures. Admittedly, the current prototype depends on Xen TCP tricks to handle MPI traffic interruption and continuation, but with sufficient work to virtualize the InfiniBand fabric, this technique could be extended to that realm as well. In addition, the use of an RDMA-enabled interconnect can itself greatly increase the speed of live migration as is demonstrated in the last paper listed in the reference section below.

We discussed other benefits of virtualization. Among them, the use of multiple virtual machines per physical node to simulate a much larger cluster for demonstrating an application's basic scaling capabilities in advance of being allowed access to a real, full-scale (and expensive) compute resource. Such pre-testing becomes very important in situations in which large user populations are vying for access to relatively scarce, large-scale, centralized research resources.

Geoffroy also spoke about "adapting systems to applications, not applications to systems" by which he meant that virtualization allows an application user to bundle their application into a virtual machine instance with any other required software, regardless of the "supported" software environment available on a site's compute resource. Being able to run applications using either old versions of operating systems or perhaps operating systems with which a site's administrative staff has no experience, does truly allow the application provider to adapt the system to their application without placing an additional administrative burden on a site's operational staff. Of course, this does push the burden of creating a correct configuration onto the application provider, but the freedom and flexibility should be welcomed by those who need it. Those who don't could presumably bundle their application into a "standard" guest OS instance. This is completely analogous to the use and customization of Amazon Machine Instances (AMIs) on the Amazon Elastic Compute Cloud (EC2) infrastructure.

Observability was another simpatico area of discussion. DTrace has taken low-cost, fine-grained observability to new heights (new depths, actually). Similarly, SRT is looking at how one might add dynamic instrumentation at the hypervisor level to offer a clearer view of where overhead is occurring within a virtualized environment to promote user understanding and also offer a debugging capability for developers.

A few final tidbits to capture before closing. Several other research efforts are looking at HPC and virtualization. Among them V3VEE (University of New Mexico and Northwestern University), XtreemOS (a bit of a different approach to virtualization for HPC and Grids). SRT is also working on a virtualized version of OSCAR called OSCAR-V.

The Dulles Vortex of Bad Travel was more successful on my way home. My flight from Knoxville was delayed with an unexplained mechanical problem that could not be fixed in Knoxville, requiring a new plane to be flown from St. Louis. I arrived very late into Dulles, about 10 minutes before my connection to Boston was due to leave from the other end of the terminal. I ran to the gate, arriving two minutes before the flight was scheduled to depart and it was already gone-- no sign of the gate agents or the plane. Spent the night at an airport hotel and flew home first thing the next morning. Dulles had struck again--this was at least the third time I've had problems like this when passing through IAD. I have colleagues that refuse to travel with me through this airport. With good reason, apparently.

Reading list:

Proactive Fault Tolerance for HPC with Xen Virtualization, Nagarajan, Mueller, Engelmann, Scott

The Impact of Paravirtualized Memory Hierarchy on Linear Algebra Computational Kernels and Software, Youseff, Seymour, You, Dongarra, Wolski

Performance Implications of Virtualizing Multicore Cluster Machines, Ranadive, Kesavan, Gavrilovska, Schwan

High Performance Virtual Machine Migration with RDMA over Modern Interconnects, Huang, Gao, Liu, Panda

Wednesday Aug 08, 2007

UltraSPARC T2: Be My Guest, Guest, Guest...

I mentioned in my last blog entry that our new UltraSPARC T2 processor (code name Niagara 2) would make a nice consolidation platform with its 64 threads and with Sun's SPARC virtualization product (LDOMs) that allows multiple OS instances (Solaris or Linux) to be run simultaneously on a single system.

Lest you dismiss this as random bluster, check out Ash's blog and watch his flash demo. Picture a roughly pizza-box sized computer running 64 separate operating systems.

Pretty slick!


Josh Simons


« July 2016