Friday Feb 12, 2010

Mirror, Mirror

Today is my last day at Sun, so I write to say goodbye to all my friends and colleagues who I will dearly miss.

I believe Oracle plans to maintain all previous blog content, but just in case I have made a copy of this entire blog, which is called, fittingly, the Mirror of the Navel of Narcissus. Yes, even more references to self-indulgence. :-)

Follow the link to find out what I'm doing next!

Friday Jan 15, 2010

Virtualization for HPC: The Heterogeneity Issue

I've been advocating for awhile now that virtualization has much to offer HPC customers (see here.) In this blog entry I'd like to focus on one specific use case, heterogeneity. It's an interesting case because while heterogeneity is either desirable or to be avoided, depending on your viewpoint, virtualization can help in either case.

The diagram above depicts a typical HPC cluster installation with each compute node running whichever distro was chosen as that site's standard OS. Homogeneity like this eases the administrative burden, but it does so at the cost of flexibility for end-users. Consider, for example, a shared compute resource like a national supercomputing center or a centralized cluster serving multiple departments within a company or other organization. Homogeneity can be a real problem for end-users whose applications only run on either other versions of the chosen cluster OS or, worse, on completely different operating systems. These users are generally not able to use these centralized facilities unless they can port their application to the appropriate OS or convinced their application provider to do so.

The situation with respect to heterogeneity for software providers, or ISVs -- independent software vendors, is quite different. These providers have been wrestling with expenses and other difficulties related to heterogeneity for years. For example, while ISVs typically develop their applications on a single platform (OS 0 above,) they must often port and support their application on several operating systems in order to address the needs of their customer base. Assuming the ISV decides correctly which operating systems should be supported to maximize revenue, it must still incur considerable expenses to continually qualify and re-qualify their application on each supported operating system version. And maintain a complex, multi-platform testing infrastructure and in-house expertise to support these efforts as well.

Imagine instead a virtualized world, as shown above. In such a world, cluster nodes run hypervisors on which pre-built and pre-configured software environments (virtual machines) are run. These virtual machines include the end-user's application and the operating system required to run that application. So far as I can see, everyone wins. Let's look at each constituency in turn:

  • End-users -- End-users have complete freedom to run any application using any operating system because all of that software is wrapped inside a virtual machine whose internal details are hidden. The VM could be supplied by an ISV, built by an open-source application's community, or created by the end-user. Because the VM is a black box from the cluster's perspective, the choice of application and operating system need no longer be restricted by cluster administrators.
  • Cluster admins -- In a virtualized world, cluster administrators are in the business of launching and managing the lifecycle of virtual machines on cluster nodes and no longer need deal with the complexities of OS upgrades, configuring software stacks, handling end-user special software requests, etc. Of course, a site might still opt to provide a set of pre-configured "standard" VMs for end-users who do not have a need for the flexibility of providing their own VMs. (If this all sounds familiar -- it should. Running a shared, virtualized HPC infrastructure would be very much like running a public cloud infrastructure like EC2. But that is a topic for another day.)
  • ISVs -- ISVs can now significantly reduce the complexity and cost of their business. Since ISV applications would be delivered wrapped within a virtual machine that also includes an operating system and other required software, ISVs would be free to select a single OS environment for developing, testing, AND deploying their application. Rather than basing their operating system choice on market share considerations, the decision could be made based on the quality of the development environment, or perhaps the stability or performance levels achievable with a particular OS, or perhaps on the ability to partner closely with an OS vendor to jointly deliver a highly-optimized, robust, and completely supported experience for end-customers.

Thursday Jan 14, 2010

Sun Grid Engine: Still Firing on All Cylinders

The Sun Grid Engine team has just released the latest version of SGE, humbly called Sun Grid Engine 6.2 update 5. It's a yawner of a name for a release that actually contains some substantial new features and improvements to Sun's distributed resource management software, among them Hadoop integration, topology-aware scheduling at the node level (think NUMA), and improved cloud integration and power management capabilities.

You can get the bits directly here. Or you can visit Dan's blog for more details first. And then get the bits.

Monday Dec 21, 2009

Sun HPC Consortium Videos Now Available

Thanks to Rich Brueckner and Deirdré Straughan, videos and PDFs are now available from the Sun HPC Consortium meeting held just prior to Supercomputing '09 in Portland, Oregon. Go here to see a variety of talks from Sun, Sun partners, and Sun customers on all things HPC. Highlights for me included Dr. Happy Sithole's presentation on Africa's largest HPC cluster (PDF|video), Marc Parizeau's talk about CLUMEQ's Collossus system and its unique datacenter design (PDF|video), and Tom Verbiscer's talk describing Univa UD's approach to HPC and virtualization, including some real application benchmark numbers illustrating the viability of the approach (PDF|video).

My talk, HPC Trends, Challenges, and Virtualization (PDF|video) is an evolution of a talk I gave earlier this year in Germany. The primary purposes of the talk were to illustrate the increasing number of common challenges faced by enterprise, cloud, and HPC users and to highlight some of the potential benefits of this convergence to the HPC community. Virtualization is specifically discussed as one such opportunity.

Thursday Nov 19, 2009

You Put Your HPC Cluster in a...WHAT??

Judging from a quick look at the survey results from this weekend's Sun HPC Consortium meeting in Portland, Oregon, Marc Parizeau's talk was a favorite with both customers and Sun employees.

Marc is Deputy Director of CLUMEQ and a professor at Université Laval in Québec City. His talk, Colossus: A cool HPC tower! [PDF, 10MB], describes with many photos how a 1960s era Van de Graaff generator facility was turned into an innovative, state of the art, supercomputing installation featuring Sun Constellation hardware. Very much worth a look.

A nicely-produced CLUMEQ / Constellation video that describes the creation of this computing facility is also available on YouTube.

Wednesday Nov 18, 2009

Climate Modeling: How much computing required to run a Century Experiment?

Henry Tufo from NCAR and CU-Boulder spoke this weekend at the Sun HPC Consortium meeting here in Portland, OR. As part of his talk, More than a Big Machine: Why We Need Multiple Breakthroughs to Tackle Cloud Resolving Climate [PDF], he estimated the number of floating-point operations (FLOPs) needed to compute a climate model over a one-century time scale with a 1 km atmosphere model.

His answer was the highlight of the Consortium for me: A Century Experiment requires about a mole of FLOPs. :-)

Monday Oct 19, 2009

Workshop on High Performance Virtualization

As discussed in several earlier posts (here and here), virtualization technology is destined to play an important role in HPC. If you have an opinion about that or are interested in learning more, consider attending or submitting a technical paper or position paper to the Workshop on High Performance Virtualization: Architecture, Systems, Software, and Analysis, which is being held in conjunction with HPCA-16, the 16th International Symposium on High Performance Computing Architecture, and also in conjunction with PPoPP 2010, the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Apparently, Bangalore is the place to be in early January 2010!

Time is short: Workshop submissions are due November 10, 2009.

Friday Oct 16, 2009

Fortress: Parallel by Default

I gave a short talk about Fortress, a new parallel language, at the Sun HPC Workshop in Regensburg, Germany and thought I'd post the slides here with commentary. Since I'm certainly not a Fortress expert by any stretch of the imagination, my intent was to give the audience a feel for the language and its origins rather than attempt a deep dive in any particular area. Christine Flood from the SunLabs Programming Languages Research Group helped me with the slides. I also stole liberally from presentations and other materials created by other Fortress team members.

The unofficial Fortress tag line, inspired by Fortress's emphasis on programmer productivity. With Fortress, programmer/scientists express their algorithms in a mathematical notation that is much closer to their domain of expertise than the syntax of the typical programming language. We'll see numerous examples in the following slides.

At the highest level, there are two things to know about Fortress. First, that it started as a SunLabs research project, and, second, that the work is being done in the open under as the Project Fortress Community, whose website is here. Source code downloads, documentation, code samples, etc., are all available on the site.

Fortress was conceived as part of Sun's involvement in a DARPA program called High Productivity Computing Systems (HPCS,) which was designed to encourage the development of hardware and software approaches that would significantly increase the productivity of the application developers and users of High Performance Computing systems. Each of the three companies selected to continue past the introductory phase of the program proposed a language designed to meet these requirements. IBM chose essentially to extend Java for HPC, while both Cray and Sun proposed new object-oriented languages. Michèle Weiland at the University of Edinburgh has written a short technical report that offers a comparison of the three language approaches. It is available in PDF format here.

I've mentioned productivity, but not defined it. I recommend visiting Michael Van De Vanter's publications page for more insight. Michael was a member of the Sun HPCS team who focused with several colleagues on the issue of productivity in an HPCS context. His considerable publication list is here.

Because I don't believe Sun's HPCS proposal has ever been made public, I won't comment further on the specific scalability goals set for Fortress other than to say they were chosen to complement the proposed hardware approach. Because Sun was not selected to proceed to the final phase of the HPCS program, we have not built the proposed system. We have, however, continued the Fortress project and several other initiatives that we believe are of continuing value.

Growability was a philosophical decision made by Fortress designers and we'll talk about that later. For now, note that Fortress is implemented as a small core with an extensive and growing set of capabilities provided by libraries.

As mentioned earlier, Fortress is designed to accommodate the programmer/scientist by allowing algorithms to be expressed directly in familiar mathematical notation. It is also important to note that Fortress constructs are parallel by default, unlike many other languages which require an explicit declaration to create parallelism. Actually to be more precise, Fortress is "potentially parallel" by default. If parallelism can be found, it will be exploited.

Finally, some code. We will look at several versions of a factorial function over the next several slides to illustrate some features of Fortress. (For additional illustrative Fibonacci examples, go here.) The first version of the function is shown here beneath a concise, mathematical definition of factorial for reference.

The red underlines highlight two Fortressisms. First, the condition in the first conditional is written naturally as a single range rather than as the more conventional (and convoluted) two-clause condition. And, second, the actual recursion shows that juxtaposition can be used to imply multiplication as is common when writing mathematical statements.

This version defines a new operator, the "!" factorial operator, and then uses that operator in the recursive step. The code has also been run through the Fortress pretty printer that converts it from ASCII form to a more mathematically formatted representation. As you can see, the core logic of the code now closely mimics the mathematical definition of factorial.

This non-recursive version of the operator definition uses a loop to compute the factorial.

Since Fortress is parallel by default, all iterations of this loop could theoretically be executed in parallel, depending on the underlying platform. The "atomic" keyword ensures that the update of the variable result is performed atomically to ensure correct execution.

This slide shows an example of how Fortress code is written with a standard keyboard and what the code looks like after it is formatted with Fortify, the Fortress pretty printer. Several common mathematical operators are shown at the bottom of the slide along with their ASCII equivalents.

A few examples of Fortress operator precedence. Perhaps the most interesting point is the fact that white space matters to the Fortress parser. Since the spacing in the 2nd negative example implies a precedence different than the actual precedence, this statement would be rejected by Fortress on the theory that its execution would not compute the result intended by the programmer.

Don't go overboard with juxtaposition as a multiplication operator -- there is clearly still a role for parentheses in Fortress, especially when dealing with complex expressions. While these two statements are supposed to be equivalent, I should point out that the first statement actually has a typo and will be rejected by Fortress. Can you spot the error? It's the "3n" that's the problem because it isn't a valid Fortress number, illustrating one case in which a juxtaposition in everyday math isn't accepted by the language. Put a space between the "3" and the "n" to fix the problem.

Here is a larger example of Fortress code. On the left is the algorithmic description of the conjugate gradient (CG) component of the NAS parallel benchmarks, taken directly from the original 1994 technical report. On the right is the Fortress code. Or do I have that backwards? :-)

More Fortress code is available on the Fortress by Example page at the community site.

Several ways to express ranges in Fortress.

The first static array definition creates a one dimensional array of 1000 32-bit integers. The second definition creates a one dimensional array of length size, initialized to zero. I know it looks like a 2D array, but the 2nd instance of ZZ32 in the Array construct refers to the type of the index rather than specifying a 2nd array dimension.

The last array subexpression is interesting, since it is only partially specified. It extracts a 20x10 subarray from array b starting at its origin.

Tuple components can be evaluated in parallel, including arguments to functions. As with for loops, do clauses execute in parallel.

In Fortress, generators control how loops are run and generators generally run computations in any order, often in parallel. As an example, the sum reduction over X and Y is controlled by a generator that will cause the summation of the products to occur in parallel or at the least in a non-deterministic order if running on a single-processor machine.

In Fortress, when parallelism is generated the execution of that work is handled using a work stealing strategy similar to that used by Cilk. Essentially, when a compute resource finishes executing its tasks, it pulls work items from other processor's work queues, ensuring that compute resources stay busy by load balancing the available work across all processors.

Essentially, a restatement of an earlier point: In Fortress, generators play the role that iterators play in other languages. By relegating the details of how the index space is processed to the generator, it is natural to then also allow the generator to control how the enclosed processing steps are executed. A generator might execute computations serially or in parallel.

A generator could conceivably also control whether computations are done locally on a single system or distributed across a cluster, though the Fortress interpreter currently only executes within a single node. To me, the generator concept is one of the nicer aspects of Fortress.

Guy Steele, who is Fortress Principal Investigator along with Eric Allen, has been working in the programming languages area long enough to know the wisdom of these statements. Watch him live the reality of growing a language in his keynote at the 1998 ACM OOPSLA conference. Be amazed at the cleverness, but listen to the message as well.

The latest version of the Fortress interpreter (source and binary) is available here. If you would like to browse the source code online, do so here.

Some informational pointers. Christine also tells me that the team is working on an overview talk like this one. Except I expect it will be a lot better. :-) Though I only scratched the surface in a superficial way, I hope this brief overview has given you at least the flavor of what Project Fortress is about.

Friday Sep 04, 2009

HPC Virtual Conference: No Travel Budget? No Problem!

Sun is holding a virtual HPC conference on September 17th featuring Andy Bechtolsheim as keynote speaker. Andy will be talking about the challenges around creating Exaflop systems by 2020, after which he will participate in a chat session with attendees. In fact, each of the conference speakers (see agenda) will chat with attendees after their presentations.

There will also be two sets of exhibits to "visit" to find information on HPC solutions for specific industries or to get information on specific HPC technologies. Industries covered include MCAE, EDA, Government/Education/Research, Life Sciences, and Digital Media. There will be technology exhibits on storage software and hardware, integrated software stack for HPC, compute and networking hardware, and HPC services.

This is a free event. Register here.

Thursday Aug 27, 2009

Parallel Computing: Berkeley's Summer Bootcamp

Two weeks ago the Parallel Computing Laboratory at the University of California Berkeley ran an excellent three-day summer bootcamp on parallel computing. I was one of about 200 people who attended remotely while another large pile of people elected to attend in person on the UCB campus. This was an excellent opportunity to listen to some very well known and talented people in the HPC community. Video and presentation material is available on the web and I would recommend it to anyone interested in parallel computing or HPC. See below for details.

The bootcamp, which was called the 2009 Par Lab Boot Camp - Short Course on Parallel Programming covered a wide array of useful topics, including introductions to many of the current and emerging HPC parallel computing models (pthreads, OpenMP, MPI, UPC, CUDA, OpenCL, etc.), hands-on labs for in-person attendees, and some nice discussions on parallelism and how to find it with an emphasis on the motifs (patterns) of parallelism identified in The Landscape of Parallel Computing Research: A View From Berkeley. There was also a presentation on performance analysis tools and several application-level talks. It was an excellent event.

The bootcamp agenda is shown below. Session videos and PDF decks are available here.

talk title speaker
Introduction and Welcome Dave Patterson (UCB)
Introduction to Parallel Architectures John Kubiatowicz (UCB)
Shared Memory Programming with Pthreads, OpenMP and TBB Katherine Yelick (UCB & LBNL), Tim Mattson (Intel), Michael Wrinn (Intel)
Sources of parallelism and locality in simulation James Demmel (UCB)
Architecting Parallel Software Using Design Patterns Kurt Keutzer (UCB)
Data-Parallel Programming on Manycore Graphics Processors Bryan Catanzaro (UCB)
OpenCL Tim Mattson (Intel)
Computational Patterns of Parallel Programming James Demmel (UCB)
Building Parallel Applications Ras Bodik (UCB), Ras Bodik (UCB), Nelson Morgan (UCB)
Distributed Memory Programming in MPI and UPC Katherine Yelick (UCB & LBNL)
Performance Analysis Tools Karl Fuerlinger (UCB)
Cloud Computing Matei Zaharia (UCB)

Wednesday Jul 15, 2009

HPC Trends and Virtualization

Here are the slides (with associated commentary) that I used for a talk I gave recently in Hamburg at Sun's HPC Consortium meeting just prior to the International Supercomputing Conference. The topic was HPC trends with a focus on virtualization and the continued convergence of HPC, Enterprise, and Cloud IT. A PDF version of the slides is available here.

Challenges faced by the HPC community arise from several sources. Some are created by the arrival and coming ubiquity of multi-core processors. Others stem from the continuing increases in problem sizes at the high end of HPC and the commensurate need for ever-larger compute and storage clusters. And still others derive from the broadening of HPC to a wider array of users, primarily commercial/industrial users for whom HPC techniques may offer a significant future competitive advantage.

Perhaps chief among these challenges is the increasingly difficult issue of cluster management complexity, which has been identified by IDC as a primary area of concern. This is especially true at the high end due to the sheer scale involved, but problems exist in the low and midrange as well since commercial/industrial customers are generally much less interested in or tolerant of approaches with a high degree of operational complexity--they expect more than is generally available currently.

Application resilience is also becoming an issue in HPC circles. At the high end there is a recognition that large distributed applications must be able to make meaningful forward progress while running on clusters whose components may be experiencing failures on a near-continuous basis due to the extremely large sizes of these systems. At the midrange and low end, individual nodes will include enough memory and CPU cores that their continued operation in the presence of failures becomes a significant issue.

Gone are the days of macho bragging about the MegaWatt rating of one's HPC datacenter. Power and cooling must be minimized while still delivering good performance for a site's HPC workload. How this will be accomplished is an area of significant interest to the HPC community--and their funding agencies.

While the ramifications of multi-core processors to HPC are a critical issue as are issues related to future programming models and high productivity computing, these are not dealt with in this talk due to time constraints.

Most HPC practitioners are comfortable with the idea that innovations in HPC eventually become useful to the wider IT community. Extreme adherents to this view may point out that the World Wide Web itself is a byproduct of work done within the HPC community. Absent that claim, there are still plenty of examples illustrating the point that HPC technologies and techniques do eventually find broader applicability.

This system was recently announced by Oracle. Under the hood, its architecture should be familiar to any HPC practitioner: it is a DDR InfiniBand, x86-based cluster in a box, expandable in a scalable way to eight cabinets with both compute and storage included.

The value of a high-bandwidth, low-latency interconnect like InfiniBand is a good example of the leverage of HPC technologies in the Enterprise. We've also seen significant InfiniBand interest from the Financial Services community for whom extremely high messaging rates are important in real-time trading and other applications.

It is also important to realize that "benefit" flows in both directions in these cases. While the Enterprise may benefit from HPC innovations, the HPC community also benefits any time Enterprise uptake occurs since adoption by the much larger Enterprise marker virtually assures that these technologies will continue to be developed and improved by vendors. Widespread adoption of InfiniBand outside of its core HPC constituency would be a very positive development for the HPC community.

A few months ago, several colleagues and I held a panel session in Second Life to introduce Sun employees to High Performance Computing as part of our "inside outreach" to the broader Sun community.

As I noted at the time, it was a strange experience to be talking about "HPC" while sitting inside what is essentially an HPC application -- that is, Second Life itself. SL is a great example of how HPC techniques can be repurposed to deliver business value in areas far from what we would typically label High Performance Computing. In this case, SL executes about 30M concurrent server-side scripts at any given time and uses the Havok physics engine to simulate a virtual world that has been block-decomposed across about 15K processor cores. Storage requirements are about 100 TB in over one billion files. It sure smells like HPC to me.

Summary: HPC advances benefit the Enterprise in numerous ways. Certainly with interconnect technology, as we've discussed. With respect to horizontal scaling, HPC has been the poster child for massive horizontal scalability for over a decade when clusters first made their appearance in HPC (but more about this later.)

Parallelization for performance has not been broadly addressed beyond least not yet. With the free ride offered by rocketing clock speed increases coming to an end, parallelization for performance is going to rapidly become everyone's problem and not an HPC-specific issue. The question at hand is whether parallelization techniques developed for HPC can be repurposed for use by the much broader developer community. It is beyond the scope of this talk to discuss this in detail, but I believe the answer is that both the HPC community and the broader community have a shared interest in developing newer and easier to use parallelization techniques.

A word on storage. While Enterprise is the realm of Big Database, it is the HPC community that has been wrestling with both huge data storage requirements and equally huge data transfer requirements. Billions of files and PetaBytes of storage are not uncommon in HPC at this point with aggregate data transfer rates of hundreds of GigaBytes per second.

One can also look at HPC technologies that are of benefit to Cloud Computing. To do that, however, realize that before "Cloud Computing" there was "Grid Computing", which came directly from work by the HPC community. The idea of allowing remote access to large, scalable compute and storage resources is very familiar to HPC practitioners since that is the model used worldwide to allow a myriad of individual researchers access to HPC resources in an economically feasible way. Handling horizontal scale and the mapping of workload to available resources are core HPC competencies that translate directly to Cloud Computing requirements.

Of course, Clouds are not the same as Grids. Clouds offer advanced APIs and other mechanisms for accessing remote resources. And Clouds generally depend on virtualization as a core technology. But more on that later.

As a community, we tend to think of HPC as being leading and bleeding edge. But is that always the case? Have there been advances in either Enterprise or Cloud that can be used to advantage HPC? There is no question in my mind that the answer to this question is a very strong Yes.

Let's talk in more detail about how Enterprise and Cloud advances can help address application resilience, cluster management complexity, effective use of resources, and power efficiency for HPC. Specifically, I'd like to discuss how the virtualization technologies used in Enterprise and Cloud can be used to address these current and emerging HPC pain points.

I am going to use this diagram frequently on subsequent slides so it is important to define our terms. When I say "virtualization" I am referring to OS virtualization of the type done with, for example, Xen on x86 systems or LDOMs on SPARC systems. With such approaches, a thin layer of software or firmware (the hypervisor) works in conjunction with a control entity (called DOM0 or the Control Domain) to mediate access to a server's physical hardware and to allow multiple operating system instances to run concurrently on that hardware. These operating system instances are usually called "guest OS" or virtual machine instances.

This particular diagram illustrates server consolidation, the common Enterprise use-case for virtualization. With server consolidation, workload previously run on physically separate machines is aggregated onto a single system, usually to achieve savings on either capital or operational expense or both. This virtualization is essentially transparent to an application running within a guest OS instance, which is part of the power of the approach since applications need not be modified to run in a virtualized environment. Note that while there are cases in which consolidating multiple guest OSes onto a single node as above would be useful in an HPC context, the more common HPC scenario involves running a single guest OS instance per node.

While server consolidation is important in a Cloud context to reduce operational costs for the Cloud provider, the encapsulation of pre-integrated and pre-tested software in a portable virtual machine is perhaps the most important aspect of virtualization for Cloud Computing. Cloud users can create an entirely customized software environment that supports their application, create a virtual machine file that includes this software, and then upload and run this software on a Cloud's virtualized infrastructure. As we will see, this encapsulation can be used to advantage in certain HPC scenarios as well.

Before discussing specific HPC use-cases for virtualization, we must first address the issue of performance since any significant reduction in application performance would not be acceptable to HPC users, rendering virtualization uninteresting to the community.

Yes, I know. You can't read the graphics because they are too small. That was actually intentional even for the slides that were projected at the conference. The graphs show comparisons of performance in virtualized and non-virtualized (native) environments for several aspects of large, linear algebra computations. The full paper, The Impact of Paravirtualized Memory Hierarchy on Linear Algebra Computational Kernels and Software by Youseff, Seymour, You, Dongarra, and Wolski is available here.

The tiny graphs show that there was essentially no performance difference found between native and virtualized environments. The curves in the two lower graphs are basically identical, showing essentially the same performance levels for virtual and native. The top-left histogram shows the same performance across all virtual and native test cases. In the top-right graph each of the "fat" histogram bars represents a different test with separate virtual and native test results shown within each fat bar. The fat bars are flat because there was little or no difference between virtual and native performance results.

These results, while comforting, should not be surprising. HPC codes are generally compute intensive so one would expect such straight-line code to execute at full speed in a virtualized environment. These tests also confirm, however, that aspects of memory performance are essentially unaffected by virtualization as well. These results are a tribute primarily to the maturity of virtualization support included in current processor architectures.

Note, however, that the explorations described in this paper focused solely on the performance of computational kernels running within a single node. For virtualization to be useful for HPC, it must also offer good performance for distributed, parallel applications as well. And there, dear reader, is the problem.

Current virtualization approaches typically use what is called a split driver model to handle IO operations, which involves using a device driver within the guest OS instance that communicates to the "bottom" half of the driver which runs in DOM0. The DOM0 driver has direct control of the real hardware, which it accesses on behalf of any guest OS instances that make IO requests. While this correctly virtualizes the IO hardware, it does so with a significant performance penalty. That extra hop (in both directions) plays havok with achievable IO bandwidths and latencies, clearly not appropriate for either high-performance MPI communications or high-speed access to storage.

Far preferable would be to allow each guest OS instance direct access to real hardware resources. This is precisely the purpose of PCI-IOV (IO Virtualization), a part of the PCI specification that specifies how PCI devices should behave in a virtualized environment. Simply put, PCI-IOV allows a single physical device to masquerade as several separate physical devices, each with their own hardware resources. Each of these pseudo-physical devices can then be assigned directly to a guest OS instance for its own use, avoiding the proxied IO situation shown on the previous slide. Such an approach should greatly improve current IO performance situation.

PCI-IOV requires hardware support from the IO device and such support is beginning to appear. It also requires software support at the OS and hypervisor level and that is beginning to appear as well.

While not based on PCI-IOV, the work done by Liu, Huang, Abali, and Panda and reported in their paper, High Performance VMM-Bypass I/O in Virtual Machines, gives an idea as to what is possible when the proxied approach is bypassed and the high-performance aspects of the underlying IO hardware are made available to the guest OS instance. The full paper is available here.

The graph on the top-left shows MPI latency as a function of message size for the virtual and native cases while the top-right graph shows throughput as a function of message size in both cases. You will note that the curves on each graph are essentially identical. These tests used MVAPICH in polling mode to achieve these results.

By contrast, the bottom-left graph reports virtual and native Netperf results shown as transactions per second across a range of message sizes. In this case, the virtualized results are not as good, especially at the smaller message sizes. This is due to the fact that interrupt processing is still proxied through DOM0 and for small message sizes it has an appreciable effect on throughput, about 25% in the worst case.

The final graph compares the performance of NAS parallel benchmarks in the virtual and native cases and shows little performance impact in these cases for virtualization.

The work makes a plausible case for the feasibility of virtualization for HPC applications, including parallel distributed HPC applications. More exploration is needed, as is PCI-IOV capable InfiniBand hardware.

This and the next few slides outline several of the dozen or so use-cases for virtualization in HPC. The value of these use-cases to an HPC customer needs to be measured against any performance degradations introduced by virtualization to assess the utility of a virtualized HPC approach. I believe that several of these use-cases are compelling enough for virtualization to warrant serious attention from the HPC community as a future path for the entire community.

Heterogeneity. Many HPC sites support end users with a wide array of software requirements. While some may be able to use the default operating system version installed at a site, others may require a different version of the same OS due to application constraints. Yet others may require a completely different OS or the installation of other non-standard software on their compute nodes. With virtualization, the choice of operating system, application, and software stack is left to end users (or ISVs) who package their software into pre-built virtual machines for execution on an HPC hardware resource, much the way cloud computing offerings work today. In such an environment, site administrators manage the life cycle of virtual machines rather than configuring and providing a proliferation of different software stacks to meet their users' disparate needs. Of course, a site may still elect to make standard software environments available for those users who do not require custom compute environments.

Effective distributed resource management is an important aspect of HPC site administration. Scheduling a mixture of jobs of various sizes and priorities onto shared compute resources is a complex process that can often lead to less than optimal use of available resources. Using Live Migration, a distributed resource management system can make dynamic provisioning decisions to shift running jobs onto different compute nodes to free resources for an arriving high-priority job, or to consolidate running workloads onto fewer resources for power management purposes.

For those not familiar, Live Migration allows a running guest OS instance to be shifted from one physical machine to another without shutting it down. More precisely, the OS instance continues to run as its memory pages are migrated and then at some point, it is actually stopped so the remaining pages can be migrated to the new machine, at which point its execution can be resumed. The technique is described in detail in this paper.

Live migration can be significantly accelerated by bringing the speed and efficiency of InfiniBand RDMA to bear on the movement of data between systems as described in High Performance Virtual Machine Migration with RDMA over Modern Interconnects by Huang, Gao, Liu, and Panda. See the full paper for details on yet another example of how the Enterprise can benefit from advances made by the HPC community.

The ability to checkpoint long-running jobs has long been an HPC requirement. Checkpoint-restart (CPR) can be used to protect jobs from underlying system failures, to deal with scheduled maintenance events, and to allow jobs to be temporarily preempted by higher-priority jobs and later restarted. While an important requirement, HPC vendors have generally been unable to deliver adequate CPR functionality, requiring application writers to code customized checkpointing capabilities within their applications. Using virtualization to save the state of a virtual machines, both single-VM and multi-VM (e.g. MPI) jobs can be checkpointed more easily and more completely than was achievable with traditional, process-based CPR schemes.

As HPC cluster sizes continue to grow and with them the sizes of distributed parallel applications, it becomes increasingly important to protect application state from underlying hardware failures. Checkpointing is one method for offering this protection, but it is expensive in time and resources since the state of the entire application must be written to disk at each checkpoint interval.

Using Live Migration, it will possible to dynamically relocate individual ranks of a running MPI application from failing nodes to other healthy nodes. In such a scenario, applications will pause briefly and then continue as affected MPI ranks are migrated and required MPI connections are re-established to the new nodes. Coupled with advanced fault management capabilities, this becomes a fast and incremental method of maintaining application forward progress in the presence of underlying system failures. Both multi-node and single-node applications can be protected with this mechanism.

In summary, virtualization holds much promise for HPC as a technology that can be used to mitigate a number of significant pain points for the HPC community. With appropriate hardware support, both compute and IO performance should be acceptable in a virtualized environment especially when judged against the benefits to be accrued from a virtualized approach.

Virtualization for HPC is mostly a research topic currently. While incremental steps are feasible now, full support for all high-impact use cases will require significant engineering work to achieve.

This talk focused on virtualization and the benefits it can bring to the HPC community. This is one example of a much larger trend towards convergence between HPC, Enterprise, and Cloud. As this convergence continues, we will see many additional opportunities to leverage advances in one sphere to the future benefit of all. I see it as fortuitous that this convergence is underway since it allows significant amounts of product development effort to be leveraged across multiple markets, an approach that is particularly compelling in a time of reduced resources and economic uncertainty.

Monday Jul 13, 2009

Performance Facts, Performance Wisdom

I was genuinely excited to see that members of Sun's Strategic Applications Engineering team have started a group blog about performance called BestPerf. These folks are the real deal -- they are responsible for generating all of the official benchmark results published by Sun -- and they collectively have a deep background in all things related to performance. I like the blog because while they do cover specific benchmark results in detail, they also share best practices and include broader discussions about achieving high performance as well. There is a lot of useful material for anyone seeking a better understanding of performance.

Here are some recent entries that caught my eye.

Find out how a Sun Constellation system running SLES 10 beat IBM BlueGene/L on a NAMD Molecular Dynamics benchmark here.

See how the Solaris DTrace facility can be used to perform detailed IO analyses here.

Detailed Fluent cluster benchmark results using the Sun Fire x2270 and SLES 10? Go here.

How to use Solaris containers, processor sets, and scheduling classes to improve application performance? Go here.

Wednesday Jul 01, 2009

Run an HPC Cluster...On your Laptop

With one free download, you can now turn your laptop into a virtual three-node HPC cluster that can be used to develop and run HPC applications, including MPI apps. We've created a pre-configured virtual machine that includes all the components you need:

Sun Studio C, C++, and Fortran compilers with performance analysis, debugging tools, and high-performance math library; Sun HPC ClusterTools -- MPI and runtime based on Open MPI; and Sun Grid Engine -- Distributed resource management and cloud connectivity

Inside the virtual machine, we use OpenSolaris 2009.06, the latest release of OpenSolaris, to create a virtual cluster using Solaris zones technology and have pre-configured Sun Grid Engine to manage it so you don't need to. MPI is ready to go as well---we've configured everything in advance.

If you haven't tried OpenSolaris before, this will also give you a chance to play with ZFS, with DTrace, with Time Slider (like Apple's Time Machine, but without the external disk) and a host of other cool new OpenSolaris capabilities.

For full details on Sun HPC Software, Developer Edition for OpenSolaris check out the wiki.

To download the virtual image for VMware, go here. (VirtualBox image coming soon.)

If you have comments or questions, send us a note at

Wednesday Jun 24, 2009

HPC in Hamburg: Sun Customers Speak at the HPC Consortium

It's crazy time again. I'm in Hamburg for two HPC events: Sun's HPC Consortium customer event, and ISC '09, the International Supercomputing Conference. The Consortium ran all day Sunday and Monday and then ISC started on Tuesday. It is now Wednesday and this is the first break I've had to post a summary talks of given at the Consortium. Due to the sheer number of presentations, including a wide range of Sun and partner talks, I only summarize those given by our customers. The full agenda is here.

Our first customer talk on Sunday was given by Dr. James Leylek, Executive Director of CU-CCMS, the Clemson University Center Computational Center for Mobility Systems, which focuses on problems in the automotive, aviation/aerospace, and energy industries.

The mission of CU-CCMS is not unique -- there are numerous university-based centers that work closely with industry by bringing resources and expertise to bear in a variety of problem domains. What sets CU-CCMS apart is its focus on addressing the mismatch between typical university time-scales and those of their industrial partners. Businesses need results quickly; universities move more slowly.

CU-CCMS has addressed this need in a few ways. They've staffed the center with full-time MS and PhD level engineers who have no teaching responsibilities. And they have provided a significant amount of computing gear to enable those engineers to work effectively with their industrial partners and generate results in a timely way.

Heterogeneity is another key part of the CU-CCMS strategy. By offering a range of computing platforms from clusters to very large shared-memory machines (from Sun) they are able to map problems to appropriate resources to deliver the fast turnaround times required by their industrial partners.

Dr. Leylek also briefly discussed the challenge of introducing HPC to industry as detailed in the Council on Competitiveness study, Reveal. As he noted, many companies are "sitting on the sidelines" of HPC and not engaging even though they could increase their competitiveness by using HPC techniques. He believes CU-CCMS offers a model for how such engagements can be run successfully: assemble a team of expert, dedicated technical resources with appropriate domain knowledge, algorithmic expertise, etc, and combine that with ample high performance computing infrastructure, and an understanding that turnaround time is critical for successful industrial engagements. And then generate valuable results. Lather, rinse, repeat.

Thomas Nau from the University of Ulm gave the next talk, which was a quick tour through several OpenSolaris technologies. He talked about COMSTAR, gave a quick demo of the new OpenSolaris Time Slider, and spent most of time talking about ZFS, specifically about the benefits of solid state disks for increasing ZFS performance. Thomas identified the ZIL -- the ZFS Intent Log -- as the component most often affecting performance. Experiments he has done that involved moving the ZIL from a standard hard disk to a ramdisk have shown significant ZFS performance improvements. In addition, only a small amount of solid state storage is needed to achieve good performance, e.g. perhaps 1-4 GB even for multi-TeraByte drives. Thomas noted that while one could theoretically increase ZFS performance by disabling the ZIL, DO NOT DO THIS. He then ended with the following statement, with which I can only agree: "Hardware RAID is dead, dead, dead. Just use ZFS." :-)

Our first customer talk on Monday was given by Prof. Dr. Thomas Lippert head of the Jülich Supercomputing Center(JCS), site of Sun's largest European deployment to date of our Sun Constellation System architecture. He first gave a brief history of the Jülich Research Center, which is one of the largest civilian research centers in Europe with over 4000 researchers in nine departments, one of which is the new Institute for Advanced Simulation of which JCS is a part. The site has a very long history of computer acquisitions, starting in 1957. This year JCS purchased three systems: a Sun system (JuRoPa), a Bull system (HPC-FF), and an IBM system (Jugene.) These systems have, respectively, 200 TFLOPs, 100 TFLOPs, and 1 PFLOPs of peak performance. Since the Sun and Bull systems are interconnected at the highest level of their switch hierarchies, the two machines can be run as a single system. This combined system delivered 274.8 TFLOPs on LINPACK which earned it the #10 entry on the latest edition of the TOP500 list. Collectively, JCS serves about 250 projects across Europe, including 20-30 highly scalable projects that are chosen by international referees for their potential for producing breakthrough science.

Dr. Lippert also spoke briefly about PRACE, the Partnership for Advanced Computing in Europe, which is radically changing the supercomputing landscape across Europe. Due to earlier studies, computing is now considered to be a crucial pillar of research infrastructure and, as such, it is now receiving considerable attention from funding agencies.

In closing, Dr. Lippert presented specific details of the JuRoPa system (2208 nodes, 17664 cores, 207 TFLOPs, 48 GB/node, and Sun's new M9 QDR switch.) He also described some of specific issues that will be explored with these systems, including control of jitter through the use of gang scheduling, daemon reduction, a SLERT kernel, etc. And some additional secret sauce from Sun perhaps. :-)

Prof. Satoshi Matsuoka from the Tokyo Institute of Technology spoke next. While he did mention Tsubame, Tokyo Tech's Sun-based supercomputer, he primarily spoke about the return of vector machines to HPC. They have, he believes, been reincarnated as GPGPU-based machines. Dinosaurs are once again walking the earth. :-) In particular, the GPGPU's high compute density, high memory bandwidth, and low memory latency echo some of the fundamental capabilities of vector machines that make them interesting for both tightly coupled codes like N-body as well as sparse codes like CFD. In his view, the GPGPU essentially becomes the main processor while the CPU becomes an ancillary processor.

Computers, however, are not useful unless they can be used to solve problems. To support the fact that GPGPU-based clusters can be effective HPC platforms, Prof. Matsuoka presented results from several new algorithms that have been developed at Tokyo Tech to take advantage of GPGPU-based systems. He showed impressive results for 3D FFTs used for protein folding and results for CFD with speedups up to 70X over CPU-based algorithms.

Our next customer speaker was Henry Tufo of the University of Colorado at Boulder (UCB) and the National Center for Atmospheric Research (NCAR.) He gave an update on UCB's upcoming Constellation-based HPC system and also spoke about some of the challenges related to climate modeling. It seems clear at this point that accurate climate modeling is going to be critical for understanding our future and our planet's future. It was a bit daunting to hear that climate modelers would like to increase many dimensions of their simulations, including spatial resolution by 10\^3 or 10\^5, the completeness of their models by a factor of 100x, the length of their simulator runs by 100x, and increase the number of modeled parameters by 100x. All told, their desires would increase computational needs by 10\^10 or 10\^12 over current requirements. It was sobering to hear that current technology trajectories predict that a 10\^7 improvement will take about 30 years. Not good.

Their new Sun-based system will consist of 12 Constellation racks, Nehalem blades, QDR InfiniBand, about 500 TB of storage with about 10% of the clusters nodes accelerated with GPUs. The system will be located next to an existing physics building in three containers -- one for the IT components, one for electrical, and one for cooling.

Stephane Thiell from the Commissariat à l’Énergie Atomique (CEA) gave an overview CEA, talked a bit about CEA's TERA-100 project and then detailed CEA's planned use of Lustre for TERA-100. The CEA computing complex currently has two computing centers, one classified (TERA) and one open (CCRT.) TERA-100 will be a follow-on to TERA-10, which is a 60 TFLOPs, Linux-based system built by Bull in 2005. It includes an impressive 1 PetaByte Lustre filesystem and uses HPSS to archive to Sun StorageTek tape libraries with a 15 PetaByte capacity.

TERA-100 aims to increase CEA's classified computing capacity by about 20x with a final size of one PFLOPs or perhaps a little larger. CEA plans to continue with their COTS-based, general-purpose approach rather than move of the main sequence to something more exotic. It will be x86-based with more than 500 GFLOPs per node using 4-socket nodes. There will be 2-4 GB per core and two Lustre file systems will be supported, one with a 300 GB/s transfer requirements and the other with a 200 GB/s requirement. The system will consume less than 5 MW. A 40 TFLOPs demonstrator system will be built first and it will include scaled-down versions of the Lustre file systems as well. In the final system the Lustre servers will be built with four-socket nodes and a four-node HA architecture will be used to guarantee against failure and to avoid long failover times.

CEA is involved in some interesting Lustre-based development, including joint work with Sun on a binding between Lustre and external HSM systems with the goal of supporting Lustre levels of performance with transparent access to hierarchical storage management. CEA is also working on Shine, a management tool for Lustre.

Dieter an Mey gave some general information about computing at RWTH Aachen University and then gave an update on their latest acquisition, a Sun-based supercomputer. He ended with a discussion about the pleasures and perils of workload placement on current generation systems. Along the way he shared some feedback on Sun products -- one of those habits that makes customers like Dieter such valuable partners for Sun.

Aachen provides both Linux and Windows-based HPC resources for their users. On Linux they record about 40,000 batch jobs per month and perhaps 150 interactive sessions per day. The Windows cluster is used primarily for interactive jobs. It was interesting to hear that Windows is gaining ground with respect to Linux at Aachen: a previous study at Aachen had shown that Windows lagged Linux in performance by about 24%, but a recent re-run of the study now shows the gap to be on the order of about 7%.

Aachen's new system will support both Linux and Windows equally with a flexible dividing line between them. The facility is designed to be general purpose with a mix of thin and fat nodes and with the required high-speed interconnect for those who use MPI. A new building is being erected to house this machine which will come fully online over the course of 2009-2010. When complete, the system will have a peak floating point rate in excess of 200 TFLOPs and it will include a 1 PetaByte Lustre file system. Speaking of Lustre, Dieter rated its configuration as "complex", something Sun is working on. The system will also include two of Sun's latest InfiniBand switches, the new 648-port QDR M9 switch.

Dieter's final topic was the correct placement of workload on non-uniform system architectures. In particular, he described the difference between compact and scatter placement on multicore NUMA systems. Compact placement uses threads on the same core first, then cores in the same socket --- a strategy that is used to minimize latency and to maximize cache sharing. Scatter placement uses threads on different sockets first, and then threads on different cores -- a strategy that maximizes memory bandwidth. Which strategy is best depends on the details of an application's underlying algorithms. (Dieter noted that currently Sun Grid Engine is not aware of these issues -- it treats nodes as flat arrays of threads or cores.) Placement decisions are further complicated when attempting to schedule more than one application onto a fat node. For example, different strategies would be used depending on whether single job turnaround is more important than overall throughput of jobs.

Our last customer talk at the Consortium was given by the tag team of Arnie Miles (left) from Georgetown University and Tim Bornholtz (right) of the Bornholtz Group. Their topic was the Thebes Consortium for which they presented current status, did a short demo, and announced that the source code would be available by the end of June on

The Thebes Consortium aims to help the widespread adoption of distributed computing technologies by creating an enabling infrastructure that focuses on scalability, security, and simplicity.

Arnie described (and Tim demo'ed) the instantiation of the Thebes first use case which assumes 1) that users have usernames and passwords in their home domain, 2) that one or more local resources have a trust relationship with a local STS (secure token service), 3) that these resources are known to users, and 4) that all resources are able to consume SAML.

The use case itself consists of the following actions: 1) users create job submission files using the client application or a command line, 2) users use institution usernames and passwords to acquire a signed SAML token, 3) users perform no other logins and do not have to go to a resource command line interface, 4) users manually choose their resources, 5) job scheduling is handled by resources. Note that in this instance "resource" refers to a DRM-managed cluster which will accept the incoming request and then schedule the job appropriately on its managed cluster. In the prototype as it currently exists, a service is a compute service though there is also some level of support for a file system service as well.

Thursday Jun 18, 2009

FORTRAN: Calling All Dinosaurs!


Please ASSIGN some time to RECORD your opinions about current and future FORTRAN needs in our non-COMPLEX online survey. It is in your INTRINSIC self-interest to PAUSE and DO so.

It is IMPLICIT and LOGICAL that you also CALL on your colleagues (those CHARACTERs) to READ this, get REAL, and make an ENTRY as well.

You can OPEN the survey IF you GOTO here.

(Something we share in COMMON: I am a FORTRAN TYPE as well and am eligible to join the Dinosaur UNION.)


Josh Simons


« July 2016