Monday Dec 21, 2009

Sun HPC Consortium Videos Now Available

Thanks to Rich Brueckner and Deirdré Straughan, videos and PDFs are now available from the Sun HPC Consortium meeting held just prior to Supercomputing '09 in Portland, Oregon. Go here to see a variety of talks from Sun, Sun partners, and Sun customers on all things HPC. Highlights for me included Dr. Happy Sithole's presentation on Africa's largest HPC cluster (PDF|video), Marc Parizeau's talk about CLUMEQ's Collossus system and its unique datacenter design (PDF|video), and Tom Verbiscer's talk describing Univa UD's approach to HPC and virtualization, including some real application benchmark numbers illustrating the viability of the approach (PDF|video).

My talk, HPC Trends, Challenges, and Virtualization (PDF|video) is an evolution of a talk I gave earlier this year in Germany. The primary purposes of the talk were to illustrate the increasing number of common challenges faced by enterprise, cloud, and HPC users and to highlight some of the potential benefits of this convergence to the HPC community. Virtualization is specifically discussed as one such opportunity.

Thursday Aug 27, 2009

Parallel Computing: Berkeley's Summer Bootcamp

Two weeks ago the Parallel Computing Laboratory at the University of California Berkeley ran an excellent three-day summer bootcamp on parallel computing. I was one of about 200 people who attended remotely while another large pile of people elected to attend in person on the UCB campus. This was an excellent opportunity to listen to some very well known and talented people in the HPC community. Video and presentation material is available on the web and I would recommend it to anyone interested in parallel computing or HPC. See below for details.

The bootcamp, which was called the 2009 Par Lab Boot Camp - Short Course on Parallel Programming covered a wide array of useful topics, including introductions to many of the current and emerging HPC parallel computing models (pthreads, OpenMP, MPI, UPC, CUDA, OpenCL, etc.), hands-on labs for in-person attendees, and some nice discussions on parallelism and how to find it with an emphasis on the motifs (patterns) of parallelism identified in The Landscape of Parallel Computing Research: A View From Berkeley. There was also a presentation on performance analysis tools and several application-level talks. It was an excellent event.

The bootcamp agenda is shown below. Session videos and PDF decks are available here.

talk title speaker
Introduction and Welcome Dave Patterson (UCB)
Introduction to Parallel Architectures John Kubiatowicz (UCB)
Shared Memory Programming with Pthreads, OpenMP and TBB Katherine Yelick (UCB & LBNL), Tim Mattson (Intel), Michael Wrinn (Intel)
Sources of parallelism and locality in simulation James Demmel (UCB)
Architecting Parallel Software Using Design Patterns Kurt Keutzer (UCB)
Data-Parallel Programming on Manycore Graphics Processors Bryan Catanzaro (UCB)
OpenCL Tim Mattson (Intel)
Computational Patterns of Parallel Programming James Demmel (UCB)
Building Parallel Applications Ras Bodik (UCB), Ras Bodik (UCB), Nelson Morgan (UCB)
Distributed Memory Programming in MPI and UPC Katherine Yelick (UCB & LBNL)
Performance Analysis Tools Karl Fuerlinger (UCB)
Cloud Computing Matei Zaharia (UCB)

Wednesday Apr 15, 2009

Tickless Clock for OpenSolaris

I've been talking a lot to people about the convergence we see happening between Enterprise and HPC IT requirements and how developments in each area can bring real benefits to the other. I should probably do an entire blog entry on specific aspects of this convergence, but for now I'd like to talk about the Tickless Clock OpenSolaris project.

Tickless kernel architectures will be familiar to HPC experts as one method for reducing application jitter on large clusters. For those not familiar with the issue, "jitter" refers to variability in the running time of application code due to underlying kernel activity, daemons, and other stray workloads. Since MPI programs typically run in alternating compute and communication phases and develop a natural synchonization as they do so, applications can be slowed down significantly when some nodes arrive late at these synchronization points. The larger the MPI job, the more likely the this type of noise will cause a problem. Measurements have shown surprisingly large slowdowns associated with jitter.

Jitter can be lessened by reducing the number of daemons running on a system, by turning off all non-essential kernel services, etc. Even with these changes, however, there are other sources of jitter. One notable source is the clock interrupt used in virtually all current operating systems. This interrupt, which fires 100 times per second, is used to periodically perform housekeeping chores required by the OS. This interrupt is a known contributor to jitter. It is for this reason that IBM has implemented a tickless kernel on their Blue Gene systems to reduce application jitter.

Sun is starting a Tickless Clock project in OpenSolaris to completely remove the clock interrupt and switch to an event-based architecture for OpenSolaris. While I expect this will be very useful for HPC users of OpenSolaris, HPC is not the primary motivator of this project.

As you'll hear in the video interview with Eric Saxe, Senior Staff Engineer in Sun's Kernel Engineering group, the primary reasons he is looking at Tickless Clock are power management and virtualization. For power management, it is important that when the system is idle, it really IS idle and not waking up 100 times per second to do nothing since this wastes power and will prevent the system from entering deeper power saving states. For virtualization, since multiple OS instances may share the same physical server resources, it is important that guest OSes that are idle really do stay idle. Again, waking up 100 times per second to do nothing will steal cycles from active guest OS instances, thereby reducing performance in a virtualized environment.

While it is true I would argue that both power management and virtualization will become increasingly important to HPC users (more of that convergence thing), it is interesting to me to see that these traditional enterprise issues are stimulating new projects that will benefit both enterprise and HPC customers in the future.

Interested in getting involved with implementing a tickless architecture for OpenSolaris? The project page is here.


Wednesday Mar 18, 2009

More Free HPC Developer Tools for Solaris and Linux

The Sun Studio team just released the latest version of our HPC developer tools with so many enhancements and additions it's hard to know where to start this blog entry. I suppose with the basics: As usual, all of the software is free. And available for both Solaris and Linux, specifically Solaris, OpenSolaris, RHEL, SuSE, and Ubuntu. Frankly, Sun would like to be your preferred provider for high-performance Fortran, C, and C++ compilers and tools. Given the performance and capabilities we deliver for HPC with Sun Studio, that seems a pretty reasonable goal to me. We think the price has been set correctly to achieve that as well. :-)

I have to admit to being confused by the naming convention for this release, but it goes something like this. The release is an EA (Early Access) version of Sun Studio 12 Update 1 -- the first major update to Sun Studio 12 since it was released in the summer of 2007. Since Sun Studio's latest and greatest bits are released every three months as part of the Express program, this release can also be called Sun Studio Express 3/09. Different names, same bits. Don't worry about it -- just focus on the fact that they make great compilers and tools. :-)

Regardless of what they call it, the release can be downloaded here. Take it for a spin and let the developers know what you think on the forum or file a request for enhancement (RFE) or a bug report here.

For the full list of new features, go here. For my personal list of favorite new features, read on.

  • Full OpenMP 3.0 compiler and tools support. For those not familiar, OpenMP is the industry standard for directives-based threaded application parallelization. Or, the answer to the question, "So how do I use all the cores and threads in my spiffy new multicore processor?"
  • ScaLAPACK 1.8 is now included in the Sun Performance Library! It works with Sun's MPI (Sun HPC ClusterTools), which is based on Open MPI 1.3. The Perflib team has also made significant performance enhancements to BLAS, LAPACK, and the FFT routines, including support for the latest Intel and AMD processors. Nice.
  • MPI performance analysis integrated into the Sun Performance Analyzer. Analyzer has been for years a kick-butt performance tool for single-process applications. It has now been extended to help MPI programmers deal with message-passing related performance problems.
  • Continued, aggressive attention paid to optimizing for the latest SPARC, Intel, and AMD processors. C, C++, and Fortran performance will all benefit from these changes.
  • A new standalone GUI debugger. Go ahead, graduate from printf() and try a real debugger. It won't bite.

As I mentioned above, full details on these new features and many, many more are all documented on this wiki page. And, again, the bits are here.

Thursday Nov 13, 2008

Big News for HPC Developers: More Free Stuff

'Tis the Season. Supercomputing season, that is. Every November the HPC community--users, researchers, and vendors--attend the world's biggest conference on HPC: Supercomputing. This year SC08 is being held in Austin Texas, to which I'll be flying in a few short hours.

As part of the seasonal rituals vendors often announce new products, showcase new technologies and generally strut their stuff at the show and even before the show in some cases. Sun is no exception as you will see if you visit our booth at the show and if you take note of two announcements we made today that should be seen as a Big Deal to HPC developers. The first concerns MPI and the second our Sun Studio developer tools.

The first announcement extends Sun's support of Open MPI to Linux with the release of ClusterTools 8.1. This is huge news for anyone looking for a pre-built and extensively tested version of Open MPI for RHEL 4 or 5, SLES 9 or 10, OpenSolaris, or Solaris 10. Support contracts are available for a fee if you need one, but you can download the CT 8.1 bits here for free and use them to your heart's content, no strings attached.

Here are some of the major features supported in ClusterTools 8.1:

  • Support for Linux (RHEL 4&5, SLES 9&10), Solaris 10, OpenSolaris
  • Support for Sun Studio compilers on Solaris and Linux, plus the GNU/gcc toolchain on Linux
  • MPI profiling support with Sun Studio Analyzer (see SSX 11.2008), plus support for VampirTrace and MPI PERUSE
  • InfiniBand multi-rail support
  • Mellanox ConnectX Infiniband support
  • DTrace provider support on Solaris
  • Enhanced performance and scalability, including processor affinity support
  • Support for InfiniBand, GbE, 10GbE, and Myrinet interconnects
  • Plug-ins for Sun Grid Engine (SGE) and Portable Batch System (PBS)
  • Full MPI-2 standard compliance, including MPI I/O and one sided communication

The second event was the release of Sun Studio Express 11/08, which among other enhancements adds complete support for the new OpenMP 3.0 specification, including tasking. If you are questing for ways to extract parallelism from your code to take advantage of multicore processors, you should be looking seriously at OpenMP. And you should do it with the Sun Studio suite, our free compilers and tools which really kick butt on OpenMP performance. You can download everything--the compilers, the debugger, the performance analyzer (including new MPI performance analysis support) and other tools for free from here. Solaris 10, OpenSolaris, and Linux (RHEL 5/SuSE 10/Ubuntu 8.04/CentOS 5.1) are all supported. That includes an extremely high-quality (and free) Fortran compiler among other goodies. (Is it sad that us HPC types still get a little giddy about Fortran? What can I say...)

The full list of capabilities in this Express release are too numerous to list here, so check out this feature list or visit the wiki.


Wednesday Jul 30, 2008

Fresh Bits: Attention all OpenMP and MPI Programmers!

The latest preview release of Sun's compiler and tools suite for C, C++, and FORTRAN users is now available for free download. Called Sun Studio Express 07/08, this release of Sun Studio marks an important advance for HPC customers and for any customer interested in extracting high performance from today's multi-threaded and multi-core processors. In addition to numerous compiler performance enhancements, the release includes beta-level support for the latest OpenMP standard, OpenMP 3.0. It also includes some nice Performance Analyzer enhancements that support simple and intuitive performance analysis of MPI jobs. More detail on both of these below.

As the industry-standard approach for achieving parallel performance on multi-CPU systems, OpenMP has long been a mainstay of the HPC developer community. Version 3.0, which is supported in this new Sun Studio preview release, is a major enhancement to the standard. Most notably it includes support for tasking, a major new feature that can help programmers achieve better performance and scalability with less effort than previous approaches using nested parallelism. There are a host of other enhancements as well. The OpenMP expert will find the latest specification useful. For those new to parallelism who have stumbled into a maze of twisty passages all alike, you may find Using OpenMP: Portable Shared Memory Parallel Programming to be a useful introduction to parallelism and OpenMP.


A parallel quicksort example, written using the new OpenMP tasking feature supported in Sun Studio Express 07/08

Sun Studio Express 07/08 also includes enhancements for programmers of parallel, distributed applications who use MPI. With this release of Sun Studio Express we have introduced tighter integration with Sun's MPI library (Sun HPC ClusterTools). Sun's Performance Analyzer has been enhanced to include the ability to examine the performance of MPI jobs by viewing information related to message transfers and messaging performance using a variety of visualization methods. This extends Analyzer's already-sophisticated on-node performance analysis capabilities. Some screenshots below give some idea of the types of information that can be viewed. You should note the idea of viewing "MPI states" (e.g. MPI Wait and MPI Work) to get a high level view of the performance of the MPI portion of an application: an ability to understand how much time is spent doing actual work versus sitting in a wait state can motivate useful insights into the performance of these parallel, distributed codes.

A source code viewer window augmented with several MPI-specific capabilities, one of which is illustrated here: the ability to quickly see how much work (or waiting) is performed within a function.

In addition to supporting direct viewing of specific MPI performance issues within an application, Analyzer now also supports a range of visualization tools useful for understanding the messaging portion of an MPI code. Zoomable timelines with MPI events are supported, as is an ability to map various metrics against the X and Y axis of a plotting area to display various interesting characteristics of the MPI run, as shown below.

Just one example of Sun Studio's new MPI charting capabilities. Shown here is a display showing the volume of messages transferred between communicating pairs of MPI processes during an application run.

This blog entry has barely scratched the surface of the new OpenMP and MPI capabilities available in this release. If you are a Solaris or Linux HPC programmer, please take these new capabilities for a test drive and let us know what you think. I know the engineering teams are excited by what they've accomplished and I hope you will share their enthusiasm once you've tried these new capabilities.

Sun Studio Express 07/08 is available for Solaris 9 & 10, OpenSolaris 2008.05, and Linux (SLES 9, RHEL 4) and can be downloaded here.


Monday Jun 23, 2008

ClusterTools 8: Early Access 2 Now Available

The latest early access version of Sun HPC ClusterTools -- Sun's MPI library -- has just been made available for download here. As an active member of the Open MPI community, we continue to build our MPI offering on the Open MPI code base, making pre-compiled libraries freely available and offering a paid support option for interested customers. Wondering why we would base our MPI implementation on Open MPI? Read this.

What is particularly cool about CT 8 is that in addition to supporting Solaris, we've added Sun support for Linux (RHEL 4 & 5 and SLES 9 & 10), including use of both the Sun Studio compilers and tools and GNU C. We've also included a DTrace provider for enhanced MPI observability under Solaris as well as additional performance analysis capabilities and a number of other enhancements that are all detailed on the Early Access webpage.


Open MPI on the Biggest Supercomputer in the World

Los Alamos National Laboratory and IBM recently announced they had broken the PetaFLOP barrier with a LINPACK run on the Roadrunner supercomputer. The Open MPI community, including Sun Microsystems, was proud to have played a role in this HPC milestone. As described by Brad Benton, member of the Roadrunner team, the 1.026 PetaFLOP/s LINPACK run was achieved using an early, unmodified snapshot of Open MPI v1.3 as the messaging layer that tied together Roadrunner's 3000+ AMD-powered nodes. For more details on specific MPI tunables used, read this subsequent message from Brad and this follow-up message from Jeff Squyres, Open MPI contributor from Cisco.

About two years ago, we decided to change Sun's MPI strategy from one of continuing to develop our own proprietary implementation of MPI to instead joining a community-based effort to create a scalable, high-performance, and portable implementation of MPI. We joined the Open MPI community because we felt (and still feel) strongly that combining forces with other vendors and other organizations is the most effective path to creating the middleware infrastructure needed to support the needs of the HPC community into the future.

Sun was the 2nd commercial member to join the Open MPI effort, which at the time consisted of a small handful of research and academic organizations. Two years later, the community looks like this:


This mix of academic/research members and commercial members brings together into one community a focus on quality, stability and customer requirements on the one hand, with a passion for research and innovation on the other. Of course, it does also create some challenges as the community works to achieve an appropriate balance between these sometimes opposing forces, but the results to date have been impressive, as witnessed by the use of Open MPI to set a new LINPACK world record on the biggest supercomputer in the world.

Tuesday Jan 08, 2008

MPI Library Updated: Sun ClusterTools 7.1 Released

The latest version of Sun's MPI library for Solaris x86 and Solaris SPARC is now available for free download on the ClusterTools 7.1 download area. Our MPI library is based on Open MPI, an open source MPI effort to which Sun contributes actively as a corporate member.

This new release adds Intel support, improved parallel debugger support, PBS Pro validation, improved memory usage for communication operations, and other bug fixes. Sun Studio 12, the latest version of Sun's high performance compiler and tools suite, is also supported.

ClusterTools 7.1 is based on Open MPI 1.2.4.



Tuesday Oct 09, 2007

CMT for HPC: Sun Launches UltraSPARC T2 Servers


[ultrasparc t2 chip]

Today we announced our first servers based on the UltraSPARC T2 (Niagara2) processor. They are officially named the Sun SPARC Enterprise T5120, the Sun SPARC Enterprise T5220, and the Sun Blade T6320. For those who enjoy code names, the rack servers are known internally as "Huron," following in the Great Lakes theme from our UltraSPARC T1-based systems. The blade is called "Glendale." For detailed specifications on these new machines, start here. UltraSHORT summary: 64 threads, eight floating point units, on-chip 10GbE, low power, 1RU or 2RU or blade form factors. And looking interesting for some HPC workloads.

The UltraSPARC T2 is Sun's second generation CMT (chip multithreaded) processor. The first-generation UltraSPARC T1, which has 32 threads and only one floating point unit, performs well on many throughput-oriented tasks, but isn't suitable as a general-purpose processor for High Performance Computing. Some HPC areas like life sciences and some parts of the intelligence community have integer-intensive workloads and can use the UltraSPARC T1 to advantage. For example, see the numerous entries on Lawrence Spracklen's blog.

So, what can we say about the UltraSPARC T2 and its platforms relative to HPC?

As usual, application performance will depend greatly on the specifics of your application, but having seen the results of several benchmarks on the UltraSPARC T2, I can make some observations. First, remember the primary value proposition of these CMT systems is throughput, and not single-thread performance. We use relatively low-performing cores, but give you eight of them on a single chip, each with multiple threads. Therefore your application or workload must benefit from lots of threads and from the CMT's ability to hide memory latency by performing real work while waiting for memory operations to complete.

I'll leave it to the benchmarking folks to give you the official story on exact results and instead make some general observations. First, these new systems generate leading performance numbers on a popular floating-point rate (i.e. throughput) benchmark. However, to achieve those numbers we obviously must run enough instances of the benchmark to make use of all of our threads, which increases the memory footprint and therefore the cost of the system. How much that matters to you in real life depends on how your application's memory footprint scales in practice.

Consider for example, an OpenMP application. Using OpenMP to parallelize an application leaves the memory footprint essentially unchanged and instead varies the number of threads used within the application. As you'd expect, the thread-rich T2-based systems deliver some very interesting OpenMP benchmark results.

Beyond performance issues, let's not lose sight of the fact that these tiny boxes have 64 hardware threads (eight FPUs), making them interesting platforms for HPC developers working on parallel algorithms, possibly even for MPI developers wanting to debug their distributed applications on a single machine. And, of course, you should expect to be able to cluster these machines for building larger HPC systems using either the on-board 10GbE or InfiniBand.

For other Sun blogger perspectives on these new systems, start with Allan Packer's cross-reference entry.


About

Josh Simons

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today