Monday Feb 01, 2010

DTrace Deep Dive in Boston this Week!

Jim Mauro will be doing a two-hour deep dive on DTrace at this week's NEOSUG (New England OpenSolaris Users Group) meeting. And Shannon Sylvia from Northeastern University will give a talk on using LDOMs and ZFS. The NEOSUG meeting will be held in two locations with the same agenda -- pick the date and location that works best for you. And please do RSVP so we have a rough head count. See below for details.

Where and When:

  • Tues Feb 2nd, 6-9pm, Sun Microsystems Burlington Campus, One Network Drive, Burlington, MA
  • Wed Feb 3rd, 6-9pm, Boston University, Electrical and Computer Engineering Department Photonics Center -- Room PHO 339, 8 Saint Mary's Street, Boston, MA 02215

Registration Required: RSVP to Linda Wendlandt: lwendlandt at


    6:00-6:20: Registration, Pizza and Beverages

    6:20-6:30: Introductions: Peter Galvin, CTO, Corporate Technologies

    6:30-8:30: Solaris Dynamic Tracing - DTrace – Jim Mauro, Principal Engineer, Sun Microsystems

    8:30-9:00: LDOM Domains and ZFS: An example of creating a ZFS bootable root LDOM domain using jumpstart - Shannon Sylvia, Sysadmin, Northeastern University

    9:00 Q&A and Discussion

Also we’ll be giving out official NEOSUG T-Shirts and other trinkets, and copies of the OpenSolaris CD and instruction manual.

The Talks:

Solaris Dynamic Tracing – Dtrace

DTrace is a revolutionary observability tool introduced in Solaris 10, and currently available in all Solaris 10 releases, OpenSolaris, Mac OS X 10.5 and FreeBSD 7.2. DTrace provides unprecedented observability of the kernel and the entire application software stack without requiring code modifications. It is completely dynamic, and introduces zero probe effect when no DTrace probes are enabled.

This talk will introduce the basic components of DTrace - Providers, Probes, Predicates, The D Language, Actions and Subroutines and DTrace variables. We will then dive into examples of DTrace one-liners and scripts that demonstrate the use of DTrace of understanding and root-causing system and application performance issues.

Jim Mauro is a Principal Engineer in Sun Microsystems Systems Group, where he focuses on performance of volume commercial workloads on Sun technology. Jim co-authored Solaris Internals (1s Ed), Solaris Internals (2nd Ed), Solaris Performance and Tools (1st Ed) and is currently working on a DTrace book.

LDOM Domains and ZFS: An example of creating a ZFS bootable root LDOM domain using jumpstart

Using Version 10.1009 of Sun Solaris on a SPARC T5120 with LDOM 1.2, Shannon Sylvia creates guest domains that are each independent of each other. Each guest domain contains its own separately configured operating system and its own virtual disks. Using a “cookbook” approach, new guest domains can be easily added and configured, or removed without affecting the control domain or any of the other guest domains. Each domain is created using ZFS as the root, bootable volume. Shannon will provide examples on how the control domain, the jumpstart/boot server, and the guest domains should be configured.

Shannon Sylvia has 15+ years experience as a Unix Systems Administrator. She is responsible for installing and maintaining Solaris, AIX, and Linux at Northeastern University. In addition, she is an adjunct professor at Northeastern University's College of Professional Studies. She has a strong interest in IT in the health field, and has recently completed 2 1/2 years of nursing school and clinicals. She is currently involved in volunteer work including Salesforce and website development. She earned a bachelor's degree in Computer Science from National University, a bachelor's degree in English from San Diego State University, and a Master's Degree in Computer Information Systems from Boston University.

Wednesday Nov 11, 2009

NEOSUG at Boston University TONIGHT!

The New England OpenSolaris User Group is holding its first meeting at Boston University this evening, hosted by the BU Department of Electrical & Computer Engineering. It is open anyone interested in learning more about OpenSolaris -- both students and professionals are welcome. This first meeting features three talks: What's So Cool About OpenSolaris Anyway, OpenSolaris: Clusters and Clouds from your Laptop, and OpenSolaris as a Research and Teaching Tool.

The meeting runs from 6-9pm tonight (Wed, Nov 11th, 2009) at the BU Photonics Center Building. Follow this link for directions, full agenda details, etc. If you think you'll be coming, please RSVP so we have a rough headcount for food.

See you there -- I'm bringing the pizza!

Monday Jul 13, 2009

Performance Facts, Performance Wisdom

I was genuinely excited to see that members of Sun's Strategic Applications Engineering team have started a group blog about performance called BestPerf. These folks are the real deal -- they are responsible for generating all of the official benchmark results published by Sun -- and they collectively have a deep background in all things related to performance. I like the blog because while they do cover specific benchmark results in detail, they also share best practices and include broader discussions about achieving high performance as well. There is a lot of useful material for anyone seeking a better understanding of performance.

Here are some recent entries that caught my eye.

Find out how a Sun Constellation system running SLES 10 beat IBM BlueGene/L on a NAMD Molecular Dynamics benchmark here.

See how the Solaris DTrace facility can be used to perform detailed IO analyses here.

Detailed Fluent cluster benchmark results using the Sun Fire x2270 and SLES 10? Go here.

How to use Solaris containers, processor sets, and scheduling classes to improve application performance? Go here.

Thursday Dec 18, 2008

Fresh Bits: InfiniBand Updates for Solaris 10

Fresh InfiniBand bits for Solaris 10 Update 6 have just been announced by the IB Engineering Team:

The Sun InfiniBand Team is pleased to announce the availability of the Solaris InfiniBand Updates 2.1. This comprises updates to the previously available Solaris InfiniBand Updates 2. InfiniBand Updates 2 has been removed from the current download pages. (Previous versions of InfiniBand Updates need to be carefully matched to the OS Update versions that they apply to.)

The primary deliverable of Solaris InfiniBand Updates 2.1 is a set of updates of the Solaris driver supporting HCAs based on Mellanox's 4th generation silicon, ConnectX. These updates include the fixes that have been added to the driver since its original delivery, and functionality in this driver is equivalent to what was delivered as part of OpenSolaris 2008.11. In addition, there continues to be a cxflash utility that allows Solaris users to update firmware on the ConnectX HCAs. This utility is only to be used for ConnectX HCAs.

Other updates include:

  • uDAPL InfiniBand service provider library for Solaris (compatible with Sun HPC ClusterTools MPI)
  • Tavor and Arbel/memfree drivers that are compatible with new interfaces in the uDAPL library
  • Documentation (README and man pages)
  • A renamed flash utility for Tavor-, Arbel memfull, Arbel memfree, and Sinai based HCAs. Instead of "fwflash" this utility is rename "ihflash" to avoid possible namespace conflicts with a general firmware flashing utility in Solaris

All are compatible with Solaris 10 10/08 (Solaris 10, Update 6), for both SPARC and X86.

You can download the package from the "Sun Downloads" A-Z page by visiting and scrolling down or searching for the link for "Solaris InfiniBand (IB) Updates 2.1" or alternatively use this link.

Please read the README before installing the updates. This contains both installation instructions and other information you will need to know before running this product.

Please note again that this Update package is for use on Solaris 10/08 (Solaris 10, Update 6) only. A version of the Hermon driver has also been integrated into Update 7 and will be available with that Update's release.

Congratulations to the Solaris IB Hermon project team and the extended IB team for their efforts in making this product available!

Wednesday Dec 10, 2008

A Quantum of Solaris

We emitted our latest wad of Solaris goodness today with the official release of OpenSolaris 2008.11. Lest you think engineering used a partially undenary nomenclature for the release name, rest assured the bits were in fact done and ready to go in November. The official announcement was delayed slightly due to other proximate product announcements.

I've been running 2008.11 for several weeks, having taken part in the internal testing cycles at Sun. I found and reported several mostly minor problems, but have generally found the 2008.11 experience to be quite good. The Live CD boot and install to disk all worked smoothly within VirtualBox, our free desktop virtualization product, on my MacBook Pro. With VirtualBox extensions installed, I can use 2008.11 in fullscreen mode and with mouse integration enabled.

While my primary interest in OpenSolaris is as a substrate on which we are building a full, integrated HPC software stack I can't help but note a few generally cool things about this release.

First is Time Slider. Yes, okay, Apple did it first with Time Machine. But try THIS with Time Machine: I turned on Time Slider and then immediately deleted a file from my Desktop without first doing any kind of back up. I then recovered the file using the TS slider on a File Browser window. This works because Time Slider is built on top of ZFS, which uses copy-on-write for safety and which is also used to implement an immediate snapshot facility. I was able to recover my file because when it was deleted (meaning "when the metadata representing the directory in which the file was located was changed"), the metadata was copied, modified and then written. But with snapshots enabled by Time Slider, the old metadata is retained as well, making it possible to slide back in time and recover deleted or altered files by revisiting the state of the file system at any earlier time. Nifty.

My second pick is perhaps somewhat esoteric, but I thought it was cool: managing boot environments with OpenSolaris. I think much of this was available in 2008.05, but it is new to me, so I've included it. In any case, managing multiple boot environments has been completely demystified as you can see in this article. Yet another admin burden removed through use of ZFS. For full documentation on boot environments, go here.

We've also made significant progress supporting Suspend/Resume, which is frankly an absolute requirement for any bare-metal OS one might run on a laptop. For me it isn't so important because I run OpenSolaris as a guest OS in VirtualBox. For those doing bare metal installations, this page details the requirements and limitations of the current Suspend/Resume support in 2008.11.

Putting my HPC hat back on for this last item, I note that a prototype release of the Automated Installer (AI) Project has been included in 2008.11. AI is basically the Jumpstart replacement for OpenSolaris--the mechanism that will be used to install OpenSolaris onto servers, including large numbers of servers hence my interest from an HPC perspective. For more information on AI, check out the design documents or, better, install the SUNWinstalladm-tools package using the Package Manager and then read the installadm man page. Full installation details are here. AI is still a work in process so feel free to pitch in if this area interests you: all of the action happens on the Caiman mailing list, which you can subscribe to here.

Sunday Nov 16, 2008

Using SPARC and Solaris for HPC: More of this, please!

Ken Edgecombe – Executive Director of HPCVL spoke today at the HPC Consortium Meeting in Austin about experiences with SPARC and HPC at his facility.

HPCVL has a massive amount of Sun gear, the newest of which includes a cluster of eight Sun SPARC Enterprise M9000 nodes, our largest SMP systems. Each node has 64 quad-core, dual-threaded SPARC64 processors and includes 2TB of RAM. With a total of 512 threads per node, the cluster has a peak performance of 20.5 TFLOPs. As you'd expect, these systems offer excellent performance for problems with large memory footprints or for those requiring extremely high bandwidths and low latencies between communicating processors.

In addition to their M9000 cluster, HPCVL has another new resource that consists of 78 Sun SPARC Enterprise T5140 (Maramba) nodes, each with two eight-core Niagara2+ processors (a.k.a. UltraSPARC T2plus). With eight threads per core, these systems make almost 10,000 hardware threads available to users at HPCVL.

Ken described some of the challenges of deploying the T5140 nodes in his HPC environment. The biggest issue is that researchers invariably first try running a serial job on these systems and then report they are very disappointed with the resulting performance. No surprise since these systems run at less that 1.5 GHz as compared to competing processors that run at over twice that rate. As Ken emphasized several times, the key educational issue is to re-orient users to thinking less about single-threaded performance and more about "getting more work done." In other words, throughput computing. For jobs that can scale to take advantage of more threads, excellent overall performance can be achieved my consuming more (slower) threads to complete the job in a competitive time. This works if one can either extract more parallelism from a single application, or run multiple instances of applications to make efficient use of the threads within these CMT systems. With 256 threads per node, there is a lot of parallelism available for getting work done.

As he closed, Ken reminded attendees of the 2009 High Performance Computing Symposium which will be held June 14-17 in Kingston, Ontario at HPCVL.

Thursday Nov 13, 2008

Big News for HPC Developers: More Free Stuff

'Tis the Season. Supercomputing season, that is. Every November the HPC community--users, researchers, and vendors--attend the world's biggest conference on HPC: Supercomputing. This year SC08 is being held in Austin Texas, to which I'll be flying in a few short hours.

As part of the seasonal rituals vendors often announce new products, showcase new technologies and generally strut their stuff at the show and even before the show in some cases. Sun is no exception as you will see if you visit our booth at the show and if you take note of two announcements we made today that should be seen as a Big Deal to HPC developers. The first concerns MPI and the second our Sun Studio developer tools.

The first announcement extends Sun's support of Open MPI to Linux with the release of ClusterTools 8.1. This is huge news for anyone looking for a pre-built and extensively tested version of Open MPI for RHEL 4 or 5, SLES 9 or 10, OpenSolaris, or Solaris 10. Support contracts are available for a fee if you need one, but you can download the CT 8.1 bits here for free and use them to your heart's content, no strings attached.

Here are some of the major features supported in ClusterTools 8.1:

  • Support for Linux (RHEL 4&5, SLES 9&10), Solaris 10, OpenSolaris
  • Support for Sun Studio compilers on Solaris and Linux, plus the GNU/gcc toolchain on Linux
  • MPI profiling support with Sun Studio Analyzer (see SSX 11.2008), plus support for VampirTrace and MPI PERUSE
  • InfiniBand multi-rail support
  • Mellanox ConnectX Infiniband support
  • DTrace provider support on Solaris
  • Enhanced performance and scalability, including processor affinity support
  • Support for InfiniBand, GbE, 10GbE, and Myrinet interconnects
  • Plug-ins for Sun Grid Engine (SGE) and Portable Batch System (PBS)
  • Full MPI-2 standard compliance, including MPI I/O and one sided communication

The second event was the release of Sun Studio Express 11/08, which among other enhancements adds complete support for the new OpenMP 3.0 specification, including tasking. If you are questing for ways to extract parallelism from your code to take advantage of multicore processors, you should be looking seriously at OpenMP. And you should do it with the Sun Studio suite, our free compilers and tools which really kick butt on OpenMP performance. You can download everything--the compilers, the debugger, the performance analyzer (including new MPI performance analysis support) and other tools for free from here. Solaris 10, OpenSolaris, and Linux (RHEL 5/SuSE 10/Ubuntu 8.04/CentOS 5.1) are all supported. That includes an extremely high-quality (and free) Fortran compiler among other goodies. (Is it sad that us HPC types still get a little giddy about Fortran? What can I say...)

The full list of capabilities in this Express release are too numerous to list here, so check out this feature list or visit the wiki.

Thursday Oct 09, 2008

LISA '08: What Every Admin Needs to Know About Solaris

Admins, fasten your seatbelts: The 22nd Large Installation System Administration (LISA '08) Conference promises to be as jammed with useful and interesting technical content as ever and at least as much fun. Come to San Diego from Nov 9-14 to find out!

For those of you looking to dig deeper into Solaris or for those looking to understand what the fuss is all about, there is a ton of Solaris and OpenSolaris related content scheduled at LISA thanks to a lot of hard work by people both inside and outside of Sun. Here are some of the highlights.

Jim Mauro is doing a full-day POD training session. That's Performance, Observability, and Debugging. If you only make it to one Solaris session, pick this one. Jim is a very knowledgeable and engaging speaker and the material is excellent. I enjoyed Jim's presentation of a much compressed version of this at a recent NEOSUG meeting--it was excellent. You will definitely emerge 1) exhausted, and 2) with a much better understanding of how to use a variety of Solaris tools to solve performance problems and to better understand your systems' workloads. Jim will lead you on a foray into the depths of the various Solaris tools that let you look at all aspects of system performance, including DTrace. Whether you are a seasoned UNIX admin, but new to Solaris, or just wondering what all the DTrace fuss is about, you will find this taste-o-DTrace pretty exciting. And if you really want to know a lot more about DTrace, Jim is also doing an all-day DTrace training session at the conference.

Peter Galvin, long-time Solaris expert and trainer and also chair of NEOSUG, and Marc Staveley will be giving a two-day Solaris workshop that has been broken into four half-days sessions. The sessions are Administration, Virtualization, File Systems, and Security. These are all hands-on sessions so Peter and Marc recommend you bring a laptop. Solaris installation not required--the instructors will supply a Solaris machine for remote access.

For something higher level and more strategic, Jim Hughes (Chief Technologist for Solaris) will give an invited talk on OpenSolaris and the Direction of Future Operating Systems. And Janice Gelb will also deliver an invited talk provocatively titled, WTFM: Documentation and the System Administrator.

There will be two Solaris-focused Guru sessions at LISA as well. Scott Davenport and Louis Tsien will cover Solaris Fault Management, while Richard Elling will speak about ZFS. These both promise to be interesting sessions with technical people who really know their stuff.

Solaris Containers are an innovative virtualization technology that is built right into Solaris and Jeff Victor will be leading a full-day workshop to take attendees on a detailed tour of this capability. Check out Resource Management with Solaris Containers.

There will also be a full-day deep dive workshop on ZFS offered by Richard Elling. Many people have heard about this new file system, but you won't really understand exactly why it is getting so much attention until you experience how it changes the administrative experience around file systems.

Sun will also be hosting a vendor BOF to talk about BigAdmin, the mega-hub for metrics tons of useful and very detailed information for administrators. If you aren't familiar with BigAdmin, check out the BOF or at the very least pop over to the website for a peek. Cool stuff.

Sun will also have a booth in the exhibit area. Booth 52, I believe. Stop by for some good conversation and maybe some giveaways.

Saturday Apr 26, 2008

HPC User Forum: Operating System Panel

As I mentioned in an earlier entry, I participated in the HPC interconnect panel discussion at IDC's HPC User Forum meeting in Norfolk, Virginia last week. I also sat on the Operating System panel, the subject of this blog entry.

Because the organizers opted against panel member presentations, the following slides were not actually shown at the conference, though they will be included in the conference proceedings. I use them here to highlight some of the main points I made during the panel discussion.

[os panel slide 1]

My fellow panelists during this session were Kenneth Rozendal from IBM, John Hesterberg from SGI, Benoit Marchand from eXludus, Ron Brightwell from Sandia National Laboratory, John Vert from Microsoft, Ramesh Joginpaulli from AMD, and Richard Walsh from IDC.

The framing topic areas suggested by Richard Walsh prior to the conference were used to guide the discussion:

  • Ensuring scheduling efficiency on fat nodes
  • Managing cache and bandwidth resources on multi-core chips
  • Linux and windows: strengths, weaknesses and alternatives
  • OS scalability and resiliency requirements of petascale systems

As you'll see, we covered a wider array of topics during the course of the panel session.

[os panel slide 2]

Beowulf clusters have been popular with the HPC community since about 1998. The idea arose in part as a reaction against expensive, "fat" SMP systems and proprietary, expensive software. Typical Beowulf clusters were built of "thin" nodes (typically one or two single-CPU sockets), commodity ethernet, and an open source software stack customized for HPC environments.

With multi-core and multi-threaded processors now becoming the norm, nodes are maintaining their svelte one or two rack unit form factors, but become much beefier internally. As an extreme example, consider Sun's new SPARC Enterprise T5140 server, which crams 128 hardware threads, 16 FPUs, 64 GB of memory, and close to 600 GBs of storage into a single rack unit (1.75") form factor. Or the two rack-unit version (the T5240) that doubles the memory to 128 GB and supports up to almost 2.4 TB of local disk storage in the chassis. I call nodes like these Sparta nodes because they are slim and trim...and very powerful. Intel and AMD's embracing of multicore ensures that future systems will generally become more Spartan over time.

Clusters need an interconnect. While traditional Beowulf clusters have used commodity Ethernet, they have often done so at the expense of performance for distributed applications that have significant bandwidth and/or latency requirements. As was discussed in the interconnect panel session at the HPC User Forum, InfiniBand (IB) is now making significant inroads into HPC at attractive price points and will continue to do so. Ethernet will also continue to play a role, but commodity 1 GbE is not at all in the same league with IB with respect to either bandwidth or latency. And InfiniBand currently enjoys a significant price advantage over 10 GbE, which does offer (at least currently) comparable bandwidths to IB, though without a latency solution. The use of IB in Sparta clusters allows the nodes to be more tightly coupled in the sense that a broader range of distributed applications will perform well on these systems due to the increased bandwidth and much lower latencies achievable with InfiniBand OS bypass capabilities.

This trend towards beefier nodes will have profound effects on HPC operating system requirements. Or said in a different way, this trend (and others discussed below) will alter the view of the HPC community towards operating systems. The traditional HPC view of an OS is one of "software that gets in the way of my application." In this new world, while we must still pay attention to OS overhead and deliver good application performance, the role of the OS will expand and deliver significant value for HPC.

[os panel slide 3]

The above is a photo I shot of the T5240 I described earlier. This is the 2RU server that recently set a new two-socket SPEComp record as well as a SPECcpu record. Details on the benchmarks are here. If you'd like a quick walkthrough of this system's physical layout, check out my annotated version of the above photo here.

[os panel slide 4]

The industry shift towards multicore processors has created concern within the HPC community and more broadly as well. There are several challenges to be addressed if the value and power of these processors are to be realized.

The increased number of CPUs and hardware threads within these systems will require careful attention be paid to operating system scalability to ensure that application performance does not suffer due to inefficiencies in the underlying OS. Vendors like Sun, IBM, SGI, etc, who have worked on OS scaling issues for many years have experience in this area, but there will doubtless be continuing scalability and performance challenges as these more closely coupled hardware complexes become available with ever large memory configurations and with ever faster IO subsystems.

There was some disagreement within the panel session over the ramifications to application architectures of these beefier nodes when they are used as part of an HPC cluster. Will users continue to run one MPI process per CPU or thread, or will a fewer number of MPI processes be used per node with each process then consuming additional on-node parallelism via OpenMP or some other threading model? I am of the opinion that the mixed/hybrid style (combined MPI and threads) will be necessary for scaling to very large-size clusters because at some point scaling MPI will become problematic. In addition, regardless of the cluster size under consideration, using MPI within a node is not very efficient. MPI libraries can be optimized in how they use shared memory segments for transferring message data between MPI processes, but any data transfers are much less efficient than using a threading model which takes full advantage of the fact that all of the memory on a node is immediately accessible to all of the threads within one address space.

The tradeoff is that this shift from pure MPI programming to hybrid programming does require application changes and the mixed model can be more difficult since it requires thinking about two levels of parallelism. If this shift to multi-core and multi-threaded processors were not such a fundamental sea change, I would agree that recoding would not be worthwhile. However, I do view this as profound a shift as that which caused the HPC community to move to distributed programming models with PVM and MPI and to recode their applications at that time.

Another challenge is that of efficient use of available memory bandwidth, both between sockets and between sockets and memory. As more computational power is crammed into a socket, it becomes more important for 1) processor and system designers to increase available memory bandwidth, 2) operating system designers to provide efficient and effective capabilities to allow applications to make effective use of bandwidth, and 3) for tool vendors to provide visibility into application memory utilization to help application programmers optimized their use of the memory subsystem. In many case, memory performance will become the gating factor on performance rather than CPU.

As the compute, memory, and IO capacities of these beefier nodes continues to grow, the resiliency of these nodes will become a more important factor. With more applications and more state within a node, downtime will be less acceptable within the HPC community. This will be especially true in commercial HPC environments where many ISV applications are commonly used and where these applications may often be able to run within a single beefy node. In such circumstances, OS capabilities like proactive fault management which identifies nascent problems and takes action to avoid system interruption become much more important to HPC customers. An interesting development, since capabilities like fault management have traditionally been developed for the enterprise computing market.

The last item--interconnect pressures--is fairly obvious. As nodes get beefier and perform more work, they put a larger demand on the cluster interconnect, both for compute-related communication and for storage data transfers. InfiniBand, with its aggressive bandwidth roadmap, and an ability to construct multi-rail fabrics, will play an important role in maintaining system balance. Well-crafted low level software (OS, IB stack, MPI) will be needed to handle the larger loads at scale.

[os panel slide 5]

Beyond the challenges of multi-core and multi-threading, there are opportunities. For all but the highest end customers, beefier nodes will allow node counts to grow more slowly, allowing datacenter footprints to grow more slowly, decreasing the rate of complexity growth and scaling issues.

Much more important, however, is that with more hardware resources per node it will now be possible to dedicate some small to moderate amount of processing power to handling OS tasks while minimizing the impact of that processing on application performance. The ability to fence off application workload from OS processing using concepts like processor sets, processor bindings, and the ability to turn off device interrupt processing, etc. should allow applications to be run with reduced jitter while still supporting a full, standard OS environment on each node, and not having to resort to microkernel or other approaches to deliver low jitter. Using standard OSes (possible stripped down by removing or disabling unneeded functions) is very important for several reasons.

As mentioned earlier, OS capabilities are becoming more important, not less. I've mentioned fault management and scalability. In a few slides we'll talk about power management, virtualization and other capabilities that will be needed for successful HPC installations in the future. Attempt to build all of that into a microkernel and you end up with a kernel. You might as well start with all of the learning and innovation that has accrued to standard OSes and minimize and improve where necessary rather than building one-off or few-off custom software environments.

I worry about how well-served the very high end of the HPC market will be in the future. It isn't a large market or one that is growing like other segments of HPC. While it is a segment that solves incredibly important problems, it is also quite a difficult market for vendors to satisfy. The systems are huge, the software scaling issues tremendously difficult, and, frankly, both the volume and the margins are low. This segment has argued for many years that meeting their scaling and other requirements guarantees that a vendor will be well-positioned to satisfy any other HPC customer's requirements. It is essentially the "scale down" argument. But that argument is in jeopardy to the extent that the high-end community embraces a different approach than is needed for the bulk of the HPC market. Commercial HPC customers want and need a full instance of Solaris or Linux on their compute nodes because they have both throughput and capability problems to run and because they run lots of ISV applications. They don't want a microkernel or some other funky software environment.

I absolutely understand and respect the software work being done at our national labs and elsewhere to take advantage of the large-scale systems they have deployed. But this does not stop me from worrying about the ramifications of the high-end delaminating from the rest of the HPC market.

[os panel slide 6]

The HPC community is accustomed to being on the leading/bleeding edge, creating new technologies that eventually flow into the larger IT markets. InfiniBand is one such example that is still in process. Parallel computing techniques may be another as the need for parallelization begins to be felt far beyond HPC due to the tailing off of clock speed increases and the emergence of multi-core CPUs.

Virtualization is an example of the opposite. This is a trend that is taking the enterprise computing markets by storm. The HPC community to date has not been interested in a technology that in their view simply adds another layer of "stuff" between the hardware and their applications, reducing performance. I would argue that virtualization is coming and will be ubiquitous and the HPC community needs to actively engaged to 1) influence virtualization technology to align it with HPC needs, and 2) find ways in which virtualization can be used to advantage within the HPC community rather than simply being victimized by it. It is coming, so let's embrace it.

The two immediate challenges are both performance related: base performance of applications in a virtualized environment and virtualizing the InfiniBand (or Ethernet) interconnect while still being able to deliver high performance on distributed applications, including both compute and storage. The first issue may not be a large one since most HPC codes are compute intensive and such code should run at nearly full speed in a virtualized environment. And early research on virtualized IB, for example by DK Panda's Network-based Computing Laboratory at OSU, has shown promising results. In addition, the PCI-IOV (IO Virtualization) standard will add hardware support for PCI virtualization that should help achieve high performance.

What about the potential benefits of virtualization for HPC? I can think of several possibilities:

  • Coupling live migration [PDF] with fault management to dynamically shift a running guest OS instance off of a failing node, thereby avoiding application interruptions.
  • Using the clean interface between hypervisor and Guest OS instances to perform checkpointing of a guest OS instance (or many instances in the case of an MPI job) rather than attempting to checkpoint individual processes within an OS instance. The HPC community has tried for many years to create the latter capability, but there are always limitations. Perhaps we can do a more complete job by working at this lower level.
  • Virtualization can enable higher utilization of available resources (those beefy nodes again) while maintaining a security and failure barrier between applications and users. This is ideal in academic or other service environments in which multi-tenancy is an issue, for example in cases where academic users and industry partners with privacy or security concerns share the same compute resources.
  • Virtualization can also be used to decrease the administrative burden on system administration staff and allow it to be more responsive to the needs of its user population. For example, a Solaris or Linux-based HPC installation could easily allow virtualized Windows-based ISV applications to be run dynamically on its systems without having to permanently maintain Windows systems in the environment.

They key point is that virtualization is coming and we as a community should find the best ways of using the technology to our advantage. The above are some ideas--I'd like to hear others.

[os panel slide 7]

The power and cooling challenges are straightforward; the solutions are not. We must deliver effective power management capabilities for compute, storage, and interconnect that support high performance, but also deliver significant improvements over current practice. To do this effectively requires work across the entire hardware and software stack. Processor and system design. Operating system design. System management framework design. And it will require a very comprehensive review of coding practices at all levels of the stack. Polling is not eco-efficient, for example.

Power management issues are yet another reason why operating systems become more important to HPC as capabilities developed primarily for the much larger enterprise computing markets gain relevance for HPC customers.

[os panel slide 8]

Here I sketch a little of Sun's approach with respect to OSes for HPC. First, we will offer both Linux and Solaris-based HPC solutions, including a full stack of HPC software on top of the base operating systems. We recognize quite clearly the position currently held by Linux in HPC and see no reason why we should not be a preferred provider of such systems. At the same time, we believe there is a strong value proposition for Solaris in HPC and that we can deliver performance along an array of increasingly relevant enterprise-derived capabilities that will benefit the HPC community. We also realize it is incumbent upon us to prove this point to you and we intend to do so.

I will finish by commenting on one bullet on this final slide. For the other products and technologies, I will defer to future blog posts. The item I want to end with is Project Indiana and OpenSolaris due to its relevance to HPC customers.

In 2005, Sun created an open source and open development effort based on the source code for the Solaris Operating System. Called OpenSolaris, the community now numbers well over 75,000 members and it continues to grow.

Project Indiana is an OpenSolaris project whose goal is to produce OpenSolaris binary distributions that will be made freely available to anyone with optional for-fee support available from Sun. An important part of this project is a modernization effort that moves Solaris to a network-based package management system, updates many of the open-source utilities that are included in the distro, and adds open-source programs and utilities that are commonly expected to be present. To those familiar with Linux, the OpenSolaris user experience should become much more familiar as these changes roll out. In my view, this was a necessary and long-overdue step towards lowering the barrier for Linux (and other) users, enabling them to more easily step into the Solaris environment and benefit from the many innovations we've introduced (see slide for some examples.)

In addition to OpenSolaris binary distros, you will see other derivative distros appearing. In particular, we are working to define an OpenSolaris-based distro that will include a full HPC software stack and will address both developers and deployers. This effort has been running within Sun for awhile and will soon transition to an OpenSolaris project so we can more easily solicit community involvement in this effort. This Solaris HPC distro is meant to complement similar work being done by a Linux-focused engineering team within our expanded Lustre group, which is also doing its work in the open and also encourages community involvement.

There was some grumbling at the HPC User Forum about the general Linux community and its lack of focus or interest in HPC. While clearly there have been some successes (for example, some of the Linux scaling work done by SGI), there is frustration. One specific example mentioned was the difficulty in getting InfiniBand support into the Linux kernel. My comment on that? We don't need to ask Linus' permission to put HPC-enabling features into Solaris. In fact, with our CEO making it very clear that HPC is one of the top three strategic focus areas for Sun [PDF, 2MB], we welcome HPC community involvement in OpenSolaris. It's free, it's open, and we want Solaris to be your operating system of choice for HPC.

Tuesday Jan 08, 2008

MPI Library Updated: Sun ClusterTools 7.1 Released

The latest version of Sun's MPI library for Solaris x86 and Solaris SPARC is now available for free download on the ClusterTools 7.1 download area. Our MPI library is based on Open MPI, an open source MPI effort to which Sun contributes actively as a corporate member.

This new release adds Intel support, improved parallel debugger support, PBS Pro validation, improved memory usage for communication operations, and other bug fixes. Sun Studio 12, the latest version of Sun's high performance compiler and tools suite, is also supported.

ClusterTools 7.1 is based on Open MPI 1.2.4.

Thursday Sep 06, 2007

Too Cool: Solaris 8 running on Solaris 10

I just read Dan Price's blog entry on Project Etude, which shows in detail how to package up a Solaris 8 environment and transfer it to run within a Solaris8 container on a Solaris 10 system using new technology developed by Dan and his team. If you'd like to move a legacy application forward onto new Sun hardware and start a gentle migration to Solaris 10, consider investigating this technology.

For the Marketing view, which is also worth reading, visit Marc Hamilton's blog.


Josh Simons


« July 2016