Tuesday Jan 06, 2009

Connecting All the Dots

The last couple of weeks before the holidays I worked on an interesting project. It involved assembling pretty much everything Sun offers for HPC into a single coherent demo and throwing in Amazon EC2 to boot. This post will explain what I did and how I did it. Let's start at the beginning.

One of the new offerings from Sun is the Sun HPC Software. Beneath the excessively generic name is a complete, integrated stack of HPC software components. Currently there are two editions: the Sun HPC Software, Linux Edition (aka Project Giraffe) and the Sun HPC Software, Solaris Developer Edition. (A Sun HPC Software, Solaris Edition and Sun HPC Software, OpenSolaris Edition will be following shortly.) The Linux edition is exactly what the name implies. It's a full stack of open source HPC tools bundled into a Centos image, ready to push out to your cluster. The Solaris developer edition is a slightly different animal. It is targeted at developers interested in writing HPC applications for Solaris. The Solaris developer edition is a virtual machine image (available for VMware and Virtual Box) that includes Solaris 10 and a pre-installed suite of Sun's HPC products, including Sun Grid Engine, Sun HPC ClusterTools, Sun Studio, and Sun Visualization, all integrated together.

For this demo, I used the Solaris developer edition. The end goal was to produce a version of the virtual machine image that was capable of automatically borrowing resources from a local pool or from the cloud in order to test or deploy developed HPC applications. Inside the developer edition virtual machine, there are already two Zones that act as virtual execution nodes for testing applications. That's a nice start, but what about testing on real machines or a larger number of machines? That's where the resource borrowing comes in. In the end, I had a VM image that was capable of automatically borrowing and releasing resources first from a local pool and later from the cloud, on demand.

The first step was to get the developer edition running as-is. Sounded simple enough. The first wrinkle was that I was doing this demo on a Mac. The regular VMware Player is not available for Mac, so I had to download an eval copy of VMware Fusion. Once I had Fusion installed, I was able to bring up the developer edition VM without a hitch.

Step 2 was to get the VM networked. The network configuration for the developer edition beta 1 is such that the global and non-global Zones can see each other, but nobody can get into or out of the VM. Getting the networking working was probably the hardest part of the demo, and honestly, I can't tell you how I finally did it. Per the suggestion of the pop-up dialogs from VMware, I installed the VMware Tools in the VM's Solaris instance. That changed the name of the primary interface from pcn0 to vmxnet0, but didn't actually help. Solaris was still unable to plumb the interface. After twiddling the VM's network settings several times and doing several reconfiguration boots, I eventually ended up with a working vmxnet1 interface (and a dead pcn0 and vmxnet0). As usual in such adventures, I'd swear that the last thing I did before it started working should not have had any appreciable effect. Oh, well. It worked, and I wasn't interested in understanding why.

Now that I had a functional network interface, the next step was to reinstall the Sun Grid Engine product. The VM comes with a preinstalled instance, but this demo requires features not enabled in a default installation, like what the VM provides. I left the original cell (default) intact and installed a new cell (hpc) with the -jmx and -csp options. -jmx enables the Java thread in the qmaster that serves up the JGDI API over JMX. I needed JGDI so that the demo GUI that I was building could receive event updates from the qmaster about job and host changes. With Sun Grid Engine 6.2, I was unable to successfully connect to the JMX server unless I installed the qmaster with certificate-based security, hence the -csp option. After the installation was complete, I then had to do the usual CSP certificate juggling, plus a new wrinkle. In order to connect to the JMX server, I also had to create a keystore for the connecting user with $SGE_ROOT/util/sgeCA/sge_ca -ks <user>. There's a quirk to the sge_ca -ks command, though. By default, it fails, explaining that it can't find the certificates. The reason is that the path to the certificates is hard-coded in the sge_ca script to a ridiculous default value. To change it to the correct value, I had to use the -calocaltop switch. After the certificates were squared away, I installed execution daemons in both Zones. At least that part was easy.

The next thing I did was to create some more Zones. Yes, I know this demo was supposed to be using real machines from a local pool and the cloud. Because it's a demo on a laptop, the "local machines" had to be equally portable. Because of firewall issues, I also wanted to have a backup for the cloud. In an effort to be clever, I moved the file systems for the two existing Zones onto their own ZFS volumes. I wanted to create the new Zones as cloned snapshots of the old Zones. Unfortunately, it turns out that even though the man page for zfs(1M) says that it's possible, the version of Solaris installed in the VM is the last version on which it isn't possible. After chasing my tail a bit, I decided to just do it the old fashioned way instead of trying to force the new fangled way to work.

Now that I had six non-global Zones running, the next step was to get Service Domain Manager installed. It is neither installed nor included in the developer edition VM, so I had to scp it over from my desktop. Technically, I could probably have managed to download it directly from the VM, but I had already downloaded it to my desktop before I started. For the Service Domain Manager installation, I followed Chansup's blog rather than the documentation. Chansup's blog posts detail exactly what steps to follow without the distraction of all the other possibilities that the docs explain. Following the steps in the blog, I was able to get the Service Domain Manager master and agents installed with little difficulty. The hardest part is that the sdmadm command has extremely complicated syntax, and it took a while before I could execute a command without having the docs or blog in front of me as a reference. To prove that the installation worked, I manually forced Service Domain Manager to add one of the new Zones to the existing Sun Grid Engine cluster, and much to my shock and wonderment, it worked.

The last step of VM (re)configuration was to configure the Service Domain Manager with a local spare pool and a cloud spare pool and a set of policies to govern when resources should be moved around. This step proved about as tricky as I expected. As one of the original architects and developers of the product, I had a good idea of what I wanted to do and how to make it happen, but the syntax and the details were still problematic. The syntax was the first hurdle. The docs have issues with both understandability and accuracy, and Chansup's blog was too narrowly focused for my purposes. After I poked around a bit, I figured out how to do what I wanted, but actually doing it was the next challenge. What I wanted to do was create two MaxPendingJobsSLO's...

We interrupt your regularly scheduled blog post to bring you a public service announcement. Please, for your own well being and the well being of others who might use your software, test all of your code contributions thoroughly on all supported platforms, and have them reviewed by an experienced member of the development team before committing, especially if you're working on the Firefox source base. This point in the blog post is the last time I saved my text before completing the post. Before I could save it, Firefox segfaulted causing me to loose a significant amount of work. What follows is a downtrodden, half-hearted attempt to complete the post again. We now return you to your regularly scheduled blog post.

What I wanted to do was create two MaxPendingJobsSLO's for the Sun Grid Engine instance. The first would post a moderate need (50) when the pending job list was more than 6 jobs long. The second would post a high need (99) when the pending job list was more than 12 jobs long. I also wanted to have a local spare pool with a low (20) PermanentRequestSLO and a low FixedUsageSLO, and a cloud spare pool with a moderate (60) PermanentRequestSLO and a moderate FixedUsageSLO. The idea was that when the Sun Grid Engine cluster was idle, all the resources would stay where they were. When the pending job list was longer than 6 jobs, resources would be taken from the local spare pool. When the pending job list was longer than 12 jobs, additional resource would be taken from the cloud spare pool. When the pending job list grew shorter, the resources would be returned to their spare pools. In theory. (The philosophy of setting up Service Domain Manager SLOs is a full topic unto itself and will have to wait for another blog post.)

The first problem I ran into was that Service Domain Manager does not allow a spare pool to have a FixedUsageSLO. An issue has been filed for the problem, but that didn't help me set up the demo. The result was that I had no way to force Service Domain Manager to take the local spare pool resources before the cloud spare pool resources. The best I could do was set the averageSlotsPerHost value for the SLO for the MaxPendingJobsSLO's to a high number so that Service Domain Manager only would take hosts one at a time, rather than one from each spare pool simultaneously.

The nest problem was quite unexpected. With the SLOs in place, I submitted an array job with 100 tasks. I waited. Nothing happened. I waited some more. Still nothing happened. I turns out that the MaxPendingJobsSLO only counts whole jobs, not job tasks like DRMAA would. The work-around was easy. I just had to be sure the demo submitted enough individual jobs instead of relying on array tasks.

The last problem was one that I had been expecting. After a long pending job list had caused Service Domain Manager to assign all the available resources to the cluster, when the pending job list went to zero, the borrowed resources didn't always end up where they started. Service Domain Manager does not track the origin of resources. Fortunately, the issue is resolved by an easy idiom. I created a source property for every resource, and I set the value of the property to either "cloud", "spare", or "sge". I then set up the spare pools' PermanentRequestSLO's to only request resources with appropriate source settings. I also added a MinResourceSLO for the cluster that wants at least 2 resources that didn't come from a spare pool, just to be complete.

With the SLOs in place, the configuration actually did what it was supposed to. When the cluster had enough pending jobs, hosts were borrowed first from the local spare pool and then from the cloud. When the pending jobs were processed, the resources went back to the appropriate spare pools. To make the configuration more demo-friendly, I changed the sloUpdateInterval for the Sun Grid Engine instance to a few seconds (from the default of a few minutes). I also changed the quantity for the spare pools' PermanentRequestSLO's to 1 so that they would only reclaim their resources one at a time, rather than all at once. With those last changes made, I was ready to move on to the UI.

The idea of the demo was to present a clear graphical representation of what was going on with Sun Grid Engine and Service Domain Manager. From past experience building a similar demo for SuperComputing, I knew that JavaFX™ Script was the best tool for the job. (OK. It's not the best tool for the job in a general sense, but I'm a long-time Java™ geek, I don't know Flash, and I didn't have any budget to buy tools. Under those constraints, it was the best I could do.) Before I could get to building the UI, though, I first needed a JGDI shim to talk to the qmaster. Richard kindly provided me with some JGDI sample code, and from there it was pretty easy. The hardest part was figuring out what the events actually meant. In the end, my shim registered for job add events (to recognize job submissions), task modified events (to recognize job tasks being scheduled), and job deleted events (to recognize job completions). It also registered for host added and deleted events to recognize when Service Domain Manager reassigned a host.

With the shim working smoothly, I turned to the actual UI. Given the complexity of the animations that I wanted to do, it was shockingly simple to achieve with JavaFX Script, especially considering that there was not yet a graphical tool equivalent to Matisse for Swing. Every bit of it was hand-coded, but it still was fast, easy, and came out looking great. In the end, the whole UI, counting the shim, was about 1500 lines of code, and about 500 lines of that was the shim. (JGDI is rather verbose, especially when establishing a connection to the qmaster.)

And with that, I ran out of time. The next step would have been to actually populate the cloud spare pool with machines provisioned from the cloud. Torsten graciously provided me a Solaris AMI that included Sun Grid Engine and Service Domain Manager. The plan was to pre-provision two hosts to populate the pool and then create a script that would provision an additional host each time the cloud pool dropped below two hosts and release a host every time it grew larger than two hosts. Now that the demo has been presented, the pressure is off, and other things are higher priority. I do plan, however, to eventually come back and put the last piece of the puzzle in place.

Below is a video of the demo, showing how jobs can be submitted from the Sun Studio IDE, and how Sun Grid Engine and Service Domain Manager work together with the local spare pool and the cloud to handle the workload. The job that is being submitted is a short script that submits eight sleeper jobs. Because the MaxPendingJobsSLO ignores array tasks, I needed to submit a bunch of individual jobs, but I didn't want to have to click the submit button multiple times in the demo.

Filming the video turned out to be an interesting challenge unto itself. I did the screencap using Snapz Pro on the Mac. It has no problem with JavaFX Script or with VMware VMs, but it apparently can't film JavaFX Script running inside a VMware VM. I ended up having to twiddle the UI a bit so that I could run it directly on the Mac. That's why in the demo, when I switch from Sun Studio to the UI, I swap Mac desktops instead of Solaris workspaces. The voice over and zooming effects are courtesy of Final Cut, by the way.

Thursday May 29, 2008

Sun Cluster Core Is Now Open Source!

The Solaris Cluster team today contributed over 2 million lines of source code to the open source community, completing the promise we made almost one year ago to open source the complete Solaris Cluster product under the name, Open High Availability Cluster. Listen to a podcast with Meenakshi Kaul-Basu, Director of Availability Products or read the official press release.

Solaris Cluster is Sun's High Availability Cluster offering. The product community hosts a blog and wiki where you can also find more information. Open HA Cluster is part of OpenSolaris, available in the HA Clusters Community Group on OpenSolaris.org. The Open HA Cluster source code is available under the Common Development and Distribution License (CDDL).

This source code release is the third of three phases announced one year ago.

Phase 1, on June 27, 2007, contained the source for almost all the Sun Cluster agents.

Phase 2, on December 4, 2007, included the source for Sun Cluster Geographic Edition disaster recovery software.

Phase 3, announced today, and delivered six months ahead of schedule, contains the source for the core Solaris Cluster product, consisting of over 2 million lines of source code!

The open source code does not include some encumbered Solaris Cluster source code. Nonetheless, users can build a completely usable HA Cluster from this source with the Sun Studio 11 product. Also available is source for parts of the Solaris Cluster Automated Test Environment (SCATE), source for the Solaris Cluster man pages, and source for Solaris Cluster Globalization (G11N). CTI for TET, which is part of the SCATE test infrastructure, has been separately open sourced on the testing community under the Artistic License. This framework supports both Solaris Cluster and ON test suites.

In addition to the source code, there is a binary distribution of OHAC, called Solaris Cluster Express (SCX), that runs on Solaris Express Community Edition.

Consider getting involved in the HA Clusters community group:

Tuesday May 27, 2008


Even if you can't speak German, this is a really cool video:

The basic gist is that Constantin and crew think the Thumper is a really cool machine, but that at $50k, it's a bit expensive for a developer to have under his desk. As an alternative, they propose a stack of inexpensive storage, in this case, a jumble of USB sticks. Using ZFS, Constantin pools together the sticks in RAID groups of three as one big storage pool. To demonstrate ZFS's recovery features, he copies a video onto the storage pool, starts the video playing, and then disconnects one of the USB hubs. Even though the storage pool looses 1/4 of its devices, the video continues playing without interruption. He then plugs the missing hub back in and shows that ZFS automatically reconstructs the pool, reintegrating the missing USB sticks. Constantin then does a ZFS export, removes all the sticks, shuffles them thoroughly, sticks them all back in in an unknown order, and then does a ZFS import. ZFS then sorts out which sticks are which and rebuilds the pool.

You can find more background information in Constantin's blog.


There's a new effort afoot in the OpenSolaris world to build a community around using OpenSolaris (and Solaris) in high performance computing. If you're interested in HPC on OpenSolaris, head over to the OpenSolaris HPC Community and have a look-see.

As a product of the community, the HPC Stack project is attempting to define what software would be needed to build a complete and useful HPC software stack for OpenSolaris. Right now, the main discussion for the HPC stack project is actually happening on the HPC Community mailing list, but feel free to jump on either list and voice your opinion.

Keep in mind that both communities are still young and in an evolving state, but you should count that as a good thing. It gives you the chance to jump in early and make a big difference in the community and/or project direction. I'm looking forward to seeing your input on the lists!

Tuesday May 29, 2007


Ever since I installed Solaris Nevada on my desktop, I've been plagued by an obnoxious problem. Periodically, particularly when using NetBeans, I would type something that would zap the X server. Normally it was while fat-fingering, so I was never sure exactly what it was that I typed to cause the problem, and so I never researched it. I had always assumed that it was going to be some deep and odd issue with the way NetBeans uses Swing.

Well, I just managed to zap my X server while using an xterm. More importantly, I did it by pressing only two keys: CTRL and backspace. A quick search on the Internet turned up OpenSolaris issue 6404762. Pressing CTRL and backspace while num lock is on zaps the X server. The problem is actually a minor issue with some screwed up key mappings. While not stated in the workaround for the issue, the issue description provides enough information to figure out that the way to fix it is to edit the /usr/X11/lib/X11/xkb/symbols/sun file for your locale and swap the Mod2 and Mod3 mappings. Works like a charm!

Tuesday Apr 17, 2007

Nevada Is Further Away Than Crazy

After an obscenely long struggle I finally have both Solaris 10 and Solaris Nevada installed on my Ultra 20 workstation. As usual, since I have gone through the pain of getting it working, I'm explaining the process here to save others the trouble.

Installing these two friendly operating systems in a dual-boot configuration is not as simple as it may sound. The first thing you need to know is that Solaris doesn't like there to be more than one Solaris partition on a disk. The trick to installing two different versions of Solaris on the same disk is installing them both in the same partition. Because Solaris subdivides partitions into slices, it's possible to install Solaris 10 in one set of slices and Solaris Nevada is another set of slides. An advantage of this configuration is that some of the slices can be shared between both operating systems.

Here's my slice layout:

  • Slice 0: Solaris 10 root
  • Slice 1: Swap
  • Slice 3: Solaris Nevada root
  • Slice 6: /usr/local
  • Slice 7: /export/home

Notice that only slices 1 and 3 are specific for an operating system. All of the rest of my slices are shared, including the swap space. Not only does this help conserve on disk space, but it also saves me from having to install two copies of everything. I use /usr/local as the repository for all my shared software. Note that /opt is not in a shared slice. The reason is that /opt is where software packages live, and software packages don't necessarily live only under /opt. I found it was less hassle to maintain two separate /opt directories than to try to coerce one instance of Solaris into recognizing software packages from another instance.

When Solaris 10 is loaded, slice 0 is mounted as /, and slice 3 is mounted as /s11. When Solaris Nevada is loaded, slice 0 is mounted as /s10, and slice 3 is mounted as /. By cross-mounting the root directories, I'm able to get even better sharing between operating systems. It also makes it very easy to do maintenance.

That was the easy part. The hard part was getting Solaris Nevada installed. When I installed Solaris 10 update 2, I laid out my slices more or less as I described above, with the assumption that at some point in the future I would install Solaris Nevada. When I eventually got around to trying it, I found I was unable to get past the install step where Solaris Nevada tried to lay out the disk and build its file systems. Apparently there was something about the partition that Solaris 10 update 2 had inherited from the previous install of Solaris 10 (no update) that was inherently bad. Unfortunately, at the time I didn't know that. Instead I went on banging my head against the problem intermittently for about 4 months. Finally, last week I decided to try to solve the problem by upgrading my Solaris 10 update 2 to update 3, in the hopes that an upgrade install would make something better. To make a long story short, I left my system in a state where starting over was my best option. During my fresh install of update 3, I used a terminal window to delete the previous Solaris partition, allowing the update 3 install to really start from scratch. After the install finished, I rebooted and tried Solaris Nevada build 61. I was much relieved to find that the Solaris Nevada install worked without a hitch. Unfortunately, after rebooting, there was a problem with the video driver which prevented the X console from running properly. In a last ditch effort, I pulled out my old build 55b DVD, and it installed and run fine. I'm writing with post from 55b.

Now, here's where the cross-linked roots come in handy. After installing Solaris Nevada, the Solaris Nevada boot loader was in charge, and it didn't know anything about Solaris 10. In order to make Solaris 10 bootable, I had to edit the /boot/grub/menu.lst file to add menu entries for Solaris 10. Essentially I copied the information from /s10/boot/grub/menu.lst into the file. According to the Solaris docs, there's a better way to do the same thing with the eeprom command, but it wasn't obvious to me how. An important part of adding the Solaris 10 boot information into the Solaris Nevada boot loader menu was the root (hd0,0,a) line, which tells the boot loader to root the boot paths at slice 0 of partition 1 of disk 0. Don't forget to include that!

The last little bit of advice I can offer is about application paths. Because /export/home is shared, users' desktops will have the same path information associated with menus and icons under both operating systems. I used symbolic links to smooth over any differences in paths between the two operating systems. Also pay attention to the application paths associated with MIME types in your browsers. One other thing you have to watch is version issues with configuration files, particularly with GNOME. Since you're now using the same home directory for multiple desktop versions, you have the potential run into problems.

Monday Mar 26, 2007

How to Create a File System on a USB Mass Storage Without vold Running

The other day I pulled my memory stick out of my Mac laptop without ejecting it first, and it managed to hose the file system on the stick. It took me a while before I finally found this doc describing the process of putting a new FAT32 file system on a USB memory stick from Solaris 10. Hopefully this post saves someone else a little search time.




« July 2016