Tuesday Jun 30, 2009

Quick Blog Links

Just wanted to push out some quick links to other blog write-ups involving the Sun HPC Software, Linux Edition. Earlier this week (or was it late last week) we put up the blog entry for using the next version of the stack and Virtual Box (if you missed it, here it is). And here are some other write-ups about using the upcoming 2.0 release you shouldn't miss: As time goes on, we'll post other write-ups that we come across.

Sunday Jun 28, 2009

Coming soon ... 2.0

Just wanted to get a note out there that 2.0 is on it's way. It's had its delays, but we're close to final release on it. Included will be the following major changes:
  • Support for RHEL 5.3, CentOS 5.3 and now SLES 10
  • OFED 1.3.1
  • Lustre 1.8.0.1
When we finally get the release out the door, we'll send out more information. In the meantime, you can take a sneak peek at the documentation or check out how to build a 2.0 cluster using virtual box here.

Tuesday Mar 03, 2009

Nagging Nagios Feelings

So I've been doing some configuration and testing with Nagios and have been having this nagging feeling that it is going to lead to some pretty major issues in the future. Monitoring is a "need-to-have" in the HPC world, and the leaders in this pack so far are Nagios and Ganglia. While we've been including Ganglia in the stack, we've never really aided in the configuration in lieu of taking care of other tasks. For the next release, it seemed like a good idea to go ahead and pause and see if Ganglia is the right choice, or if perhaps Nagios can provide some more options.

My biggest issue, right now, is a question of scalability of Nagios. This is primarily drawn out when you look at just how Nagios is configured. To define a cluster, you must create a host entry for each host within the cluster; while this is easy enough and scriptable it really draws out the question "are you thinking about 1000, or 10000, or even more nodes?" Yes, this is only the configuration file, but it also progresses into the monitoring solution itself. Nagios uses a polling method to check every service on every node. In the case of 10K nodes, how long will it take for the same node to be checked twice; or three times; how long will it take before we find out that it's down?

Perhaps I just don't understand the configuration options available to me (which is why I'm writing, hoping someone tells me I'm stupid). Perhaps there are other ways to approach this with Nagios (e.g., use scalable units that each only monitor a subset of nodes). Any thoughts out there?

(This has been cross-posted on our mailing list here.)

Friday Jan 09, 2009

Sun HPC Software, Linux Edition 1.2 Now Available!

The Sun HPC Software Stack, Linux Edition team (aka Giraffe) is pleased to announce the release of version 1.2 of our HPC Software Stack.

This release builds on the features included in version 1.1 of the stack, including:

  • Support for Red Hat 5.2 and CentOS 5.2
  • OFED 1.3.1
  • Lustre 1.6.6
  • Optional Kickstart-based installation

Improvements included in version 1.2 of the stack include:

  • Boot Over Infiniband (BoIB) support
  • New versions of oneSIS (2.0.1) and SLURM (1.3.10)
  • Inclusion of Mellanox Firmware Tools (version 2.5.0)
  • Optional perfctr-patched kernel for compute nodes

Sun HPC Software, Linux Edition is an integrated open-source software solution for Linux-based HPC clusters running on Sun hardware. It provides a framework of software components to simplify the process of deploying and managing large-scale Linux HPC clusters.

For more information or to download this new version, please visit the Sun HPC Software, Linux Edition product page at the following URL: http://www.sun.com/software/products/hpcsoftware/

Thanks to everyone in the community who helped us by providing early feedback and testing.

About

A forum to allow those of us in Giraffe to help the Linux community in our own little ways.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Feeds