Nagging Nagios Feelings

So I've been doing some configuration and testing with Nagios and have been having this nagging feeling that it is going to lead to some pretty major issues in the future. Monitoring is a "need-to-have" in the HPC world, and the leaders in this pack so far are Nagios and Ganglia. While we've been including Ganglia in the stack, we've never really aided in the configuration in lieu of taking care of other tasks. For the next release, it seemed like a good idea to go ahead and pause and see if Ganglia is the right choice, or if perhaps Nagios can provide some more options.

My biggest issue, right now, is a question of scalability of Nagios. This is primarily drawn out when you look at just how Nagios is configured. To define a cluster, you must create a host entry for each host within the cluster; while this is easy enough and scriptable it really draws out the question "are you thinking about 1000, or 10000, or even more nodes?" Yes, this is only the configuration file, but it also progresses into the monitoring solution itself. Nagios uses a polling method to check every service on every node. In the case of 10K nodes, how long will it take for the same node to be checked twice; or three times; how long will it take before we find out that it's down?

Perhaps I just don't understand the configuration options available to me (which is why I'm writing, hoping someone tells me I'm stupid). Perhaps there are other ways to approach this with Nagios (e.g., use scalable units that each only monitor a subset of nodes). Any thoughts out there?

(This has been cross-posted on our mailing list here.)


I was planning to give Nagios a try. But now I am using Zabbix which seems to be used by
Ben Rockwood too. Maybe you should take a look at it.

Posted by MichaelLi on March 03, 2009 at 12:53 PM EST #

Thanks for that pointer to Zabbix. I'll have to investigate it further. One thing that concerns me about it is on their frontpage: "Up-to 1000 of nodes". Someday we'll get people to think outside of "cluster" nomenclature and really start scaling.

Posted by Makia Minich on March 05, 2009 at 04:19 AM EST #

There are a few things we have used for 1k-2k nodes:

- Have nagios only check if a node is up, and then let the node in a cron job complain about other issues it is having directly to the nagios server.

- Use info from other systems that do scale better to gather information. Slurm can tell you what nodes are up/down.

- Cluster Monitoring Plugin - Monitor a threshhold of nodes down before triggering critical failures.. this allows you to have a few percent of nodes out before you wake up the admin.

- You can use templates to configure nodes that are exactly the same in your configuration(such as 10000 compute nodes).

Posted by Evan Felix on March 09, 2009 at 07:10 AM EDT #

There are two possible solutions to the scalability question within Nagios:

- - I am the most familiar with this approach and have seen many sites scale to thousands of nodes being monitored without problem. Checks are issued from remote systems then status is reported to the central console to limit the load and activity on the primary Nagios monitoring host.

- - This is a newer project that has promise to ease the scalability question in a seamless fashion. I have not had a chance to work with it, but the architecture seems sound.

As for managing the Nagios configuration. All Nagios configuration is very standard and can be generated by scripts based on information in a central database. This is very common for large sites to select all hosts from the DB, then generate the Nagios configuration files based on that DB information.

Posted by Joey Jablonski on March 14, 2009 at 11:30 AM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed

A forum to allow those of us in Giraffe to help the Linux community in our own little ways.


« February 2015