Thursday Aug 21, 2008

Agent vs agent-less monitoring

Baron posted something interesting about agent vs agent-less monitoring in response to Rob.

While reading it, I couldn’t help thinking that the distinction is somewhat misleading, if not wrong.

I’d go so far as to say that agent-less doesn’t exist as such. Why do I say such heresy?

Trivially, you need some piece of software to collect data. With munin you configure a server that triggers scripts on the monitored servers. The set of data sources is governed by what you install in the correct directory on the monitored server. Cacti relies on SNMP heavily and also allows you to write plugins which then connect out from the central Cacti server to the individual hosts to be monitored. I believe you don’t have to do any SNMP but can report data in a different protocol, but since I haven’t done it I don’t know the mechanics.

Now, with Cacti the scripts are all in one central place, which is nice. It’s certainly not the case for munin and that’s where the problem comes in.

Munin in fact is an agent based system as my definition goes. The code necessary to monitor services is not in a central place. For Nagios and Cacti it appears to be “true” agent-less, but effectively you exercise a non-trivial amount of code on the client side as well.

If your definition of an agent (in the monitoring world) is that it runs even when not connected to the monitor server, then yes, there’s a distinction.

However, I prefer to think about it this way: With agent-less systems, you are effectively retransmitting the “agent” each time you retrieve data. Not only do you establish connections at a rate equal to your most frequently collected data, but you are retransmitting much more data in a given time. I find that wasteful.

One other problem I’ve encountered with “agent-less” systems (or rather with “mini-agent” sytems like munin) is that if you have a problem with the network between the monitoring server and the monitored ones, or the monitoring server says bye-bye you must lose all data during that interval. There’s no way to collect the data, simple as that.

In my experience it’s not uncommon that an outage affects only part of your network (especially if you have redundant datacenters - which is why you have them in first place…) and losing much of your monitoring ability without any way to recover from a temporary problem (say duration of about 5-10 minutes) is a bummer, especially in high-traffic environments.

Another interesting point is that agent-less systems are polling constantly. That is fine for sampling data that is constantly changing but for other things it’s far from ideal. The agent can use an “interrupt” model when reporting data back to the monitoring server, because it’s an active component able to report data out of its own accord. Take query analysis for instance. If there’s no data to report, don’t report it. Less work to be done on both sides, less traffic going back and forth. The same goes for setting thresholds for instance, you can report once the threshold was exceeded and maybe continue to send actual values while the readings are out of their allowed range, but stop again once everthing is back to normal. This approach greatly reduces the amount of dataflow, but you cannot do it with a polling scheme.

Agent systems do have their own problems, including higher maintenance efforts, but the idea is that each system is separate and isolated from each other. And quite frankly, I don’t buy the “it’s hard to update agents” argument. There are quite a few free management solutions that make software deployment pretty easy once it is in place. Every place I worked at had some kind of system like that and a performing an update of software took almost no effort at all (that includes MySQL, Apache and even kernel updates). It’s quite possible to include agents in that system and have them deployed automatically, after testing it in a stage environment, of course.

I believe agents have their place and a properly designed system can mitigate almost all maintenance issues.

Obviously, if you want to measure network latency or service availability (is the HTTP service still up and responding), there’s little sense in deploying agents. Just use Nagios, it’s good for something like that!


Kay Roepke


« March 2017