Agent vs agent-less monitoring

Baron posted something interesting about agent vs agent-less monitoring in response to Rob.

While reading it, I couldn’t help thinking that the distinction is somewhat misleading, if not wrong.

I’d go so far as to say that agent-less doesn’t exist as such. Why do I say such heresy?

Trivially, you need some piece of software to collect data. With munin you configure a server that triggers scripts on the monitored servers. The set of data sources is governed by what you install in the correct directory on the monitored server. Cacti relies on SNMP heavily and also allows you to write plugins which then connect out from the central Cacti server to the individual hosts to be monitored. I believe you don’t have to do any SNMP but can report data in a different protocol, but since I haven’t done it I don’t know the mechanics.

Now, with Cacti the scripts are all in one central place, which is nice. It’s certainly not the case for munin and that’s where the problem comes in.

Munin in fact is an agent based system as my definition goes. The code necessary to monitor services is not in a central place. For Nagios and Cacti it appears to be “true” agent-less, but effectively you exercise a non-trivial amount of code on the client side as well.

If your definition of an agent (in the monitoring world) is that it runs even when not connected to the monitor server, then yes, there’s a distinction.

However, I prefer to think about it this way: With agent-less systems, you are effectively retransmitting the “agent” each time you retrieve data. Not only do you establish connections at a rate equal to your most frequently collected data, but you are retransmitting much more data in a given time. I find that wasteful.

One other problem I’ve encountered with “agent-less” systems (or rather with “mini-agent” sytems like munin) is that if you have a problem with the network between the monitoring server and the monitored ones, or the monitoring server says bye-bye you must lose all data during that interval. There’s no way to collect the data, simple as that.

In my experience it’s not uncommon that an outage affects only part of your network (especially if you have redundant datacenters - which is why you have them in first place…) and losing much of your monitoring ability without any way to recover from a temporary problem (say duration of about 5-10 minutes) is a bummer, especially in high-traffic environments.

Another interesting point is that agent-less systems are polling constantly. That is fine for sampling data that is constantly changing but for other things it’s far from ideal. The agent can use an “interrupt” model when reporting data back to the monitoring server, because it’s an active component able to report data out of its own accord. Take query analysis for instance. If there’s no data to report, don’t report it. Less work to be done on both sides, less traffic going back and forth. The same goes for setting thresholds for instance, you can report once the threshold was exceeded and maybe continue to send actual values while the readings are out of their allowed range, but stop again once everthing is back to normal. This approach greatly reduces the amount of dataflow, but you cannot do it with a polling scheme.

Agent systems do have their own problems, including higher maintenance efforts, but the idea is that each system is separate and isolated from each other. And quite frankly, I don’t buy the “it’s hard to update agents” argument. There are quite a few free management solutions that make software deployment pretty easy once it is in place. Every place I worked at had some kind of system like that and a performing an update of software took almost no effort at all (that includes MySQL, Apache and even kernel updates). It’s quite possible to include agents in that system and have them deployed automatically, after testing it in a stage environment, of course.

I believe agents have their place and a properly designed system can mitigate almost all maintenance issues.

Obviously, if you want to measure network latency or service availability (is the HTTP service still up and responding), there’s little sense in deploying agents. Just use Nagios, it’s good for something like that!

Comments:

Actually SNMP in Cacti is totally optional. I don't use it.

I don't get how you're retransmitting the agent each time you collect data. If my collection method is "ssh [server] cat /proc/vmstat" that's something like 20 bytes. The shell script that does something with the result is not getting transmitted. I don't think I understand what you mean here.

Posted by Xaprb on August 21, 2008 at 07:25 AM CEST #

Baron,

what I meant is something more involved stuff than just getting the content of one file.

Of course you can offload almost any computation on the polling end, but then you'll end up with one overloaded server, at least for bigger environments.

I agree that for small shops it might not make a difference at all, but in my experience this approach doesn't work for 50+ servers.

Posted by Kay Röpke on August 21, 2008 at 07:37 AM CEST #

Kay,

I think there are much better solutions to collect data in agent-less tools than those mentioned by you.

For example, there is absolutely no need to re-connect for every collection. Just keep a ssh connection open and read raw /proc files. This hardly puts any load on the servers being monitored.

Posted by Rohit Nadhani on August 21, 2008 at 09:17 AM CEST #

OK, I get it now :)

Posted by Xaprb on August 21, 2008 at 09:24 AM CEST #

Rohit,

AFAIK those tools are most often used in OpenSource environments.

If you have any links to the tools you are referring to, please send me links. I'd love to take a look :)

And, sure, you can leave the connections open, but essentially you then program an agent (the execution environment being the shell) but it has still the drawbacks of polling:
- either you perform all readings serially or you have to open many connections
- post-processing of the acquired data is still being done on the monitoring server

The load on the monitored server is normally low for monitoring tools, unless you need high resolution data.

Posted by Kay Röpke on August 21, 2008 at 09:25 AM CEST #

Kay,

Sure! Please download an eval of MONyog from www.webyog.com and test yourself.

I would love to hear your thoughts.

Posted by Rohit Nadhani on August 21, 2008 at 09:43 AM CEST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Kay Roepke

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today