Introduction

In this blog, we will discuss how we can use Performance Metrics Inference Engine (PMIE) to write custom rules to evaluate system performance metrics, detect anomalies, and take predefined actions based on the ruleset defined for automated system monitoring.

What is PMIE?

Performance Co-Pilot (PCP) is a framework for monitoring and managing system performance metrics across diverse platforms. An important component of PCP is the Performance Metrics Inference Engine (PMIE), which enables automated analysis and response based on system performance metrics.

PMIE allows users to create rules that evaluate system performance metrics, detect anomalies, and take predefined actions as we will see in this blog. It operates alongside PCP, collecting and analyzing metrics from various sources, such as CPU usage, memory statistics, network activity, etc.

Getting Started with PMIE

Before diving into writing the rules let’s go through the performance metrics available in PCP, to explore all the metrics use the following command that displays information about all the performance metrics with a one line help summary.

$ pminfo -t
mem.util.free [free memory metric from /proc/meminfo]
mem.physmem [total system memory metric reported by /proc/meminfo]
mem.util.swapTotal [Kbytes swap, from /proc/meminfo]
mem.util.swapFree [Kbytes free swap, from /proc/meminfo]
kernel.cpu.util.sys [percentage of sys time across all CPUs]
kernel.cpu.util.idle [percentage of idle time across all CPUs]
kernel.cpu.util.intr [percentage of interrupt time across all CPUs]
kernel.cpu.util.wait [percentage of wait time across all CPUs]
...

We have the pmval command to fetch the current or archived values for the required performance metric.

$ pmval mem.physmem

metric:    mem.physmem
host:      test-01
semantics: discrete instantaneous value
units:     Kbyte
 16074012

$ pmval mem.util.swapFree

metric:    mem.util.swapFree
host:      test-01
semantics: instantaneous value
units:     Kbyte
 4192036

PMIE functions as an inference engine that evaluates a set of user-defined expressions involving performance metrics. These expressions can perform arithmetic and logical operations, and when certain conditions are met, pmie can print messages, activate alarms, write syslog entries, and launch arbitrary programs.

A simple anatomy of any pmie rule is:

  1. Define Expressions: Create a configuration file containing expressions that specify the performance metrics to monitor and the conditions.
  2. Specify Actions: Define the actions PMIE should take when certain conditions are met, such as sending alerts or executing programs.
  3. Run PMIE: Execute PMIE with the appropriate options to start monitoring based on your configuration.
lexpr -> actions ;

If the logical expression lexpr evaluates true, then perform the actions that follow. Otherwise, do not perform the actions.

Now with an understanding of performance metrics and how pmie handles them, let’s write our simple pmie rule.

Create a file low_memory.conf

delta = 10 sec;    //rule evaluation frequency

100 * mem.util.free / mem.physmem < 5
 -> print "low free memory!"

We can read these as if system free memory is less than 5% of the total memory then trigger a log message and delta is rule evaluation frequency which is every 10 seconds.

Execute the rule on a system with a low available free memory:

$ pmie low_memory.conf
Thu Feb 13 08:28:30 2025: low free memory!
Thu Feb 13 08:28:40 2025: low free memory!
Thu Feb 13 08:28:50 2025: low free memory!
Thu Feb 13 08:29:00 2025: low free memory!
Thu Feb 13 08:29:10 2025: low free memory!

Creating pmie Rules with pmieconf

pmieconf provides a facility for generating a pmie configuration file from a set of generalized pmie rules. The directory /var/lib/pcp/config/pmieconf/ contains all the default system pmie generalized rules and variables, including default values for all variables. These files are in the pmieconf-rules format. With these generalized rules pmieconf utility generates the rule file /var/lib/pcp/config/pmie/config.default

Let’s look at the format of the generalized rule with the example of /var/lib/pcp/config/pmieconf/memory/swap_low:

rule    memory.swap_low
 summary = "$rule$"
 predicate =
"some_host (
 ( 100 * ( swap.free $hosts$ / swap.length $hosts$ ) )
 < $threshold$
 && swap.length $hosts$ > 0  // ensure swap in use
)"
 enabled = no
 version = 1
 help    =
"There is only threshold percent swap space remaining - the system
may soon run out of virtual memory.  Reduce the number and size of
the running programs or add more swap(1) space before it completely
runs out.";

percent threshold
 default = 10
 help    =
"Threshold percent of total swap space which is free, in the range
0 (none free) to 100 (all swap is unused).";

string  action_expand
 default = "%v%free@%h"
 modify  = no
 display = no;

// The action_expand variable uses placeholders to insert dynamic values into the log message.
// %v is replaced with the value of the metric or expression being monitored,
// %h is replaced with the hostname.
// Other available placeholders include %c (canonical name of the metric or expression) and %i (instance name).

This defines the rule memory.swap_low to monitor the availability of free swap space and triggers an alert when free swap space falls below a defined threshold. Now with pmieconf, we can display and modify variables or parameters controlling the details of the generated pmie rules. Let’s see how we can do it interactively:

$ pmieconf -f /var/lib/pcp/config/pmie/config.default
Updates will be made to /var/lib/pcp/config/pmie/config.default

pmieconf> list memory.swap_low
 rule: memory.swap_low  [Low free swap space]
 help: There is only threshold percent swap space remaining - the system
 may soon run out of virtual memory.  Reduce the number and size of
 the running programs or add more swap(1) space before it completely
 runs out.
 predicate = 
 some_host (
 ( 100 * ( swap.free $hosts$ / swap.length $hosts$ ) )
 < $threshold$
 && swap.length $hosts$ > 0        // ensure swap in use
 )
 vars: enabled = no
 threshold = 10%

pmieconf> enable memory.swap_low

pmieconf> modify memory.swap_low threshold 20

pmieconf> quit

As the rule was disabled by default we have enabled it and we have also modified the threshold to 20 from 10, Now let’s look at the generated rule in /var/lib/pcp/config/pmie/config.default, this file is in pmieconf-pmie format and starts with a header in the format // pmieconf-pmie version pmieconf_path, where the version (currently always 1) specifies the syntax, and pmieconf_path points to the original pmieconf-rules file. Each customization entry follows the format // rule_version rule_name rule_variable = value, associating modified variables with specific rules. The file ends with the end keyword, signaling the boundary between custom settings and actual pmie rules.

// pmieconf-pmie 1 /var/lib/pcp/config/pmieconf
// 1 memory.swap_low enabled = yes
// 1 memory.swap_low threshold = 20%
// end
//
// --- START GENERATED SECTION (do not change this section) ---
//     generated by pmieconf on:  Fri Feb 14 07:18:27 2025
//

// 1 memory.swap_low
delta = 2 min;          //rule evaluation frequency
memory.swap_low = 
some_host (
 ( 100 * ( swap.free  / swap.length  ) )
 < 20
 && swap.length  > 0 // ensure swap in use
) -> syslog 10 min "Low free swap space" " %v%free@%h";

// --- END GENERATED SECTION (changes below will be preserved) ---

Here, we can observe that the generated configuration file /var/lib/pcp/config/pmie/config.default contains line entries for each of the modified variables.

Let’s now examine the action string of the generated rule, the syslog 10 min syntax specifies that when the condition is met, a message will be logged to the system log (syslog) with a rate limit of 10 minutes (can be modified). This means that even if the condition is met multiple times within a 10-minute window, the message will only be logged once during that time frame, preventing excessive logging.

To verify that the rule is working as expected, let’s execute the PMIE rule using the generated configuration file and check the syslog entries. To do this, we need to enable and start the PMIE service, which will be discussed in more detail in the next section:

$ systemctl enable pmie.service 
$ /usr/libexec/pcp/lib/pmie start
$ tail -F /var/log/messages
Feb 14 08:11:32 test-host-01 pcp-pmie[2393863]: Low free swap space 15%free@test-host-01
Feb 14 08:21:32 test-host-01 pcp-pmie[2393863]: Low free swap space 12%free@test-host-01

These messages indicate that the PMIE rule has detected low free swap space on the system and has logged a message to the syslog accordingly.

Similarly, we can modify and use any of the generalized rules to monitor and perform analysis, also we can add a new generalized rule in /var/lib/pcp/config/pmieconf/ in pmieconf-rules format according to our requirements and use pmieconf to modify and control it and by default, the rule will be added to /var/lib/pcp/config/pmie/config.default file.

Automating System Monitoring with PMIE

The pmie process can run as a daemon at system startup, enabling automated, live performance monitoring. To start it, enable the service:

$ systemctl enable pmie.service 
$ /usr/libexec/pcp/lib/pmie start

By default, this starts pmie with predefined rules generated by pmieconf in /var/lib/pcp/config/pmie/config.default for the local host with default thresholds as discussed earlier, There can be only one primary pmie instance on each host and is controlled by /etc/pcp/pmie/control.d/ and /etc/pcp/pmie/control, To add a new host, add the entry in /etc/pcp /pmie/control.d file.

To verify that the pmie processes have started, we can use the pcp command:

$ pcp
Performance Co-Pilot configuration on test-host-01:

 platform: Linux test-host-01 5.15.0-210.163.7.el8uek.x86_64 #2 SMP Tue Sep 10 18:31:09 PDT 2024 x86_64
 hardware: 8 cpus, 1 disk, 1 node, 15697MB RAM
 timezone: GMT
 services: pmcd
 pmcd: Version 5.3.7-20, 10 agents, 2 clients
 pmda: root pmcd proc pmproxy xfs linux nfsclient[4] mmv kvm jbd2
 dm openmetrics[4]
 pmlogger: primary logger: /var/log/pcp/pmlogger/test-host-01/20250214.00.10
 pmie: primary engine: /var/log/pcp/pmie/test-host-01/pmie.log

To monitor the PMIE service, you can check the logs at the location /var/log/pcp/pmie/test-host-01/pmie.log. To manage the service, you can use the following commands:

  • To stop the service: /usr/libexec/pcp/lib/pmie stop
  • To reload the service after modifying rules using pmieconf: /usr/libexec/pcp/lib/pmie reload

Conclusion

In this blog, we have explored PMIE and how to use it for automated system monitoring, covering how to write custom PMIE rules, use default configurations, and use pmieconf for easier rule management. By integrating PMIE into your system, you can detect performance issues and automate responses, ensuring efficient monitoring with minimal manual intervention.

Reference