Introduction

oomwatch can be used to monitor a system’s memory and swap utilization and kill off the top-memory-using process (out of a list of candidate processes) once defined thresholds are exceeded. The tool uses a feature of Performance Co-Pilot (PCP), called the Performance Metric Inference Engine (PMIE). The configurable PMIE rule will check the amount of free memory and free swap at a defined interval and take action as needed. The configuration, enabling, and disabling of the rule is handled by Oracle Linux Enhanced Diagnostics (OLED). “But wait”, I hear you say, “doesn’t the system already have a built-in out of memory (OOM) killer?”. Yes, that’s true, however the kernel OOM killer is a last-ditch-effort to keep the OS up and running and is called when the system is already (or very close to being) in trouble. oomwatch can be used to catch those bad acting processes before the situation gets desperate.

Installation

For installation instructions and more information about OLED, please see Oracle Linux Enhanced Diagnostics.

For installation instructions and more information about PCP, please see Better Diagnostics With Performance Co-Pilot.

Using oomwatch

Because oomwatch is part of OLED, you need to preface the commands with oled, and you need to run as root. For example here is how you would view the help information:

# oled oomwatch --help
usage: oomwatch [-h] [-v] [-e] [-d] [-r] [-s] [-k] {configure} ...

oled oomwatch v1.0.0

positional arguments:
  {configure}
    configure    Configure oomwatch parameters

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  Print version
  -e, --enable   Enable oomwatch
  -d, --disable  Disable oomwatch
  -r, --reload   Reload the oomwatch settings stored in
                 /etc/oled/oomwatch.json
  -s, --status   Display oomwatch status
  -k, --kill     Kill processes matching specifications

Of these options there are 2 that deserve a little more explanation.

Configure

As the name suggests this is how you configure the PMIE rule used to monitor memory. This command has additional help.

# oled oomwatch configure --help
usage: oomwatch configure [-h] [--show] [--delta DELTA] [--holdoff HOLDOFF]
                          [--memfree_threshold MEMFREE_THRESHOLD]
                          [--swapfree_threshold SWAPFREE_THRESHOLD]
                          [--monitored_process MONITORED_PROCESS]

optional arguments:
  -h, --help            show this help message and exit
  --show                Show current configuration
  --delta DELTA         PMIE rule delta parameter (time str)
  --holdoff HOLDOFF     PMIE rule holdoff parameter (time str)
  --memfree_threshold MEMFREE_THRESHOLD
                        Set memfree threshold (float)
  --swapfree_threshold SWAPFREE_THRESHOLD
                        Set swapfree threshold (float)
  --monitored_process MONITORED_PROCESS
                        Set programs to kill (comma-delimited str)

–help, and –show are self-explanatory. Here’s some more details about the other options:

  • –delta – Specifies the interval at which the PMIE rule is checked in seconds. Must be greater or equal to 1.
  • –holdoff – This is the “cool off” period in seconds between taking actions. For example if you set this to 60, and oomwatch fires and kills a process, it will be at least 60 seconds before it attempts to kill the next candidate process.
  • –memfree_threshold – Indicates the amount of free memory as a percentage. Setting this to 10 means that if free memory drops below 10% the rule should evaluate to true. Setting this to 100 means it will always evaluate as true.
  • –swapfree_threshold – The amount of free swap expressed as a percentage. NOTE: Both the memory and swap thresholds are AND’d together. Both must be true for the rule to fire. Setting this to 100 will cause it to evaluate to true assuming there is some swap usage.
  • –monitored_process – Candidate processes are listed here separated by commas.

Initially, after installation, the oomwatch rule is not enabled. You need to configure it and enable it the first time. The default configuration is such that nothing will happen if it’s enabled.

# oled oomwatch configure --show
memfree_threshold : 0
swapfree_threshold: 0
delta             : 30 sec
holdoff           : 0
monitored_process : ['']

–kill

Please use caution when running oled oomwatch --kill. This command assumes that the thresholds for the rule have been exceeded and will evaluate the candidate processes and kill the top memory consumer. You can use this for testing, but once you have the rules configured, just let oomwatch do its thing.

Monitoring

To monitor oomwatch you should watch the /var/log/pcp/pmie/`hostname`/pmie.log file. Every time that you enable oomwatch, or change the configuration this file will be rotated to pmie.log.prior. In addition the log file is rotated daily to a file with a date extension. 14 days worth of files are retained, and the oldest files are compressed. The applicable notifications contain the string “INFO”.

Here is an example of the output when the rule fires:

2024-12-24 20:45:32.791 INFO - Thresholds exceeded. Searching for processes to terminate...
2024-12-24 20:45:32.814 INFO - Killing process application1 (PID: 50299) using 1559.46 MB.
2024-12-24 20:45:32.816 INFO - Process application1 (PID: 50299) killed using 'kill -9'.
2024-12-24 20:45:32.817 INFO - Execution Time: 0.03 seconds

Examples

Let’s say you have 3 applications that won’t behave themselves, when it comes to memory, and you want to have oomwatch monitor for low memory (10%) and swap (10%) every 10 seconds and kill the highest memory user:

# oled oomwatch configure \
--delta=10 \
--memfree_threshold=10 \
--swapfree_threshold=10 \
--monitored_process=application1,application2,application3

#oled oomwatch configure --show
memfree_threshold : 10.0
swapfree_threshold: 10.0
delta             : 10
holdoff           : 0
monitored_process : ['application2', 'application3', 'application1']

oomwatch will now: – Check memory and swap to see if they are both less than 10%. – If they are: – Check all instances of all the candidate processes. – Sort by memory used (RSS). – Kill the top process. – Wait holdoff seconds or delta seconds which ever is longer. – If they aren’t: – Wait delta seconds. – Start over.

To disable the rule:

# oled oomwatch -d

To enable/re-enable the rule:

# oled oomwatch -e

Conclusion

oomwatch is a preemptive, userspace-based, oom-killer. You can use oomwatch to help manage misbehaving processes and protect your system from OOM situations. If you have any issues with oomwatch, please open a Service Request to Oracle Linux Support.