Introduction
oomwatch
can be used to monitor a system’s memory and swap utilization and kill off the top-memory-using process (out of a list of candidate processes) once defined thresholds are exceeded. The tool uses a feature of Performance Co-Pilot (PCP), called the Performance Metric Inference Engine (PMIE). The configurable PMIE rule will check the amount of free memory and free swap at a defined interval and take action as needed. The configuration, enabling, and disabling of the rule is handled by Oracle Linux Enhanced Diagnostics (OLED). “But wait”, I hear you say, “doesn’t the system already have a built-in out of memory (OOM) killer?”. Yes, that’s true, however the kernel OOM killer is a last-ditch-effort to keep the OS up and running and is called when the system is already (or very close to being) in trouble. oomwatch
can be used to catch those bad acting processes before the situation gets desperate.
Installation
For installation instructions and more information about OLED, please see Oracle Linux Enhanced Diagnostics.
For installation instructions and more information about PCP, please see Better Diagnostics With Performance Co-Pilot.
Using oomwatch
Because oomwatch
is part of OLED, you need to preface the commands with oled, and you need to run as root. For example here is how you would view the help information:
# oled oomwatch --help usage: oomwatch [-h] [-v] [-e] [-d] [-r] [-s] [-k] {configure} ... oled oomwatch v1.0.0 positional arguments: {configure} configure Configure oomwatch parameters optional arguments: -h, --help show this help message and exit -v, --version Print version -e, --enable Enable oomwatch -d, --disable Disable oomwatch -r, --reload Reload the oomwatch settings stored in /etc/oled/oomwatch.json -s, --status Display oomwatch status -k, --kill Kill processes matching specifications
Of these options there are 2 that deserve a little more explanation.
Configure
As the name suggests this is how you configure the PMIE rule used to monitor memory. This command has additional help.
# oled oomwatch configure --help usage: oomwatch configure [-h] [--show] [--delta DELTA] [--holdoff HOLDOFF] [--memfree_threshold MEMFREE_THRESHOLD] [--swapfree_threshold SWAPFREE_THRESHOLD] [--monitored_process MONITORED_PROCESS] optional arguments: -h, --help show this help message and exit --show Show current configuration --delta DELTA PMIE rule delta parameter (time str) --holdoff HOLDOFF PMIE rule holdoff parameter (time str) --memfree_threshold MEMFREE_THRESHOLD Set memfree threshold (float) --swapfree_threshold SWAPFREE_THRESHOLD Set swapfree threshold (float) --monitored_process MONITORED_PROCESS Set programs to kill (comma-delimited str)
–help, and –show are self-explanatory. Here’s some more details about the other options:
- –delta – Specifies the interval at which the PMIE rule is checked in seconds. Must be greater or equal to 1.
- –holdoff – This is the “cool off” period in seconds between taking actions. For example if you set this to 60, and
oomwatch
fires and kills a process, it will be at least 60 seconds before it attempts to kill the next candidate process. - –memfree_threshold – Indicates the amount of free memory as a percentage. Setting this to 10 means that if free memory drops below 10% the rule should evaluate to true. Setting this to 100 means it will always evaluate as true.
- –swapfree_threshold – The amount of free swap expressed as a percentage. NOTE: Both the memory and swap thresholds are AND’d together. Both must be true for the rule to fire. Setting this to 100 will cause it to evaluate to true assuming there is some swap usage.
- –monitored_process – Candidate processes are listed here separated by commas.
Initially, after installation, the oomwatch rule is not enabled. You need to configure it and enable it the first time. The default configuration is such that nothing will happen if it’s enabled.
# oled oomwatch configure --show memfree_threshold : 0 swapfree_threshold: 0 delta : 30 sec holdoff : 0 monitored_process : ['']
–kill
Please use caution when running oled oomwatch --kill
. This command assumes that the thresholds for the rule have been exceeded and will evaluate the candidate processes and kill the top memory consumer. You can use this for testing, but once you have the rules configured, just let oomwatch
do its thing.
Monitoring
To monitor oomwatch
you should watch the /var/log/pcp/pmie/`hostname`/pmie.log file. Every time that you enable oomwatch
, or change the configuration this file will be rotated to pmie.log.prior. In addition the log file is rotated daily to a file with a date extension. 14 days worth of files are retained, and the oldest files are compressed. The applicable notifications contain the string “INFO”.
Here is an example of the output when the rule fires:
2024-12-24 20:45:32.791 INFO - Thresholds exceeded. Searching for processes to terminate... 2024-12-24 20:45:32.814 INFO - Killing process application1 (PID: 50299) using 1559.46 MB. 2024-12-24 20:45:32.816 INFO - Process application1 (PID: 50299) killed using 'kill -9'. 2024-12-24 20:45:32.817 INFO - Execution Time: 0.03 seconds
Examples
Let’s say you have 3 applications that won’t behave themselves, when it comes to memory, and you want to have oomwatch
monitor for low memory (10%) and swap (10%) every 10 seconds and kill the highest memory user:
# oled oomwatch configure \ --delta=10 \ --memfree_threshold=10 \ --swapfree_threshold=10 \ --monitored_process=application1,application2,application3 #oled oomwatch configure --show memfree_threshold : 10.0 swapfree_threshold: 10.0 delta : 10 holdoff : 0 monitored_process : ['application2', 'application3', 'application1']
oomwatch
will now: – Check memory and swap to see if they are both less than 10%. – If they are: – Check all instances of all the candidate processes. – Sort by memory used (RSS). – Kill the top process. – Wait holdoff seconds or delta seconds which ever is longer. – If they aren’t: – Wait delta seconds. – Start over.
To disable the rule:
# oled oomwatch -d
To enable/re-enable the rule:
# oled oomwatch -e
Conclusion
oomwatch
is a preemptive, userspace-based, oom-killer. You can use oomwatch
to help manage misbehaving processes and protect your system from OOM situations. If you have any issues with oomwatch
, please open a Service Request to Oracle Linux Support.