syswatch
, part of Oracle Linux Enhanced Diagnostics (OLED), is a tool that watches system CPU utilization and executes user-specified commands when configured CPU utilization is reached. Its main purpose is to aid in troubleshooting ephemeral system loads that otherwise would be difficult to track manually.
Installation
For installation instructions and more information about OLED, please see Oracle Linux Enhanced Diagnostics.
Usage
$ sudo oled syswatch -h usage: syswatch [-h] [-b] [-C NR_CPUS] -s STAT:PERCENTAGE [-t TARGET_DIR] [-M MAX_FS_UTIL] [-I INTERVAL] -c COMMAND Execute user specified commands if configured CPU utilization thresholds are reached. See oled-syswatch(8) for a detailed description. optional arguments: -h, --help show this help message and exit -b Run indefinitely until manually terminated. (default: False) -C NR_CPUS # of CPUs to apply match criteria. A value <= 0 means apply system wide. (default: 0) -s STAT:PERCENTAGE Triggers action commands if the specified CPU STAT utilization reaches or exceeds PERCENTAGE. Valid values of STAT are: usr, nice, sys, idle, iowait, irq, soft, steal, guest, gnice. PERCENTAGE must be an integer in the range [1, 100]. This option is required and can be specified multiple times. (default: None) NOTE: The logic for idle percentage is inverted. When specifying idle, the commands will execute when the utilization reaches or is LESS THAN the percentage specified. -t TARGET_DIR Output directory. The program will cd to this directory before performing any actions. (default: /var/oled/syswatch) -M MAX_FS_UTIL Max Filesystem Utilization. If the filesystem is at or above the set %, then exit and don't take any action. (default: 85) -I INTERVAL Interval, in seconds, between CPU utilization snapshots. (default: 5) -c COMMAND Command to execute when CPU stat thresholds are reached. This is mandatory. If a more complex command is needed this can point to a script. This option can be specified multiple times, in which case all the commands will be executed in parallel. (default: None)
syswatch accepts at least one CPU stat and threshold to watch with -s <STAT>:<PERCENTAGE>
(e.g. sys
, usr
, irq
, etc) and at least one command to execute, with -c <CMD>
, to execute when that threshold is reached in the given time interval (5 seconds by default). For example, the following command will execute command logger "HIGH LOAD"
if the average sys CPU load reaches 90% system-wide in 10sec snapshots:
$ sudo oled syswatch -I 10 -s sys:90 -c 'logger "HIGH LOAD"'
See oled-syswatch(8)
man page for more thorough documentation on it’s usage and other important considerations.
Example
Lets say there is a system for which ocassionally a random CPU reaches 90%+ sys utilization. You can see this behavior in the monitoring logs, but don’t know what exactly is causing it so you cannot reproduce it at will, you just see that it happens at random intervals. You are interested in knowing what the kernel is doing at those occurrences, which processes are running and the memory consumption.
You can use syswatch
to collect perf(1)
data, processes and memory state when at least 1 CPU has sustained an average of 90% sys utilization in 5sec intervals:
$ sudo oled syswatch -C 1 -s sys:90 -c 'perf record -ag -F99 -- sleep 15' -c 'ps -ef' -c 'cat /proc/meminfo' 2024-03-14T21:28:16+0000 INFO - Log file: /var/oled/syswatch/syswatch_2024-03-14T21-28-16_4192381/syswatch.log 2024-03-14T21:28:16+0000 INFO - /usr/libexec/oled-tools/syswatch -C 1 -s sys:90 -c 'perf record -ag -F99 -- sleep 15' -c 'ps -ef' -c 'cat /proc/meminfo' 2024-03-14T21:28:16+0000 INFO - Config: Run continuously: False # CPUs: 1 Thresholds: sys: 90% Working dir: /var/oled/syswatch/syswatch_2024-03-14T21-28-16_4192381 Max FS utilization: 85% Snapshot interval: 5 sec Actions commands: ['perf record -g -F99 -- sleep 15', 'ps -ef', 'cat /proc/meminfo'] 2024-03-14T21:28:16+0000 INFO - CPU utilization watch started... 2024-03-14T21:30:01+0000 INFO - Reached CPU utilization thresholds: Snapshot: 2024-03-14T21:29:56 - 2024-03-14T21:30:01 Stats: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle all 2.07 0.00 31.71 0.00 0.00 0.00 0.00 0.00 0.00 66.22 0 5.40 0.00 94.60 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 0.60 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 99.20 3 0.40 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 99.40 2024-03-14T21:30:01+0000 INFO - Executing action commands in directory '/var/oled/syswatch/syswatch_2024-03-14T21-28-16_4192381/2024-03-14T21-30-01'... 2024-03-14T21:30:18+0000 INFO - done executing actions 2024-03-14T21:30:18+0000 INFO - Stop watching 2024-03-14T21:30:18+0000 INFO - Finished
The command started by printing the location of the log of this particular invocation, the whole command being executed and the configuration. Of importance here is the working directory, that is where stdout/stderr of the commands executed (perf(1)
/ps(1)
/cat(1)
in this case) would be stored.
Then it started watching CPU utilization at 21:28:16, and at 21:30:01 it detected the utilization thresholds had been reached. At that time it dumped the interval of the snapshot where the CPU util threshold was detected (21:29:56 – 21:30:01) as well as the CPU stats in that snapshot. We can see that it was CPU 0 which reached an average of 94%+ sys utilization in that snapshot.
It proceeded to execute the configured commands (in parallel) and inform us of the directory where the output of the commands would be stored. The latter is a subdirectory within the working directory mentioned above, named after the timestamp when the CPU util threshold was detected.
If we look at the structure of the working directory, we see that one subdirectory for each one of the commands executed was created, named after the command itself. Each command is run with its CWD set to its own subdirectory. In the case of the perf
command, we can see it created perf.data
inside its own subdirectory because of this.
$ tree -F /var/oled/syswatch/syswatch_2024-03-14T21-28-16_4192381 /var/oled/syswatch/syswatch_2024-03-14T21-28-16_4192381 ├── 2024-03-14T21-30-01/ │ ├── cat____proc__meminfo/ │ │ └── output │ ├── perf__record__-ag__-F99__--__sleep__15/ │ │ ├── output │ │ └── perf.data │ └── ps__-ef/ │ └── output └── syswatch.log 4 directories, 5 files
The stdout/stderr of each command is redirected to file output
in their own subdirectory.
$ head /var/oled/syswatch/syswatch_2024-03-14T21-28-16_4192381/2024-03-14T21-30-01/cat____proc__meminfo/output MemTotal: 16078272 kB MemFree: 3024300 kB MemAvailable: 12890108 kB Buffers: 5308 kB Cached: 9753152 kB SwapCached: 300 kB Active: 6429876 kB Inactive: 4372496 kB Active(anon): 774296 kB Inactive(anon): 281040 kB $ head /var/oled/syswatch/syswatch_2024-03-14T21-28-16_4192381/2024-03-14T21-30-01/ps__-ef/output UID PID PPID C STIME TTY TIME CMD root 1 0 0 2023 ? 00:35:09 /usr/lib/systemd/systemd --system --deserialize 22 root 2 0 0 2023 ? 00:00:08 [kthreadd] root 3 2 0 2023 ? 00:00:00 [rcu_gp] root 4 2 0 2023 ? 00:00:00 [rcu_par_gp] root 6 2 0 2023 ? 00:00:00 [kworker/0:0H-events_highpri] root 8 2 0 2023 ? 00:04:32 [kworker/0:1H-events_highpri] root 9 2 0 2023 ? 00:00:00 [mm_percpu_wq] root 10 2 0 2023 ? 00:09:54 [ksoftirqd/0] root 11 2 0 2023 ? 00:24:25 [rcu_sched]
Another example shows the use of the idle stat. Keep in mind that when specifying the idle stat that the logic is inverted. For example if we want the system to record a ps
command when the system dips below 80% idle, we can do the following:
# oled syswatch -s idle:80 -c 'ps -elf' 2024-05-24T17:36:41+0000 INFO - Log file: /var/oled/syswatch/syswatch_2024-05-24T17-36-41_13805/syswatch.log 2024-05-24T17:36:41+0000 INFO - /usr/libexec/oled-tools/syswatch -s idle:80 -c 'ps -elf' 2024-05-24T17:36:41+0000 INFO - Config: Run continuously: False # CPUs: ALL Thresholds: idle: 80% Working dir: /var/oled/syswatch/syswatch_2024-05-24T17-36-41_13805 Max FS utilization: 85% Snapshot interval: 5 sec Actions commands: ['ps -elf'] 2024-05-24T17:36:41+0000 INFO - CPU utilization watch started...
Once again, consult oled-syswatch(8)
man page for a much more thorough description of all this and other details not mentioned in this blog post.
Conclusion
syswatch
can be a useful tool to monitor and collect arbitrary diagnostic data, or take arbitrary actions (commands) under desired CPU utilization that otherwise would be difficult to collect manually.
As always, to report any issue, please open a Service Request to Oracle Linux Support.