Oracle Linux Enhanced Diagnostics: Use syswatch to trigger diagnostics based on CPU usage

syswatch, part of Oracle Linux Enhanced Diagnostics (OLED), is a tool that watches system CPU utilization and executes user-specified commands when configured CPU utilization is reached. Its main purpose is to aid in troubleshooting ephemeral system loads that otherwise would be difficult to track manually.

Installation

For installation instructions and more information about OLED, please see Oracle Linux Enhanced Diagnostics.

Usage

$ sudo oled syswatch -h
usage: syswatch [-h] [-b] [-C NR_CPUS] -s STAT:PERCENTAGE [-t TARGET_DIR]
                [-M MAX_FS_UTIL] [-I INTERVAL] -c COMMAND

Execute user specified commands if configured CPU utilization thresholds are
reached. See oled-syswatch(8) for a detailed description.

optional arguments:
  -h, --help          show this help message and exit
  -b                  Run indefinitely until manually terminated. (default:
                      False)
  -C NR_CPUS          # of CPUs to apply match criteria. A value <= 0 means
                      apply system wide. (default: 0)
  -s STAT:PERCENTAGE  Triggers action commands if the specified CPU STAT
                      utilization reaches or exceeds PERCENTAGE. Valid values
                      of STAT are: usr, nice, sys, idle, iowait, irq, soft,
                      steal, guest, gnice. PERCENTAGE must be an integer in
                      the range [1, 100]. This option is required and can be
                      specified multiple times. (default: None)
                      NOTE: The logic for idle percentage is inverted.  When
                      specifying idle, the commands will execute when the
                      utilization reaches or is LESS THAN the percentage
                      specified.
  -t TARGET_DIR       Output directory. The program will cd to this directory
                      before performing any actions. (default:
                      /var/oled/syswatch)
  -M MAX_FS_UTIL      Max Filesystem Utilization. If the filesystem is at or
                      above the set %, then exit and don't take any action.
                      (default: 85)
  -I INTERVAL         Interval, in seconds, between CPU utilization snapshots.
                      (default: 5)
  -c COMMAND          Command to execute when CPU stat thresholds are reached.
                      This is mandatory. If a more complex command is needed
                      this can point to a script. This option can be specified
                      multiple times, in which case all the commands will be
                      executed in parallel. (default: None)

syswatch accepts at least one CPU stat and threshold to watch with -s <STAT>:<PERCENTAGE> (e.g. sys, usr, irq, etc) and at least one command to execute, with -c <CMD>, to execute when that threshold is reached in the given time interval (5 seconds by default). For example, the following command will execute command logger "HIGH LOAD" if the average sys CPU load reaches 90% system-wide in 10sec snapshots:

$ sudo oled syswatch -I 10 -s sys:90 -c 'logger "HIGH LOAD"'

See oled-syswatch(8) man page for more thorough documentation on it’s usage and other important considerations.

Example

Lets say there is a system for which ocassionally a random CPU reaches 90%+ sys utilization. You can see this behavior in the monitoring logs, but don’t know what exactly is causing it so you cannot reproduce it at will, you just see that it happens at random intervals. You are interested in knowing what the kernel is doing at those occurrences, which processes are running and the memory consumption.

You can use syswatch to collect perf(1) data, processes and memory state when at least 1 CPU has sustained an average of 90% sys utilization in 5sec intervals:

$ sudo oled syswatch -C 1 -s sys:90 -c 'perf record -ag -F99 -- sleep 15' -c 'ps -ef' -c 'cat /proc/meminfo'
2024-03-14T21:28:16+0000 INFO - Log file: /var/oled/syswatch/syswatch_2024-03-14T21-28-16_4192381/syswatch.log
2024-03-14T21:28:16+0000 INFO - /usr/libexec/oled-tools/syswatch -C 1 -s sys:90 -c 'perf record -ag -F99 -- sleep 15' -c 'ps -ef' -c 'cat /proc/meminfo'
2024-03-14T21:28:16+0000 INFO - Config:
        Run continuously: False
        # CPUs: 1
        Thresholds:
                sys: 90%
        Working dir: /var/oled/syswatch/syswatch_2024-03-14T21-28-16_4192381
        Max FS utilization: 85%
        Snapshot interval: 5 sec
        Actions commands: ['perf record -g -F99 -- sleep 15', 'ps -ef', 'cat /proc/meminfo']
2024-03-14T21:28:16+0000 INFO - CPU utilization watch started...
2024-03-14T21:30:01+0000 INFO - Reached CPU utilization thresholds:
Snapshot: 2024-03-14T21:29:56 - 2024-03-14T21:30:01
Stats:
   CPU   %usr  %nice   %sys %iowait   %irq  %soft %steal %guest %gnice  %idle
   all   2.07   0.00  31.71    0.00   0.00   0.00   0.00   0.00   0.00  66.22
     0   5.40   0.00  94.60    0.00   0.00   0.00   0.00   0.00   0.00   0.00
     1   0.60   0.00   0.20    0.00   0.00   0.00   0.00   0.00   0.00  99.20
     3   0.40   0.00   0.20    0.00   0.00   0.00   0.00   0.00   0.00  99.40
2024-03-14T21:30:01+0000 INFO - Executing action commands in directory '/var/oled/syswatch/syswatch_2024-03-14T21-28-16_4192381/2024-03-14T21-30-01'...
2024-03-14T21:30:18+0000 INFO - done executing actions
2024-03-14T21:30:18+0000 INFO - Stop watching
2024-03-14T21:30:18+0000 INFO - Finished

The command started by printing the location of the log of this particular invocation, the whole command being executed and the configuration. Of importance here is the working directory, that is where stdout/stderr of the commands executed (perf(1)/ps(1)/cat(1) in this case) would be stored.

Then it started watching CPU utilization at 21:28:16, and at 21:30:01 it detected the utilization thresholds had been reached. At that time it dumped the interval of the snapshot where the CPU util threshold was detected (21:29:56 – 21:30:01) as well as the CPU stats in that snapshot. We can see that it was CPU 0 which reached an average of 94%+ sys utilization in that snapshot.

It proceeded to execute the configured commands (in parallel) and inform us of the directory where the output of the commands would be stored. The latter is a subdirectory within the working directory mentioned above, named after the timestamp when the CPU util threshold was detected.

If we look at the structure of the working directory, we see that one subdirectory for each one of the commands executed was created, named after the command itself. Each command is run with its CWD set to its own subdirectory. In the case of the perf command, we can see it created perf.data inside its own subdirectory because of this.

$ tree -F /var/oled/syswatch/syswatch_2024-03-14T21-28-16_4192381
/var/oled/syswatch/syswatch_2024-03-14T21-28-16_4192381
├── 2024-03-14T21-30-01/
│   ├── cat____proc__meminfo/
│   │   └── output
│   ├── perf__record__-ag__-F99__--__sleep__15/
│   │   ├── output
│   │   └── perf.data
│   └── ps__-ef/
│       └── output
└── syswatch.log

4 directories, 5 files

The stdout/stderr of each command is redirected to file output in their own subdirectory.

$ head /var/oled/syswatch/syswatch_2024-03-14T21-28-16_4192381/2024-03-14T21-30-01/cat____proc__meminfo/output
MemTotal:       16078272 kB
MemFree:         3024300 kB
MemAvailable:   12890108 kB
Buffers:            5308 kB
Cached:          9753152 kB
SwapCached:          300 kB
Active:          6429876 kB
Inactive:        4372496 kB
Active(anon):     774296 kB
Inactive(anon):   281040 kB

$ head /var/oled/syswatch/syswatch_2024-03-14T21-28-16_4192381/2024-03-14T21-30-01/ps__-ef/output
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0  2023 ?        00:35:09 /usr/lib/systemd/systemd --system --deserialize 22
root           2       0  0  2023 ?        00:00:08 [kthreadd]
root           3       2  0  2023 ?        00:00:00 [rcu_gp]
root           4       2  0  2023 ?        00:00:00 [rcu_par_gp]
root           6       2  0  2023 ?        00:00:00 [kworker/0:0H-events_highpri]
root           8       2  0  2023 ?        00:04:32 [kworker/0:1H-events_highpri]
root           9       2  0  2023 ?        00:00:00 [mm_percpu_wq]
root          10       2  0  2023 ?        00:09:54 [ksoftirqd/0]
root          11       2  0  2023 ?        00:24:25 [rcu_sched]

Another example shows the use of the idle stat. Keep in mind that when specifying the idle stat that the logic is inverted. For example if we want the system to record a ps command when the system dips below 80% idle, we can do the following:

# oled syswatch -s idle:80 -c 'ps -elf'
2024-05-24T17:36:41+0000 INFO - Log file: /var/oled/syswatch/syswatch_2024-05-24T17-36-41_13805/syswatch.log
2024-05-24T17:36:41+0000 INFO - /usr/libexec/oled-tools/syswatch -s idle:80 -c 'ps -elf'
2024-05-24T17:36:41+0000 INFO - Config:
    Run continuously: False
    # CPUs: ALL
    Thresholds:
        idle: 80%
    Working dir: /var/oled/syswatch/syswatch_2024-05-24T17-36-41_13805
    Max FS utilization: 85%
    Snapshot interval: 5 sec
    Actions commands: ['ps -elf']
2024-05-24T17:36:41+0000 INFO - CPU utilization watch started...

Once again, consult oled-syswatch(8) man page for a much more thorough description of all this and other details not mentioned in this blog post.

Conclusion

syswatch can be a useful tool to monitor and collect arbitrary diagnostic data, or take arbitrary actions (commands) under desired CPU utilization that otherwise would be difficult to collect manually.

As always, to report any issue, please open a Service Request to Oracle Linux Support.

Oracle Linux Enhanced Diagnostics: Use syswatch to trigger diagnostics based on CPU usage

Installation

Usage

Example

Conclusion

Jeffery Yoder

Oracle Linux 10 Developer Preview—Now Available for Download

Oracle Linux Enhanced Diagnostics: Kstack can view the kernel's state in real time

Oracle Linux Enhanced Diagnostics: Use syswatch to trigger diagnostics based on CPU usage

Installation

Usage

Example

Conclusion

Authors

Jeffery Yoder

Oracle Linux 10 Developer Preview—Now Available for Download

Oracle Linux Enhanced Diagnostics: Kstack can view the kernel's state in real time