Oracle Linux Enhanced Diagnostics: Use rpm_db_snooper to monitor and investigate rpm database corruption events

Issue

Systems may experience recurring RPM database (rpmdb) corruption or package management failures, for example:

# rpm -qa
error: rpmdb: BDB0113 Thread/process 9773/140219155286080 failed: BDB1507 Thread died in Berkeley DB library
error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages index using db5 - (-30973)
error: cannot open Packages database in /var/lib/rpm

In many cases, the root cause is that a process using the RPM database is terminated abruptly (for example via SIGKILL), leaving the database in an inconsistent state.

To understand who is killing RPM-related processes and when, we can use rpm_db_snooper from the Oracle Linux Enhanced Diagnostics (OLED) toolkit.

What is rpm_db_snooper?

rpm_db_snooper is a systemd service and DTrace-based diagnostic tool for Oracle Linux that:

Monitors access to the RPM database directory /var/lib/rpm by rpm, yum, and dnf.
Watches for signals (such as SIGKILL) sent to those processes.
Logs relevant events (open/close, signals, exits) via the standard system journal.

It is implemented using DTrace and packaged as part of the Oracle Linux Enhanced Diagnostics (OLED) tools for Oracle Linux.

Note: rpm_db_snooper is a diagnostics/forensics utility. It does not repair corruption; it helps you capture evidence about what was happening to the RPM database when problems occur.

Why rpm_db_snooper is Useful

RPM database corruption and hangs are often intermittent and hard to reproduce. Once the database is already in a bad state, there may be no standard logging that explains who killed yum/dnf/rpm or which process interrupted a transaction.

rpm_db_snooper addresses this by providing:

An audit trail of which RPM-related processes were using /var/lib/rpm.
A record of signals (especially SIGKILL) delivered to those processes.
Timestamps that let you correlate events with other system activity (automation jobs, monitoring agents, OOM kills, etc.).

By reviewing this data when corruption reoccurs, you can often identify:

Automation or monitoring tools that time out and kill yum/dnf/rpm.
Operators manually sending kill -9 to package processes.
Other system components (e.g., OOM killer) terminating those processes.

Temporary Recovery vs Root Cause

When rpmdb corruption is already present, you typically need to:

Ensure no processes are using /var/lib/rpm:

$ fuser -v /var/lib/rpm
Stop or kill any processes reported by fuser.
Perform the appropriate rpmdb recovery/repair for your distribution, which may include removing transient Berkeley DB environment files (__db.*) or following vendor-specific recovery procedures.

These steps can temporarily clear the issue, but if the underlying cause is not fixed, corruption can reoccur. That is where rpm_db_snooper helps: it needs to be running before the issue in order to capture signal and access activity around the problematic time.

Installing rpm_db_snooper (OLED tools)

rpm_db_snooper is included in the Oracle Linux Enhanced Diagnostics (OLED) tools collection.

For installation and repository setup details, see:
- Oracle Linux Enhanced Diagnostics

Once the OLED tools are installed, rpm_db_snooper will be provided as a systemd service and supporting DTrace script.

Service Architecture

Key components:

rpm_db_snooper.service
- A systemd unit that runs the monitoring DTrace script as a persistent root-level service.
monitor_rpmdb.d
- The DTrace script that:
  - Detects when yum, dnf, or rpm open or interact with database files under /var/lib/rpm and, where applicable, history.sqlite.
  - Tracks active RPM DB consumer PIDs.
  - Logs signal delivery (kill) targeted at those consumers.
  - Monitors consumer exit events.
Key paths
- /var/lib/rpm – RPM database directory.
- /usr/libexec/oled-tools/scripts.d/monitor_rpmdb.d – DTrace script.
- /var/log/messages – standard system log, if configured. journalctl -u rpm_db_snooper.service – primary journal view (SYSLOG_IDENTIFIER: rpm_db_snooper).

Service ExecStart (typical):

/usr/sbin/dtrace -Cs /usr/libexec/oled-tools/scripts.d/monitor_rpmdb.d $RPM_DB_SNOOPER_DEBUG

$RPM_DB_SNOOPER_DEBUG can be set to enable additional debug output.

Starting and Managing the Service

On Oracle Linux systems with OLED tools installed, rpm_db_snooper is managed via systemd.

The service is not enabled by default when OLED tools are installed; you must explicitly enable and start it.

Start the service:

systemctl start rpm_db_snooper.service

Check the status:

systemctl status rpm_db_snooper.service

Restart if you change configuration or enable debug:

systemctl restart rpm_db_snooper.service

View logs:

journalctl -u rpm_db_snooper.service
# or
grep rpm_db_snooper /var/log/messages

Important: The service must be running before rpmdb corruption or hangs occur in order to capture useful diagnostic data.

To have it start automatically on boot, also run:
systemctl enable rpm_db_snooper.service

Enabling Debug Output (for troubleshooting rpm_db_snooper itself)

In normal use, you do not need debug output enabled to diagnose rpmdb corruption. The standard service logs are sufficient for tracking signals and basic rpmdb access patterns.

Enable debug mode only if you need more verbose output to troubleshoot the rpm_db_snooper script or service behavior.

Temporary debug via environment variable

export RPM_DB_SNOOPER_DEBUG=-DDEBUG
systemctl restart rpm_db_snooper.service

Or, running the script directly for low-level debugging:

export DEBUG=1
dtrace -s /usr/libexec/oled-tools/scripts.d/monitor_rpmdb.d -DDEBUG

How rpm_db_snooper Works (High-Level)

At a high level, the DTrace script:

Attaches to RPM-related processes (rpm, yum, dnf).
Monitors operations on files under /var/lib/rpm (and related history files).
Tracks which PIDs currently hold the RPM DB open.
Observes signals sent to those PIDs (e.g., SIGKILL, SIGTERM).
Logs structured lines with:
- Process name and PID
- Parent process name and PID
- Signal number (if applicable)
- Target process PID
- Open/close or exit events

This information lets you reconstruct a timeline around rpmdb activity and identify when a package manager was killed.

Example: Using rpm_db_snooper When Corruption Is Discovered Later

In many real-world cases, rpmdb corruption is not noticed immediately. You might only discover it much later, for example when a new yum/dnf/rpm command fails.

To make rpm_db_snooper useful in such scenarios, it should be enabled and kept running in the background before any issues occur. Then, when corruption is eventually detected, you can look back in time through its logs.

Ensure rpm_db_snooper is running (and preferably enabled at boot):

$ systemctl status rpm_db_snooper.service

If it is not active, start (and optionally enable) it:

$ systemctl start rpm_db_snooper.service
$ systemctl enable rpm_db_snooper.service # optional, for automatic start on boot
When rpmdb corruption is detected (even if long after the triggering event), do not discard logs. Instead, review rpm_db_snooper output for a time window that reasonably covers when the corruption might have occurred. For example:

journalctl -u rpm_db_snooper.service --since "2 days ago"

Adjust the --since window based on how long ago you suspect the problematic activity might have taken place.
Look for signal events or unexpected exits of rpm, yum, or dnf around the time of any suspicious package operations (maintenance windows, automation runs, scripted updates, etc.). Correlate these with other system logs (cron jobs, monitoring, OOM killer, operator actions) to narrow down the root cause.

Sample Log Output and Interpretation

Below is an example snippet of what rpm_db_snooper-related activity might look like in the system logs when monitoring RPM database access and a kill signal sent to a package manager process:

Feb 11 10:28:37 test_setup systemd[1]: Started Dtrace Program to track kill signals sent to yum/dnf/rpm processes.
Feb 11 10:28:40 test_setup dtrace[3081414]: -------------------------
Feb 11 10:28:40 test_setup dtrace[3081414]: Rpm db snooper - start
Feb 11 10:29:26 test_setup dtrace[3081414]: Process(PID)         Parent Process(PID)       Signal No    Target Process(PID)
Feb 11 10:29:26 test_setup dtrace[3081414]: pkill(3081985)          bash(3081982)                 9         yum(3081878)
Feb 11 10:31:00 test_setup dtrace[3081414]: Rpm db snooper - end
Feb 11 10:31:00 test_setup dtrace[3081414]: -------------------------
Feb 11 10:31:00 test_setup systemd[1]: Stopping Dtrace Program to track kill signals sent to yum/dnf/rpm processes...
Feb 11 10:31:00 test_setup systemd[1]: rpm_db_snooper.service: Succeeded.
Feb 11 10:31:00 test_setup systemd[1]: Stopped Dtrace Program to track kill signals sent to yum/dnf/rpm processes.

From this example, you can see:

systemd starts the DTrace-based monitoring service.
rpm_db_snooper announces its start and end.
A header line documents the columns for subsequent signal events.
A pkill command (PID 3081985), running under a bash shell (PID 3081982), sends signal 9 (SIGKILL) to a yum process (PID 3081878).

If rpmdb corruption is later detected, these lines show that yum was killed with SIGKILL by the pkill process while accessing the RPM database. You can then investigate why pkill was run (automation, scripts, operator action, etc.) and adjust timeouts or behavior to avoid abrupt termination.

Summary

Recurring rpmdb corruption is often caused by package manager processes (rpm, yum, dnf) being killed while they hold the RPM database open.
Standard system logs usually do not record who sent the kill signal.
rpm_db_snooper, part of the Oracle Linux Enhanced Diagnostics tools, uses DTrace to monitor rpmdb activity and capture relevant signals and exits.
Run rpm_db_snooper before the problem occurs, then use its logs to correlate corruption events with the processes or automation responsible for terminating RPM-related workloads.

This information can then be used to adjust automation, monitoring, or operational practices to prevent future rpmdb corruption.

Oracle Linux Enhanced Diagnostics: Use rpm_db_snooper to monitor and investigate rpm database corruption events

Issue

What is rpm_db_snooper?

Why rpm_db_snooper is Useful

Temporary Recovery vs Root Cause

Installing rpm_db_snooper (OLED tools)

Service Architecture

Starting and Managing the Service

Enabling Debug Output (for troubleshooting rpm_db_snooper itself)

Temporary debug via environment variable

How rpm_db_snooper Works (High-Level)

Example: Using rpm_db_snooper When Corruption Is Discovered Later

Sample Log Output and Interpretation

Summary

sagar sagar

Extending the counted_by Attribute to Pointers inside Structures in GCC16

Topology Matters: Modeling EPYC Genoa topology with QEMU

Oracle Linux Enhanced Diagnostics: Use rpm_db_snooper to monitor and investigate rpm database corruption events

Issue

What is rpm_db_snooper?

Why rpm_db_snooper is Useful

Temporary Recovery vs Root Cause

Installing rpm_db_snooper (OLED tools)

Service Architecture

Starting and Managing the Service

Enabling Debug Output (for troubleshooting rpm_db_snooper itself)

Temporary debug via environment variable

How rpm_db_snooper Works (High-Level)

Example: Using rpm_db_snooper When Corruption Is Discovered Later

Sample Log Output and Interpretation

Summary

Authors

sagar sagar

Extending the counted_by Attribute to Pointers inside Structures in GCC16

Topology Matters: Modeling EPYC Genoa topology with QEMU