Oracle Linux Enhanced Diagnostics: memstate: Going beyond /proc/meminfo

Introduction

Memory is a critical resource, and monitoring the health and use of this resource is vital for Linux systems to function optimally.

There are a lot of memory-related statistics that are collected as part of system health monitoring tools like oswatcher, sosreport etc. These stats give us a fairly good high level summary of whether or not there are any issues on the system, or potential issues that an admin might want to watch out for. But sometimes we want to dig a little deeper than these high level statistics – or monitor them over time. oled memstate makes it easier to observe the health of memory as a snapshot, or how it changes over time. It analyzes raw data already exported to userspace using various procfs or sysfs files – and in some cases, prints appropriate warnings. Let’s delve deeper into these files and the memstate options appropriate for each use case in the sections below.

Using Memstate

memstate is part of the oled-tools rpm. For more information about the OLED project and installation instructions, please see Oracle Linux Enhanced Diagnostics.

After the oled-tools rpm is installed, memstate can be invoked as:

$ sudo oled memstate
    KERNEL: 5.4.17-2136.315.5.8.el8uek.x86_64
  HOSTNAME: aruramak-test
      TIME: 10/26/2023 10:59:12

MEMORY USAGE SUMMARY (in GB):
Total memory                         251.3
Free memory                          167.1
Used memory                           84.2
  Userspace                           50.5
    Processes                          9.4
    Page cache                        42.2
    Shared mem                         1.2
  Kernel                               6.7
    Slabs                              2.7
    RDS                                0.8
    Unknown                            2.3
  Total Hugepages (2048 KB)           27.0
  Free Hugepages (2048 KB)             0.0
Swap used                              0.0
...

By invoking it without any options or parameters, it will summarize and print to the console the current memory usage summary, including process and kernel memory usage, top 10 slab caches, top 10 processes consuming most memory, health checks (more on that below), as well as fragmentation status. The data in the ‘SUMMARY’ section is basically read from /proc/meminfo and presented in a user-friendly format. The various categories in /proc/meminfo can be a little confusing for the average user, so here the tool sums up related categories – user space allocations (e.g., page cache, shmem, mapped, anonpages, etc.) and kernel allocations (slab, page tables, etc.). If there is a memory leak or memory pressure on the system, a glance at this section will tell us if it’s the kernel or the processes that need to be looked into.

A brief note about the “unknown” kernel allocations seen here: most memory allocations in the kernel are made via 2 families of APIs – the first is via kmalloc() or kmem_cache_alloc(), which allocates an object from the appropriate slab cache. This is accounted for in /proc/meminfo, /proc/slabinfo, etc. The second set directly allocates from the buddy allocator – via __get_free_pages() and friends. This allocates in units of pages, and drivers or other kernel modules might use it to allocate temporary buffers for I/O, for instance. This memory is not accounted for anywhere. The “unknown” figure in the memstate output is a guesstimate for how much memory the tool thinks the kernel is using via __get_free_pages() and similar APIs. Note that this is not an accurate number – it’s an approximation. If this value is too large, it’s likely that there’s a memory leak somewhere in the kernel. Otherwise, the system is healthy and a small amount of “unknown” memory usage is very normal.

Another line item in the summary output above is hugepages. If the system has hugepages configured (either 2 MB or 1 GB), the total number reserved as well as the number of free hugepages of each size will be displayed here. This is extracted from files under /sys/devices/system/node/nodeX/hugepages/hugepages-Y/. /proc/meminfo only displays usage stats about the default hugepage size, which is 2 MB on x86_64.

Apart from this general ‘SUMMARY’ section, oled memstate also prints some data about NUMA node allocations, slab cache usage, process usage, etc. which we’ll review in separate sections below.

Memstate supports a few different arguments:

$ sudo oled memstate --help
usage: oled memstate [-h] [-p [PID]] [-s [FILE]] [-n [FILE]] [-a] [-v] [-f [INTERVAL]]

memstate: Capture and analyze memory usage data on this system.

optional arguments:
  -h, --help            show this help message and exit
  -p [PID], --pss [PID]
                        display per-process memory usage
  -s, --slab            analyze/display slab usage
  -n [FILE], --numa [FILE]
                        analyze/display NUMA stats
  -v, --verbose         verbose data capture; combine with other options
  -f [INTERVAL], --frequency [INTERVAL]
                        interval at which data should be collected (default:
                        30s)

We’ll discuss each of these below. Note that the memstate command has to be run as root.

Using -f [INTERVAL] will result in the specified memstate command being run every [INTERVAL] seconds – the default, if no number is specified, is 30 seconds. This will not print the output on the console though – instead, the data is captured in file /var/oled/memstate/memstate.log, which is logrotated and compressed as it grows. So one could run memstate in the background for hours or days (or even weeks) and continuously monitor the state of memory using the captured logs. We’ll talk about a few use cases where that might be useful in the sections below.

Per-process memory usage

In case of memory exhaustion or pressure, our first step is to determine what kind of allocations are causing the pressure. If it’s kernel allocations, we can check if any of the slab caches are growing more than expected (which is usually the case), or if there’s a memory leak in a kernel module or driver. If the latter, one can run a dtrace or ebpf script to track all kernel allocations and frees, and dump stacktraces to pinpoint the function which is leaking memory – there is an ebpf script in the bcc-tools rpm which does that.

On the other hand, if it’s userspace allocations that are growing, we’ll need to know how much memory each process is using. oled memstate -p prints exactly that.

# oled memstate -p
    KERNEL: 4.14.35-2047.518.4.3.el7uek.x86_64
  HOSTNAME: scao08adm03.us.oracle.com
      TIME: 01/25/2024 12:21:53

TOP 10 MEMORY CONSUMERS (in KB, ordered by PSS):
PROCESS(PID)                               PSS             RSS         PRIVATE            SWAP
java(109499)                            905735          908388          905716               0
java(52119)                             596949          607452          589200               0
ora_ipc0_cdb11(105038)                  569761          865868          381772               0
java(102656)                            366326          401008          359332               0
java(102674)                            218670          251460          212576               0
ocssd.bin(66311)                        191520          286568          177540               0
ologgerd(102388)                        154181          250640          139664               0
ora_mmon_cdb11(105326)                  140677          272136          134824               0
ora_rcbg_cdb11(109629)                  119240          347748            4864               0
oraagent.bin(102167)                    107811          182424           88920               0

>> Total memory used by all processes: 7.7 GB

This gathers usage data for all processes from /proc/<pid>/smaps_rollup files and sorts them in descending order by PSS, to display the most memory intensive processes on the system. It displays the following metrics per process:

PSS (proportional set size): PSS reports the proportional size of memory that a process is using. For instance, if process A is using a shared library of size 10 KB, and 9 other processes are also using this shared library, PSS will count only 1 KB towards process A’s memory accounting. This value is more accurate than RSS, and is a reliable number for the memory footprint reported for a process.
RSS (resident set size): RSS is the total memory held in RAM for a given process. This can be misleading and bloated – as it includes 100% of the shared libraries for a given process, even though that library could be shared by many processes (and only loaded once in memory). This value is included here since most standard tools (e.g. ps or top) report RSS values for the processes, but it’s not a reliable indicator of how much memory a process is using.
Private: Memory not shared with any other process.
Swap: Amount of swap space used by this process. A process that is using swap space might be a victim of memory pressure on the system – not the cause of it.

Adding a -v to the previous command will list all the processes in the system and their memory consumption metrics, rather than just the top 10.

If we suspect that a process is leaking memory, one could run this command periodically, using the option -f:

$ sudo oled memstate -p -v -f 60 &
[1] 47583
Capturing memstate data every 60 seconds, in log: /var/oled/memstate/memstate.log; press Ctrl-C to stop.

This will capture the memory usage of each process on the system every 60 seconds, in /var/oled/memstate/memstate.log. A quick glance at the logs after 24 or 48 hours will tell us which process is growing, and at what rate.

Alternatively, If a process’s pid is passed as input, this option will display the memory mappings and usage summary for that process only, extracted from /proc/<pid>/smaps file.

# oled memstate -p 109629
    KERNEL: 4.14.35-2047.518.4.3.el7uek.x86_64
  HOSTNAME: scao08adm03.us.oracle.com
      TIME: 01/25/2024 14:08:54

Memory usage summary for process ora_rcbg_cdb11 (pid: 109629):
Pss                       118844 KB
Shared                    342884 KB
Private                     4864 KB
Hugetlb                 11548672 KB

Displaying process VMAs >= 256 KB (numbers are in KB):
ADDR                                     PSS          SHARED            PRIV         HUGETLB            MAPPING
70000000-1990000000                        0               0               0        11534336            /SYSV00000000 (deleted)
400010000000-400020000000             113455          262144               0               0            /dev/shm/ora_ffffffffdd586ae0_2e455_1_KSIPC_MGA_NMSPC_1_0_1.dat

Slab caches

Slab caches are contiguous pages which are carved up into small objects for kernel data structures or general purpose buffers. If slab caches grow too big, they could cause memory fragmentation since slab cache pages cannot be moved around in order to coalesce small, free chunks into larger free chunks. In other words, large slab caches get in the way of memory reclaim and compaction.

We have seen quite a few customer bugs where the system was badly fragmented due to heavy slab cache growth. To check how fragmented the memory is, one can check /proc/buddyinfo. Memstate also prints this data in its default summary output. If the memory is too fragmented – i.e. there are not enough chunks available in the order-3 bucket and all higher order buckets have either low numbers or 0 – that can lead to compaction stalls. On fragmented systems, we’ll see most of the free memory present in order-0 and maybe order-1 buckets. A large value for order-0 implies that the system is unable to merge those 4 KB pages into 8 KB chunks, perhaps due to an allocated, immovable page (maybe a slab page) being adjacent to a free page.

This is a healthy (and most likely, idle) system without much fragmentation:

# cat /proc/buddyinfo
Node 0, zone      DMA      0      0      0      0      0      0      0      0      1      1      3
Node 0, zone    DMA32      7      4      2      3      2      3      3      4      1      2    335
Node 0, zone   Normal   9916   5648   4340   1183    542    139     50     20     15      5  21441
Node 1, zone   Normal   8748   4896   1737    807    398    211     57     15      8      5  25654

Note the large number of order-10 chunks. When a request for a smaller chunk comes in, the kernel will split a higher order chunk into smaller ones to fulfill the allocation request. Over time (months or years), this can lead to most of the higher order chunks being used up and the numbers in those buckets will shrink, and the free memory will skew heavily in lower order buckets. But the kernel does compact free memory by migrating the allocated pages which are adjacent to free pages (if they are movable) and thus creating a contiguous span of free pages – i.e., a higher order chunk. However, this process will fail if there are a lot of immovable pages allocated in the system, leading to fragmentation.

On an unhealthy system, you might see numbers like this:

# cat /proc/buddyinfo
Node 0, zone      DMA      0      1      1      0      0      0      1      0      0      1      3
Node 0, zone    DMA32      6      5      5      5      5      4      3      4      3      2    338
Node 0, zone   Normal 1084411 393679    495     59     3      0      1      0      0      0      1
Node 1, zone   Normal 523667 240672 109678  53514  15280   2708    663      7      0      0      1

Here, there are 59 chunks available in the order-3 bucket, on node 0. When a process requests an order-3 chunk and if one is not readily available, it will enter direct reclaim mode – i.e. the process will try to free up and compact memory in the allocation path. This might cause memory allocation to be very slow, and if a process is holding a lock or other critical resource before it tried to allocate memory, it could cause problems for other processes running on that system. Memory fragmentation could cause issues like device timeouts, ping failures, high %sys utilization, high load average, etc.

If there is scarcity of order-4 chunks (or higher), it will result in page allocation failure warnings being printed to the syslog, but this will not enter the direct reclaim mode of allocation. In other words, lack of higher order chunks is normal – expected even, on long running systems, and the system does not try too hard to reclaim/compact memory when a higher order chunk is unavailable. Therefore kernel modules and drivers should always have a fall back option in the code to retry lower order allocations, when a higher order one fails.

Anyway, all this is to say that it’s quite useful to keep an eye on the slab caches – which one can do using the -s option of memstate:

$ sudo oled memstate -s -v
    KERNEL: 5.4.17-2136.318.7.1.el8uek.x86_64
  HOSTNAME: aruramak-test
      TIME: 10/17/2023 11:14:31

SLAB CACHES (in KB):
SLAB CACHE                           SIZE (KB)            ALIASES
proc_inode_cache                        855872            (null)
dentry                                  666504            (null)
task_struct                             187168            (null)
xfs_inode                               183776            (null)
vm_area_struct                          167456            (null)
kmalloc-512                             148960            (null)
inode_cache                             101632            (null)
kmalloc-2k                               94048            (null)
ip_fib_alias                             89180            nsproxy, avc_xperms_node, zswap_entry, Acpi-Parse, uhci_urb_priv, xfs_bmap_free_item, lsm_inode_cache, file_lock_ctx
radix_tree_node                          86560            (null)
kmalloc-rcl-96                           81312            (null)
kmalloc-1k                               80672            (null)
filp                                     77232            (null)
sighand_cache                            53664            (null)
ip_mrt_cache                             45856            virtio_scsi_cmd, rds_ib_frag, uid_cache, ip6_mrt_cache, inet_peer_cache, xfs_ili, t10_alua_lu_gp_cache, dmaengine-unmap-16
kmalloc-4k                               40768            (null)
PING                                     34176            UNIX, signal_cache
kmalloc-8k                               27872            (null)
mm_struct                                27232            (null)
fs_cache                                 26832            anon_vma_chain
anon_vma                                 25544            (null)
files_cache                              22752            (null)
...

The slab usage data is gathered from /sys/kernel/slab/<cache>/ files. The output here consists of the slab cache name, its size (in KB) and all the aliases for that cache. Linux kernel merges together slab caches with similar attributes (object size, alignement, etc.). Typically, for a set of merged slabs, only one name (alias) is displayed in /proc/slabinfo. Here, we see the list of all aliases for a given cache, which can be useful in debugging slab leaks. Enabling any of the SLUB debug options will disable this merging.

Running this command with an additional -v lists the slab cache sizes of all caches on the system, in descending order. If you suspect that one of the slab caches is growing, you could run this command periodically (using -f to specify periodicity) to monitor the growth over time:

$ nohup sudo oled memstate -s -v -f 600 &
Capturing memstate data every 600 seconds, in log: /var/oled/memstate/memstate.log; press Ctrl-C to stop.

A final note before we close the chapter on fragmentation: the system keeps track of the compaction/reclamation effort in various counters, which are exported to the userspace via /proc/vmstat. These can also be useful in debugging memory fragmentation-related issues, which are present in the oled memstate summary output:

# oled memstate
...
Vmstat:
  allocstall_normal 0
  zone_reclaim_failed 0
  kswapd_low_wmark_hit_quickly 0
  kswapd_high_wmark_hit_quickly 0
  drop_pagecache 0
  drop_slab 0
  oom_kill 0
  compact_migrate_scanned 0
  compact_free_scanned 0
  compact_isolated 0
  compact_stall 0
  compact_fail 0
  compact_success 0
  compact_daemon_wake 0
  compact_daemon_migrate_scanned 0
  compact_daemon_free_scanned 0
...

If there is a large slab cache on the system (say, hundreds of GB) causing memory fragmentation, one way to temporarily mitigate that is to forcibly release memory from that cache. For dentry or inode slab caches, that can be achieved by writing 2 to /proc/sys/vm/drop_caches. Note that we do not recommend doing this on a regular basis – it can cause performance issues for any I/O intensive applications. At best, it’s a temporary band-aid and further debugging is necessary to root cause the dentry/inode slab growth. If it’s another slab cache (e.g. task_struct) that’s growing, then it calls for extended debugging to pinpoint if/how those objects are being leaked – drop_caches will not help.

NUMA node allocations

On systems with multiple NUMA nodes, memory allocation needs are not equitably distributed across the NUMA nodes. Typically, memory is allocated from the “local” node – i.e. the node the process is running on. In some corner cases, we might have processes whose memory needs are far higher than the others – maybe some backup and recovery processes, for instance. These will then put memory pressure on one NUMA node, causing some imbalance.

For example:

# numastat -m
Per-node system memory usage (in MBs):
                          Node 0          Node 1           Total
                 --------------- --------------- ---------------
MemTotal              1547405.20      1548190.00      3095595.20
MemFree                 13461.10       848778.94       862240.04
MemUsed               1533944.09       699411.07      2233355.16
...

This is an extreme example of imbalanced allocations caused due to an I/O intensive process running on node 0 allocating a lot of memory from node 0. This can cause memory pressure for other processes running on node 0, and MemFree might dip below the low watermark, waking up kswapd to do background reclaim and compaction even though there is over 800 GB of free, unused memory on node 1. This is not a common scenario, but it’s an interesting case that might call for looking more closely into how the processes’ allocations are split across the NUMA nodes, checking if any process is explicitly bound to NUMA node 0, etc.

The NUMA stats for some select categories (like page cache, slab caches, etc.) can be printed using the -n option – it extracts per-NUMA node meminfo data from /sys/devices/system/node/nodeX/meminfo.

$ sudo oled memstate -n
...
NUMA STATISTICS:
NUMA is enabled on this system; number of NUMA nodes is 2.
Per-node memory usage summary (in KB):
                                       NODE 0         NODE 1
MemTotal                            131413236      132071696
MemFree                              14468156       10625716
FilePages                            41377828       37372572
AnonPages                            12178764       10885112
Slab                                 19184668       32837680
Shmem                                 1077016         262980
Total Hugepages (2048 KB)            38912000       38912000
Free Hugepages (2048 KB)              3600384        3117056
[WARN]  Shmem is imbalanced across NUMA nodes.
...

If there is significant imbalance across one or more of these categories, the tool will print a warning. The warnings that memstate prints to console are just informational – they are just things that a sysadmin might want to keep an eye on.

Running this with an additional -v option will scan the /proc/<pid>/numa_maps file for each process and print the per-NUMA node memory usage for each pid, but this processing can be quite expensive w.r.t. time taken/CPU cycles. esp. if run on a heavily-loaded system. Therefore, it is not recommended to invoke -n/--numa with -v/--verbose too often.

# oled memstate -n -v
...
Reading /proc/<pid>/numa_maps (this could take a while) ...

Per-node memory usage, per process (in KB):
PROCESS(PID)                           NODE 0         NODE 1
systemd-journal(1632)                 61224.0         2120.0
firewalld(2499)                        5920.0        46932.0
rsyslogd(3065)                        36760.0        17728.0
tuned(3058)                           10604.0        34156.0
polkitd(2745)                          3760.0        19468.0
systemd(45133)                        16404.0          584.0
systemd(1)                            13752.0          580.0
NetworkManager(2558)                   7564.0        11952.0
python3(60492)                        11440.0         3264.0
python3(thread-self)                  10864.0         3264.0
python3(self)                         10844.0         3264.0
...

Health checks

In the output of oled memstate, there’s a section that checks for common, known issues that we’ve seen before:

# oled memstate
...
HEALTH CHECKS:
[OK]    The value of vm.min_free_kbytes is: 3954221 KB.

[OK]    The value of vm.watermark_scale_factor is: 10.

[WARN]  Page tables are larger than expected (7.1 GB); if this is an Exadata system, check if the DB parameter USE_LARGE_PAGES is set to ONLY.

[OK]    RDS receive cache size is: 2.3 GB.

[OK]    Unaccounted kernel memory is: 5.6 GB.
...

While debugging various memory-related issues, we have come across some common patterns/themes that the memstate tool explicitly tracks and warns the user about. Among them (and this is not a comprehensive list – more checks like these will be added as we discover more useful patterns):

vm.min_free_kbytes: This is a sysctl tunable that tells the system the minimum amount of memory to be kept free to satisfy critical allocations, some of which are done in memory management code’s housekeeping paths. If this is too low, it will affect the system’s ability to free/compact memory (i.e. reduce fragmentation). In some cases, processes might end up doing direct reclaim while allocating memory – which can affect performance.

On Oracle’s engineered systems, we recommend setting this value to max(0.5% of RAM, 1 GB per NUMA node). Note that setting this value too high might cause the system to invoke the OOM-killer, as this memory is effectively reserved and unusable for all the ‘normal’ allocations in the kernel.
vm.watermark_scale_factor: This is also a sysctl tunable that controls how aggressively kswapd reclaims memory. It defines the distance between the min/low/high watermarks of a zone, which dictate when kswapd wakes up and how long it needs to run before it can go to sleep again.
Page table size: The amount of memory used by page tables is dictated by the number of pages on that system. A page table is used by the OS to keep track of the mapping between the addresses the processes use and their corresponding physical pages. The larger the RAM, the larger the number of pages of default size (4 KB) – which translates to bigger page tables. The good news is that most, if not all, large systems use hugepages. On x86_64, a hugepage could be 2 MB or 1 GB. So, one page table entry can map an entire 2 MB region (with 2 MB hugepages) which will otherwise use 512 page table entries (with default 4 KB pages).

Due to the use of hugepages on Oracle’s engineered systems, page tables are typically small in size (a couple of GBs, if that). If a DB is misconfigured or there was an issue with the hugepages setting, it might fall back to using regular 4 KB pages. When that happens, there might be a host of other issues (performance, for one), and one way to detect that is to check the size of page tables. That’s one of the reasons memstate checks page table size and prints a warning in this section, if it is larger than expected.
“Unknown” kernel memory: We’ve discussed this in an earlier section. If the amount of memory allocated by kernel components or drivers directly from the buddy allocator (using __get_free_pages() and sister APIs) is too large, it will print a warning here. It could be a legitimate use case by a kernel component, say for large I/O buffers, or it could be a sign of a memory leak. Further investigation will be needed to rule out any issues.
RDS receive cache: This is one of the known, legitimate users of kernel memory allocated directly from the buddy allocator. We compute the cache size and mention it in the ‘HEALTH CHECKS’ section – so that it does not add to the “unknown” memory footprint.

Future Work

The memstate tool consolidates, analyzes and monitors useful memory-related statistics that are already exported to userspace in various procfs and sysfs files. It could also be enhanced to run ebpf scripts to collect stacktraces when an issue is detected, which will help get to the source of any memory leaks in the kernel. Based on real use cases where this is used to debug an issue, we will continue to refine and enhance the tool. If you have a specific use case that you’d like to debug with memstate and it’s not supported, we’d like to hear about it – please open an SR to Linux Support.

Oracle Linux Enhanced Diagnostics: memstate: Going beyond /proc/meminfo

Introduction

Using Memstate

Per-process memory usage

Slab caches

NUMA node allocations

Health checks

Future Work

Aruna Ramakrishna

entropy_avail = 256 is good enough for everyone

Get Hands-on Experience with Free Oracle OS Management Hub Labs

Oracle Linux Enhanced Diagnostics: memstate: Going beyond /proc/meminfo

Introduction

Using Memstate

Per-process memory usage

Slab caches

NUMA node allocations

Health checks

Future Work

Authors

Aruna Ramakrishna

entropy_avail = 256 is good enough for everyone

Get Hands-on Experience with Free Oracle OS Management Hub Labs