Introduction
Memory is a critical resource, and monitoring the health and use of this resource is vital for Linux systems to function optimally.
There are a lot of memory-related statistics that are collected as part of system health monitoring tools like oswatcher, sosreport etc. These stats give us a fairly good high level summary of whether or not there are any issues on the system, or potential issues that an admin might want to watch out for. But sometimes we want to dig a little deeper than these high level statistics – or monitor them over time. oled memstate
makes it easier to observe the health of memory as a snapshot, or how it changes over time. It analyzes raw data already exported to userspace using various procfs
or sysfs
files – and in some cases, prints appropriate warnings. Let’s delve deeper into these files and the memstate options appropriate for each use case in the sections below.
Using Memstate
memstate is part of the oled-tools rpm. For more information about the OLED project and installation instructions, please see Oracle Linux Enhanced Diagnostics.
After the oled-tools rpm is installed, memstate can be invoked as:
$ sudo oled memstate KERNEL: 5.4.17-2136.315.5.8.el8uek.x86_64 HOSTNAME: aruramak-test TIME: 10/26/2023 10:59:12 MEMORY USAGE SUMMARY (in GB): Total memory 251.3 Free memory 167.1 Used memory 84.2 Userspace 50.5 Processes 9.4 Page cache 42.2 Shared mem 1.2 Kernel 6.7 Slabs 2.7 RDS 0.8 Unknown 2.3 Total Hugepages (2048 KB) 27.0 Free Hugepages (2048 KB) 0.0 Swap used 0.0 ...
By invoking it without any options or parameters, it will summarize and print to the console the current memory usage summary, including process and kernel memory usage, top 10 slab caches, top 10 processes consuming most memory, health checks (more on that below), as well as fragmentation status. The data in the ‘SUMMARY’ section is basically read from /proc/meminfo
and presented in a user-friendly format. The various categories in /proc/meminfo
can be a little confusing for the average user, so here the tool sums up related categories – user space allocations (e.g., page cache, shmem, mapped, anonpages, etc.) and kernel allocations (slab, page tables, etc.). If there is a memory leak or memory pressure on the system, a glance at this section will tell us if it’s the kernel or the processes that need to be looked into.
A brief note about the “unknown” kernel allocations seen here: most memory allocations in the kernel are made via 2 families of APIs – the first is via kmalloc()
or kmem_cache_alloc()
, which allocates an object from the appropriate slab cache. This is accounted for in /proc/meminfo
, /proc/slabinfo
, etc. The second set directly allocates from the buddy allocator – via __get_free_pages()
and friends. This allocates in units of pages, and drivers or other kernel modules might use it to allocate temporary buffers for I/O, for instance. This memory is not accounted for anywhere. The “unknown” figure in the memstate output is a guesstimate for how much memory the tool thinks the kernel is using via __get_free_pages()
and similar APIs. Note that this is not an accurate number – it’s an approximation. If this value is too large, it’s likely that there’s a memory leak somewhere in the kernel. Otherwise, the system is healthy and a small amount of “unknown” memory usage is very normal.
Another line item in the summary output above is hugepages. If the system has hugepages configured (either 2 MB or 1 GB), the total number reserved as well as the number of free hugepages of each size will be displayed here. This is extracted from files under /sys/devices/system/node/nodeX/hugepages/hugepages-Y/
. /proc/meminfo
only displays usage stats about the default hugepage size, which is 2 MB on x86_64.
Apart from this general ‘SUMMARY’ section, oled memstate
also prints some data about NUMA node allocations, slab cache usage, process usage, etc. which we’ll review in separate sections below.
Memstate supports a few different arguments:
$ sudo oled memstate --help usage: oled memstate [-h] [-p [PID]] [-s [FILE]] [-n [FILE]] [-a] [-v] [-f [INTERVAL]] memstate: Capture and analyze memory usage data on this system. optional arguments: -h, --help show this help message and exit -p [PID], --pss [PID] display per-process memory usage -s, --slab analyze/display slab usage -n [FILE], --numa [FILE] analyze/display NUMA stats -v, --verbose verbose data capture; combine with other options -f [INTERVAL], --frequency [INTERVAL] interval at which data should be collected (default: 30s)
We’ll discuss each of these below. Note that the memstate command has to be run as root.
Using -f [INTERVAL]
will result in the specified memstate command being run every [INTERVAL] seconds – the default, if no number is specified, is 30 seconds. This will not print the output on the console though – instead, the data is captured in file /var/oled/memstate/memstate.log
, which is logrotated and compressed as it grows. So one could run memstate in the background for hours or days (or even weeks) and continuously monitor the state of memory using the captured logs. We’ll talk about a few use cases where that might be useful in the sections below.
Per-process memory usage
In case of memory exhaustion or pressure, our first step is to determine what kind of allocations are causing the pressure. If it’s kernel allocations, we can check if any of the slab caches are growing more than expected (which is usually the case), or if there’s a memory leak in a kernel module or driver. If the latter, one can run a dtrace or ebpf script to track all kernel allocations and frees, and dump stacktraces to pinpoint the function which is leaking memory – there is an ebpf script in the bcc-tools rpm which does that.
On the other hand, if it’s userspace allocations that are growing, we’ll need to know how much memory each process is using. oled memstate -p
prints exactly that.
# oled memstate -p KERNEL: 4.14.35-2047.518.4.3.el7uek.x86_64 HOSTNAME: scao08adm03.us.oracle.com TIME: 01/25/2024 12:21:53 TOP 10 MEMORY CONSUMERS (in KB, ordered by PSS): PROCESS(PID) PSS RSS PRIVATE SWAP java(109499) 905735 908388 905716 0 java(52119) 596949 607452 589200 0 ora_ipc0_cdb11(105038) 569761 865868 381772 0 java(102656) 366326 401008 359332 0 java(102674) 218670 251460 212576 0 ocssd.bin(66311) 191520 286568 177540 0 ologgerd(102388) 154181 250640 139664 0 ora_mmon_cdb11(105326) 140677 272136 134824 0 ora_rcbg_cdb11(109629) 119240 347748 4864 0 oraagent.bin(102167) 107811 182424 88920 0 >> Total memory used by all processes: 7.7 GB
This gathers usage data for all processes from /proc/<pid>/smaps_rollup
files and sorts them in descending order by PSS, to display the most memory intensive processes on the system. It displays the following metrics per process:
-
PSS (proportional set size): PSS reports the proportional size of memory that a process is using. For instance, if process A is using a shared library of size 10 KB, and 9 other processes are also using this shared library, PSS will count only 1 KB towards process A’s memory accounting. This value is more accurate than RSS, and is a reliable number for the memory footprint reported for a process.
-
RSS (resident set size): RSS is the total memory held in RAM for a given process. This can be misleading and bloated – as it includes 100% of the shared libraries for a given process, even though that library could be shared by many processes (and only loaded once in memory). This value is included here since most standard tools (e.g.
ps
ortop
) report RSS values for the processes, but it’s not a reliable indicator of how much memory a process is using. -
Private: Memory not shared with any other process.
-
Swap: Amount of swap space used by this process. A process that is using swap space might be a victim of memory pressure on the system – not the cause of it.
Adding a -v
to the previous command will list all the processes in the system and their memory consumption metrics, rather than just the top 10.
If we suspect that a process is leaking memory, one could run this command periodically, using the option -f
:
$ sudo oled memstate -p -v -f 60 & [1] 47583 Capturing memstate data every 60 seconds, in log: /var/oled/memstate/memstate.log; press Ctrl-C to stop.
This will capture the memory usage of each process on the system every 60 seconds, in /var/oled/memstate/memstate.log
. A quick glance at the logs after 24 or 48 hours will tell us which process is growing, and at what rate.
Alternatively, If a process’s pid is passed as input, this option will display the memory mappings and usage summary for that process only, extracted from /proc/<pid>/smaps
file.
# oled memstate -p 109629 KERNEL: 4.14.35-2047.518.4.3.el7uek.x86_64 HOSTNAME: scao08adm03.us.oracle.com TIME: 01/25/2024 14:08:54 Memory usage summary for process ora_rcbg_cdb11 (pid: 109629): Pss 118844 KB Shared 342884 KB Private 4864 KB Hugetlb 11548672 KB Displaying process VMAs >= 256 KB (numbers are in KB): ADDR PSS SHARED PRIV HUGETLB MAPPING 70000000-1990000000 0 0 0 11534336 /SYSV00000000 (deleted) 400010000000-400020000000 113455 262144 0 0 /dev/shm/ora_ffffffffdd586ae0_2e455_1_KSIPC_MGA_NMSPC_1_0_1.dat
Slab caches
Slab caches are contiguous pages which are carved up into small objects for kernel data structures or general purpose buffers. If slab caches grow too big, they could cause memory fragmentation since slab cache pages cannot be moved around in order to coalesce small, free chunks into larger free chunks. In other words, large slab caches get in the way of memory reclaim and compaction.
We have seen quite a few customer bugs where the system was badly fragmented due to heavy slab cache growth. To check how fragmented the memory is, one can check /proc/buddyinfo
. Memstate also prints this data in its default summary output. If the memory is too fragmented – i.e. there are not enough chunks available in the order-3 bucket and all higher order buckets have either low numbers or 0 – that can lead to compaction stalls. On fragmented systems, we’ll see most of the free memory present in order-0 and maybe order-1 buckets. A large value for order-0 implies that the system is unable to merge those 4 KB pages into 8 KB chunks, perhaps due to an allocated, immovable page (maybe a slab page) being adjacent to a free page.
This is a healthy (and most likely, idle) system without much fragmentation:
# cat /proc/buddyinfo Node 0, zone DMA 0 0 0 0 0 0 0 0 1 1 3 Node 0, zone DMA32 7 4 2 3 2 3 3 4 1 2 335 Node 0, zone Normal 9916 5648 4340 1183 542 139 50 20 15 5 21441 Node 1, zone Normal 8748 4896 1737 807 398 211 57 15 8 5 25654
Note the large number of order-10 chunks. When a request for a smaller chunk comes in, the kernel will split a higher order chunk into smaller ones to fulfill the allocation request. Over time (months or years), this can lead to most of the higher order chunks being used up and the numbers in those buckets will shrink, and the free memory will skew heavily in lower order buckets. But the kernel does compact free memory by migrating the allocated pages which are adjacent to free pages (if they are movable) and thus creating a contiguous span of free pages – i.e., a higher order chunk. However, this process will fail if there are a lot of immovable pages allocated in the system, leading to fragmentation.
On an unhealthy system, you might see numbers like this:
# cat /proc/buddyinfo Node 0, zone DMA 0 1 1 0 0 0 1 0 0 1 3 Node 0, zone DMA32 6 5 5 5 5 4 3 4 3 2 338 Node 0, zone Normal 1084411 393679 495 59 3 0 1 0 0 0 1 Node 1, zone Normal 523667 240672 109678 53514 15280 2708 663 7 0 0 1
Here, there are 59 chunks available in the order-3 bucket, on node 0. When a process requests an order-3 chunk and if one is not readily available, it will enter direct reclaim mode – i.e. the process will try to free up and compact memory in the allocation path. This might cause memory allocation to be very slow, and if a process is holding a lock or other critical resource before it tried to allocate memory, it could cause problems for other processes running on that system. Memory fragmentation could cause issues like device timeouts, ping failures, high %sys utilization, high load average, etc.
If there is scarcity of order-4 chunks (or higher), it will result in page allocation failure warnings being printed to the syslog, but this will not enter the direct reclaim mode of allocation. In other words, lack of higher order chunks is normal – expected even, on long running systems, and the system does not try too hard to reclaim/compact memory when a higher order chunk is unavailable. Therefore kernel modules and drivers should always have a fall back option in the code to retry lower order allocations, when a higher order one fails.
Anyway, all this is to say that it’s quite useful to keep an eye on the slab caches – which one can do using the -s
option of memstate:
$ sudo oled memstate -s -v KERNEL: 5.4.17-2136.318.7.1.el8uek.x86_64 HOSTNAME: aruramak-test TIME: 10/17/2023 11:14:31 SLAB CACHES (in KB): SLAB CACHE SIZE (KB) ALIASES proc_inode_cache 855872 (null) dentry 666504 (null) task_struct 187168 (null) xfs_inode 183776 (null) vm_area_struct 167456 (null) kmalloc-512 148960 (null) inode_cache 101632 (null) kmalloc-2k 94048 (null) ip_fib_alias 89180 nsproxy, avc_xperms_node, zswap_entry, Acpi-Parse, uhci_urb_priv, xfs_bmap_free_item, lsm_inode_cache, file_lock_ctx radix_tree_node 86560 (null) kmalloc-rcl-96 81312 (null) kmalloc-1k 80672 (null) filp 77232 (null) sighand_cache 53664 (null) ip_mrt_cache 45856 virtio_scsi_cmd, rds_ib_frag, uid_cache, ip6_mrt_cache, inet_peer_cache, xfs_ili, t10_alua_lu_gp_cache, dmaengine-unmap-16 kmalloc-4k 40768 (null) PING 34176 UNIX, signal_cache kmalloc-8k 27872 (null) mm_struct 27232 (null) fs_cache 26832 anon_vma_chain anon_vma 25544 (null) files_cache 22752 (null) ...
The slab usage data is gathered from /sys/kernel/slab/<cache>/
files. The output here consists of the slab cache name, its size (in KB) and all the aliases for that cache. Linux kernel merges together slab caches with similar attributes (object size, alignement, etc.). Typically, for a set of merged slabs, only one name (alias) is displayed in /proc/slabinfo
. Here, we see the list of all aliases for a given cache, which can be useful in debugging slab leaks. Enabling any of the SLUB debug options will disable this merging.
Running this command with an additional -v
lists the slab cache sizes of all caches on the system, in descending order. If you suspect that one of the slab caches is growing, you could run this command periodically (using -f
to specify periodicity) to monitor the growth over time:
$ nohup sudo oled memstate -s -v -f 600 & Capturing memstate data every 600 seconds, in log: /var/oled/memstate/memstate.log; press Ctrl-C to stop.
A final note before we close the chapter on fragmentation: the system keeps track of the compaction/reclamation effort in various counters, which are exported to the userspace via /proc/vmstat
. These can also be useful in debugging memory fragmentation-related issues, which are present in the oled memstate
summary output:
# oled memstate ... Vmstat: allocstall_normal 0 zone_reclaim_failed 0 kswapd_low_wmark_hit_quickly 0 kswapd_high_wmark_hit_quickly 0 drop_pagecache 0 drop_slab 0 oom_kill 0 compact_migrate_scanned 0 compact_free_scanned 0 compact_isolated 0 compact_stall 0 compact_fail 0 compact_success 0 compact_daemon_wake 0 compact_daemon_migrate_scanned 0 compact_daemon_free_scanned 0 ...
If there is a large slab cache on the system (say, hundreds of GB) causing memory fragmentation, one way to temporarily mitigate that is to forcibly release memory from that cache. For dentry
or inode
slab caches, that can be achieved by writing 2 to /proc/sys/vm/drop_caches
. Note that we do not recommend doing this on a regular basis – it can cause performance issues for any I/O intensive applications. At best, it’s a temporary band-aid and further debugging is necessary to root cause the dentry/inode slab growth. If it’s another slab cache (e.g. task_struct) that’s growing, then it calls for extended debugging to pinpoint if/how those objects are being leaked – drop_caches
will not help.
NUMA node allocations
On systems with multiple NUMA nodes, memory allocation needs are not equitably distributed across the NUMA nodes. Typically, memory is allocated from the “local” node – i.e. the node the process is running on. In some corner cases, we might have processes whose memory needs are far higher than the others – maybe some backup and recovery processes, for instance. These will then put memory pressure on one NUMA node, causing some imbalance.
For example:
# numastat -m Per-node system memory usage (in MBs): Node 0 Node 1 Total --------------- --------------- --------------- MemTotal 1547405.20 1548190.00 3095595.20 MemFree 13461.10 848778.94 862240.04 MemUsed 1533944.09 699411.07 2233355.16 ...
This is an extreme example of imbalanced allocations caused due to an I/O intensive process running on node 0 allocating a lot of memory from node 0. This can cause memory pressure for other processes running on node 0, and MemFree might dip below the low watermark, waking up kswapd to do background reclaim and compaction even though there is over 800 GB of free, unused memory on node 1. This is not a common scenario, but it’s an interesting case that might call for looking more closely into how the processes’ allocations are split across the NUMA nodes, checking if any process is explicitly bound to NUMA node 0, etc.
The NUMA stats for some select categories (like page cache, slab caches, etc.) can be printed using the -n
option – it extracts per-NUMA node meminfo data from /sys/devices/system/node/nodeX/meminfo
.
$ sudo oled memstate -n ... NUMA STATISTICS: NUMA is enabled on this system; number of NUMA nodes is 2. Per-node memory usage summary (in KB): NODE 0 NODE 1 MemTotal 131413236 132071696 MemFree 14468156 10625716 FilePages 41377828 37372572 AnonPages 12178764 10885112 Slab 19184668 32837680 Shmem 1077016 262980 Total Hugepages (2048 KB) 38912000 38912000 Free Hugepages (2048 KB) 3600384 3117056 [WARN] Shmem is imbalanced across NUMA nodes. ...
If there is significant imbalance across one or more of these categories, the tool will print a warning. The warnings that memstate prints to console are just informational – they are just things that a sysadmin might want to keep an eye on.
Running this with an additional -v
option will scan the /proc/<pid>/numa_maps
file for each process and print the per-NUMA node memory usage for each pid, but this processing can be quite expensive w.r.t. time taken/CPU cycles. esp. if run on a heavily-loaded system. Therefore, it is not recommended to invoke -n/--numa
with -v/--verbose
too often.
# oled memstate -n -v ... Reading /proc/<pid>/numa_maps (this could take a while) ... Per-node memory usage, per process (in KB): PROCESS(PID) NODE 0 NODE 1 systemd-journal(1632) 61224.0 2120.0 firewalld(2499) 5920.0 46932.0 rsyslogd(3065) 36760.0 17728.0 tuned(3058) 10604.0 34156.0 polkitd(2745) 3760.0 19468.0 systemd(45133) 16404.0 584.0 systemd(1) 13752.0 580.0 NetworkManager(2558) 7564.0 11952.0 python3(60492) 11440.0 3264.0 python3(thread-self) 10864.0 3264.0 python3(self) 10844.0 3264.0 ...
Health checks
In the output of oled memstate
, there’s a section that checks for common, known issues that we’ve seen before:
# oled memstate ... HEALTH CHECKS: [OK] The value of vm.min_free_kbytes is: 3954221 KB. [OK] The value of vm.watermark_scale_factor is: 10. [WARN] Page tables are larger than expected (7.1 GB); if this is an Exadata system, check if the DB parameter USE_LARGE_PAGES is set to ONLY. [OK] RDS receive cache size is: 2.3 GB. [OK] Unaccounted kernel memory is: 5.6 GB. ...
While debugging various memory-related issues, we have come across some common patterns/themes that the memstate tool explicitly tracks and warns the user about. Among them (and this is not a comprehensive list – more checks like these will be added as we discover more useful patterns):
-
vm.min_free_kbytes: This is a sysctl tunable that tells the system the minimum amount of memory to be kept free to satisfy critical allocations, some of which are done in memory management code’s housekeeping paths. If this is too low, it will affect the system’s ability to free/compact memory (i.e. reduce fragmentation). In some cases, processes might end up doing direct reclaim while allocating memory – which can affect performance.
On Oracle’s engineered systems, we recommend setting this value to
max(0.5% of RAM, 1 GB per NUMA node)
. Note that setting this value too high might cause the system to invoke the OOM-killer, as this memory is effectively reserved and unusable for all the ‘normal’ allocations in the kernel. -
vm.watermark_scale_factor: This is also a sysctl tunable that controls how aggressively kswapd reclaims memory. It defines the distance between the min/low/high watermarks of a zone, which dictate when kswapd wakes up and how long it needs to run before it can go to sleep again.
-
Page table size: The amount of memory used by page tables is dictated by the number of pages on that system. A page table is used by the OS to keep track of the mapping between the addresses the processes use and their corresponding physical pages. The larger the RAM, the larger the number of pages of default size (4 KB) – which translates to bigger page tables. The good news is that most, if not all, large systems use hugepages. On x86_64, a hugepage could be 2 MB or 1 GB. So, one page table entry can map an entire 2 MB region (with 2 MB hugepages) which will otherwise use 512 page table entries (with default 4 KB pages).
Due to the use of hugepages on Oracle’s engineered systems, page tables are typically small in size (a couple of GBs, if that). If a DB is misconfigured or there was an issue with the hugepages setting, it might fall back to using regular 4 KB pages. When that happens, there might be a host of other issues (performance, for one), and one way to detect that is to check the size of page tables. That’s one of the reasons memstate checks page table size and prints a warning in this section, if it is larger than expected.
-
“Unknown” kernel memory: We’ve discussed this in an earlier section. If the amount of memory allocated by kernel components or drivers directly from the buddy allocator (using
__get_free_pages()
and sister APIs) is too large, it will print a warning here. It could be a legitimate use case by a kernel component, say for large I/O buffers, or it could be a sign of a memory leak. Further investigation will be needed to rule out any issues. -
RDS receive cache: This is one of the known, legitimate users of kernel memory allocated directly from the buddy allocator. We compute the cache size and mention it in the ‘HEALTH CHECKS’ section – so that it does not add to the “unknown” memory footprint.
Future Work
The memstate tool consolidates, analyzes and monitors useful memory-related statistics that are already exported to userspace in various procfs
and sysfs
files. It could also be enhanced to run ebpf scripts to collect stacktraces when an issue is detected, which will help get to the source of any memory leaks in the kernel. Based on real use cases where this is used to debug an issue, we will continue to refine and enhance the tool. If you have a specific use case that you’d like to debug with memstate and it’s not supported, we’d like to hear about it – please open an SR to Linux Support.