Understanding Linux Kernel Memory Statistics

February 20, 2024 | 14 minute read
Text Size 100%:

Introduction

The Memory Management (MM) subsystem stands as a vital cornerstone in the Linux kernel. It provides necessary utilities of the underlying memory hardware, specifically RAM, to cater to the diverse needs of other kernel components and user-space processes. Given memory utilities designed for different use cases, the MM subsystem has evolved into a highly intricate component.

Debugging memory issues on a Linux system often leads engineers to delve into the system-level statistics of the MM subsystem. However, it is not surprising that such a complex subsystem will have a complicated measurement system with so many statistics. Understanding and using these metrics during the debugging phase can sometimes be a daunting task.

This article aims to introduce and explain the memory statistics collected in /proc/meminfo. To do so, we will explore three questions:

  1. Why does the Linux kernel need these memory statistics?
  2. What are these statistics? (3) how can one leverage them when troubleshooting memory-related issues?

Gathering Detailed Memory Information

To obtain an overview of the MM subsystem statistics on a live Linux system, a simple command, cat /proc/meminfo, suffices. Another command cat sys/devices/system/node/\<node name\>/meminfo can be useful in getting each NUMA node’s memory statistics for those interested.

But what if the system has experienced a crash?

In that case, kdump, a kernel feature that creates crash dumps when a kernel crash occurs, can be helpful. With kdump enabled, the Linux kernel exports a memory image called vmcore that can be analyzed to determine the cause of a crash. For a detailed guide on generating vmcore in Oracle Linux, refer to this blog post. Certain kernel debugging tools, such as crash and drgn, can be used to analyze Linux crash dump and parse Linux kernel variables within a vmcore.

However, obtaining the list of meminfo statistics can still require significant effort. Fortunately, Oracle’s open-source drgn-tools simplifies the process. drgn-tools utilize the programmability of drgn and provide many kernel debugging utilities. Among them, corelens can seamlessly gather information from vmcore to produce system statistics reports for various Linux kernel subsystems. Using corelens, one can retrieve meminfo statistics from a crashed Linux system with a single command:

corelens <vmcore dir> -M meminfo

So Many Statistics. Why Do We Need Them?

With a big list of memory statistics dumped to the terminal, it is natural to wonder why such an extensive list is necessary. The short answer lies in the diverse use cases that the MM subsystem is designed for, and correspondingly, its monitoring system mirrors this complexity. This section offers an overview of memory utilities and their properties to help understand the meminfo statistics.

Memory allocation requests manifest in different forms: byte-level, page-level, and hugepage allocations. These requests can come from different consumers, including user-space processes and kernel threads.

For byte-level memory, kmalloc and vmalloc are involved. kmalloc allocates physically contiguous memory regions, while vmalloc handles memory blocks that may not be available in contiguous memory regions.

For page-level memory, there are page caches and anonymous pages. Page caches or file-backed pages are associated with a specific file on persistent storage. They act as a caching layer to enhance I/O efficiency. In contrast, anonymous pages are not tied to a file. They are created through dynamic allocations, such as those forming VM areas during program execution.

The Linux MM subsystem employs a Least-Recently-Used (LRU)-like system to manage the above pages. File pages and anonymous pages reside in distinct lists and are treated separately. A page is promoted to the head of its corresponding LRU list upon access, while memory pressure prompts page evictions from the tails of LRU lists.

Swap space is involved when evicting anonymous pages under memory pressure. Anonymous pages, lacking file backing, require swap space to be protected during memory reclaiming. With swap enabled, the kernel reserves a subset of persistent storage. The kernel can then release anonymous pages by moving them from RAM to swap space (swap out). When needed again, these pages can be swapped in to the main memory as if untouched.

Beyond these, the Linux kernel embraces hugepage utilities since Linux Kernel 2.6. The primary motivation is to boost the system’s performance, as hugepages can reduce Translation Lookaside Buffer (TLB) misses and minimize page table footprints. Two variants, standard hugepages and Transparent Huge Pages (THP), are proposed and implemented in the kernel. The former requires proper configuration and pre-allocation, demanding careful management for optimal performance, while the latter offers dynamic allocation with reduced complexity.

In short, comprehensive statistics are needed to cover all these memory utilities, which results in the extensive list of meminfo.

Next, What Do They All Mean?

Meminfo statistics serve as valuable guides in the debugging process. Using the meminfo output of a Linux kernel as an example, this section delves into these statistics, offering an in-depth exploration to decipher their meanings.

Basic System Information

For the output of /proc/meminfo, the top section is an overview of the server’s memory. Key metrics include: - MemTotal is the total usable physical memory (RAM). - MemFree is the amount of free, unused RAM. - MemAvailable provides an estimate of the current available RAM. This value is a simple estimation rather than an accurate statistics. The basic idea is to subtract various occupied memory from MemTotal and add memory resources that can be easily reclaimed. Some examples of reclaimable memory include page caches, slab caches, and some kernel memory.

MemTotal:      114836372 kB
MemFree:        58050652 kB
MemAvailable:  110577328 kB

If the system is under memory pressure, we can look into the following sections to identify which component(s) have utilized the most RAM.

Page-level memory

Page-level memory introduces several statistics relevant to the management of pages, involving page caches, anonymous pages, and the Slab allocator. The kernel also uses pages for its own purposes.

LRU lists

Currently, the kernel maintains an active list and an inactive list for both page types.When the system’s memory is below a watermark threshold, the kernel starts to scan the tails of inactive lists to reclaim pages which are likely to be idle for a while. If an inactive list becomes short, the kernel scans its corresponding active list at the tail to deactivate pages and move them to the inactive list.

  • Active is the total amount of pages in active LRU lists. Inactive is the total amount of pages in inactive LRU lists.
  • For anonymous pages, Active(anon) and Inactive(anon) are the amount of pages in the active and the inactive list.
  • Similarly, for page cache, Active(file) and Inactive(file) are the amount of pages in the active and the inactive list.
Active:          7888268 kB
Inactive:       46164148 kB
Active(anon):     478584 kB
Inactive(anon):   933240 kB
Active(file):    7409684 kB
Inactive(file): 45230908 kB

Among the two types of pages, page cache is easier to be reclaimed. A page cache page can be directly freed if not dirty. Otherwise, a write-back operation is needed. However, reclaiming an anonymous page requires to save the page to swap space.

The swappiness parameter in /proc/sys/vm/swappiness affects the ratio of page cache pages to anonymous pages reclaimed during memory pressure. It ranges from 0 (to reclaim page cache pages only) to 200 (to reclaim anonymous pages only). When vm.swappiness is 100, the kernel treats both page types equally. By default, a swappiness 60 is used to prefer reclaiming more page cache pages than anonymous pages.

Some pages must be kept in the main memory. Examples are pages mapped into SHM_LOCK’s shared memory, mapped into VM_LOCKED VM areas via mlock(), and owned by ramfs. The Linux kernel keeps these pages in an Unevictable page list and uses the same set of kernel functions (e.g., list manipulation operations, page migration, and memory statistics) for managing them.

  • Unevictable is the amount of pages in the Unevictable list.
  • Mlocked is the total memory pages that are pinned in memory by mlock(). With mlock(), an application can specify a memory range in which pages won’t be removed from the main memory during memory reclaiming.
Unevictable:       12388 kB
Mlocked:           12388 kB

Swap space

Swap space is used for backing up anonymous and shared memory pages. Key swap-related statistics include the following:

  • SwapTotal is the total available swap space.
  • SwapFree is the current remaining swap space available.
  • SwapCached is the amount of pages which are both in memory and in swap space on disk. A non-zero SwapCached indicates memory pressure in the past.
SwapTotal:       8024060 kB
SwapFree:        8024060 kB
SwapCached:            0 kB

Most swapped-out pages do not contribute to SwapCached. The reason is that: if a page is swapped out, the kernel will remove the page from the main memory after it has been written back via pageout(). So, that page is in swap space but not “cached” in memory. Only if the page is accessed again and swapped in, it is counted towards SwapCached.

Caches

Memory used for file buffers (Buffers) and cached file data (Cached) are detailed in these two statistics.

Buffers:            6044 kB
Cached:         53358656 kB
  • Cached is the size of the page cache. Cache memory refers to data that has been read (i.e., page caches). If the processes on the system keep reading new file contents, Cached can become large.
  • Buffers is a transient, in-memory cache for block devices. It covers filesystem metadata as well as tracking in-flight pages between an application and the device. Buffers is used to consolidate data from multiple read/write requests and execute them in a block I/O operation. Typically, Buffers has a small value.

Kernel memory usage

Various kernel memory metrics are shown as follows.

KernelStack:        9072 kB
PageTables:        14928 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
  • KernelStack is the sum of all kernel stack memory. Each user-space process gets a kernel stack used for executing kernel functions (e.g., system calls, traps, and exceptions). Kernel stack memory is always in the main memory. It is not attached to LRU lists and not included in a process’s resident set size (RSS).
  • PageTables is the memory used for page tables. Page tables are used to track the mapping from virtual to physical memory addresses. A process’s page table size grows as it applies for more pages.
  • NFS_Unstable is the amount of memory sent to NFS servers but not yet written to the persistent storage.
  • Bounce is the amount of memory allocated for the block device “bounce buffer”. The mechanism is employed due to certain hardware devices being constrained to accessing low memory addresses, such as the DMA32 memory region. If an application initiates a Direct Memory Access (DMA) request to store data in high memory, the kernel allocates a bounce area in low memory to copy data from high memory and then performs the DMA.
  • WritebackTmp is the amount of memory allocated for the temporary writeback buffer used by the Filesystem in User-space (FUSE) module. As a user-space filesystem, FUSE is a software interface that allows user-space programs to export a virtual filesystem to the Linux kernel. Note that this field is different from Writeback.

Memory overcommitment

Linux kernel’s MM subsystem allows memory overcommit. The rationale is that it is typical that processes request more memory than they really need to use. Without memory overcommitment, a substantial portion of allocated memory might remain unused, resulting in underutilization of RAM.

CommitLimit is a memory threshold. If the kernel has allocated more memory than the threshold, the kernel starts the memory reclaiming process. The calculation for CommitLimit is as follows:

CommitLimit = TotalRAM * vm.overcommit_ratio / 100 + SwapTotal

Committed_AS is the aggregated total allocated memory. When OOM occurs, this metric provides an estimated amount of memory that should be provisioned to avert OOM under the target workload.

CommitLimit:    65442244 kB
Committed_AS:    4068492 kB

Byte-level memory: VMalloc

Slab allocator

The kernel uses various data structures in different kernel components and drivers. Some examples are struct mm_struct used for each process’s address space and struct buffer_head used in filesystem I/O. The Slab allocator manages caches of commonly used objects kept in an initialized state available to be used later. This saves time for allocating, initialising, and freeing objects.

The following statistics offer insights into the Slab allocator’s impact.

  • Slab is the total memory used by all Slab caches.
  • SReclaimable is the amount of the Slab that might be reclaimed (such as cache objects like dentry).
  • SUnreclaim is the opposite. When unreclaimable slab memory takes up a high percentage of the total memory, system performance may be affected.
Slab:            1622084 kB
SReclaimable:    1000692 kB
SUnreclaim:       621392 kB

Slab memory can grow when a kernel component or a driver requests memory but fails to release the memory properly, which is a typical memory leak case. Delving deeper into the specifics of /proc/slabinfo is crucial for identifying the particular slab object that has accumulated substantial memory usage.

Vmalloc memory space

The kernel divides its virtual address space into two parts: lowmem (or direct map) and vmalloc memory space. The vmalloc() function allocates physically non-contiguous memory within the vmalloc space.

VmallocTotal is the total size of the vmalloc address space, which is intricately tied to the system’s architecture, or more precisely, the layout of the virtual address space in its implementation. VmallocUsed is the amount of used vmalloc space. VmallocChunk is the size of the largest remaining vmalloc chunk. This field has been set to zero in recent Linux kernels due to the costly computations required to compute its value.

VmallocTotal:   34359738367 kB
VmallocUsed:          69144 kB
VmallocChunk:             0 kB

Other Statistics

  • Dirty is the total amount of file pages that have been modified and are waiting to be written back.
  • Writeback is the total amount of actively written-back pages.
  • AnonPages is the amount of anonymous pages. These pages are utilized for heap, stack, tmpfs, etc., with no backing from files on disk.
  • Mapped is the amount of memory used for mapping files (e.g., files or libraries) to processes’ address spaces.
  • Shmem is the total amount of shared memory and the tmpfs filesystem. tmpfs is a file system which keeps all of its files in memory.
  • Percpu Indicates the memory allocated for per-cpu objects. For performance and scalability optimizations, per-cpu objects (such as counters, caches, and locks) have been used extensively in different kernel components. Each CPU possesses a dedicated memory area to hold these per-cpu objects.
  • HardwareCorrupted is the total amount of memory that have been removed due to hardware problems. A substantial number in this field may indicate underlying hardware issues with memory devices.
Dirty:               148 kB
Writeback:             0 kB
AnonPages:        676988 kB
Mapped:           250748 kB
Shmem:            713508 kB
Percpu:           552192 kB
HardwareCorrupted:     0 kB

Hugepages

Standard hugepages and Transparent Huge Pages (THP) are different in their implementations. The design of standard hugepage makes it a standalone memory component that is isolated from the normal page management system. Standard hugepages are not managed by LRU lists and are not accounted for as cache or buffer. In contrast, THPs are treated differently; they are placed in LRU lists and accounted towards each process’s Resident Set Size (RSS).

Moreover, standard hugepages must be isolated at the system level and reserved in advance by processes. If they are unused, they are free but are unusable by other processes for regular allocations. THPs have no reservations.

THP (Transparent Huge Pages)

AnonHugePages:    243712 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:    282624 kB
FilePmdMapped:     40960 kB
  • AnonHugePages represents the cumulative size of anonymous transparent huge pages across all processes. To identify the applications utilizing these pages, read /proc/\<PID\>/smaps and sum the AnonHugePages fields for each mapping.
# grep AnonHugePages /proc/[1-9]*/smaps | awk '{total+=$2}; END {print total}'
243712
# grep AnonHugePages /proc/meminfo
AnonHugePages:    243712 kB
  • FileHugeMapped is the size of file-backed transparent huge pages. Explore /proc/\<PID\>/smaps and count the FileHugeMapped fields for each mapping.
  • FilePmdMapped is the amount of hugepages mapped into user-space as page cache.

Transparent hugepages can also be mapped to an application’s address space. The count of file transparent huge pages mapped to user-space is available by reading ShmemPmdMapped and ShmemHugePages fields.

HugeTLB (Standard Hugepages)

The following metrics are relevant to standard hugepages.

HugePages_Total:       1024
HugePages_Free:        1024
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:         2097152 kB
  • Hugetlb is the total amount of memory allocated for hugepages of all sizes.
  • Hugepagesize specifies the default hugepage size, in this case, 2 MB. The following statistics are all for hugepages of the default size.
  • HugePages_Total is the total number of hugepages of the default size. These memory regions are reserved for hugepages either at boot time or at run time. At run time, one can update vm.nr_hugepages in /proc/sys/vm/nr_hugepages to configure it.
  • HugePages_Free is the current count of free hugepages of the default size.
  • HugePages_Rsvd is the number of reserved hugepages for future allocation.
  • HugePages_Surp is number of hugepages surpassing the value in /proc/sys/vm/nr_hugepages. The number of surplus hugepages is controlled by memory overcommittment setting in /proc/sys/vm/nr_overcommit_hugepages. A non-zero value allows the system to obtain that number of hugepages from the normal page pool if the original hugepage pool is exhausted. These surplus hugepages are freed and returned to the normal page pool if they become unused.

The interpretation of HugePages_Free and HugePages_Rsvd might be a bit perplexing without a clear example. Suppose an user-space process decides to reserve 100 2-MB hugepages; in this case, HugePages_Rsvd would increment by 100. However, these 100 hugepages remain un-allocated until the process actively accesses any of them. Upon such access, both HugePages_Free and HugePages_Rsvd see a decrement of 1 each. This is because the particular hugepage has been allocated, and its status transforms from being both free and reserved to owned by the process.

In addition, if both 2-MB and 1-GB hugepages are configured on the system, one can use sysfs files under /sys/kernel/mm/hugepages/hugepages-2048kB/ and /sys/kernel/mm/hugepages/hugepages-1048576kB/ to get statistics for each hugepage type.

Reference

  1. Generating a vmcore in OCI
  2. crash(8) - Linux manual page
  3. drgn: Programmable debugger
  4. oracle-samples/drgn-tools
  5. Resident set size

Jianfeng Wang


Previous Post

Building an aarch64 Linux Kernel on OCI Oracle Linux 8

Dongli Zhang | 2 min read

Next Post


On-disk Journal Data Structures (JBD2)

Srivathsa Dara | 12 min read