Introduction

Your phone blasts you awake!

⚠️WARNING: System PROD01 swap utilization > 90%

If you are getting woken up in the middle of the night, or on the weekends, you may need to reconsider your alerts, as related to high swap utilization.

Swap is often misunderstood, and the way swap is used in modern kernels may cause some confusion. This blog will briefly cover how swap usage is now more of an optimization than an emergency, and will also answer some frequently asked questions about swap.

Terms, processes and settings

  • MemFree – As viewed in /proc/meminfo, this is the amount of memory that is not being used for anything at all. While it may be tempting to want to see a large number here, this is really a waste of RAM resources.
  • MemAvailable – This is the amount of memory that is available for programs to use, which includes free memory as well as easily reclaimed cache (minus reserved buffers). This is also present in /proc/meminfo.
  • SwapFree – Idle process memory pages which have been evicted from RAM reside on the swap disk. SwapFree is the amount of free/unused space left on the swap disk. Also present in /proc/meminfo.
  • kswapd – The kernel thread that handles memory reclaim and swapping in the background. It evicts least recently used (LRU) pages from RAM – whether that’s from the page cache or process memory – based on a number of complex factors.
  • kcompactd – Another kernel thread that consolidates smaller, free memory chunks into larger, physically contiguous chunks, and reduces memory fragmentation.
  • Low watermark – This is the threshold that controls when kswapd is woken up, to do background reclaim. This threshold can be found in /proc/zoneinfo.
  • Min watermark – This is the threshold that controls when the allocating process itself will block during allocation, to reclaim memory (also called direct reclaim). This can also be found in /proc/zoneinfo.
  • High watermark – This is the threshold that controls when kswapd goes back to sleep, and memory reclaim is considered successful. This threshold can be found in /proc/zoneinfo.

A brief introduction to MemFree, kswapd & background reclaim

After your system is booted, most of the memory is in MemFree. As programs allocate memory, MemFree decreases. The Linux kernel, by design, will use up most of the available memory in order to enhance system and application performance. What this means is that it’s expected, normal and healthy for MemFree to be low. Most free memory is used for caching – anything that can prevent a disk I/O is good for application performance.

In general, memory used for page cache is easily reclaimable. Since the data in the page cache is also (for the most part) available on disk, the pages can be easily reclaimed and reused in case of memory pressure (if they’re clean).

When MemFree falls below the low watermark, kswapd is woken up and it tries to reclaim memory. It stops its efforts once free memory reaches the high watermark threshold.

If MemFree continues to fall below the min watermark, then not only will kswapd try and reclaim memory but individual processes that are trying to allocate memory will also attempt to reclaim. This is when you might notice some application latencies as each allocation request has to do some direct reclaim before it can be satisfied.

The pages reclaimed in these flows can either be anonymous pages (i.e. process heap, stack) or file pages (i.e. page cache pages). Some kernel memory (e.g. slab cache) is also reclaimed here, but that is typically not where most gains come from, so we’ll ignore slab cache shrinkers in this post.

Is swap necessary? Can I just disable swap?

Enabling swap space gives the system a way to evict pages that are not backed by disk. If there are allocation bursts, having swap enabled gives the system a little flexibility in deciding what pages to evict. If there is no swap but the system is under memory pressure, the only way to recover memory is to evict page cache pages. Even if there are idle processes with a lot of inactive anonymous memory allocated, they will not be evicted because they are not disk-backed – there’s no backing store for those pages.

If, for instance, there are a lot of processes running on that system (and therefore significant anonymous memory usage), and moderate page cache usage, and there are sudden spikes in memory allocation requests, then the reclaim algorithm will be handicapped by the small amount of page cache. If the page cache also becomes hot (due to backup application or other file I/O intense application), then the system will start thrashing – evicting page cache pages only to read them back in again. This is unstable, and might lead to the Out of Memory (OOM) killer being invoked.

There is no need to have cold anonymous pages always in memory – enabling swap space lets the kernel choose between file pages and idle process memory to evict, based on the workload’s memory access patterns. This leads to optimal performance for all applications on the system, as well as optimal resource (i.e. memory) utilization.

On the other hand, swapping during normal workload (on some systems) can increase latency for other processes. In most cases, this is insignificant. But for latency-sensitive or real-time applications, this could be a problem. Most workloads do not fall under this category.

Why is the swap usage high even when my system is not actively swapping now? Shouldn’t SwapFree increase?

SwapFree in /proc/meminfo is the amount of currently unused swap space. As uptime increases, SwapFree will usually decrease; if there was memory pressure in the past (whether that was one day ago or one month ago) that caused pages to be swapped out, those pages tend to remain in swap unless the process that owns them dies, or if the page is modified in memory (which makes the swap copy stale, and thus it will be discarded).

So SwapFree can be quite low due to historic swapping, even if the system’s not under any memory pressure now. If you don’t see active swapping (in the si and so columns of vmstat output), there is typically little ongoing swap I/O impact on the system, though reclaim latency can still exist for other reasons.

If you see thrashing, i.e. pages are swapped out due to memory pressure, only to be swapped back in because they’re actively referenced, that’s an issue and should be addressed, possibly by balancing the workload on the system, or adding more RAM.

Why is my swap usage high even when I have plenty of free memory?

That depends on the watermarks. There might be a lot of free memory globally, but it might be unevenly split among the NUMA (Non-Uniform Memory Access) nodes, where one NUMA node is teetering on the edge of memory pressure and the other is under-utilizing its resources. If this happens, the NUMA node running low on memory could start swapping (depending on allocation policy, cpusets/mempolicy constraints, and zone reclaim behavior).

The NUMA imbalance situation is shown below with a sample of numastat data:

Per-node system memory usage (in MBs):
                          Node 0          Node 1           Total
                 --------------- --------------- ---------------
MemTotal               772243.23       774101.36      1546344.59
MemFree                  9418.90       106649.16       116068.06    <--
...
Active(file)            81925.87        56966.03       138891.89
Inactive(file)         245829.25       168282.12       414111.36
...
FilePages              333708.79       236806.75       570515.54
...

Meminfo:

zzz <11/27/2023 11:35:24> Count:0
MemTotal:       1583456860 kB
MemFree:        131671936 kB
...
Active(anon):   77319236 kB
Inactive(anon):  3020752 kB
Active(file):   140084028 kB
Inactive(file): 413750580 kB
...
SwapTotal:      25165820 kB
SwapFree:       23269628 kB

Here, NUMA node 0 has just ~9 GB free, whereas NUMA node 1 has over 100 GB free. This is a stark imbalance and node 0 pages could be swapped out if memory pressure worsened on node 0, despite the fact that MemFree is >125 GB (almost all of which is coming from node 1).

Here’s another example, which is more drastic:

&nbsp;
                          Node 0          Node 1           Total
                 --------------- --------------- ---------------
MemTotal              1547405.20      1548190.00      3095595.20
MemFree                 13432.87       848495.85       861928.72 <--
...
Active(file)           864159.89        31523.51       895683.39 <--
Inactive(file)           9656.17         9037.02        18693.19
...
FilePages              881459.59        63041.10       944500.70
Mapped                   7441.12        23012.17        30453.29
AnonPages               25261.50        19255.40        44516.90
Shmem                    7627.10        22435.06        30062.16
...

Summary percentage report: 
                          Node 0          Node 1           Total
                 --------------- --------------- ---------------
MemFree                    0.87%          54.81%          27.84%
MemUsed                   99.13%          45.19%          72.16%

Note: From an Numa.ExaWatcher report

Almost 850 GB free, but all on NUMA node 1. Node 0 will experience memory pressure, sooner rather than later, and the system has multiple strategies to deal with that:

  • Reclaim inactive page cache pages.
  • Demote active pages to inactive more aggressively (which then get reclaimed easily).
  • Allocate from foreign node (i.e. node 1).
  • Swap inactive process memory out to disk.
  • Shrink slab caches.

The swapping is a small piece of the pie, and not the most important piece. All the other strategies are being deployed as well. The goal is to reclaim as much memory, as efficiently as possible.

Why is my system swapping even though there’s plenty of page cache? Shouldn’t it reclaim from the page cache first?

Short answer: it does both. Broadly speaking, the kernel reclaims memory from both file pages (i.e. page cache) and anonymous process pages (by evicting them to swap). It prefers reclaiming clean page cache pages, since those are very easy to reuse without I/O. But it does not wait to swap until the page cache is depleted – both sets of pages are reclaimed, although at different rates.

Here’s the relevant kernel code snippet (this, and all other references in this note are from the 5.15 (i.e. Oracle UEK7) kernel):

/*
 * Determine how aggressively the anon and file LRU lists should be
 * scanned.  The relative value of each set of LRU lists is determined
 * by looking at the fraction of the pages scanned we did rotate back
 * onto the active list instead of evict.
 *
 * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
 * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
 */
static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
                           unsigned long *nr)
{
...
        /*
         * If there is enough inactive page cache, we do not reclaim
         * anything from the anonymous working right now.
         */
        if (sc->cache_trim_mode) {
                scan_balance = SCAN_FILE;
                goto out;
        }

        scan_balance = SCAN_FRACT;
        /*
         * Calculate the pressure balance between anon and file pages.
         *
         * The amount of pressure we put on each LRU is inversely
         * proportional to the cost of reclaiming each list, as
         * determined by the share of pages that are refaulting, times
         * the relative IO cost of bringing back a swapped out
         * anonymous page vs reloading a filesystem page (swappiness).
         *
         * Although we limit that influence to ensure no list gets
         * left behind completely: at least a third of the pressure is
         * applied, before swappiness.
         *
         * With swappiness at 100, anon and file have equal IO cost.
         */
        total_cost = sc->anon_cost + sc->file_cost;
        anon_cost = total_cost + sc->anon_cost;
        file_cost = total_cost + sc->file_cost;
        total_cost = anon_cost + file_cost;

        ap = swappiness * (total_cost + 1);
        ap /= anon_cost + 1;

        fp = (200 - swappiness) * (total_cost + 1);
        fp /= file_cost + 1;

        fraction[0] = ap;
        fraction[1] = fp;
        denominator = ap + fp;

If the number of inactive file pages is low, or if the active file pages have a high refault rate, the reclaim preference will tilt towards anon pages – basically whatever pages the system can reclaim with least cost paid in terms of performance, I/O cost, etc. There’s no single trigger to start swapping.

Let’s check the behavior on one system:

Before:
zzz <03/08/2022 00:00:02> Count:119
Cached:         40329468 kB
SwapFree:       25165820 kB

Page cache usage shoots up:
zzz <03/08/2022 01:50:23> Count:359
Cached:         225038676 kB
SwapFree:       17709564 kB
...
zzz <03/08/2022 02:50:34> Count:359
Cached:         226585112 kB
SwapFree:        8473852 kB
...
zzz <03/08/2022 03:20:42> Count:359
Cached:         245398456 kB
SwapFree:        7650044 kB

Here, we see page cache usage increase sharply between the hours of midnight and 2 AM, probably due to a backup application scheduled to run at that time. This can increase memory pressure on the system, and the system can start swapping if needed. Adding such backup processes to a memory-constrained cgroup will ensure that it does not consume all available memory on the system, and thus not affect other processes.

Related question #1: How do pages get demoted from the active LRU to the inactive LRU list?

Linux categorizes pages into two sets: anonymous pages, which are not file-backed (for instance, heap, stack, etc.) and file pages, which are pages in RAM that have a backing file on disk (for instance, libraries, data files, etc.).

These 2 categories are further divided into two lists: active and inactive, using the LRU (Least Recently Used) algorithm. The active list contains pages that have been recently referenced, and the inactive LRU list contains pages that have not been accessed in a while. If a page is accessed that’s in the inactive LRU list, it gets ‘promoted’ to the active list. Memory reclaim favors pages from the inactive LRU lists, as it’s not very optimal to evict pages actively in use. Similarly, it’s easier to evict clean file-backed pages since they’re already up to date on the disk, as opposed to reclaiming dirty file-backed pages (which would need to be written out first) or anonymous pages (which also need to be written out, but to swap).

Function shrink_active_list() moves pages from active to inactive LRU list; shrink_active_list() is called by kswapd, during page reclaim. This is called typically if the inactive list is too small – by deactivating pages, it provides enough candidate pages for the reclaim algorithm to make progress.

It scans a batch of pages (denoted by nr_to_scan) at the tail of the active list, and if they can be demoted, it moves them to the inactive list. If they cannot be deactivated, the pages are rotated back to the head of the active list – this gives the page an extra turn around the active list, before it’s checked again for demotion.

static void shrink_active_list(unsigned long nr_to_scan,
                               struct lruvec *lruvec,
                               struct scan_control *sc,
                               enum lru_list lru)
{
...
        nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
                                     &nr_scanned, sc, lru);
...
        while (!list_empty(&l_hold)) {
...
                if (page_referenced(page, 0, sc->target_mem_cgroup,
                                    &vm_flags)) {
                        /*
                         * Identify referenced, file-backed active pages and
                         * give them one more trip around the active list. So
                         * that executable code get better chances to stay in
                         * memory under moderate memory pressure.  Anon pages
                         * are not likely to be evicted by use-once streaming
                         * IO, plus JVM can create lots of anon VM_EXEC pages,
                         * so we ignore them here.
                         */
                        if ((vm_flags & VM_EXEC) && page_is_file_lru(page)) {
                                nr_rotated += thp_nr_pages(page);
                                list_add(&page->lru, &l_active);
                                continue;
                        }
                }

                ClearPageActive(page);  /* we are de-activating */
                SetPageWorkingset(page);
                list_add(&page->lru, &l_inactive);
        }

        /*
         * Move pages back to the lru list.
         */
        nr_activate = move_pages_to_lru(&l_active);
        nr_deactivate = move_pages_to_lru(&l_inactive);
...
        __count_vm_events(PGDEACTIVATE, nr_deactivate);
        __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
        __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
...

This counter that tracks how many pages were deactivated in each pass (i.e. pgdeactivate) can be read from /proc/vmstat. It’s a cumulative, global counter that tracks all deactivations for the duration of the system’s uptime. If you monitor this file and see pgdeactivate going up, it’s a sign that the system is shrinking active lists due to lack of sufficient inactive pages to reclaim.

Let’s look at inactive LRU list handling – how do pages get reclaimed from the inactive list? The core function that implements this logic is shrink_page_list(). It’s called here:

static unsigned long
shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
                     struct scan_control *sc, enum lru_list lru)
{
...
        nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, &stat, false);

...
        __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
        item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
        if (!cgroup_reclaim(sc))
                __count_vm_events(item, nr_reclaimed);
        __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
        __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
...

More counters one could monitor for, in /proc/vmstat, to understand current reclaim activity – pgsteal_kswapd and pgsteal_direct indicate how many pages were reclaimed by kswapd and in the direct reclaim flow respectively. Among them, pgsteal_anon and pgsteal_file counters indicate which LRU list these pages were reclaimed from.

shrink_page_list() is handed a set of candidate pages which are isolated from the inactive LRU lists (it could be inactive file or inactive anon). It performs a series of checks on each page to ultimately decide if the page can be freed, or if some other course of action is more appropriate. If the page was recently referenced, it is activated (i.e. added to the active LRU list). If a page is dirty, it is queued for writeback, and reclaim is deferred. For anon pages, swap space is allocated, and the mapping is freed. The best case scenario is when a page is clean and file-backed – the page is freed right away.

static enum page_references page_check_references(struct page *page,
                                                  struct scan_control *sc)
{
        int referenced_ptes, referenced_page;
        unsigned long vm_flags;

        referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
                                          &vm_flags);
        referenced_page = TestClearPageReferenced(page);
...
        if (referenced_ptes) {
                /*
                 * All mapped pages start out with page table
                 * references from the instantiating fault, so we need
                 * to look twice if a mapped file page is used more
                 * than once.
                 *
                 * Mark it and spare it for another trip around the
                 * inactive list.  Another page table reference will
                 * lead to its activation.
                 *
                 * Note: the mark is set for activated pages as well
                 * so that recently deactivated but used pages are
                 * quickly recovered.
                 */
                SetPageReferenced(page);

                if (referenced_page || referenced_ptes > 1)
                        return PAGEREF_ACTIVATE;

                /*
                 * Activate file-backed executable pages after first usage.
                 */
                if ((vm_flags & VM_EXEC) && !PageSwapBacked(page))
                        return PAGEREF_ACTIVATE;

                return PAGEREF_KEEP;
        }

        /* Reclaim if clean, defer dirty pages to writeback */
        if (referenced_page && !PageSwapBacked(page))
                return PAGEREF_RECLAIM_CLEAN;

        return PAGEREF_RECLAIM;
}

Related question #2: Is MemAvailable an accurate statistic for how much memory is readily available for allocation on my system?

Not always. MemAvailable is a heuristic and can be inaccurate for some workloads.

long si_mem_available(void)
{
...
        /*
         * Not all the page cache can be freed, otherwise the system will
         * start swapping. Assume at least half of the page cache, or the
         * low watermark worth of cache, needs to stay.
         */
        pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];
        pagecache -= min(pagecache / 2, wmark_low);
        available += pagecache;
...

MemAvailable is just a heuristic – the calculation assumes that at least half the page cache can be easily reclaimed, which is not true if the page cache is hot. For instance:

zzz <03/08/2022 01:50:23> Count:359
MemAvailable:   204435292 kB
Active(file):   185688096 kB
Inactive(file): 11493160 kB

Most of the page cache is “active” or “hot” here. For more details about how MemAvailable is computed, see Why is MemAvailable sometimes less than MemFree?.

How does the kernel decide which processes to swap out?

Processes aren’t chosen to be swapped out – rather, pages are chosen to be swapped out/evicted based on a host of parameters, including how recently it was last accessed/referenced. The reclaim algorithm favors pages that have a low cost of reclaim. A file-backed page’s reclaim cost depends on if it’s clean or dirty (dirty pages have to be written out to disk before they can be evicted, making them costly to evict due to the extra I/O), plus the refault cost (a page is evicted but is needed again, it must be read back in, generating more I/O). For process pages not backed by a file (i.e. anonymous pages), there is no way to avoid I/O before they can be reclaimed – they must be written out (to swap) before being freed. The rate of refault (i.e. reading it back in from swap) also increases the reclaim cost of anon pages. It’s irrelevant what process the anon page being swapped out belongs to – what matters is that that page has not been referenced in a while, and thus is being evicted so that the memory can be reused elsewhere.

My system is swapping more after a kernel upgrade to Oracle UEK7, even though the workload is the same. Why?

There have been some optimizations merged in the upstream kernel (version 5.15) that affect reclaim behavior and swap usage. Among them (and this is not a comprehensive list):

d483a5dd009a mm: vmscan: limit the range of LRU type balancing
96f8bf4fb1dd mm: vmscan: reclaim writepage is IO cost
7cf111bc39f6 mm: vmscan: determine anon/file pressure balance at the reclaim root
314b57fb0460 mm: balance LRU lists based on relative thrashing
264e90cc07f1 mm: only count actual rotations as LRU reclaim cost
fbbb602e40c2 mm: deactivations shouldn't bias the LRU balance
1431d4d11abb mm: base LRU balancing on an explicit cost model
a4fe1631f313 mm: vmscan: drop unnecessary div0 avoidance rounding in get_scan_count()
968246874739 mm: remove use-once cache bias from LRU balancing
34e58cac6d8f mm: workingset: let cache workingset challenge anon
6058eaec816f mm: fold and remove lru_cache_add_anon() and lru_cache_add_file()
c843966c556d mm: allow swappiness that prefers reclaiming anon over the file workingset
497a6c1b0990 mm: keep separate anon and file statistics on page reclaim activity
5df741963d52 mm: fix LRU balancing effect of new transparent huge pages

This changes the memory reclaim behavior so that swap usage is not the last resort behavior of a system under pressure – rather it’s part of the normal reclaim flow, especially if the page cache pages are being refaulted in, at a high frequency. This kernel.org patchset makes the system utilize swap space more, even under “normal” memory pressure, if the page cache is hot.

Let’s look at some of these statistics, from /proc/vmstat:

workingset_nodes 1265381
workingset_refault_anon 201574
workingset_refault_file 181043188
workingset_activate_anon 79184
workingset_activate_file 25598943
workingset_restore_anon 2066
workingset_restore_file 7334788
workingset_nodereclaim 171392
...
pswpin 201575
pswpout 4522469
...
pgsteal_kswapd 246507286
pgsteal_direct 253820
pgdemote_kswapd 0
pgdemote_direct 0
pgscan_kswapd 269136889
pgscan_direct 278885
pgscan_direct_throttle 0
pgscan_anon 26736135
pgscan_file 242679639
pgsteal_anon 4493398
pgsteal_file 242267708
...

Some observations:

  • workingset_refault_file is high – it indicates that the page cache pages are hot, and they are being refaulted in at a high frequency after being reclaimed. This will result in increased pressure on anon LRU list (and therefore swap).
  • workingset_restore_file tracks how often files that are about to be reclaimed are restored back to the working set (due to the pages being referenced again before being evicted) – this value is high, indicating that a lot of pages in the inactive(file) LRU list are not really good candidates for reclaim, and they’re promoted to the active list quickly. Both these counters are indicative of a hot page cache that is not suitable for reclaim.
  • There is high kswapd activity (indicated by pgsteal_kswapd, pgscan_kswapd).
  • There seem to be less direct reclaim, which is good (pgscan_direct, pgsteal_direct).

My swap utilization is close to 100% – will the OOM-killer be invoked now? Should I be concerned about system stability?

Usually no. High swap utilization by itself does not trigger OOM; what matters is whether the current allocation can make reclaim/compaction progress. If there is sufficient reclaimable memory in the page cache, OOM is less likely, but there are exceptions (for example, memcg OOM, strict cpuset/mempolicy constraints, or high-order/GFP constraints).

The OOM killer is invoked only when the kernel has tried repeatedly to reclaim memory in order to satisfy an allocation request, and has failed. Let’s look at the relevant code snippet – __alloc_pages_slowpath() is the function that deals with allocations that cannot be satisfied right away. This is the function that wakes up kswapd, that tries to reclaim memory and then compact it (to ensure memory is not too fragmented for higher order allocations) and then retries the allocation.

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                                                struct alloc_context *ac)
{
...
        if (alloc_flags & ALLOC_KSWAPD)
                wake_all_kswapds(order, gfp_mask, ac);
...
        /*
         * For costly allocations, try direct compaction first, as it's likely
         * that we have enough base pages and don't need to reclaim. For non-
         * movable high-order allocations, do that as well, as compaction will
         * try prevent permanent fragmentation by migrating from blocks of the
         * same migratetype.
         * Don't try this for allocations that are allowed to ignore
         * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.
         */
        if (can_direct_reclaim && can_compact &&
                       (costly_order ||
                           (order > 0 && ac->migratetype != MIGRATE_MOVABLE))
                        && !gfp_pfmemalloc_allowed(gfp_mask)) {
                page = __alloc_pages_direct_compact(gfp_mask, order,
                                                alloc_flags, ac,
                                                INIT_COMPACT_PRIORITY,
                                                &compact_result);
...
        /* Try direct reclaim and then allocating */
        page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
                                                        &did_some_progress);
        if (page)
                goto got_pg;
...
        /*
         * Do not retry costly high order allocations unless they are
         * __GFP_RETRY_MAYFAIL and we can compact
         */
        if (costly_order && (!can_compact ||
                             !(gfp_mask & __GFP_RETRY_MAYFAIL)))
                goto nopage;

        if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
                                 did_some_progress > 0, &no_progress_loops))
                goto retry;
...
        /* Reclaim has failed us, start killing things */
        page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
        if (page)
                goto got_pg;

        /* Avoid allocations with no watermarks from looping endlessly */
        if (tsk_is_oom_victim(current) &&
            (alloc_flags & ALLOC_OOM ||
             (gfp_mask & __GFP_NOMEMALLOC)))
                goto nopage;

        /* Retry as long as the OOM killer is making progress */
        if (did_some_progress) {
                no_progress_loops = 0;
                goto retry;
        }

As you can gather from the snippets and comments above, the kernel tries very hard to satisfy an allocation request. As long as the background reclaim (done by kswapd) is making some progress, the kernel tries to compact and allocate. As long as there’s memory used by the page cache, it’s very much available to be reclaimed, however slowly. If swap space fills up, the system can no longer optimize for file I/O and will start evicting the page cache – even writing out dirty pages to disk so they can be evicted, and shrinking slab caches, etc. It only invokes the OOM-killer when it has run out of all options. In that case, instead of the entire system grinding to a halt, unable to make progress due to lack of memory, it selects one “victim” process and kills that, hopefully freeing up enough memory to ease the pressure on the system so that the rest of the processes can continue.

In short, if reclaim/compaction keeps making progress, OOM is unlikely. If the page cache has been almost completely evicted, and swap space is 100% full, and the free memory is below the per-zone low watermarks (or the system is too fragmented and compaction fails) – that’s when the system is in trouble and the OOM-killer jumps in.

When should I be concerned about swapping? When is swapping bad?

When a system is swapping, it’s usually seen as one of the first symptoms of memory pressure, which portends worse things to follow. That is a myth. Perhaps this is how swap was used back when memory was scarce, and disks were super slow and the system would actively try to not page out data to disk unless it had no other choice.

That is not the case anymore. With terabytes of RAM and SSDs being more commonplace now, swap is not a necessary evil – it is simply one of the mechanisms to make sure the system is operating at it’s maximum efficiency. Memory reclaim is choosing the best candidate pages to evict for optimal system performance, in many cases keeping active page cache pages in memory, while moving idle pages out to swap.

Even if swap space has been used up 100%, it’s not a reason to be alarmed; see the previous question (My swap utilization is close to 100% – will the OOM-killer be invoked now? Should I be concerned about system stability?). It’s when there is constant swapping in and swapping out – also known as thrashing – that the system is in trouble. Thrashing happens because the available RAM is not enough to satisfy the memory demands of the workload. Due to sustained memory pressure, pages are swapped out (or evicted to disk), but those are still part of the active working set. The processes continue to read/write to those pages, which then results in them being swapped in again, which increases memory pressure (as those new pages have to go somewhere), which results in other active pages being swapped out, and so on. This will result in performance degradation for all applications, and the system will slowly grind to a halt.

Are there any kernel sysctl knobs that can affect swapping on my system?

There are a few knobs that affect memory reclaim behavior – which indirectly affects swap.

vm.swappiness

This is the primary swap-related tunable that controls how aggressively the kernel reclaims file pages vs. anonymous memory – the latter gets evicted to swap space. The value can range from 0 to 200, with 60 being the default. Lowering vm.swappiness generally tends to make the reclaim algorithm favor page cache eviction more than swapping, while increasing this value favors evicting process memory to swap and keeping page cache around a little longer.

Note: Setting vm.swappiness=0 can increase OOM risk on some workloads by greatly reducing anonymous pages reclaimed.

vm.min_free_kbytes

This directly affects the zone watermark values – which control how much memory the kernel sets aside as “reserved” – i.e. normal allocations will not be able to dip into this pool. This in turn influences when memory reclaim starts and how long it runs for, which will affect how much swap space gets utilized. If this value is set too low, the system will not have enough reserve memory to handle allocation bursts. The kernel will struggle to do reclaim, compaction, dirty page writeback, etc. Even the memory reclaim flow might need to allocate memory (e.g. to migrate pages), which will stall (in direct reclaim) or fail if this setting is too low. On the other hand, if this is set too high, all that memory is set aside as reserved, which cannot be used for regular allocations. Also, this increases the zone watermarks, which means kswapd will run for longer, evicting more pages from memory and swapping more too. It is not advisable to tune this setting unless there is evidence that the current setting is suboptimal for the workload.

vm.compaction_proactiveness

This knob controls how aggressively background compaction is done, by kcompactd. If compaction is successful, this will reduce fragmentation on the system, thus increasing the likelihood that a higher order allocation request will succeed. This will help reduce kswapd’s run time.

watermark_boost_factor; watermark_scale_factor

Like vm.min_free_kbytes, they affect the zone watermarks, which alters reclaim aggressiveness, which in turn can increase swap usage.

vm.drop_caches

This sysctl is not a tunable, but writing to it will perform a specific reclaim action that reduces the memory pressure on the system.

  • Writing 1 to this parameter will drop reclaimable page cache.
  • Writing 2 will free up reclaimable dentry and inode slab caches.
  • Writing 3 will free up both slab caches and page cache memory.

⚠️WARNING: Please do not do this on production systems without understanding the performance impact. Dropping the page cache on a system with a heavy workload which is very much actively utilizing said cache is not a good idea. This is a blunt hammer that will evict active and inactive pages – regardless of the performance hit to the applications. Other considerations:

  • During the actual kernel processing of the drop_caches, the system could experience a brownout or eviction.
  • Any memory it frees up will usually be temporary – the system will read all those pages right back in if the workload demands it. It will not fix the real problems on the system, if any.
  • Due to the re-faulting of those evicted pages, some applications can see higher latencies or performance drops after page cache is dropped.

There are many more kernel tunables that affect memory reclaim behavior, which could affect swap aggressiveness. Please consider all the pros and cons of these tunables carefully, before changing them from their default values. It is not advisable to change these without expert recommendations, it could have surprising and undesirable consequences.

I would like my system to use less swap. How do I achieve that?

Reducing swap usage in itself is not a worthy goal, unless there are some negative effects or performance issues. Typically, on a well planned system where the workload does not exceed the memory capacity, there should not be heavy swapping or thrashing. If you just see some swap usage now and then, that is completely normal. If SwapFree continues to go down (and stay down), that is also normal.

As we’ve discussed earlier in this document, there are a few things that can contribute to increased swapping even on a healthy system. Here are a couple things to check:

  • Check if there’s NUMA imbalance.
  • If the page cache is hot, and it’s being used mostly for read-once files (like backup, or security scanner etc.), consider using cgroup limits for those applications so they don’t use too much memory.

⚠️WARNING: Do not run swapoff -a and swapon -a in an attempt to free up swap. This will not help. If the system is already running low in memory, and you turn off swap, you’re forcing the data that was swapped out to be read back into memory, which could cause a crash/reboot.

There is no advantage in trying to increase SwapFree to be closer to SwapTotal. There’s no benefit in trying to not use swap, if it’s enabled. It’s good for overall system performance to let the kernel decide what pages it wants to keep in memory and what pages it wants to evict. There’s no need to try to change that algorithm unless you’re running into performance issues, or the system is thrashing.

What are the latest updates in swap, in upstream Linux?

On newer kernels, Multi-Generational LRU (MGLRU) may be used rather than just two LRU lists (active and inactive), depending on kernel version, configuration and runtime settings. MGLRU entered upstream in the 5.x series and continued to evolve through 6.x, with the goal of improving page reclamation. This categorizes pages into multiple generations based on how recently they were accessed – the pages in older generations were not accessed recently, whereas the pages in the newer generations are more active. The kernel can then reclaim older pages much more efficiently and reduce the rate of refaults. Benchmarks have shown this to improve the efficiency of kswapd, as well as reducing working set refaults significantly. This should (theoretically at least) reduce unnecessary swapping since there is better accuracy in identifying truly inactive/idle pages. Note that this will not eliminate swapping; for instance, if most of the pages in the oldest generation belong to idle processes, and the page cache pages are in one generation newer, those process pages will be evicted first – i.e. swapped – before the page cache pages are reclaimed. This is still the most optimal thing to do for the system overall. The goal here is to evict truly cold pages from memory, and MGLRU brings more accuracy in identifying those pages, and thus reduce refaults/swap-ins of evicted pages.

Conclusion

What we have shown in this blog is that swap usage is no longer a bellwether of system problems. Swap space is being used to improve performance by freeing up real memory for active processes and page cache to use. Alerts that monitor for swap free or swap used should be changed to look for increased and near-constant periods of active swapping – i.e. si and so fields from vmstat. If you do not see this and there are no performances issues, there is nothing to worry about w.r.t. swap space usage. Low SwapFree by itself is not a signal of impending disaster. The default kernel sysctl vm.swappiness does favor reclaiming page cache over swapping when possible, but, at the same time, swapping will happen in parallel with page cache eviction. If you want to reduce swap usage, consider using cgroup limits to constrain applications like backups so that they do not end up using all available memory for caching of file data.