Introduction

The Linux Operating System is designed to use as much of the available memory to cache data, for as long as possible, with the hope that the entry can be efficiently reused in the future. Under normal operations, memory pressure will force the cached entries out when new items are created. Linux also provides a user space interface (drop_caches) that allows the system administrator to drop specific groups of cached entries.

At Oracle Linux, we don’t encourage the use of drop_caches on production systems, nevertheless, there are some workloads where customers find it expedient to use drop_caches. We undertook this evaluation to understand whether using drop_caches was harmless, or if it could have negative effects on a running system. If drop_caches are required in production systems, the best practice will be to drop page cache first which is pretty safe, and then only drop slab cache if necessary.

Glossary of terms

Superblock

Each active filesystem is represented with a special entry called the superblock. The variable “sb” is a common kernel pointer representation to a specific superblock.

Inode

Every Linux file, directory, symbolic link and special device is represented in the kernel by an entry called the inode. In this blog, the inode is for a file.

Dentry

A directory entry (dentry) is a kernel entry that contain the file name, pointers to parent, sibling and children that speeds the file lookup process. A dentry may be either valid, inactive negative. A valid dentry has active references. A valid dentry becomes inactive with the removal of the last active reference. The negative dentry is a directory entry that did not match a valid directory file name in the file lookup process. Inactive and negative dentry items are linked on a per superblock list_lru.

Page cache

The page cache is a collection of memory pages holding buffered IO data for a file. A valid page has active data for example read from storage. A dirty page has data that has not been updated to storage.

Control Group

A control group (cgroup) is a Linux kernel feature that groups resources (such as CPU, disk IO, memory, etc) into a hierarchical management unit for accounting, limiting and isolation purposes. This blog is interested in memory control group (memcgroup) items.

XFS terms

The XFS filesystem stores metadata in special buffers called xfs_buf. In XFS metadata has a life cycle designed to accumulate changes in memory and to ensure the data is on disk to recover from system crashes.

The XFS filesystem stores quota metadata in special inode (xfs_dquot) and the inodes resides in a xfs_buf for reading and writing to storage..

EXT4 terms

In a filesystem a group of sequential blocks that contain related data or metadata is called an extent. EXT4 maintains the file’s extent list in the extent_status tree.

The EXT4 and OCFS2 filesystems use a journal subsystem called jbd2. A filesystem journal protects metadata from corruption when the system crashes unexpectedly.

About drop_caches

When should I use drop_caches?

Typically, drop_caches are only used to work around bugs in the kernel and when possible. It is always better to actually fix the bug in the kernel than to suppress the symptoms. We’ve seen cases where XFS grows large number of inodes, or buildup of negative dentries, or other internal kernel data structures.

Buffer IO pages are stored in the page cache after being written to storage. The administrator may want to ensure that the buffered IO pages cache are written to disk and removed from the page cache to free memory or when running timing tests that makes sure that the next access comes from disk instead of the page cache. If one wants to drop more pages, a sync(1) should be called before dropping the page cache. The sync(1) writes the outstanding unwritten buffered IO pages to permanent storage and inactivates the pages so the pages become eligible to be dropped by the shrinker.

When a memory control group (memcgroup) is removed from userspace, prior allocations may still hold a reference on the memcgroup. Even though the removed memcgroup cannot be viewed from userspace, but the page references cause the memory used by cgroup can not be freed, which becomes a zombie cgroup. The page cache option of drop_caches call will clear the zombie memcgroup entries. Oracle Linux UEK7 starting v5.15.0-305.171.1 includes improvements for some of the zombie memcgroup page cache issues.

There are many ways to misspell a file name. Since the unmatched file name is per directory, operations like the shell’s PATH expansion can quickly create the negative dentrys in multiple directories. The dentry has a reference count and becomes inactive on the release of the last reference. The negative and inactive dentry can remain on a list_lru for an extended period of time before being freed. The negative dentry entries’ long lifespan can cause memory fragmentation, long dentry cache traversal and high reclaim times. Currently, there is an enhanced UEK change that can limit negative dentry memory consumption to a percentage of the total system memory but does not help with excessive entries in an individal directory. Therefore excessive negative dentry entries can still result in long traversal times and may lead to soft lockup issues.

So we encourage you to reach out to the kernel team at Oracle Linux if you have a real use case for drop_caches, as it’s probably a bug that needs to be fixed. For example, an underlying problem in systemd related to excessive negative dentrys in a filesystem leveraging inotify() functions was discovered and fixed in UEK. Additionally, the adopting of the obj_cgroup API fix that prevents long living objects from pinning the original memory cgroup in memory into Oracle Linux is an example of leveraging great internal and community derived ideas to benefit our Enterprise customers.

How to drop caches

To drop the desired caches, write the appropriate integer command mask to the special procfs file, /proc/sys/vm/drop_caches

To drop the page cache:

# echo 1 > /proc/sys/vm/drop_caches

To drop the slab (dentry, inode) caches:

# echo 2 > /proc/sys/vm/drop_caches

To combine the masks to drop both page cache and slab cache at the same time:

# echo 3 > /proc/sys/vm/drop_caches

Oracle Linux suggests dropping the page cache and slab caches individually. This blog will look at the process to drop page cache and slab cache individually.

The drop caches routine in UEK 5, UEK 6 and UEK 7 are functionally the same.

This blog is provided to evaluate stability related issues (potential to hang, lockup, deadlock, panic etc.). The performance impact will depend on particular application workload which is out of the scope of this blog.

Evaluate the dropping of page caches

The goal of dropping the page caches is to invalidate and free every inactive (not locked, dirty, in write back nor mapped) page in every inode’s address space in every filesystem. This routine intentionally limits the search for inactive pages allocated to a valid inode. Since the desired pages are inactive page cache pages, this routine will not have any interaction with active buffered IO pages.

The page cache drop cache routine briefly holds the sb_lock spin lock to read the next superblock and holds a read lock on the sb->s_umount semaphore during the processing of every inode on that filesystem to prevent an unmount or a freeze command to the filesystem.

For each superblock, the sb->s_inode_list_lock spin lock is briefly held to read the next inode on the superblock list.

For each inode, the inode->i_lock spin lock is held just long enough to verify the inode is not being created nor being freed and also verify that the inode’s address space holds pages. With the superblock and inode spin lock released, each page in the current inode’s address space is checked and each inactive page is removed from the address space and freed.

The shrinker uses cond_resched() to instruct the scheduler to check for another process that needs to use the processor.

Pseudo code of drop page cache

iterate_supers(sb)
 spin_lock(&sb_lock)
 iterate for each superblock (sb) loop
  spin_unlock(&sb_lock)
  down_read(&sb->s_umount)
  drop_pagecache_sb()
   spin_lock(&sb->s_inode_list_lock)
   list_for_each_entry(inode, &sb->s_inodes)
    spin_lock(&inode->i_lock)
    if inode->i_state is finished being created and not freeing and
       inode->i_mapping has pages
     take a reference on the inode
     spin_unlock(&inode->i_lock)
     spin_unlock(&sb->s_inode_list_lock)
     invalidate_mapping_pages(of the entire inode address_space range)
      loop through the address space indexes
       grab array of inactive pages in the range (locks the pages, max 15 pages)
       loop through the array of pages
        invalidate_inode_page()
         if page is not dirty and not in writeback and not mapped
           invalidate page
        unlock page
       cond_resched()
      end of looping through the address space
     remove reference to inode
     cond_resched()
     spin_lock(&sb->s_inode_list_lock)
   end of list_for_each_entry loop
  end of drop_pagecache_sb()
  up_read(&sb->s_umount)
  spin_lock(&sb_lock)
 end of iterate the superblocks loop
 spin_unlock(&sb_lock)
end iterate_supers(sb)

Dropping page_cache conclusion

The dropping of the page cache is limited to inactive buffered IO pages for the inode. The operation will hold a read the sb->s_umount semaphore and could block a freeze/umount operation for a long period of time which may trigger a hung task warning. It may be wise to not freeze or unmount the filesystem while dropping page cache. If a freeze or unmount of the filesystem could happen during the dropping of the page cache, disable the hung_task_panic messages while dropping page caches. The other locks in this shrinker are just held for a very short period, which will not cause any lockup issue.

Evaluate the dropping of slab caches

The dropping of slab cache entries is performed by calling each of the registered kernel shrinkers for each NUMA node and possibly for each memory group. Various Linux subsystems register a shrinker, which is stored on a common list and is protected with a common semaphore.

Each shrinker has two special functions. The first function returns a count of the number of entries available to be reclaimed. The second function attempts to reclaim a specified number of entries. The idea is to call the function to get the count of the reclaimable entries and use that number in the call to do the actual entry reclaim. The shrinker typically runs the reclaim function in batches of up to 128 entries. Using the smaller batches and calling cond_resched() to yield the processor, prevents monopolizing the processor for too long.

Subsystems using the shrinker API, use a list_lru which is NUMA aware and optionally may be memory control group (memcgroup) aware. For example the superblock dentry and inode caches are memcgroup aware but the XFS xfs_dquot and xfs_buf list_lru are not memcgroup aware.

Pseudo code of drop slab

for_each_numa_node(node)
  for_each_memcg()
   # walk the shrinkers:
   down_read_trylock(&shrinker_rwsem)
   loop for each entry in the shrinker_list
    create a shrink_control structure with gfp mask, NUMA node, memcgroup
    do_shrink_slab()
     call the shrinker->count_object()
     loop using the shrink count value
      call the shrinker->scan_objects() at most batch count
      adjust the count processed
      cond_resched()
   end shrinker_list loop
   up_read(&shrinker_rwsem)
   cond_resched()
  end for_each_memcg
 end for_each_numa_node

There are many subsystems in the Linux kernel that interface with the shrinker API. The core shrinker design is a count of the number of items to shrink, loop through the items and periodically call cond_resched() so other kernel and user threads can run with reasonable response times.

Filesystem Shrinker

Each mounted superblock has a shrinker for unused dentry and inode cleaning.

XFS has a shrinker to release the freed XFS quota and XFS buffer entries.

EXT4 has a shrinker to release unused extents and journal items.

The Superblock shrinkers

Superblock shrinkers use a common API with routines to lock, add, remove and interate over the items on list_lru structure. Each mounted filesystem has a superblock shrinker that knows how to reclaim unreferenced dentry and inode entries. A filsystem may have additional special purpose shrinkers. The count routine for the superblock shrinker also uses the tunable, sysctl_vfs_cache_pressure, to scale the aggressiveness of the unreferenced dcache and inode pruning. The default value of 100 does not change the count of entries to be cleaned. But a value of 10 would clean only 10% of the unreferenced dcache and inode entries and a value of 1000 would try to clean 10 times the number of available unreferenced dentry and inode list_lru entries. Cleaning more than the original available entry counts would allow the shrinker to continue to remove any new entries added after the initial call and can spend more kernel time to shrink entries.

The dentry and inode are placed on list_lru lists when they are no longer active. This is a metadata equivalent of the inactive buffered IO pages on the page cache. These entries available will be freed with memory pressure or with shrinking.

The unreferenced dentry and inode lru cache reclaim process have a similar two step process. First, the list_lru is walked and the spin lock (dentry-> or inode->i_lock) for each entry is taken briefly to check the entry’s state . If the entry is still valid for reclaim, the entry is removed from the list_lru and moved to a new dispose list. The second step walks the dispose list and the entry is freed. The second step can take longer but is no longer dependent on the superblock lists and locks.

Most dentry entrys point to another parent directory dentry. The removal of the dentry decreases the reference count on this parent dentry and this could result in the parent dentry becoming unused. The dentry reference decrement process is done by looping up parent list adjusting locks, decrementing the reference count and removing dentry entries as needed.

Inode entries that are disposed of may trigger file system activity like writing out data blocks, trimming speculative preallocation blocks, writing the metadata to the filesystem.

Pseudo code of superblock shrinker

super_cache_scan()
 down_read_trylock(&sb->s_umount)
 fs_objects = sb->s_op->nr_cached_objects(sb)
 inodes = list_lru_shrink_count(&sb->s_inode_lru)
 dentries = list_lru_shrink_count(&sb->s_dentry_lru)
 prune_dcache_sb(sb)
  walk the sb->s_dentry_lru list_lru
   dentry_lru_isolate()
    (move inactive dentrys to dispose list, see below)
   end dentry_lru_isolate
   shrink_dentry_list()
    (remove inactive dentrys on dispose list. see below)
   end shrink_dentry_list()
 end prune_cache_sb()
 prune_icache_sb(sb)
  walk the sb->s_inode_lru list_lru
  inode_lru_isolate()
   (move inactive inodes to dispose list, see below)
  end inode_lru_isolate()
  dispose_list()
   (remove the inactive inode)
   evict_inode()
    dispose_inode()
     wait for inode IO to complete
     aops->destroy_inode()   this may call back into the filesystem
      xfs_fs_destroy_inode() for example xfs call
   cond_resched()
  end of dispose_list()
 end prune_icache_sb(sb)
 up_read(&sb->s_umount)
end super_scan_scan()

# The dentry is not active and is on the LRU. Function moves inactive dentrys to dispose list
# another loop will walk the dispose list and free the dentry entries (see below)
dentry_lru_isolate()
 try to take the dentry->d_lock spin lock
  skip dentry if already locked
 if dentry->d_lockref.count != 0
  free dentry->d_lock spin lock
 else
  if dentry->d_flags & DCACHE_REFERENCED
   remove the DCACHE_REFERENCED flag
   free dentry->d_lock spin lock
   put the entry at the end of the sb->s_dentry_lru
  else
   put the dentry on the dispose list
   free dentry->d_lock spin lock
end dentry_lru_isolate()

# Free the entries in the dentry dispose list
shrink_dentry_list()
 loop through the dentry on the dispose list
 take dentry->d_lock spin lock
 take the rcu read lock
 shrink_lock_dentry()
  dentry->d_lockref.count is 0
  dentry->d_inode lock can be taken
  dentry parent lock can be taken
 if the above is not true
  rcu_read_unlock()
  skip this dentry
 else the above is true
  rcu_read_unlock()
  d_shrink_del()
   list_del_init(&dentry->d_lru)
  end d_shrink_del
  __dput_to_list(parent)
   decrement the parent's d_lockref.count
   if the parent's d_lockref.count is zero
    d_shrink_add()
     add the parent to the dispose list to be remove in a future iteration
    end d_shrink_add()
 end __dput_to_list()
 __dentry_kill(dentry)
 end shrink_dentry_list()

# The inode is on the LRU so move inactive inodes to dispose list
inode_lru_isolate()
 try to take the inode->i_lock spin lock
  skip inode if already locked
 if inode->i_count or inode->i_state & ~I_REFERENCED
  free inode->i_lock and skip inode
 else
  if inode->i_state & I_REFERENCED
   remove the I_REFERENCED flag
   free inode->i_lock
   put the entry at the end of the sb->s_inode_lru
  else if inode->i_data.private_list is not empty
    take a reference on the inode (__iget())
    free inode->i_lock and lru_lock
    remove the inode buffers
    release a reference on the inode (iput())
    retry the inode
  else
   inode->i_state |= I_FREEING
   put the inode on the dispose list
   free inode->i_lock spin lock
end inode_lru_isolate()

The xfs_buf shrinker

The xfs_buf shrinker walks the non memcgroup aware list_lru (xfs_buf->bt_lru) and puts xfs_buffers that are not busy (not referenced and unlocked) on a dispose list. The xfs_buf lock could be already taken if the buffer is being written in the background or if the buffer is being reallocated in a new transaction. The second phase walks the dispose list to perform the actual freeing.

Pseudo code of xfs_buf shrinker

# isolate unreferenced xfs_buf on xfs_buftarg bt_lru list_lru to a dispose list
# walk the dispose list and freed the xfs_buf
xfs_buftarg_shrink_scan()
 iterate the xfs_buf on the xfs_buftarg->bt_lru
  xfs_buftarg_isolate()
   (xfs_buf that are not referenced are put on a dispose list)
   if can't take the xfs_buf->b_lock spin lock
    skip this xfs_buf
   else
    decrement the xfs_buf->b_lru_ref
    if xfs_buf->b_lru_ref is not 0
     unlock xfs_buf->b_lock spin lock
     put at the end of xfs_buftarg->bt_lru for future scan
    else
     remove xfs_buf from xfs_buftarg->bt_lru and put on the dispose list
     unlock xfs_buf->b_lock spin lock
  end xfs_buftarg_isolate()
  iterate the xfs_buf on the dispose list
   remove xfs_buf from the dispose list
   xfs_buf_rele()
    take the xfs_buf->b_lock spin lock
    decrement the xfs_bmap->b_hold
    if xfs_bmap->b_hold == 0
     take the appropriate xfs_perag->pag_buf_lock lock
     remove the xfs_buf from AG
     release xfs_perag->pag_buf_lock lock
     xfs_buf_free()
      frees any pages, maps
      unlock xfs_buf->b_lock spin lock
      free the xfs_buf structure
     end xfs_buf_free()
   end xfs_buf_rele()
  end of iterate the xfs_buf dispose list
end xfs_buftarg_shrink_scan()

The xfs_qm (quota) shrinker

The xfs_dquo shrinker walks the non memcgroup aware list_lru (xfs_quotainfo->qi_lru) and puts xfs_dquot (quota inodes) that are still unbusy (not referenced and unlocked) and their backing xfs_buffers on a dispose list. In the second phase, the dispose list is walked to perform the actual freeing. The xfs_buf that back the quota data for the xfs_dquot are written back using asyncronous IO in the second phase and the xfs_dquot are freed.

Pseudo code of xfs_qm (quota) shrinker

# isolate unreferenced xfs_dquot and any dirty xfs_buf that stores them
# to a dispose list. Walk the dispose list and remove the xfs_buf and
# xfs_dquot entries.
xfs_qm_shrink_scan()
 iterate the xfs_dquot on the xfs_quotainfo->qi_lru
  xfs_qm_dquot_isolate()
   if can't take the xfs_dquot->q_lock mutex
    skip the xfs_dquot
   else
    if xfs_dquot->q_nrefs is not 0
     unlock the xfs_dquot->q_lock mutex
     busy so remove from the xfs_quotainfo->qi_lru
    else
     if wait for fs_dquot->q_flush block?
      unlock the xfs_dquot->q_lock mutex
      skip xfs_dquot
     else
      if xfs_dquota->dq_flags & XFS_DQ_DIRTY
       xfs_qm_dqflush()
       xfs_buf_delwri_queue()
       xfs_buf_relse()
       retry the xfs_dquot
      else
       (not referenced item)
       complete xfs_dquot->q_flush
       free xfs_dquot->q_lock mutex
       move the xfs_dquot entry to the dispose list
 end xfs_qm_dquot_isolate()
 xfs_buf_delwri_submit() free the xfs_buf entries on the dispose list
  loop over the xfs_buf on the dispose list
   xfs_buf_iowait() wait for IO to complete on the xfs_buf
   xfs_buf_relse()
    unlock the xfs_buf
    free the xfs_buf
  end xfs_buf loop
 end xfs_buf_delwri_submit()
 iterate the xfs_qm entries on the dispose list
  xfs_qm_dqfree_one()
   lock xfs_mount->m_quotainfo->qi_tree_lock mutex
   remove the xfs_dquota item from the xfs_quotainfo radix tree
   unlock xfs_mount->m_quotainfo->qi_tree_lock mutex
   xfs_qm_dqdestroy()
    free the xfs-dquot
   end xfs_qm_dqdestroy()
  end xfs_qm_dqfree_one()
 end iterate xfs_qm_isolate entries
end xfs_qm_shrink_scan()

The EXT4 extent status shrinker

The ext4 extent status shrinker walks all the EXT4 inodes in the ext4_sb_info->es_list and tries to remove discretionary extents (non EXTENT_STATUS_DELAYED). Those discretionary extents that do not have an active reference (have been read more than once) are removed and freed. EXTENT_STATUS_DELAY extent are delay allocated extents which means this extent is active in buffered IO write into a hole.

Pseudo code of Ext4 shrinker

# the EXT4 shrinker walks the ext4_sb_info->s_es_list list removing discretionary
# entries from the extent status cache. Other entries (marked with DELAYED or
# REFERENCED) must be retained.
# The ext4 code places the ext4 inode on the ext4_sb_info->s_es_list when the
# inode has a regular, unwritten, precached extent type. The number of freeable
# extent status entries is stored (ext4_sb_info->s_es_stats.es_stats_shk_cnt)
# per-cpu counter and the ext4 inode has a non EXTENT_STATUS_DELAYED extent
# status in ext4_inode_info->i_es_shk_nr.
 __es_shrink()
  take ext4_sb_info->s_es_lock
  loop while number remove items is not 0
   get the next ext4_inode_info from ext4_sb_info->s_es_list
   move the ext4_inode_info to the end of ext4_sb_info->s_es_list
   if ext4_inode_info extent status have been precached (EXT4_STATE_EXT_PRECACHED) or
      ext4_inode_info->i_es_lock is locked
    continue loop to next ext4_inode_info
   endif
   free ext4_sb_info->s_es_lock
   es_reclaim_extents()
    es_do_reclaim_extents()
     (shrink extent status in a give ext4_inode_info until end of range or
      number of shrink records is reached)
     __es_tree_search()
     end __es_tree_search()
     loop while number remove item is not 0 or index is beyond the end
      decrement the number of remove items
      get the next red-black node after extent_status found in __es_tree_search()
      if extent_status entry has EXT4_STATE_EXT_PRECACHED or STATUS_REFERENCE
       clear the STATUS_REFERENCE
       continue to the next
      else
       remove this entry from the ext4_inode_info->i_es_tree rb tree
       ext4_es_free_extent()
        free the extent_status
       end ext4_es_free_extent()
       get the ext4_inode_info for the next node found above
     end loop
    end es_do_reclaim_extents()
   end es_reclaim_extents()
   free ext4_inode_info->i_es_lock write lock
   take ext4_sb_info->s_es_lock
  end loop
  free ext4_sb_info->s_es_lock
 end __es_shrink()
end ext4_es_scan()

The EXT4/OCFS2 JBD2 shrinker

Introduced in UEK7, the jbd2 shrinker scans the check-pointed transaction for buffers and frees the journal buffers that have been written. A busy/dirty journal buffer will be skipped, and also the shrinker will break out of the scanning loop if rescheduling is required. The shrinker holds the journal->j_list_lock spin lock while the shrinker walks the transaction queue and journal_head items in that queue. The journal_head holds the buffer_head items that were released if they are not dirty nor locked. The journal->j_list_lock is also used to do a light cleaning of the transactions buffer_heads when committing a transaction. There is no IO performed with this shrinker and minimal interaction with transaction commits.

Pseudo code of jbd2 shrinker

jbd2_journal_shrink_scan
 jbd2_journal_shrink_checkpoint_list
  again_label_loop
   spin_lock(&journal->j_list_lock)
   if journal->j_checkpoint_transactions == NULL
    spin_unlock(&journal->j_list_lock)
    return
   if journal->j_shrink_transaction cursor is not NULL
    transaction = journal->j_shrink_transaction
   else
    transaction = journal->j_checkpoint_transactions (start from the beginning)
   last_transaction = last entry of the journal->j_checkpoint_transactions queue
   loop until we reach the last_transaction in the queue
    journal_shrink_one_cp_list (current transaction t_checkpoint_list)
     last_jh = the last journal_head on the queue
     loop until the current journal_head == last_jh
      jbd2_journal_try_remove_checkpoint(current journal_head)
        if journal_head has active transaction or buffer_head is dirty or locked
         skip (return -EBUSY)
        __jbd2_journal_remove_checkpoint
         __buffer_unlink(jh)
          removes journal_head from the journal_head and transaction queues
         end __buffer_unlink
         percpu_counter_dec()
         __jbd2_journal_drop_transaction
          journal->j_shrink_transaction = NULL
          remove the transaction from the j_checkpoint_transactions queue
         end __jbd2_journal_drop_transaction
         jbd2_journal_free_transaction
          frees the transaction structure allocation
         end jbd2_journal_free_transaction(
        end  __jbd2_journal_remove_checkpoint(
      end jbd2_journal_try_remove_checkpoint
      if (ret < 0)
       continue to the next journal_head because this item is busy
      nr_freed++
      if the thread is marked to need rescheduling
       break from loop so caller can do a cond_resched()
      end jbd2_journal_try_remove_checkpoint
     end loop
    end journal_shrink_one_cp_list
    (adjust the number of records to process in the shrinker)
    if the thread is marked to need rescheduling
     end inner loop
   end loop
   set sjournal->j_shrink_transaction cursor to next transcation unless at end
   spin_unlock(&journal->j_list_lock)
   cond_resched()
   if the number of records to process has been satified or no more transactions
   to process
     end outer loop
  end again_label_loop

The Filesystem shrinkers conclusion

The shrinker operation will hold a read of the sb->s_umount lock and could delay a freeze/umount operation for a long period of time. It may be wise to disable hung_task_panic messages while dropping page caches.

The dropping of the XFS dquota and xfs_buf entries are done on released entries. Any IO (for example writing the XFS inode) is done with no additional locks held.

On EXT4, the extent cache drops only discretionary extents and skips extents that have been read multiple times for another scan. Unlike the dcache and inode shrinkers, the EXT4 can work on an open inode. Under certain workloads, it is possible that the extent item will be read back from storage after the shrink.

The MM shrinkers

The memory management code has several shrinkers. For example, the working set code has a shrinker to remove unused shadow nodes; the x86 MMU code has a shrinker to remove old MMU pages; the huge memory code has a large zero page and a deferred vmpage split shrinkers.

The workingset_shadow_shrinker

Shadow entries are cached evicted page entries with a counter of active, inactive evictions and activation counts. The counter helps the reactivation of the page and to age the page for eviction of stale cached pages. The number of shadow entries are limited for memory consumption concerns.

The shadow shrinker walks the global list_lru (shadow_nodes). The shrinker count function and the shrinker scan routine use the list_lru iteration functions (list_lru_shrink_count() and list_lru_shrink_walk_irq()). The shrinker count function will use the rcu_read_lock() for a memory read, and the shrinker scan function will use the list_lru (lru->lock) spin lock to walk the lru and attempts to get the shadow page’s address_space>i_pages xarray lock, so the shrinker can add/remove the shadow page to a private list based on whether the entry has real pages or not.

This shrinker works with shadow page entries which represents evicted pages and skips any work if the address_space lock cannot be obtained. This shrinker will not effect working pages.

The x86 kvm mmu_shrinker

The mkvm mmu_shrinker removes the inactive shadow pages in a kernel virtual machine (kvm) when they were replaced with a new entry. The “zapped” shadow entries are referenced on a list kvm->kvm_arch->zapped_obsolete_pages and will have an obsolete generation number. The x86_shrink_count reads a per_cpu counter. The shrinker removal function takes the kvm_lock mutex to scan the global vm_list to find entries inactive shadown pages in a kvm entry from the vm_list. The current entry’s kvm->mmu_lock write lock is taken to remove the shadow entry. The shadow removal does a TLB flush on all the CPUs. To prevent monopolizing a process, the cond_resched() call is used to yield the processor after processing for every 10 shadow entries removed. Having inactive shadown pages is not common.

The deferred_split_shrinker

The Transparent HugePage (THP) facility is the Linux kernel huge page system where the THP page size is able to be promoted and demoted by power of two page sizes. The unmapping (for example with a munmap) of just part of a THP page range and does not immediately free the memory. Instead the THP is put on a NUMA or optionally NUMA/memcgroup specific pg_data_t->split_queue. The THP page remains on this queue in the partially used state until this shrinker runs. This worker splits the THP into at least two new smaller power of two page size entries. The new active THP portion of the split of the original THP entry(s) replaces the partial used THP in the address space. The inactive THP entry(s) are available to be allocated.

The deferred_split shrinker works on each NUMA or NUMA/memcgroup deferred_split->deferred_split entry in the system. The list holds the THP head pages that could be split. In the first pass of the shrinker, if a reference on the page can be taken, the page is moved from the split_queue list and placed into a pass two working list. After the first pass on the pages on the split_queue, the split_queue_lock can be released and the second pass can process the pages on the working list created in the first pass. The second pass attempts to take the page lock and checks to make sure the page is not in IO writeback. So the page is not split if the page is performing IO. The read semaphore (i_mmap_rwsem) for the page’s address space is taken and the local processor interrupt (IRQ) is disabled for the page split. This shrinker is very complicated. Spltting THP requires updates to the memory management page tables and the buckets that holds different size THP entries.

A summary of this worker is that there is a brief time that during any new THP splits or THP deletes are delayed. As the THP is split, the page will briefly stall new access in the address space, at the valid THP page range and masks out the local IRQ for page table changes. THP splits are complicated and affect working pages and the MMU page table mapping but the THP spilts are not common.

The huge_zero_page_shrinker

There is an optional huge page stored in the global page pointer (huge_zero_page). The huge_zero_page_shrinker will remove this huge page if the corresponding atomic reference counter indicates that this huge page is no longer in use. This routine uses an atomic counter and an atomic pointer and does not hold any locks. If removed, a new huge zero page will be allocated when needed.

The mm shrinkers conclusion

The memory management shrinkers are special cases.

The working set shadow removes inactive shadow pages and the KVM mmu shrinker similarly removes inactive shadow pages in the kernel virtual machine implementation. In both cases, the shrinker tries to have minimal impact on the current running address spaces.

The deferred split page shrinker splits big THP memory pages. The implementation is complex but is rarely needed.

The huge zero page shrinker has minimal impact on performance.

Dropping slab cache conclusion

The slab shrinkers are designed to have minimal impact on the running system by selecting inactive entries but slab shrinkers reach into filesystem and memory management structures and locks. The slab shrinkers should be used wisely on production systems.

Conclusion

In this blog, we looked at the page cache and slab shrinkers to understand the implications of dropping the caches. By design, the dropping of caches are designed to use a best effort mode and use cond_sched() to limit processor hold times. The locks are held only where needed and uses try_lock whenever possible. The drop_cache routines should not create a deadlock problem but if a filesystem freeze or umount operation were to happen during drop_caches calls, make sure the hung_task_panic sysctl is disabled.