Introduction
NUMA (Non-Uniform Memory Access) balancing is a feature (CONFIG_NUMA_BALANCING=y) in the Linux kernel, designed to automatically optimize task and page placement to minimize the performance penalty of remote memory access in a NUMA architecture. The Linux kernel monitors memory access patterns by using page faults to determine whether a memory access is local or remote. The flag TNF_FAULT_LOCAL is set if a folio is found to be local to the CPU during the handling of a NUMA hinting page fault. The numbers of local (numa_hint_faults_local) and total (numa_hint_faults) hinting faults can be found in /proc/vmstat.
if (folio_nid(folio) == numa_node_id()) {
    count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
    *flags |= TNF_FAULT_LOCAL;
} 
Based on that flag, the number of local or remote pages will be accounted for in p->numa_faults_locality[local], which is used to determine the frequency of NUMA balancing VMA scanning, p->numa_scan_period initialized by sysctl_numa_balancing_scan_delay (1000 in ms by default). The frequency is ultimately determined by the local/remote ratio and the private/shared ratio. There is also a separate set of statistics, namely p->numa_faults[], that tracks details such as the node where the page faults originate, whether the faults are private or shared, etc., and therefore helps determine the preferred NUMA node to which the task and its pages should migrate.
NUMA Balancing VMA scanning
NUMA balancing introduces the notion of hinting page faults for the sole purpose of gathering memory access patterns across NUMA nodes as mentioned above, it periodically scans a region of size sysctl_numa_balancing_scan_size (256 MB by default or 8x that size of virtual space if no PTE update occurs) from the beginning or where it left off, and sets the protection flag MM_CP_PROT_NUMA on accessible VMAs one at a time. The primary action of this flag is to set PTEs to PROT_NONE, which will trigger page faults later on.
nr_updated = change_protection(&tlb, vma, addr, end, MM_CP_PROT_NUMA);
While NUMA balancing iterates through VMAs in the allowed range, it checks whether a VMA is migratable, whether the VMA’s memory policy prohibits migration, whether it is a hugeTLB, etc., and skips VMAs that don’t meet these conditions. Some of the notable checks can be found below:
if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
    is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
    trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UNSUITABLE);
    continue;
}
if (!vma->vm_mm ||
    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ))) {
    trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SHARED_RO);
    continue;
}
if (!vma_is_accessible(vma)) {
    trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_INACCESSIBLE);
    continue;
} 
The key difference between a normal PROT_NONE fault and a NUMA hinting fault is whether a VMA is accessible. Thus, skipping NUMAB_SKIP_INACCESSIBLE VMAs is critical to properly identifying hinting faults in the page fault handler.
How does cgroup play a role in NUMA balancing?
VMs or containers that don’t span NUMA nodes can be pinned via cpuset.[mems|cpus]; the same technique can also be applied to any workloads for performance purposes. At the same time, people may want to keep NUMA balancing enabled because other (potentially numerous) unpinned tasks can still benefit from it.
The issue here is that NUMA balancing isn’t cgroup-aware at all. It’s worth mentioning that there has been a proposal[1] recently to allow turning it on or off at the cgroup level. During VMA scanning, it doesn’t check whether a task’s memory is pinned to a NUMA node by a cpuset, and it continues doing all the scanning, PTE updates, and eventually triggering page faults—even though not a single byte of that task’s memory can be migrated out of its designated node.
The cost can be quite expensive: Oracle has observed a 6x regression on some Java workloads due to this unnecessary and totally avoidable overhead.
The solution
The fix[2] is fairly straightforward. Before VMA scanning even starts, we perform an early return from task_numa_work() if cpuset is enabled and cpuset.mems contains only one node. We also add a tracepoint for debugging purposes.
The reason not to skip earlier in task_tick_numa() or even task_tick_fair() is that there is still a time window between task_work_add() and task_work_run(), during which cpuset.mems can change. Additionally, task_numa_work() is called much less frequently than the other two because of the NUMA balancing scan period.
/*
 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
 * no page can be migrated.
 */
if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1) {
    trace_sched_skip_cpuset_numa(current, &cpuset_current_mems_allowed);
    return;
} 
With this patch, we observe approximately a 30% improvement in performance when it is pinned to a single NUMA node using cpuset.[cpus|mems] on UEK7. The test allocates 4 GB of THP memory and spawns a 30-vCPU selftest KVM guest, with each vCPU running a loop 10,000 times to read 4 KB pages from that memory. On a dual-socket Intel system with NUMA balancing enabled, it takes approximately 250 seconds to complete without the patch, but only around 180 seconds with it. In a non-default configuration where NUMA balancing is made more aggressive by halving sysctl_numa_balancing_scan_delay and doubling sysctl_numa_balancing_scan_size, we observe further improvement: the time-to-complete (TTC) increases to 280 seconds without the patch, while TTC remains unchanged with it.
This is a small step to improve NUMA balancing. This approach eliminates the possibility of NUMA task migrations when CPUs are NOT pinned to a single node because it also relies on hinting page faults to track memory access patterns—similar to other VMA skipping scenarios.
However, there may be motivation to allow task migrations even when pages are not migratable. This means the overall benefits should outweigh the cost of hinting page faults. This can be a future work.
[1]. https://lore.kernel.org/lkml/cover.1740483690.git.yu.c.chen@intel.com/
[2]. https://lore.kernel.org/all/20250424024523.2298272-1-libo.chen@oracle.com/
