Transparent Hugepages(THP) for .text mappings

February 1, 2024 | 3 minute read
Text Size 100%:

In the Oracle Linux UEK7U2 kernel release, which is based on kernel version 5.15, we are enabling Transparent Hugepage(THP) allocations for executable file mappings(.text) (default setting CONFIG_READ_ONLY_THP_FOR_FS=y). This kernel version has the necessary support to add THP pages to the page cache.

We see performance benefit when a large .text region of an executable is mapped using THP. In this kernel version, there is no support for creating THP pages for .text mappings on demand in the page fault path. THP page allocations are done by khugepaged daemon. It will require enabling khugepaged daemon, by setting /sys/kernel/mm/transparent_hugepage/enabled to madvise or always. If the setting is madvise a process would require issuing a madvise(.., MADV_HUGEPAGE) on the mapped .text region.

Khugepaged scalability issue

khugepaged is a single-threaded daemon that regularly scans the address space of all processes and collapses a set of eligible small pages(4k) accessed to THP. This daemon runs slowly and goes to sleep periodically when scanning, to keep its impact low on system load.

On a system, there can be many applications that have large executables. Also, there can be a large number of processes(1000s) running from those large executables. With such a setup, we encounter a scalability issue with khugepaged daemon as it needs to scan the address spaces of many processes. This can result in a long lag time(hours to days) before the .text pages of many of the processes get converted to THP. Once the .text pages are accessed by a process they are converted to THP, all other processes accessing the same file pages will benefit from it. However, in the case of a large file/.text region not all processes sharing the file may be accessing the same pages.

A solution to address THP creation lag time

Later kernel versions have support for the creation of THP pages on first access to .text mappings, which is based on folios with the necessary support in the filesystem implementation. Backporting folios and related filesystem changes to UEK7 would not be feasible.

To address this issue in UEK7U2, a smaller change has been implemented which will now attempt the creation of THP pages at access time(page fault), rather than depend only on the khugepaged daemon. In the page fault path, we call the routine collapse_file() (which is also called by the khugepaged daemon) to convert small pages to THP. Any missing small pages required to fill the THP size, are fetched (read in). Converted THP pages are added to the page cache. Subsequent processes that are spawned from the same executable binary would find the THP page in the page cache and map it.

This change in UEK7U2 applies only to the .text address range (vma) which has the VM_HUGEPGE flag, i.e. the application has to call madvise(MADV_HUGEPAGE) on the address range. Due to concurrent accesses, the attempt to create or convert to THP can fail, resulting in the creation of just small pages(4k) for .text that get added to the page cache. In this case, it will be left to khugepaged to convert those small pages to THP as part of its scan.

In addition, when an application is installed, typically all its executable binary files get copied which normally results in the executable file pages getting cached in the page cache in small pages(4k). Those pages tend to remain in the page cache unless the system is rebooted. When a process is spawned from these executable binaries, all its text pages are found in the page cache during a minor fault, and so would skip the attempt to convert these small pages to THP during page fault with the above-mentioned change. So, even in this case, it will be up to khugepaged to convert those small pages to THP.

To deal with small pages in the page cache, an additional change has been added to force a major fault in the case small pages of .text are found in the page cache, to attempt conversion to THP. Again, this change applies only to the address range (vma) on which a madvise(MADV_HUGEPAGE) has been called.

References

Prakash Sangappa


Previous Post

Oracle Cloud Native Environment 1.8 introduces support for Arm and Kubernetes 1.28

Simon Coter | 3 min read

Next Post


Gain experience in Oracle Linux system monitoring and logging

Nicolas Pares | 4 min read