By Jamesmorris-Oracle on Oct 06, 2015
The following is a write-up by Oracle Mainline Linux Kernel Engineer, Mike Kravetz, on his recent upstream work on enhancing hugetlbfs support in the Linux kernel.
Linux huge page support has been present in the Linux kernel since 2003. When first introduced, the only way to take advantage of huge pages was via hugetlbfs. This often involved modifications to application code and explicit action by system administrators to set up and reserve pools of huge pages. As a result, the use of huge pages was mostly limited to applications such as large databases which wanted the very best performance possible and had skilled developers who could modify and tune their code.
More recently, Linux kernel development has been focused on Transparent Huge Page (THP) support. THP is a system wide feature that enables the use of huge pages by any application without source code modification or system administrator intervention. The creation, management and use of huge pages is managed transparently by the Linux kernel.
THP works well for most applications today. However, some application developers want to achieve the best performance possible. To achieve this, they are willing to modify their application to use the original hugetlbfs interfaces. Of course, this also requires the application have intimate knowledge of it's interaction with system resources. To meet the evolving needs of these applications, two new enhancements were made to hugetlbfs.
Reserving Huge Pages
Users of hugetlbfs typically reserve huge pages at system boot time. This pool of reserved pages is then used as the applications map and fault in huge pages. Since memory reserved for huge pages is not available for other uses, it is important not to reserve an excessive number of pages. However, if too few pages are reserved the applications may receive out of memory errors when the reserved pool is exhausted. Therefore, users attempt to make an accurate estimate of their huge page needs and have their applications make use of all the reserved huge pages.
One concern in this area is that the pool of reserved pages is global. Therefore, it is possible for any user/application on the system with sufficient privilege to use huge pages in the reserved pool. This could cause problems for an application that expects a certain number of huge pages.
An application would like some reasonable assurance that allocations will not fail due to a lack of huge pages. At application start-up time, the application would like to configure itself to use a specific number of huge pages. Before starting, the application can check to make sure that enough huge pages exist in the system global pools. However, there are no guarantees that those pages will be available when needed by the application. The application really wants exclusive use of a subset of huge pages.
A new hugetlbfs mount option 'min_size=
The min_size mount option for hugetlbfs was added to the 4.1 version of
Punching Holes in hugetlbfs files
As mentioned above, applications which make use of huge pages via hugetlbfs often have intimate knowledge of their system resource needs. In addition, these application may use files within hugetlbfs as huge page backed shared memory. Within the application, many processes will be simultaneously mapping these files. Some of the data in these files is long lived, and is used throughout the life of the application. Other data may only be used for a period of time and then never accessed again. When the application knows that data within these files is no longer needed, it would like to release the huge pages associated with the data so that it can be used for other purposes.
Punching holes within files is accomplished with the fallocate() system call in traditional filesystems. In Linux, the tmpfs filesystem also supports fallocate hole punch. Adding this support to hugetlbfs provides the requested functionality to release huge pages within files. Hole punching in hugetlbfs is actually simpler than for other filesystems. This is because hugetlbfs is a memory only filesystem, therefore there is no disk or swap space to be concerned with.
In addition to hole punch, fallocate pre-allocation support was also added for hugetlbfs. This allows one to allocate multiple huge pages with a single system call. Without pre-allocation, each huge page would be allocated at page fault time.
hugetlbfs fallocate support is part of the 4.3 release candidate series
of the Linux kernel.
The 4.3 Linux kernel release candidate series contains support for userfaultfd, by Andrea Archangeli. This new functionality allows for the handling of page faults in user space. An application can monitor a range of virtual addresses. When a page fault happens within this range, the application is notified and can take various actions.
The initial version of userfaultfd only supports anonymous VMA mappings. Applications using hugetlbfs may also like to use userfaultfd. One identified use case is the monitoring of address ranges that were hole punched with fallocate. Access to these areas may be considered an error by the application. Therefore, the application would like to be notified of such accesses.
The addition of userfaultfd support for hugetlbfs is being considered for a future Linux kernel release.
-- Mike Kravetz