The following is a write-up by Oracle Mainline Linux Kernel Engineer, Mike Kravetz, on his recent upstream work on enhancing hugetlbfs support in the Linux kernel.
Introduction
Linux huge page support has been present in the Linux kernel since 2003. When
first introduced, the only way to take advantage of huge pages was via
hugetlbfs. This often involved modifications to application code and
explicit action by system administrators to set up and reserve pools of
huge pages. As a result, the use of huge pages was mostly limited to
applications such as large databases which wanted the very best performance
possible and had skilled developers who could modify and tune their code.
More recently, Linux kernel development has been focused on Transparent
Huge Page (THP) support. THP is a system wide feature that enables the
use of huge pages by any application without source code modification or
system administrator intervention. The creation, management and use of
huge pages is managed transparently by the Linux kernel.
THP works well for most applications today. However, some application
developers want to achieve the best performance possible. To achieve this,
they are willing to modify their application to use the original hugetlbfs
interfaces. Of course, this also requires the application have intimate
knowledge of it’s interaction with system resources. To meet the evolving
needs of these applications, two new enhancements were made to hugetlbfs.
Reserving Huge Pages
Users of hugetlbfs typically reserve huge pages at system boot time. This
pool of reserved pages is then used as the applications map and fault in
huge pages. Since memory reserved for huge pages is not available for
other uses, it is important not to reserve an excessive number of pages.
However, if too few pages are reserved the applications may receive out
of memory errors when the reserved pool is exhausted. Therefore, users
attempt to make an accurate estimate of their huge page needs and have
their applications make use of all the reserved huge pages.
One concern in this area is that the pool of reserved pages is global.
Therefore, it is possible for any user/application on the system with
sufficient privilege to use huge pages in the reserved pool. This could
cause problems for an application that expects a certain number of huge
pages.
An application would like some reasonable assurance that allocations will
not fail due to a lack of huge pages. At application start-up time, the
application would like to configure itself to use a specific number of
huge pages. Before starting, the application can check to make sure that
enough huge pages exist in the system global pools. However, there are no
guarantees that those pages will be available when needed by the
application.
The application really wants exclusive use of a subset of huge pages.
A new hugetlbfs mount option ‘min_size=
‘ was developed to indicate
the number of huge pages guaranteed to be available for use by the
filesystem. At mount time, this number of huge pages will be reserved for
exclusive use of the filesystem. If there are not a sufficient number of
free pages, the mount will fail. As applications allocate and free huge
pages from the filesystem, the number of reserved pages is adjusted so that
the specified minimum is maintained. In this way, the application is
assured
the specified number of huge pages will be available for their use.
The min_size mount option for hugetlbfs was added to the 4.1 version of
Linux kernel.
Punching Holes in hugetlbfs files
As mentioned above, applications which make use of huge pages via hugetlbfs
often have intimate knowledge of their system resource needs. In addition,
these application may use files within hugetlbfs as huge page backed shared
memory. Within the application, many processes will be simultaneously
mapping
these files. Some of the data in these files is long lived, and is used
throughout the life of the application. Other data may only be used for a
period of time and then never accessed again. When the application knows
that data within these files is no longer needed, it would like to release
the huge pages associated with the data so that it can be used for other
purposes.
Punching holes within files is accomplished with the fallocate() system call
in traditional filesystems. In Linux, the tmpfs filesystem also supports
fallocate hole punch. Adding this support to hugetlbfs provides the
requested
functionality to release huge pages within files. Hole punching in
hugetlbfs
is actually simpler than for other filesystems. This is because hugetlbfs
is a memory only filesystem, therefore there is no disk or swap space to be
concerned with.
In addition to hole punch, fallocate pre-allocation support was also added
for hugetlbfs. This allows one to allocate multiple huge pages with a
single system call. Without pre-allocation, each huge page would be
allocated
at page fault time.
hugetlbfs fallocate support is part of the 4.3 release candidate series
of the Linux kernel.
Future enhancements
The 4.3 Linux kernel release candidate series contains support for
userfaultfd, by Andrea Archangeli.
This new functionality allows for the handling of page faults in user space.
An application can monitor a range of virtual addresses. When a page fault
happens within this range, the application is notified and can take various
actions.
The initial version of userfaultfd only supports anonymous VMA mappings.
Applications using hugetlbfs may also like to use userfaultfd. One
identified use case is the monitoring of address ranges that were hole
punched with fallocate. Access to these areas may be considered an error
by the application. Therefore, the application would like to be notified
of such accesses.
The addition of userfaultfd support for hugetlbfs is being considered for
a future Linux kernel release.
— Mike Kravetz