hugetlbfs: Not just for databases anymore!

December 11, 2017 | 5 minute read
Text Size 100%:

Mike Kravetz contributed this blog post about uses for hugetlbfs in the community!

Any Operating System that implements virtual memory must do translations from virtual to physical addresses.  Such translations could be accomplished entirely by software, but modern processor memory management units(MMU) provide a mechanism to speed up this process.  This is accomplished by keeping a cache of recently used virtual to physical translations in something called the Translation Lookaside Buffer(TLB).  The size of the TLB is limited, and when a translation can not be found in the TLB performance suffers.  The use of huge pages can often help avoid TLB misses.  By requiring fewer pages to contain the same amount of data, there is less contention on the limited number of TLB entries.

Linux added support for huge pages in 2002 with the v2.5.36 kernel.  This initial support was called hugetlbfs.  It followed the traditional Unix model where everything is a file.  To use hugetlbfs, you needed to explicitly mount a hugetlbfs file system.  More importantly, to guarantee huge page availability they needed to be pre-allocated at boot time or shortly thereafter.  In addition, an application program needed to be modified to make use of hugetlbfs huge pages.  Due to all these requirements, hugetlbfs is not used in average applications.  Where hugetlbfs did find a niche is in databases that address large amounts of memory and want the best possible performance.  Databases were one of the first types of applications to make use of hugetlbfs and remain a significant use case today.

To address the configuration and application modification issues required by hugetlbfs, a new huge page model called Transparent Huge Pages (THP) was developed. As the name implies, THP generally requires no config or application modifications to take advantage of huge pages. If mappings within an application are laid out such that they can be backed by huge pages, the kernel will make an attempt to use huge pages for those mappings. This feature is enabled by default on most Linux distributions which causes programs to transparently use huge pages if possible.

Today, most Linux huge page development is focused on THP. Recent additions to THP include page cache support with ongoing work to add support to the ext4 filesystem. The idea is to make more and more mappings compatible with THP so that the use of huge pages transparently becomes more widespread. With Linux's primary huge page focus on THP, there are still some modifications being done to hugetlbfs. A small but important number of applications continue to rely on hugetlbfs. These applications and the systems they target find the config and application modification requirements acceptable for the control and predictability offered by hugetlbfs. In addition to databases, applications in the area of Virtualization, High Frequency Trading and even Java are employing hugetlbfs today. Below are some of the lesser known hugetlbfs features employed by these use cases.

  • Multiple huge page sizes
    Hugetlbfs allows the use of all huge page sizes supported by the hardware and kernel.  On x86 systems, this allows the use of 1GB size huge pages as well as the default 2MB size.  Architectures like PowerPC have an even larger set of huge page sizes:  512KB, 1MB, 2MB, 8MB, 16MB, 1GB and 16GB.  An application with explicit knowledge of its memory usage and access patterns can choose the optimal huge page size.  In recent kernels, most architectures enable all huge page sizes by default.  In the mmap(2) system call, one can specify the huge page size to back the hugetlbfs mapping.
  • Dynamically allocating huge pages
    Most users of hugetlbfs want a guarantee that the huge pages an application needs are available when an application needs them.  Because of this, huge pages are typically pre-allocated at boot time.  Pre-allocation takes care of guaranteeing huge pages are available, but it does not guarantee the location of those huge pages.  At pre-allocation time it is possible to specify the location of the huge pages via a memory policy.  However, it is hard to determine where an application which is running much later than pre-allocation time would like huge pages located.  As a result some hugetlbfs usage models have taken advantage of dynamically allocating huge pages at mmap(2) or page fault time.  This allows the application to have more control over huge page placement.  However, there are no guarantees that huge pages can be dynamically allocated.  Therefore, the application must be prepared to deal with mmap(2) failure, or receiving a SIGBUS due to an unresolved page fault.
  • fallocate support
    This was added at the request of databases.  The pre-allocation functionality is not very exciting as one could pre-allocate a hugetlbfs file with a simple mmap(2) call to reserve huge pages.  The interesting functionality is hole punch.  It allows the removal of huge pages previously allocated to the file.  Since huge pages are typically a scarce commodity, it allows an intelligent application to release pages when no longer needed so that they can be used for other purposes.
  • Userfaultfd support
    The initial use case for userfaultfd was to catch access to holes that were created via fallocate hole punch.  If an intelligent application released the page because it would never be used again, it wanted to catch any subsequent access as this would indicate an error.  However, after support was added it was determined that hugetlbfs userfaultfd could be used in QEMU post copy live migration.  Base page size pages were previously being used for live migration.  However, with sufficient network bandwidth using larger hugetlbfs pages could speed up the migration.
  • memfd_create and file sealing
    Oracle Java came up with a new garbage collection scheme.  To facilitate this scheme, they needed multiple mappings to heap memory.  The heap is allocated as anonymous memory.  On Linux, one can use the memfd_create(2) system call to create a tmpfs file backed by anonymous memory.  The file descriptor returned by memfd_create(2) can be used to create multiple mappings to this anonymous memory and works well in this new scheme.  However, the JVM currently has the option of using huge pages to back the heap.  To help with this situation, hugetlbfs support was added to memfd_create(2).  This allows one to create multiple mappings of anonymous memory backed by huge pages.  Shortly after this functionality was added another use case surfaced.  memfd_create(2) for hugetlbfs can be used by QEMU for DPDK vHost User support.

With Linux's current focus on THP, it is easy to forget that hugetlbfs is still being used in important applications today.  As new features are added, new use cases emerge.  hugetlbfs is as useful today as when it was introduced more than 15 years ago, and it is not just for databases anymore.


Mike Kravetz

Previous Post

New high performance NFS server in Oracle Cloud Infrastructure using Oracle Linux

Julie Wong | 2 min read

Next Post

XFS - What's new and what's next!

Darrick Wong | 4 min read