Oracle Linux kernel developer Mike Kravetz, who is also the hugetlbfs maintainer, attended Linux Plumbers Conference 2018 and shares some of his thought about the conference especially around huge pages in this blog post.
At the 2018 Linux Plumbers Conference, Huge Page utilization was discussed during the Performance and Scalability microconf, and the topic of Contiguous Allocations was discussed during the RDMA microconf. Christoph Lameter and myself gave brief presentations and led discussions on these topics.
Neither of these topics are new to Linux and are often discussed at conferences and other developer gatherings. One reason for frequent discussion is that the issues are somewhat complicated and difficult to implement to everyoneâs satisfaction. As a result, discussions tend to rehash old ideas, talk about any progress made and look for new ideas. Below are some of my observations from this yearâs discussions.
One may think that there is little to talk about in the realm of huge pages. After all, they have been available in Linux via hugetlbfs for over 15 years. When Transparent Huge Pages (THP) were added, huge pages could be used without all the required hugetlbfs application changes and sysadmin support.
While hugetlbfs functionality is mostly settled, new features have recently been added to THP: notably work by Kirill Shutemov and others to add THP support in shm and tmpfs. Kirill has even proposed patches that add THP support to ext4. In addition to hugetlbfs and THP, DAX (Persistent Memory) defaults to using huge pages for suitably sized mappings. Ongoing Xarray work by Matthew Wilcox will make page cache management of multiple page sizes much easier.
On systems with very large memory sizes people would ideally like to scale up the base page size. The well known default base page size is 4K on x86 and most other architectures. However, it is possible to change the base page size on some architectures such as arm64 and powerpc. There is interest in exploring ways to increase the base page size on x86. However, jumping to the next size supported by the MMU (2M) would be wasteful in most cases. But, for really really big memory systems (think multi- TB) it may be worth exploring.
This discussion was a follow up on the LPC 2017 presentation that formally introduced the a new contiguous allocation request. The use case from 2017 was the need for a RDMA driver to have physically contiguous areas for optimal performance. Ideally, these areas would be allocated by and passed in from user space. The ideal size for this driver would be 2G.
Two things make this use case especially difficult. First, there is no interface capable of obtaining a physically contiguous area of such a large size. The in kernel memory allocators are based on the buddy allocator and have a maximum allocation size of MAX_ORDER-1 pages (4M default on x86). CMA (Contiguous Memory Allocator) can allocate such large areas, but it requires administrative overhead and coordination. Secondly, is the general problem of memory fragmentation. After the system is up and running for a while, it becomes less and less likely to find large physically contiguous areas. Memory migration is used to try and create large contiguous areas. However, some pages become locked and can not be moved which prevents their migration.
In a separate presentation, work in the area of fragmentation avoidance was presented by Vlastmil Babka: The hard work behind large physical allocations in the kernel. In addition, Mel Gorman has been working on a patch series to help address this issue. Christoph Lameter suggested an idea to protect large order pages from being broken up so that they would be available for contiguous allocations. However, he admits this is a controversial hack that will likely not be accepted due to the âmemory reservationâ aspect of the approach.
Even though the likelihood of actually obtaining large contiguous allocations is only slowly moving forward, an in kernel interface to obtain contiguous pages has been proposed. alloc_contig_pages() would search for and return an arbitrary number contiguous pages if possible. There is similar special case code in the kernel today to allocate gigantic huge pages. The idea is to use this new interface for gigantic huge pages as well as other use cases.