Tuesday Oct 06, 2015

Linux Kernel hugetlbfs Enhancements, by Mike Kravetz

The following is a write-up by Oracle Mainline Linux Kernel Engineer, Mike Kravetz, on his recent upstream work on enhancing hugetlbfs support in the Linux kernel.

Introduction

Linux huge page support has been present in the Linux kernel since 2003. When first introduced, the only way to take advantage of huge pages was via hugetlbfs. This often involved modifications to application code and explicit action by system administrators to set up and reserve pools of huge pages. As a result, the use of huge pages was mostly limited to applications such as large databases which wanted the very best performance possible and had skilled developers who could modify and tune their code.

More recently, Linux kernel development has been focused on Transparent Huge Page (THP) support. THP is a system wide feature that enables the use of huge pages by any application without source code modification or system administrator intervention. The creation, management and use of huge pages is managed transparently by the Linux kernel.

THP works well for most applications today. However, some application developers want to achieve the best performance possible. To achieve this, they are willing to modify their application to use the original hugetlbfs interfaces. Of course, this also requires the application have intimate knowledge of it's interaction with system resources. To meet the evolving needs of these applications, two new enhancements were made to hugetlbfs.

Reserving Huge Pages

Users of hugetlbfs typically reserve huge pages at system boot time. This pool of reserved pages is then used as the applications map and fault in huge pages. Since memory reserved for huge pages is not available for other uses, it is important not to reserve an excessive number of pages. However, if too few pages are reserved the applications may receive out of memory errors when the reserved pool is exhausted. Therefore, users attempt to make an accurate estimate of their huge page needs and have their applications make use of all the reserved huge pages.

One concern in this area is that the pool of reserved pages is global. Therefore, it is possible for any user/application on the system with sufficient privilege to use huge pages in the reserved pool. This could cause problems for an application that expects a certain number of huge pages.

An application would like some reasonable assurance that allocations will not fail due to a lack of huge pages. At application start-up time, the application would like to configure itself to use a specific number of huge pages. Before starting, the application can check to make sure that enough huge pages exist in the system global pools. However, there are no guarantees that those pages will be available when needed by the application. The application really wants exclusive use of a subset of huge pages.

A new hugetlbfs mount option 'min_size=' was developed to indicate the number of huge pages guaranteed to be available for use by the filesystem. At mount time, this number of huge pages will be reserved for exclusive use of the filesystem. If there are not a sufficient number of free pages, the mount will fail. As applications allocate and free huge pages from the filesystem, the number of reserved pages is adjusted so that the specified minimum is maintained. In this way, the application is assured the specified number of huge pages will be available for their use.

The min_size mount option for hugetlbfs was added to the 4.1 version of Linux kernel.

Punching Holes in hugetlbfs files 

As mentioned above, applications which make use of huge pages via hugetlbfs often have intimate knowledge of their system resource needs. In addition, these application may use files within hugetlbfs as huge page backed shared memory. Within the application, many processes will be simultaneously mapping these files. Some of the data in these files is long lived, and is used throughout the life of the application. Other data may only be used for a period of time and then never accessed again. When the application knows that data within these files is no longer needed, it would like to release the huge pages associated with the data so that it can be used for other purposes.

Punching holes within files is accomplished with the fallocate() system call in traditional filesystems. In Linux, the tmpfs filesystem also supports fallocate hole punch. Adding this support to hugetlbfs provides the requested functionality to release huge pages within files. Hole punching in hugetlbfs is actually simpler than for other filesystems. This is because hugetlbfs is a memory only filesystem, therefore there is no disk or swap space to be concerned with.

In addition to hole punch, fallocate pre-allocation support was also added for hugetlbfs. This allows one to allocate multiple huge pages with a single system call. Without pre-allocation, each huge page would be allocated at page fault time.

hugetlbfs fallocate support is part of the 4.3 release candidate series of the Linux kernel.

Future enhancements

The 4.3 Linux kernel release candidate series contains support for userfaultfd, by Andrea Archangeli. This new functionality allows for the handling of page faults in user space. An application can monitor a range of virtual addresses. When a page fault happens within this range, the application is notified and can take various actions.

The initial version of userfaultfd only supports anonymous VMA mappings. Applications using hugetlbfs may also like to use userfaultfd. One identified use case is the monitoring of address ranges that were hole punched with fallocate. Access to these areas may be considered an error by the application. Therefore, the application would like to be notified of such accesses.

The addition of userfaultfd support for hugetlbfs is being considered for a future Linux kernel release.

-- Mike Kravetz


Monday Aug 11, 2014

Improving the Performance of Transparent Huge Pages in Linux, by Khalid Aziz

The following is a write-up by Oracle mainline Linux kernel engineer, Khalid Aziz, detailing his and others' work on improving the performance of Transparent Huge Pages in the Linux kernel.


Introduction

The Linux kernel uses small page size (4K on x86) to allow for efficient sharing of physical memory among processes. Even though this can maximize utilization of physical memory, it results in large numbers of pages associated with each process and each page requires an entry in the Translation Look-aside Buffer (TLB) to be able to associate a virtual address with the physical memory page it represents. The TLB is a finite resource and large number of entries required for each process forces kernel to constantly swap out entries in TLB. There is a performance impact any time the TLB entry for a virtual address is missing. This impact is especially large for data intensive applications like large databases.

To alleviate this, Linux kernel added support for Huge Pages, which can support significantly larger page sizes for specific uses. This larger page size is variable and depends upon architecture (a few megabytes to gigabytes) . Huge Pages can be used for shared memory or for memory mapping. Huge Pages reduce the number of TLB entries required for a process's data by factor of 100s and thus reduce the number of TLB misses for the process significantly.

Huge Pages are statically allocated and need to be used through a hugetlbfs API, which requires changing applications at source level to take advantage of this feature. The Linux kernel added a Transparent Huge Pages (THP) feature that coalesces multiple contiguous pages in use by a process to create a Huge Page transparently without the process needing to even know about it. This makes the benefits of Huge Pages available to every application without having to rewrite it.

Unfortunately, adding THP caused side-effects for performance. We will explore these performance impacts in more detail in this article and how those issues have been addressed.

The Problem

When Huge Pages were introduced in the kernel, they were meant to be statically allocated in physical memory and never swapped out. This made for simple accounting through use of refcounts for these hugepages. Transparent hugepages on the other hand need to be swappable so a process could take advantage of performance improvements through hugepages and yet not tie up the physical memory for these transparent hugepages. Since the swap subsystem only deals with base page size, it can not swap out larger hugepages. The kernel breaks the hugepages up into base page sizes before swapping transparent huge pages out.

A page is identified as hugepage via page flags and each hugepage is composed of one head page and a number of tail pages. Each tail page has a pointer, first_page, that points back to the head page. The Kernel can break the transparent hugepages up any time there is memory pressure and pages need to be swapped out. This creates a race between the code that breaks hugepages up and the code managing free and busy hugepages. When marking a hugepage busy or free, the code needs to ensure a hugepage is not broken up underneath it. This requires taking reference to the page multiple times, locking the page to ensure page is not broken up and executing memory barriers a few times to ensure any updates to the page flags get flushed out to memory so we retain consistency.

Before THP was introduced into the kernel in 2.6.38, the code to release a page was fairly straightforward. A call to put_page()was made and first thing put_page() checked was to determine if it was dealing with hugepage (also known as compound page) or base page:

void put_page(struct page *page)
{
	if (unlikely(PageCompound(page))) 
		put_compound_page(page);
…....
}

If the page being released is a hugepage, put_compound_page() verifies reference count is 0 and then calls the free routine for compound page which walks the head page and tail pages and frees them all up:

static void put_compound_page(struct page *page)
{
	page = compound_head(page); 
	if (put_page_testzero(page)) { 
		compound_page_dtor *dtor; 

		dtor = get_compound_page_dtor(page); 
		(*dtor)(page); 
	}
}

This is fairly straightforward code and has virtually no impact on performance of page release code. After THP was introduced, additional checks, locks, page references and memory barriers were added to ensure correctness. The new put_compound_page() in 2.6.38 looks like:

static void put_compound_page(struct page *page)
{
	if (unlikely(PageTail(page))) {
		/* __split_huge_page_refcount can run under us */
		struct page *page_head = page->first_page;
		smp_rmb();
		/*
		 * If PageTail is still set after smp_rmb() we can be sure
		 * that the page->first_page we read wasn't a dangling pointer.
		 * See __split_huge_page_refcount() smp_wmb().
		 */
		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
			unsigned long flags;
			/*
			 * Verify that our page_head wasn't converted
			 * to a a regular page before we got a
			 * reference on it.
			 */
			if (unlikely(!PageHead(page_head))) {
				/* PageHead is cleared after PageTail */
				smp_rmb();
				VM_BUG_ON(PageTail(page));
				goto out_put_head;
			}
			/*
			 * Only run compound_lock on a valid PageHead,
			 * after having it pinned with
			 * get_page_unless_zero() above.
			 */
			smp_mb();
			/* page_head wasn't a dangling pointer */
			flags = compound_lock_irqsave(page_head);
			if (unlikely(!PageTail(page))) {
				/* __split_huge_page_refcount run before us */
				compound_unlock_irqrestore(page_head, flags);
				VM_BUG_ON(PageHead(page_head));
			out_put_head:
				if (put_page_testzero(page_head))
					__put_single_page(page_head);
			out_put_single:
				if (put_page_testzero(page))
					__put_single_page(page);
				return;
			}
			VM_BUG_ON(page_head != page->first_page);
			/*
			 * We can release the refcount taken by
			 * get_page_unless_zero now that
			 * split_huge_page_refcount is blocked on the
			 * compound_lock.
			 */
			if (put_page_testzero(page_head))
				VM_BUG_ON(1);
			/* __split_huge_page_refcount will wait now */
			VM_BUG_ON(atomic_read(&page->_count) <= 0);
			atomic_dec(&page->_count);
			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
			compound_unlock_irqrestore(page_head, flags);
			if (put_page_testzero(page_head)) {
				if (PageHead(page_head))
					__put_compound_page(page_head);
				else
					__put_single_page(page_head);
			}
		} else {
			/* page_head is a dangling pointer */
			VM_BUG_ON(PageTail(page));
			goto out_put_single;
		}
	} else if (put_page_testzero(page)) {
		if (PageHead(page))
			__put_compound_page(page);
		else
			__put_single_page(page);
	}
}

The level of complexity of code went up significantly. This complexity guaranteed correctness but sacrificed performance.

Large database applications read large chunks of database into memory using AIO. When databases started using hugepages for these reads into memory, performance went up significantly due to the benefit of much lower number of TLB misses and significantly smaller amount of memory being used up by page table resulting in lower swapping activity. When a database application reads data from disk into memory using AIO, pages from the hugepages pool are allocated for the read and the block I/O subsystem grabs reference to these pages for read and later releases reference to these pages when read is done. This causes traversal of the code referenced above starting with call to put_page(). With the newly introduced THP code, the additional overhead added up to significant performance penalty.

Over the next several kernel releases, the THP code was refined and optimized which helped slightly in some cases while performance got worse in other cases. Subsequent refinements to THP code to do accurate accounting of tail pages introduced the routine __get_page_tail() which is called by get_page() to grab tail pages for the hugepage. This added further performance impact to AIO into hugetlbfs pages. All of this code stays in the code path as long as kernel was compiled with CONFIG_TRANSPARENT_HUGEPAGE. Running echo never > /sys/kernel/mm/transparent_hugepage/enabled does not bypass this new additional code to support THP. Results from a database performance benchmark run using two common read sizes used by databases show this performance degradation clearly:


2.6.32 (pre-THP)

2.6.39 (with THP)

3.11-rc5 (with THP)

1M read

8384 MB/s

5629 MB/s

6501 MB/s

64K read

7867 MB/s

4576 MB/s

4251 MB/s


This amounts to 22% degradation for 1M read and 45% degradation for 64K read! perf top during benchmark runs showed CPU spending more than 40% of cycles in __get_page_tail() and put_compound_page().

The Solution

An Immediate solution to the performance degradation comes from the fact that hugetlbfs pages can never be split and hence all the overhead added for THP can be bypassed. I added code to __get_page_tail() and put_compound_page()to check for hugetlbfs page up front and bypass all the additional checks for those pages:

static void put_compound_page(struct page *page) 
{ 
      if (PageHuge(page)) { 
              page = compound_head(page); 
              if (put_page_testzero(page)) 
                      __put_compound_page(page); 

              return; 
      } 
...


bool __get_page_tail(struct page *page)
{
...

      if (PageHuge(page)) { 
              page_head = compound_head(page); 
              atomic_inc(&page_head->_count); 
             got = true; 
      } else { 

...

This resulted in immediate performance gain. Running the same benchmark as before with THP enabled, the new performance numbers for aio reads are below:


2.6.32

3.11-rc5

3.11-rc5 + patch

1M read

8384 MB/s

6501 MB/s

8371 MB/s

64K read

7867 MB/s

4251 MB/s

6510 MB/s


This patch was sent to linux-mm and linux kernel mailing lists in August 2013 [link] and was subsequently integrated into kernel version 3.12. This is a significant performance boost for database applications.

Further review of the original patch by Andrea Arcangeli during integration of this patch into stable kernels exposed issues with refcounting of pages and revealed this patch had introduced a subtle bug where a page pointer could become a dangling link under certain circumstances. Andrea Arcangeli and author worked to address these issues and revised the code in __get_page_tail() and put_compound_page() to eliminate extraneous locks and memory barriers, fixed incorrect refcounting of tail pages and eliminate some of the inefficiencies in the code.

Andrea sent out an initial series of patches to address all of these issues [link].

Further discussions and refinements led to the final version of these patches which were integrated into kernel version 3.13 [link],[link].

AIO performance has gone up significantly with these patches but it is still not at the same level as it used to be for smaller block sizes before THP was introduced to the kernel. THP and hugetlbfs code in the kernel is better at guaranteeing correctness but it still comes at the cost of performance, so there is room for improvement.

-- Khalid Aziz.


About

The Oracle mainline Linux kernel team works as part of the Linux kernel community to develop new features and maintain existing code.


Our team is globally distributed and includes leading core kernel developers and industry veterans.


This blog is edited by James Morris <james.l.morris@oracle.com>

Search

Categories
Archives
« April 2016
SunMonTueWedThuFriSat
     
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
       
Today