Monday Aug 11, 2014

Improving the Performance of Transparent Huge Pages in Linux, by Khalid Aziz

The following is a write-up by Oracle mainline Linux kernel engineer, Khalid Aziz, detailing his and others' work on improving the performance of Transparent Huge Pages in the Linux kernel.


Introduction

The Linux kernel uses small page size (4K on x86) to allow for efficient sharing of physical memory among processes. Even though this can maximize utilization of physical memory, it results in large numbers of pages associated with each process and each page requires an entry in the Translation Look-aside Buffer (TLB) to be able to associate a virtual address with the physical memory page it represents. The TLB is a finite resource and large number of entries required for each process forces kernel to constantly swap out entries in TLB. There is a performance impact any time the TLB entry for a virtual address is missing. This impact is especially large for data intensive applications like large databases.

To alleviate this, Linux kernel added support for Huge Pages, which can support significantly larger page sizes for specific uses. This larger page size is variable and depends upon architecture (a few megabytes to gigabytes) . Huge Pages can be used for shared memory or for memory mapping. Huge Pages reduce the number of TLB entries required for a process's data by factor of 100s and thus reduce the number of TLB misses for the process significantly.

Huge Pages are statically allocated and need to be used through a hugetlbfs API, which requires changing applications at source level to take advantage of this feature. The Linux kernel added a Transparent Huge Pages (THP) feature that coalesces multiple contiguous pages in use by a process to create a Huge Page transparently without the process needing to even know about it. This makes the benefits of Huge Pages available to every application without having to rewrite it.

Unfortunately, adding THP caused side-effects for performance. We will explore these performance impacts in more detail in this article and how those issues have been addressed.

The Problem

When Huge Pages were introduced in the kernel, they were meant to be statically allocated in physical memory and never swapped out. This made for simple accounting through use of refcounts for these hugepages. Transparent hugepages on the other hand need to be swappable so a process could take advantage of performance improvements through hugepages and yet not tie up the physical memory for these transparent hugepages. Since the swap subsystem only deals with base page size, it can not swap out larger hugepages. The kernel breaks the hugepages up into base page sizes before swapping transparent huge pages out.

A page is identified as hugepage via page flags and each hugepage is composed of one head page and a number of tail pages. Each tail page has a pointer, first_page, that points back to the head page. The Kernel can break the transparent hugepages up any time there is memory pressure and pages need to be swapped out. This creates a race between the code that breaks hugepages up and the code managing free and busy hugepages. When marking a hugepage busy or free, the code needs to ensure a hugepage is not broken up underneath it. This requires taking reference to the page multiple times, locking the page to ensure page is not broken up and executing memory barriers a few times to ensure any updates to the page flags get flushed out to memory so we retain consistency.

Before THP was introduced into the kernel in 2.6.38, the code to release a page was fairly straightforward. A call to put_page()was made and first thing put_page() checked was to determine if it was dealing with hugepage (also known as compound page) or base page:

void put_page(struct page *page)
{
	if (unlikely(PageCompound(page))) 
		put_compound_page(page);
…....
}

If the page being released is a hugepage, put_compound_page() verifies reference count is 0 and then calls the free routine for compound page which walks the head page and tail pages and frees them all up:

static void put_compound_page(struct page *page)
{
	page = compound_head(page); 
	if (put_page_testzero(page)) { 
		compound_page_dtor *dtor; 

		dtor = get_compound_page_dtor(page); 
		(*dtor)(page); 
	}
}

This is fairly straightforward code and has virtually no impact on performance of page release code. After THP was introduced, additional checks, locks, page references and memory barriers were added to ensure correctness. The new put_compound_page() in 2.6.38 looks like:

static void put_compound_page(struct page *page)
{
	if (unlikely(PageTail(page))) {
		/* __split_huge_page_refcount can run under us */
		struct page *page_head = page->first_page;
		smp_rmb();
		/*
		 * If PageTail is still set after smp_rmb() we can be sure
		 * that the page->first_page we read wasn't a dangling pointer.
		 * See __split_huge_page_refcount() smp_wmb().
		 */
		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
			unsigned long flags;
			/*
			 * Verify that our page_head wasn't converted
			 * to a a regular page before we got a
			 * reference on it.
			 */
			if (unlikely(!PageHead(page_head))) {
				/* PageHead is cleared after PageTail */
				smp_rmb();
				VM_BUG_ON(PageTail(page));
				goto out_put_head;
			}
			/*
			 * Only run compound_lock on a valid PageHead,
			 * after having it pinned with
			 * get_page_unless_zero() above.
			 */
			smp_mb();
			/* page_head wasn't a dangling pointer */
			flags = compound_lock_irqsave(page_head);
			if (unlikely(!PageTail(page))) {
				/* __split_huge_page_refcount run before us */
				compound_unlock_irqrestore(page_head, flags);
				VM_BUG_ON(PageHead(page_head));
			out_put_head:
				if (put_page_testzero(page_head))
					__put_single_page(page_head);
			out_put_single:
				if (put_page_testzero(page))
					__put_single_page(page);
				return;
			}
			VM_BUG_ON(page_head != page->first_page);
			/*
			 * We can release the refcount taken by
			 * get_page_unless_zero now that
			 * split_huge_page_refcount is blocked on the
			 * compound_lock.
			 */
			if (put_page_testzero(page_head))
				VM_BUG_ON(1);
			/* __split_huge_page_refcount will wait now */
			VM_BUG_ON(atomic_read(&page->_count) <= 0);
			atomic_dec(&page->_count);
			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
			compound_unlock_irqrestore(page_head, flags);
			if (put_page_testzero(page_head)) {
				if (PageHead(page_head))
					__put_compound_page(page_head);
				else
					__put_single_page(page_head);
			}
		} else {
			/* page_head is a dangling pointer */
			VM_BUG_ON(PageTail(page));
			goto out_put_single;
		}
	} else if (put_page_testzero(page)) {
		if (PageHead(page))
			__put_compound_page(page);
		else
			__put_single_page(page);
	}
}

The level of complexity of code went up significantly. This complexity guaranteed correctness but sacrificed performance.

Large database applications read large chunks of database into memory using AIO. When databases started using hugepages for these reads into memory, performance went up significantly due to the benefit of much lower number of TLB misses and significantly smaller amount of memory being used up by page table resulting in lower swapping activity. When a database application reads data from disk into memory using AIO, pages from the hugepages pool are allocated for the read and the block I/O subsystem grabs reference to these pages for read and later releases reference to these pages when read is done. This causes traversal of the code referenced above starting with call to put_page(). With the newly introduced THP code, the additional overhead added up to significant performance penalty.

Over the next several kernel releases, the THP code was refined and optimized which helped slightly in some cases while performance got worse in other cases. Subsequent refinements to THP code to do accurate accounting of tail pages introduced the routine __get_page_tail() which is called by get_page() to grab tail pages for the hugepage. This added further performance impact to AIO into hugetlbfs pages. All of this code stays in the code path as long as kernel was compiled with CONFIG_TRANSPARENT_HUGEPAGE. Running echo never > /sys/kernel/mm/transparent_hugepage/enabled does not bypass this new additional code to support THP. Results from a database performance benchmark run using two common read sizes used by databases show this performance degradation clearly:


2.6.32 (pre-THP)

2.6.39 (with THP)

3.11-rc5 (with THP)

1M read

8384 MB/s

5629 MB/s

6501 MB/s

64K read

7867 MB/s

4576 MB/s

4251 MB/s


This amounts to 22% degradation for 1M read and 45% degradation for 64K read! perf top during benchmark runs showed CPU spending more than 40% of cycles in __get_page_tail() and put_compound_page().

The Solution

An Immediate solution to the performance degradation comes from the fact that hugetlbfs pages can never be split and hence all the overhead added for THP can be bypassed. I added code to __get_page_tail() and put_compound_page()to check for hugetlbfs page up front and bypass all the additional checks for those pages:

static void put_compound_page(struct page *page) 
{ 
      if (PageHuge(page)) { 
              page = compound_head(page); 
              if (put_page_testzero(page)) 
                      __put_compound_page(page); 

              return; 
      } 
...


bool __get_page_tail(struct page *page)
{
...

      if (PageHuge(page)) { 
              page_head = compound_head(page); 
              atomic_inc(&page_head->_count); 
             got = true; 
      } else { 

...

This resulted in immediate performance gain. Running the same benchmark as before with THP enabled, the new performance numbers for aio reads are below:


2.6.32

3.11-rc5

3.11-rc5 + patch

1M read

8384 MB/s

6501 MB/s

8371 MB/s

64K read

7867 MB/s

4251 MB/s

6510 MB/s


This patch was sent to linux-mm and linux kernel mailing lists in August 2013 [link] and was subsequently integrated into kernel version 3.12. This is a significant performance boost for database applications.

Further review of the original patch by Andrea Arcangeli during integration of this patch into stable kernels exposed issues with refcounting of pages and revealed this patch had introduced a subtle bug where a page pointer could become a dangling link under certain circumstances. Andrea Arcangeli and author worked to address these issues and revised the code in __get_page_tail() and put_compound_page() to eliminate extraneous locks and memory barriers, fixed incorrect refcounting of tail pages and eliminate some of the inefficiencies in the code.

Andrea sent out an initial series of patches to address all of these issues [link].

Further discussions and refinements led to the final version of these patches which were integrated into kernel version 3.13 [link],[link].

AIO performance has gone up significantly with these patches but it is still not at the same level as it used to be for smaller block sizes before THP was introduced to the kernel. THP and hugetlbfs code in the kernel is better at guaranteeing correctness but it still comes at the cost of performance, so there is room for improvement.

-- Khalid Aziz.


About

The Oracle mainline Linux kernel team works as part of the Linux kernel community to develop new features and maintain existing code.


Our team is globally distributed and includes leading core kernel developers and industry veterans.


This blog is maintained by James Morris <james.l.morris@oracle.com>

Search

Categories
Archives
« March 2015
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today