Monday Aug 11, 2014

Improving the Performance of Transparent Huge Pages in Linux, by Khalid Aziz

The following is a write-up by Oracle mainline Linux kernel engineer, Khalid Aziz, detailing his and others' work on improving the performance of Transparent Huge Pages in the Linux kernel.


Introduction

The Linux kernel uses small page size (4K on x86) to allow for efficient sharing of physical memory among processes. Even though this can maximize utilization of physical memory, it results in large numbers of pages associated with each process and each page requires an entry in the Translation Look-aside Buffer (TLB) to be able to associate a virtual address with the physical memory page it represents. The TLB is a finite resource and large number of entries required for each process forces kernel to constantly swap out entries in TLB. There is a performance impact any time the TLB entry for a virtual address is missing. This impact is especially large for data intensive applications like large databases.

To alleviate this, Linux kernel added support for Huge Pages, which can support significantly larger page sizes for specific uses. This larger page size is variable and depends upon architecture (a few megabytes to gigabytes) . Huge Pages can be used for shared memory or for memory mapping. Huge Pages reduce the number of TLB entries required for a process's data by factor of 100s and thus reduce the number of TLB misses for the process significantly.

Huge Pages are statically allocated and need to be used through a hugetlbfs API, which requires changing applications at source level to take advantage of this feature. The Linux kernel added a Transparent Huge Pages (THP) feature that coalesces multiple contiguous pages in use by a process to create a Huge Page transparently without the process needing to even know about it. This makes the benefits of Huge Pages available to every application without having to rewrite it.

Unfortunately, adding THP caused side-effects for performance. We will explore these performance impacts in more detail in this article and how those issues have been addressed.

The Problem

When Huge Pages were introduced in the kernel, they were meant to be statically allocated in physical memory and never swapped out. This made for simple accounting through use of refcounts for these hugepages. Transparent hugepages on the other hand need to be swappable so a process could take advantage of performance improvements through hugepages and yet not tie up the physical memory for these transparent hugepages. Since the swap subsystem only deals with base page size, it can not swap out larger hugepages. The kernel breaks the hugepages up into base page sizes before swapping transparent huge pages out.

A page is identified as hugepage via page flags and each hugepage is composed of one head page and a number of tail pages. Each tail page has a pointer, first_page, that points back to the head page. The Kernel can break the transparent hugepages up any time there is memory pressure and pages need to be swapped out. This creates a race between the code that breaks hugepages up and the code managing free and busy hugepages. When marking a hugepage busy or free, the code needs to ensure a hugepage is not broken up underneath it. This requires taking reference to the page multiple times, locking the page to ensure page is not broken up and executing memory barriers a few times to ensure any updates to the page flags get flushed out to memory so we retain consistency.

Before THP was introduced into the kernel in 2.6.38, the code to release a page was fairly straightforward. A call to put_page()was made and first thing put_page() checked was to determine if it was dealing with hugepage (also known as compound page) or base page:

void put_page(struct page *page)
{
	if (unlikely(PageCompound(page))) 
		put_compound_page(page);
…....
}

If the page being released is a hugepage, put_compound_page() verifies reference count is 0 and then calls the free routine for compound page which walks the head page and tail pages and frees them all up:

static void put_compound_page(struct page *page)
{
	page = compound_head(page); 
	if (put_page_testzero(page)) { 
		compound_page_dtor *dtor; 

		dtor = get_compound_page_dtor(page); 
		(*dtor)(page); 
	}
}

This is fairly straightforward code and has virtually no impact on performance of page release code. After THP was introduced, additional checks, locks, page references and memory barriers were added to ensure correctness. The new put_compound_page() in 2.6.38 looks like:

static void put_compound_page(struct page *page)
{
	if (unlikely(PageTail(page))) {
		/* __split_huge_page_refcount can run under us */
		struct page *page_head = page->first_page;
		smp_rmb();
		/*
		 * If PageTail is still set after smp_rmb() we can be sure
		 * that the page->first_page we read wasn't a dangling pointer.
		 * See __split_huge_page_refcount() smp_wmb().
		 */
		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
			unsigned long flags;
			/*
			 * Verify that our page_head wasn't converted
			 * to a a regular page before we got a
			 * reference on it.
			 */
			if (unlikely(!PageHead(page_head))) {
				/* PageHead is cleared after PageTail */
				smp_rmb();
				VM_BUG_ON(PageTail(page));
				goto out_put_head;
			}
			/*
			 * Only run compound_lock on a valid PageHead,
			 * after having it pinned with
			 * get_page_unless_zero() above.
			 */
			smp_mb();
			/* page_head wasn't a dangling pointer */
			flags = compound_lock_irqsave(page_head);
			if (unlikely(!PageTail(page))) {
				/* __split_huge_page_refcount run before us */
				compound_unlock_irqrestore(page_head, flags);
				VM_BUG_ON(PageHead(page_head));
			out_put_head:
				if (put_page_testzero(page_head))
					__put_single_page(page_head);
			out_put_single:
				if (put_page_testzero(page))
					__put_single_page(page);
				return;
			}
			VM_BUG_ON(page_head != page->first_page);
			/*
			 * We can release the refcount taken by
			 * get_page_unless_zero now that
			 * split_huge_page_refcount is blocked on the
			 * compound_lock.
			 */
			if (put_page_testzero(page_head))
				VM_BUG_ON(1);
			/* __split_huge_page_refcount will wait now */
			VM_BUG_ON(atomic_read(&page->_count) <= 0);
			atomic_dec(&page->_count);
			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
			compound_unlock_irqrestore(page_head, flags);
			if (put_page_testzero(page_head)) {
				if (PageHead(page_head))
					__put_compound_page(page_head);
				else
					__put_single_page(page_head);
			}
		} else {
			/* page_head is a dangling pointer */
			VM_BUG_ON(PageTail(page));
			goto out_put_single;
		}
	} else if (put_page_testzero(page)) {
		if (PageHead(page))
			__put_compound_page(page);
		else
			__put_single_page(page);
	}
}

The level of complexity of code went up significantly. This complexity guaranteed correctness but sacrificed performance.

Large database applications read large chunks of database into memory using AIO. When databases started using hugepages for these reads into memory, performance went up significantly due to the benefit of much lower number of TLB misses and significantly smaller amount of memory being used up by page table resulting in lower swapping activity. When a database application reads data from disk into memory using AIO, pages from the hugepages pool are allocated for the read and the block I/O subsystem grabs reference to these pages for read and later releases reference to these pages when read is done. This causes traversal of the code referenced above starting with call to put_page(). With the newly introduced THP code, the additional overhead added up to significant performance penalty.

Over the next several kernel releases, the THP code was refined and optimized which helped slightly in some cases while performance got worse in other cases. Subsequent refinements to THP code to do accurate accounting of tail pages introduced the routine __get_page_tail() which is called by get_page() to grab tail pages for the hugepage. This added further performance impact to AIO into hugetlbfs pages. All of this code stays in the code path as long as kernel was compiled with CONFIG_TRANSPARENT_HUGEPAGE. Running echo never > /sys/kernel/mm/transparent_hugepage/enabled does not bypass this new additional code to support THP. Results from a database performance benchmark run using two common read sizes used by databases show this performance degradation clearly:


2.6.32 (pre-THP)

2.6.39 (with THP)

3.11-rc5 (with THP)

1M read

8384 MB/s

5629 MB/s

6501 MB/s

64K read

7867 MB/s

4576 MB/s

4251 MB/s


This amounts to 22% degradation for 1M read and 45% degradation for 64K read! perf top during benchmark runs showed CPU spending more than 40% of cycles in __get_page_tail() and put_compound_page().

The Solution

An Immediate solution to the performance degradation comes from the fact that hugetlbfs pages can never be split and hence all the overhead added for THP can be bypassed. I added code to __get_page_tail() and put_compound_page()to check for hugetlbfs page up front and bypass all the additional checks for those pages:

static void put_compound_page(struct page *page) 
{ 
      if (PageHuge(page)) { 
              page = compound_head(page); 
              if (put_page_testzero(page)) 
                      __put_compound_page(page); 

              return; 
      } 
...


bool __get_page_tail(struct page *page)
{
...

      if (PageHuge(page)) { 
              page_head = compound_head(page); 
              atomic_inc(&page_head->_count); 
             got = true; 
      } else { 

...

This resulted in immediate performance gain. Running the same benchmark as before with THP enabled, the new performance numbers for aio reads are below:


2.6.32

3.11-rc5

3.11-rc5 + patch

1M read

8384 MB/s

6501 MB/s

8371 MB/s

64K read

7867 MB/s

4251 MB/s

6510 MB/s


This patch was sent to linux-mm and linux kernel mailing lists in August 2013 [link] and was subsequently integrated into kernel version 3.12. This is a significant performance boost for database applications.

Further review of the original patch by Andrea Arcangeli during integration of this patch into stable kernels exposed issues with refcounting of pages and revealed this patch had introduced a subtle bug where a page pointer could become a dangling link under certain circumstances. Andrea Arcangeli and author worked to address these issues and revised the code in __get_page_tail() and put_compound_page() to eliminate extraneous locks and memory barriers, fixed incorrect refcounting of tail pages and eliminate some of the inefficiencies in the code.

Andrea sent out an initial series of patches to address all of these issues [link].

Further discussions and refinements led to the final version of these patches which were integrated into kernel version 3.13 [link],[link].

AIO performance has gone up significantly with these patches but it is still not at the same level as it used to be for smaller block sizes before THP was introduced to the kernel. THP and hugetlbfs code in the kernel is better at guaranteeing correctness but it still comes at the cost of performance, so there is room for improvement.

-- Khalid Aziz.


Tuesday Apr 01, 2014

LSF/MM 2014 and ext4 Summit Notes by Darrick Wong

This is a contributed post from Darrick Wong, storage engineer on the Oracle mainline Linux kernel team.

The following are my notes from LSF/MM 2014 and the ext4 summit, held last week in Napa Valley, CA.

  • Discussed the draft DIX passthrough interface. Based on Zach Brown's suggestions last week, I rolled out a version of the patch with a statically defined io extensions struct, and Martin Petersen said he'd try porting some existing asmlib clients to use the new interface, with a few field-enlarging tweaks. For the most part nobody objected; Al Viro said he had no problems "yet" -- but I couldn't tell if he had no idea what I was talking about, or if he was on board with the API. It was also suggested that I seek the opinion of Michael Kerrisk (the manpages maintainer) about the API. As for the actual implementation, there are plenty of holes in it that I intend to fix this week. The NFS/CIFS developers I spoke to were generally happy to hear that the storage side was finally starting to happen, and that they could get to working on the net-fs side of things now. Nicholas Bellinger noted that targetcli can create DIF disks even with the fileio backend, so he suggested I play with that over scsi_debug.

  • A large part of LSF was taken up with the discussion of how to handle the brave new world of weird storage devices. To recap: in the beginning, software had to deal with the mechanical aspects of a rotating disk; addressing had to be done in terms of cylinders, heads, and sectors (CHS). This made it difficult to innovate drive mechanics, as it was impossible to express things like variable zone density to existing software. SCSI eliminated this pain by abstracting a disk into a big tub of consecutive sectors, which simplified software quite a bit, though at some cost to performance. But most programs weren't trying to wring the last iota of performance out of disks and didn't care. So long as some attention was paid to data locality, disks performed adequately. Fast forward to 2014: now we have several different storage device classes: Flash, which has no seek penalty but prefers large writeouts; SMR drives with hard-disk seek penalties but requirements that all writes within a ~256MB zone be written in linear order; RAIDs, which by virtue of stripe geometries violate a few of the classic hard disk thinking; and NVMe devices which implement atomic read and write operations. Dave Chinner suggests that rather than retrofitting each filesystem to deal with each of these devices, it might be worth shoving all the block allocation and mapping operation down to a device mapper (dm) shim layer that can abstract away different types of storage, leaving FSes to manage namespace information. This suggestion is very attractive on a few levels: Benefits include the ability to emulate atomic read/writes with journalling, more flexible software-defined FTLs for flash and SMR, and improved communication with cloud storage systems -- Mike Snitzer had a session about dm-thinp and the proper way for FSes to communicate allocation hints to the underlying storage; this would certainly seem to fit the bill. I mentioned that Oracle's plans for cheap ext4 reflink would be trivial to implement with dm shims. Unfortunately, the devil is in the details -- when will we see code? For that reason, Ted Ts'o was openly skeptical.

  • The postgresql developers showed up to complain about stable pages and to ask for a less heavyweight fsync() -- currently, when fsync is called, it assumes that the caller wants all dirty data written out NOW, so it writes dirty pages with WRITE_SYNC, which starves reads. For postgresql this is suboptimal since fsync is typically called by the checkpointing code, which doesn't need to be fast and doesn't care if fsync writeback is not fast. There was an interlock scheduled for Thursday afternoon, but I was unable to attend. See LWN for more detailed coverage of the postgresql (and FB) sessions.

  • At the ext4 summit, we discussed a few cleanups, such as removing the use of buffer_heads and the impending removal of the ext2/3 drivers. Removing buffer_heads in the data path has the potential benefit that it'll make the transition to supporting block/sector size > page size easier, as well as reducing memory requirements (buffer heads are a heavyweight structure now). There was also the feeling that once most enterprise distros move to ext4, it will be a lot easier to remove ext3 upstream because there will be a lot more testing of the use of ext4.ko to handle ext2/3 filesystems. There was a discussion of removing ext2 as well, though that stalled on concerns that Christoph Hellwig (hch) would like to see ext2 remain as a "sample" filesystem, though Jan Kara could be heard muttering that nobody wants a bitrotten example.

  • The other major new ext4 feature discussed at the ext4 summit is per-data block metadata. This got started when Lukas Czerner (lukas) proposed adding data block checksums to the filesystem. I quickly chimed in that for e2fsck it would be helpful to have per-block back references to ease reconstruction of the filesystem, at which point the group started thinking that rather than a huge static array of block data, the complexity of a b-tree with variable key size might well be worth the effort. Then again, with all the proposed filesystem/block layer changes, Ted said that he might be open to a quick v1 implementation because the block shim layer discussed in the SMR forum could very well obviate the need for a lot of ext4 features. Time will tell; Ted and I were not terribly optimistic that any of that software is coming soon. In any case, lukas went home to refine his proposal. The biggest problem is ext4's current lack of a btree implementation; this would have to be written or borrowed, and then tested. I mentioned to him that this could be the cornerstone of reimplementing a lot of ext4 features with btrees instead of static arrays, which could be a good thing if RH is willing to spend a lot of engineering time on ext4.

  • Michael Halcrow, speaking at the ext4 summit, discussed implementing a lightweight encrypted filesystem subtree feature. This sounds a lot like ecryptfs, but hopefully less troublesome than the weird shim fs that is ecryptfs. For the most part he seemed to need (a) the ability to inject his code into the read/write path and some ability to store a small amount of per-inode encryption data. His use-case is Chrome OS, which apparently needs the ability for cache management programs to erase parts of a(nother) user's cache files without having the ability to access the file. The discussion concluded that it wouldn't be too difficult for him to start an initial implementation with ext4, but that much of this ought to be in the VFS layer.

 -- Darrick

[Ed: see also the LWN coverage of LSF/MM]

About

The Oracle mainline Linux kernel team works as part of the Linux kernel community to develop new features and maintain existing code.


Our team is globally distributed and includes leading core kernel developers and industry veterans.


This blog is maintained by James Morris <james.l.morris@oracle.com>

Search

Categories
Archives
« July 2015
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
       
Today