The following is a write-up by Oracle mainline Linux kernel engineer, Khalid Aziz, detailing his and others’ work on improving the performance of Transparent Huge Pages in the Linux kernel.
Introduction 
 
The Linux kernel uses small page size
(4K on x86) to allow for efficient sharing of physical memory among
processes. Even though this can maximize utilization of physical
memory, it results in large numbers of pages associated with each
process and each page requires an entry in the Translation Look-aside
Buffer (TLB) to be able to associate a virtual address with the
physical memory page it represents. The TLB is a finite resource and
large number of entries required for each process forces kernel to
constantly swap out entries in TLB. There is a performance impact any
time the TLB entry for a virtual address is missing. This impact is
especially large for data intensive applications like large
databases.
To alleviate this, Linux kernel added support for
Huge Pages, which can support significantly larger page sizes for
specific uses. This larger page size is variable and depends upon
architecture (a few megabytes to gigabytes) . Huge Pages can be used for
shared memory or for memory mapping. Huge Pages reduce the number of
TLB entries required for a process’s data by factor of 100s and thus
reduce the number of TLB misses for the process significantly.
Huge Pages are statically allocated and need to be used through a
hugetlbfs API, which requires changing applications at source level to
take advantage of this feature. The Linux kernel added a Transparent
Huge Pages (THP) feature that coalesces multiple contiguous pages in
use by a process to create a Huge Page transparently without the
process needing to even know about it. This makes the benefits of
Huge Pages available to every application without having to rewrite
it.
Unfortunately, adding THP caused side-effects for performance. We will explore these performance impacts in more
detail in this article and how those issues have been addressed.
The Problem
When Huge Pages were introduced in
the kernel, they were meant to be statically allocated in physical
memory and never swapped out. This made for simple accounting through
use of refcounts for these hugepages. Transparent hugepages on the
other hand need to be swappable so a process could take advantage of
performance improvements through hugepages and yet not tie up the
physical memory for these transparent hugepages. Since the swap subsystem
only deals with base page size, it can not swap out larger hugepages.
The kernel breaks the hugepages up into base page sizes before swapping
transparent huge pages out.
A page is identified as hugepage via page flags and each hugepage is composed of one head page and a
number of tail pages. Each tail page has a pointer, first_page,
that points back to the head page. The Kernel can break the transparent
hugepages up any time there is memory pressure and pages need to be
swapped out. This creates a race between the code that breaks
hugepages up and the code managing free and busy hugepages. When
marking a hugepage busy or free, the code needs to ensure a hugepage
is not broken up underneath it. This requires taking reference to the
page multiple times, locking the page to ensure page is not broken up
and executing memory barriers a few times to ensure any updates to
the page flags get flushed out to memory so we retain consistency.
Before THP was introduced into the
kernel in 2.6.38, the code to release a page was fairly
straightforward. A call to put_page()was
made and first thing put_page()
checked was to determine if it was dealing with hugepage (also known
as compound page) or base page:
void put_page(struct page *page)
{
if (unlikely(PageCompound(page)))
put_compound_page(page);
…....
}
If the page being released is a
hugepage, put_compound_page() 
verifies reference count is 0 and then calls the free routine for
compound page which walks the head page and tail pages and frees them
all up:
static void put_compound_page(struct page *page)
{
page = compound_head(page);
if (put_page_testzero(page)) {
compound_page_dtor *dtor;
dtor = get_compound_page_dtor(page);
(*dtor)(page);
}
}
This is fairly straightforward code
and has virtually no impact on performance of page release code.
After THP was introduced, additional checks, locks, page references
and memory barriers were added to ensure correctness. The newput_compound_page()
in 2.6.38 looks like:
static void put_compound_page(struct page *page)
{
if (unlikely(PageTail(page))) {
/* __split_huge_page_refcount can run under us */
struct page *page_head = page->first_page;
smp_rmb();
/*
* If PageTail is still set after smp_rmb() we can be sure
* that the page->first_page we read wasn't a dangling pointer.
* See __split_huge_page_refcount() smp_wmb().
*/
if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
unsigned long flags;
/*
* Verify that our page_head wasn't converted
* to a a regular page before we got a
* reference on it.
*/
if (unlikely(!PageHead(page_head))) {
/* PageHead is cleared after PageTail */
smp_rmb();
VM_BUG_ON(PageTail(page));
goto out_put_head;
}
/*
* Only run compound_lock on a valid PageHead,
* after having it pinned with
* get_page_unless_zero() above.
*/
smp_mb();
/* page_head wasn't a dangling pointer */
flags = compound_lock_irqsave(page_head);
if (unlikely(!PageTail(page))) {
/* __split_huge_page_refcount run before us */
compound_unlock_irqrestore(page_head, flags);
VM_BUG_ON(PageHead(page_head));
out_put_head:
if (put_page_testzero(page_head))
__put_single_page(page_head);
out_put_single:
if (put_page_testzero(page))
__put_single_page(page);
return;
}
VM_BUG_ON(page_head != page->first_page);
/*
* We can release the refcount taken by
* get_page_unless_zero now that
* split_huge_page_refcount is blocked on the
* compound_lock.
*/
if (put_page_testzero(page_head))
VM_BUG_ON(1);
/* __split_huge_page_refcount will wait now */
VM_BUG_ON(atomic_read(&page->_count) <= 0);
atomic_dec(&page->_count);
VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
compound_unlock_irqrestore(page_head, flags);
if (put_page_testzero(page_head)) {
if (PageHead(page_head))
__put_compound_page(page_head);
else
__put_single_page(page_head);
}
} else {
/* page_head is a dangling pointer */
VM_BUG_ON(PageTail(page));
goto out_put_single;
}
} else if (put_page_testzero(page)) {
if (PageHead(page))
__put_compound_page(page);
else
__put_single_page(page);
}
}
The level of complexity of code
went up significantly. This complexity guaranteed correctness but
sacrificed performance.  
Large database applications read
large chunks of database into memory using AIO. When databases
started using hugepages for these reads into memory, performance went
up significantly due to the benefit of much lower number of TLB
misses and significantly smaller amount of memory being used up by
page table resulting in lower swapping activity. When a database
application reads data from disk into memory using AIO, pages from the
hugepages pool are allocated for the read and the block I/O subsystem
grabs reference to these pages for read and later releases reference
to these pages when read is done. This causes traversal of the code
referenced above starting with call to put_page().
With the newly introduced THP code, the additional overhead added up
to significant performance penalty.
Over the next several kernel
releases, the THP code was refined and optimized which helped slightly in
some cases while performance got worse in other cases. Subsequent
refinements to THP code to do accurate accounting of tail pages
introduced the routine __get_page_tail() which is called by get_page()
to grab tail pages for the hugepage. This added further performance
impact to AIO into hugetlbfs pages. All of this code stays in the
code path as long as kernel was compiled with CONFIG_TRANSPARENT_HUGEPAGE. Running echo never > /sys/kernel/mm/transparent_hugepage/enabled does not bypass this new additional code to support THP. Results from
a database performance benchmark run using two common read sizes used
by databases show this performance degradation clearly:
| 
 | 2.6.32 (pre-THP) | 2.6.39 (with THP) | 3.11-rc5 (with THP) | 
| 1M read | 8384 MB/s | 5629 MB/s | 6501 MB/s | 
| 64K read | 7867 MB/s | 4576 MB/s | 4251 MB/s | 
This amounts to 22% degradation for
1M read and 45% degradation for 64K read! perf during benchmark runs showed CPU spending more than 40% of
top
cycles in __get_page_tail()
and put_compound_page().
The Solution
An
Immediate solution to the performance degradation comes from the fact
that hugetlbfs pages can never be split and hence all the overhead
added for THP can be bypassed. I added code to__get_page_tail() andput_compound_page()to check
for hugetlbfs page up front and bypass all the additional checks for
those pages:
static void put_compound_page(struct page *page)
{
if (PageHuge(page)) {
page = compound_head(page);
if (put_page_testzero(page))
__put_compound_page(page);
return;
}
...
bool __get_page_tail(struct page *page)
{
...
if (PageHuge(page)) {
page_head = compound_head(page);
atomic_inc(&page_head->_count);
got = true;
} else {
...
This
resulted in immediate performance gain. Running the same benchmark as
before with THP enabled, the new performance numbers for aio reads
are below:
| 2.6.32 | 3.11-rc5 | 3.11-rc5 + patch | |
| 1M read | 8384 MB/s | 6501 MB/s | 8371 MB/s | 
| 64K read | 7867 MB/s | 4251 MB/s | 6510 MB/s | 
 
This
patch was sent to linux-mm and linux kernel mailing lists in August
2013 [link]
and was subsequently integrated into kernel version 3.12. This is a
significant performance boost for database applications.
Further
review of the original patch by Andrea Arcangeli during integration
of this patch into stable kernels exposed issues with refcounting of
pages and revealed this patch had introduced a subtle bug where a
page pointer could become a dangling link under certain
circumstances. Andrea Arcangeli and author worked to address these
issues and revised the code in __get_page_tail() and put_compound_page() to
eliminate extraneous locks and memory barriers, fixed incorrect
refcounting of tail pages and eliminate some of the inefficiencies in
the code.
Andrea sent out an initial series of patches to address all
of these issues [link].
Further discussions and refinements led to the final version of these
patches which were integrated into kernel version 3.13 [link],[link].
 AIO performance has gone up
significantly with these patches but it is still not at the same
level as it used to be for smaller block sizes before THP was
introduced to the kernel. THP and hugetlbfs code in the kernel is
better at guaranteeing correctness but it still comes at the cost of
performance, so there is room for improvement.
— Khalid Aziz.
 
