Tuesday Dec 02, 2014

Improving Sunvnet Performance on Linux for SPARC, by Sowmini Varadhan

The following is a write-up by Oracle mainline Linux kernel engineer, Sowmini Varadhan, detailing her recent work on improving the performance of the Sunvnet driver on Linux for SPARC.


Background

In the typical device-driver, the Producer (I/O device) notifies the Consumer (device-driver) that data is available for consumption by triggering a hardware interrupt at a fixed Interrupt Priority Level (IPL). In the purely interrupt-driven model, the Consumer then masks off any additional Rx interrupts from the driver, and drains the read-buffers in hardware-interrupt context. A network device-driver would then enqueue packets for the TCP/IP stack where they would typically be processed in software interrupt (softirq) context.

Dispatching an interrupt is an expensive operation, thus network device drivers should attempt to batch interrupts, i.e., process as many packets as possible within the context of one interrupt. Also, hardware interrupts preempt all tasks running at a lower IPL. Thus the amount of time spent in hardware interrupt context should be kept to a a minimum. As pointed out in Mogul1, "If the event rate is high enough to cause the sytem to spend all of its time responding to interrupts, then nothing else will happen, and the system throughput will drop to zero". This condition is called receive-livelock, and all purely interrupt-driven systems are susceptible to it.

We will now talk about the various improvements made to the sunvnet driver on Linux to convert it from being a purely interrupt-driven network device driver to one that implements all of the above prescriptions using Linux's most current device-driver infrastructure.

What is Sunvnet?

In a virtualized environment such as LDoms, the guest Operating Systems (DomU) communicate with each other using a virtual link-layer abstraction called Logical Domain Channel (LDC) on SPARC. The LDC provides point-to-point communication channels between the guests, or between the domU and an external entity such as a service processor or the Hypervisor itself. The LDC provides an encapsulation protocol for other upper-layer protocols such as TCP/IP and Ethernet.

Sunvnet is the device driver that implement this virtual link-layer on Linux.

Batching Interrupts

In its simplest mode of operation, when the LDC Producer wishes to send an IP packet to the consumer, it needs to do two things:

  1. Copy the data packet to a descriptor buffer. In the "TxDring" mode, this buffer is a shared-memory region that is "owned" by the Producer.
  2. After the packet has been successfully copied, the Producer needs to signal to the Consumer that data is available. This is achieved by sending a "start" message over the LDC. A "start" message is a 64-byte message sent over the LDC in the format specified by the VIO protocol. The start message has a subtype of VIO_SUBTYPE_DATA, and specifies the index of the descriptor buffer at which data is available.

The transmission of the LDC "start" message is processed at the Hypervisor, and will result in hard-interrupt at the consumer, which will invoke the ldc_rx() interrupt handler. The Consumer would then process the interrupt in hardirq context, and when it is done, if the Producer had requested a "stopped" ack for the packet, the Consumer will send back a "stopped" message over LDC. Just like the "start" message, the "stopped" message is specified by the VIO protocol. It has a subtype of VIO_SUBTYPE_ACK (0x2) and allows the Consumer to specify the index at which data was last read.

Note that the VIO protocol does not mandate a "stopped" LDC message for every descriptor read/write: the Consumer is required to send back an LDC "stopped" message if, and only if:

  • the Producer has requested it for the descriptor; or,
  • the Consumer has read a full burst of ready data in descriptors, and there are no more ready descriptors.

LDC messaging is expensive to performance: it requires a slot in the LDC ring, in addition to triggering a hardware interrupt at the receiver. Thus the first step to improving sunvnet performance was to optimize the number of LDC messages sent and received, and batching packets as much as possible.

We achieved this with the following patches:

These, along with some other bug fixes, brought sunvnet to a more stable performance level: we observed fewer dev_watchdog hangs (previously seen due to flow-control assertions caused by a full LDC channel) and soft-lockups were seen. It also gave a 25% bump to performance.  In iperf tests on a T5-2 using 16 VCPUs and 16 iperf threads, we were now able to handle approximately 100k pps, whereas we were only able to handle a maximum of 80k pps prior to the fixes. (See diagram below).

But all packets were being received in hard-interrupt context. And as Mogul1 has established in the 90's: that is toxic to performance.

NAPI

Linux implements the concepts described in Mogul1 through a common device-driver infrastructure called NAPI. The NAPI framework allows a driver to defer reception of packet-bursts from hardware-interrupt context to a polling-mechanism that is invoked in softirq context. In addition to the benefits of interrupt mitigation and avoidance of receive live-lock, this also has other ramifications:

  • Since packet transmission via NET_TX_ACTION is already done in softirq context, moving Rx processing to softirq context now allows Tx reclaim, and recovery from link-congestion, to be done more efficiently.
  • The locking model is simplified, eliminating a number of spin_[un]lock_irqsave[restore] invocations, and improving system performance in general.
  • Moving the Rx processing to softirq context allows the driver to use the vastly more efficient netif_receive_skb() to pass the packet up to the network-stack, instead of being constrained to defer to netif_rx(), which is invoked in the less-desirable process context.
  • We also get the benefits of ksoftirqd to schedule softirq under scheduler control. Otherwise, everything would get processed on the CPU that receives the hardware interrupt, and you would have to configure RPS to distribute those hardirqs (can be done, but requires extra administration).

We'll now walk through the changes made to NAPIfy sunvnet, to examine each of these items.

The details...

The sunvnet driver has a `struct vnet_port' data-structure for each connected peer. At the minimum, there is one such structure for the vswitch peer in Dom0. In addition, if the Dom0 ldm property `inter-vnet-link' has been set to `on' (the default), DomU's on the same physical host will have a virtual point-to-point channel over LDC. Each such channel is represented by a unique `struct vnet_port' and has its own LDC ring and Rx descriptor buffers.

As part of the device probe callback, sunvnet allocates one `struct napi_struct' instance for each `struct vnet_port'.

    struct vnet_port {
	    /* ... */
            struct napi_struct      napi;
	    /* ... */
    }

The next NAPI requirement is to modify the driver's Rx interrupt handler. When a new packet becomes available, the driver must disable any additional Rx interrupts (LDC Rx interrupts in this case), and arrange for polling by invoking napi_schedule. This is achieved as follows:

Both sunvnet and the VDC (virtual disk driver) infrastructure share a common set of routines for processing the VIO messages and LDC interrupts. Thus the Rx interrupt handler (`ldc_rx()') is common to both modules, which hands off packets destined to sunvnet by invoking the `vnet_event()' callback that is registered by sunvnet. In `vnet_event()', we defer packet processing to the NAPI poll callback by recording the events (which may include both LDC control events such as UP/DOWN notifications, as well as notification about incoming data), disabling hardware interrupts, and scheduling a NAPI callback for the poll handler.

static void vnet_event(void *arg, int event)
{
        struct vnet_port *port = arg;
        struct vio_driver_state *vio = &port->vio;

        port->rx_event |= event;
        vio_set_intr(vio->vdev->rx_ino, HV_INTR_DISABLED);
        napi_schedule(&port->napi);
}

We now need to set up the poll handler itself. We do this in `vnet_poll()' which has the signature:

        static int vnet_poll(struct napi_struct *napi, int budget);

Thus vnet_poll will be called with a pointer to the NAPI instance, so that the `struct vnet_port' can be obtained as

        struct vnet_port *port = container_of(napi, struct vnet_port, napi);

The `budget' parameter is an upper-bound on the number of packets that can be processed in any single ->poll invocation. The intention of the `budget' parameter is to ensure fair-scheduling across drivers, and avoid starvation when a single driver gets flooded with packet burst. The ->poll() callback, i.e., vnet_poll(), must return the number of packets processed. A return value that is less than the budget can be taken to indicate that we are at the end of a packet burst, i.e., hard-interrupts can be re-enabled. We do this in `vnet_poll()' as

        if (processed < budget) {
                napi_complete(napi);
                port->rx_event &= ~LDC_EVENT_DATA_READY;
                vio_set_intr(vio->vdev->rx_ino, HV_INTR_ENABLED);
        }

Here the value of processed is obtained by calling

        int processed = vnet_event_napi(port, budget);

where vnet_event_napi examines and processes the `rx_event' bits available on the `vnet_port'. If data is available on the port, `vnet_event_napi()' will read the LDC channel for information about the starting descriptor index, and process a batch of descriptors in softirq mode, passing up the received packets to the network stack using `napi_gro_receive()'. The batch processing of descriptors is constained to at must `budget' descriptors per vnet_event_napi() invocation.

The final step is to inform the NAPI infractructure that `vnet_poll()' is the poll callback. We do this in the vnet_port_probe() routine

        netif_napi_add(port->vp->dev, &port->napi, vnet_poll, NAPI_POLL_WEIGHT);

and actually enable NAPI before marking the port up:

        napi_enable(&port->napi);
        vio_port_up(&port->vio);

Some caveats specific to sunvnet/LDC

The `budget' parameter passed by the NAPI infra to `vnet_poll()' places an upper-bound on the number of packets that may be processed in a single ->poll callback. While this ensures fair-scheduling across drivers, we should be careful not to unnecessarily send LDC stop/start messages at each `budget' boundary when the packet burst size is larger than the `budget'.

This entails tracking additional state in the `vnet_port' to remember (a) when packet processing is truncated prematurely due to `budget' constraints, (b) the last index processed, when (a) occurs.

Both of these items are tracked in the `vnet_port' as

        bool                    napi_resume;
        u32                     napi_stop_idx;


Benefits of NAPI

The most obvious benefit of NAPIfication is interrupt mitigation. The ability to process packets in softirq context and pass up packets using napi_gro_receive() by itself results in a significant increase in packet processing rate. On a T5-2 with 16 VCPUS, iperf tests using 16 threads results in 230k pps (compared to the newer baseline of 100k pps!). This is a further 130% increase in performance.


In addition, conforming to the NAPI infrastructure automatically provides access to all the newest features and enhancements in the Linux driver infra, such as enhanced RPS.

But there are other benefits as well. With both Tx and Rx packets now being processed in softirq context, the irq save/restore locking done in sunvnet at the port level is eliminated, resulting in lock-less processing. The netif_tx_lock() can instead be used to synchronize access in the critical sections such as Tx reclaim which can now be inlined from the ->poll routine without any pre-emption concerns with dev_watchdog().

Multiple Tx queues

We've mostly talked about Rx side handling here, but on Tx side, when inter-vnet-link is on, we have a virtual point-to-point link between guests on the same physical host. As mentioned earlier, each such point-to-point link is represented as its own data-structure (`struct vnet_port') and has its own LDC ring and Rx descriptor buffers. Thus a flow-controlled path due to bursty traffic between peers A and B should not impact traffic between peers A and C. The Linux driver infrastructure makes this possible through the support for multiple Tx queues.

Briefly, these were the steps to set up multiple Tx queues:

  1. Queue allocation: invoke alloc_etherdev_mqs(), to set up VNET_MAX_TXQS queues when creating the `struct net_device'.
  2. As each port is added, assign a queue index to the port in a round-robin fasion. The assigned index is tracked in the `vnet_port' structure.
  3. Supply a ->ndo_select_queue callback that returns the selected queue to dev_queue_xmit() when it calls netdev_pick_tx(). In the case of sunvnet, the vnet_select_queue() should simply return the index assigned to the vnet_port that would be selected for the outgoing packet.

After the integration of multiple Tx queues, we can do even better at recovering from flow-control.

Flow control is asserted on the Tx side when we exhaust either the descriptor rings for data, and/or run out of resources to send LDC messages. After the batched LDC processing optimizations, it is uncommon to run out-of-resources for LDC messages. Thus flow-control is typically asserted when the Producer generates data much faster than the Consumer, at which point the netif_tx_stop_queue() is asserted, blocking a Tx queue for a specific peer.

The flow-control can thus be released when we get back an LDC stopped ACK from the blocked peer (neatly identified by the LDC message, and by the specific vnet_port and Tx queue!).

Conclusions and Future Work

In addition to NAPI, Linux offers other alternatives to drivers for deferring work away from hard-interrupt context, such as bottom-half (BH) handlers and tasklets.

A BH Rx handler will eliminate the problems of the interrupt context, and packets can now be received in process context, which speeds up things somewhat. But it still cannot call netif_receive_skb(), since that can deadlock on socket locks with the softirq-based tasklets that do TCP timers, packet rexmit, etc. So the BH handler is constrained to use netif_rx_ni(), which is still less efficient than the straight-through call to pass up the packet via netif_receive_skb()

Both NAPI and tasklet based implementations offer softirq context, which allows the driver to safely invoke netif_receive_skb() to deliver the packet to the IP stack. NAPI, which seamlessly allows softirq context for both Tx and Rx processing, and already has the infrastructure to handle bursts of packets with fair-scheduling, proved to be the best option for sunvnet.

In the near future, we will be adding support for Jumbo Frames and TCP Segmentation Offload, to further leverage from hardware support by offloading features where possible. Another feature that offers potential for improving performance is the "RxDring" model, where the Consumer owns the shared-memory buffer for receiving data that the Producer then populates. In the RxDring model, the buffer can then be part of the sk_buff itself, thereby saving one memcpy for the Consumer.

References:

1Mogul - Eliminating receive livelock in an interrupt-driven kernel

-- Sowmini Varadhan

Monday Aug 11, 2014

Improving the Performance of Transparent Huge Pages in Linux, by Khalid Aziz

The following is a write-up by Oracle mainline Linux kernel engineer, Khalid Aziz, detailing his and others' work on improving the performance of Transparent Huge Pages in the Linux kernel.


Introduction

The Linux kernel uses small page size (4K on x86) to allow for efficient sharing of physical memory among processes. Even though this can maximize utilization of physical memory, it results in large numbers of pages associated with each process and each page requires an entry in the Translation Look-aside Buffer (TLB) to be able to associate a virtual address with the physical memory page it represents. The TLB is a finite resource and large number of entries required for each process forces kernel to constantly swap out entries in TLB. There is a performance impact any time the TLB entry for a virtual address is missing. This impact is especially large for data intensive applications like large databases.

To alleviate this, Linux kernel added support for Huge Pages, which can support significantly larger page sizes for specific uses. This larger page size is variable and depends upon architecture (a few megabytes to gigabytes) . Huge Pages can be used for shared memory or for memory mapping. Huge Pages reduce the number of TLB entries required for a process's data by factor of 100s and thus reduce the number of TLB misses for the process significantly.

Huge Pages are statically allocated and need to be used through a hugetlbfs API, which requires changing applications at source level to take advantage of this feature. The Linux kernel added a Transparent Huge Pages (THP) feature that coalesces multiple contiguous pages in use by a process to create a Huge Page transparently without the process needing to even know about it. This makes the benefits of Huge Pages available to every application without having to rewrite it.

Unfortunately, adding THP caused side-effects for performance. We will explore these performance impacts in more detail in this article and how those issues have been addressed.

The Problem

When Huge Pages were introduced in the kernel, they were meant to be statically allocated in physical memory and never swapped out. This made for simple accounting through use of refcounts for these hugepages. Transparent hugepages on the other hand need to be swappable so a process could take advantage of performance improvements through hugepages and yet not tie up the physical memory for these transparent hugepages. Since the swap subsystem only deals with base page size, it can not swap out larger hugepages. The kernel breaks the hugepages up into base page sizes before swapping transparent huge pages out.

A page is identified as hugepage via page flags and each hugepage is composed of one head page and a number of tail pages. Each tail page has a pointer, first_page, that points back to the head page. The Kernel can break the transparent hugepages up any time there is memory pressure and pages need to be swapped out. This creates a race between the code that breaks hugepages up and the code managing free and busy hugepages. When marking a hugepage busy or free, the code needs to ensure a hugepage is not broken up underneath it. This requires taking reference to the page multiple times, locking the page to ensure page is not broken up and executing memory barriers a few times to ensure any updates to the page flags get flushed out to memory so we retain consistency.

Before THP was introduced into the kernel in 2.6.38, the code to release a page was fairly straightforward. A call to put_page()was made and first thing put_page() checked was to determine if it was dealing with hugepage (also known as compound page) or base page:

void put_page(struct page *page)
{
	if (unlikely(PageCompound(page))) 
		put_compound_page(page);
…....
}

If the page being released is a hugepage, put_compound_page() verifies reference count is 0 and then calls the free routine for compound page which walks the head page and tail pages and frees them all up:

static void put_compound_page(struct page *page)
{
	page = compound_head(page); 
	if (put_page_testzero(page)) { 
		compound_page_dtor *dtor; 

		dtor = get_compound_page_dtor(page); 
		(*dtor)(page); 
	}
}

This is fairly straightforward code and has virtually no impact on performance of page release code. After THP was introduced, additional checks, locks, page references and memory barriers were added to ensure correctness. The new put_compound_page() in 2.6.38 looks like:

static void put_compound_page(struct page *page)
{
	if (unlikely(PageTail(page))) {
		/* __split_huge_page_refcount can run under us */
		struct page *page_head = page->first_page;
		smp_rmb();
		/*
		 * If PageTail is still set after smp_rmb() we can be sure
		 * that the page->first_page we read wasn't a dangling pointer.
		 * See __split_huge_page_refcount() smp_wmb().
		 */
		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
			unsigned long flags;
			/*
			 * Verify that our page_head wasn't converted
			 * to a a regular page before we got a
			 * reference on it.
			 */
			if (unlikely(!PageHead(page_head))) {
				/* PageHead is cleared after PageTail */
				smp_rmb();
				VM_BUG_ON(PageTail(page));
				goto out_put_head;
			}
			/*
			 * Only run compound_lock on a valid PageHead,
			 * after having it pinned with
			 * get_page_unless_zero() above.
			 */
			smp_mb();
			/* page_head wasn't a dangling pointer */
			flags = compound_lock_irqsave(page_head);
			if (unlikely(!PageTail(page))) {
				/* __split_huge_page_refcount run before us */
				compound_unlock_irqrestore(page_head, flags);
				VM_BUG_ON(PageHead(page_head));
			out_put_head:
				if (put_page_testzero(page_head))
					__put_single_page(page_head);
			out_put_single:
				if (put_page_testzero(page))
					__put_single_page(page);
				return;
			}
			VM_BUG_ON(page_head != page->first_page);
			/*
			 * We can release the refcount taken by
			 * get_page_unless_zero now that
			 * split_huge_page_refcount is blocked on the
			 * compound_lock.
			 */
			if (put_page_testzero(page_head))
				VM_BUG_ON(1);
			/* __split_huge_page_refcount will wait now */
			VM_BUG_ON(atomic_read(&page->_count) <= 0);
			atomic_dec(&page->_count);
			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
			compound_unlock_irqrestore(page_head, flags);
			if (put_page_testzero(page_head)) {
				if (PageHead(page_head))
					__put_compound_page(page_head);
				else
					__put_single_page(page_head);
			}
		} else {
			/* page_head is a dangling pointer */
			VM_BUG_ON(PageTail(page));
			goto out_put_single;
		}
	} else if (put_page_testzero(page)) {
		if (PageHead(page))
			__put_compound_page(page);
		else
			__put_single_page(page);
	}
}

The level of complexity of code went up significantly. This complexity guaranteed correctness but sacrificed performance.

Large database applications read large chunks of database into memory using AIO. When databases started using hugepages for these reads into memory, performance went up significantly due to the benefit of much lower number of TLB misses and significantly smaller amount of memory being used up by page table resulting in lower swapping activity. When a database application reads data from disk into memory using AIO, pages from the hugepages pool are allocated for the read and the block I/O subsystem grabs reference to these pages for read and later releases reference to these pages when read is done. This causes traversal of the code referenced above starting with call to put_page(). With the newly introduced THP code, the additional overhead added up to significant performance penalty.

Over the next several kernel releases, the THP code was refined and optimized which helped slightly in some cases while performance got worse in other cases. Subsequent refinements to THP code to do accurate accounting of tail pages introduced the routine __get_page_tail() which is called by get_page() to grab tail pages for the hugepage. This added further performance impact to AIO into hugetlbfs pages. All of this code stays in the code path as long as kernel was compiled with CONFIG_TRANSPARENT_HUGEPAGE. Running echo never > /sys/kernel/mm/transparent_hugepage/enabled does not bypass this new additional code to support THP. Results from a database performance benchmark run using two common read sizes used by databases show this performance degradation clearly:


2.6.32 (pre-THP)

2.6.39 (with THP)

3.11-rc5 (with THP)

1M read

8384 MB/s

5629 MB/s

6501 MB/s

64K read

7867 MB/s

4576 MB/s

4251 MB/s


This amounts to 22% degradation for 1M read and 45% degradation for 64K read! perf top during benchmark runs showed CPU spending more than 40% of cycles in __get_page_tail() and put_compound_page().

The Solution

An Immediate solution to the performance degradation comes from the fact that hugetlbfs pages can never be split and hence all the overhead added for THP can be bypassed. I added code to __get_page_tail() and put_compound_page()to check for hugetlbfs page up front and bypass all the additional checks for those pages:

static void put_compound_page(struct page *page) 
{ 
      if (PageHuge(page)) { 
              page = compound_head(page); 
              if (put_page_testzero(page)) 
                      __put_compound_page(page); 

              return; 
      } 
...


bool __get_page_tail(struct page *page)
{
...

      if (PageHuge(page)) { 
              page_head = compound_head(page); 
              atomic_inc(&page_head->_count); 
             got = true; 
      } else { 

...

This resulted in immediate performance gain. Running the same benchmark as before with THP enabled, the new performance numbers for aio reads are below:


2.6.32

3.11-rc5

3.11-rc5 + patch

1M read

8384 MB/s

6501 MB/s

8371 MB/s

64K read

7867 MB/s

4251 MB/s

6510 MB/s


This patch was sent to linux-mm and linux kernel mailing lists in August 2013 [link] and was subsequently integrated into kernel version 3.12. This is a significant performance boost for database applications.

Further review of the original patch by Andrea Arcangeli during integration of this patch into stable kernels exposed issues with refcounting of pages and revealed this patch had introduced a subtle bug where a page pointer could become a dangling link under certain circumstances. Andrea Arcangeli and author worked to address these issues and revised the code in __get_page_tail() and put_compound_page() to eliminate extraneous locks and memory barriers, fixed incorrect refcounting of tail pages and eliminate some of the inefficiencies in the code.

Andrea sent out an initial series of patches to address all of these issues [link].

Further discussions and refinements led to the final version of these patches which were integrated into kernel version 3.13 [link],[link].

AIO performance has gone up significantly with these patches but it is still not at the same level as it used to be for smaller block sizes before THP was introduced to the kernel. THP and hugetlbfs code in the kernel is better at guaranteeing correctness but it still comes at the cost of performance, so there is room for improvement.

-- Khalid Aziz.


About

The Oracle mainline Linux kernel team works as part of the Linux kernel community to develop new features and maintain existing code.


Our team is globally distributed and includes leading core kernel developers and industry veterans.


This blog is maintained by James Morris <james.l.morris@oracle.com>

Search

Categories
Archives
« August 2015
SunMonTueWedThuFriSat
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
     
Today