An interesting exercise investigating copy performance

Recently I was involved in a customer escalation. Our partner was complaining of poor copy performance seen by their application compared to what they saw on a flavor of Linux. Since copying is a key functionality to performance of many device drivers and subsystems, we felt that it was important to investigate this. This blog article will discuss our approach and outlines an interesting example on how to measure and investigate performance. Also, to set the tone, this discussion is mainly catered to X86 systems.

Our first endeavor was to have a benchmark to measure copy performance. While there are many tools such as libMicro to accomplish this, these have functions much beyond copying. As a result, the results exhibit significant variance and may not illustrate copy functionality.

We wrote a device driver, copydd, which will simply emulate a user-to-kernel copy operation. Then our benchmark, a simple user program opens copydd, and writes a 512 MB buffer to it in specified data sizes. Copydd does a copy from the user memory space to a preallocated memory buffer (which was allocated and initialized using an ioctl() before the benchmark started). We wrote this program for both Solaris and Linux, and the code is available for download here. Using this device driver, we could focus on the copy operation without getting sidetracked by anything else.

Our very first version of the benchmark measured the latency of writing into copydd by timing the routine at the user level since we believed that an improvement in latency would directly translate to improving bandwidth. We timed the benchmark for three different cases:

(i)Memory alloced but not touched before the benchmark,

(ii)Memory alloced and page-faulted in, and

(iii)Memory alloced and cached-in.

We discovered that using large pages (page size=2M instead of 4K) helped case (i) by 20% but did not affect cases (ii) and (iii). This would be indicative of TLB misses occurring when the memory is faulted-in. Setting use_sse_pagezero to 0 (using mdb -kw) helped case (ii) by close to 50%. The above setting makes the kernel zero-out a page right when it is faulted-in. In the process of zeroing-in, the page gets loaded into the cache, as a result of which the benchmark runs much faster.

We suggested our partner to tune use_sse_pagezero to 0 and re-evaluate their application. However, they reported little to no benefit.

We then decided to move over to measure copy bandwidth in our benchmark rather than the latency. This led us investigate the different approaches Solaris uses for copying data. Since copying is such critical functionality, it is implemented in assembly code for performance reasons. The code specific to Intel X86_64 based systems is available here. There are two variants of copy available (i)Cache-allocating copy, in which memory is first cached-in during read and then copied, and (ii)Non-temporal copy, in which copying happens from one memory location to another without bringing the contents into cache.

The choice between the two depends on whether the data copied will be immediately used by the CPU. For example, lets take the case of Network I/O. In the transmit path, a socket write() call copies the data from the user buffer to kernel (using uiomove()). Thereafter the packet is usually driven in the receive context (driven by processing of TCP ACKs). The CPU driving the data via the network stack is the one where the receive interrupt lands on. This CPU maybe very different from the one doing the user-to-kernel copy. Thus, bringing the data into cache (during the uiomove()) may cause unnecessary cache pollution on the CPU that did the copy, without actually benefitting from bringing the contents of the cache. Thus in this case a non-temporal copy is better since it does not cause any unnecessary cache pollution.

On the other hand, in the receive codepath, a driver (working in interrupt context) often copies the DMAed data into the kernel so that the receive DMA buffer can be freed as soon as possible. Thereafter the packets received are processed through the network protocol stack and delivered to the socket. In this case, since the data will immediately be accessed by the same thread which is copying it, and a cache allocating copy is potentially beneficial. Therefore, copy operations performed by device drivers use bcopy() or ddi_copyin() which do cache-allocating copy.

We then did an analysis of the performance of copy bandwidth using non-temporal copy (uiomove()) and cache-allocating copy (ddi_copyin()). Using the copydd driver on a Sun X4270 Lynx server based on the Intel Nehalem architecture, we arrived at the set of curves shown below.

The above curve shows that while ddi_copyin() is a small win over uiomove() for small data sizes(<1024 bytes), uiomove() beats by a factor of 2x when it comes to copying 128k chunks. The result is very interesting because it shows the tradeof between bringing contents from memory to cache, vs copying from one memory segment to the other.

Finally, to conclude this rather long blog, the tradeof between non-temporal copy (uiomove()) and cache-allocating copy(ddi_copyin()) depends on:
(i) The possibility of the data being copied to be used by the same thread as the one which did the copy.
(ii) The size of the data segment being copied.

In Solaris, we have this just right, socket write() and related APIs use uiomove(), while device drivers use ddi_copyin(). We contacted our partner who had complained to us about the copy performance and asked them if they were calling the right API. They switched over from ddi_copyin() to uiomove() and got nearly 2x performance benefit!!!


What was the final performance difference after those changes?

Posted by Mikael Gueck on August 08, 2009 at 08:13 PM PDT #

Very nice study. Thanks for that.

To readers note that uiomove can do either types of copies (cached or non-temporal/NTA) depending on how uio_extflg is setup.

About DMA and network stack. In my observation, fast network drivers pass bound DMA buffers to the network stack meaning data is not in any cpu caches. The network stack has no need to inspect the bulk of the data until it's copied into a user buffer in a read system call. After data is copied into the user buffer, the application will then handle the data. If the application is such that the thread doing the read, immediately processes the bulk of the data _in the same thread_ then it might benefit from cached copies for small reads that do not overun the caches. But more modern and threaded application that have a thread doing the socket reads and other worker threads doing the processing of the data it becomes unlikely that those threads will benefits even of small copies.

So for modern applications and architecture, I think the bias should be toward delaying cache install as much as possible and this study helps a lot to promote that viewpoint.

Thanks again.

Posted by Roch Bourbonnais on August 11, 2009 at 07:57 PM PDT #

Hi Mikael,

The performance difference that our partner got after using uiomove() was roughly 2x. This benefit depends on the tradeoff between points (i) and (ii) as listed above.


Posted by Amitab on August 12, 2009 at 02:16 AM PDT #

Thanks to Roch for providing the clarification that uiomove has a uio_extflg flag for deciding whic type of copy. Here is the uiomove syntax:

As an example, for write() syscall in syscall/rw.c, it is set to default (non-temporal copy), while for read() it depends on the data size.

in write():

auio.uio_extflg = UIO_COPY_DEFAULT;

in read():
55 #define COPYOUT_MAX_CACHE (1<<17) /\* 128K \*/
57 size_t copyout_max_cached = COPYOUT_MAX_CACHE; /\* global so it's patchable \*/

171 if (bcount <= copyout_max_cached)
172 auio.uio_extflg = UIO_COPY_CACHED;
173 else
174 auio.uio_extflg = UIO_COPY_DEFAULT;

Thanks again for all these inputs.

Posted by Amitab on August 12, 2009 at 02:23 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed

This blog discusses my work as a performance engineer at Sun Microsystems. It touches upon key topics of performance issues in operating systems and the Solaris Networking stack.


« March 2017