Wednesday Mar 30, 2011

Solaris Express 11

Since I have a brief project break, I thought it is time to now update this blog with what I have been doing the last one-and-a-half years since my last blog entry.

It is exciting that after many years of development, Solaris Express 11 was released in November 2010. In my next few blogs, I will talk about some exciting new features in Solaris Express 11. Stay tuned...

Saturday Aug 08, 2009

An interesting exercise investigating copy performance

Recently I was involved in a customer escalation. Our partner was complaining of poor copy performance seen by their application compared to what they saw on a flavor of Linux. Since copying is a key functionality to performance of many device drivers and subsystems, we felt that it was important to investigate this. This blog article will discuss our approach and outlines an interesting example on how to measure and investigate performance. Also, to set the tone, this discussion is mainly catered to X86 systems.

Our first endeavor was to have a benchmark to measure copy performance. While there are many tools such as libMicro to accomplish this, these have functions much beyond copying. As a result, the results exhibit significant variance and may not illustrate copy functionality.

We wrote a device driver, copydd, which will simply emulate a user-to-kernel copy operation. Then our benchmark, a simple user program opens copydd, and writes a 512 MB buffer to it in specified data sizes. Copydd does a copy from the user memory space to a preallocated memory buffer (which was allocated and initialized using an ioctl() before the benchmark started). We wrote this program for both Solaris and Linux, and the code is available for download here. Using this device driver, we could focus on the copy operation without getting sidetracked by anything else.

Our very first version of the benchmark measured the latency of writing into copydd by timing the routine at the user level since we believed that an improvement in latency would directly translate to improving bandwidth. We timed the benchmark for three different cases:

(i)Memory alloced but not touched before the benchmark,

(ii)Memory alloced and page-faulted in, and

(iii)Memory alloced and cached-in.

We discovered that using large pages (page size=2M instead of 4K) helped case (i) by 20% but did not affect cases (ii) and (iii). This would be indicative of TLB misses occurring when the memory is faulted-in. Setting use_sse_pagezero to 0 (using mdb -kw) helped case (ii) by close to 50%. The above setting makes the kernel zero-out a page right when it is faulted-in. In the process of zeroing-in, the page gets loaded into the cache, as a result of which the benchmark runs much faster.

We suggested our partner to tune use_sse_pagezero to 0 and re-evaluate their application. However, they reported little to no benefit.

We then decided to move over to measure copy bandwidth in our benchmark rather than the latency. This led us investigate the different approaches Solaris uses for copying data. Since copying is such critical functionality, it is implemented in assembly code for performance reasons. The code specific to Intel X86_64 based systems is available here. There are two variants of copy available (i)Cache-allocating copy, in which memory is first cached-in during read and then copied, and (ii)Non-temporal copy, in which copying happens from one memory location to another without bringing the contents into cache.

The choice between the two depends on whether the data copied will be immediately used by the CPU. For example, lets take the case of Network I/O. In the transmit path, a socket write() call copies the data from the user buffer to kernel (using uiomove()). Thereafter the packet is usually driven in the receive context (driven by processing of TCP ACKs). The CPU driving the data via the network stack is the one where the receive interrupt lands on. This CPU maybe very different from the one doing the user-to-kernel copy. Thus, bringing the data into cache (during the uiomove()) may cause unnecessary cache pollution on the CPU that did the copy, without actually benefitting from bringing the contents of the cache. Thus in this case a non-temporal copy is better since it does not cause any unnecessary cache pollution.

On the other hand, in the receive codepath, a driver (working in interrupt context) often copies the DMAed data into the kernel so that the receive DMA buffer can be freed as soon as possible. Thereafter the packets received are processed through the network protocol stack and delivered to the socket. In this case, since the data will immediately be accessed by the same thread which is copying it, and a cache allocating copy is potentially beneficial. Therefore, copy operations performed by device drivers use bcopy() or ddi_copyin() which do cache-allocating copy.

We then did an analysis of the performance of copy bandwidth using non-temporal copy (uiomove()) and cache-allocating copy (ddi_copyin()). Using the copydd driver on a Sun X4270 Lynx server based on the Intel Nehalem architecture, we arrived at the set of curves shown below.

The above curve shows that while ddi_copyin() is a small win over uiomove() for small data sizes(<1024 bytes), uiomove() beats by a factor of 2x when it comes to copying 128k chunks. The result is very interesting because it shows the tradeof between bringing contents from memory to cache, vs copying from one memory segment to the other.

Finally, to conclude this rather long blog, the tradeof between non-temporal copy (uiomove()) and cache-allocating copy(ddi_copyin()) depends on:
(i) The possibility of the data being copied to be used by the same thread as the one which did the copy.
(ii) The size of the data segment being copied.

In Solaris, we have this just right, socket write() and related APIs use uiomove(), while device drivers use ddi_copyin(). We contacted our partner who had complained to us about the copy performance and asked them if they were calling the right API. They switched over from ddi_copyin() to uiomove() and got nearly 2x performance benefit!!!

Wednesday May 27, 2009

Hello world with VirtualBox

Sun VirtualBox is one of the hottest virtualization products around. Although I work with a performance group at Sun Microsystems, I have not been involved with programming or performance aspects of VirtualBox, therefore all my comments here are as an outsider to VirtualBox.

My experience of using VirtualBox has been simply great. I did have some hiccups getting everything working together, so this blogpost is intended to be a comprehensive summary of all that I had to scourge around the web to get things working.

I required virtualized operating systems (OSs) for my home desktop machine for some key requirements: My host OS is the latest Fedora. I host two websites at home: A vertical search engine for travel; and blog that my wife maintains. Hosting at home works well for websites which are young and have low volumes of traffic since you save on hosting fees. If you have a DSL or cable connection which gives you a dynamic IP, you may use dynamic DNS from zoneedit. But hosting at home also means that you would want your machine running 24x7. And so if you need any other OS for any other reason; you do need a virtualization tool. VirtualBox is free, and opensource, and this greatly helps. I needed the following guest OSs:

(i)Windows 7 Beta: I needed a Windows installation to try the latest OS from Microsoft and also for software such as Sopcast and iTunes.

(ii)Fedora 10: I needed another Fedora 10 so that I could run a VPN out of it without disrupting my Webserver. Virtualization really helps in this particular requirement.

(iii)Solaris Nevada release: I keep trying the latest unsupported release of Solaris Nevada available here which I work with Solaris OS code as my daily job.

Here is what I had to do to get everything working together. The most tricky part is to have a great display resolution in the guest OS, so please pay attention to them.


1) Installing Virtual Box on Fedora 10: For some reason the latest VirtualBox release 2.2.2 gave problems starting up on my Fedora 10. So I went back to 2.2.0 and thereafter thais was very smooth. Here are the steps which are documented in greater detail over here.

Get the kernel development packages:
yum install kernel-devel make automake autoconf gcc

Get the latest VirtualBox rpm and install it. Here is an example:
rpm -ivh VirtualBox-2.2.0_45846_fedora9-1.x86_64.rpm

Setup virtualbox:
/etc/init.d/vboxdrv setup

Add users who can use virtualbox:
usermod -G vboxusers -a

Start virtualbox:
VirtualBox, or open from the GUI.

2) Installing Windows 7 Beta Guest OS: Installing a guest OS on Virtual Box is pretty simple. One of the things I did not realize at first is the utility of the VirtualBox Media manager. There is no need to burn the iso image of the OS on a DVD, you can use the Media Manager to upload the iso image. When you start a new instance, Virtual Box requests the OS that is to be installed. The other information that VirtualBox needs is the amount of memory you would like to allocate for the guest OS, the amount of video memory, and a hard disk image where the guest OS will be installed. Most of these pop up with default values that Virtual Box recommends.

The installation went on very smoothly and once it completed, Windows came up just fine. The next step is to install the guest additions. This helps in a number of things - It helps in seamless integration of the mouse with the guest OS, and also in improving the display resolution. Once this was done, Windows booted up fine minus networking. I had to manually put in the DNS IP address as the IP address of my wireless router, and then the networking ran great.

I have observed that Windows Beta 7 requires at least 1 GB RAM to perform sanely, anything else would make the performance remarkably slow.

3) Installing Fedora 10 Guest OS : In a way installing guest OSs is almost similar. With Fedora 10, I didn't have to do any hanky-panky for networking. The display came up with 800x600 resolution. Interestingly the xorg.conf file was missing. I generated a xorg.conf file, manually edited it to match the settings in my Fedora 10 host OS xorg.conf, and then the resolution was excellent. Installing the guest additions was more tricky. When I ran the script, it complained about missing header files. The solution is to run "yum install kernel kernel-devel" so that all the kernel headers are installed. The script ran fine after this.

I installed vpnc and now I can connect to my work via vpn and also have the server running on the host OS at the same time. For some reason, my work DNS was not working on the guest OS although it worked easily on the host OS. I couldn't resolve this, but a easy solution was to manually add all the IP addresses I connect to, to the /etc/hosts file.

4) Installing Solaris Nevada : This by far was the easiest. The display came up in 1200\*1080, with networking going, so I didn't have much reason to play around with either of them.

On the whole, I have found the performance of guest OSs on VirtualBox extremely satisfactory. I worked on Word and Powerpoint presentations on Windows and the difference is barely noticable. There is a small difference in latency for Internet browsing, but nothing much to bother me.

I think what is important for VirtualBox though is having a lot of RAM on your machine. My machine has 2 GB RAM, and that was sufficient only for running the host OS with 1 GB RAM, and running one of the guest OSs with 1 GB RAM. Running more than one guest OSs with lesser memory inevitably had performance implications. For me this was not a limitation since I am not planning to work on multiple OSs at the same time.

What I feel is missing
1)I would love to have the drag-and-drop copy feature so that I can just cut and paste files from one OS to the other. I dont think VirtualBox supports this feature yet, and I am told from Internet postings that VMware Fusion does, so this will definitely be a great feature to have. As of now I am moving files around using scp which I find a pain, given that everything is in the same machine. Similarly there is no way I can copy a link from a browser in the host OS to a browser in the guest OS if I needed to.

2)Sound support: I still couldn't get sound working in any of my guest OSs, I plan to post an update when I get it up.

On the whole I am a very satisfied customer of VirtualBox and I hope they keep the good work up.

Monday Apr 13, 2009

Examining Crypto Performance on Intel Nehalem based Sun Fire X4170

Today, Sun is releasing a vast array of servers and blades with Intel's new Xeon 5560 (Nehalem) processor. We have significantly improved the performance of crypto algorithms (as part of Solaris Cryptographic Framework (SCF)). While some of these changes have been covered in my previous blogs, I would like to summarize them here.

I must first commend Dan Anderson for doing an excellent job in incorporating a lot of hand-coded assembly into the SCF. These enhancements were available in OpenSolaris 2008.10. Since then we have made the following enhancements, which will be available in OpenSolaris 2009.06. You can also try the preview bits of 2009.06 at

1) CR 6799218: RSA using Solaris Kernel Crypto framework lagging behind OpenSSL. We made changes that made RSA decrypt operations 1.8 times faster. The details are documented here.

2) CR 6811474 and CR 6823192 make number of changes to big_mont_mul() and big_mul() routines which form the essence of montogomery multiplication. These changes improve RSA decrypt operations by 10%.

3) CR 6812615: 64-bit RC4 has poor performance on Intel Nehalem. We made changes to the RC4 encrypt routine which delivered an improvement of 25% on Intel Nehalem. These changes are documented here.

The performance of these and other crypto algorithms may be examined using the PKCS#11 compliant Sun Software Crypto plugin. Applications can be linked to the library, /usr/lib/amd64/ For benchmarking the performance of SCF, we patched OpenSSL 0.9.8j (patch available at Jan Pachanec's blog) to use pkcs#11. The OpenSSL speed benchmark gives us the following numbers on a Sun Fire X4270 pre-release system with 2-socket Intel(r) Xeon(r)CPU X5560@2.8 GHz processor HT-enabled:

Benchmark 1-thread 16-threads
RSA-1024 encrypt 24.9 K ops/s 199.2 K op/s
RSA-1024 decrypt 1760 ops/s 14048op/s
RC4 encrypt (8k message) 317 MBytes/s 2265 Mbytes/s
MD5 Hash (8k message) 531 MBytes/s 6085 MBytes/s
SHA-1 Hash (8k message) 356 MBytes/s 2545 Mbytes/s
AES-256 encryption (8k message) 136.9 MBytes/s 1212.6 Mbytes/s

Please note that these numbers are with Hyper-Threading (HT) enabled on the Nehalem processor, in which two virtual processors share the same execution pipeline. The performance of all algorithms is seen to scale pretty linearly from one-core to 8-cores. Disabling HT did not make much of a difference to the benchmarks, and this could be because crypto algorithmic operations do not have many stalls in the execution pipeline, and therefore the benefit from having virtual processors is less.

For further notes on Sun's Intel Nehalem based servers and blades, I recommend you to read Heather's blog which cross-links all Nehalem-based blog entries. And please do leave your comments and feedback behind.

This blog discusses my work as a performance engineer at Sun Microsystems. It touches upon key topics of performance issues in operating systems and the Solaris Networking stack.


« December 2016