Wednesday May 27, 2009

Hello world with VirtualBox

Sun VirtualBox is one of the hottest virtualization products around. Although I work with a performance group at Sun Microsystems, I have not been involved with programming or performance aspects of VirtualBox, therefore all my comments here are as an outsider to VirtualBox.

My experience of using VirtualBox has been simply great. I did have some hiccups getting everything working together, so this blogpost is intended to be a comprehensive summary of all that I had to scourge around the web to get things working.

I required virtualized operating systems (OSs) for my home desktop machine for some key requirements: My host OS is the latest Fedora. I host two websites at home: A vertical search engine for travel; and blog that my wife maintains. Hosting at home works well for websites which are young and have low volumes of traffic since you save on hosting fees. If you have a DSL or cable connection which gives you a dynamic IP, you may use dynamic DNS from zoneedit. But hosting at home also means that you would want your machine running 24x7. And so if you need any other OS for any other reason; you do need a virtualization tool. VirtualBox is free, and opensource, and this greatly helps. I needed the following guest OSs:

(i)Windows 7 Beta: I needed a Windows installation to try the latest OS from Microsoft and also for software such as Sopcast and iTunes.

(ii)Fedora 10: I needed another Fedora 10 so that I could run a VPN out of it without disrupting my Webserver. Virtualization really helps in this particular requirement.

(iii)Solaris Nevada release: I keep trying the latest unsupported release of Solaris Nevada available here which I work with Solaris OS code as my daily job.

Here is what I had to do to get everything working together. The most tricky part is to have a great display resolution in the guest OS, so please pay attention to them.


1) Installing Virtual Box on Fedora 10: For some reason the latest VirtualBox release 2.2.2 gave problems starting up on my Fedora 10. So I went back to 2.2.0 and thereafter thais was very smooth. Here are the steps which are documented in greater detail over here.

Get the kernel development packages:
yum install kernel-devel make automake autoconf gcc

Get the latest VirtualBox rpm and install it. Here is an example:
rpm -ivh VirtualBox-2.2.0_45846_fedora9-1.x86_64.rpm

Setup virtualbox:
/etc/init.d/vboxdrv setup

Add users who can use virtualbox:
usermod -G vboxusers -a

Start virtualbox:
VirtualBox, or open from the GUI.

2) Installing Windows 7 Beta Guest OS: Installing a guest OS on Virtual Box is pretty simple. One of the things I did not realize at first is the utility of the VirtualBox Media manager. There is no need to burn the iso image of the OS on a DVD, you can use the Media Manager to upload the iso image. When you start a new instance, Virtual Box requests the OS that is to be installed. The other information that VirtualBox needs is the amount of memory you would like to allocate for the guest OS, the amount of video memory, and a hard disk image where the guest OS will be installed. Most of these pop up with default values that Virtual Box recommends.

The installation went on very smoothly and once it completed, Windows came up just fine. The next step is to install the guest additions. This helps in a number of things - It helps in seamless integration of the mouse with the guest OS, and also in improving the display resolution. Once this was done, Windows booted up fine minus networking. I had to manually put in the DNS IP address as the IP address of my wireless router, and then the networking ran great.

I have observed that Windows Beta 7 requires at least 1 GB RAM to perform sanely, anything else would make the performance remarkably slow.

3) Installing Fedora 10 Guest OS : In a way installing guest OSs is almost similar. With Fedora 10, I didn't have to do any hanky-panky for networking. The display came up with 800x600 resolution. Interestingly the xorg.conf file was missing. I generated a xorg.conf file, manually edited it to match the settings in my Fedora 10 host OS xorg.conf, and then the resolution was excellent. Installing the guest additions was more tricky. When I ran the script, it complained about missing header files. The solution is to run "yum install kernel kernel-devel" so that all the kernel headers are installed. The script ran fine after this.

I installed vpnc and now I can connect to my work via vpn and also have the server running on the host OS at the same time. For some reason, my work DNS was not working on the guest OS although it worked easily on the host OS. I couldn't resolve this, but a easy solution was to manually add all the IP addresses I connect to, to the /etc/hosts file.

4) Installing Solaris Nevada : This by far was the easiest. The display came up in 1200\*1080, with networking going, so I didn't have much reason to play around with either of them.

On the whole, I have found the performance of guest OSs on VirtualBox extremely satisfactory. I worked on Word and Powerpoint presentations on Windows and the difference is barely noticable. There is a small difference in latency for Internet browsing, but nothing much to bother me.

I think what is important for VirtualBox though is having a lot of RAM on your machine. My machine has 2 GB RAM, and that was sufficient only for running the host OS with 1 GB RAM, and running one of the guest OSs with 1 GB RAM. Running more than one guest OSs with lesser memory inevitably had performance implications. For me this was not a limitation since I am not planning to work on multiple OSs at the same time.

What I feel is missing
1)I would love to have the drag-and-drop copy feature so that I can just cut and paste files from one OS to the other. I dont think VirtualBox supports this feature yet, and I am told from Internet postings that VMware Fusion does, so this will definitely be a great feature to have. As of now I am moving files around using scp which I find a pain, given that everything is in the same machine. Similarly there is no way I can copy a link from a browser in the host OS to a browser in the guest OS if I needed to.

2)Sound support: I still couldn't get sound working in any of my guest OSs, I plan to post an update when I get it up.

On the whole I am a very satisfied customer of VirtualBox and I hope they keep the good work up.

Thursday Feb 19, 2009

Trip Report from IOM-2009

Last weekend I participated at IOM-2009, a workshop on The Influence of I/O on Microprocessor Architecture, co-located with the High Performance Computer Architectures (HPCA) conference-2009 at Raleigh, NC. Ram Huggahalli, Principal Engineer, and Platform Architect in Communication Technology Lab, Intel served as the workshop chair. I must say that this workshop was very well organized. There were 8 presentations, 45 minutes each, leaving ample time for valuable discussions and feedback.

The workshop addressed the challenge of how to provide more I/O to systems, particularly with 40 GigE and 100 GigE getting standardized very soon. Here are the challenges as listed by the workshop chair.

1) Making I/O an integral part of chip/system design instead of having it as a peripheral device. Networking and other I/O needs to be integrated with microprocessor design instead of being thought as a peripheral device.
2) Making some kind of revolutionary change in the way network I/O is performed because (i) Memory access wont become faster, (ii) Demand for I/O (networking) will only increase and the current way will probably not scale.

Summary of viewpoints:
Main viewpoints expressed in the workshop:
1) A presentation from Sandia National Labs demonstrated experimental data on pre-release Nehalem systems. Main points: (i) Nehalem showing much improved network I/O because of reduced local memory latency,
(ii) NUMAness in Nehalem plays major role, many benchmarks show great difference using local memory than using remote memory. a list of commercial benchmarks for which performance varies a lot depending on whether memory is located locally or remotely. Memory is controlled in linux with numactl.

2) Main stumbling bottleneck for Network I/O: Avoiding copy required from user space to kernel space for transmit and vice-versa for receive. Strategies presented:
(i) Cache injection (2 papers: 1 from IBM Labs, Zurich, and 1 from Univ. of Victoria): Strategy is to inject data received directly into cache so that when receive() is issued, there is no cache-miss and data is readily available. Challenge is that data that is displaced in cache will cause cache-misses, and therefore it is hard to come up with an algorithm which is suited to all workloads. Presenters from Univ. of Victoria presented strategies in the context of MPI running on a IBM Cell processor.

(ii) IOMMU to drive hardware accelerators (1 paper from Univ. Pittsburgh, Intel): Using IOMMU hardware to access physical memory by supplying virtual address so that a hardware accelerator device can directly access memory. The presenter demonstrated this approach in the context of a USB drive.

(iii) Creating DMA Cache (Chinese Academy of Sciences) : Having separate cache to keep I/O data before it can be read by application. As a result primary cache is not affected.

(iv) Intelligent NICs (Virginia Tech): NICs which can interact with the CPU and transfer data when required.

(3) Other Interesting papers:
(i) Using network processors to create virtual NICs. (Univ Massachussetts, Lowell)
(ii) Active end-system analysis for finding bottleneck rate for receive network I/O: This work is mainly from UCDavis, while I have contributed to the theoretical part. This work demonstrates the importance of pacing on the transmit side, and illustrates how to compute the bottleneck at the receiver using a stochastic model. Slides are available here.

Thursday Aug 28, 2008

Notes from Intel Developers Fourm

Intel Developer Forum (IDF) was held 19th-21st August last week, of which I attended Day 2-20th August. Intel seems to be focussed on three key markets: (i) Mobile Internet Devices (MIDs) (ii) Converged internet and multimedia, and (iii) High-end enterprise solutions. Intel is targeting these markets with the following processors respectively (i)Intel Atom which is sized smaller than a quarter coin but has as many transistors as the Pentium IV, (ii) Intel Media Processor CE 3100, and (iii) Intel Nehalem which is Intel's first foray into a NUMA architecture with high memory bandwidth using the Intel Quickpath Interconnect (QPI) technology.

The plenary talks revolved around the above. Out of a gamut of applications and gadgets talked about, I found two interesting. The first is Gypsii
, an unique application combining the mobile computing and social networking. On a mobile device, such as your IPhone, you can locate all your friends on a map, and instantly communicate with them like hooking up for lunch. The other is a TV internet widget jointly developed b Intel and Yahoo (see press release here. With this widget, you will have a toolbar at the bottom of your television screen with which you can check your mail, stocks, weather, news, and what not.

Many of the technical sessions were based on parallel computing and the Nehalem architecture. Nehalem seems to have improved branch prediction, better unaligned cache handling, improved store and load performance, besides significantly higher interconnect bandwidth which should help in faster memory access and better I/O. That seems to indicate that Nehalem will have great 10 GigE network I/O, so it will be fun to do some performance characterization and analysis on a Nehalem box. Besides there was a 2 hour session on the Intel Advanced Vector Extensions (AVX) ins which is expected in 2010. AVX will be operating on 256-bit registers allowing for increased vectorization and 256-bit add, multiply and shuffle operations.

All presentation materials from IDF are now publicly available here.

Thursday Feb 28, 2008

Getting more beef from your network

Let us consider the process by which the operating system (OS) handles network I/O. For simplicity we consider the receive path. While there are some differernces between Solaris, Linux, and other flavors of unix, I will try to generalize the steps to construct a high-level representative picture. Here is an outline of the steps:

1. When packets are received, the Network Interface Card (NIC) performs a Direct Memory Access (DMA) to transfer the data to the main memory. Once a sufficient size of data prescribed by the interrupt coalescing parameter is received, an interrupt is raised to inform the device driver of this event. The device driver assigns a data structure called the receive descriptor to handle the memory location identified by the DMA.

2. In the interrupt handling context, the device driver handles the packet in the DMA memory. The packet is processed through the network protocol stack (MAC, IP, TCP layers) in the interrupt context and is ultimately copied to the TCP socket buffer. The work of the interrupt handler ends at this stage. Solaris GLDv3 based drivers have a tunable to employ independent kernel threads (also known as soft rings) to handle the network protocol stack so that the interrupt CPU does not become the bottleneck. This is is sometimes required on the UltraSparc based systems because of the large number of cores that they support.

3. The application thread, usually executing as a user-level process, then reads the packet from the socket buffer and processes the data appropriately.

Thus, data transfer between the NIC and the application may involve at least two copies of data: one from the DMA memory to kernel space, and the other from kernel space to user space. In addition, if the application is writing data to the disk, there may be an additional copy of data from memory to the disk. Such a large number of copies has high overhead, particularly when the network transmission line rate is high. Moreover, the CPU becomes increasingly burdened with the large amount of packet processing and copying.

The following techniques have been studied to improve the end-system performance.

Protocol Offload Engines (POE)
Offload engines implement the functionality of the network protocol in on-chip hardware (usually in the NIC), which reduces the CPU overhead for network I/O. The most common offload engines are TCP Offload Engines (TOEs). TOEs have been demonstrated to deliver higher throughput as well as reduce the CPU utilization for data transfers. Although POEs improve network I/O performance, they do not completely eliminate the I/O bottleneck, as the data still must be copied to the application buffer space.

Moreover TOE has numerous vulnerabilities because of which it is not supported by any operating system. Patches to provide TOE support to Linux were rejected for many reasons, which are documented here. The main reasons are: (i)Difficulty of patching security updates since TOE resides firmly in hardware, (ii) Inability of ToE to perform well under stress, (iii)Vulnerabilities to SYN flooding attacks, (iv)Difficulties in longterm kernel maintenance with evolving dimensions of TOE hardware.

Zero-Copy Optimizations
Zero-copy optimizations such as the sendfile() implementation in Linux 2.4 , aim to reduce the number of copy operations between kernel and user space. As an example, in sendfile(), only one copy of data occurs when data is transferred from the file to the NIC. Numerous zero-copy enabled versions of TCP/IP have been developed and implementations are available for Linux, Solaris, and FreeBSD. A limitation of most zero-copy implementations is the amount of data that may be transferred. As an example, sendfile() has the limitation of a maximum file size of 2 GB. Although zero-copies improve performance, they do not eliminate the contention for end-system resources.

Remote DMA (RDMA)
The RDMA protocol implements both POEs and zero-copy optimizations.
RDMA allows data to be directly written to/read from the application buffer without the involvement of the CPU or OS. It thus avoids the overhead of the network protocol stack and context switches, and allows transfers to continue in parallel with other executing tasks. However, apart from cluster computing environments, the acceptance of RDMA has been rather limited because of the need of a separate networking infrastructure. Moreover, RDMA has security concerns, particularly in the setting of remote end-to-end data transfers.

Large Send Offload (LSO)/ Large Receive Offload (LRO)
LSO and LRO are NIC features to allow the network protocol stack to process large (up to 64 KB) segments. The NIC has hardware features to split the segments into 1500 byte MTU packets for send (LSO) and combine incoming MTU sized packets into a large segment for receive (LRO). LSO and LRO help save CPU cycles consumed in the network protocol stack because a single call can handle a 64 KB segment. LSO/LRO are supported in most NICs and are known to improve the CPU efficiency of networking considerably.

Transport Protocol Mechanisms
There are several approaches to optimizing TCP performance. Most focus on improving the Additive Increase Multiplicative Decrease (AIMD) congestion control algorithm of TCP which is sometimes less inefficient at very high bit-rates, because a single packet-loss may quench the transfer rate. Also, the congestion control algorithm in TCP has been demonstrated to be not scalable in high Bandwidth Delay Product (BDP) settings (connections with high bandwidth and Round Trip Time (RTT)).
To improve these remedies, a large variety of TCP-variants which improve on the congestion control algorithm have been proposed, such as FAST, High-Speed TCP (HS-TCP), Scalable TCP, BIC-TCP, and many others.


This blog discusses my work as a performance engineer at Sun Microsystems. It touches upon key topics of performance issues in operating systems and the Solaris Networking stack.


« July 2016