This article focuses on what is often the most important performance topic: I/O.
Most commercial applications, that is, everything except technical computing,
depends more on I/O than anything else - but it doesn't always get the coverage it deserves.
Seriously: people obsess about CPU scheduling and overcommittment, but most performance problems (IMHO) are I/O.
We'll discuss network I/O (briefly, for a reason you'll see in a moment), and disk I/O.
Network performance is an essential component of overall system performance, especially in a virtual
machine environment. Networks are used in separate roles: for live migration, cluster heartbeats,
access to virtual disks on NFS or iSCSI, system management, and for the guest virtual machines own network
traffic. These roles, and other aspects of Oracle VM networking, are explained in the
outstanding article Looking "Under the Hood" at Networking in Oracle VM Server for x86 by Greg King and Suzanne Zorn.
There are other excellent Oracle VM white papers here.
dom0 has physical access to the network adapter cards, so should have native I/O performance.
The tips to use at the hypervisor level are the same as in a physical network environment:
such as using bonding for resiliency and isolation, perhaps using jumbo frames
(obligatory warning: if you use large MTU sizes, make sure that all network devices in the transmission path are enabled for jumbo frames
or Bad Things will happen).
and separating different networks based on
their requirements for latency and isolation. In particular, you want to protect the Oracle VM cluster heartbeat from
being delayed by heavy network I/O for storage or VM access.
The same applies to heartbeats for Oracle RAC and other cluster technologies that are sensitive to packet delay.
Guest VM network traffic rides on virtual networks with the "VM" channel defined. Many of the same rules apply as
with physical networks here as well. Currently, we do not have direct Oracle VM control over QoS, so use separate networks for different
types of traffic.
That's all by way of introduction. Rather than rehash network tuning advice, please read the white paper
Oracle VM 3: 10GbE Network Performance Tuning.
Driving a 1Gb network at line speed isn't a challenge for today's servers, but it gets a bit more difficult with 10GbE and higher networks speeds.
This paper is kept up to date with tuning advice, and is linked from the MOS note
"Oracle VM 3: 10GbE Network Performance Tuning (Doc ID 1519875.1)".
Surprisingly, the most important tuning actions relate to controlling NUMA latency, as described in the previous articles in this series.
Disk I/O performance is critical for most applications, especially commercial workloads which are typically data base or file intensive.
Providing good disk I/O performance is the most important performance task for many workloads.
Rather than recapitulate everything about disk I/O performance, I'll plug Brendan Gregg's book again:
Systems Performance: Enterprise and the Cloud.
There are other good performance references, but this is very good and has a wealth of information.
Oracle VM use of storage is described in the
Chapter 3 Understanding Storage of the Oracle VM 3 Concepts Guide.
There are disk I/O requirements for Oracle VM Manager and for the Oracle VM Server on each host, but
those are light weight in nature and not in the direct I/O path for workloads,
Most attention is on virtual machine disk I/O performance.
Virtual machine disk images typically reside on shared storage, as described in
Oracle VM 3: Archtecture and Technical Overview.
Oracle VM Servers are grouped into server pools, and every server in a pool has access to shared storage, which can be NFS, Fibre Channel or iSCSI.
This lets VMs in the pool run on any physical server in the pool.
Local storage (that is, a server's internal hard disks) can also be configured but is often not appropriate for production,
as it can be a single point of failure, prevents migrating VMs, and is limited to per-server disk capacity and I/O performance.
The first deployment choice for a virtual machine is whether to use
virtual disks in a repository or physical disks, as described in
Chapter 4 Understanding Repositories and
4.9 How are Virtual Disks Managed?
in the Concepts Guide. The main considerations are:
Virtual disks provide good performance, and permit scalability and help resource sharing:
a single LUN or NFS share or can host hundreds of virtual disks.
They are operationally convenient, and permit operations like cloning VMs and templates.
Repositories are always used for metadata about each VM, and optionally may contain virtual disk images, ISO images, assemblies and templates.
Physical disks generally have higher performance than virtual disks but require additional administrative
effort to create LUNs and present them to each physical server in the pool.
For what it's worth, there's nothing particularly "physical" about a physical disk: it needn't be
a rotating device, it might just be an iSCSI or FC LUN conjured by magic from a storage array, with its
supposedly physical disk blocks scattered all over actual media.
In practice, most VM disk images reside in repositories, with physical disks used when
required by the application (Oracle RAC requires physical disks) for functional reasons like SCSI reservation, or for higher performance needs.
Virtual disks in a repository provide good performance, work with templates and clones, and are much easier to administer.
Repositories are always needed, whether used for virtual disks or not.
Repositories should be hosted on a storage backend that provides sufficient performance and capacity, or it will be a bottleneck
for the entire environment.
The primary decision is whether to base repositories on a LUN block device or on an NFS share:
There is no clear-cut advantage of one over the other (IMHO). Performance seems to be a wash except for cloning (see below).
I have seen performance reports showing a few percentage points difference of one over the other.
There are high speed transports (Fibre Channel LUNs, 10GbE for iSCSI LUNs or NFS) in all cases,
and the limiting constraint may well be the robustness of the storage underlying any of them rather than the protocol.
The main advantage of LUN block device respositories is that they permit thin cloning, where clones are
created using reference links to common blocks. Cloning a VM or template is almost instantaneous and takes up only
as much disk space as is consumed by differences between the cloned images and their source.
The disadvantage is administration: it's harder to grow a LUN and its OCSF2 file system, for example.
An OCFS2 file system on a single LUN potentially represents a single point of failure if the file system is corrupted.
The main advantage of an NFS-based repository, again - my opinion, is that it's easier to set up and administer.
Backing it up is easy, increasing its size is easy too (just increase the quota on the file system it resides on).
Performance over 10GbE networks with a large MTU is very good.
However, cloning a template or VM requires copying the complete disk images rather than aliasing them, which consumes both space and time.
Also, NFS permits per-virtual disk analytics on the storage subsystem: with block storage repositories, all of the virtual
disks are on a single LUN and the storage server can't distinguish one from the other.
With NFS-based storage, each virtual disk is a different file, and good file servers like the ZFS Storage Appliance
can show individual per-vdisk I/O as shown here. That can be incredibly valuable for answering questions like "which VM is hammering storage?".
This can and should be done with individual physical disks described previously.
In the accompanying graphic I can see the top two virtual disks, and by scrolling to the file name I can see which vdisk and then
work back to the virtual machine that owns it.
Regardless of protocol or interconnect, disk performance is going to be governed by the performance of the underlying storage system.
Provide a wimpy storage backend or a low speed storage network and there will be bad performance regardless of other decisions.
Most storage arrays provide cache, and increasingly Solid State Disk (SSD), which reduce
latency since there's no rotational or seek delay - until you fill the cache or SSD of course.
Otherwise you are limited to the 150 to 300 IOPS rotating media can deliver, and have to spread load over multiple spindles.
This is very important for virtual machine workloads, as all writes are synchronous (must be commited to stable media
before being marked as "done") because hypervisors don't know which guest VM writes can be buffered (as in POSIX file semantics)
and which must be committed. Write-optimized SSD and disk cache are essential for write intensive workloads.
In ZFS terms, that means having a "logzilla" for the ZFS Intent Log (ZIL) for quickly stashing blocks that will later be written
to the regular part of the ZFS pool. Also, for ZFS users, a ZFS mirror is preferred over RAIDZ1 or RAIDZ2 as it permits higher IOPS.
Within the guest VM, the main tuning action is to make sure the paravirtualized device drivers are in use if this is a hardware virtualized guest.
That outperforms native device drivers by a large margin.
Also, there is little or no benefit to guest VMs trying to do seek optimization: what guests think are contiguous disk blocks may be scattered
over multiple physical disks. Trying to schedule seeks is probably unnecessary overhead.
If the guest OS provides control over seek optimization, it might be worth disabling it, as
discussed in MOS Note "I/O Scheduler Selection to Optimize Oracle VM Performance (Doc ID 2069125.1)"
Besides application I/O, there's I/O for the guest operating systems.
You can get "boot storms", as I described in the first
article in this series, when systems are restarted after an outage.
Every VM starting up has to load its OS images and fire up processes from its boot drives.
This sometimes catches people by surprise because they think they can avoid planning for OS and boot disk performance
because "those have only light I/O load" - but that's not always true.
Treat boot disks like other disks to avoid unpleasant surprise.
One optimization you may be able to do if you are using thin-cloned VMs on OCFS2-based LUN repositories,
VM boot disks cloned from the same template or VM source have many disk blocks in common
(at least until the guest operating systems are individually upgraded).
The first OS to read a disk block warms the disk read cache (if available) and other guests using
the same block get a free ride: their reads should complete very quickly because they are already cached.
Blogger's prerogative: I like to tell old war stories and anecdotes about performance.
If you don't enjoy them, feel free to stop here! (If you do like them, drop me a note so I know you did :) )
I went to graduate school at Cornell University. Totally unrelated to the computer science department
(with which IT had at best an equivocal relationship) the datacenter was one of the most innovative sites of the day,
with brilliant developers. This is also where I was first exposed to virtual machines.
There was a performance
problem that was getting senior management attention: student batch jobs to do their little compile-and-go jobs
were taking too long. This was using Cornell's PL/C dialect of PL/I, which
normally compiles very quickly. The staff wizards were pursuing deep questions on CPU scheduling
to see if the VM used for compiles and other batch work was getting enough CPU, or if the compile
jobs within the VM were getting enough CPU. That was a hot topic of the day, but didn't seem to be the problem.
I happened to wander into the machine room and
looked at the disk drives, which were top-loading washing-machine style with removable disks storing
(woo-hoo!) 100MB each. These drives had the clever feature of showing the current cylinder seek
address in binary in blinking lights on the top of the drive, along with a 'disk busy' light. I noticed
that one disk was busy all the time, and the high-order bits of the seek address
kept changing. I thought "that can't be good."
There was no instrumentation available at the time, so
I went back to my cubicle and wrote a disk seek monitor program to learn what was going on. The program
woke up every N milliseconds and read kernel memory to retrieve the device busy status and seek address
for every disk device. After some number of iterations it printed out a table that showed how
busy each disk was, and an ASCII-art (well, EBCDIC-art, actually) histogram of the disk hot spots.
The hot spots were the library containing the PL/C compiler binaries on one end of the disk (a high seek address)
and a system checkpoint area updated many times per second on the other end of the disk (a low seek address).
No wonder there was trouble: the two hottest disk locations in the house were on opposite ends of the
same disk drive. This was before disk caches or solid state disk, so all the time was spent doing long seeks to
shuttle the access arm back and forth. (Yes, boys and girls, back then file placement made a tremendous difference,
and a lot of effort was spent on reducing seek and rotational latency.) I ran to the manager-of-wizards and told him that
I might have an answer, mere Boy Systems Programmer / grad student that I was.
He looked at the printout, quickly went into his office (he rated an office) and
(change control procedures cast to the winds!) moved the library containing the compiler to a low-usage spindle.
The problem disappeared immediately! Everyone was happy, and a few days later the wizard-in-chief. a real
performance expert, came up to me and said "thanks, that change made my numbers look really good".
This was much more fun than the computibility theory I was supposed to be learning in the CS department.
I took serious lessons from this experience. The most important lesson is that it's essential to have
relevant performance data. The other message that stayed with me is that problems are often rooted in almost
boring low-tech details that might be disregarded. There's an expression in medicine, "When you hear hoofbeats
think horses, not zebras" - in other words, don't think of the most exotic animal.
Footnote: I used this pattern on several operating systems. On an early Unix I wrote a monitoring pseudo-device
/dev/mon to collect such data instead of sampling. The disk device driver was modified to put information
about each disk I/O in a circular buffer in kernel memory, and userland code could read from /dev/mon to get it.
I wish I had the foresight to generalize this for multiple sources of information, make it configurable (size of
buffer, which data to collect, etc). That would have been really useful but I didn't understand
how important this could be till much later. Every OS should have instrumentation.
This article discussed network and disk I/O performance and how to measure and improve it, and told an ancient war story.