First, let's consider ways to not evaluate performance. Performance is often stated in unquantified generalities ("Give good response time!") or complaints ("Response time is terrible today. Fix it!"). That doesn't help understand the performance situation.
Another habit is to look at resource utilization without relevance to delivered performance. For example:
High CPU busy may just mean you're getting your money's worth from your servers. It is only a problem if service level objectives aren't being met due to resource starvation. However, it could be a symptom of a problem (program looping, excessive error handling, etc) rather than being a problem.
Low CPU may mean the workload is idle, or it's a single threaded app on a server with many CPUs. (Is this an 8 CPU machine where one CPU is 100% busy while the others are idle?)
Applications are often unable to drive CPU because they are waiting on I/O, or because memory is over committed and they are thrashing. Low utilization might be innocent, or can indicate a bottleneck elsewhere.
Raw utilization numbers are not good or bad of themselves. They can be clues when used in the right context - "it depends". Another trap is to use average numbers, which can hide peak loads and spikes.
There are other popular measurements that don't help: such as microbenchmarks that don't look at all like the actual workload, or measuring how long it takes to run the dd command to write a gigabyte of zeroes when the expected workload is random I/O. Unless the purpose of the system is to use dd to write zeroes, it's of limited utility. Measuring the wrong thing because it's easy is so common that it has its own name, the streetlight effect.
Instead, performance analysts and systems administrators should measure against requirements of their business users, using service level objectives stated in external terms: meeting a deadline to run a particular task (such as: get payroll out, post and clear stock trades, or close the books at end of quarter), or meeting response times for different times of transaction (load a web page, do a stock quote, transact a purchase or trade). Performance objectives are commonly expressed in the form of response times ("95% of a specified transaction must complete in X seconds at a rate of N transactions per second") or in throughput rates ("handle N payroll records in H hours").
In the preceding paragraphs, I've touched on essential performance concepts: thoughput, response time, latency, utilization, service levels. We'll use those terms to relate to virtual machine performance.
Some may find the preceding material too abstract and not specific to virtualization, so lets change gears and discuss architecture of the Oracle VM hypervisors.
The Oracle VM family includes Oracle VM Server for x86 and Oracle VM Server for SPARC. two hypervisors that optionally share Oracle VM Manager as a common management infrastructure. There is also a desktop virtualization product, Oracle VM VirtualBox. VirtualBox is popular for end-user virtual machines on a desktop for laptop, but is out of scope for this series of articles.
Oracle VM Server for x86 and Oracle VM Server for SPARC have architectural similarities and differences. Both use a small hypervisor in conjunction with a privileged virtual machine ("domain") for aministration, and virtual and physical device management. The hypervisor resides in firmware on SPARC, and in software on x86.That constrasts with traditional virtual machine systems that use a monolithic hypervisor kernel for system control and device management.
Oracle VM Server for x86 is based on Xen virtualization technology, and uses a "dom0" (domain 0) as an administrative control point and to provide virtual I/O device services to the guest VMs ("domU"). Oracle VM Server for SPARC uses a small firmware-based hypervisor coupled with a "control domain" that can be compared to dom0 on x86, with the option of having multiple "servicedomains" for resiliency.
The two products also have similarities and differences in how they handle systems resources:
|Oracle VM Server for x86||Oracle VM Server for SPARC|
|CPU||CPUs can be shared, oversubscribed, and timesliced using a share-based scheduler.
CPUs can be allocated cores (or CPU threads if hyperthreading is enabled.)
The number of virtual CPUs in a domain can be changed while the domain is running.
|CPUs are dedicated to each domain with static assignment when the domain is "bound".
Domains are given exclusive use of some number of CPU cores or threads, which can be
changed while the domain is running.
|Memory||Memory is dedicated to each domain, no over-subscription.
The hypervisor attempts to assign a VM's memory to a single NUMA node,
and has CPU affinity rules to try to keep a VM's virtual CPUs near its memory for local latency.
|Memory is dedicated to each domain, no over-subscription.
The hypervisor attempts to assign memory on a single NUMA node,
and allocate CPUs on the same NUMA node for local latency.
|I/O||Guest VMs are provided virtual network, console, and disk devices provided by dom0||Guest VMs are provided virtual HBA, network, console, and disk devices provided by control domain and optional service domains.
VMs can also use physical I/O with direct connection to SR-IOV virtual functions or PCIe buses.
|Domain types||Guest VMs (domains) may be hardware virtualization (HVM), paravirtualized (PV) or hardware virtualized with PV device drivers.||Guest VMs (domains) are paravirtualized.|
That's a lot of similarity for two products with different origins. When I'm asked for a quick summary, I say that the two products have a common memory model (VM memory is fixed, not overcommitted or swapped - very important), but different CPU models (Oracle VM Server for SPARC uses dedicated CPUs on servers that have lots of them, while x86 uses a more traditional software based scheduler that time-slices virtual CPUs onto physical CPUs. Both products are aware of the NUMA affects and try in different ways to reduce remote memory latency from CPUs. Both have virtual network and virtual disk devices, but the SPARC side has additional options for device backends and non-virtualized I/O. Finally, the x86 side has more domain types, reflecting the wide range of x86 operating systems.
That's an introduction to the concepts. The next article (rubbing hands together in anticipation!) will delve more into the technology
and their performance implications.
For general performance analysis, I recommend Brendan Gregg's book Systems Performance: Enterprise and the Cloud. It has excellent content for any performance analyst, as well as details for various versions of Linux and Solaris.