News, tips, partners, and perspectives for Oracle’s virtualization offerings

Oracle VM Performance and Tuning - Part 4

Jeff Savit
Product Management Senior Manager
The previous article in this series described goals for virtual machines in the cloud, and CPU and memory recommendations to help achieve them. This article will go into more detail for Oracle VM Server for x86 CPU and memory management:

Quick review

An extremely brief review on Oracle VM Server for x86 resource usage and architecture:
CPU CPUs can be shared, oversubscribed, and timesliced using a share-based scheduler.
CPUs can be allocated cores (or CPU threads if hyperthreading is enabled.)
The number of virtual CPUs in a domain can be changed while the domain is running.
Memory Memory is dedicated to each domain, there is no over-subscription. The hypervisor attempts to assign a VM's memory to a single NUMA node, and has CPU affinity rules to try to keep a VM's virtual CPUs near its memory for local latency.

Oracle VM does not over-subscribe memory because that can unpredictably harm virtual machine performance. Guest VMs have poor locality of reference so are not good candidates for normal page replacement, and nesting guest operating systems that oversubscribe under a hypervisor that oversubscribes can lead to pathological (albeit interesting) performance problems and double-paging.

Domain types Guest VMs (domains) may be hardware virtualization (HVM), paravirtualized (PV) or hardware virtualized with PV device drivers.


In addition, a privileged domain "dom0" is used for system control and to map guest VMs ("domU") virtual I/O onto physical devices. It is essential to make sure that dom0 performs well, as all guest I/O is processed by dom0.

Optimizing CPU and memory on Oracle VM Server for x86

The most important tuning actions are to control allocation of virtual CPUs to physical ones. As mentioned in the previous article, giving a domain too many CPUs can harm performance by increasing NUMA latency or increasing multiprocessor overhead like lock management.  EDIT: nice write up from ACM Queue at http://queue.acm.org/detail.cfm?id=2852078

Similarly, giving a domain too much memory can cause NUMA latency if the memory straddles sockets. Since the number of VMs is limited by the number we fit in memory, overallocating VM memory reduces the density of guests that can be running.

Oracle VM applies this tuning to dom0 by default: recent versions of Oracle VM size dom0 so its virtual CPUs are pinned to physical CPUs on the first socket of the server, and size its memory based on the server's capacity. This eliminates NUMA memory latency since dom0's CPUs and memory are on the same socket. This is especially important for low-latency networking - further details are in the whitepaper on 10GbE network performance.

Similar tuning should be applied to guest VMs. In particular:

  • The number of virtual CPUs should not exceed the number pf physical CPUs per socket to avoid NUMA effect. This is under administrator control when defining or editing a VM. Sometimes it's necessary to give a virtual machine more CPUs than are on a single socket because the workload requires more capacity than a socket provides. That potentially reduces efficiency to get scale, and is reasonable for large workloads.

    The physical server's architecture of sockets and cores can be determined by logging into dom0 and issuing the following commands:

    root # xenpm get-cpu-topology


    # note the relationship between CPU number, core and socket
    CPU core socket node
    CPU0 0 0 0
    CPU1 0 0 0
    CPU2 1 0 0
    CPU3 1 0 0
    CPU4 2 0 0
    CPU5 2 0 0
    CPU6 3 0 0
    CPU7 3 0 0
    CPU8 4 0 0
    CPU9 4 0 0
    CPU10 5 0 0
    CPU11 5 0 0
    CPU12 6 0 0
    CPU13 6 0 0
    CPU14 7 0 0
    CPU15 7 0 0
    CPU16 0 1 1
    CPU17 0 1 1
    CPU18 1 1 1
    CPU19 1 1 1
    CPU20 2 1 1
    CPU21 2 1 1
    CPU22 3 1 1
    CPU23 3 1 1
    CPU24 4 1 1
    CPU25 4 1 1
    CPU26 5 1 1
    CPU27 5 1 1
    CPU28 6 1 1
    CPU29 6 1 1
    CPU30 7 1 1
    CPU31 7 1 1
    Similar output is produced by running xl info -n
  • Never give an individual virtual machine more virtual CPUs than the number of physical CPUs on the server. Giving more CPUs than CPUs per socket may be necessary, but giving a VM more CPUs than are on the server always harms performance and can lead to errors due to race conditions.
  • A total number of virtual CPUs over all VMs can be higher than physical CPUs. CPU over-subscription (in terms of number of virtual CPUs rather than their utilization) is not a cause for concern. There is some overhead for running the CPU scheduler and dispatcher, but this is a mild effect. If CPU utilization is low (many virtual CPUs are idle) then this is benign - even the added overhead only means "system goes idle a little later."
  • On the other hand: avoid excessive over-subscription of active virtual CPUs (with high %CPU load). It's not the number of virtual CPUs that's the problem, it's distributing a fixed resource (a server's total CPU capacity) over more requesters. Slicing the pie for more active consumers just means that they get smaller slices. This is classic queuing theory: service times (elapsed time to get service) get longer as capacity approaches saturation.

    To avoid this, monitor CPU utilization at the system level (Enterprise Manager provides observability, or you can use Oracle VM Manager's Health tab) to see if servers are approaching full utilization without head-room. Be cautious about averages: a smoothed average over time of (say) 95% CPU busy may hide intervals that were 100% busy.

    Within virtual machines running Linux or Solaris, use vmstat, mpstat or iostat commands and observe "percent steal" time (%st or %steal). When steal is non-zero, the VM was runnable, had work to do, but the hypervisor was unable to give it CPU cycles.
    For example see these commands (run on idle systems, so not themselves interesting):

    linux$ vmstat 1
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     0  0      0 3135616  49224 193668    0    0    0     0    2    5  0  0 100  0  0
    0 0 0 3135680 49224 193668 0 0 0 0 58 60 0 0 100 0 0
    linux $ iostat 1
    avg-cpu: %user %nice %system %iowait %steal %idle
    0.01 0.00 0.00 0.00 0.00 99.99
    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    xvda 0.36 1.09 4.50 1446588 5980034
    solaris$ mpstat 1
    CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys st idl
    0 1 0 0 490 246 102 0 1 1 0 90 0 1 0 99
    1 1 0 0 90 18 96 0 1 1 0 80 0 0 0 100
    2 1 0 0 89 17 93 0 1 1 0 79 0 0 0 100
    3 2 0 0 91 18 97 0 1 1 0 81 0 0 0 100
    solaris$ iostat 1
    tty xdf0 cpu
    tin tout kps tps serv us sy st id
    0 0 3 0 5 0 0 0 99
    0 110 0 0 0 0 1 0 99
    0 38 0 0 0 0 1 0 99



    Being CPU saturated is not a problem - it means you're getting your money's worth. Being CPU saturated with latent, unserviced resource demand is the problem.

    On the other hand, high steal is not a problem for "cycle soaker" VMs (if you are lucky enough to have them): compute-intensive workloads with lax service levels that can run when there are no cycles needed by anyone else.

Hyper-threading (HTT)

Hyper-threading (HTT) is a controversial topic, as it improves performance in some cases, degrades others. With hyper-threading each core runs two CPU threads instead of one, and time-slices the threads (in hardware) onto the core's resources. If one thread is idle or stalls on a cache miss, the other thread can continue to execute - this provides a potential throughput advantage. On the other hand, the fact that both threads are competing for the core's resources, especially level 1 cache ("L1$") means that each thread may run slower than if it owned the entire core.

EDIT: Nice writeup can be found at https://en.wikipedia.org/wiki/Hyper-threading

Each thread is assignable by the Xen hypervisor to a VM virtual CPU, and can be dispatched as a CPU. The performance effect is workload dependent, so must be tested for different workload types.

  • Linux and other guest operating systems try to spread work over cores, but guests don’t know if virtual CPUs are on the same core or socket.  Real topology data isn't available, so the guest dispatcher may make poor scheduling choices that increase overhead.
  • Hyper-threading can help scalable apps when the number of vCPUs is greater than cores/socket but less than threads per socket. This can keep CPU allocation on same socket.
  • Hyper-threading can keep a core busy when a CPU thread stalls, which can increase overall throughput. SPARC uses a very advanced form of of multi-threading for this reason.

In general, hyper-threading is best for multi-threaded applications that can drive multiple vCPUs, but can reduce per-vCPU performance and affect single-thread dominated applications. It needs to be evaluated in the context of the application.

You can determine if hyper-threading is enabled several ways. Log into dom0 on Oracle VM Server and issue the commands shown below on a server that has hyper-threading enabled:

root# dmidecode|grep HTT

HTT (Hyper-threading technology)

HTT (Hyper-threading technology)
root# egrep 'siblings|cpu cores' /proc/cpuinfo | head -2
siblings: 2

cpu cores : 1
root# grep '^flags\b' /proc/cpuinfo | tail -1
flags : fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc pni pclmulqdq est ssse3 cx16 sse4_1 sse4_2 popcnt aes f16c rdrand hypervisor lahf_lm ida arat epb pln pts dts fsgsbase erms


This article described CPU management on Oracle VM Server for x86, with an emphasis on reducing NUMA latency, avoiding problems that can come with over-subscription and explaining the effects of hyper-threading. The next article will discuss Xen domain types and virtual I/O.

An anecdote

Blogger's prerogative: I like to tell old war stories and anecdotes about performance. If you don't enjoy them, feel free to stop here! (If you do like them, drop me a note so I know you did :) )

Back in Ye Olde Days, you could simply look at the system to know what was going on. As Yogi Berra said, "You can observe a lot by just watching." On mainframes and minicomputers in days of yore, you could look at the lights on the front panel (we had great computer lights then) and get a clue about activity. No more, alas, whether because of cloud computing, or simply because the systems we work on are in a data center somewhere, or darn it because we just don't put enough lights on the computer.

While working my way through college as Boy Systems Programmer, I had to write a program that searched accounting data on tape (OS/360 SMF data, for those who know what that is) for records matching a list of file names, to see who had accessed or deleted them. I wrote a program that read through the data, which was on 2400 foot tape reels. When I hit the record type for file access, I searched for matches in an in-memory table of interesting file names. I walked into the computer room while the batch job was running and saw that the CPU was pinned (the CPU "WAIT" light was off) and the tape wasn't moving all that fast. Nice thing about the old tape drives: you could watch them spin and judge performance by how fast they moved. "Hmm, that can't be good." Canceled the job, replaced the sequential search through the table of file names with a binary search, and reran it. The 0.5 MIPS (!) CPU was no longer CPU-saturated, and the tape drive was spinning as fast as it could.

Sometimes seeing the system is the best way to even know there's a problem. You can't fix a problem if you don't know you have a problem. We usually can't look at the computer any more, and even if we can it doesn't tell us much, so we have replace tangible observation with the right tools.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.