Best Practices - Top Ten Tuning Tips
By jsavit on Feb 04, 2013
This is the original version of this blog entry kept for reference. Please refer to the updated version.
This post is one of a series of "best practices" notes for Oracle VM Server for SPARC (formerly called Logical Domains)
Top Ten Tuning Tips
Oracle VM Server for SPARC is a high performance virtualization technology for SPARC servers. It provides native CPU performance without the virtualization overhead typical of hypervisors. The way memory and CPU resources are assigned to domains avoids problems often seen in other virtual machine environments, and there are intentionally few "tuning knobs" to adjust.
However, there are best practices that can enhance or ensure performance. This blog post lists and briefly explains performance tips and best practices that should be used in most environments. Detailed instructions are in the Oracle VM Server for SPARC Administration Guide. Other important information is in the Release Notes. (The Oracle VM Server for SPARC documentation home page is here.)
Big Rules / General AdviceSome important notes first:
- "Best practices" may not apply to every situation. There are often exceptions or trade-offs to consider. We'll mention them so you can make informed decisions. Please evaluate these practices in the context of your requirements and systems.
- Best practices, and "rules of thumb" change over time as technology changes. What may be "best" at one time may not be the best answer later as new features are added or enhanced.
- Continuously measure, and tune and allocate resources to meet service level objectives. Then do something else - it's rarely worth trying to squeeze the last bit of performance when performance objectives have been achieved!
- Standard Solaris tools and tuning apply in a domain or virtual machine just as on
bare metal: the
*stattools, DTrace, driver options, TCP window sizing,
/etc/systemsettings, and so on.
- The answer to many performance questions is "it depends". Your mileage may vary. In other words: there are few fixed "rules" that say how much performance boost you'll achieve from a given practice.
Keep firmware, Logical Domains Manager, and Solaris up to date - Performance
enhancements are continually added to Oracle VM Server for SPARC, so staying
current is important.
That include the firmware, which is easy to "install once and forget". The firmware contains much of the logical domains infrastructure, so it should be kept current. The Release Notes list minimum and recommended firmware and software levels needed for each platform.
Some enhancements improve performance automatically just by installing the new versions. Others require administrators configure and enable new features. The following items will mention them as needed.
Allocate sufficient CPU and memory resources to each domain, especially
control, I/O and service domains - This should be obvious, but cannot be
overemphasized. If a service domain is short on CPU, then all of its clients are
delayed. Within the domain you can use
prstatto see if there is pent up demand for CPU. Alternatively, issue
ldm list -lfrom the control domain.
Good news: you can dynamically add and remove CPUs to meet changing load conditions, even on the control domain. You can do this manually or automatically with the built-in policy-based resource manager. That's a Best Practice of its own, especially if you have guest domains with peak and idle periods.
The same applies to memory. Again, the good news is that standard Solaris tools can be used to see if a domain is low on memory, and memory can also added to or removed from a domain. Applications need the same amount of RAM to run efficiently in a domain as they do on bare metal, so no guesswork or fudge-factor is required. Logical domains do not oversubscribe memory, which avoids problems like unpredictable thrashing.
For the control domain and other service domains, a good starting point is at least 1 core (8 vCPUs) and 4GB or 8GB of memory. Actual requirements must be based on system load: small CPU and memory allocations were appropriate with older, smaller LDoms-capable systems, but larger values are better choices for the demanding, higher scaled systems and applications now used with domains, Today's faster CPUs are capable of generating much higher I/O rates than older systems, and service domains have to be suitably provisioned to support the load. Don't starve the service domains! Two cores and 8GB of RAM are a good starting point if there is substantial I/O load.
Live migration is known to run much faster if the control domain has at least 2 cores, both for total migration time and suspend time, so don't run with a minimum-sized control domain if live migration times are important.
In general, add another core if
ldm listshows that the control domain is busy. Add more RAM if you are hosting lots of virtual devices are running agents, management software, or applications in the control domain and
vmpstat -pshows that you are short on memory. Both can be done dynamically without an outage.
Allocate domains on core boundaries - SPARC servers supporting logical
domains have multiple CPU cores with 8 CPU threads each. Avoid "split core"
situations in which CPU cores are shared by more than one domain (different domains
have CPU threads on the same core). This can reduce performance by causing "false
cache sharing" in which domains compete for a core's Level 1 cache. The impact on
performance is highly variable, depending on the domains' behavior.
Split core situations are easily avoided by always assigning virtual CPUs in multiples of 8 (
ldm set-vcpu 8 mydomainor
ldm add-vcpu 24 mydomain). It is rarely good practice to give tiny allocations of 1 or 2 virtual CPUs, and definitely not for production workloads. If fine-grain CPU granularity is needed for multiple applications, deploy them in zones within a logical domain for sub-core resource control.
Alternatively, use the whole core constraint (
ldm set-core 1 mydomainor
ldm add-core 3 mydomain). The whole-core constraint requires a domain be given its own cores, or the bind operation will fail. This prevents unnoticed sub-optimal configurations.
In most cases the logical domain manager avoids split-core situations even if you allocate fewer than 8 virtual CPUs to a domain. The manager attempts to allocate different cores to different domains even when partial core allocations are used. It is not always possible, though, so the best practice is to allocate entire cores.
For a slightly lengthier writeup, see Best Practices - Core allocation.
- Use Solaris 11 in the control and service domains - Solaris 11 contains functional and performance improvements over Solaris 10 (some will be mentioned below), and will be where future enhancements are made. It is also required to use Oracle VM Manager with SPARC. Guest domains can be a mixture of Solaris 10 and Solaris 11, so there is no problem doing "mix and match" regardless of which version of Solaris is used in the control domain. It is a best practice to deploy Solaris 11 in the control domain even if you haven't upgraded the domains running applications.
NUMA latency - Servers with more than one CPU socket, such as a T4-4, have
non-uniform memory access (NUMA) latency between CPUs and RAM. "Local" memory
access from CPUs on the same socket has lower latency than "remote". This can have
an effect on applications, especially those with large memory footprints that do
not fit in cache, or are otherwise sensitive to memory latency.
Starting with release 3.0, the logical domains manager attempts to bind domains to CPU cores and RAM locations on the same CPU socket, making all memory references local. If this is not possible because of the domain's size or prior core assignments, the domain manager tries to distribute CPU core and RAM equally across sockets to prevent an unbalanced configuration. This optimization is automatically done at domain bind time, so subsequent reallocation of CPUs and memory may not be optimal. Keep in mind that that this does not apply to single board servers, like a T4-1. In many cases, the best practice is to do nothing special.
To further reduce the likelihood of NUMA latency, size domains so they don't unnecessarily span multiple sockets. This is unavoidable for very large domains that needs more CPU cores or RAM than are available on a single socket, of course.
If you must control this for the most stringent performance requirements, you can use "named resources" to allocate specific CPU and memory resources to the domain, using commands like
ldm add-core cid=3 ldm1and
ldm add-mem mblock=PA-start:size ldm1. This technique is successfully used in the SPARC Supercluster engineered system, which is rigorously tested on a fixed number of configurations. This should be avoided in general purpose environments unless you are certain of your requirements and configuration, because it requires model-specific knowledge of CPU and memory topology, and increases administrative overhead.
- Single thread CPU performance - Starting with the T4 processor, SPARC
servers supporting domains can use a dynamic threading mode that allocates all of a
core's resources to a thread for highest single thread performance.
Solaris will generally detect threads that will benefit from this mode and "do the right thing"
with little or no administrative effort, whether in a domain or not.
An excellent writeup can be found in Critical Threads Optimization
in the Observatory blog.
Mentioned for completeness sake: there is also a deprecated
command to control this at the domain level by
ldm set-domain threading=max-ipc mydomain, but this is generally unnecessary and should not be done.
Live Migration - Live migration is CPU intensive in the control domain of
the source (sending) host. Configure at least 1 core (8 vCPUs) to the control
domain in all cases, but optionally add an additional core to speed migration
and reduce suspend time. The core can be added just before starting migration and
removed afterwards. If the machine is older than T4, add crypto accelerators to the
control domains. No such step is needed on later machines.
Perform migrations during low activity periods. Guests that heavily modify their memory take more time to migrate since memory contents have to be retransmitted, possibly several times. The overhead of tracking changed pages also increases CPU utilization.
Network I/O - Configure aggregates, use multiple network links,
use jumbo frames, adjust TCP windows and other systems settings the same way and for the
same reasons as you would in a non-virtual environments.
Use RxDring support to substantially reduce network latency and CPU utilization. To turn this on, issue
ldm set-domain extended-mapin-space=on mydomainfor each of the involved domains. The domains must run Solaris 11 or Solaris 10 update 10 and later, and the involved domains (including the control domain) will require a domain reboot for the change to take effect. This also requires 4MB of RAM per guest.
If you are using a Solaris 10 control or service domain for virtual network I/O, then it is important to plumb the virtual switch (vsw) as the network interface and not use the native NIC or aggregate (aggr) interface. If the native NIC or aggr interface is plumbed, there can be a performance impact sinces each packet may be duplicated to provide a packet to each client of the physical hardware. Avoid this by not plumbing the NIC and only plumbing the vsw. The vsw doesn't need to be plumbed either unless the guest domains need to communicate with the service domain. This isn't an issue for Solaris 11 - another reason to use that in the service domain. (thanks to Raghuram for great tip)
As an alternative to virtual network I/O, use Direct I/O (DIO) or Single Root I/O Virtualization (SR-IOV) to provide native-level network I/O performance. They currently have two main limitations: they cannot be used in conjunction with live migration, and cannot be dynamically added to or removed from a running domain, but provide superior performance. SR-IOV is described in an excellent blog article by Raghuram Kothakota.
Disk I/O - For best performance, use a whole disk backend (a LUN or full
disk). Use multiple LUNs to spread load across virtual and physical disks and reduce queueing
(just as you would do in a non-virtual environment).
Flat files in a file system are convenient and easy to set up as backends, but have less performance.
For completely native performance, use a PCIe root complex domain and physical I/O.
ZFS can also be used for disk backends. This provides flexibility and useful features (clones, snapshots, compression) but can impose overhead compared to a raw device. Note that local or SAN ZFS disk backends preclude live migration, because a
zpoolcan be mounted to only one host at a time. When using ZFS backends for virtual disk, use a
zvolrather than a flat file - it performs much better. Also: make sure that the ZFS
recordsizefor the ZFS dataset matches the application (also, just as in a non-virtual environment). This avoids read-modify-write cycles that inflate I/O counts and overhead. The default of 128K is not optimal for small random I/O.
Networked disk on NFS and iSCSI -
NFS and iSCSI also can perform quite well if an appropriately fast network is used.
Apply the same network tuning you would use for in non-virtual applications.
For NFS, specify mount options to disable
atime, use hard mounts, and set large read and write sizes.
If the NFS and iSCSI backends are provided by ZFS, such as in the ZFS Storage Appliance, provide lots of RAM for buffering, and install write-optimized solid-state disk (SSD) "logzilla" ZFS Intent Logs (ZIL) to speed up synchronous writes.
By design, logical domains don't have a lot of "tuning knobs", and many tuning practices you would do for Solaris in a non-domained environment apply equally when domains are used. However, there are configuration best practices and tuning steps you can use to improve performance. This blog note itemizes some of the most effective (and least exotic) performance best practices.