This post is one of a series of "best practices" notes for Oracle VM Server for
SPARC (formerly called Logical Domains)
Oracle VM Server for SPARC (originally, and still often called Logical Domains) provides live migration,
which non-disruptively moves a running guest domain from one SPARC server to another.
As with most things, this requires planning and has several technical requirements:
Several virtualization technologies offer live migration, and while details differ, they work in roughly similar ways:
Migration time is one of the first things that people ask. There are actually several times to consider:
Total migration time is the one most people think of, but in practice suspension time is usually more important, because it represents the time that an application is unresponsive. Overhead added by migration also is significant because it changes the user's experience of the guest's performance.
Unfortunately, there is no easy way to calculate these times. They are a function of server speed, the algorithms used in the live migration process, the speed of the network connecting the servers, the size of the virtual machine's memory, and how actively the guest VM is changing its memory contents. Unfortunately, there is no rule of thumb that accurately predicts how long it will take an arbitrary virtual machine to migrate from one server to another. Even if you know the memory size you can't really predict, because you can have a large memory VM that is only using a small working set, or a relatively small VM that is actively changing most of its memory (think of a large Java heap during a full GC): a CPU can update memory much faster than any network can transmit it. This is true across the industry. You can estimate how long it will take to migrate a virtual machine only by experimenting with a specific system and workload.
The following best practices can be used to reduce live migration time on Oracle VM Server for SPARC:
Note that clocks are not advanced during domain suspension which can create clock skew. It's necessary to run the NTP client service in the guest domain so it can re-sync with correct times after migration completes.
This is an important question - live migration is not always the solution to the problem at hand. Just because you can do something doesn't mean you should do something!
First, it's not a substitute for fault resiliency or high availability technology: you cannot live migrate a virtual machine from a server that isn't alive (yes, I have been asked that). It is useful for vacating a server when planned in advance of a maintenance window or even a tech refresh. That makes it possible to service or even replace a server without an application outage - provided you plan to do it while the source server is available! Of course, if you have a server that is beginning to fail, then by all means use live migration to evacuate it before it goes away.
Second, there are sometimes better solutions than using live migration to provide uninterrupted service - this can sometimes be better solved at the application level than at the virtualization level. For example, if you have a stateless web application behind a load balancer that sprays web requests to multiple machines, it is much simpler to remove a server from the load balancer than migrate all of its guests off. Optionally, you can fire up new application instances on other servers in the web farm. It's a straightforward and well-understood method that works with or without virtualization.
Enterprise applications that provide their own resiliency provide a strong alternative to live migration. For example, Oracle WebLogic and Oracle Real Application Clusters (RAC) already have the ability to control an application that is distributed over cooperating instances on multiple server nodes. Rather than live migrate a WebLogic instance or RAC node, it is generally easier and faster to halt and remove a node from the application cluster than to live migrate the virtual machine containing it. Those applications tend to have large memories that they actively update, making them the ones most likely to have longer migration and suspend times.
Yes, they can be live migrated, but in this case there's an easier way to have the same effect. You can combine techniques to leverage their strengths: for example, remove a node from a cluster (fast, well understood technique), shutdown the VM it runs in, and then optionally do a cold migration of the guest domain to another server, and start it up again. The advantage is that cold migration is essentially instantaneous. No need to live migrate a massive Oracle database memory image in order to rehost it.
Live migration is sometimes used for distributed resource management. An important guest VM might be migrated off a mostly-full server to a server with more capacity in order to give it more CPU and memory capacity. Conversely, less important guests can be migrated off of a server in order to free up resources for an important VM that stays where it is.
This is a widely used and effective technique, which can be used with Oracle VM Server for SPARC.
However, OVM SPARC offers an elegant alternative method to achieve similar effects.
Rather than migrate guests off of a box in order to free up resources, one can simply redistribute resources by shrinking CPU and memory assignments to less-important or idle guests and give freed up resources to the guests that need them. That can be faster and more effective than moving entire virtual machines from one server to another across an intervening network. Not all virtualization technologies permit dynamically adding and removing (often the harder task) resources from a running VM, but Oracle VM Server for SPARC provides this due to cooperation between the hypervisor and Solaris OS.
Oracle VM Server for SPARC can live migrate running guest domains between servers. This can be an effective way to
enhance operational flexibility, and can be used to evacuate a server or to provide distributed resource management. This blog entry describes some rules about how to use it effectively, and offers alternatives that can be even more effective
Irrelevant observation: for the first time ever, I saw a system where all the domains were assigned CPU counts that were prime numbers. :-)