Live Migration in Oracle VM Server for SPARC 2.1
By Liam Merwick on Jun 08, 2011
You may have seen the press release that Oracle VM Server for SPARC 2.1 (a.k.a. LDoms) has just been released (you can find the download links here). There is a considerable list of new features (including Dynamic Resource Management, Virtual Device Service Validation and many more) but the key feature for me is Live Migration which allows for the migration of an active domain without any impact on applications and users shouldn't even notice that the guest domain is running on a new machine (OK so I would say that since I'm one of the migration developers...)
It has been possible to migrate an active domain since LDoms 1.1 (released in 2008) however up until now, the domain was suspended while the runtime state was copied from the source machine to the target which could result in an outage in the order of minutes if the domain had a large amount of memory (the suspend time was pretty much linearly proportional to the guest domain memory size). With Live Migration we transfer the memory contents as the domain keeps running while at the same time keeping track of the memory that is being modified. We iterate through the memory, transferring modified pages to the target system, until the amount of memory being modified is minimal. Then at the end we momentarily suspend the domain and copy the remaining memory and resume the domain on the target. This suspension can take less than a second but depending on the workload can take longer than this if the domain is rapidly modifying a lot of the memory pages.
One other migration performance enhancement in this release is that multiple network connections between the source and target machines are utilised (based on the number of virtual cpus in the control domains) which improves the throughput of the memory transfer and reduces the overall migration time.
I've found from running experiments that having 16 vcpus in the control domains makes a significant improvement over 8 vcpus (and up to 32 vpus will help more, beyond which there's no noticeable difference). The other best practice is to ensure that cryptographic units (a.k.a MAUs) are assigned to the control domains also as the memory contents are protected by SSL when being transferred over the network and offloading these operations to the T-series hardware makes a big difference.
[Update: Jeff Savit wrote a great post about his experiences using Live Migration]