Solaris and SPARC virtualization management features of Oracle Enterprise Manager Ops Center including "Live Migration"
By LeonShaner on Feb 07, 2012
(Short attention-span? Skip to the video. Don't forget to come back. *smile*)
Acceleration of Technology Adoption
Some of the most interesting and rewarding periods in IT Service Management, are when it is finally time to turn the corner and begin to adopt and deploy new technologies. The greater the business need and the more polished the technology is in addressing requirements, the faster the relative rate of adoption. When technologies advance in a structured way, we begin to see accelerated rates of adoption that are relative to a company's operational maturity level.
Virtualization is not a new concept; however, there have been numerous advances in recent years that are helping businesses to be more effective at managing their virtualized environments. The easier it is to manage assets reliably, with reduced risk of downtime, the better the ability to focus on optimizing asset utilization in balance with required Service Level Objectives.
Key Advancements in Solaris Virtualization
Since the first release of Solaris 10 in 2005, Solaris Container technology has grown incredibly mature and is the most widely used method of virtualizating Solaris, largely because it works on any machine that runs Solaris 10 or later. In 2006, Logical Domains (LDom's) were introduced and have evolved together with corresponding enhancements in SPARC servers built around Chip Multi-Threading (CMT) technology, the latest of which are the SPARC T4 family of servers.
The first significant increase in Logical Domains adoption began in 2008 with the introduction of servers based on the UltraSPARC T2 Plus chip-set, with an even greater rise in popularity, owing to advancements brought by the SPARC T3 and SPARC T4 server families. The technology we have known as Logical Domains is now called Oracle VM Server for SPARC.
There are several factors contributing to the rise in popularity of Oracle VM Server for SPARC (formerly Logical Domains). Some of the key factors include the following break-through evolutions, which have been designed to work symbiotically together:
- Underlying SPARC Chip Multi-Threading technology, especially SPARC T4 -- far faster and capable than any previous SPARC CMT-based servers.
- Oracle VM Server for SPARC advancements toward (near) Live Migration. For the majority of workloads this represents an "imperceptible" transition of the LDom Guest from one physical host to another.
- Oracle Enterprise Manager Ops Center advancements toward seamless, end-to-end lifecycle management for LDom Host and Guests -- starting from bare-metal up through Server Pools as collections of virtualized assets, storage, and networks.
Enterprise Manager Ops Center and the Psychology of Technology Adoption
One of the more interesting interplays between Solaris Containers and Oracle VM Server for SPARC (Logical Domains) has come to light while working with customers who have been using Containers for years and have more or less not even looked at LDom's. If these customers have looked at LDom's at all, it was probably several years ago, without the benefit of the newer chip-sets, or the newer features of the LDom Hypervisor, and/or without the benefit of user-friendly tools to manage it all together, easily, and reliably.
Very often the conversation starts with OEM Ops Center helping customers get past challenges in patching systems with very large numbers of Solaris Containers. From there the interest moves toward how to migrate their Containers off of sub-optimal storage configurations that impede parallel patching, and onto newer technologies based on ZFS with Live Upgrade, etc. From there, the capability evolves toward "rolling upgrades" where new Global Zones are built with the latest patches, and rather than patching old hosts, only the non-global Zones themselves are upgraded via Zone Migration from the old Global Zone to the newly built OS, and the old Global Zone is simply re-provisioned with the new OS image.
Once an IT organization is able to realize that kind of capability through OEM Ops Center, they begin to realy on the very straight-forward and easy to use approach. However, these customers still have the business problem of how to actually leverage the technology in the context of Service Level Objectives -- how to minimize the downtime associated with the Zone Migrations w/ Upgrade on Attach. That approach, as advanced as it is, is nearing the limit of its capability when it comes to minimizing the actual downtime associated with the Zone Migration with Upgrade on Attach.
Fortunately, somewhere in the middle of making Zone Migration with Upgrade on Attach as good as it can possibly be, we are also planting ideas about how to solve the overarching business need of keeping IT services running smoothly, while optimally available + utilized + reliable. It's the intersection of optimally available + utilized that drives the need for an effective virtualization strategy and it is the intersection of available + reliable that drives the need for an unobtrusive patching strategy.
Each of those spokes of the combined business need find their hub around OEM Ops Center; however, the underlying key toward availability with reliability is the wheel that glides on Oracle VM Server for SPARC (Logical Domains). The reason is because the latest LDom technology is where Live Migration becomes possible (or at least "virtually indistinguishable from live," for most workloads).
Migrating LDom Guests around, instead of Zones, has the potential to virtually eliminate the downtime associated with maintaining and re-balancing utilization across a collection of virtualized servers (guests). The general "rolling upgrade" paradigm can be achieved for each OS instance, but the mechanics of it are based on migrating an LDom Guest with a single non-Global Zone to an already upgraded LDom Host with practically zero downtime. When it is time to upgrade the Global Zone within the LDom Guest, there is only a very short amount of downtime. The reduction in downtime is possible because of Live Upgrade, but also because LDom reboots are fast, and with just the one Zone there is no protracted upgrade on attach.
The psychology of getting to that conversation is about business stakeholders having already experienced the pain around getting from point A to point Y, while not yet getting to point their point Z -- that being "running smoothly, while optimally available + utilized + reliable" to be enabled through Live Migration and Live Upgrade. If the stakeholders are at point Y, then they are already looking for ways to cut out the downtime associated with rolling upgrades for Solaris Containers. When we show them how to get to point Z with OEM Ops Center, they often ask why didn't we tell them about it sooner. ( *smile* )
Building Blocks of Optimally Available + Reliable + Utilized
Any number of virtualization technologies and components may be okay for certain development and non-production activities; however, the requirements for production Service Level Objectives are bound to narrow the choices. A need to achieve an optimally available, well-utilized, and reliable virtualized environment that stakeholders can "trust in production," puts us on the path of addressing the overall manageability requirements. That means manageability as applied across hardware, software, tools, and with lowered complexity and maximum ease-of-use.
With overall manageability in mind, now is perfect time for businesses and IT stakeholders to catch up to the present for their Solaris and SPARC assets. The first step involves seriously evaluating the most feature-complete and reliable combination of technologies available to date as the "to be" environment for virtualization appropriate workloads, then looking past the "but we we've always done it this other way" and focusing on the business drivers for making it happen.
LDom Migration capabilities for Solaris and SPARC have never been more effective, than they are on the newer SPARC T-Series servers, like SPARC T3 and especially the SPARC T4, together with Oracle VM for SPARC 2.1. All of those technologies are brought together by Ops Center, to enable an effective end-to-end solution for managing vitualization in the Solaris and SPARC world.
Live Migration in Action
Really, no amount of explaining how Ops Center enables more effective virtualization can convince business and IT stakeholders as effectively as showing it to them--especially when they've been told already that it has been tried and it doesn't work or it's too slow, or too complex. Hence, the need to catch up to the present, which is I thought it would be interesting to show it to you, now. ( *smile* )
Why not make things interesting and show Solaris and SPARC Virtualization working exceedingly well, in what are decidedly "sub-optimal" conditions? This is where the "near" Live Migration caveat comes in. It is possible to get really close to live, even with a sub-optimal configuration; however, the better the config and the faster the host hardware involved, the more LDom Migration can approach Live Migration (as far as anyone/anything would notice).
In practice the difference between the perception of "can't tell it isn't Live" vs. "nearly Live" vs. "it's sorta Live" will ultimately lead to Tier-1, Tier-2, Tier N, Service Levels which today could map to SPARC T4, vs. SPARC T3, vs. SPARC T2+ (a la SPARC Enterprise T5xx0 Servers). Such tiers would have corresponding classes of network and storage tiers in terms of bandwidth, spindles, etc.
So, to define "sub-optimal" I am going to show LDom Migration working very well on the following configuration:
- 2 x SPARC Enterprise T5140's @ 1.165 GHz
- Migration Network/Interconnect Speed: 1 GbE
- Migration Guest Storage: 2 Gbit FC
Without further adieu, the following video shows LDom Guest migration in the above configuration:
In this configuration, it is pretty typical to see ~5 milliseconds, or less, wall-clock guest time "slippage," during the LDom guest migration. To put it into perspective, network response and drive seek times are typically measured in the 10s of milliseconds and very few things even notice a 5 millisecond "pause" in operations. If your app needs faster than that, then go with the Tier-1 config, not this sub-optimal config. ( *smile* )
Ultimately, the ~5 ms of time "slippage" shown will depend on many factors, but in this case most of that time can be attributed to the extremely slow processor speed on the T5140's that were used in the test. The wall-clock time for the memory synchronization and actual migration step was 44 seconds. Had there been more memory allocated to the LDom Guest, then the 1 GBe network speed would have increased the wall-clock time for the migration step; HOWEVER, that would not have had a noticeable impact on the passage of time from the LDom Guest perspective.
I promise to revisit this topic with a SPARC T4 configuration, periodically, when there are advancements in Server Firmware and LDom Hypervisor code.
Please let me know what you think? Until next time...
Leon Shaner | Senior IT/Product Architect
Systems Management | Ops Center Engineering @ Oracle
The views expressed on this [blog; Web site] are my own and do not necessarily reflect the views of Oracle.
For more information, please go to Oracle Enterprise Manager web page or follow us at :