My Own Private Cloud-aho - The Goals
By me on Apr 14, 2009
For quite some time I have been informally tracking the progress of the xVM Xen team, playing with code drops on perhaps a quarterly basis, and doing benchmarking to determine if things are stable enough and efficient enough to run in a production data-center.
Probably about 6 months ago, performance looked pretty decent, and I implemented a tiny proof-of-concept “cloud” using some friends as guinea pigs. Its been running great for us. In fact, I just checked my virtual server and its been running for 245 days so far. Pretty stable indeed.
Recently I have had the opportunity to build a small cloud in one of our data-centers, expanding on the concepts from the POC into a more full-blown and mature solution, suitable for a near-production environment. Its hard to really say its production quality when so many moving parts are still pre-release.
I'd like to share what I have done with xVM Xen, in hopes that it may be useful for others who wish do do the same. But before I get into the details, let me describe some of the goals that we hope to achieve with this new virtual environment.
Replace the current virtual hosting environment
I hate to knock Solaris Containers since its a Sun technology, but we've struggled somewhat in our usage. The problem isn't really with the technology, but rather a scheduling problem when it comes to planned and unplanned maintenance. In our environment we typically load up 4-8 application zones within a single physical host. These applications are generally maintained by separate teams. When we want to do something like patch the system, the operations team needs to schedule downtime with all the application teams for the same time so patching can occur, and we've found that the operations team can spin quite a few cycles lining all the ducks in a row.
The model with xVM Xen is quite a bit different. Since each domain, or virtual server, is more or less independent of the others, the operations team need only schedule down time with a single team at a time. So although they are doing more scheduling and patching, overall we hope that we can reduce the amount of real time they are doing entirely.
Simple, Rapid (Virtual) Server Creation
The operations team has things dialed in pretty well with JASS, Jumpstart, and so on, for installing physical hosts, but the creation of virtual servers isn't as streamlined. The hope is that though the use of newer technologies available with ZFS – snapshots and cloning for one – as well as with a network storage infrastructure -and iSCSI, we can really streamline the process so that we can spin up virtual servers within minutes. The goal is to make the gating factor be how fast the system administrator can type.
Faster & Error Free Deployments
Although most of the applications behind sun.com don't require huge clusters of servers, we do have a few that span perhaps a dozen or more physical hosts. The problem is that typically the process of deploying a new version the application requires the same set of repeated steps for each instance of the application, introducing the fat-finger problem. What if, using those same ZFS and iSCSI technologies, we can install once, then clone the application for the other instances. As long as that initial install is done correctly, it can greatly reduce the possibility of errors when replicating the changes across the cluster.
Easier Horizontal Expansion
Occasionally, during launches or even worse, DOS attacks, applications can get hit their capacity which result in reduced quality of service for everyone. It those cases, it would be great if we could instantly increase our capacity. Are there technologies that we could employ to do this easily? We think there are.
Painless Migration to Newer, Bigger, Faster Hardware
Although we've tried to employ some best practices which attempt to separate the O/S and the application on different areas of the filesystem(s), it still isn't the easiest exercise to upgrade an application to a new chunk of hardware. Essentially it becomes another case where the application team has to spend some cycles installing the service on the new hardware, test, verify, yadda, yadda, yadda.
We think that the live migration capabilities of Xen have great potential here. Since the application would be installed in a virtual server, the process of upgrading simply becomes a push of the running application from one physical host to another. And, this could even be something the operations team does all by itself, unbeknownst to the application team at all!
Better Hardware Utilization
A while ago I gave a talk with the xVM Xen team about what we had done. I don't think I really explained this one correctly, because their initial comment was something about Solaris Zones providing the most efficient method of squeezing every ounce out of a physical host.
That's not really what this is about. Many of the physical hosts are really incredibly under utilized, perhaps peaking out at somewhere near 30% of sustained CPU used even at the 95th percentile. With hundreds of hosts running that way, we're really just wasting power, cooling and space, when we don't need to.
We're hoping that with the virtualization capabilities with xVM Xen provides, we can make the practice of doubling or tripling up applications on a physical host more common, increasing the sustained performance closer to somewhere between 60% to 80% and lowering our datacenter footprint overall. Where an application begins to run hotter, monitoring would help us decide to move it, via Xen live migration, to a less used, larger, or private physical host.
What we are looking for here is for ways to be able to recover better from catastrophic failures. The server is on fire, how do we get the application off of it and up and running quickly on another physical host? How do we reduce the need for someone hands on to physically fix a problem on a piece of hardware. Again, we're hoping virtualization and other technologies will be helpful here.
I probably forgot a few goals but in general these are the bigger problems which we hope to solve with a more virtualized data-center.
In my next post, I'll describe the infrastructure we built in detail.