Thursday Apr 16, 2009

My Own Private Cloud-aho - The Details

In my previous post, I went over some of the goals we are trying to attain though a data-center architecture based on virtualization with xVM Xen. Essentially looking for ways to work smarter, faster and be more flexible. In this entry, I will attempt to go into the details of the infrastructure we built.

Network Layout

A good place to start is the general network layout. The first diagram shows what we have done. Nothing too fancy. At the heart of it is two independent physical networks. We do this to increase availability. Two load balancers are at the top of the stack, which are cross connected to switches on each physical network, and do something like talk VRRP or some other crazy protocol to each other to decide who is the master and who is on standby.

There are two logical networks that run throughout the infrastructure, the Public network and the Backend network. If the names aren't clear enough, the Public network is meant to serve traffic between us and our customers (via the internet.) The Backend network is for various favors of server-to-server communication within the data-center.

Each physical host, therefore, has two network connections, one to each logical network. The notable exception is the Sun Storage Unified Storage 7410 Cluster, which is connected only to the Backend network, and has connections to both physical networks. The cluster is configured in an active-passive mode, which means that only one of the two 7410 head units is doing the file serving at any one time. In order to ensure that it can still serve stuff to the Even segment even though the Odd segment is down, we need to give it a presence on each physical segment. I'll discuss more about what specifically we're doing with the Unified Storage Cluster later on.

Install Servers

Nothing really out of the ordinary here. We've set up a pair of PXE Boot/Install servers to handle mainly installing additional Hypervisors. The Install servers themselves are running Solaris Nevada 105, I don't recall what that translates to in terms of SXCE releases. Nothing special about that version other than what happened to be the more recent drop we could get when we started assembling everything.

These are more or less independent of each other. Most of the installs work against the primary install server. The only real reason to switch the other server is if the primary is down or if its network segment is down. But as you'll see, we're really not doing many installs at all.

DHCP Servers

The DHCP servers actually run on the same physical hosts as the install servers. They are using the DHCP daemons that come as part of Solaris/Nevada. They do play a key part in the whole PXE Boot process, so they have had additional customizationsmade to allow for that â\\u20ac\\u201c there are a few good articles out there describing the additional macros to accomplish PXE Boot.

The one other interesting thing to note is that the pair of DHCP servers are configured as a cluster. The DHCP daemon doesn't really have a notion of being in a cluster, however, it can be configured to use a shared datastore, which each daemon can read and update. In our case, this is the first use of the Unified Storage Cluster. The DHCP servers are configured to use an NFS share on the cluster. The caveat here, is that you must configure the daemons to use the SUNWfiles datasource. Neither SUNWbinfiles nor SUNWnisplus will work.

A couple of other things to be aware of when clustering DHCP is to make sure you set the OWNER_IP parameter in each daemon's local dhcpsvc.conf file to a comma separated list of all the IP addresses of all the interfaces that will serve DHCP requests on all servers. Also make sure you set the RESCAN_INTERVAL to a reasonable value for you, in our case we just set it to 1 minute. Both of these values can be updated with the dhcpconfig -P command.

dhcpconfig -P OWNER_IP=,,, 
dhcpconfig -P RESCAN_INTERVAL=1 

The Hypervisors

First things first. The Hypervisors are all well loaded Sun Fire X4150 servers. They have been installed with Solaris/Nevada 105, again just because that was current at the time. We chose Nevada instead of OpenSolaris primarily because it offers a better unattended/headless install environment. They are configured with ZFS root on local hard drives, and the ZFS mirroring has been set up to increase uptime.

Not much in the way of customizations to Dom-0 or the Hypervisor. Obviously in Dom-0 we disable as many unnecessary services as possible. Since these are servers, we shut down all of the Desktop related services like the X server and the like.

We also limit Dom-0 memory to only 2G using the dom0_mem parameter to the hypervisor. This might be a little aggressive since ZFS is memory hungry, but we want to try to keep as much memory available for the guest domains as possible, and we haven't seen a problem with this yet.

We also set the hypervisor console to com1, in case we need to break into the console for any sort of debugging (knock on wood we don't have to do that.)

Both these parameters are set from the GRUB boot commands

kernel$ /boot/$ISADIR/xen.gz com1=9600,8n1 console=com1 dom0_mem=2G

We also use the vanity device naming capabilities of Solaris - via dladm - to give all the Public and Backend interfaces the same names. So, although today we're using Sun Fire X4150 servers which all have Intel Pro/1000 network interface controllers in them, in the future, we can move to different platforms and still maintain a consistent device naming convention. This is actually pretty crucial for Xen Live Migration to work. Xen needs the same network interfaces to be available on the source and destination of the Live Migration. Without this, the migration will not succeed.

dladm rename-link e1000g0 fe0
dladm rename-link e1000g1 be0

Finally, while we're talking about Live Migration, its something we need to enable in xend. A couple of simple SMF changes handle that.

svccfg -s xend setprop config/xend-relocation-address =
svccfg -s xend setprop config/xend-relocation-hosts-allow = astring: \\

For what its worth, we currently have 22 Hypervisors in the cloud, clearly not yet a huge deployment, yet.

Unified Storage Cluster

These units sort of fell in our lap at just the right time. Although we could have approached the level of availability they offer with some solution built on top of Solaris & SunCluster, the ease of installation, configuration and maintenance the the Unified Storage Cluster affords should help keep the infrastructure simple and straight forward to maintain.

As I mentioned earlier, these units are configured in an active-passive cluster mode, and have a presence on each physical network segment. Configuring them as a cluster is amazingly simple, as any appliance should be. Once the CLUSTRON interface is connected between them, the initial boot to configure the first head node automatically detects the presence of the second head node, and prompts you to configure them as a cluster.

As far as how we are using them, well, they are used in a few capacities. First, as I mentioned earlier, they are used to house the DHCP servers' shared datastore. We also use them to house various administrative tools and bits.

But that's not the main use. Primarily we are using them as the virtual disk in which all the guest operating systems (virtual servers) are installed. This is done by creating LUNs in the Unified Storage and exposing them as iSCSI targets, which are then attached to the Dom-0's and made available to the Hypervisors.

This is another critical piece that makes some of the cool features of xVM Xen, like Live Migration, possible. Live Migration is the process of moving a running virtual server from one physical host to another. Did you read that, a running virtual server! For this to happen, the virtual disk must be available on the target physical host. Using iSCSI makes this a snap since all you need to do is attach the LUN to the target physical host and you are done. If you think of the alternative with local storage, you would have to somehow transfer the bits from the local storage on the source physical host to the target, while the virtual server is still running, which among other things is nearly impossible to do in a real-time manner.

Ok, so one other use of the Unified Storage is for shared storage for applications running inside the virtual servers. Building horizontally scalable, redundant applications sometimes requires the use of storage that can be accessed by all nodes in the application cluster. We provide this on the Unified Storage cluster with NFS.

Putting it All Together

That's a summary of all the pieces, now how does it all fit together. Here's a diagram that shows all the interactions.

Lets look at the organization on the Unified Storage Cluster first. We use different projects within the Unified Storage for keep things manageable. First thing we've done is build a set of master images of various guest operating systems like OpenSolaris - 08.11 & Dev, Nevada, and so forth. Those images are kept in a Masters project. These images are pre-installed instances of guest operating systems that reside on (iSCSI) LUNs in the Unified Storage. A snapshot of the image is taken after any revision we make to the image (note that snapshots should only be taken when the guest has been shutdown.)

When we're ready to spin up a new virtual machine, we select the current snapshot of the O/S image we want, and clone it as a new LUN in a separate Unified Storage project. The great thing here, is that the clones initially take up zero additional space, and only start to use their own space when the operating system changes anything on the virtual disk, or an application is installed, and the like. Quite a savings in disk space. We have some scripts that interact with the Unified Storage to do this.

These commands will create a new project called appl-1, and then clone a master image to the project as vm-1.

domu-project appl-1
domu-clone masters/osol-0811@version-01 appl-1/vm-1

After the image is cloned, we instruct the Unified Storage to export the cloned LUN as an iSCSI target. The target is then attached to a Dom-0 and Hypervisor. Currently we've been putting about 4-6 guests on each Hypervisor before moving onto a new one for additional virtual servers.

From here, the usual Xen commands are used to define a domain, that is xm create or virsh create commands. As part of the domain configuration, we specify the attached iSCSI LUN for the virtual disk for the domain. Again, this is simplified though scripting by using a pre-defined XML template for domain creation.

This example shows the creation of a paravirtualized guest - pv, which has both a Public and Backend interface - fe-be, on the Odd network segment - odd, with 1G of memory - 1024, and 1 CPU - 1.

domu-init pv fe-be odd 1024 1 appl-1/vm-1

I should point out that we're not really pre-attaching the iSCSI LUN to Dom-0, but instead using one of the enhancements of xVM Xen which will do this for us. We simply specify the iSCSI Qualified Name (IQN) and the IP address of the target as the virtual disk in the domain configuration, and let xVM Xen deal with attaching it when the domain starts up.

Now, we're almost there, but we need to assign IP addresses to the virtual server. We use DHCP for this. More specifically we bind specific IP addresses to specific MAC addresses to ensure that a virtual server always uses its assigned address(es). Xen creates a sort of synthetic MAC address for each interface it configures and persists the address. We can grab these addresses before the domain starts to update DHCP.

This is a good time to point out that all the master images of guest operating systems have been configured as DHCP clients. This really simplifies the entire process, since there is no need to do any post-configuration of the cloned images to give them the correct network resources. With DHCP, it just happens. You can also see why it is critical to have a highly available DHCP cluster, since so much relies on it. Once again, this is scripted.

This example shows the assignment of a Public and Backend interface to the new guest

domu-assign-ip appl-1:vm-1 fe0 vm-host1
domu-assign-ip appl-2:vm-1 be0 vm-host1-be

And that's it. At this point, we're ready to fire up the guest.

virsh start appl-1:vm-1
virsh console appl-1:vm-1

Wrap Up

That's about it for the basic details of what we've done. Its all been working incredibly well, especially considering we're running about ¾ development code everywhere.

Before I forget, I would really like to thank the xVM Xen team for their support while we've been setting things up and tinkering around, they have all been vary helpful and responsive to my questions on as well as private threads. Mark Johnson deserves a special mention since I glommed onto him the most.

Up next, a summary of how well we're doing on the previously outlined goals.

Tuesday Apr 14, 2009

My Own Private Cloud-aho - The Goals

For quite some time I have been informally tracking the progress of the xVM Xen team, playing with code drops on perhaps a quarterly basis, and doing benchmarking to determine if things are stable enough and efficient enough to run in a production data-center.

Probably about 6 months ago, performance looked pretty decent, and I implemented a tiny proof-of-concept “cloud” using some friends as guinea pigs. Its been running great for us. In fact, I just checked my virtual server and its been running for 245 days so far. Pretty stable indeed.

Recently I have had the opportunity to build a small cloud in one of our data-centers, expanding on the concepts from the POC into a more full-blown and mature solution, suitable for a near-production environment. Its hard to really say its production quality when so many moving parts are still pre-release.

I'd like to share what I have done with xVM Xen, in hopes that it may be useful for others who wish do do the same. But before I get into the details, let me describe some of the goals that we hope to achieve with this new virtual environment.

The Goals

  • Replace the current virtual hosting environment

    I hate to knock Solaris Containers since its a Sun technology, but we've struggled somewhat in our usage. The problem isn't really with the technology, but rather a scheduling problem when it comes to planned and unplanned maintenance. In our environment we typically load up 4-8 application zones within a single physical host. These applications are generally maintained by separate teams. When we want to do something like patch the system, the operations team needs to schedule downtime with all the application teams for the same time so patching can occur, and we've found that the operations team can spin quite a few cycles lining all the ducks in a row.

    The model with xVM Xen is quite a bit different. Since each domain, or virtual server, is more or less independent of the others, the operations team need only schedule down time with a single team at a time. So although they are doing more scheduling and patching, overall we hope that we can reduce the amount of real time they are doing entirely.

  • Simple, Rapid (Virtual) Server Creation

    The operations team has things dialed in pretty well with JASS, Jumpstart, and so on, for installing physical hosts, but the creation of virtual servers isn't as streamlined. The hope is that though the use of newer technologies available with ZFS – snapshots and cloning for one – as well as with a network storage infrastructure -and iSCSI, we can really streamline the process so that we can spin up virtual servers within minutes. The goal is to make the gating factor be how fast the system administrator can type.

  • Faster & Error Free Deployments

    Although most of the applications behind don't require huge clusters of servers, we do have a few that span perhaps a dozen or more physical hosts. The problem is that typically the process of deploying a new version the application requires the same set of repeated steps for each instance of the application, introducing the fat-finger problem. What if, using those same ZFS and iSCSI technologies, we can install once, then clone the application for the other instances. As long as that initial install is done correctly, it can greatly reduce the possibility of errors when replicating the changes across the cluster.

  • Easier Horizontal Expansion

    Occasionally, during launches or even worse, DOS attacks, applications can get hit their capacity which result in reduced quality of service for everyone. It those cases, it would be great if we could instantly increase our capacity. Are there technologies that we could employ to do this easily? We think there are.

  • Painless Migration to Newer, Bigger, Faster Hardware

    Although we've tried to employ some best practices which attempt to separate the O/S and the application on different areas of the filesystem(s), it still isn't the easiest exercise to upgrade an application to a new chunk of hardware. Essentially it becomes another case where the application team has to spend some cycles installing the service on the new hardware, test, verify, yadda, yadda, yadda.

    We think that the live migration capabilities of Xen have great potential here. Since the application would be installed in a virtual server, the process of upgrading simply becomes a push of the running application from one physical host to another. And, this could even be something the operations team does all by itself, unbeknownst to the application team at all!

  • Better Hardware Utilization

    A while ago I gave a talk with the xVM Xen team about what we had done. I don't think I really explained this one correctly, because their initial comment was something about Solaris Zones providing the most efficient method of squeezing every ounce out of a physical host.

    That's not really what this is about. Many of the physical hosts are really incredibly under utilized, perhaps peaking out at somewhere near 30% of sustained CPU used even at the 95th percentile. With hundreds of hosts running that way, we're really just wasting power, cooling and space, when we don't need to.

    We're hoping that with the virtualization capabilities with xVM Xen provides, we can make the practice of doubling or tripling up applications on a physical host more common, increasing the sustained performance closer to somewhere between 60% to 80% and lowering our datacenter footprint overall. Where an application begins to run hotter, monitoring would help us decide to move it, via Xen live migration, to a less used, larger, or private physical host.

  • More Resiliency

    What we are looking for here is for ways to be able to recover better from catastrophic failures. The server is on fire, how do we get the application off of it and up and running quickly on another physical host? How do we reduce the need for someone hands on to physically fix a problem on a piece of hardware. Again, we're hoping virtualization and other technologies will be helpful here.

I probably forgot a few goals but in general these are the bigger problems which we hope to solve with a more virtualized data-center.

In my next post, I'll describe the infrastructure we built in detail.




Top Tags
« April 2014

No bookmarks in folder