My Own Private Cloud-aho - The Details

In my previous post, I went over some of the goals we are trying to attain though a data-center architecture based on virtualization with xVM Xen. Essentially looking for ways to work smarter, faster and be more flexible. In this entry, I will attempt to go into the details of the infrastructure we built.

Network Layout

A good place to start is the general network layout. The first diagram shows what we have done. Nothing too fancy. At the heart of it is two independent physical networks. We do this to increase availability. Two load balancers are at the top of the stack, which are cross connected to switches on each physical network, and do something like talk VRRP or some other crazy protocol to each other to decide who is the master and who is on standby.

There are two logical networks that run throughout the infrastructure, the Public network and the Backend network. If the names aren't clear enough, the Public network is meant to serve traffic between us and our customers (via the internet.) The Backend network is for various favors of server-to-server communication within the data-center.

Each physical host, therefore, has two network connections, one to each logical network. The notable exception is the Sun Storage Unified Storage 7410 Cluster, which is connected only to the Backend network, and has connections to both physical networks. The cluster is configured in an active-passive mode, which means that only one of the two 7410 head units is doing the file serving at any one time. In order to ensure that it can still serve stuff to the Even segment even though the Odd segment is down, we need to give it a presence on each physical segment. I'll discuss more about what specifically we're doing with the Unified Storage Cluster later on.

Install Servers

Nothing really out of the ordinary here. We've set up a pair of PXE Boot/Install servers to handle mainly installing additional Hypervisors. The Install servers themselves are running Solaris Nevada 105, I don't recall what that translates to in terms of SXCE releases. Nothing special about that version other than what happened to be the more recent drop we could get when we started assembling everything.

These are more or less independent of each other. Most of the installs work against the primary install server. The only real reason to switch the other server is if the primary is down or if its network segment is down. But as you'll see, we're really not doing many installs at all.

DHCP Servers

The DHCP servers actually run on the same physical hosts as the install servers. They are using the DHCP daemons that come as part of Solaris/Nevada. They do play a key part in the whole PXE Boot process, so they have had additional customizationsmade to allow for that รข\\u20ac\\u201c there are a few good articles out there describing the additional macros to accomplish PXE Boot.

The one other interesting thing to note is that the pair of DHCP servers are configured as a cluster. The DHCP daemon doesn't really have a notion of being in a cluster, however, it can be configured to use a shared datastore, which each daemon can read and update. In our case, this is the first use of the Unified Storage Cluster. The DHCP servers are configured to use an NFS share on the cluster. The caveat here, is that you must configure the daemons to use the SUNWfiles datasource. Neither SUNWbinfiles nor SUNWnisplus will work.

A couple of other things to be aware of when clustering DHCP is to make sure you set the OWNER_IP parameter in each daemon's local dhcpsvc.conf file to a comma separated list of all the IP addresses of all the interfaces that will serve DHCP requests on all servers. Also make sure you set the RESCAN_INTERVAL to a reasonable value for you, in our case we just set it to 1 minute. Both of these values can be updated with the dhcpconfig -P command.

dhcpconfig -P OWNER_IP=,,, 
dhcpconfig -P RESCAN_INTERVAL=1 

The Hypervisors

First things first. The Hypervisors are all well loaded Sun Fire X4150 servers. They have been installed with Solaris/Nevada 105, again just because that was current at the time. We chose Nevada instead of OpenSolaris primarily because it offers a better unattended/headless install environment. They are configured with ZFS root on local hard drives, and the ZFS mirroring has been set up to increase uptime.

Not much in the way of customizations to Dom-0 or the Hypervisor. Obviously in Dom-0 we disable as many unnecessary services as possible. Since these are servers, we shut down all of the Desktop related services like the X server and the like.

We also limit Dom-0 memory to only 2G using the dom0_mem parameter to the hypervisor. This might be a little aggressive since ZFS is memory hungry, but we want to try to keep as much memory available for the guest domains as possible, and we haven't seen a problem with this yet.

We also set the hypervisor console to com1, in case we need to break into the console for any sort of debugging (knock on wood we don't have to do that.)

Both these parameters are set from the GRUB boot commands

kernel$ /boot/$ISADIR/xen.gz com1=9600,8n1 console=com1 dom0_mem=2G

We also use the vanity device naming capabilities of Solaris - via dladm - to give all the Public and Backend interfaces the same names. So, although today we're using Sun Fire X4150 servers which all have Intel Pro/1000 network interface controllers in them, in the future, we can move to different platforms and still maintain a consistent device naming convention. This is actually pretty crucial for Xen Live Migration to work. Xen needs the same network interfaces to be available on the source and destination of the Live Migration. Without this, the migration will not succeed.

dladm rename-link e1000g0 fe0
dladm rename-link e1000g1 be0

Finally, while we're talking about Live Migration, its something we need to enable in xend. A couple of simple SMF changes handle that.

svccfg -s xend setprop config/xend-relocation-address =
svccfg -s xend setprop config/xend-relocation-hosts-allow = astring: \\

For what its worth, we currently have 22 Hypervisors in the cloud, clearly not yet a huge deployment, yet.

Unified Storage Cluster

These units sort of fell in our lap at just the right time. Although we could have approached the level of availability they offer with some solution built on top of Solaris & SunCluster, the ease of installation, configuration and maintenance the the Unified Storage Cluster affords should help keep the infrastructure simple and straight forward to maintain.

As I mentioned earlier, these units are configured in an active-passive cluster mode, and have a presence on each physical network segment. Configuring them as a cluster is amazingly simple, as any appliance should be. Once the CLUSTRON interface is connected between them, the initial boot to configure the first head node automatically detects the presence of the second head node, and prompts you to configure them as a cluster.

As far as how we are using them, well, they are used in a few capacities. First, as I mentioned earlier, they are used to house the DHCP servers' shared datastore. We also use them to house various administrative tools and bits.

But that's not the main use. Primarily we are using them as the virtual disk in which all the guest operating systems (virtual servers) are installed. This is done by creating LUNs in the Unified Storage and exposing them as iSCSI targets, which are then attached to the Dom-0's and made available to the Hypervisors.

This is another critical piece that makes some of the cool features of xVM Xen, like Live Migration, possible. Live Migration is the process of moving a running virtual server from one physical host to another. Did you read that, a running virtual server! For this to happen, the virtual disk must be available on the target physical host. Using iSCSI makes this a snap since all you need to do is attach the LUN to the target physical host and you are done. If you think of the alternative with local storage, you would have to somehow transfer the bits from the local storage on the source physical host to the target, while the virtual server is still running, which among other things is nearly impossible to do in a real-time manner.

Ok, so one other use of the Unified Storage is for shared storage for applications running inside the virtual servers. Building horizontally scalable, redundant applications sometimes requires the use of storage that can be accessed by all nodes in the application cluster. We provide this on the Unified Storage cluster with NFS.

Putting it All Together

That's a summary of all the pieces, now how does it all fit together. Here's a diagram that shows all the interactions.

Lets look at the organization on the Unified Storage Cluster first. We use different projects within the Unified Storage for keep things manageable. First thing we've done is build a set of master images of various guest operating systems like OpenSolaris - 08.11 & Dev, Nevada, and so forth. Those images are kept in a Masters project. These images are pre-installed instances of guest operating systems that reside on (iSCSI) LUNs in the Unified Storage. A snapshot of the image is taken after any revision we make to the image (note that snapshots should only be taken when the guest has been shutdown.)

When we're ready to spin up a new virtual machine, we select the current snapshot of the O/S image we want, and clone it as a new LUN in a separate Unified Storage project. The great thing here, is that the clones initially take up zero additional space, and only start to use their own space when the operating system changes anything on the virtual disk, or an application is installed, and the like. Quite a savings in disk space. We have some scripts that interact with the Unified Storage to do this.

These commands will create a new project called appl-1, and then clone a master image to the project as vm-1.

domu-project appl-1
domu-clone masters/osol-0811@version-01 appl-1/vm-1

After the image is cloned, we instruct the Unified Storage to export the cloned LUN as an iSCSI target. The target is then attached to a Dom-0 and Hypervisor. Currently we've been putting about 4-6 guests on each Hypervisor before moving onto a new one for additional virtual servers.

From here, the usual Xen commands are used to define a domain, that is xm create or virsh create commands. As part of the domain configuration, we specify the attached iSCSI LUN for the virtual disk for the domain. Again, this is simplified though scripting by using a pre-defined XML template for domain creation.

This example shows the creation of a paravirtualized guest - pv, which has both a Public and Backend interface - fe-be, on the Odd network segment - odd, with 1G of memory - 1024, and 1 CPU - 1.

domu-init pv fe-be odd 1024 1 appl-1/vm-1

I should point out that we're not really pre-attaching the iSCSI LUN to Dom-0, but instead using one of the enhancements of xVM Xen which will do this for us. We simply specify the iSCSI Qualified Name (IQN) and the IP address of the target as the virtual disk in the domain configuration, and let xVM Xen deal with attaching it when the domain starts up.

Now, we're almost there, but we need to assign IP addresses to the virtual server. We use DHCP for this. More specifically we bind specific IP addresses to specific MAC addresses to ensure that a virtual server always uses its assigned address(es). Xen creates a sort of synthetic MAC address for each interface it configures and persists the address. We can grab these addresses before the domain starts to update DHCP.

This is a good time to point out that all the master images of guest operating systems have been configured as DHCP clients. This really simplifies the entire process, since there is no need to do any post-configuration of the cloned images to give them the correct network resources. With DHCP, it just happens. You can also see why it is critical to have a highly available DHCP cluster, since so much relies on it. Once again, this is scripted.

This example shows the assignment of a Public and Backend interface to the new guest

domu-assign-ip appl-1:vm-1 fe0 vm-host1
domu-assign-ip appl-2:vm-1 be0 vm-host1-be

And that's it. At this point, we're ready to fire up the guest.

virsh start appl-1:vm-1
virsh console appl-1:vm-1

Wrap Up

That's about it for the basic details of what we've done. Its all been working incredibly well, especially considering we're running about ¾ development code everywhere.

Before I forget, I would really like to thank the xVM Xen team for their support while we've been setting things up and tinkering around, they have all been vary helpful and responsive to my questions on as well as private threads. Mark Johnson deserves a special mention since I glommed onto him the most.

Up next, a summary of how well we're doing on the previously outlined goals.


Really cool stuff! Thanks for sharing all the details.

One detail you did not mention was how long does the live migration take? I bet that it depends on the size of memory (RAM) allocated/used by the guest. But let's say that for a guest with 2GB RAM, would it be <1 second, a few seconds, more? And what happens to existing network (http) connections between the guest and clients? Would they get preserved?

Posted by Igor Minar on April 16, 2009 at 04:29 PM PDT #

The whole live migration process in real time takes a little time, say under 20-30 seconds. During most of the process the domain is still active, but Xen does need to suspend it momentarily when it transfers execution from one physical host to the target.

I notice in general attaching an iSCSI target to a physical host takes a few seconds for Solaris to dynamically configure itself.

Then you're looking at in the best case with your 2GB example, to transfer the pages over a 1Gbit link, would take approximately 16 seconds (1024Gbit/sec = 128MB/sec, 2048MB / 128MB/sec = 16 sec.) But could take longer since the domain could dirty some pages that have already been transferred once.

All the network connections are preserved through the migration. There can be a momentary blip while the switches update themselves (hey MAC x:y:z:a:b:c is now on that port.) I believe Xen attempts to minimize this by sending an gratuitous ARP packet to help force the switches to update themselves.

Posted by Joe Mocker on April 17, 2009 at 02:16 AM PDT #

So from the user perspective, the entire migration is transparent? Let's say I'm downloading a file from a webserver running in on a guest OS. If I move this OS to a new physical host, will the download get interrupted or will the TCP's guaranteed delivery recover any networking errors that might occur?

From what you described it sounds like the original guest OS keeps on working even after the migration has started and only when the copy on the guest is mostly complete the control will be moved over there, is that correct? That would mean that the time that the guest OS is suspended is greatly minimized.

Posted by Igor Minar on April 17, 2009 at 03:09 AM PDT #

You are correct, the migration is transparent. I believe the guest knows it was suspended and resumed but everything should just pick up on the new physical host where it left off on the old one. Including whatever network activity was happening, like HTTP downloads or what have you.

Xen I believe has different strategies to migrate pages over while keeping the guest running as long as possible. So yes, it does try to minimize any suspend time as much as possible.

Posted by Joe Mocker on April 17, 2009 at 08:37 AM PDT #

cool.. thnx!

Posted by Igor Minar on April 17, 2009 at 09:06 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed



Top Tags
« December 2016

No bookmarks in folder