My Own Private Cloud-aho - The Details
By me on Apr 16, 2009
In my previous post, I went over some of the goals we are trying to attain though a data-center architecture based on virtualization with xVM Xen. Essentially looking for ways to work smarter, faster and be more flexible. In this entry, I will attempt to go into the details of the infrastructure we built.
A good place to start is the general network layout. The first diagram shows what we have done. Nothing too fancy. At the heart of it is two independent physical networks. We do this to increase availability. Two load balancers are at the top of the stack, which are cross connected to switches on each physical network, and do something like talk VRRP or some other crazy protocol to each other to decide who is the master and who is on standby.
There are two logical networks that run throughout the infrastructure, the Public network and the Backend network. If the names aren't clear enough, the Public network is meant to serve traffic between us and our customers (via the internet.) The Backend network is for various favors of server-to-server communication within the data-center.
Each physical host, therefore, has two network connections, one to each logical network. The notable exception is the Sun Storage Unified Storage 7410 Cluster, which is connected only to the Backend network, and has connections to both physical networks. The cluster is configured in an active-passive mode, which means that only one of the two 7410 head units is doing the file serving at any one time. In order to ensure that it can still serve stuff to the Even segment even though the Odd segment is down, we need to give it a presence on each physical segment. I'll discuss more about what specifically we're doing with the Unified Storage Cluster later on.
Nothing really out of the ordinary here. We've set up a pair of PXE Boot/Install servers to handle mainly installing additional Hypervisors. The Install servers themselves are running Solaris Nevada 105, I don't recall what that translates to in terms of SXCE releases. Nothing special about that version other than what happened to be the more recent drop we could get when we started assembling everything.
These are more or less independent of each other. Most of the installs work against the primary install server. The only real reason to switch the other server is if the primary is down or if its network segment is down. But as you'll see, we're really not doing many installs at all.
The DHCP servers actually run on the same physical hosts as the install servers. They are using the DHCP daemons that come as part of Solaris/Nevada. They do play a key part in the whole PXE Boot process, so they have had additional customizationsmade to allow for that â\\u20ac\\u201c there are a few good articles out there describing the additional macros to accomplish PXE Boot.
The one other interesting thing to note is that the pair of DHCP servers are configured as a cluster. The DHCP daemon doesn't really have a notion of being in a cluster, however, it can be configured to use a shared datastore, which each daemon can read and update. In our case, this is the first use of the Unified Storage Cluster. The DHCP servers are configured to use an NFS share on the cluster. The caveat here, is that you must configure the daemons to use the SUNWfiles datasource. Neither SUNWbinfiles nor SUNWnisplus will work.
A couple of other things to be aware of when clustering DHCP is to make sure you set the OWNER_IP parameter in each daemon's local dhcpsvc.conf file to a comma separated list of all the IP addresses of all the interfaces that will serve DHCP requests on all servers. Also make sure you set the RESCAN_INTERVAL to a reasonable value for you, in our case we just set it to 1 minute. Both of these values can be updated with the dhcpconfig -P command.
dhcpconfig -P OWNER_IP=192.168.78.14,192.168.78.16,192.168.76.25,192.168.76.25 dhcpconfig -P RESCAN_INTERVAL=1
First things first. The Hypervisors are all well loaded Sun Fire X4150 servers. They have been installed with Solaris/Nevada 105, again just because that was current at the time. We chose Nevada instead of OpenSolaris primarily because it offers a better unattended/headless install environment. They are configured with ZFS root on local hard drives, and the ZFS mirroring has been set up to increase uptime.
Not much in the way of customizations to Dom-0 or the Hypervisor. Obviously in Dom-0 we disable as many unnecessary services as possible. Since these are servers, we shut down all of the Desktop related services like the X server and the like.
We also limit Dom-0 memory to only 2G using the dom0_mem parameter to the hypervisor. This might be a little aggressive since ZFS is memory hungry, but we want to try to keep as much memory available for the guest domains as possible, and we haven't seen a problem with this yet.
We also set the hypervisor console to com1, in case we need to break into the console for any sort of debugging (knock on wood we don't have to do that.)
Both these parameters are set from the GRUB boot commands
kernel$ /boot/$ISADIR/xen.gz com1=9600,8n1 console=com1 dom0_mem=2G
We also use the vanity device naming capabilities of Solaris - via dladm - to give all the Public and Backend interfaces the same names. So, although today we're using Sun Fire X4150 servers which all have Intel Pro/1000 network interface controllers in them, in the future, we can move to different platforms and still maintain a consistent device naming convention. This is actually pretty crucial for Xen Live Migration to work. Xen needs the same network interfaces to be available on the source and destination of the Live Migration. Without this, the migration will not succeed.
dladm rename-link e1000g0 fe0 dladm rename-link e1000g1 be0
Finally, while we're talking about Live Migration, its something we need to enable in xend. A couple of simple SMF changes handle that.
svccfg -s xend setprop config/xend-relocation-address = 192.168.78.21 svccfg -s xend setprop config/xend-relocation-hosts-allow = astring: \\ \\"\^localhost$\^192\\.168\\.78\\.[0-9]\*$\\"
For what its worth, we currently have 22 Hypervisors in the cloud, clearly not yet a huge deployment, yet.
Unified Storage Cluster
These units sort of fell in our lap at just the right time. Although we could have approached the level of availability they offer with some solution built on top of Solaris & SunCluster, the ease of installation, configuration and maintenance the the Unified Storage Cluster affords should help keep the infrastructure simple and straight forward to maintain.
As I mentioned earlier, these units are configured in an active-passive cluster mode, and have a presence on each physical network segment. Configuring them as a cluster is amazingly simple, as any appliance should be. Once the CLUSTRON interface is connected between them, the initial boot to configure the first head node automatically detects the presence of the second head node, and prompts you to configure them as a cluster.
As far as how we are using them, well, they are used in a few capacities. First, as I mentioned earlier, they are used to house the DHCP servers' shared datastore. We also use them to house various administrative tools and bits.
But that's not the main use. Primarily we are using them as the virtual disk in which all the guest operating systems (virtual servers) are installed. This is done by creating LUNs in the Unified Storage and exposing them as iSCSI targets, which are then attached to the Dom-0's and made available to the Hypervisors.
This is another critical piece that makes some of the cool features of xVM Xen, like Live Migration, possible. Live Migration is the process of moving a running virtual server from one physical host to another. Did you read that, a running virtual server! For this to happen, the virtual disk must be available on the target physical host. Using iSCSI makes this a snap since all you need to do is attach the LUN to the target physical host and you are done. If you think of the alternative with local storage, you would have to somehow transfer the bits from the local storage on the source physical host to the target, while the virtual server is still running, which among other things is nearly impossible to do in a real-time manner.
Ok, so one other use of the Unified Storage is for shared storage for applications running inside the virtual servers. Building horizontally scalable, redundant applications sometimes requires the use of storage that can be accessed by all nodes in the application cluster. We provide this on the Unified Storage cluster with NFS.
Putting it All Together
That's a summary of all the pieces, now how does it all fit together. Here's a diagram that shows all the interactions.
Lets look at the organization on the Unified Storage Cluster first. We use different projects within the Unified Storage for keep things manageable. First thing we've done is build a set of master images of various guest operating systems like OpenSolaris - 08.11 & Dev, Nevada, and so forth. Those images are kept in a Masters project. These images are pre-installed instances of guest operating systems that reside on (iSCSI) LUNs in the Unified Storage. A snapshot of the image is taken after any revision we make to the image (note that snapshots should only be taken when the guest has been shutdown.)
When we're ready to spin up a new virtual machine, we select the current snapshot of the O/S image we want, and clone it as a new LUN in a separate Unified Storage project. The great thing here, is that the clones initially take up zero additional space, and only start to use their own space when the operating system changes anything on the virtual disk, or an application is installed, and the like. Quite a savings in disk space. We have some scripts that interact with the Unified Storage to do this.
These commands will create a new project called appl-1, and then clone a master image to the project as vm-1.
domu-project appl-1 domu-clone masters/osol-0811@version-01 appl-1/vm-1
After the image is cloned, we instruct the Unified Storage to export the cloned LUN as an iSCSI target. The target is then attached to a Dom-0 and Hypervisor. Currently we've been putting about 4-6 guests on each Hypervisor before moving onto a new one for additional virtual servers.
From here, the usual Xen commands are used to define a domain, that is xm create or virsh create commands. As part of the domain configuration, we specify the attached iSCSI LUN for the virtual disk for the domain. Again, this is simplified though scripting by using a pre-defined XML template for domain creation.
This example shows the creation of a paravirtualized guest - pv, which has both a Public and Backend interface - fe-be, on the Odd network segment - odd, with 1G of memory - 1024, and 1 CPU - 1.
domu-init pv fe-be odd 1024 1 appl-1/vm-1
I should point out that we're not really pre-attaching the iSCSI LUN to Dom-0, but instead using one of the enhancements of xVM Xen which will do this for us. We simply specify the iSCSI Qualified Name (IQN) and the IP address of the target as the virtual disk in the domain configuration, and let xVM Xen deal with attaching it when the domain starts up.
Now, we're almost there, but we need to assign IP addresses to the virtual server. We use DHCP for this. More specifically we bind specific IP addresses to specific MAC addresses to ensure that a virtual server always uses its assigned address(es). Xen creates a sort of synthetic MAC address for each interface it configures and persists the address. We can grab these addresses before the domain starts to update DHCP.
This is a good time to point out that all the master images of guest operating systems have been configured as DHCP clients. This really simplifies the entire process, since there is no need to do any post-configuration of the cloned images to give them the correct network resources. With DHCP, it just happens. You can also see why it is critical to have a highly available DHCP cluster, since so much relies on it. Once again, this is scripted.
This example shows the assignment of a Public and Backend interface to the new guest
domu-assign-ip appl-1:vm-1 fe0 vm-host1 192.168.76.71 domu-assign-ip appl-2:vm-1 be0 vm-host1-be 192.168.78.71
And that's it. At this point, we're ready to fire up the guest.
virsh start appl-1:vm-1 virsh console appl-1:vm-1
That's about it for the basic details of what we've done. Its all been working incredibly well, especially considering we're running about ¾ development code everywhere.
Before I forget, I would really like to thank the xVM Xen team for their support while we've been setting things up and tinkering around, they have all been vary helpful and responsive to my questions on firstname.lastname@example.org as well as private threads. Mark Johnson deserves a special mention since I glommed onto him the most.
Up next, a summary of how well we're doing on the previously outlined goals.