Wednesday Sep 09, 2015

Configuring and Managing OpenStack Nova's Image Cache


One piece of configuration that we had originally neglected in configuring Solaris Engineering's OpenStack cloud was the cache that Nova uses to store downloaded Glance images in the process of deploying each guest.  For many sites, this wouldn't likely be a big problem as they will often use a small set of images for a long period of time, so the cache won't grow much.  In our case, that's definitely not true; we generate new Solaris images every two weeks for each release that is under development, and we usually have two of those (an update release and the next minor release), so we are introducing new images every week (and that's multiplied by having images for SPARC and x86; yes, our Glance catalog is pretty large, a topic for a future post).  And of course, the most recent images are usually the most used, so usually a particular image will be out of favor within a month or two.  We recently reached the point where the cache became a problem on our compute nodes, so here we'll share the solution we're using until the solariszones driver in OpenStack gets the smarts to handle this.

By default, the solariszones driver uses /var/lib/nova/images as its download cache for Glance images.  This has two problems: it's not quota'ed, and it's inside the boot environment's /var dataset, meaning that as we update the compute node's OS, each new boot environment (or BE) will create a snapshot that refers to whatever is there, and so it becomes difficult to actually free up space; you need to delete both the image and any old BE snapshots that refer to it.  There's also another related problem, in that attempting to create an archive of the system using archiveadm(1m) will also capture all of these images and bloat the archive to Godzilla size.  Thus, our solution needs to move the cache outside of the boot environment hierarchy.  All together, there are three pieces:

  1. Create a dataset and set a quota on it
  2. Introduce an SMF service to manage the cache against the quota
  3. Reconfigure Nova to use the new cache dataset

The SMF Service

For this initial iteration, I've built a really simple service with lots of hard-coded policy.  A real product-quality solution would be more configurable and such, but right now I'm in quick & dirty mode as we have bigger problems we're working on.  Here's the script at the core of the service:


# Nova image cache cleaner
# Presently all hard-coded policy; should be configured via SMF properties


quota=$(zfs get -pH -o value quota ${cacheds})

(( quota == 0 )) && exit 0

(( limit=quota-2*1024*1024*1024 ))

size=$(zfs get -pH -o value used ${cacheds})

# Delete images until we are at least 2 GB below quota; always delete oldest

while (( size > limit )); do
    oldest=$(ls -tr ${cachedir}|head -1)
    rm -v ${cachedir}/${oldest}
    size=$(zfs get -pH -o value used ${cacheds})
exit 0

To run something like this you'd traditionally use cron, but we have the new SMF periodic services at our disposal, and that's what we'll use here; the script will be run hourly, and gets to run for however long it takes (which should be but a second or two, but no reason to be aggressive).

<?xml version="1.0" ?>
<!DOCTYPE service_bundle
  SYSTEM '/usr/share/lib/xml/dtd/service_bundle.dtd.1'>
<service_bundle type="manifest"
    <service version="1" type="service"
            The following dependency keeps us from starting unless
            nova-compute is also enabled
        <dependency restart_on="none" type="service"
            name="nova-compute" grouping="require_all">
            <service_fmri value="svc:/application/openstack/nova/nova-compute"/>
        <instance enabled="false" name="default"/>
                <loctext xml:lang="C">
                        OpenStack Nova Image Cache Management
                <loctext xml:lang="C">
                        A periodic service that manages the Nova image cache

To deliver it, we package it up in IPS:

set name=pkg.fmri value=pkg://site/nova-cache-cleanup@0.1
set name=pkg.summary value="Nova compute node cache management"
set name=pkg.description value="Nova compute node glance cache cleaner"
file nova-cache-cleanup.xml path=lib/svc/manifest/site/nova-cache-cleanup.xml owner=root group=bin \
    mode=0444 restart_fmri=svc:/system/manifest-import:default
file nova-cache-cleanup path=lib/svc/method/nova-cache-cleanup owner=root group=bin mode=0755

Use pkgsend(1) to publish all of that into your local IPS repository.

Creating the Dataset

Since we use Puppet to handle the day-to-day management of our cloud, I updated our compute_node class to create the dataset and distribute the package.  Here's what that looks like:

# Configuration custom to compute nodes
class compute_node {

    # Customize kernel settings
    file { "site:cloud" :
        path => "/etc/system.d/site:cloud",
        owner => "root",
        group => "root",
        mode => 444,
        source => "puppet:///modules/compute_node/site:cloud",
    # Create separate dataset for nova to cache glance images
    zfs { "rpool/export/nova" :
        ensure => present,
        mountpoint => "/export/nova",
        quota => "15G",
    file { "/export/nova" : 
        ensure => directory,
        owner => "nova",
        group => "nova",
        mode => 755,
        require => Zfs['rpool/export/nova'],
    # Install service to manage cache and enable it
    package { "pkg://site/nova-cache-cleanup" :
        ensure => installed,
    service { "site/nova-cache-cleanup" :
        ensure => running,
        require => Pkg['pkg://site/nova-cache-cleanup'],

It's important that the directory have the correct owner and permissions; the solariszones driver will create the /export/nova/images directory within it to store the images and the nova-compute service runs as nova/nova.  Get this wrong and guests will just plain fail to deploy.  The kernel settings referred to above are unrelated to this posting, it's our setting user_reserve_hint_pct so that the ZFS ARC cache is managed appropriately for running lots of zones (we're using a value of 90 percent as our compute nodes are beefy - 256 or 512 GB).

Reconfiguring Nova

Once you have all that in place and the Puppet agent has run successfully to create the dataset and install the package, it's just a small bit of Python to enable the service on all the compute nodes:


import iniparse
from subprocess import CalledProcessError, Popen, PIPE, check_call

def configure_cache():
    ini = iniparse.ConfigParser()
    ini.set('DEFAULT', 'glancecache_dirname', '/var/share/nova/images')
    with open('/etc/nova/nova.conf', 'w') as fh:
    check_call(["/usr/sbin/svcadm", "restart", "nova-compute"])

if __name__ == "__main__":

I used cssh(1) to run this at once on all the compute nodes, but you could also do this with Puppet.  Ideally we'd package nova.conf or use Puppet's OpenStack modules to do this part, but the packaging solution doesn't work due to a couple of node-specific items in nova.conf and we don't yet have Puppet OpenStack available in Solaris (stay tuned!).

The final step once you've done all the above is to take out the trash from your old image cache:

rm -rf /var/lib/nova/images

Again, if you have older BE's around this won't free up all the space yet, you'll need to remove old BE's, but be careful not to leave yourself with just one!

Friday Jul 10, 2015

Upgrading Solaris Engineering's OpenStack Cloud

The Solaris 11.3 Beta release includes an update to the bundled OpenStack packages from the Havana version to Juno1.  Over on the OpenStack blog my colleague Drew Fisher has a detailed post that looks under the covers at the work the community and our Solaris development team did to make this major upgrade as painless as possible. Here, I'll talk about applying that upgrade technology from the operations side, as we recently performed this upgrade on the internal cloud that we're operating for Solaris engineering.  See my series of posts from last year on how our cloud was initially constructed. 

Our Upgrade Process

The first thing to understand about our upgrade process is that, since the Solaris Nova driver as yet lacks live migration support, we can't upgrade compute nodes without an outage for the guest instances.  We also don't yet have an HA configuration deployed for the database and all the services, so those also require outages to upgrade2.  Therefore, all of our upgrades have downtime scheduled for the entire cloud and we attempt to upgrade all the nodes to the same build.  We typically schedule two hours for upgrades.  If everything were to go smoothly we could be done in less than 30 minutes, but it never works out that way, at least so far.

Right now, we're still doing the upgrades fairly manually, with a small script that we run on each node in turn.  That script looks something like:

# shut down puppet so that patches don't get pulled before they are required
svcadm disable -t puppet:agent
# shut down zones so update goes more quickly, use synchronous to wait for this
svcadm disable -ts zones
# Disable nova API and BUI; we use temporary for API so it will
# come back on reboot but persistent for BUI so that it's not available
# until we're ready to end the outage.
# Dump database for disaster recovery
if [[ $node == "cloud-controller" ]]; then
        svcadm disable -t nova-api-osapi-compute 
        svcadm disable apache22
        mysqldump --user=root --password='password' --add-drop-database --all-databases >/tank/all_databases.sql
pkg update --be-name solaris11_3 -C 5

Once the script completes, we can reboot the system.  The comment above about Puppet relates to specifics in how we are using it; since we sometimes have bugs in the builds that we can work around, we typically use Puppet to distribute those workarounds, but we don't want them to take effect until we've rebooted into the new boot environment.  There's almost certainly a better way to do this, we're just not that smart yet ;-)

We run the above script on all the nodes in parallel, which is fine to do because the upgrade is always creating a new boot environment and we'll wait until all of the core service nodes (keystone, cinder, nova controller, neutron, glance) are done before we reboot any of them.  We don't necessarily wait for all of the compute nodes since they can take longer if any of the guests are non-global zones, and they are the last thing we reboot anyway.

Once the updates are complete, we reboot nodes in the following order:

  1. Nova controller - MySQL, RabbitMQ, Keystone, Nova api's, Heat
  2. Neutron controller
  3. Cinder controller
  4. Glance
  5. Compute nodes

This order minimizes disruptions to the services connecting to RabbitMQ and MySQL, which have been a point of fragility for many operators of OpenStack clouds.  It also ensures that the compute nodes don't see disruptions to iSCSI connections for running zones, which we've seen occasionally lead to ZFS pools ending up in a suspended state.  As we build out the cloud we'll be separating the functions that are in the Nova controller into separate instances, which will necessitate some adjustments to this sequencing, but the basic idea is to work from the database to rabbitmq to keystone to the nova services.

Verifying the Upgrade Worked

Once we've rebooted the nodes we run a couple of quick tests to launch both SPARC and x86 guests, ensuring that basically all of the machinery is working.  I've started doing this with a fairly simple Heat template:

heat_template_version: 2013-05-23

description: >
  HOT template to deploy SPARC & x86 servers as a quick sanity test

    type: string
    description: Name of image to use for x86 server
    type: string
    description: Name of image to use for SPARC server

    type: OS::Nova::Server
      name: test_x86
      image: { get_param: x86_image }
      flavor: 1
      key_name: testkey
        - port: { get_resource: x86_server1_port }
    type: OS::Neutron::Port
      network: internal

    type: OS::Neutron::FloatingIP
      floating_network: external
      port_id: { get_resource: x86_server1_port }
    type: OS::Nova::Server
      name: test_sparc
      image: { get_param: sparc_image }
      flavor: 1
      key_name: testkey
        - port: { get_resource: sparc_server1_port }
    type: OS::Neutron::Port
      network: internal

    type: OS::Neutron::FloatingIP
      floating_network: external
      port_id: { get_resource: sparc_server1_port }

    description: Floating IP address of x86 server in public network
    value: { get_attr: [ x86_server1_floating_ip, floating_ip_address ] }
    description: Floating IP address of SPARC server in public network
    value: { get_attr: [ sparc_server1_floating_ip, floating_ip_address ] }

Once that test runs successfully, we declare the outage over and re-enable the Apache service to restore access to the Horizon dashboard.

Our Upgrade Experiences

Since we went into production almost a year ago, we've upgraded the entire cloud infrastructure, including the OpenStack packages, seven times.  Had we met our goals we would have done the upgrade every two weeks as each full Solaris development build is released internally (and thus would have done over 20 upgrades), but the reality of running at the bleeding edge of the operating system's development is that we find bugs, and we've had several that were too serious and too difficult to work around to undertake upgrades, so we've had to delay a number of times while we waited for fixes to integrate. Through all of this, we've learned a lot and are continually refining our upgrade process.

So far, we've only had one upgrade over the last year that was unsuccessful, and that was reasonably painless, since we just re-activated the old boot environment on each node and rebooted back to it.  We now pre-stage each upgrade on a single-node stack that's configured similarly to the production cloud to verify there aren't any truly catastrophic problems with kernel zones, ZFS, or networking.  That's mostly been successful, but we're going to build a small multi-node cloud for the staging to ensure that we can catch issues in additional areas such as iSCSI that aren't exercised properly by the single-node stack.  The lesson, as always, is to have your test environment replicate production as closely as possible.

For this particular upgrade, we did a lot more testing; I spent the better part of two weeks running trial upgrades from Havana to Juno to shake out issues there, which allowed the development team to fix a bunch of bugs in the configuration file and database upgrades before we went ahead with the actual upgrade.  Even so, the production upgrade was more of an adventure than we expected.  We ran into three issues:

  1. After we rebooted the controller node, the heat-db service went into maintenance.  The database had been corrupted by the service exceeding its start method timeout, which caused SMF to kill and restart it, and that apparently happened at a very inopportune time.  Fortunately we had made little use of heat with Havana and we could simply drop the database and recreate it.  The SMF method timeout is being fixed (for heat-db and other services), though that fix isn't in the 11.3 beta release.  We're also having some discussion about whether SMF should generally default to much longer start method timeouts.  We find that developers are consistently overly optimistic about the true performance of systems in production and short timeouts of 30 seconds or 1 minute that are often used are more likely to cause harm than good.
  2. The puppet:master service went into maintenance when that node was rebooted, with truss we determined that for some reason it was attempting to kill labeld, failing, and exiting.  This is still being investigated, we've had difficulty reproducing it.  Fortunately, disabling labeld worked around the problem and we were able to proceed.
  3. After we had resolved the above issues, the test launches we use to verify the cloud is working would not complete - they'd be queued but not actually happen.  This took us over an hour to diagnose, in part because we're not that experienced with RabbitMQ issues, to that point it had "just worked".  It turned out that we were victims of the default file descriptor limit for RabbitMQ, at 256, being too low to handle all of the connections from the various services using it.  Apparently Juno is just more resource-hungry in this respect than Havana, and it's not something we could have observed in the smaller test environment.  Adding a "ulimit -n 1024" to the rabbitmq start method worked around this for now; this has sparked some internal discussion on whether the default limits should be increased, as yet unresolved.  The values are relics from many years ago and likely could use some updating.

Overall, this upgrade clocked in at a bit over 4 hours of downtime, not the 3 hours that we'd scheduled.  Happily, our cloud has run very smoothly in the weeks since the upgrade to Juno, and our users are very pleased with the much-improved Horizon dashboard.    We're now working our way through a long list of improvements to our cloud and getting the equipment in place to move to an HA environment, which will let us move towards our goal of rolling, zero-downtime upgrades.  More updates to come!


  1. If you're following the OpenStack community, you'll ask, "What about Icehouse?"  Well, we skipped it, in order to get closer to the community releases more quickly.
  2. I am happy to note that, in spite of this lack of HA, we've had only a few minutes of unscheduled service interruptions over the course of the year, due mostly to panics in the Cinder or Neutron servers.  That seems pretty good considering the bleeding-edge nature of the software we're running

Tuesday Oct 28, 2014

Heating Up Your OpenStack Cloud

As part of the support updates to Solaris 11.2, we recently added the Heat orchestration engine to our OpenStack distribution.  If you aren't familiar with Heat, I highly recommend getting to know it, as you'll find it invaluable in deploying complex application topologies within an OpenStack cloud.  I've updated the script tarball from my recent series on building the Solaris engineering cloud to include configuration of Heat, so if you download that and update your cloud controller to the latest SRU, you can run heat to turn it on.

OK, once you've done that, what can you do with Heat?  Well, I've added a script and a Heat template that it uses to the tarball to give you at least one idea.  The script, create_image, is similar to a script that we run to create custom Unified Archive images internally for the Solaris cloud.  The basic idea is to deploy an OpenStack instance using the standard archive that release engineering constructs for the product build, add some things we need to it, then save an image of that for the users of the cloud to use as a base deployment image.  I'd originally written a script to do this using the nova CLI, but using a Heat template simplified it.  The file in the tarball is the template that it uses; that template is a simpler version of a two-node template from the heat-templates repository.  It's fairly self-explanatory so I'm not going to walk through it here.

As for create_image itself, the standard Solaris archive contains the packages in the solaris-minimal-server group, a pretty small package set that really isn't too useful for anything itself, but makes a nice base to build images that include the specific things you need.  In our case, I've defined a group package that pulls in a bunch of things we typically use in Solaris development work: ssh client, LDAP, NTP, Kerberos, NFS client and automounter, the man command, and less.  Here's what the main part of the package manifest looks like:

depend fmri=/network/ssh type=group
depend fmri=group/system/solaris-minimal-server type=group
depend fmri=ldapcert type=group
depend fmri=naming/ldap type=group
depend fmri=security/nss-utilities type=group
depend fmri=service/network/ntp type=group
depend fmri=service/security/kerberos-5 type=group
depend fmri=system/file-system/autofs type=group
depend fmri=system/file-system/nfs type=group
depend fmri=system/network/nis type=group
depend fmri=text/doctools type=group
depend fmri=text/less type=group

In our case we bundle the package in a package archive file that we copy into the image using scp and then install the group package.  Doing this saves our developers a few minutes in getting what they need deployed, and that's one easy way we can show them value in using the cloud rather than our older lab infrastructure.  It's certainly possible to do much more interesting customizations than this, so experiment and share your ideas, we're looking to make Heat much more useful on Solaris OpenStack as we move ahead.  You can also talk to us at the OpenStack summit in Paris next week, a number of us will be manning the booth at various times when we're not in sessions at the design summit or the conference itself.

Oh, and for those who are interested, the Solaris development cloud is now up past 100 users and has 5 compute nodes deployed.  Still not large by any measure, but it's growing quickly and we're learning more about running OpenStack every day.

Friday Sep 19, 2014

Building an OpenStack Cloud for Solaris Engineering, Part 4

The prior parts of this series discussed the design and deployment of the undercloud nodes on which our cloud is implemented.  Now it's time to configure OpenStack and turn the cloud on.  Over on OTN, my colleague David Comay has published a general getting started guide that does a manual setup based on the OpenStack all-in-one Unified Archive, I recommend at least browsing through that for background that will come in handy as you deal with the inevitable issues that occur in running software with the complexity of OpenStack.  It's even better to run through that single-node setup to get some experience before moving on to trying to build a multi-node cloud.

For our purposes, I needed to script the configuration of a multi-node cloud, and that makes everything more complex, not the least of the problems being that you can't just use the loopback IP address ( as the endpoint for every service.  We had (compliments of my colleague Drew Fisher) a script for single-system configuration already, so I started with that and hacked away to build something that could configure each component correctly in a multi-node cloud.  That Python script, called, and some associated scripts are available for download.  Here, I'll walk through the design and key pieces.


Before the proper OpenStack configuration process, you'll need to run the script to create some SSH keys.  These are used to secure the Solaris RAD (Remote Administration Daemon) transport that the Solaris Elastic Virtual Switch (EVS) controller uses to manage the networking between the Nova compute nodes and the Neutron controller node.  The script creates evsuser, neutron, and root sub-directories in whatever location you run it, and this location will be referenced later in configuring the Neutron and Nova compute nodes, so you want to put it in a directory that's easily shared via NFS.  You can (and probably should) unshare it after the nodes are configured, though.

Global Configuration

The first part of is a whole series of global declarations that parameterize the services deployed on various nodes.  You'll note that the PRODUCTION variable can be set to control the layout used; if its value is False, you'll end up with a single-node deployment.  I have a couple of extra systems that I use for staging and this makes it easy to replicate the configuration well enough to do some basic sanity testing before deploying changes.,

MY_NAME = platform.node()
MY_IP = socket.gethostbyname(MY_NAME)

# When set to False, you end up with a single-node deployment



    GLANCE_NODE = ""
    CINDER_NODE = ""

Next, we configure the main security elements, the root password for MySQL plus passwords and access tokens for Keystone, along with the URL's that we'll need to configure into the other services to connect them to Keystone.

MYSQL_ROOTPW = "mysqlroot"
ADMIN_PASSWORD = "adminpw"
SERVICE_PASSWORD = "servicepw"

AUTH_URL = "http://%s:5000/v2.0/" % KEYSTONE_NODE
IDENTITY_URL = "http://%s:35357" % KEYSTONE_NODE

The remainder of this section configures specifics of Glance, Cinder,  Neutron, and Horizon.  For Glance and Cinder, we provide the name of the base ZFS dataset that each will be using.  For Neutron, the NIC, VLAN tag, and external network addresses, as well as the subnets for each of the two tenants we are providing in our cloud.  We chose to have one tenant for developers in the organization that is funding this cloud, and a second tenant for other Oracle employees who want to experiment with OpenStack on Solaris; this gives us a way to grossly allocate resources between the two, and of course most go to the tenant paying the bill.  The last element of each tuple in the tenant network list is the number of floating IP addresses to set as the quota for the tenant.  For Horizon, the paths to a server certificate and key must be configured, but only if you're using TLS, and that's only the case if the script is run with PRODUCTION = True.  The SSH_KEYDIR should be set to the location where you ran the script, above.

GLANCE_DATASET = "tank/glance"
CINDER_DATASET = "tank/cinder"

UPLINK_PORT = "aggr0"
    VXLAN_RANGE = "500-600"
    TENANT_NET_LIST = [("tenant1", "", 10),
                       ("tenant2", "", 60)]
    VXLAN_RANGE = "400-499"
    TENANT_NET_LIST = [("tenant1", "", 5), 
                       ("tenant2", "", 5)]


SERVER_CERT = "/path/to/horizon.crt" SERVER_KEY = "/path/to/horizon.key"

SSH_KEYDIR = "/path/to/generated/keys"

Configuring the Nodes

The remainder of is a series of functions that configure each element of the cloud.  You select which element(s) to configure by specifying command-line arguments.  Valid values are mysql, keystone, glance, cinder, nova-controller, neutron, nova-compute, and horizon.  I'll briefly explain what each does below.  One thing to note is that each function first creates a backup boot environment so that if something goes wrong, you can easily revert to the state of the system prior to running the script.  This is a practice you should always use in Solaris administration before making any system configuration changes.  It also saved me a ton of time in development of the cloud, since I could reset within a minute or so every time I had a serious bug.  Even our best re-deployment times with AI and archives are about 10 times that when you have to cycle through network booting.


MySQL must be the first piece configured, since all of the OpenStack services use databases to store at least some of their objects.  This function sets the root password and removes some insecure aspects of the default MySQL configuration.  One key piece is that it removes remote root access; that forces us to create all of the databases in this module, rather than creating each component's database in its associated module.  There may be a better way to do this, but since I'm not a MySQL expert in any way, that was the easiest path here.  On review it seems like the enable of the mysql SMF service should really be moved over into the Puppet manifest from part 3.


The keystone function does some basic configuration, then calls the /usr/demo/openstack/keystone/ script to configure users, tenants, and endpoints.  In our deployment I've customized this script a bit to create the two tenants rather than just one, so you may need to make some adjustments for your site; I have not included that customization in the downloaded files.


The glance function configures and starts the various glance services, and also creates the base dataset for ZFS storage; we turn compression on to save on storage for all the images we'll have here.  If you're rolling back and re-running for some reason, this module isn't quite idempotent as written because it doesn't deal with the case where the dataset already exists, so you'd need to use zfs destroy to delete the glance dataset.


Beyond just basic configuration of the cinder services, the cinder function also creates the base ZFS dataset under which all of the volumes will be created.  We create this as an encrypted dataset so that all of the volumes will be encrypted, which Darren Moffat covers at more length in OpenStack Cinder Volume encryption with ZFS. Here we use pktool to generate the wrapping key and store it in root's home directory.  One piece of work we haven't yet had time to take on is adding our ZFS Storage Appliance as an additional back-end for Cinder.  I'll post an update to cover that once we get it done.  Like the glance function, this function doesn't deal propertly with the dataset already existing, so any rollback also needs to destroy the base dataset by hand.

nova_controller & nova_compute

Since our deployment runs the nova controller services separate from the compute nodes, the nova_controller function is run on the controller node to set up the API, scheduler, and conductor services.  If you combine the compute and controller nodes you would run this and then later run the nova_compute function.  The nova_compute function also makes use of a couple of helper functions to set up the ssh configuration for EVS.  For these functions to work properly you must run the neutron function on its designated node before running nova_compute on the compute nodes.


The neutron setup function is by far the most complex, as we not only configure the neutron services, including the underlying EVS and RAD functions, but also configures the external network and the tenant networks.  The external network is configured as a tagged VLAN, while the tenant networks are configured as VxLANs; you can certainly use VLANs or VxLANs for all of them, but this configuration was the most convenient for our environment.


For the production case, the horizon function just copies into place an Apache config file that configures TLS support for the Horizon dashboard and the server's certificate and key files.  If you're using self-signed certificates, then the Apache SSL/TLS Strong Encryption: FAQ is a good reference on how to create them.  For the non-production case, this function just comments out the pieces of the dashboard's local settings that enable SSL/TLS support.

Getting Started

Once you've run through all of the above functions from, you have a cloud, and pointing your web browser at http://<your server>/horizon should display the login page, where you can login to the admin user with the password you configured in the global settings of

Assuming that works, your next step should be to upload an image.  The easiest way to start is by downloading the Solaris 11.2 Unified Archives.  Once you have an archive the upload can be done from the Horizon dashboard, but you'll find it easier to use the upload_image script that I've included in the download.  You'll need to edit the environment variables it sets first, but it takes care of setting several properties on the image that are required by the Solaris Zones driver for Nova to properly handle deploying instances.  Failure to set them is the single most common mistake that I and others have made in the early Solaris OpenStack deployments; when you forget and attempt to launch an instance, you'll get an immediate error, and the details from nova show will include the error:

| fault                                | {"message": "No valid host was 
found. ", "code": 500, "details": "  File 
 line 107, in schedule_run_instance |

When you snapshot a deployed instance with Horizon or nova image-create the archive properties will be set properly, so it's only manual uploads in Horizon or with the glance command that need care.

There's one more preparation task to do: upload an ssh public key that'll be used to access your instances. Select Access & Security from the list in the left panel of the Horizon Dashboard, then select the Keypairs tab, and click Import Keypair.  You'll want to paste the contents of your ~/.ssh/ into the Public Key field, and probably name your keypair the same as your username.

Finally, you are ready to launch instances.   Select Instances in the Horizon Dashboard's left panel list, then click the Launch Instance button.  Enter a name for the instance, select the Flavor, select Boot from image as the Instance Boot Source, and select the image to use in deploying the VM.  The image will determine whether you get a SPARC or x86 VM and what software it includes, while the flavor determines whether it is a kernel zone or non-global zone, as well as the number of virtual CPUs and amount of memory.  The Access & Security tab should default to selecting your uploaded keypair.  You must go to the Networking tab and select a network for the instance.  Then click Launch and the VM will be installed, you can follow progress by clicking on the instance name to see details and selecting the Log tab.  It'll take a few minutes at present, in the meantime you can Associate a Floating IP in the Actions field.  Pick any address from the list offered.  Your instance will not be reachable until you've done this.

Once the instance has finished installing and reached the Active status, you can login to it.  To do so, use ssh root@<floating-ip-address>, which will login to the zone as root using the key you uploaded above.  If that all works, congratulations, you have a functioning OpenStack cloud on Solaris!

In future posts I'll cover additional tips and tricks we've learned in operating our cloud.  At this writing we're over 60 users and growing steadily, and it's been totally reliable over 3 months, with only outages for updates to the infrastructure.



Tuesday Sep 16, 2014

Building an OpenStack Cloud for Solaris Engineering, Part 3

At the end of Part 2, we built the infrastructure needed to deploy the undercloud systems into our network environment.  However, there's more configuration needed on these systems than we can completely express via Automated Installation, and there's also the issue of how to effectively maintain the undercloud systems.  We're only running a half dozen initially, but expect to add many more as we grow, and even at this scale it's still too much work, with too high a probability of mistakes, to do things by hand on each system.  That's where a configuration management system such as Puppet shows its value, providing us the ability to define a desired state for many aspects of many systems and have Puppet ensure that state is maintained.  My team did a lot of work to include Puppet in Solaris 11.2 and extend it to manage most of the major subsystems in Solaris, so the OpenStack cloud deployment was a great opportunity to start working with another shiny new toy.

Configuring the Puppet Master

One feature of the Puppet integration with Solaris is that the Puppet configuration is expressed in SMF, and then translated by the new SMF Stencils feature to settings in the usual /etc/puppet/puppet.conf file.  This makes it possible to configure Puppet using SMF profiles at deployment time, and the examples in Part 2 showed this for the clients.  For the master, we apply the profile below:

<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
  This profile configures the Puppet master
<service_bundle type="profile" name="puppet">
  <service version="1" type="service" name="application/puppet">
    <instance enabled="true" name="master">
      <property_group type="application" name="config">
        <propval name="server" value=""/>
        <propval name="autosign" value="/etc/puppet/autosign.conf"/>

The interesting setting is the autosign configuration, which allows new clients to have their certificates automatically signed and accepted by the Puppet master.  This isn't strictly necessary, but makes operation a little easier when you have a reasonably secure network and you're not handing out any sensitive configuration via Puppet.  We use an autosign.conf that looks something like:


This means that we're accepting any system that identifies as being in the domain.  The main pain with autosigning is that if you reinstall any of the systems and you're using self-generated certificates on the clients, you need to clean out the old certificate before the new one will be accepted; this means issuing a command on the master like:

# puppet cert clean

There are lots of options in Puppet related to certificates and more sophisticated ways to manage them, but this is what we're doing for now.  We have filed some enhancement requests to implement ways of integrating Puppet client certificate delivery and signing with Automated Installation, which would make using the two together much more convenient.

Writing the Puppet Manifests

Next, we implemented a small Mercurial source repository to store the Puppet manifests and modules.  Using a source control system with Puppet is a highly recommended practice, and Mercurial happens to be the one we use for Solaris development, so it's natural for us in this case.  We configure /etc/puppet on the Puppet master as a child repository of the main Mercurial repository, so when we have new configuration to apply it's first checked into the main repository and then pulled into Puppet via hg pull -u, then automatically applied as each client polls the master.  Our repository presently contains the following:


An example tar file with all of the above is available for download.

The site manifest  starts with:

include ntp
include nameservice

The ntp module is the canonical example of Puppet, and is really important for the OpenStack undercloud, as it's necessary for the various nodes to have a consistent view of time in order for the security certificates issued by Keystone to be validated properly.  I'll describe the nameservice module a little later in this post.

Since most of our nodes are configured identically, we can use a default node definition to configure them.  The main piece is configuring Datalink Multipathing (DLMP), which provides us additional bandwidth and higher availability than a single link.  We can't yet configure this using SMF, so the Puppet manifest:

  • Figures out the IP address the system is using with some embedded Ruby
  • Removes the net0 link and creates a link aggregation from net0 and net1
  • Enables active probing on the link aggregation, so that it can detect upstream failures on the switches that don't affect link state signaling (which is also used, and is the only means unless probing is enabled)
  • Configures an IP interface and the same address on the new aggregation link
  • Restricts Ethernet autonegotiation to 1 Gb to work around issues we have with these systems and the switches/cabling we're using the in the lab; without this, we get 100 Mb speeds negotiated about 50% of the time, and that kills performance.
You'll note several uses of the require and before statements to ensure the rules are applied in the proper order, as we need to tear down the net0 IP interface before it can be moved into the aggregation, and the aggregation needs to be configured before the IP objects on top of it.
node default {
$myip = inline_template("<% _erbout.concat('$fqdn').to_s) %>")
	# Force link speed negotiation to be at least 1 Gb
	link_properties { "net0":
	    ensure => present,
	    properties => { en_100fdx_cap => "0" },
	link_properties { "net1":
	    ensure => present,
	    properties => { en_100fdx_cap => "0" },

	link_aggregation { "aggr0" :
	    ensure => present,
	    lower_links => [ 'net0', 'net1' ],
	    mode => "dlmp",
	link_properties { "aggr0":
	    ensure => present,
	    require => Link_aggregation['aggr0'],
	    properties => { probe-ip => "+" },
	ip_interface { "aggr0" :
	    ensure => present,
	    require => Link_aggregation['aggr0'],
	ip_interface { "net0":
	    ensure => absent,
	    before => Link_aggregation['aggr0'],
	address_object { "net0":
	    ensure => absent,
	    before => Ip_interface['net0'],
	address_object { 'aggr0/v4':
	    require => Ip_interface['aggr0'],
	    ensure => present,
	    address => "${myip}/24",
	    address_type => "static",
	    enable => "true",

The controller node declaration includes all of the above functionality, but also adds these elements to keep rabbitmq running and install the mysql database.

    service { "application/rabbitmq" :
        ensure => running,
    package { "database/mysql-55":
        ensure => installed,

The database installation could have been part of the AI derived manifest as well, but it works just as well here and it's convenient to do it this way when I'm setting up staging systems to test builds before we upgrade.

The nameservice Puppet module is shown below.  It's handling both nameservice and RBAC (Role-based Access Control) configuration:

class nameservice {

    dns { "openstack_dns":
        search => [ '' ],
        nameserver => [ ', '' ],

    service { "dns/client":
        ensure => running,

    svccfg { "domainname":
        ensure => present,
        fmri => "svc:/network/nis/domain",
        property => "config/domainname",
        type => "hostname",
        value => "",

    # nameservice switch
    nsswitch { "dns + ldap":
        default => "files",
        host =>  "files dns",
        password => "files ldap",
        group => "files ldap",
        automount => "files ldap",
        netgroup => "ldap",

    # Set user_attr for administrative accounts
    file { "user_attr" :
        path => "/etc/user_attr.d/site-openstack",
        owner => "root",
        group => "sys",
        mode => 644,
        source => "puppet:///modules/nameservice/user_attr",

    # Configure zlogin access
    file { "site-zlogin" :
        path => "/etc/security/prof_attr.d/site-zlogin",
        owner => "root",
        group => "sys",
        mode => 644,
        source => "puppet:///modules/nameservice/prof_attr-zlogin",

    file { "zlogin-exec" :
        path => "/etc/security/exec_attr.d/site-zlogin",
        owner => "root",
        group => "sys",
        mode => 644,
        source => "puppet:///modules/nameservice/exec_attr-zlogin",

    file { "policy.conf" :
        path => "/etc/security/policy.conf",
        owner => "root",
        group => "sys",
        mode => 644,
        source => "puppet:///modules/nameservice/policy.conf",

You may notice that the nameservice configuration here is exactly the same as what we provided in the SMF profile in part 2.  We include it here because it's configuration we anticipate changing someday and we won't want to re-deploy the nodes.  There are ways we could prevent the duplication, but we didn't have time to spend on it right now and it also demonstrates that you could use a completely different configuration in operation than at deployment/staging time.

What's with the RBAC configuration?

The RBAC configuration is doing two things, the first being configuring the user accounts of the cloud administrators for administrative access on the cloud nodes.  The user_attr file we're distributing confers the System Adminstrator and OpenStack Management profiles, as well as access to the root role (oadmin is just an example account in this case):

oadmin::::profiles=System Administrator,OpenStack Management;roles=root

As we add administrators, I just need to add entries for them to the above file and they get the required access to all of the nodes.  Note that this doesn't directly provide administrative access to OpenStack's CLI's or its Dashboard, that's configured within OpenStack.

A limitation of the OpenStack software we include in Solaris 11.2 is that we don't provide the ability to connect to the guest instance consoles, an important feature that's being worked on.  The zlogin User profile is something I created to work around this problem and allow our cloud users to get access to the consoles, as this is often needed in Solaris development and testing.  First, the profile is defined by a prof_attr file with the entry:

zlogin User:::Use

We also need an exec_attr file to ensure that zlogin is run with the needed uid and privileges:

 zlogin User:solaris:cmd:RO::/usr/sbin/zlogin:euid=0;privs=ALL

Finally, we modify the RBAC policy file so that all users are assigned to the zlogin User profile:

PROFS_GRANTED=zlogin User,Basic Solaris User

The result of all this is that a user can obtain access to their specific OpenStack guest instance via login to the compute node on which the guest is running, and runing a command such as:

$ pfexec zlogin -C instance-0000abcd

At this point we have the undercloud nodes fully configured to support our OpenStack deployment.  In part 4, we'll look at the scripts used to configure OpenStack itself.

Friday Aug 22, 2014

Building an OpenStack Cloud for Solaris Engineering, Part 1

One of the signature features of the recently-released Solaris 11.2 is the OpenStack cloud computing platform.  Over on the Solaris OpenStack blog the development team is publishing lots of details about our version of OpenStack Havana as well as some tips on specific features, and I highly recommend reading those to get a feel for how we've leveraged Solaris's features to build a top-notch cloud platform.  In this and some subsequent posts I'm going to look at it from a different perspective, which is that of the enterprise administrator deploying an OpenStack cloud.  But this won't be just a theoretical perspective: I've spent the past several months putting together a deployment of OpenStack for use by the Solaris engineering organization, and now that it's in production we'll share how we built it and what we've learned so far.

In the Solaris engineering organization we've long had dedicated lab systems dispersed among our various sites and a home-grown reservation tool for developers to reserve those systems; various teams also have private systems for specific testing purposes.  But as a developer, it can still be difficult to find systems you need, especially since most Solaris changes require testing on both SPARC and x86 systems before they can be integrated.  We've added virtual resources over the years as well in the form of LDOMs and zones (both traditional non-global zones and the new kernel zones).  Fundamentally, though, these were all still deployed in the same model: our overworked lab administrators set up pre-configured resources and we then reserve them.  Sounds like pretty much every traditional IT shop, right?  Which means that there's a lot of opportunity for efficiencies from greater use of virtualization and the self-service style of cloud computing.  As we were well into development of OpenStack on Solaris, I was recruited to figure out how we could deploy it to both provide more (and more efficient) development and test resources for the organization as well as a test environment for Solaris OpenStack.

At this point, let's acknowledge one fact: deploying OpenStack is hard.  It's a very complex piece of software that makes use of sophisticated networking features and runs as a ton of service daemons with myriad configuration files.  The web UI, Horizon, doesn't often do a good job of providing detailed errors.  Even the command-line clients are not as transparent as you'd like, though at least you can turn on verbose and debug messaging and often get some clues as to what to look for, though it helps if you're good at reading JSON structure dumps.  I'd already learned all of this in doing a single-system Grizzly-on-Linux deployment for the development team to reference when they were getting started so I at least came to this job with some appreciation for what I was taking on.  The good news is that both we and the community have done a lot to make deployment much easier in the last year; probably the easiest approach is to download the OpenStack Unified Archive from OTN to get your hands on a single-system demonstration environment.  I highly recommend getting started with something like it to get some understanding of OpenStack before you embark on a more complex deployment.  For some situations, it may in fact be all you ever need.  If so, you don't need to read the rest of this series of posts!

In the Solaris engineering case, we need a lot more horsepower than a single-system cloud can provide.  We need to support both SPARC and x86 VM's, and we have hundreds of developers so we want to be able to scale to support thousands of VM's, though we're going to build to that scale over time, not immediately.  We also want to be able to test both Solaris 11 updates and a release such as Solaris 12 that's under development so that we can work out any upgrade issues before release.  One thing we don't have is a requirement for extremely high availability, at least at this point.  We surely don't want a lot of down time, but we can tolerate scheduled outages and brief (as in an hour or so) unscheduled ones.  Thus I didn't need to spend effort on trying to get high availability everywhere.

The diagram below shows our initial deployment design.  We're using six systems, most of which are x86 because we had more of those immediately available.  All of those systems reside on a management VLAN and are connected with a two-way link aggregation of 1 Gb links (we don't yet have 10 Gb switching infrastructure in place, but we'll get there).  A separate VLAN provides "public" (as in connected to the rest of Oracle's internal network) addresses, while we use VxLANs for the tenant networks.

Solaris cloud diagram

One system is more or less the control node, providing the MySQL database, RabbitMQ, Keystone, and the Nova API and scheduler as well as the Horizon console.  We're curious how this will perform and I anticipate eventually splitting at least the database off to another node to help simplify upgrades, but at our present scale this works.

I had a couple of systems with lots of disk space, one of which was already configured as the Automated Installation server for the lab, so it's just providing the Glance image repository for OpenStack.  The other node with lots of disks provides Cinder block storage service; we also have a ZFS Storage Appliance that will help back-end Cinder in the near future, I just haven't had time to get it configured in yet.

There's a separate system for Neutron, which is our Elastic Virtual Switch controller and handles the routing and NAT for the guests.  We don't have any need for firewalling in this deployment so we're not doing so.  We presently have only two tenants defined, one for the Solaris organization that's funding this cloud, and a separate tenant for other Oracle organizations that would like to try out OpenStack on Solaris.  Each tenant has one VxLAN defined initially, but we can of course add more.  Right now we have just a single /24 network for the floating IP's, once we get demand up to where we need more then we'll add them.

Finally, we have started with just two compute nodes; one is an x86 system, the other is an LDOM on a SPARC T5-2.  We'll be adding more when demand reaches the level where we need them, but as we're still ramping up the user base it's less work to manage fewer nodes until then.

My next post will delve into the details of building this OpenStack cloud's infrastructure, including how we're using various Solaris features such as Automated Installation, IPS packaging, SMF, and Puppet to deploy and manage the nodes.  After that we'll get into the specifics of configuring and running OpenStack itself.

Tuesday Jan 31, 2012

Detroit Solaris 11 Forum, February 8

I'm just posting this quick note to help publicize the Oracle Solaris 11 Technology Forum we're holding in the Detroit area next week.  There's still time to register and come get a half-day overview of the great new stuff in Solaris 11.  The "special treat" that's not mentioned in the link is that I'll be joining Jeff Victor as a speaker.  Looking forward to being back in my home state for a quick visit, and hope I'll see some old friends there!

Tuesday Nov 15, 2011

Solaris 11 Technology Forums, NYC and Boston

By now you're certainly aware that we released Solaris 11; I was on vacation during the launch so haven't had time to write any material related to the Solaris 11 installers, but will get to that soon.  Following onto the release, we're scheduling events in various locations around the world to talk about some of the key new features in Solaris 11 in more depth.  In the northeast US, we've scheduled technology forums in New York City on November 29, and Burlington, MA on November 30.  Click on those links to go to the detailed info and registration.  I'll be one of the speakers at both of them, so hope to see you there!

Friday Dec 18, 2009

Big 2009 Finish for OpenSolaris Installation

As the end of 2009 approaches, there are a bunch of recent developments in the OpenSolaris installation software that I want to highlight.  All of the below will appear in OpenSolaris development build 130, due in the next few days.

First up is the addition of iSCSI support to Automated Installation (or AI).  You can now specify an iSCSI target for installation in the AI manifest.  It'll work on both SPARC and x86, provided you have firmware that can support iSCSI boot; on SPARC you'll need a very recent OBP patch to enable this support.  Official docs are in the works, but the design document should have enough info to piece it together if you're interested.

Next is the bootable AI image, which allows use of AI in a number of additional scenarios.  Probably the most generally interesting one is that you can now install OpenSolaris on SPARC without setting up an AI server first, by using the default AI manifest that's included on the ISO image.  One caveat is that the default manifest installs from the release repository; due to ZFS version changes between 2009.06 and present, this results in an installation that won't boot.  You'll want to make a copy of the default manifest and change the main url for the ai_pkg_repo_default_authority element to point to and put it at a URL that you can supply to the AI client once it boots.  Alok's mail and blog entry have more details.

Building on bootable AI, we've extended the Distribution Constructor (or DC) with a project known as Virtual Machine Constructor (VMC).  Succinctly, it extends DC to construct a pre-built virtual machine image that can be imported into hypervisors that support OVF 1.0, such as VirtualBox or VMware.  Glenn's mail notes a few limitations that will be addressed in the next few builds.  Anyone interested in building virtualization-heavy infrastructures should find this quite useful.

Finally, one more barrier to adding OpenSolaris on x86 to a system that's multi-booted with other OS's has fallen with the addition of extended partition support to both the live CD GUI installer and Automated Installation.  You can now install OpenSolaris into a logical partition carved from the extended partition.  Jean's mail has some brief notes on how to use this new feature.  I should also note at this point that a couple of builds ago the parted command and GParted GUI were added to the live CD, so the more complex preparations sometimes needed to free up space for OpenSolaris can now be done directly from the CD.

I'd like to thank my team for all the hard work that went into all of the above; they accomplished all of it with precious little help from me, as I spent most of the past three months either traveling around talking to literally hundreds of customers or working on architecture and design tasks.  Speaking of those, the review of the installer architecture is open, and I've also just this week posted the first draft design for AI service management improvements.

What's next?  Well, that will be the topic of my next post in early January.  It's time for a vacation!


I'm the architect for Solaris deployment and system management, with a lot of background in networking on the side. I spend a lot of my time currently operating Solaris Engineering's OpenStack cloud. I am co-author of the OpenSolaris Bible (Wiley, 2009). I also play a lot of golf.


« October 2015

No bookmarks in folder


No bookmarks in folder