Wednesday Feb 03, 2016

Troubleshooting I/O Load in a Solaris OpenStack Cloud

The other day we noticed that a number of guests running on our Solaris OpenStack cloud were very lethargic, with even simple operations like listing boot environments with beadm taking a long time, let alone more complex things like installing packages.  It was easy to rule out CPU load on the compute nodes, as prstat quickly showed me that load averages were not at all high on the compute nodes hosting my guests.

The next suspect was disk I/O.  When you're running on bare metal with local disks, this is pretty easy to check; things like zpool status, zpool iostat, and iostat provide a pretty good high-level view of what's going on, and tools like prstat or truss might help identify the culprit.  In an OpenStack cloud it's not nearly that simple.  It's all virtual machines, the "disks" are ZFS volumes served up as LUNs by an iSCSI target, and with a couple hundred VM's running on a bunch of compute nodes, you're now searching for the needle(s) in a haystack.  Where to start?

First I confirmed that we didn't have any funny business going on with the network.  We've had issues with the ixgbe NICs falling back to 100 Mb with our switch occasionally, but dladm show-aggr -x aggr0 confirmed that wasn't an issue on either the Cinder node or the compute nodes.  My next step was to look at I/O on the Cinder volume server.  We're not yet running this on a ZFS Storage Appliance, just a Solaris box with a pile of disks, so no analytics BUI yet.  A quick look with zpool iostat showed we're pretty busy:
$ zpool iostat 30
          capacity     operations    bandwidth
pool   alloc   free   read  write   read  write
-----  -----  -----  -----  -----  -----  -----
rpool   224G  54.2G      0     46      0   204K
tank   7.17T  25.3T  1.10K  5.53K  8.78M  92.4M
But are we really that loaded as far as the disks are concerned?
$ iostat -xnzc 30

 us sy st id
  0 12  0 88
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   57.9  272.6  769.1 10579.3  0.0  4.7    0.0   14.1   0  88 c1t0d0
    0.0    8.9    0.0   78.1  0.0  0.0    0.0    2.1   0   1 c1t1d0
   54.6  256.9  759.7 10475.1  0.0  4.6    0.0   14.8   0  90 c1t2d0
   59.1  283.3  749.9 10070.2  0.0  4.3    0.0   12.5   0  87 c1t3d0
   76.9  224.9 1172.9 10275.5  0.0  5.4    0.0   18.0   0  96 c1t4d0
   51.7  259.8  784.0 10579.2  0.0  4.2    0.0   13.6   0  93 c1t5d0
   47.3  275.2  781.3 10109.4  0.0  4.3    0.0   13.4   0  93 c1t6d0
   49.6  262.0  766.2 10416.9  0.0  4.7    0.0   15.1   0  92 c1t7d0
   66.3  243.2 1163.7 10172.9  0.0  5.3    0.0   17.1   0  97 c1t8d0
   57.1  269.1  814.6 10584.6  0.0  4.2    0.0   13.0   0  90 c1t9d0
   62.3  272.2  820.1 10298.4  0.0  4.0    0.0   12.0   0  87 c1t10d0
   55.1  247.3  791.9 10558.6  0.0  4.6    0.0   15.3   0  91 c1t11d0
   63.5  221.6 1166.7 10559.2  0.0  5.7    0.0   20.0   0 100 c1t12d0
    0.0    9.1    0.0   78.1  0.0  0.0    0.0    4.9   0   1 c1t13d0
Yeah, that looks heavy; all of the disks in the tank pool are above 87% busy and it wasn't changing much over several minutes of monitoring, so it's not just a brief burst of activity.  Note that I don't have continuous monitoring set up for this data but past quick looks had shown we were typically about half those bandwidth numbers from zpool iostat and disks were in the 50-60% busy range.  So we have an unusually heavy load; but where's it coming from, and is it expected activity from something like a source build or something that's gone runaway?  This is where the problem gets challenging; we've got a bunch of compute nodes hammering this over iSCSI, so  we need to figure out which ones to look at.

Fortunately, my friends over at have a nice little DTrace script called iscsiio.d for breaking down iSCSI traffic at the target end.  Running it for less than a minute got this output:
# dtrace -s /var/tmp/iscsiio.d
Tracing... Hit Ctrl-C to end.
   REMOTE IP                     EVENT    COUNT     Kbytes     KB/sec                   read       56        238         11                   read      631       2835        132                   read      636       3719        173                   read      671       3561        166                  write      869       7330        342                  write     2200      23153       1080                   read     2273      13422        626                  write     3730      69042       3221                  write     4236      47745       2227                  write     5997     108624       5068                  write     6803     105302       4913                   read     8152      45458       2121                  write     9557     130043       6068                   read    46666     372022      17359
Wow, a lot of read traffic on that last one.  So I logged into that node and ran:
$ iostat -xnzc 30

 us sy st id
  0  4  0 96
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    4.3    0.0   27.6  0.0  0.0    0.0    2.3   0   0 c0t5000CCA03C31E530d0
    0.0    5.0    0.0   27.6  0.0  0.0    0.0    1.8   0   0 c0t5000CCA03C2DB384d0
    0.2    2.3    1.1    7.9  0.0  0.6    0.0  256.4   0  19 c0t600144F07DA70300000055DE259000B9d0
    0.2    2.4    1.1    5.4  0.0  1.1    0.0  421.1   0  24 c0t600144F07DA70300000056548EDA0080d0
    5.5    4.0 4543.6   18.8  0.0  3.5    0.0  363.5   0 100 c0t600144F07DA70300000055706B06000Dd0
    0.8   12.1    4.5   68.1  0.0  1.7    0.0  133.7   0  43 c0t600144F07DA7030000005655824D0085d0
    8.1    7.1 6127.3  443.2  0.0  6.4    0.0  417.8   0 100 c0t600144F07DA703000000555C7D48003Bd0
    0.9   11.6    4.7  110.0  0.0  2.0    0.0  157.8   0  58 c0t600144F07DA703000000564EB7950065d0
    0.0    0.7    0.0    2.1  0.0  0.4    0.0  550.8   0   7 c0t600144F07DA70300000056A9F12A0121d0
    0.2    2.2    1.1    6.2  0.0  0.5    0.0  215.4   0  12 c0t600144F07DA70300000056A8CB070112d0
    0.0    0.0    0.0    0.0  0.0  0.1    0.0    0.0   0   2 c0t600144F07DA703000000569558FE00BBd0
    0.2    2.4    1.1    6.5  0.0  0.2    0.0   75.6   0   7 c0t600144F07DA7030000005695379300B6d0
    0.4    4.9    2.3   12.3  0.0  0.9    0.0  163.0   0  34 c0t600144F07DA703000000569522F800B4d0
    0.4    2.8    2.3   14.6  0.0  0.5    0.0  158.9   0   7 c0t600144F07DA70300000056A1722900F4d0
    0.6    7.6    3.4   40.0  0.0  1.2    0.0  140.0   0  27 c0t600144F07DA70300000056A652E700FBd0
    0.2    2.6    1.1    6.6  0.0  0.0    0.0    0.6   0   0 c0t600144F07DA70300000056B02C5F0150d0
There are two LUNs that have a *lot* of read traffic going.  Which kernel zones are responsible?  That requires help from OpenStack to map it back to the responsible guest.  Back on the target, we get COMSTAR to tell us what a LUN maps to:
$ stmfadm list-lu -v 600144F07DA703000000555C7D48003B
LU Name: 600144F07DA703000000555C7D48003B
    Operational Status     : Online
    Provider Name          : sbd
    Alias                  : /dev/zvol/rdsk/tank/cinder/volume-ed69313a-8954-405b-b89d-a11e5723a88f
    View Entry Count       : 1
    Data File              : /dev/zvol/rdsk/tank/cinder/volume-ed69313a-8954-405b-b89d-a11e5723a88f
    Meta File              : not set
    Size                   : 107374182400
    Block Size             : 512
    Management URL         : not set
    Software ID            : not set
    Vendor ID              : SUN     
    Product ID             : COMSTAR         
    Serial Num             : not set
    Write Protect          : Disabled
    Write Cache Mode Select: Enabled
    Writeback Cache        : Enabled
    Access State           : Active
Now that I have the volume name, I can feed the UUID portion of that name to Cinder:
$ cinder show ed69313a-8954-405b-b89d-a11e5723a88f
|              attachments              | [{u'device': u'c1d0', 
u'server_id': u'5751bd18-78aa-4387-be55-8e395aa081eb', 
u'id': u'ed69313a-8954-405b-b89d-a11e5723a88f', 
u'host_name': None, u'volume_id': u'ed69313a-8954-405b-b89d-a11e5723a88f'}] |

(I've wrapped the above output to aid its readability) That maps to a server_id I can provide to nova show to identify the zone.

$ nova show 5751bd18-78aa-4387-be55-8e395aa081eb

I've ellided that output as it's private info.  From there, I'm able to talk to the owner, use zlogin to inspect it, etc.

It turned out that the two problem zones were running a test OpenStack Swift deployment that seemed to have run away a bit; we shut them down and voila, we were back to nominal performance for the rest of the cloud.

Wednesday Sep 09, 2015

Configuring and Managing OpenStack Nova's Image Cache


One piece of configuration that we had originally neglected in configuring Solaris Engineering's OpenStack cloud was the cache that Nova uses to store downloaded Glance images in the process of deploying each guest.  For many sites, this wouldn't likely be a big problem as they will often use a small set of images for a long period of time, so the cache won't grow much.  In our case, that's definitely not true; we generate new Solaris images every two weeks for each release that is under development, and we usually have two of those (an update release and the next minor release), so we are introducing new images every week (and that's multiplied by having images for SPARC and x86; yes, our Glance catalog is pretty large, a topic for a future post).  And of course, the most recent images are usually the most used, so usually a particular image will be out of favor within a month or two.  We recently reached the point where the cache became a problem on our compute nodes, so here we'll share the solution we're using until the solariszones driver in OpenStack gets the smarts to handle this.

By default, the solariszones driver uses /var/lib/nova/images as its download cache for Glance images.  This has two problems: it's not quota'ed, and it's inside the boot environment's /var dataset, meaning that as we update the compute node's OS, each new boot environment (or BE) will create a snapshot that refers to whatever is there, and so it becomes difficult to actually free up space; you need to delete both the image and any old BE snapshots that refer to it.  There's also another related problem, in that attempting to create an archive of the system using archiveadm(1m) will also capture all of these images and bloat the archive to Godzilla size.  Thus, our solution needs to move the cache outside of the boot environment hierarchy.  All together, there are three pieces:

  1. Create a dataset and set a quota on it
  2. Introduce an SMF service to manage the cache against the quota
  3. Reconfigure Nova to use the new cache dataset

The SMF Service

For this initial iteration, I've built a really simple service with lots of hard-coded policy.  A real product-quality solution would be more configurable and such, but right now I'm in quick & dirty mode as we have bigger problems we're working on.  Here's the script at the core of the service:


# Nova image cache cleaner
# Presently all hard-coded policy; should be configured via SMF properties


quota=$(zfs get -pH -o value quota ${cacheds})

(( quota == 0 )) && exit 0

(( limit=quota-2*1024*1024*1024 ))

size=$(zfs get -pH -o value used ${cacheds})

# Delete images until we are at least 2 GB below quota; always delete oldest

while (( size > limit )); do
    oldest=$(ls -tr ${cachedir}|head -1)
    rm -v ${cachedir}/${oldest}
    size=$(zfs get -pH -o value used ${cacheds})
exit 0

To run something like this you'd traditionally use cron, but we have the new SMF periodic services at our disposal, and that's what we'll use here; the script will be run hourly, and gets to run for however long it takes (which should be but a second or two, but no reason to be aggressive).

<?xml version="1.0" ?>
<!DOCTYPE service_bundle
  SYSTEM '/usr/share/lib/xml/dtd/service_bundle.dtd.1'>
<service_bundle type="manifest"
    <service version="1" type="service"
            The following dependency keeps us from starting unless
            nova-compute is also enabled
        <dependency restart_on="none" type="service"
            name="nova-compute" grouping="require_all">
            <service_fmri value="svc:/application/openstack/nova/nova-compute"/>
        <instance enabled="false" name="default"/>
                <loctext xml:lang="C">
                        OpenStack Nova Image Cache Management
                <loctext xml:lang="C">
                        A periodic service that manages the Nova image cache

To deliver it, we package it up in IPS:

set name=pkg.fmri value=pkg://site/nova-cache-cleanup@0.1
set name=pkg.summary value="Nova compute node cache management"
set name=pkg.description value="Nova compute node glance cache cleaner"
file nova-cache-cleanup.xml path=lib/svc/manifest/site/nova-cache-cleanup.xml owner=root group=bin \
    mode=0444 restart_fmri=svc:/system/manifest-import:default
file nova-cache-cleanup path=lib/svc/method/nova-cache-cleanup owner=root group=bin mode=0755

Use pkgsend(1) to publish all of that into your local IPS repository.

Creating the Dataset

Since we use Puppet to handle the day-to-day management of our cloud, I updated our compute_node class to create the dataset and distribute the package.  Here's what that looks like:

# Configuration custom to compute nodes
class compute_node {

    # Customize kernel settings
    file { "site:cloud" :
        path => "/etc/system.d/site:cloud",
        owner => "root",
        group => "root",
        mode => 444,
        source => "puppet:///modules/compute_node/site:cloud",
    # Create separate dataset for nova to cache glance images
    zfs { "rpool/export/nova" :
        ensure => present,
        mountpoint => "/export/nova",
        quota => "15G",
    file { "/export/nova" : 
        ensure => directory,
        owner => "nova",
        group => "nova",
        mode => 755,
        require => Zfs['rpool/export/nova'],
    # Install service to manage cache and enable it
    package { "pkg://site/nova-cache-cleanup" :
        ensure => installed,
    service { "site/nova-cache-cleanup" :
        ensure => running,
        require => Pkg['pkg://site/nova-cache-cleanup'],

It's important that the directory have the correct owner and permissions; the solariszones driver will create the /export/nova/images directory within it to store the images and the nova-compute service runs as nova/nova.  Get this wrong and guests will just plain fail to deploy.  The kernel settings referred to above are unrelated to this posting, it's our setting user_reserve_hint_pct so that the ZFS ARC cache is managed appropriately for running lots of zones (we're using a value of 90 percent as our compute nodes are beefy - 256 or 512 GB).

Reconfiguring Nova

Once you have all that in place and the Puppet agent has run successfully to create the dataset and install the package, it's just a small bit of Python to enable the service on all the compute nodes:


import iniparse
from subprocess import CalledProcessError, Popen, PIPE, check_call

def configure_cache():
    ini = iniparse.ConfigParser()
    ini.set('DEFAULT', 'glancecache_dirname', '/var/share/nova/images')
    with open('/etc/nova/nova.conf', 'w') as fh:
    check_call(["/usr/sbin/svcadm", "restart", "nova-compute"])

if __name__ == "__main__":

I used cssh(1) to run this at once on all the compute nodes, but you could also do this with Puppet.  Ideally we'd package nova.conf or use Puppet's OpenStack modules to do this part, but the packaging solution doesn't work due to a couple of node-specific items in nova.conf and we don't yet have Puppet OpenStack available in Solaris (stay tuned!).

The final step once you've done all the above is to take out the trash from your old image cache:

rm -rf /var/lib/nova/images

Again, if you have older BE's around this won't free up all the space yet, you'll need to remove old BE's, but be careful not to leave yourself with just one!

Friday Jul 10, 2015

Upgrading Solaris Engineering's OpenStack Cloud

The Solaris 11.3 Beta release includes an update to the bundled OpenStack packages from the Havana version to Juno1.  Over on the OpenStack blog my colleague Drew Fisher has a detailed post that looks under the covers at the work the community and our Solaris development team did to make this major upgrade as painless as possible. Here, I'll talk about applying that upgrade technology from the operations side, as we recently performed this upgrade on the internal cloud that we're operating for Solaris engineering.  See my series of posts from last year on how our cloud was initially constructed. 

Our Upgrade Process

The first thing to understand about our upgrade process is that, since the Solaris Nova driver as yet lacks live migration support, we can't upgrade compute nodes without an outage for the guest instances.  We also don't yet have an HA configuration deployed for the database and all the services, so those also require outages to upgrade2.  Therefore, all of our upgrades have downtime scheduled for the entire cloud and we attempt to upgrade all the nodes to the same build.  We typically schedule two hours for upgrades.  If everything were to go smoothly we could be done in less than 30 minutes, but it never works out that way, at least so far.

Right now, we're still doing the upgrades fairly manually, with a small script that we run on each node in turn.  That script looks something like:

# shut down puppet so that patches don't get pulled before they are required
svcadm disable -t puppet:agent
# shut down zones so update goes more quickly, use synchronous to wait for this
svcadm disable -ts zones
# Disable nova API and BUI; we use temporary for API so it will
# come back on reboot but persistent for BUI so that it's not available
# until we're ready to end the outage.
# Dump database for disaster recovery
if [[ $node == "cloud-controller" ]]; then
        svcadm disable -t nova-api-osapi-compute 
        svcadm disable apache22
        mysqldump --user=root --password='password' --add-drop-database --all-databases >/tank/all_databases.sql
pkg update --be-name solaris11_3 -C 5

Once the script completes, we can reboot the system.  The comment above about Puppet relates to specifics in how we are using it; since we sometimes have bugs in the builds that we can work around, we typically use Puppet to distribute those workarounds, but we don't want them to take effect until we've rebooted into the new boot environment.  There's almost certainly a better way to do this, we're just not that smart yet ;-)

We run the above script on all the nodes in parallel, which is fine to do because the upgrade is always creating a new boot environment and we'll wait until all of the core service nodes (keystone, cinder, nova controller, neutron, glance) are done before we reboot any of them.  We don't necessarily wait for all of the compute nodes since they can take longer if any of the guests are non-global zones, and they are the last thing we reboot anyway.

Once the updates are complete, we reboot nodes in the following order:

  1. Nova controller - MySQL, RabbitMQ, Keystone, Nova api's, Heat
  2. Neutron controller
  3. Cinder controller
  4. Glance
  5. Compute nodes

This order minimizes disruptions to the services connecting to RabbitMQ and MySQL, which have been a point of fragility for many operators of OpenStack clouds.  It also ensures that the compute nodes don't see disruptions to iSCSI connections for running zones, which we've seen occasionally lead to ZFS pools ending up in a suspended state.  As we build out the cloud we'll be separating the functions that are in the Nova controller into separate instances, which will necessitate some adjustments to this sequencing, but the basic idea is to work from the database to rabbitmq to keystone to the nova services.

Verifying the Upgrade Worked

Once we've rebooted the nodes we run a couple of quick tests to launch both SPARC and x86 guests, ensuring that basically all of the machinery is working.  I've started doing this with a fairly simple Heat template:

heat_template_version: 2013-05-23

description: >
  HOT template to deploy SPARC & x86 servers as a quick sanity test

    type: string
    description: Name of image to use for x86 server
    type: string
    description: Name of image to use for SPARC server

    type: OS::Nova::Server
      name: test_x86
      image: { get_param: x86_image }
      flavor: 1
      key_name: testkey
        - port: { get_resource: x86_server1_port }
    type: OS::Neutron::Port
      network: internal

    type: OS::Neutron::FloatingIP
      floating_network: external
      port_id: { get_resource: x86_server1_port }
    type: OS::Nova::Server
      name: test_sparc
      image: { get_param: sparc_image }
      flavor: 1
      key_name: testkey
        - port: { get_resource: sparc_server1_port }
    type: OS::Neutron::Port
      network: internal

    type: OS::Neutron::FloatingIP
      floating_network: external
      port_id: { get_resource: sparc_server1_port }

    description: Floating IP address of x86 server in public network
    value: { get_attr: [ x86_server1_floating_ip, floating_ip_address ] }
    description: Floating IP address of SPARC server in public network
    value: { get_attr: [ sparc_server1_floating_ip, floating_ip_address ] }

Once that test runs successfully, we declare the outage over and re-enable the Apache service to restore access to the Horizon dashboard.

Our Upgrade Experiences

Since we went into production almost a year ago, we've upgraded the entire cloud infrastructure, including the OpenStack packages, seven times.  Had we met our goals we would have done the upgrade every two weeks as each full Solaris development build is released internally (and thus would have done over 20 upgrades), but the reality of running at the bleeding edge of the operating system's development is that we find bugs, and we've had several that were too serious and too difficult to work around to undertake upgrades, so we've had to delay a number of times while we waited for fixes to integrate. Through all of this, we've learned a lot and are continually refining our upgrade process.

So far, we've only had one upgrade over the last year that was unsuccessful, and that was reasonably painless, since we just re-activated the old boot environment on each node and rebooted back to it.  We now pre-stage each upgrade on a single-node stack that's configured similarly to the production cloud to verify there aren't any truly catastrophic problems with kernel zones, ZFS, or networking.  That's mostly been successful, but we're going to build a small multi-node cloud for the staging to ensure that we can catch issues in additional areas such as iSCSI that aren't exercised properly by the single-node stack.  The lesson, as always, is to have your test environment replicate production as closely as possible.

For this particular upgrade, we did a lot more testing; I spent the better part of two weeks running trial upgrades from Havana to Juno to shake out issues there, which allowed the development team to fix a bunch of bugs in the configuration file and database upgrades before we went ahead with the actual upgrade.  Even so, the production upgrade was more of an adventure than we expected.  We ran into three issues:

  1. After we rebooted the controller node, the heat-db service went into maintenance.  The database had been corrupted by the service exceeding its start method timeout, which caused SMF to kill and restart it, and that apparently happened at a very inopportune time.  Fortunately we had made little use of heat with Havana and we could simply drop the database and recreate it.  The SMF method timeout is being fixed (for heat-db and other services), though that fix isn't in the 11.3 beta release.  We're also having some discussion about whether SMF should generally default to much longer start method timeouts.  We find that developers are consistently overly optimistic about the true performance of systems in production and short timeouts of 30 seconds or 1 minute that are often used are more likely to cause harm than good.
  2. The puppet:master service went into maintenance when that node was rebooted, with truss we determined that for some reason it was attempting to kill labeld, failing, and exiting.  This is still being investigated, we've had difficulty reproducing it.  Fortunately, disabling labeld worked around the problem and we were able to proceed.
  3. After we had resolved the above issues, the test launches we use to verify the cloud is working would not complete - they'd be queued but not actually happen.  This took us over an hour to diagnose, in part because we're not that experienced with RabbitMQ issues, to that point it had "just worked".  It turned out that we were victims of the default file descriptor limit for RabbitMQ, at 256, being too low to handle all of the connections from the various services using it.  Apparently Juno is just more resource-hungry in this respect than Havana, and it's not something we could have observed in the smaller test environment.  Adding a "ulimit -n 1024" to the rabbitmq start method worked around this for now; this has sparked some internal discussion on whether the default limits should be increased, as yet unresolved.  The values are relics from many years ago and likely could use some updating.

Overall, this upgrade clocked in at a bit over 4 hours of downtime, not the 3 hours that we'd scheduled.  Happily, our cloud has run very smoothly in the weeks since the upgrade to Juno, and our users are very pleased with the much-improved Horizon dashboard.    We're now working our way through a long list of improvements to our cloud and getting the equipment in place to move to an HA environment, which will let us move towards our goal of rolling, zero-downtime upgrades.  More updates to come!


  1. If you're following the OpenStack community, you'll ask, "What about Icehouse?"  Well, we skipped it, in order to get closer to the community releases more quickly.
  2. I am happy to note that, in spite of this lack of HA, we've had only a few minutes of unscheduled service interruptions over the course of the year, due mostly to panics in the Cinder or Neutron servers.  That seems pretty good considering the bleeding-edge nature of the software we're running

Tuesday Oct 28, 2014

Heating Up Your OpenStack Cloud

As part of the support updates to Solaris 11.2, we recently added the Heat orchestration engine to our OpenStack distribution.  If you aren't familiar with Heat, I highly recommend getting to know it, as you'll find it invaluable in deploying complex application topologies within an OpenStack cloud.  I've updated the script tarball from my recent series on building the Solaris engineering cloud to include configuration of Heat, so if you download that and update your cloud controller to the latest SRU, you can run heat to turn it on.

OK, once you've done that, what can you do with Heat?  Well, I've added a script and a Heat template that it uses to the tarball to give you at least one idea.  The script, create_image, is similar to a script that we run to create custom Unified Archive images internally for the Solaris cloud.  The basic idea is to deploy an OpenStack instance using the standard archive that release engineering constructs for the product build, add some things we need to it, then save an image of that for the users of the cloud to use as a base deployment image.  I'd originally written a script to do this using the nova CLI, but using a Heat template simplified it.  The file in the tarball is the template that it uses; that template is a simpler version of a two-node template from the heat-templates repository.  It's fairly self-explanatory so I'm not going to walk through it here.

As for create_image itself, the standard Solaris archive contains the packages in the solaris-minimal-server group, a pretty small package set that really isn't too useful for anything itself, but makes a nice base to build images that include the specific things you need.  In our case, I've defined a group package that pulls in a bunch of things we typically use in Solaris development work: ssh client, LDAP, NTP, Kerberos, NFS client and automounter, the man command, and less.  Here's what the main part of the package manifest looks like:

depend fmri=/network/ssh type=group
depend fmri=group/system/solaris-minimal-server type=group
depend fmri=ldapcert type=group
depend fmri=naming/ldap type=group
depend fmri=security/nss-utilities type=group
depend fmri=service/network/ntp type=group
depend fmri=service/security/kerberos-5 type=group
depend fmri=system/file-system/autofs type=group
depend fmri=system/file-system/nfs type=group
depend fmri=system/network/nis type=group
depend fmri=text/doctools type=group
depend fmri=text/less type=group

In our case we bundle the package in a package archive file that we copy into the image using scp and then install the group package.  Doing this saves our developers a few minutes in getting what they need deployed, and that's one easy way we can show them value in using the cloud rather than our older lab infrastructure.  It's certainly possible to do much more interesting customizations than this, so experiment and share your ideas, we're looking to make Heat much more useful on Solaris OpenStack as we move ahead.  You can also talk to us at the OpenStack summit in Paris next week, a number of us will be manning the booth at various times when we're not in sessions at the design summit or the conference itself.

Oh, and for those who are interested, the Solaris development cloud is now up past 100 users and has 5 compute nodes deployed.  Still not large by any measure, but it's growing quickly and we're learning more about running OpenStack every day.

Friday Sep 19, 2014

Building an OpenStack Cloud for Solaris Engineering, Part 4

The prior parts of this series discussed the design and deployment of the undercloud nodes on which our cloud is implemented.  Now it's time to configure OpenStack and turn the cloud on.  Over on OTN, my colleague David Comay has published a general getting started guide that does a manual setup based on the OpenStack all-in-one Unified Archive, I recommend at least browsing through that for background that will come in handy as you deal with the inevitable issues that occur in running software with the complexity of OpenStack.  It's even better to run through that single-node setup to get some experience before moving on to trying to build a multi-node cloud.

For our purposes, I needed to script the configuration of a multi-node cloud, and that makes everything more complex, not the least of the problems being that you can't just use the loopback IP address ( as the endpoint for every service.  We had (compliments of my colleague Drew Fisher) a script for single-system configuration already, so I started with that and hacked away to build something that could configure each component correctly in a multi-node cloud.  That Python script, called, and some associated scripts are available for download.  Here, I'll walk through the design and key pieces.


Before the proper OpenStack configuration process, you'll need to run the script to create some SSH keys.  These are used to secure the Solaris RAD (Remote Administration Daemon) transport that the Solaris Elastic Virtual Switch (EVS) controller uses to manage the networking between the Nova compute nodes and the Neutron controller node.  The script creates evsuser, neutron, and root sub-directories in whatever location you run it, and this location will be referenced later in configuring the Neutron and Nova compute nodes, so you want to put it in a directory that's easily shared via NFS.  You can (and probably should) unshare it after the nodes are configured, though.

Global Configuration

The first part of is a whole series of global declarations that parameterize the services deployed on various nodes.  You'll note that the PRODUCTION variable can be set to control the layout used; if its value is False, you'll end up with a single-node deployment.  I have a couple of extra systems that I use for staging and this makes it easy to replicate the configuration well enough to do some basic sanity testing before deploying changes.,

MY_NAME = platform.node()
MY_IP = socket.gethostbyname(MY_NAME)

# When set to False, you end up with a single-node deployment



    GLANCE_NODE = ""
    CINDER_NODE = ""

Next, we configure the main security elements, the root password for MySQL plus passwords and access tokens for Keystone, along with the URL's that we'll need to configure into the other services to connect them to Keystone.

MYSQL_ROOTPW = "mysqlroot"
ADMIN_PASSWORD = "adminpw"
SERVICE_PASSWORD = "servicepw"

AUTH_URL = "http://%s:5000/v2.0/" % KEYSTONE_NODE
IDENTITY_URL = "http://%s:35357" % KEYSTONE_NODE

The remainder of this section configures specifics of Glance, Cinder,  Neutron, and Horizon.  For Glance and Cinder, we provide the name of the base ZFS dataset that each will be using.  For Neutron, the NIC, VLAN tag, and external network addresses, as well as the subnets for each of the two tenants we are providing in our cloud.  We chose to have one tenant for developers in the organization that is funding this cloud, and a second tenant for other Oracle employees who want to experiment with OpenStack on Solaris; this gives us a way to grossly allocate resources between the two, and of course most go to the tenant paying the bill.  The last element of each tuple in the tenant network list is the number of floating IP addresses to set as the quota for the tenant.  For Horizon, the paths to a server certificate and key must be configured, but only if you're using TLS, and that's only the case if the script is run with PRODUCTION = True.  The SSH_KEYDIR should be set to the location where you ran the script, above.

GLANCE_DATASET = "tank/glance"
CINDER_DATASET = "tank/cinder"

UPLINK_PORT = "aggr0"
    VXLAN_RANGE = "500-600"
    TENANT_NET_LIST = [("tenant1", "", 10),
                       ("tenant2", "", 60)]
    VXLAN_RANGE = "400-499"
    TENANT_NET_LIST = [("tenant1", "", 5), 
                       ("tenant2", "", 5)]


SERVER_CERT = "/path/to/horizon.crt" SERVER_KEY = "/path/to/horizon.key"

SSH_KEYDIR = "/path/to/generated/keys"

Configuring the Nodes

The remainder of is a series of functions that configure each element of the cloud.  You select which element(s) to configure by specifying command-line arguments.  Valid values are mysql, keystone, glance, cinder, nova-controller, neutron, nova-compute, and horizon.  I'll briefly explain what each does below.  One thing to note is that each function first creates a backup boot environment so that if something goes wrong, you can easily revert to the state of the system prior to running the script.  This is a practice you should always use in Solaris administration before making any system configuration changes.  It also saved me a ton of time in development of the cloud, since I could reset within a minute or so every time I had a serious bug.  Even our best re-deployment times with AI and archives are about 10 times that when you have to cycle through network booting.


MySQL must be the first piece configured, since all of the OpenStack services use databases to store at least some of their objects.  This function sets the root password and removes some insecure aspects of the default MySQL configuration.  One key piece is that it removes remote root access; that forces us to create all of the databases in this module, rather than creating each component's database in its associated module.  There may be a better way to do this, but since I'm not a MySQL expert in any way, that was the easiest path here.  On review it seems like the enable of the mysql SMF service should really be moved over into the Puppet manifest from part 3.


The keystone function does some basic configuration, then calls the /usr/demo/openstack/keystone/ script to configure users, tenants, and endpoints.  In our deployment I've customized this script a bit to create the two tenants rather than just one, so you may need to make some adjustments for your site; I have not included that customization in the downloaded files.


The glance function configures and starts the various glance services, and also creates the base dataset for ZFS storage; we turn compression on to save on storage for all the images we'll have here.  If you're rolling back and re-running for some reason, this module isn't quite idempotent as written because it doesn't deal with the case where the dataset already exists, so you'd need to use zfs destroy to delete the glance dataset.


Beyond just basic configuration of the cinder services, the cinder function also creates the base ZFS dataset under which all of the volumes will be created.  We create this as an encrypted dataset so that all of the volumes will be encrypted, which Darren Moffat covers at more length in OpenStack Cinder Volume encryption with ZFS. Here we use pktool to generate the wrapping key and store it in root's home directory.  One piece of work we haven't yet had time to take on is adding our ZFS Storage Appliance as an additional back-end for Cinder.  I'll post an update to cover that once we get it done.  Like the glance function, this function doesn't deal propertly with the dataset already existing, so any rollback also needs to destroy the base dataset by hand.

nova_controller & nova_compute

Since our deployment runs the nova controller services separate from the compute nodes, the nova_controller function is run on the controller node to set up the API, scheduler, and conductor services.  If you combine the compute and controller nodes you would run this and then later run the nova_compute function.  The nova_compute function also makes use of a couple of helper functions to set up the ssh configuration for EVS.  For these functions to work properly you must run the neutron function on its designated node before running nova_compute on the compute nodes.


The neutron setup function is by far the most complex, as we not only configure the neutron services, including the underlying EVS and RAD functions, but also configures the external network and the tenant networks.  The external network is configured as a tagged VLAN, while the tenant networks are configured as VxLANs; you can certainly use VLANs or VxLANs for all of them, but this configuration was the most convenient for our environment.


For the production case, the horizon function just copies into place an Apache config file that configures TLS support for the Horizon dashboard and the server's certificate and key files.  If you're using self-signed certificates, then the Apache SSL/TLS Strong Encryption: FAQ is a good reference on how to create them.  For the non-production case, this function just comments out the pieces of the dashboard's local settings that enable SSL/TLS support.

Getting Started

Once you've run through all of the above functions from, you have a cloud, and pointing your web browser at http://<your server>/horizon should display the login page, where you can login to the admin user with the password you configured in the global settings of

Assuming that works, your next step should be to upload an image.  The easiest way to start is by downloading the Solaris 11.2 Unified Archives.  Once you have an archive the upload can be done from the Horizon dashboard, but you'll find it easier to use the upload_image script that I've included in the download.  You'll need to edit the environment variables it sets first, but it takes care of setting several properties on the image that are required by the Solaris Zones driver for Nova to properly handle deploying instances.  Failure to set them is the single most common mistake that I and others have made in the early Solaris OpenStack deployments; when you forget and attempt to launch an instance, you'll get an immediate error, and the details from nova show will include the error:

| fault                                | {"message": "No valid host was 
found. ", "code": 500, "details": "  File 
 line 107, in schedule_run_instance |

When you snapshot a deployed instance with Horizon or nova image-create the archive properties will be set properly, so it's only manual uploads in Horizon or with the glance command that need care.

There's one more preparation task to do: upload an ssh public key that'll be used to access your instances. Select Access & Security from the list in the left panel of the Horizon Dashboard, then select the Keypairs tab, and click Import Keypair.  You'll want to paste the contents of your ~/.ssh/ into the Public Key field, and probably name your keypair the same as your username.

Finally, you are ready to launch instances.   Select Instances in the Horizon Dashboard's left panel list, then click the Launch Instance button.  Enter a name for the instance, select the Flavor, select Boot from image as the Instance Boot Source, and select the image to use in deploying the VM.  The image will determine whether you get a SPARC or x86 VM and what software it includes, while the flavor determines whether it is a kernel zone or non-global zone, as well as the number of virtual CPUs and amount of memory.  The Access & Security tab should default to selecting your uploaded keypair.  You must go to the Networking tab and select a network for the instance.  Then click Launch and the VM will be installed, you can follow progress by clicking on the instance name to see details and selecting the Log tab.  It'll take a few minutes at present, in the meantime you can Associate a Floating IP in the Actions field.  Pick any address from the list offered.  Your instance will not be reachable until you've done this.

Once the instance has finished installing and reached the Active status, you can login to it.  To do so, use ssh root@<floating-ip-address>, which will login to the zone as root using the key you uploaded above.  If that all works, congratulations, you have a functioning OpenStack cloud on Solaris!

In future posts I'll cover additional tips and tricks we've learned in operating our cloud.  At this writing we're over 60 users and growing steadily, and it's been totally reliable over 3 months, with only outages for updates to the infrastructure.



Tuesday Sep 16, 2014

Building an OpenStack Cloud for Solaris Engineering, Part 3

At the end of Part 2, we built the infrastructure needed to deploy the undercloud systems into our network environment.  However, there's more configuration needed on these systems than we can completely express via Automated Installation, and there's also the issue of how to effectively maintain the undercloud systems.  We're only running a half dozen initially, but expect to add many more as we grow, and even at this scale it's still too much work, with too high a probability of mistakes, to do things by hand on each system.  That's where a configuration management system such as Puppet shows its value, providing us the ability to define a desired state for many aspects of many systems and have Puppet ensure that state is maintained.  My team did a lot of work to include Puppet in Solaris 11.2 and extend it to manage most of the major subsystems in Solaris, so the OpenStack cloud deployment was a great opportunity to start working with another shiny new toy.

Configuring the Puppet Master

One feature of the Puppet integration with Solaris is that the Puppet configuration is expressed in SMF, and then translated by the new SMF Stencils feature to settings in the usual /etc/puppet/puppet.conf file.  This makes it possible to configure Puppet using SMF profiles at deployment time, and the examples in Part 2 showed this for the clients.  For the master, we apply the profile below:

<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
  This profile configures the Puppet master
<service_bundle type="profile" name="puppet">
  <service version="1" type="service" name="application/puppet">
    <instance enabled="true" name="master">
      <property_group type="application" name="config">
        <propval name="server" value=""/>
        <propval name="autosign" value="/etc/puppet/autosign.conf"/>

The interesting setting is the autosign configuration, which allows new clients to have their certificates automatically signed and accepted by the Puppet master.  This isn't strictly necessary, but makes operation a little easier when you have a reasonably secure network and you're not handing out any sensitive configuration via Puppet.  We use an autosign.conf that looks something like:


This means that we're accepting any system that identifies as being in the domain.  The main pain with autosigning is that if you reinstall any of the systems and you're using self-generated certificates on the clients, you need to clean out the old certificate before the new one will be accepted; this means issuing a command on the master like:

# puppet cert clean

There are lots of options in Puppet related to certificates and more sophisticated ways to manage them, but this is what we're doing for now.  We have filed some enhancement requests to implement ways of integrating Puppet client certificate delivery and signing with Automated Installation, which would make using the two together much more convenient.

Writing the Puppet Manifests

Next, we implemented a small Mercurial source repository to store the Puppet manifests and modules.  Using a source control system with Puppet is a highly recommended practice, and Mercurial happens to be the one we use for Solaris development, so it's natural for us in this case.  We configure /etc/puppet on the Puppet master as a child repository of the main Mercurial repository, so when we have new configuration to apply it's first checked into the main repository and then pulled into Puppet via hg pull -u, then automatically applied as each client polls the master.  Our repository presently contains the following:


An example tar file with all of the above is available for download.

The site manifest  starts with:

include ntp
include nameservice

The ntp module is the canonical example of Puppet, and is really important for the OpenStack undercloud, as it's necessary for the various nodes to have a consistent view of time in order for the security certificates issued by Keystone to be validated properly.  I'll describe the nameservice module a little later in this post.

Since most of our nodes are configured identically, we can use a default node definition to configure them.  The main piece is configuring Datalink Multipathing (DLMP), which provides us additional bandwidth and higher availability than a single link.  We can't yet configure this using SMF, so the Puppet manifest:

  • Figures out the IP address the system is using with some embedded Ruby
  • Removes the net0 link and creates a link aggregation from net0 and net1
  • Enables active probing on the link aggregation, so that it can detect upstream failures on the switches that don't affect link state signaling (which is also used, and is the only means unless probing is enabled)
  • Configures an IP interface and the same address on the new aggregation link
  • Restricts Ethernet autonegotiation to 1 Gb to work around issues we have with these systems and the switches/cabling we're using the in the lab; without this, we get 100 Mb speeds negotiated about 50% of the time, and that kills performance.
You'll note several uses of the require and before statements to ensure the rules are applied in the proper order, as we need to tear down the net0 IP interface before it can be moved into the aggregation, and the aggregation needs to be configured before the IP objects on top of it.
node default {
$myip = inline_template("<% _erbout.concat('$fqdn').to_s) %>")
	# Force link speed negotiation to be at least 1 Gb
	link_properties { "net0":
	    ensure => present,
	    properties => { en_100fdx_cap => "0" },
	link_properties { "net1":
	    ensure => present,
	    properties => { en_100fdx_cap => "0" },

	link_aggregation { "aggr0" :
	    ensure => present,
	    lower_links => [ 'net0', 'net1' ],
	    mode => "dlmp",
	link_properties { "aggr0":
	    ensure => present,
	    require => Link_aggregation['aggr0'],
	    properties => { probe-ip => "+" },
	ip_interface { "aggr0" :
	    ensure => present,
	    require => Link_aggregation['aggr0'],
	ip_interface { "net0":
	    ensure => absent,
	    before => Link_aggregation['aggr0'],
	address_object { "net0":
	    ensure => absent,
	    before => Ip_interface['net0'],
	address_object { 'aggr0/v4':
	    require => Ip_interface['aggr0'],
	    ensure => present,
	    address => "${myip}/24",
	    address_type => "static",
	    enable => "true",

The controller node declaration includes all of the above functionality, but also adds these elements to keep rabbitmq running and install the mysql database.

    service { "application/rabbitmq" :
        ensure => running,
    package { "database/mysql-55":
        ensure => installed,

The database installation could have been part of the AI derived manifest as well, but it works just as well here and it's convenient to do it this way when I'm setting up staging systems to test builds before we upgrade.

The nameservice Puppet module is shown below.  It's handling both nameservice and RBAC (Role-based Access Control) configuration:

class nameservice {

    dns { "openstack_dns":
        search => [ '' ],
        nameserver => [ ', '' ],

    service { "dns/client":
        ensure => running,

    svccfg { "domainname":
        ensure => present,
        fmri => "svc:/network/nis/domain",
        property => "config/domainname",
        type => "hostname",
        value => "",

    # nameservice switch
    nsswitch { "dns + ldap":
        default => "files",
        host =>  "files dns",
        password => "files ldap",
        group => "files ldap",
        automount => "files ldap",
        netgroup => "ldap",

    # Set user_attr for administrative accounts
    file { "user_attr" :
        path => "/etc/user_attr.d/site-openstack",
        owner => "root",
        group => "sys",
        mode => 644,
        source => "puppet:///modules/nameservice/user_attr",

    # Configure zlogin access
    file { "site-zlogin" :
        path => "/etc/security/prof_attr.d/site-zlogin",
        owner => "root",
        group => "sys",
        mode => 644,
        source => "puppet:///modules/nameservice/prof_attr-zlogin",

    file { "zlogin-exec" :
        path => "/etc/security/exec_attr.d/site-zlogin",
        owner => "root",
        group => "sys",
        mode => 644,
        source => "puppet:///modules/nameservice/exec_attr-zlogin",

    file { "policy.conf" :
        path => "/etc/security/policy.conf",
        owner => "root",
        group => "sys",
        mode => 644,
        source => "puppet:///modules/nameservice/policy.conf",

You may notice that the nameservice configuration here is exactly the same as what we provided in the SMF profile in part 2.  We include it here because it's configuration we anticipate changing someday and we won't want to re-deploy the nodes.  There are ways we could prevent the duplication, but we didn't have time to spend on it right now and it also demonstrates that you could use a completely different configuration in operation than at deployment/staging time.

What's with the RBAC configuration?

The RBAC configuration is doing two things, the first being configuring the user accounts of the cloud administrators for administrative access on the cloud nodes.  The user_attr file we're distributing confers the System Adminstrator and OpenStack Management profiles, as well as access to the root role (oadmin is just an example account in this case):

oadmin::::profiles=System Administrator,OpenStack Management;roles=root

As we add administrators, I just need to add entries for them to the above file and they get the required access to all of the nodes.  Note that this doesn't directly provide administrative access to OpenStack's CLI's or its Dashboard, that's configured within OpenStack.

A limitation of the OpenStack software we include in Solaris 11.2 is that we don't provide the ability to connect to the guest instance consoles, an important feature that's being worked on.  The zlogin User profile is something I created to work around this problem and allow our cloud users to get access to the consoles, as this is often needed in Solaris development and testing.  First, the profile is defined by a prof_attr file with the entry:

zlogin User:::Use

We also need an exec_attr file to ensure that zlogin is run with the needed uid and privileges:

 zlogin User:solaris:cmd:RO::/usr/sbin/zlogin:euid=0;privs=ALL

Finally, we modify the RBAC policy file so that all users are assigned to the zlogin User profile:

PROFS_GRANTED=zlogin User,Basic Solaris User

The result of all this is that a user can obtain access to their specific OpenStack guest instance via login to the compute node on which the guest is running, and runing a command such as:

$ pfexec zlogin -C instance-0000abcd

At this point we have the undercloud nodes fully configured to support our OpenStack deployment.  In part 4, we'll look at the scripts used to configure OpenStack itself.

Tuesday Sep 02, 2014

Building an OpenStack Cloud for Solaris Engineering, Part 2


Continuing from where I left off with part 1 of this series, in this posting I'll discuss the elements that we put in place to deploy the OpenStack cloud infrastructure, also known as the undercloud.

The general philosophy here is to automate everything, both because it's a best practice and because this cloud doesn't have any dedicated staff to manage it; we're doing it ourselves in order to get first-hand operational experience that we can apply to improve both Solaris and OpenStack.  As I said in part 1, we don't have an HA requirement at this point, but we'd like to keep any outages, both scheduled and unscheduled, to less than a half hour, so redeploying a failed node should take no more than 20 minutes.  The pieces that we're using are:

  • Automated Installation services and manifests to deploy Solaris
  • SMF profiles to configure system services
  • IPS site packages installed as part of the AI deployment to automate some first-boot configuration
  • A Puppet master to provide initial and ongoing configuration automation

I'll elaborate on the first two below, and discuss Puppet in the next posting.  The IPS site packages we are using are specific to Oracle's environment so I won't be covering those in detail.

Sanitized versions of the manifests and profiles discussed below are available for download as a tar file.

Automated Installation

Building the undercloud nodes means we're doing bare-metal provisioning, so we'll be using the Automated Installation (AI) feature in Solaris 11.  Most of the OpenStack services could run in kernel zones, or even non-global zones, but we're planning for larger scale and want to have some extra horsepower.  Therefore we opted not to go in that direction for now, but it may well be an option we use later for some services.

I already had an existing AI server in this cloud's lab, and it provides services to systems that aren't part of this cloud.  As we release each development build of a Solaris 11 update or Solaris 12 there's a new service generated on it.  The pace of evolution of this cloud is likely to be different from those other systems as well, so that led me to create two new AI services specifically for the cloud; we can make these services aliases of existing services so we don't need to bother replicating the boot image, thus the commands look like (output ellided):

# installadm create-service -n cloud-i386 --aliasof solaris11_2-i386
# installadm create-service -n cloud-sparc --aliasof solaris11_2-sparc 

The next step is setting up the manifests that specify the installation.  For this, I've taken the default derived manifest that we install for services and modified it to:

  1. Specify a custom package list
  2. Lay out all of the storage
  3. Select the package repository based on Solaris release
  4. Install a Unified Archive rather than a package set based on a boot argument

You can download the complete manifest, I'll discuss the various customizations here.

The package list we're explicitly installing is below, there are of course a number of other packages pulled in as dependencies, so this expands out to just over 500 packages installed (perhaps not surprisingly, about 35% are Python libraries):


We start with solaris-minimal-server in order to build an effectively minimized environment.  We've chosen to install the same package set on all nodes so that any of them can be easily repurposed to a different role in the cloud if needed, so the openstack group package is used rather than the packages for the various OpenStack services.  We'll be using MySQL as the database, so need its client package.  snoop is there for network diagnostics (yes, we should use tshark instead but I'm old-school :-), some Python packages that we need to support OpenStack, as well as RabbitMQ as that's our message broker.  We use LDAP for authentication so that's included.  I find rsync convenient for caching crash dumps off to other systems for examination.  ssh is needed for remote access.  nss-utilities are needed for some LDAP configuration.  OpenStack needs consistent time, so NTP is required.  We use Kerberos for some NFS access so that's included, along with the automounter and NFS client.  We want to use SMTP notifications for any fault management events, so include it.  The utilities to manage Oracle hardware may come in handy, so we include them.  Puppet is going to provide ongoing configuration management, so it's included.  We need rad-evs-controller to back-end our Neutron driver.  bpf is listed only because of a missing dependency in another package that causes runaway console messages from the DLMP daemon; that's being fixed.  The NIS package provides some things that LDAP needs.  We're using kernel zones as the primary OpenStack guest, so need that zone brand installed.  The doctools package provides the man command, don't want to be caught without a man page when you need it!  less is there because it's better than more.  Finally, we install a couple of site packages, one that does some general customizations, another that delivers the base certificate needed for TLS access to our LDAP servers.

The storage layout we standardized on for the undercloud is to have a two-way mirror for the root pool, formed out of two of the smallest disks (usually 300 GB on the systems we're using), with any remaining disks in a separate pool, called tank on all of the systems, that can be used for other purposes.  On the Cinder node, it's where we put all the ZFS iSCSI targets; in the case of Glance it's where we store the images.  We're also planning to use it for Swift services on various nodes, but we haven't deployed Swift yet.  The tank pool gets built with varying amounts of redundancy based on the number of disks.  This logic is all in the last 60 lines of the manifest script.  It's an interesting example of using the derived manifest features to do some reasonably complex customization for individual nodes.

We internally have separate repositories for Solaris 11 and Solaris 12, so the manifest defaults to Solaris 12 and if it determines we've booted Solaris 11 to install, then it uses a different repository:

if [[ $(uname -r) = 5.11 ]]; then
        aimanifest set /auto_install/ai_instance/software[@type="IPS"]/source/publisher[@name="solaris"]/origin@name

The last trick I added was the ability to select a Unified Archive to install instead of the packages.  We'll be using archives as the backup/recovery mechanism for the infrastructure, so this provides a faster way to deploy nodes when we already have the desired archive available.  On a SPARC system you'd select this using a boot command like:

ok boot net:dhcp - install archive_uri=

On an x86 system you'd add this as -B archive_uri=<uri> to the $multiboot line in grub.cfg.

The code for this in the script looks like:

if [[ ${SI_ARCH} = sparc ]]; then
    ARCHIVE_URI=$(prtconf -vp | nawk \
        '/bootargs.*archive_uri=/{n=split($0,a,"archive_uri=");split(a[2],b);split(b[1],c,"'\''");print c[1]}')
    ARCHIVE_URI=$(devprop -s archive_uri)

if [[ -n "$ARCHIVE_URI" ]]; then
    # Replace package software section with archive
    aimanifest delete software
    swpath=$(aimanifest add -r /auto_install/ai_instance/software@type ARCHIVE)
    aimanifest add $swpath/source/file@uri $ARCHIVE_URI
    inspath=$(aimanifest add -r $swpath/software_data@action install)
    aimanifest add $inspath/name global 


Once we have the manifest, it's a simple matter to make it the default manifest for both of the cloud services:

# installadm create-manifest -n cloud-i386 -d -f havana.ksh
# installadm create-manifest -n cloud-sparc -d -f havana.ksh

Each of the systems we're including in the cloud infrastructure are assigned to the appropriate AI service with a command such as:

# installadm create-client -n cloud-sparc -e <mac address>

SMF Configuration Profiles

But before we go on to installing the systems, we also want to provide SMF (Service Management Facility) configuration profiles to automate the initial system configuration; otherwise, we'll be faced with running the interactive sysconfig tool during the initial boot.  For this deployment, we have a somewhat unusual twist, in that there is configuration we'd like to share between the infrastructure nodes and guests since they are ultimately all nodes on the Oracle internal network.  Also, for maximum flexibility and reuse, the configuration is expressed by multiple profiles, with each designed to configure only some aspects of the system.  In our case, we have a directory structure on the AI server that looks like:


The first three are specific to the infrastructure nodes.  The infrastructure.xml profile provides the fixed network configuration, along with coreadm setup and fault management notifications; we use SMTP notifications to alert us to any faults from the system.  The puppet.xml profile configures the puppet agents with the name of the master node.  The users.xml profile configures the root account as a role and sets its password, and also sets up a local system administrator account that's meant to be used in case of networking issues that prevent our administrators from using their normal user accounts.

The three profiles under the common directory are also used to configure guest instances in our cloud.  I'll show how that's done later in this series, but it's important that they be under a separate directory.  basic.xml configures the system's timezone, default locale, keyboard layout, and console terminal type.  dns.xml configures the DNS resolver, and ldap.xml configures the LDAP client.

We load each of these into the AI services with the command:

# installadm create-profile -n cloud-sparc -f <file name>

The important aspect of the above command is that no criteria are specified for the profiles, which means that they are applied to all clients of the service.  This also means that they must be disjoint; no two profiles can attempt to configure the same property on the same service, otherwise SMF will not apply the profiles that conflict.

Once all that's done, we can see the results:

# installadm list -p -m -n cloud-sparc
Service Name Manifest Name Type    Status   Criteria
------------ ------------- ----    ------   --------
cloud-sparc  havana.ksh    derived default  none    

Service Name Profile Name       Criteria
------------ ------------       --------
cloud-sparc  basic.xml          none    
             dns.xml            none    
             infrastructure.xml none    
             ldap.xml           none    
             puppet.xml         none    
             users.xml          none    
At this point we've got enough infrastructure implemented to install the OpenStack undercloud systems.  In the next posting I'll cover the Puppet manifests we're using; after that we'll get into configuring OpenStack itself.

Friday Aug 22, 2014

Building an OpenStack Cloud for Solaris Engineering, Part 1

One of the signature features of the recently-released Solaris 11.2 is the OpenStack cloud computing platform.  Over on the Solaris OpenStack blog the development team is publishing lots of details about our version of OpenStack Havana as well as some tips on specific features, and I highly recommend reading those to get a feel for how we've leveraged Solaris's features to build a top-notch cloud platform.  In this and some subsequent posts I'm going to look at it from a different perspective, which is that of the enterprise administrator deploying an OpenStack cloud.  But this won't be just a theoretical perspective: I've spent the past several months putting together a deployment of OpenStack for use by the Solaris engineering organization, and now that it's in production we'll share how we built it and what we've learned so far.

In the Solaris engineering organization we've long had dedicated lab systems dispersed among our various sites and a home-grown reservation tool for developers to reserve those systems; various teams also have private systems for specific testing purposes.  But as a developer, it can still be difficult to find systems you need, especially since most Solaris changes require testing on both SPARC and x86 systems before they can be integrated.  We've added virtual resources over the years as well in the form of LDOMs and zones (both traditional non-global zones and the new kernel zones).  Fundamentally, though, these were all still deployed in the same model: our overworked lab administrators set up pre-configured resources and we then reserve them.  Sounds like pretty much every traditional IT shop, right?  Which means that there's a lot of opportunity for efficiencies from greater use of virtualization and the self-service style of cloud computing.  As we were well into development of OpenStack on Solaris, I was recruited to figure out how we could deploy it to both provide more (and more efficient) development and test resources for the organization as well as a test environment for Solaris OpenStack.

At this point, let's acknowledge one fact: deploying OpenStack is hard.  It's a very complex piece of software that makes use of sophisticated networking features and runs as a ton of service daemons with myriad configuration files.  The web UI, Horizon, doesn't often do a good job of providing detailed errors.  Even the command-line clients are not as transparent as you'd like, though at least you can turn on verbose and debug messaging and often get some clues as to what to look for, though it helps if you're good at reading JSON structure dumps.  I'd already learned all of this in doing a single-system Grizzly-on-Linux deployment for the development team to reference when they were getting started so I at least came to this job with some appreciation for what I was taking on.  The good news is that both we and the community have done a lot to make deployment much easier in the last year; probably the easiest approach is to download the OpenStack Unified Archive from OTN to get your hands on a single-system demonstration environment.  I highly recommend getting started with something like it to get some understanding of OpenStack before you embark on a more complex deployment.  For some situations, it may in fact be all you ever need.  If so, you don't need to read the rest of this series of posts!

In the Solaris engineering case, we need a lot more horsepower than a single-system cloud can provide.  We need to support both SPARC and x86 VM's, and we have hundreds of developers so we want to be able to scale to support thousands of VM's, though we're going to build to that scale over time, not immediately.  We also want to be able to test both Solaris 11 updates and a release such as Solaris 12 that's under development so that we can work out any upgrade issues before release.  One thing we don't have is a requirement for extremely high availability, at least at this point.  We surely don't want a lot of down time, but we can tolerate scheduled outages and brief (as in an hour or so) unscheduled ones.  Thus I didn't need to spend effort on trying to get high availability everywhere.

The diagram below shows our initial deployment design.  We're using six systems, most of which are x86 because we had more of those immediately available.  All of those systems reside on a management VLAN and are connected with a two-way link aggregation of 1 Gb links (we don't yet have 10 Gb switching infrastructure in place, but we'll get there).  A separate VLAN provides "public" (as in connected to the rest of Oracle's internal network) addresses, while we use VxLANs for the tenant networks.

Solaris cloud diagram

One system is more or less the control node, providing the MySQL database, RabbitMQ, Keystone, and the Nova API and scheduler as well as the Horizon console.  We're curious how this will perform and I anticipate eventually splitting at least the database off to another node to help simplify upgrades, but at our present scale this works.

I had a couple of systems with lots of disk space, one of which was already configured as the Automated Installation server for the lab, so it's just providing the Glance image repository for OpenStack.  The other node with lots of disks provides Cinder block storage service; we also have a ZFS Storage Appliance that will help back-end Cinder in the near future, I just haven't had time to get it configured in yet.

There's a separate system for Neutron, which is our Elastic Virtual Switch controller and handles the routing and NAT for the guests.  We don't have any need for firewalling in this deployment so we're not doing so.  We presently have only two tenants defined, one for the Solaris organization that's funding this cloud, and a separate tenant for other Oracle organizations that would like to try out OpenStack on Solaris.  Each tenant has one VxLAN defined initially, but we can of course add more.  Right now we have just a single /24 network for the floating IP's, once we get demand up to where we need more then we'll add them.

Finally, we have started with just two compute nodes; one is an x86 system, the other is an LDOM on a SPARC T5-2.  We'll be adding more when demand reaches the level where we need them, but as we're still ramping up the user base it's less work to manage fewer nodes until then.

My next post will delve into the details of building this OpenStack cloud's infrastructure, including how we're using various Solaris features such as Automated Installation, IPS packaging, SMF, and Puppet to deploy and manage the nodes.  After that we'll get into the specifics of configuring and running OpenStack itself.

Tuesday Jan 31, 2012

Detroit Solaris 11 Forum, February 8

I'm just posting this quick note to help publicize the Oracle Solaris 11 Technology Forum we're holding in the Detroit area next week.  There's still time to register and come get a half-day overview of the great new stuff in Solaris 11.  The "special treat" that's not mentioned in the link is that I'll be joining Jeff Victor as a speaker.  Looking forward to being back in my home state for a quick visit, and hope I'll see some old friends there!

Tuesday Nov 22, 2011

Solaris at LISA 2011

As is our custom, the Solaris team will be out in force at the USENIX LISA conference; this year it's in Boston so it's sort of a home game for me for a change.  The big event we'll have is Tuesday, December 6, the Oracle Solaris 11 Summit Day.  We'll be covering deployment, ZFS, Networking, Virtualization, Security, Clustering, and how Oracle apps run best on Solaris 11.  We've done this the past couple of years and it's always a very full day.

On Wednesday, December 7, we've got a couple of BOF sessions scheduled back-to-back.  At 7:30 we'll have the ever-popular engineering panel, with all of us who are speaking at Tuesday's summit day there for a free-flowing discussion of all things Solaris.  Following that, Bart & I are hosting a second BOF at 9:30 to talk more about deployment for clouds and traditional data centers.

Also, on Wednesday and Thursday we'll have a booth at the exhibition where there'll be demos and just a general chance to talk with various Solaris staff from engineering and product management.

The conference program looks great and I look forward to seeing you there!

Thursday Nov 17, 2011

Virtually the fastest way to try Solaris 11 (and Solaris 10 zones)

If you're looking to try out Solaris 11, there are the standard ISO and USB image downloads on the main page.  Those are great if you're looking to install Solaris 11 on hardware, and we hope you will.  But if you take the time to look down the page, you'll find a link off to the Oracle Solaris 11 Virtual Machine downloads.  There are two downloads there:

  1. A pre-built Solaris 10 zone
  2. A pre-built Solaris 11 VM for use with VirtualBox

If you're looking to try Solaris 11 on x86, the second one is what you want.  Of course, this assumes you have VirtualBox already (and if you don't, now's the time to try it, it's a terrific free desktop virtualization product).  Once you complete the 1.8 GB download, it's a simple matter of unzipping the archive and a few quick clicks in VirtualBox to get a Solaris 11 desktop booted.  While it's booting, you'll get to run through the new system configuration tool (that'll be the subject of a future posting here) to configure networking, a user account, and so on.

So what about that pre-built Solaris 10 zone download?  It's a really simple way to get yourself acquainted with the Solaris 10 zones feature, which you may well find indispensible in transitioning an existing Solaris 10 infrastructure to Solaris 11.  Once you've downloaded the file, it's a self-extracting executable that'll configure the zone for you, all you have to supply is an IP address for the zone.  It's really quite slick!

I expect we'll do a lot more pre-built VM's and zones going forward, as that's a big part of being a cloud OS; if there's one that would be really useful for you, let us know.

Tuesday Nov 15, 2011

Solaris 11 Technology Forums, NYC and Boston

By now you're certainly aware that we released Solaris 11; I was on vacation during the launch so haven't had time to write any material related to the Solaris 11 installers, but will get to that soon.  Following onto the release, we're scheduling events in various locations around the world to talk about some of the key new features in Solaris 11 in more depth.  In the northeast US, we've scheduled technology forums in New York City on November 29, and Burlington, MA on November 30.  Click on those links to go to the detailed info and registration.  I'll be one of the speakers at both of them, so hope to see you there!

Monday Mar 28, 2011

Solaris Online Forum on April 14

For all of you that are interested in what's happening with Solaris 11, we've scheduled a half-day of online forums on Thursday, April 14.  I'll be on for 45 minutes with my pal Bart to talk about deployment; other colleagues will be discussing the Solaris strategy, virtualization, and other features of Solaris 11.  We'll also have a live on-line chat where you can get one of us to answer your questions.  For the full details, see the registration page.  Hope to see you there!

Tuesday Nov 16, 2010

Solaris 11 Express Interactive Installation

One thing I didn't note in my previous entry on the Solaris 11 Express 2010.11 release is that there are some new developments in installation since the last available builds of OpenSolaris.  This post just discusses the interactive installation options, while a subsequent entry will discuss the Automated Installer.

Before digging into the details, it's probably useful to explain the philosophy of the interactive installers a bit for those encountering them for the first time, as it is somewhat of a departure from Solaris 10 and prior.  Our basic guiding principle is probably best summarized as, "Get the system installed and get out of the way."  To elaborate a bit, the idea is to collect a minimal amount of configuration required to make the installed system functional, execute the install quickly, and let the user get on with using the system.  That means that a lot of the configuration you might have been asked about in past Solaris releases, such as Kerberos or NFS domains, or installing additional, layered software, are just not present.  You're asked only to select a disk, partition it a bit if you want, provide timezone and locale, and create a user account.  You're also not prompted to interactively select the software to be installed.  Instead, the software that's present on the media is what's installed, providing a useful starting point at first boot.  From there, you can use tools like the pkg CLI or the Package Manager GUI to customize software to your heart's content, all installed from the convenience of a software repository on the network.

There are several reasons why we think this shift is appropriate.  First, many of the configuration settings that were prompted for in the past were of interest to only small minorities of users.  That means we were making it harder for the majority, which is almost always a bad choice.  Second, we've put in a concerted effort over the past 5+ years to make Solaris configured more correctly to start with, and more capable of self-configuring, so that more users get the best results, not just those who can figure out the right knobs to twist.  The end results should be better for all of us in the Solaris ecosystem, as behavior will be more consistent and predictable.  Finally, in terms of software selection, we've reached the point where the commonly-available media format (DVD) just isn't large enough to incorporate all the software we want to provide as part of the product - we've just plain outstripped the rate of improvement in software compression technology.  It's well past time that we oriented Solaris towards a network-centric software delivery paradigm.

Text Installer

The most obvious difference to OpenSolaris users is the addition of the Text Installer, a curses-based interactive UI designed to run comfortably on all those servers out there that have only serial consoles.  Those that were following the OpenSolaris development train did see a late preview of this from the project team back around build 134, but S11 Express is the first release that includes this installer.  This now means that there is an interactive install option for SPARC users, as the GUI install is offered only on the x86 live CD.

Philosophically, this UI shares a fair amount with the GUI: it's a fairly streamlined experience that doesn't allow customization of the software payload, but does allow a little more freedom in disk configuration (most notably, the ability to preserve existing VTOC slices).  Like the GUI, the installation is a direct copy of the media contents, so what is included on the media defines the installation.

Initially, we've opted to include this installer only on a new, separate ISO download, identified as Text Install on the downloads page.  This image might be more accurately called "Server Install", as that's what it really is meant to be: a generic server installation that includes most, if not all, of the Solaris server elements, but omits the GNOME desktop and related applications.  If this is the image you downloaded and installed but you really wanted the GNOME desktop (easy to do since it's the first image on the page), then the easy solution is to install the package set that appears on the live CD media; you can accomplish that with the command pkg install slim_install, slim_install being the IPS group package that we use to define the live CD contents.  Incidentally, the group package that defines the text install media contents is the server_install package.

One thing that server administrators will undoubtedly find missing is the ability to directly configure the network as part of the install; right now it defaults to the automatic configuration technology we call Network Auto-Magic (or NWAM).  We do plan to extend the text installer to also provide static network configuration, so you'll be able to supply an IP address and nameservice configuration directly, rather than having to do this post-installation.

GUI Installer

The GUI installer has undergone some small changes from the versions provided with OpenSolaris.  If the last time you used it was with OpenSolaris 2009.06, the biggest difference is that it provides support for extended partitions, which provides a little more flexibility in dealing with the limitations of the x86 partitioning scheme and eases co-existence with other OS's in multi-boot configurations.  The other change here, more subtle, is that the UI no longer separately prompts for the root password.  Instead, the password for the root role is set to the same password as the initial user account (which is now required, where it was optional during OpenSolaris releases).  The root password is created as expired, however, so first time you su to root, you'll be prompted to change the password.  Finally, the initial user account is no longer assigned the Primary Administrator profile to enable administrative access.  Instead, the user account retains access to the root role, and is also given all access to sudo.  The text installer does allow independent setting of the root password at this release, but we expect to align it with the GUI in a future build.

Monday Nov 15, 2010

Oracle Solaris 11 Express 2010.11 is released

Today marks the release of Oracle Solaris 11 Express 2010.11, beginning the rollout of our long-gestating successor to Solaris 10.  The summary and links to most everything are available on the OTN Oracle Solaris 11 Overview.  Probably the biggest thing to emphasize is that this is a supported release, not a "beta" or preview; see the link for the support options.  That said, feature development continues in anticipation of a Solaris 11 release in 2011, as was outlined at OpenWorld back in September.

For those who used the OpenSolaris distribution releases, you'll find this release quite familiar, as it's the continuing evolution of the technology we introduced in those releases: the installers from the Caiman project, the IPS packaging system, and all the other great things that my colleagues in Solaris engineering have been developing for the past several years in networking, storage, security and so on.  The biggest visible differences are a different package repository, license terms, and of course Oracle branding.

For those of you who weren't users of OpenSolaris, well, now is the time to really start getting your feet wet, evaluating Solaris 11 and planning its deployment in your environment.  We hope you'll like it!


I'm the architect for Solaris deployment and system management, with a lot of background in networking on the side. I spend a lot of my time currently operating Solaris Engineering's OpenStack cloud. I am co-author of the OpenSolaris Bible (Wiley, 2009). I also play a lot of golf.


« December 2016

No bookmarks in folder


No bookmarks in folder