X

News, tips, partners, and perspectives for the Oracle Solaris operating system

Upgrading Solaris Engineering's OpenStack Cloud

Dave Miner
Sr. Principal Software Engineer

The Solaris 11.3 Beta release includes an update to the bundled OpenStack packages from the Havana version to Juno1.  Over on the OpenStack blog my colleague Drew Fisher has a detailed post that looks under the covers at the work the community and our Solaris development team did to make this major upgrade as painless as possible. Here, I'll talk about applying that upgrade technology from the operations side, as we recently performed this upgrade on the internal cloud that we're operating for Solaris engineering.  See my series of posts from last year on how our cloud was initially constructed. 

Our Upgrade Process

The first thing to understand about our upgrade process is that, since the Solaris Nova driver as yet lacks live migration support, we can't upgrade compute nodes without an outage for the guest instances.  We also don't yet have an HA configuration deployed for the database and all the services, so those also require outages to upgrade2.  Therefore, all of our upgrades have downtime scheduled for the entire cloud and we attempt to upgrade all the nodes to the same build.  We typically schedule two hours for upgrades.  If everything were to go smoothly we could be done in less than 30 minutes, but it never works out that way, at least so far.

Right now, we're still doing the upgrades fairly manually, with a small script that we run on each node in turn.  That script looks something like:

# shut down puppet so that patches don't get pulled before they are required
svcadm disable -t puppet:agent
# shut down zones so update goes more quickly, use synchronous to wait for this
svcadm disable -ts zones
# Disable nova API and BUI; we use temporary for API so it will
# come back on reboot but persistent for BUI so that it's not available
# until we're ready to end the outage.
# Dump database for disaster recovery
if [[ $node == "cloud-controller" ]]; then
        svcadm disable -t nova-api-osapi-compute
        svcadm disable apache22
mysqldump --user=root --password='password' --add-drop-database --all-databases >/tank/all_databases.sql
fi
pkg update --be-name solaris11_3 -C 5

Once the script completes, we can reboot the system.  The comment above about Puppet relates to specifics in how we are using it; since we sometimes have bugs in the builds that we can work around, we typically use Puppet to distribute those workarounds, but we don't want them to take effect until we've rebooted into the new boot environment.  There's almost certainly a better way to do this, we're just not that smart yet ;-)

We run the above script on all the nodes in parallel, which is fine to do because the upgrade is always creating a new boot environment and we'll wait until all of the core service nodes (keystone, cinder, nova controller, neutron, glance) are done before we reboot any of them.  We don't necessarily wait for all of the compute nodes since they can take longer if any of the guests are non-global zones, and they are the last thing we reboot anyway.

Once the updates are complete, we reboot nodes in the following order:

  1. Nova controller - MySQL, RabbitMQ, Keystone, Nova api's, Heat
  2. Neutron controller
  3. Cinder controller
  4. Glance
  5. Compute nodes

This order minimizes disruptions to the services connecting to RabbitMQ and MySQL, which have been a point of fragility for many operators of OpenStack clouds.  It also ensures that the compute nodes don't see disruptions to iSCSI connections for running zones, which we've seen occasionally lead to ZFS pools ending up in a suspended state.  As we build out the cloud we'll be separating the functions that are in the Nova controller into separate instances, which will necessitate some adjustments to this sequencing, but the basic idea is to work from the database to rabbitmq to keystone to the nova services.

Verifying the Upgrade Worked

Once we've rebooted the nodes we run a couple of quick tests to launch both SPARC and x86 guests, ensuring that basically all of the machinery is working.  I've started doing this with a fairly simple Heat template:

heat_template_version: 2013-05-23
description: >
HOT template to deploy SPARC & x86 servers as a quick sanity test
parameters:
x86_image:
type: string
description: Name of image to use for x86 server
sparc_image:
type: string
description: Name of image to use for SPARC server
resources:
x86_server1:
type: OS::Nova::Server
properties:
name: test_x86
image: { get_param: x86_image }
flavor: 1
key_name: testkey
networks:
- port: { get_resource: x86_server1_port }
x86_server1_port:
type: OS::Neutron::Port
properties:
network: internal
x86_server1_floating_ip:
type: OS::Neutron::FloatingIP
properties:
floating_network: external
port_id: { get_resource: x86_server1_port }
sparc_server1:
type: OS::Nova::Server
properties:
name: test_sparc
image: { get_param: sparc_image }
flavor: 1
key_name: testkey
networks:
- port: { get_resource: sparc_server1_port }
sparc_server1_port:
type: OS::Neutron::Port
properties:
network: internal
sparc_server1_floating_ip:
type: OS::Neutron::FloatingIP
properties:
floating_network: external
port_id: { get_resource: sparc_server1_port }
outputs:
x86_server1_public_ip:
description: Floating IP address of x86 server in public network
value: { get_attr: [ x86_server1_floating_ip, floating_ip_address ] }
sparc_server1_public_ip:
description: Floating IP address of SPARC server in public network
value: { get_attr: [ sparc_server1_floating_ip, floating_ip_address ] }

Once that test runs successfully, we declare the outage over and re-enable the Apache service to restore access to the Horizon dashboard.

Our Upgrade Experiences

Since we went into production almost a year ago, we've upgraded the
entire cloud infrastructure, including the OpenStack packages, seven
times.  Had we met our goals we would have done the upgrade every two
weeks as each full Solaris development build is released internally (and
thus would have done over 20 upgrades), but the reality of running at
the bleeding edge of the operating system's development is that we find
bugs, and we've had several that were too serious and too difficult to
work around to undertake upgrades, so we've had to delay a number of
times while we waited for fixes to integrate. Through all of this, we've
learned a lot and are continually refining our upgrade process.

So far, we've only had one upgrade over the last year that was unsuccessful, and that was reasonably painless, since we just re-activated the old boot environment on each node and rebooted back to it.  We now pre-stage each upgrade on a single-node stack that's configured similarly to the production cloud to verify there aren't any truly catastrophic problems with kernel zones, ZFS, or networking.  That's mostly been successful, but we're going to build a small multi-node cloud for the staging to ensure that we can catch issues in additional areas such as iSCSI that aren't exercised properly by the single-node stack.  The lesson, as always, is to have your test environment replicate production as closely as possible.

For this particular upgrade, we did a lot more testing; I spent the better part of two weeks running trial upgrades from Havana to Juno to shake out issues there, which allowed the development team to fix a bunch of bugs in the configuration file and database upgrades before we went ahead with the actual upgrade.  Even so, the production upgrade was more of an adventure than we expected.  We ran into three issues:

  1. After we rebooted the controller node, the heat-db service went into maintenance.  The database had been corrupted by the service exceeding its start method timeout, which caused SMF to kill and restart it, and that apparently happened at a very inopportune time.  Fortunately we had made little use of heat with Havana and we could simply drop the database and recreate it.  The SMF method timeout is being fixed (for heat-db and other services), though that fix isn't in the 11.3 beta release.  We're also having some discussion about whether SMF should generally default to much longer start method timeouts.  We find that developers are consistently overly optimistic about the true performance of systems in production and short timeouts of 30 seconds or 1 minute that are often used are more likely to cause harm than good.
  2. The puppet:master service went into maintenance when that node was rebooted, with truss we determined that for some reason it was attempting to kill labeld, failing, and exiting.  This is still being investigated, we've had difficulty reproducing it.  Fortunately, disabling labeld worked around the problem and we were able to proceed.
  3. After we had resolved the above issues, the test launches we use to verify the cloud is working would not complete - they'd be queued but not actually happen.  This took us over an hour to diagnose, in part because we're not that experienced with RabbitMQ issues, to that point it had "just worked".  It turned out that we were victims of the default file descriptor limit for RabbitMQ, at 256, being too low to handle all of the connections from the various services using it.  Apparently Juno is just more resource-hungry in this respect than Havana, and it's not something we could have observed in the smaller test environment.  Adding a "ulimit -n 1024" to the rabbitmq start method worked around this for now; this has sparked some internal discussion on whether the default limits should be increased, as yet unresolved.  The values are relics from many years ago and likely could use some updating.

Overall, this upgrade clocked in at a bit over 4 hours of downtime, not the 3 hours that we'd scheduled.  Happily, our cloud has run very smoothly in the weeks since the upgrade to Juno, and our users are very pleased with the much-improved Horizon dashboard.    We're now working our way through a long list of improvements to our cloud and getting the equipment in place to move to an HA environment, which will let us move towards our goal of rolling, zero-downtime upgrades.  More updates to come!

Footnotes

  1. If you're following the OpenStack community, you'll ask,
    "What about Icehouse?"  Well, we skipped it, in order to get closer to
    the community releases more quickly.
  2. I am happy to note that, in spite of this lack of HA, we've had only a few
    minutes of unscheduled service interruptions over the course of the year,
    due mostly to panics in the Cinder or Neutron servers.  That seems pretty good
    considering the bleeding-edge nature of the software we're running

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.