Thursday Nov 05, 2009

Behavior Driven Infrastructure

One problem I'm wrestling in my day job at Web Engineering is: how do you know when a system you are building is ready?

When we build a new system, it goes through the following steps:

  1. Jumpstart
    Installs the OS and sets up basic configuration, like hostname, domainname, network.
  2. Puppet
    System specific configuration
  3. Manual steps
    This includes things which are too system dependent to automate, like creating a separate zpool for application data on external storage

For me it has been enough to review the puppet logs to determine if the system has been correctly configured, but for my colleagues who aren't using puppet on a daily basis, it isn't. They have been asking "how do we know if a system is ready?", and I've realized that "review the puppet logs" isn't really a helpful answer for most people. What if you have forgotten to add a node definition for the system, and you get the default node configuration. Then puppet will tell you everything is configured correctly - which is partly true: the things puppet has been told to configure are configured, but what about the stuff I forgot to tell it about?

So I've been thinking about using the same approach as I use when I write code: Behavior Driven Development. I.e. you start by specifying the behavior of the program you are developing, after that you start you start to code. This has the benefit of easily letting you known when you are done. If your code pass all the behavior tests, then you can release it.

Translating this to Solaris installs isn't that hard, instead of describing program behavior you describe (operating) system behavior. You can use the same tools as you do for development, and I've been using cucumber for my Ruby on Rails projects, so it is what I picked for my initial testing. Cucumber uses natural language to describe the behavior you want, which makes it easy for non-programmers to understand what it is testing.

When you write the definitions, you should not use technical language, like: "ssh to the host weblogs and grep for an passwd(4) entry for the user martin in /etc/passwd" instead use something like "I should be able to ssh to weblogs, and log in as the user martin", which is the behavior you want. Cucumber then takes that definition and translates it into step-by-step instructions which can be validated.

This is how it can look when you run it:

martin@server$ cucumber
Feature: sendmail configure
  Systems should be able to send mail

  Scenario: should be able to send mail                  # features/
    When connecting to using ssh   # features/steps/ssh_steps.rb:12
    Then I want to send mail to "" # features/steps/mail_steps.rb:1

Feature: NIS client
  Systems on SWAN should be NIS clients

  Scenario: should be able to match entries in NIS    # features/
    When connecting to using ssh # features/steps/ssh_steps.rb:12
    Then I want to lookup "xuan" in the passwd table   # features/steps/nis_steps.rb:1
    And I want to lookup "onnv" in the hosts table     # features/steps/nis_steps.rb:1

  Scenario: should be able to make lookups through NIS # features/
    When connecting to using ssh # features/steps/ssh_steps.rb:12
    Then I want to lookup "xuan" through nsswitch.conf # features/steps/nis_steps.rb:5

Feature: SSH access
  SSH should be configured

  Scenario: ssh user access                            # features/
    Given a user named "martin"                        # features/steps/ssh_steps.rb:3
    When connecting to using ssh # features/steps/ssh_steps.rb:12
    Then the connection should succeed                 # features/steps/ssh_steps.rb:28

  Scenario: no lingering default OpenSolaris user      # features/
    Given a user named "jack" with password "jack"     # features/steps/ssh_steps.rb:7
    When connecting to using ssh # features/steps/ssh_steps.rb:12
    Then the connection should fail                    # features/steps/ssh_steps.rb:32

5 scenarios (5 passed)
13 steps (13 passed)

This makes it really easy to see if the behavior of the system is what you expect. All green means it is ready!

The stuff I am working on at the moment is to make the failures understandable by a non-programmer. For example when a scenario fails (and it succeeds to log in to a system where it should have failed), it looks like this:

  Scenario: no lingering default OpenSolaris user      # features/
    Given a user named "jack" with password "jack"     # features/steps/ssh_steps.rb:7
    When connecting to using ssh # features/steps/ssh_steps.rb:12
    Then the connection should fail                    # features/steps/ssh_steps.rb:28
      expected not nil, got nil (Spec::Expectations::ExpectationNotMetError)
      ./features/steps/ssh_steps.rb:29:in `/\^the connection should succeed$/'
      features/ `Then the connection should succeed'

Failing Scenarios:
cucumber features/ # Scenario: no lingering default OpenSolaris user

5 scenarios (1 failed, 4 passed)
13 steps (1 failed, 12 passed)

It is not obvious that expected not nil, got nil means that it could log in when it shouldn't be able to, so I am working on some custom rspec matchers to generate better error messages.

Once I've gotten a bit beyond playing around with this, I will publish the source if someone is interested in it.

Tuesday Jun 23, 2009

Planning to fail when using Puppet

We put a lot of thought into planning for failure when we setup our sites (like, and so on). Every component is redundant, from border firewalls to load-balancers to front end web servers to root disks. We even put the gear in separate racks on separate power, just in case someone accidentally knocks both power cables out. This is arranged in odd and even sides, and servers are placed in the corresponding side, i.e. is placed on the odd side and is placed on the even side. If we use more than two servers they are added to the respective side.

But the chain is only as strong as its weakest link: if I screw up when I update the puppet profile for our base server class, things will quickly go south.

No matter how carefully I test things before I commit my changes to the master mercurial repository and on to the puppetmaster (we only ran one per site before), there still is a chance things go boink! There are always some servers which were setup a few years ago, long before we started using puppet, that aren't installed and configured the way I expect, and when they are modified by puppet - they break!

So it doesn't matter that we are running multiple systems, they all get changed by puppet within 30 minutes.

To work around this problem I've set up two puppetmasters, and they serve the corresponding side (odd or even). This lets me push changes to the one side first, let it stew for a while, before I push it to the other side.

Tuesday Mar 03, 2009

Running puppet on OpenSolaris

I'm running puppet on the production servers I manage at Sun, and for Solaris 10 I've had to compile Ruby and create my own package (for easy distribution). I've also created my own puppet and facter packages, as I didn't want to setup rubygems.

Now on OpenSolaris this is much easier, as you can just run:

# pkg install -q SUNWruby18
# gem install -y puppet
Bulk updating Gem source index for:
Successfully installed puppet-0.24.7
Successfully installed facter-1.5.4
Installing ri documentation for puppet-0.24.7...
Installing RDoc documentation for puppet-0.24.7...
and you are all set to configure /etc/puppet/puppet.conf to get puppetmasterd and puppetd running!

Thursday May 08, 2008

Creating a user_attr puppet type

I've come a fair bit in my puppet testing now, but one thing I lack is a user_attr type. I.e. a way to update the /etc/user_attr file using puppet.

This is what I have in mind for the syntax:

user_attr { "martin":
    type => normal,
    roles => [
    profiles => "Zone Management",
    auths => [

One thing I haven't figured out yet is how if the definitions should be absolute, i.e. if the entry must be exactly like the definition, or if it is enough that the listed values are present. In the above example, should the role list be exactly root,admin or should it just make sure that those two roles are in the list and you can have the role audit too. Perhaps it would be good to be able to use the absent/present syntax on individual items?

I haven't decided if I'm going to manage the other user attributes too, e.g. project, defaultpriv, limitpriv and lock_after_retries. I will probably leave that for a later release...

[Technorati Tags: ]

Friday Apr 18, 2008

Testing puppet configurations

I've set up a puppet environment which uses mercurial to store the configuration and manifests. Now I'm trying to build an environment to be able to test changes before I commit them to the repository, and they propagate to all our 400 servers - but I encountered a problem.

You can use a separate configuration directory with the --confdir option for both puppetd and puppetmasterd, and run everything on localhost, but the problem is the source parameter

file { "/etc/profile":
    owner => root,
    group => root,
    mode => 644,
    source => "puppet://server/base/profile"

The above source parameter contains the hostname, so when I want to test it on my local mercurial repository, it still connects to the server instead of localhost when it fetches the files.

Luckily there is a solution! If you leave out the server part, puppetd will insert the name of the server it is connecting to.

Tuesday Apr 08, 2008

Trying out puppet

I'm looking for ways to better manage our servers, and right now I'm playing with puppet.

I immediately ran in to a problem: it picked the wrong domain name. Internally at Sun we use NIS (yes, I know it is insecure and sucks in almost all aspects, but I'm not in position to change it - and believe me I have tried) and our NIS domain name doesn't match the DNS domain name.

This is something puppet (facter to be exact) doesn't figure out, at least not on Solaris. Instead of picking the correct fqdn for a host, e.g., it picks, since that is what the domainname command returns.

They tried to fix this, but unfortunately it doesn't work for Solaris, as it relies on the dnsdomainname which we don't have.

I've worked around it by creating my own /usr/bin/dnsdomainname which gets called before domainname.

DOMAIN="`/usr/bin/domainname 2> /dev/null`"
if [ ! -z "$DOMAIN" ]; then
    echo $DOMAIN | sed 's/\^[\^.]\*.//'

So now I can continue to test my puppet configurations...




« July 2016