Wednesday Apr 28, 2010

Last call for the paranoid git

I joined Sun on July 11th 1995, so I'm very close to making 15 years now that I'm being assimilated into Oracle on May 1st - it is a pity I didn't make it all the way.

I've been posting entries in this blog since April 4th 2004, and now that I'm being sucked into the huge beast that Oracle is, I'm going to end this blog and start posting stuff at instead, to allow me to speak freely without fear of violating some corporate policy of what I can and can't say.

Code Nursery is where I spend my spare time - working on open source projects I find interesting, like:

Or helping friends and family build websites for their interest groups or small businesses - just for the mental exercise and that I like keeping myself up to date with new technology!

According to the "Proprietary Information Agreement" I signed today, Oracle wants to claim everything I do that falls under "any current or reasonably anticipated business of Oracle" ... "whether or not conceived during regular business hours". Luckily that doesn't hold up in court in Sweden, they can only claim stuff that I do at work which is related to what I'm hired to do.

However, to continue contributing to open source projects I have to be really careful not to mix business with pleasure, so that is yet another reason for me to move away from the corporate site, and use my hardware on my spare time from now on.

For those of you who come here for Solaris auditing information - I'm gradually going to review all those posts I've made over the years and do a "clean-room rewrite" of them at the new blog, so keep your eyes peeled if you are an audit junkie like me :)

That's all I had to say - now mosey on over to the new site and read about how I access it securely...


Thursday Nov 05, 2009

Behavior Driven Infrastructure

One problem I'm wrestling in my day job at Web Engineering is: how do you know when a system you are building is ready?

When we build a new system, it goes through the following steps:

  1. Jumpstart
    Installs the OS and sets up basic configuration, like hostname, domainname, network.
  2. Puppet
    System specific configuration
  3. Manual steps
    This includes things which are too system dependent to automate, like creating a separate zpool for application data on external storage

For me it has been enough to review the puppet logs to determine if the system has been correctly configured, but for my colleagues who aren't using puppet on a daily basis, it isn't. They have been asking "how do we know if a system is ready?", and I've realized that "review the puppet logs" isn't really a helpful answer for most people. What if you have forgotten to add a node definition for the system, and you get the default node configuration. Then puppet will tell you everything is configured correctly - which is partly true: the things puppet has been told to configure are configured, but what about the stuff I forgot to tell it about?

So I've been thinking about using the same approach as I use when I write code: Behavior Driven Development. I.e. you start by specifying the behavior of the program you are developing, after that you start you start to code. This has the benefit of easily letting you known when you are done. If your code pass all the behavior tests, then you can release it.

Translating this to Solaris installs isn't that hard, instead of describing program behavior you describe (operating) system behavior. You can use the same tools as you do for development, and I've been using cucumber for my Ruby on Rails projects, so it is what I picked for my initial testing. Cucumber uses natural language to describe the behavior you want, which makes it easy for non-programmers to understand what it is testing.

When you write the definitions, you should not use technical language, like: "ssh to the host weblogs and grep for an passwd(4) entry for the user martin in /etc/passwd" instead use something like "I should be able to ssh to weblogs, and log in as the user martin", which is the behavior you want. Cucumber then takes that definition and translates it into step-by-step instructions which can be validated.

This is how it can look when you run it:

martin@server$ cucumber
Feature: sendmail configure
  Systems should be able to send mail

  Scenario: should be able to send mail                  # features/
    When connecting to using ssh   # features/steps/ssh_steps.rb:12
    Then I want to send mail to "" # features/steps/mail_steps.rb:1

Feature: NIS client
  Systems on SWAN should be NIS clients

  Scenario: should be able to match entries in NIS    # features/
    When connecting to using ssh # features/steps/ssh_steps.rb:12
    Then I want to lookup "xuan" in the passwd table   # features/steps/nis_steps.rb:1
    And I want to lookup "onnv" in the hosts table     # features/steps/nis_steps.rb:1

  Scenario: should be able to make lookups through NIS # features/
    When connecting to using ssh # features/steps/ssh_steps.rb:12
    Then I want to lookup "xuan" through nsswitch.conf # features/steps/nis_steps.rb:5

Feature: SSH access
  SSH should be configured

  Scenario: ssh user access                            # features/
    Given a user named "martin"                        # features/steps/ssh_steps.rb:3
    When connecting to using ssh # features/steps/ssh_steps.rb:12
    Then the connection should succeed                 # features/steps/ssh_steps.rb:28

  Scenario: no lingering default OpenSolaris user      # features/
    Given a user named "jack" with password "jack"     # features/steps/ssh_steps.rb:7
    When connecting to using ssh # features/steps/ssh_steps.rb:12
    Then the connection should fail                    # features/steps/ssh_steps.rb:32

5 scenarios (5 passed)
13 steps (13 passed)

This makes it really easy to see if the behavior of the system is what you expect. All green means it is ready!

The stuff I am working on at the moment is to make the failures understandable by a non-programmer. For example when a scenario fails (and it succeeds to log in to a system where it should have failed), it looks like this:

  Scenario: no lingering default OpenSolaris user      # features/
    Given a user named "jack" with password "jack"     # features/steps/ssh_steps.rb:7
    When connecting to using ssh # features/steps/ssh_steps.rb:12
    Then the connection should fail                    # features/steps/ssh_steps.rb:28
      expected not nil, got nil (Spec::Expectations::ExpectationNotMetError)
      ./features/steps/ssh_steps.rb:29:in `/\^the connection should succeed$/'
      features/ `Then the connection should succeed'

Failing Scenarios:
cucumber features/ # Scenario: no lingering default OpenSolaris user

5 scenarios (1 failed, 4 passed)
13 steps (1 failed, 12 passed)

It is not obvious that expected not nil, got nil means that it could log in when it shouldn't be able to, so I am working on some custom rspec matchers to generate better error messages.

Once I've gotten a bit beyond playing around with this, I will publish the source if someone is interested in it.

Tuesday Jun 23, 2009

Planning to fail when using Puppet

We put a lot of thought into planning for failure when we setup our sites (like, and so on). Every component is redundant, from border firewalls to load-balancers to front end web servers to root disks. We even put the gear in separate racks on separate power, just in case someone accidentally knocks both power cables out. This is arranged in odd and even sides, and servers are placed in the corresponding side, i.e. is placed on the odd side and is placed on the even side. If we use more than two servers they are added to the respective side.

But the chain is only as strong as its weakest link: if I screw up when I update the puppet profile for our base server class, things will quickly go south.

No matter how carefully I test things before I commit my changes to the master mercurial repository and on to the puppetmaster (we only ran one per site before), there still is a chance things go boink! There are always some servers which were setup a few years ago, long before we started using puppet, that aren't installed and configured the way I expect, and when they are modified by puppet - they break!

So it doesn't matter that we are running multiple systems, they all get changed by puppet within 30 minutes.

To work around this problem I've set up two puppetmasters, and they serve the corresponding side (odd or even). This lets me push changes to the one side first, let it stew for a while, before I push it to the other side.

Tuesday Mar 03, 2009

Running puppet on OpenSolaris

I'm running puppet on the production servers I manage at Sun, and for Solaris 10 I've had to compile Ruby and create my own package (for easy distribution). I've also created my own puppet and facter packages, as I didn't want to setup rubygems.

Now on OpenSolaris this is much easier, as you can just run:

# pkg install -q SUNWruby18
# gem install -y puppet
Bulk updating Gem source index for:
Successfully installed puppet-0.24.7
Successfully installed facter-1.5.4
Installing ri documentation for puppet-0.24.7...
Installing RDoc documentation for puppet-0.24.7...
and you are all set to configure /etc/puppet/puppet.conf to get puppetmasterd and puppetd running!

Wednesday Dec 10, 2008

Sendmail, may I introduce Alteon to you?

Yesterday we started using an Alteon VIP to load balance SMTP traffic to our two mail servers, and everything was fine and dandy, but when I took a look in /var/log/syslog I found loads of entries like this:

Dec 11 18:17:14 prod-git1 sendmail[20899]: [ID 801593] j93FHDNX020899: []
did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA

The Alteon health check connects and then just issue a QUIT which sendmail finds suspicious, and hence feels obliged to let me know about it. This becomes very annoying when you have two Alteons doing the check every other second!

After scratching my head for a while and searching for a solution, I came across this patch to sendmail, which lets you select systems which shouldn't generate the above log entry. The only caveat was that I'd have to build my own sendmail, and I really don't want to roll my own stuff as it require more job to support, so I continued to look for a another solution.

I finally figured out (after reading the sendmail sourcode) that if I in /etc/mail/ set

O PrivacyOption=authwarnings,needexpnhelo,needvrfyhelo

sendmail would be quiet if the Alteon changed the health check to doing the equivalent of this:

mconnect localhost
connecting to host localhost (, port 25
connection open
220 ESMTP Sendmail 8.13.8+Sun/8.13.8; Thu, 11 Dec 2008 13:58:48 +0100 (CET)
VRFY root
503 5.0.0 I demand that you introduce yourself first
221 2.0.0 closing connection

So we changed the health check from being smtp to a custom script (note that you need the double backslashes):

open 25,tcp
expect "ESMTP"
send "VRFY root\\\\n"
expect "503"
send "QUIT\\\\n"
expect "221"

And after pushing this change out, sendmail stopped filling the log with messages I don't want to see.

Thursday May 08, 2008

Creating a user_attr puppet type

I've come a fair bit in my puppet testing now, but one thing I lack is a user_attr type. I.e. a way to update the /etc/user_attr file using puppet.

This is what I have in mind for the syntax:

user_attr { "martin":
    type => normal,
    roles => [
    profiles => "Zone Management",
    auths => [

One thing I haven't figured out yet is how if the definitions should be absolute, i.e. if the entry must be exactly like the definition, or if it is enough that the listed values are present. In the above example, should the role list be exactly root,admin or should it just make sure that those two roles are in the list and you can have the role audit too. Perhaps it would be good to be able to use the absent/present syntax on individual items?

I haven't decided if I'm going to manage the other user attributes too, e.g. project, defaultpriv, limitpriv and lock_after_retries. I will probably leave that for a later release...

[Technorati Tags: ]

Friday Apr 18, 2008

Testing puppet configurations

I've set up a puppet environment which uses mercurial to store the configuration and manifests. Now I'm trying to build an environment to be able to test changes before I commit them to the repository, and they propagate to all our 400 servers - but I encountered a problem.

You can use a separate configuration directory with the --confdir option for both puppetd and puppetmasterd, and run everything on localhost, but the problem is the source parameter

file { "/etc/profile":
    owner => root,
    group => root,
    mode => 644,
    source => "puppet://server/base/profile"

The above source parameter contains the hostname, so when I want to test it on my local mercurial repository, it still connects to the server instead of localhost when it fetches the files.

Luckily there is a solution! If you leave out the server part, puppetd will insert the name of the server it is connecting to.

Tuesday Apr 08, 2008

Trying out puppet

I'm looking for ways to better manage our servers, and right now I'm playing with puppet.

I immediately ran in to a problem: it picked the wrong domain name. Internally at Sun we use NIS (yes, I know it is insecure and sucks in almost all aspects, but I'm not in position to change it - and believe me I have tried) and our NIS domain name doesn't match the DNS domain name.

This is something puppet (facter to be exact) doesn't figure out, at least not on Solaris. Instead of picking the correct fqdn for a host, e.g., it picks, since that is what the domainname command returns.

They tried to fix this, but unfortunately it doesn't work for Solaris, as it relies on the dnsdomainname which we don't have.

I've worked around it by creating my own /usr/bin/dnsdomainname which gets called before domainname.

DOMAIN="`/usr/bin/domainname 2> /dev/null`"
if [ ! -z "$DOMAIN" ]; then
    echo $DOMAIN | sed 's/\^[\^.]\*.//'

So now I can continue to test my puppet configurations...

Monday Mar 31, 2008

The danger of growing too fast

Out esteemed director has pushed us too far too long - he requires us to rack 'em and stack 'em all day long, and after the last spree of installing alpha hardware he got from engineering (the new 4 way, 16-core Rock based systems, code name lurad) for the cluster we now have such a big mess in our server room that I thought I'd share it with you:

Picture by: VespaGT

We have added 72 of these little monsters since the beginning of last week and haven't had time to clean up the cables - so now it is time to bring out the dymo and start labeling...

[Technorati Tags: ]

Friday Mar 07, 2008

Converting HFS from case sensitive to case insensitive

I've managed to solve the problem I was blogging about earlier.

I started out by forcing TimeMachine to do a backup and since I wasn't sure I'd succeed in restoring my data using it, I did a gtar backup of all user directories too.

Once the backups were done I booted the Leopard install DVD, started DiskUtility, and reformatted the disk as HFS, Journalling and Case Insensitive. After that I started TimeMachine and choose the restore option. It immediately reformatted my disk to match the backup, and that wasn't what I wanted.

So I reformatted the disk again and then choose to do an install from scratch. When the installation completed and the system rebooted, the migration assistant asked if I would like to mograte old data, and I picked the option to restore from the last TimeMachine backup.

This time is didn't do anything with my file system and all files & settings were restored - and I could start the Photoshop CS3 installation and get it installed!

I don't know how it would have handled a conflict, i.e. restoring foo and Foo, since I wrote a Perl script to make sure that I didn't have any conflicts.

Monday Mar 03, 2008

Insensitive file systems

cASe inSEnsITIvE file system - what an utterly stupid idea!

When I installed Leopard on my MacBook Pro it was a natural choice to make the file system case sensitive. Besides being a UNIX geek I had a legitimate reason for doing so:
you can't do

hg clone ssh://

as the OpenSolaris source code contains case insensitivity conflicts.

So what am I bitching about then? Yesterday I tried to install Adobe Photoshop CS3 on my wife's MacBook pro (which I also installed with case sensitivity) and got this very unintuitive dialog:

This software cannot be installed because the file system of the OS volume is not supported

After scratching my head for a while, I figured out that it is due to the case sensitivity! Adobe hasn't bothered to fix their code, and it is not like it is a new feature in Mac OS X either... they have had several years to fix it.

Unfortunately there is no solution to this, but to reformat the file system and make it case insensitive! To go from bad to worse I can't use TimeMachine to do it, as it too doesn't support backing up a case sensitive file system and restoring it to a case insensitive. It just has to alert me if there is a conflict - which there isn't in my case, I've checked!

Luckily Mac OS X comes with all the UNIX tools we love and cherish, so I'll just use cpio or gtar to back up all my data and then nuke the / partition (while keeping my zpool)

Update: as suggested by zdz and Dick Davies I tried creating a disk image with a case insensitive HFS, but that didn't work either for the Photoshop installer. The hint is in the error message "OS volume is not supported". Back to the original plan of backup/reinstall/restore...

Tuesday Nov 27, 2007

Trying out mirrored zfs root on Indiana

I've been playing around with project Indiana, and the new installer and packaging system, and they are really nice.

When you install it turns the root disk into a zpool called zpl_slim, but it doesn't let you select two disks and mirror the zpool. Luckily you can fix this once the installation is done. When the system has booted, you can use the zpool attach command:

# zpool attach zpl_slim c7d0s0 c8d0s0
# zpool status
  pool: zpl_slim
 state: ONLINE
 scrub: resilver in progress, 11.75% done, 0h3m to go

        NAME        STATE     READ WRITE CKSUM
        zpl_slim    ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c7d0s0  ONLINE       0     0     0
            c8d0s0  ONLINE       0     0     0

errors: No known data errors

Friday Nov 09, 2007

CSWmercurial 0.9.5

Now that CSWpython is upgraded I've finally got my act together and found some spare cycles lying around in a drawer, so I could finish the update of the CSWmercurial package. I've sent it out for alfa-testing, so hopefully I'll be able to publish it by the end of next week.

Saturday Nov 03, 2007


This post is a petition to Apple to get their act together and finish Java 6 for Leopard

If you wonder what the strange title means, read this blog post.

Monday Oct 29, 2007

Time Machine & ZFS

I've just installed Leopard on my MacBook Pro, and was first disapointed that it only had read only zfs, but after checking out ADC that was solved :)

I also wanted to try out Time Machine and thought that I could place the backups on zfs, but Time Machine doesn't let me select zfs as a destination. Hopefully I'll be able to trick it somehow ;)

after Jeff Harrell's comment I read up on Time Machine here and here, and as Jess says it uses directory hard-links, so that won't work with zfs. Bummer! :(




« August 2016