Wednesday Apr 28, 2010

Last call for the paranoid git

I joined Sun on July 11th 1995, so I'm very close to making 15 years now that I'm being assimilated into Oracle on May 1st - it is a pity I didn't make it all the way.

I've been posting entries in this blog since April 4th 2004, and now that I'm being sucked into the huge beast that Oracle is, I'm going to end this blog and start posting stuff at instead, to allow me to speak freely without fear of violating some corporate policy of what I can and can't say.

Code Nursery is where I spend my spare time - working on open source projects I find interesting, like:

Or helping friends and family build websites for their interest groups or small businesses - just for the mental exercise and that I like keeping myself up to date with new technology!

According to the "Proprietary Information Agreement" I signed today, Oracle wants to claim everything I do that falls under "any current or reasonably anticipated business of Oracle" ... "whether or not conceived during regular business hours". Luckily that doesn't hold up in court in Sweden, they can only claim stuff that I do at work which is related to what I'm hired to do.

However, to continue contributing to open source projects I have to be really careful not to mix business with pleasure, so that is yet another reason for me to move away from the corporate site, and use my hardware on my spare time from now on.

For those of you who come here for Solaris auditing information - I'm gradually going to review all those posts I've made over the years and do a "clean-room rewrite" of them at the new blog, so keep your eyes peeled if you are an audit junkie like me :)

That's all I had to say - now mosey on over to the new site and read about how I access it securely...


Thursday Nov 05, 2009

Behavior Driven Infrastructure

One problem I'm wrestling in my day job at Web Engineering is: how do you know when a system you are building is ready?

When we build a new system, it goes through the following steps:

  1. Jumpstart
    Installs the OS and sets up basic configuration, like hostname, domainname, network.
  2. Puppet
    System specific configuration
  3. Manual steps
    This includes things which are too system dependent to automate, like creating a separate zpool for application data on external storage

For me it has been enough to review the puppet logs to determine if the system has been correctly configured, but for my colleagues who aren't using puppet on a daily basis, it isn't. They have been asking "how do we know if a system is ready?", and I've realized that "review the puppet logs" isn't really a helpful answer for most people. What if you have forgotten to add a node definition for the system, and you get the default node configuration. Then puppet will tell you everything is configured correctly - which is partly true: the things puppet has been told to configure are configured, but what about the stuff I forgot to tell it about?

So I've been thinking about using the same approach as I use when I write code: Behavior Driven Development. I.e. you start by specifying the behavior of the program you are developing, after that you start you start to code. This has the benefit of easily letting you known when you are done. If your code pass all the behavior tests, then you can release it.

Translating this to Solaris installs isn't that hard, instead of describing program behavior you describe (operating) system behavior. You can use the same tools as you do for development, and I've been using cucumber for my Ruby on Rails projects, so it is what I picked for my initial testing. Cucumber uses natural language to describe the behavior you want, which makes it easy for non-programmers to understand what it is testing.

When you write the definitions, you should not use technical language, like: "ssh to the host weblogs and grep for an passwd(4) entry for the user martin in /etc/passwd" instead use something like "I should be able to ssh to weblogs, and log in as the user martin", which is the behavior you want. Cucumber then takes that definition and translates it into step-by-step instructions which can be validated.

This is how it can look when you run it:

martin@server$ cucumber
Feature: sendmail configure
  Systems should be able to send mail

  Scenario: should be able to send mail                  # features/
    When connecting to using ssh   # features/steps/ssh_steps.rb:12
    Then I want to send mail to "" # features/steps/mail_steps.rb:1

Feature: NIS client
  Systems on SWAN should be NIS clients

  Scenario: should be able to match entries in NIS    # features/
    When connecting to using ssh # features/steps/ssh_steps.rb:12
    Then I want to lookup "xuan" in the passwd table   # features/steps/nis_steps.rb:1
    And I want to lookup "onnv" in the hosts table     # features/steps/nis_steps.rb:1

  Scenario: should be able to make lookups through NIS # features/
    When connecting to using ssh # features/steps/ssh_steps.rb:12
    Then I want to lookup "xuan" through nsswitch.conf # features/steps/nis_steps.rb:5

Feature: SSH access
  SSH should be configured

  Scenario: ssh user access                            # features/
    Given a user named "martin"                        # features/steps/ssh_steps.rb:3
    When connecting to using ssh # features/steps/ssh_steps.rb:12
    Then the connection should succeed                 # features/steps/ssh_steps.rb:28

  Scenario: no lingering default OpenSolaris user      # features/
    Given a user named "jack" with password "jack"     # features/steps/ssh_steps.rb:7
    When connecting to using ssh # features/steps/ssh_steps.rb:12
    Then the connection should fail                    # features/steps/ssh_steps.rb:32

5 scenarios (5 passed)
13 steps (13 passed)

This makes it really easy to see if the behavior of the system is what you expect. All green means it is ready!

The stuff I am working on at the moment is to make the failures understandable by a non-programmer. For example when a scenario fails (and it succeeds to log in to a system where it should have failed), it looks like this:

  Scenario: no lingering default OpenSolaris user      # features/
    Given a user named "jack" with password "jack"     # features/steps/ssh_steps.rb:7
    When connecting to using ssh # features/steps/ssh_steps.rb:12
    Then the connection should fail                    # features/steps/ssh_steps.rb:28
      expected not nil, got nil (Spec::Expectations::ExpectationNotMetError)
      ./features/steps/ssh_steps.rb:29:in `/\^the connection should succeed$/'
      features/ `Then the connection should succeed'

Failing Scenarios:
cucumber features/ # Scenario: no lingering default OpenSolaris user

5 scenarios (1 failed, 4 passed)
13 steps (1 failed, 12 passed)

It is not obvious that expected not nil, got nil means that it could log in when it shouldn't be able to, so I am working on some custom rspec matchers to generate better error messages.

Once I've gotten a bit beyond playing around with this, I will publish the source if someone is interested in it.

Tuesday Jun 23, 2009

Planning to fail when using Puppet

We put a lot of thought into planning for failure when we setup our sites (like, and so on). Every component is redundant, from border firewalls to load-balancers to front end web servers to root disks. We even put the gear in separate racks on separate power, just in case someone accidentally knocks both power cables out. This is arranged in odd and even sides, and servers are placed in the corresponding side, i.e. is placed on the odd side and is placed on the even side. If we use more than two servers they are added to the respective side.

But the chain is only as strong as its weakest link: if I screw up when I update the puppet profile for our base server class, things will quickly go south.

No matter how carefully I test things before I commit my changes to the master mercurial repository and on to the puppetmaster (we only ran one per site before), there still is a chance things go boink! There are always some servers which were setup a few years ago, long before we started using puppet, that aren't installed and configured the way I expect, and when they are modified by puppet - they break!

So it doesn't matter that we are running multiple systems, they all get changed by puppet within 30 minutes.

To work around this problem I've set up two puppetmasters, and they serve the corresponding side (odd or even). This lets me push changes to the one side first, let it stew for a while, before I push it to the other side.

Tuesday Mar 03, 2009

Running puppet on OpenSolaris

I'm running puppet on the production servers I manage at Sun, and for Solaris 10 I've had to compile Ruby and create my own package (for easy distribution). I've also created my own puppet and facter packages, as I didn't want to setup rubygems.

Now on OpenSolaris this is much easier, as you can just run:

# pkg install -q SUNWruby18
# gem install -y puppet
Bulk updating Gem source index for:
Successfully installed puppet-0.24.7
Successfully installed facter-1.5.4
Installing ri documentation for puppet-0.24.7...
Installing RDoc documentation for puppet-0.24.7...
and you are all set to configure /etc/puppet/puppet.conf to get puppetmasterd and puppetd running!

Friday Dec 12, 2008

A new Solaris security book on the way

For the last few months I've been spending my evenings tapping away on the keyboard - but not producing code or managing Solaris servers like I usually do. I've been writing two chapters for an upcoming Solaris security book! It has been fun, but it has also been hard - not hard because I didn't know what to write, but hard to constrain myself from wanting to include too much.

The book is not intended to cover every nitty gritty detail of every security feature in Solaris - that would make it a real brick of a book! So I've had to think hard about what to include, and the level of detail of the included parts.

Parts of the book is already available on Safari Rough Cuts for review before we publish. Please leave comments about on the Safari site so that nothing gets lost.

The chapter about File System Security is mine, and I've also authored the chapter about auditing (not very surprising), though it hasn't been processed for publication yet, but when it is - I'll post a blog entry with a link to it.

Wednesday Dec 10, 2008

Sendmail, may I introduce Alteon to you?

Yesterday we started using an Alteon VIP to load balance SMTP traffic to our two mail servers, and everything was fine and dandy, but when I took a look in /var/log/syslog I found loads of entries like this:

Dec 11 18:17:14 prod-git1 sendmail[20899]: [ID 801593] j93FHDNX020899: []
did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA

The Alteon health check connects and then just issue a QUIT which sendmail finds suspicious, and hence feels obliged to let me know about it. This becomes very annoying when you have two Alteons doing the check every other second!

After scratching my head for a while and searching for a solution, I came across this patch to sendmail, which lets you select systems which shouldn't generate the above log entry. The only caveat was that I'd have to build my own sendmail, and I really don't want to roll my own stuff as it require more job to support, so I continued to look for a another solution.

I finally figured out (after reading the sendmail sourcode) that if I in /etc/mail/ set

O PrivacyOption=authwarnings,needexpnhelo,needvrfyhelo

sendmail would be quiet if the Alteon changed the health check to doing the equivalent of this:

mconnect localhost
connecting to host localhost (, port 25
connection open
220 ESMTP Sendmail 8.13.8+Sun/8.13.8; Thu, 11 Dec 2008 13:58:48 +0100 (CET)
VRFY root
503 5.0.0 I demand that you introduce yourself first
221 2.0.0 closing connection

So we changed the health check from being smtp to a custom script (note that you need the double backslashes):

open 25,tcp
expect "ESMTP"
send "VRFY root\\\\n"
expect "503"
send "QUIT\\\\n"
expect "221"

And after pushing this change out, sendmail stopped filling the log with messages I don't want to see.

Tuesday Apr 08, 2008

Trying out puppet

I'm looking for ways to better manage our servers, and right now I'm playing with puppet.

I immediately ran in to a problem: it picked the wrong domain name. Internally at Sun we use NIS (yes, I know it is insecure and sucks in almost all aspects, but I'm not in position to change it - and believe me I have tried) and our NIS domain name doesn't match the DNS domain name.

This is something puppet (facter to be exact) doesn't figure out, at least not on Solaris. Instead of picking the correct fqdn for a host, e.g., it picks, since that is what the domainname command returns.

They tried to fix this, but unfortunately it doesn't work for Solaris, as it relies on the dnsdomainname which we don't have.

I've worked around it by creating my own /usr/bin/dnsdomainname which gets called before domainname.

DOMAIN="`/usr/bin/domainname 2> /dev/null`"
if [ ! -z "$DOMAIN" ]; then
    echo $DOMAIN | sed 's/\^[\^.]\*.//'

So now I can continue to test my puppet configurations...

Tuesday Nov 27, 2007

Trying out mirrored zfs root on Indiana

I've been playing around with project Indiana, and the new installer and packaging system, and they are really nice.

When you install it turns the root disk into a zpool called zpl_slim, but it doesn't let you select two disks and mirror the zpool. Luckily you can fix this once the installation is done. When the system has booted, you can use the zpool attach command:

# zpool attach zpl_slim c7d0s0 c8d0s0
# zpool status
  pool: zpl_slim
 state: ONLINE
 scrub: resilver in progress, 11.75% done, 0h3m to go

        NAME        STATE     READ WRITE CKSUM
        zpl_slim    ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c7d0s0  ONLINE       0     0     0
            c8d0s0  ONLINE       0     0     0

errors: No known data errors

Thursday Aug 09, 2007

ZFS the perfect file system for audit trails

Those of you who have to deal with audit trails from busy systems know that they can get really big, and when you need to store them for a couple of years they consume a considerable amount of disk.

To minimize the disk usage that you can compress the audit trails, and it works really well (I've adjusted the output to right-align the size):

root@warlord# ls -l
total 19447638
-rw-r--r--   1 root     other    1353214369 Aug  9 09:27 20070728065900.20070730163539.warlord
-rw-------   1 root     other      62209268 Aug  7 08:25 20070728065900.20070730163539.warlord.gz
-rw-r--r--   1 root     other    1965073391 Aug  7 09:35 20070728065900.20070730163539.warlord.txt
-rw-r--r--   1 root     other      71460194 Aug  7 09:35 20070728065900.20070730163539.warlord.txt.gz

As you can see the compression ratio is really good (over 90%), but one problem still remains: if you need to work with the files you have to uncompress them before you can run your scripts to go an find who edited /etc/passwd. Uncompressing one file doesn't take that long, but when you don't know exactly in which file the audit records you are looking for are, things start to take time.

Enter ZFS: with on-the-fly disk compression it is the perfect file system to store audit trails. First of all you have to enable compression:

root@warlord# zfs set compression on pool/audit

That only makes future writes compressed, so the files you already have needs to be rewritten to be compressed. After having done that it is time to look at the compression ratio:

root@warlord# ls -l
total 19447638
-rw-r--r--   1 root     other    1353214369 Aug  9 09:27 20070728065900.20070730163539.warlord
-rw-r--r--   1 root     other    1965073391 Aug  7 09:35 20070728065900.20070730163539.warlord.txt

Wait a minute! That didn't compress anything, or did it? ls -l shows the size of the uncompressed file, so you have to use du to see the compressed size:

root@warlord# du -k
554067  20070728065900.20070730163539.warlord
732399  20070728065900.20070730163539.warlord.txt

Much better! (Note that it displays the size in kilo bytes while ls -l displays it in bytes) But it still is not as good as when I ran gzip on them. Why? ZFS uses the LZJB compression algorithm which isn't as space effecient as the gzip algorithm, but it is much faster. If I had been running Nevada I could have used:

root@warlord# zfs set compression gzip pool/audit

And have gotten the same compression ratio as when I "hand compress" my audit trails. This thanks to Adam who integrated gzip support in ZFS in build 62 of Nevada.

[Technorati Tags: ]

Tuesday Aug 07, 2007

ssh job queue

Today I started to ponder a problem which I can't be alone to have encountered. When you administer over 300 systems and want to perform bulk operations over ssh, there are always one or two systems which are down or unreachable, so your nifty little scripts which log on to each system to install a package, apply a patch, change a configuration setting, tweak a variable or just pull statistics from the system will fail.

So I started toying with the idea of an ssh job queue which helps you keep track of bulk operations, so you can see the on which systems the operation has successfully completed. Once I started to try this out I figured that I can't be the first one to face this problem, so i thought I'd ask you for input.

How do you deal with this problem? And "pen and paper" isn't the answer I'm looking for :)

Monday Aug 06, 2007

Ever wondered what the files /var/spool/cron/crontabs/\*.au are

You might have noticed some strange files in /var/spool/cron/crontabs ending with .au. These are not µlaw audit files, but auxiliary audit files for crontab, which are created when auditing have been enabled and you edit your crontab entry.

# cd /var/spool/cron/crontabs
# ls -l
total 19
-rw-------   1 root     sys         1010 Feb 25 18:04 adm
-r--------   1 root     root        1371 Feb 25 18:06 lp
-rw-------   1 root     martin        38 Jun 21 00:20 martin
-r--------   1 root     martin        45 Jun 21 00:20
-rw-------   1 root     sys         1401 Mar 13 04:28 root
-rw-------   1 root     sys         1128 Feb 25 18:09 sys

Looking closer at what is in my .au file we find the following:

# cat
1dad35c9 0 0 0

This is quite cryptic, especially as it isn't documented anywhere but in the source! Using it you can discern what the above settings are.

The first number (300) is the audit id, i.e. my user id. The second and third rows are the pre-selection mask split up in two parts, first the audit on success and then audit on failure. The next three rows are the terminal id, starting with the port, address type and last the address. The port number (5f81600) is made up of two parts (major and minor) which are joined together. After that follows the address type (4) which represents IPv4, as defined in audit.h. Note that the address is made up of 4 numbers to fit IPv6 addresses, but since I logged from a system using IPv4 it is only the first part which is filled. There is a gotcha here, the number is written depends on the architecture, the example is from my X2200 M2, so the 1dad35c9 needs to be changed to network byte order to map correctly to an IP address. The last row is the session id (2441309132).

This file is created (and updated) when you edit crontab, which can cause a lot of confusion. The pre-selection mask used by cron is calculated by logically ORing the entry in the .au file with the user entry from audit_user and the global flags in audit_control. So if you reduce the auditing for a particular user in audit_user, you expect that the audit trail from the user's cron jobs would change too, but if the .au file have already been created the pre-selection masks are frozen.

To fix this you need to update the .au file too when you change the audit flags or edit the crontab so that the .au file gets rewritten.

[Technorati Tags: ]

Friday Aug 03, 2007

Bye, bye Java - welcome back Solaris!

As of this week I am no longer working in the Java SE Security Team. I got tired of fixing other peoples' mistakes - I have a hard enough time fixing my own! It was a great place to work at, lots of clever, nice and cool people, but my job wasn't all that I thought it would be...

So what do I do now? I'm the security geek for .Sun engineering (DSE - Dot Sun Engineering), the group who run and operate most of Sun's external systems - like this one (

My main focus is Solaris security, but I will also help the development teams to review the application security. I'm responsible for the security of the Solaris servers which run all the external sites - so that'll keep me busy :)

This mean that I now have legitimate reasons to play with Solaris Auditing again, so I've already started to help Tomas to alfa test the Remote Audit Trail Storage project. DSE will be a good test bed to see how well the project scales.

[Technorati Tags: ]




« August 2016