Friday Nov 12, 2004

Run, IPC::Run

I do a lot of work that involves automation -- both product and software test. This week, my task was to automate some testing of Perl code which drives the Solaris 10 fault management log viewer fmdump. The fmdump utility allows adminstrators to view system errors and faults with varying levels of verbosity. In order to provide our manufacturing test process with information about faults that occurred during test, tools are required which can parse the varying information output by fmdump, and provide feedback to the test supervisor. Now, once you've created a tool which relies on a system utility, how do you test it in environments where that utility may not exist? Most often, this is done through the use of program stubs -- a program which behaves like the original, but is really just faking it.

Faking it in this case means that for any set of arguments, the stub program should return the same output as the original utility would. In our program stub terminology, this involves playing back a control file, which tells the stub what to output, when to output it, and what exit status to return at the end. I wanted my stubs for fmdump to be able to call on the real program in cases where an existing stub did not exist, and create a control file. Then the next time that same argument combination was passed to the stub, the generated control file would be replayed. While thinking about this problem, I happened across an extremely interesting Perl module called IPC::Run.

IPC::Run is a module which makes calling, managing, and collecting output from subprocesses easy. It is an extremely flexible module, which allows programmers to hook in at an API level that makes sense for their application. At it's simplest, the function run() behaves like system(), but inverts the result so the code makes more sense:

# Old way, using system
system("diff fileA fileB") and die "Failed to run diff!";

# Newer way, using IPC::Run::run
run("diff fileA fileB") or die "Failed to run diff!";

The run() function is probably as far down the IPC::Run API foodchain as most people need to go, as this function allows huge flexibility (which we will see shortly). This function is flexible enough to serve most tasks. For instance, if I wanted to collect the standard output of the above process in a scalar, I'd change the code to this:

# Create a scalar to store the output
my $out;

# Collect that output
run(['diff', 'fileA', 'fileB'], '>', \\$out)
    or die "Failed to collect diff output!";

Easy as that. Want the standard error too?

# Create a scalar to store the output
my ($err, $out);

# Collect that output
run(['diff', 'fileA', 'fileB'], '>', \\$out, '2>', \\$err)
    or die "Failed to collect diff output!";

There are a large number of redirections that can be implemented using run, including standard input, output, error and more. Also, instead of scalars for data sources and destinations, you can specify filehandles and even subroutines. The latter is what I needed for my control file generation. I wanted to be notified each time a line of output was created on either the standard output or error, and needed to record the time a line came in, with reasonable precision. To do this, I used the following call to run:

# Get hi-res versions of sleep and time
use Time::HiRes qw/sleep time/;
use IPC::Run qw/run new_chunker timeout/;

# Command to execute
my @cmd = qw/fmdump -V/;

# Array to collect 'events'
my @events;

# First, record the command and args
push @events, [ time, 'cmdline', join(" ", @cmd) ];

# Create control events
eval {
    '>', new_chunker, sub { push @events, [ time, 'stdout', shift ] },
    '2>', new_chunker, sub { push @events, [ time, 'stderr', shift ] },
    timeout( 5 ),
  push @events, [ time, 'status', $? >> 8 ];
die if $@;

This snippet of code runs the 'fmdump' argument combination I requested, and records information about how the command behaves. The new_chunker method in the filter chain of the run() command for both streams is a filter provided by IPC::Run. This filter brings in input from the process and splits it into lines (by default using the "\\n" separator). Each line is then passed on to the anonymous sub, which records the stream source, time, and line in the event array. When the process exits, the routine then records the exit status in the event array. Note the use of the timeout() function. This specifies to run the number of seconds to allow the task to execute before giving up and throwing an exception. That exception is the main reason for invoking run() from within an eval block. This prevents misbehaving argument combinations from hanging forever. For instance, if someone specified fmdump -f, fmdump attempts to 'file tail' the fault log, and won't exit until it receives a signal. Using a timeout ensures that this will only last for 5 seconds, and will not generate an incomplete control file.

Now that we have the full details about what happened during the process run, it's time to do a bit of housekeeping. The first step is to go through the event list and calculate the line to line delay based on the timestamps. Times are then floor()d to a reasonable precision, rather than keeping the full Perl floating point precision. For this particular application, the precision is 4 digits (0.0001s or 100µs). The event list is then written to a control file on disk, encoded so that the same combination of program and arguments will generate the same control file name. All that is left to do now is to run the Perl module test suite, pointing it at our stubbed fmdump rather than the real one. This generates all of the control files necessary to run that suite of tests. I then can check these control files into revision control as part of the test suite. It is then possible to run the module test suite in any environment.

Wednesday Sep 29, 2004

RSS/Atom Auto-discovery

My last article on bookmark publishing was picked up on Dave Johnson's roller blog today, with some interesting ideas for enhancement. One idea I found interesting was RSS auto-discovery. A quick search on google showed that a few people have expressed ideas about how to use HTML <link> elements to discover alternate content types for a particular page. As well, Dave suggested that I could output the bookmark hierarchy using Outline Processing Markup Language [OPML], and import the resulting document into Roller (once updates to Roller 0.9.9, that is).

As there is only so much I can do between conference calls, writing requirements documents, and planning out my yearly goals, I will focus on idea the first. OPML will have to wait for another article, but if you have extra time, feel free to read ahead.

You might be familiar with the <link> tag. It allows you to specify cascading stylesheets and &lquo;favorite&rquo; icons, etc. It is also possible to specify additional content types for your page. If you click View -> Page Source on this page, for instance, you would see that I define an alternate content type of application/rss+xml. The path given in the link for this document is relative to the server, but it can be absolute too to pull alternate content from a different server.

There are several different content types that I am interested in for the bookmark publisher script. While RSS is quite common, other formats such as Atom and RDF are popular. I want to ensure that among all of the favorite icons and style sheets, I get only the links to alternate content types. Each link tag has a type attribute, which contains the MIME type of the alternate link. As I search through the link entries, I'll check each one to see that it matches one of my desired content types. As I intend to roll this function back into the bookmark publisher script when I'm done, I'll write the functionality as a subroutine that takes a URI object and returns a hash reference. The hash reference will be keyed by MIME type, and will contain a URI for each verified content type. The function starts out by listing the acceptable content types:

use strict;
use LWP::UserAgent;
use URI;

my $ua = new LWP::UserAgent( env_proxy => 1 );

sub autodiscover {
    my $uri = shift;
    my $map;

    my %ALT_TYPES = map { $_ => 1 } qw(

In contrast to the accept_bookmark subroutine I described yesterday, this routine will require LWP::UserAgent, which is the full featured object which underlies LWP::Simple. A UserAgent enables far greater control over the request and the response, which makes it ideal for this task. The above code creates a global user agent, which should be used by all sections of the final bookmark publisher script for grabbing files and checking URLs. The autodiscover function then grabs a URI to check, defines the $map of media types to URIs for return, and a list of alternate media types that we want to auto-discover. The map function maps all of the listed content types into the hash with '1' as the value. This makes checking for acceptable media types easier.

Now that we have poured the foundation, it is time to check the passed URI to see if it's pointing at anything interesting:

    my $rsp = $ua->get( $uri );
    return {} unless $rsp->is_success;

    # Record the link content type
    my $headers = $rsp->headers;
    my $ctypes  = ref($headers->{'content-type'}) eq 'ARRAY'
        ? $headers->{'content-type'} : [ $headers->{'content-type'} ];

    my ($ctype) = split /;/, $ctypes->[0];
    $map->{$ctype} = $uri;
    $map->{default} = $ctype;

In the accept subroutine discussed yesterday, we used the head method of LWP::Simple to check if a link is valid. In order to get a handle on the embedded link tags in the target HTML document, we need to use the get method instead. Unfortunately, the method provided by LWP::Simple does not return enough information to enable these links to be processed. The get() method of LWP::UserAgent returns a HTTP::Response object, which provides a headers() method, which then returns a HTTP::Headers object, which we are most interested in.

After requesting the target link, we can check to see if the HTTP response code was success, and return an empty hash reference if it's not. This should indicate to the caller that no valid media types were found for the specified URI. Next, we read the response headers, and determine the content type of the returned document (there's no sense guessing). If there is only one content type, the content-type header contains a scalar value, but if more than one type is present (e.g. if there were different content encodings available), the field contains an ARRAY reference. We deal with this by detecting the field type, and forcing the scalar into an array reference. We can then split the content type from any additional information such as encodings or weightings. Type weightings are expressed in the form of q=x where 0 < x <= 1 which indicates preference when multiple types are available. For simplicity, this example discards weighting, but it might be useful in the future. Once we've separated the MIME type out of the header string, we assign the original URI to that content type, and indicate that this content type is the default.

Now, I often save links to documents that are not HTML, like PDF documents and images. Obviously, these documents can't contain any link references, as they are not HTML, so we must skip auto-discovery for any content types that are not text/html. This should likely not be limited to just text/html, however, as other valid types like text/xhtml or even text/xml might contain useful information. We can worry about this in the final application -- this script is just for discovery, so it's OK to drop the ball here. Link tags, if they are available in the target document, come in the header field link. Here is our link extraction code:

    # Don't autodiscover for non-html links (e.g. pdf, images)
    if ($ctype eq 'text/html') {
        my $links = ref($headers->{link}) eq 'ARRAY'
            ? $headers->{link} : [ $headers->{link} ];

        foreach my $link (@$links) {
            my ($href, $type) = $link =~ /<(.+?)>;.+; type="(.+?)"$/;
            next unless $href and $type;

            if ($ALT_TYPES{$type}) {

                my $nuri;
                if ($href =~ m#\\w+://#) {
                    $nuri = new URI($href);
                else {
                    $nuri = new URI($uri);
                    $nuri->path( $href );

                # Check that the feed actually exists...
                $rsp = $ua->head( $nuri );
                $map->{$type} = $nuri if $rsp->is_success;
    return $map;

The first couple of lines should be familiar. The link field works the same as the content field, expressed as a scalar value if there is only one link, or an ARRAY reference if there are more than one. Again, we force the scalar into an array reference. Next, we iterate through each of the links in the document. For each link entry, we extract the href attribute value and the MIME type. If we can't find both, then this is not a proper link entry, so we skip on to the next item. If we do properly extract both fields, we can then check the type against our predefined map of content types. If the content type is in the alternate type map, we construct a new URI value for the type, and check that it exists. Note that we first check to see if the link specified an absolute URL (e.g. one with a scheme). If it did not, the link is considered relative to the original site, so we just replace the path segment in the original URI. To be pedantic, the next step is to check that the listed feed actually responds. If it does, the type and source URI are inserted into the map, and returned to the caller.

When I point my autodiscover function at this blog, I get the following structure back (printed out here with Data::Dumper):

bash$ ./ 
$VAR1 = {
          'default' => 'text/html',
          'application/rss+xml' => bless( do{\\(my $o = '')}, 'URI::http' ),
          'text/html' => bless( do{\\(my $o = '')}, 'URI::http' )

This is what I was expecting -- a default type of text/html, because the main page existed, and an alternate type of application/rss+xml which points to the RSS feed for my site. If the main page did not exist, I'd get back an empty structure, and I'd know to skip the site entirely. This might not be the desired sequence of events, however. It might be desirable to attempt auto-discovery even in the case of a broken main link. We'll see how things turn out when I integrate this function back into the main bookmark publisher script.

Tuesday Sep 28, 2004

Publishing Netscape Bookmarks

For quite some time, I've wanted to publish my large list of bookmarks on the web. The primary reason is to give me a map of information to use when I am not near a browser with my bookmarks. Another reason is to let others benefit from the time I've spent gathering and organizing these links. I could just upload my Netscape bookmarks.html file, as it is just HTML, but there are issues.

Issue the first is that I have Sun internal links sprinkled liberally through my bookmark folder. I could copy my bookmarks.html aside, then manually remove the internal links, but the current version contains 325 bookmarks, and I am impatient. I also don't want to go through this exercise every time I update, add, or delete a bookmark. Issue the second is that some of my links are old and outdated -- documents have moved (or decomposed). I need to identify links that are broken, and either deal with them, or remove them from the bookmark file altogether. Issue the last is that there are some links in my folders which I do not want published to the world. Sure, I trust you to keep where I bank a secret, but not that guy in the office next to you -- he's kinda shady.

So what to do when faced with a big text processing task like this? Whip out Perl, of course! There are several approaches here, as with every task Perl is involved with. My tack starts with a utility called HTML tidy. This utility will take the not very well formed Netscape bookmark HTML and give me well formed XML. In Perl, I prefilter the bookmarks like this:

use strict;
use File::Temp qw/tempfile/;

# Filter STDIN through tidy
my $temp = new File::Temp( UNLINK => 1 );
open TIDY, "| tidy -quiet -asxml 2>/dev/null 1>$temp"
    or die "Failed to open pipe to 'tidy': $!";
print TIDY while (<>);
close TIDY;

Now the file named by $temp contains well-formed XML. Note that the temporary file uses UNLINK => 1. This will cause the tidy formatted XML file to be cleaned up when the program exits or the $temp variable goes out of scope, whichever comes first. Now that I have well formed XML, I can search through the bookmarks programatically. My weapon of choice for tasks like this is the fine XML::XPath module set written by Matt Sergeant. To begin with, I need to identify the root of the personal toolbar folder:

use XML::XPath;

my $xp = new XML::XPath( filename => $temp );
my $root = $xp->find( '/html/body' );

Bookmarks in the file are all organized as HTML definition lists (DL/DD/DT). The very top of the document is inside of the body tag. All of the nodes within the body are now contained in the $root variable. The general pattern for folders and bookmarks within the file is as follows:

   H3 -> Folder Title
     A -> Bookmark
     A -> Bookmark
     H3 -> Sub-folder Title

This structure can be arbitrarily deep, therefore the script must be able to handle this. The best way is to process the file using a recursive function:

sub collect_bookmarks {
   my ($ctx, $href) = @_;

   # For each folder root...
   my $f_result = $ctx->find( './dl/dd' );
   foreach my $f_node ($f_result->get_nodelist) {

       # Grab the folder title, skip if no name (separators)
       (my $f_name = $f_node->find( './h3' )) =~ s/\^\\s\*|\\s\*$//g;
       next unless $f_name;

       # Within this folder, search for bookmark entries
       my $a_result = $f_node->find( './dl/dt/a' );
       foreach my $a_node ($a_result->get_nodelist) {

           # Retrieve and normalize the URL and bookmark title
           my $link = $a_node->getAttribute('href');
           my $title = $a_node->string_value();
           $link =~ s/\^\\s\*|\\s\*$//g; $link =~ s/\\n//g;
           $title =~ s/\^\\s\*|\\s\*$//g; $title =~ s/\\n//g;

           # Store the bookmark unless it's bad
           if (accept_bookmark($title,$link)) {
               $href->{$f_name}{$title} = $link;
           else {
               print "Skipping bookmark: $link";
       # Recursive call to process subfolders of this node
       collect_bookmarks($f_node, \\%{$href->{$f_name}});

We call this with the root folder as:

collect_bookmarks($root, \\%BOOKMARKS);

This will identify and recursively process any subfolders that are present in the bookmark file structure. Each iteration passes in a hash reference, which that call will populate with a folder name and one or more bookmarks. Note the call to accept_bookmark(), which makes the final decision on whether a bookmark is good or bad. Bad bookmarks are defined by my own criteria, which filters out broken, invalid and internal links, as well as those that I don't want to publish. The function looks like this:

sub accept_bookmark {
   my ($title, $link) = @_;

   # Parse the link URL
   my $luri = new URI($link);

   # These things are all bad.
   return 0 if $luri->scheme !~ /http|ftp/
            or $title =~ $PRIVATE
            or $link =~ $PRIVATE
            or $luri->host =~ /\\.(corp|ebay|sfbay|west|central|uk$)/
            or index($luri->host,'.') == -1
            or not head($luri);

   return 1;

This uses URI to filter out any file:// links that might be hanging around, and the head() function from LWP::Simple to check links. The $PRIVATE variable is a compiled regular expression (using qr{}) which contains title and link patterns that I don't want to publish.

Now that I can browse all of the information in the bookmarks, the next step is to write them out in some browseable format. I'm thinking DHTML collapsable lists, but I could just as easily print out simple HTML. I haven't decided yet, but I'll publish the rest of the script in a followup post once it is complete.

PS: Syntax highlighting above was done using Vim 6.3 and the code2html.vim script by Soren Anderson.

Monday Sep 27, 2004

Take a SWIG

Before I begin, it occurs to me that my posts might be just a bit too long. Maybe I should split stuff over a number of posts (or even days), but when I get on a roll, I can type forever (frequency of posting shows my average motivation level ;). Maybe I'll change the format, but probably not... I'm unpredictable like that.

SWIG, the Simple Wrapper Interface Generator, is a great way to wrap C/C++ libraries for use with scripting languages. It supports the favorites - Java, Perl, Tcl, Python, and PHP as well as some newer and perhaps more obscure languages such as Ruby, Guile, MZscheme, Ocaml, and Chicken (huh?). For most standard calling conventions, it is possible to point swig at the header files for a library, and it will spit out a C interface file to that library. I say most standard calling conventions here, because as I've found out over the last couple of weeks, some things are hard to do in a language agnostic fashion.

For instance, consider the case of the Solaris libnvpair(3LIB) interface. In my project, I want to be able to wrap a function from another library which returns a pointer to a nvlist. I don't want to wrap the entire libnvpair interface in all of it's glory just to be able to read a nested nvlist. I'm not interested in adding to or modifying the nvlist that is returned to my code, so I will turn the nvlist into a structure which is native to whatever scripting language I'm using. As I mentioned in my last article regarding extension and embedding, this boils down to Perl, and then maybe Python and Java (adoption for Perl for system admin tasks is high, less so for Python; Java is not widely used, but could be).

For Perl, it seems most natural to have the nvlist copied out into a hash reference, which the Perl code would then be able to walk through to make decisions about the returned data. The code to do this is quite interesting, and needs to be added on as inline code in the SWIG interface file. For instance, to create a simple hash reference, here is the Perl API code:

  HV \* hash = (HV\*)sv_2mortal((SV\*)newHV());
  hv_store(hash, "mykey", 3, newSVpv("my value", 0));

This is similar to the following Perl code:

  my $href = {};
  $href->{mykey} = "my value";

Now, for the real application, C code will be required to traverse the nvlist and copy keys and values to an HV. Obviously, the above code is very Perl specific. To do the same thing in Python, I need to traverse the nvlist and create a Python dictionary object, for Java a java.util.HashMap or similar. Each will require a bit of VM magic code to make the translation. Luckily, SWIG is there to save the day, and makes it quite easy to #ifdef each language specific section, as it gives the interface file a once over with the C pre-processor before creating the wrapper. Then I just have to bind each wrapper with the corresponding language headers and libraries, and I get a nice loadable module.

My main concern, as C is really not a core competency in my department, is to keep the interface as clean and understandable as possible. I also want to have the flexibility to experiment and explore with other languages. SWIG gives me the best of both worlds.

Thursday Sep 23, 2004

Extending and Embedding Perl

In my experience researching complex topics, such as extending and embedding Perl, it is inevitable that the range of examples available will be large. Unfortunately, the examples that I have found regarding extending Perl fall into two categories. The first category of examples are far too simplistic. These range from the ubiquitous 'hello world' example written in XS, to tutorial examples on how SWIG might wrap the libgd graphics library. These are all very straight forward examples, and should suffice for most real world tasks. Not mine, however.

The other category of example code is mind blowingly complex. This camp includes most well known examples of Perl extension -- I've searched through the multi-language SWIG bindings for Subversion, and the Perl specific bindings to Tk. Both are stunning examples of how to do extension correctly. Unfortunately, there is a huge overhead to wrapping your mind around projects of this scale. Multi-module, multi-library projects are far far too complex for what I need to do.

I was really unable to find a happy medium -- I need to expose a number of C structures as Perl objects, reserving the option to expose them to other languages (e.g. python, java, guile). For this, some of the aspects of Subversion bindings are easy to understand. However, I also need to be able to pass callbacks to C from Perl code references to allow the C libraries to pass information to Perl. In Subversion, callbacks are provided by a language specific runtime library which implements a thunk editor. The subversion thunk editor manages manipulation of the stack such that calling back and forth between Perl and C works according to the Perl stack protocol. Just hearing the word 'thunk' makes me shudder, with horrible visions of Win32 dancing in my head. At any rate, the Subversion method seemed to me to be just a bit more abstraction than I wanted to deal with for my (relatively) simple case.

As I was dreaming about how to get my extension project off the ground, I found a massively useful book at my local Barnes & Noble book store. The book, Extending and Embedding Perl by Tim Jenness and Simon Cozens. In my opinion, this book is the most complete and coherent source of Perl extension and embedding knowledge available. It's not terse like the documentation which comes with Perl -- although that continues to be an authoratative source of this info. Instead, it's a well written, easy to read coverage of all the details needed to get the job done. It includes fully documented code samples -- which cover a broad range of projects. Want to find out how to create a new Perl scalar value in C and pass it back to Perl? There is simple example code to demonstrate the technique, along with detailed breakdowns of the output of tools such as Devel::Peek.

Suffice to say that I love this book. I've nearly read it front to back in about a week, which says a lot considering the dry nature of the subject material. However, I'm motivated to learn, as I want to complete my extension project. Over the coming days, I'll outline where I am with the project, and offer up a description of how to solve extension and embedding problems that aren't covered by 99% of available examples. Stay tuned.




« July 2016