Tuesday Oct 07, 2008

Analyzing that old perl script

Guess I have to understand that script to rewrite it in Python. :->

First, gethead.pl reads through the file until it finds a line which starts with a '!'. In which case it creates a list of names of the form '$'column name:

        $format = '$' . join(', $', split(/,/, $first_line));
        print $format . "\\n";

Yields:

> ./r.pl r.txt
$started, $ended, $title, $company, $description

The magic really occurs in the main processing loop:

do main'read_txtfile_format(\*LNG_FILE, \*languages);

lang: while (<LNG_FILE>) {
        next lang if (/\^#/ || /\^!/);
        eval "($languages) = split(/[,\\n]/)";

        print "$started - $ended: $title for $company\\n\\t$description\\n\\n";
}

The first line gets '$languages' setup to the 'variable' names. Each time through the while loop, we call the eval to translate/associate the columns to variable names.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Monday Oct 06, 2008

An old perl script

I've got an old perl script that I have gotten a lot of mileage from:

package read_txtfile_format;

sub main'read_txtfile_format {
        local(\*file,\*format) = @_;
        local($first_line, $first_char) = '';

        do {
                $first_line = <file>;
                $first_line =~ /(.)(.\*)/;

                $first_char = $1;
                $first_line = $2;
        } until ($first_char eq "!" || eof(file));

        if (eof(file)) {
                die "There is no ! header line in $file";
        }

        $format = '$' . join(', $', split(/,/, $first_line));
}

I didn't write it, I think either Mark Lawrence or Walt Gaber did while I was at DRD Corporation. Or they got it somewhere. I know they called it a data dictionary - which I'm not sure is a way I would use that term these days. By the way, I still have and use my very beaten up copy of Programming Perl from back then - a 1991 printing.

What it does is allow you to read in another file, generate variable names based on a line starting with a '!' and then use those names per line. It is a cheap database laid out in a flat file.

I think we called this basic script and its associated text files data dictionaries because we would use it to quickly prototype and change data structures in C. I know I used it in my Genetic Programming research to describe the operators used in a new problem set.

Perhaps an example will show the power.

Resume example

I want to quickly take my resume and reformat it as needed. Perhaps I need it in html format, a plain text file, etc.

I can keep my data in a file, I can have a skeleton script to process it, and I can quickly change it to adapt to new styles...

I'm picking an example which looks like I'm pimping myself out because I thought it was quirky and fun to code. It is also a way I never would have thought to do an example with this piece of code.

Data file

!started,ended,title,company,jobdesc
1/05,present,Staff Engineer Software,Sun Microsystems,NFS development
6/01,12/05,File System Engineer,Network Appliance,WAFL and NFS development
4/01,6/01,Manager,Network Appliance,Manager of Engineering Internal Test
10/99,4/01,System Administrator,Network Appliance,Perl hacker and filer administrator

Perl script

#! /usr/bin/perl

do 'getthead.pl';

open(LNG_FILE, $ARGV[0]) || die "Can't open LNG_FILE: $!\\n";

# Determine the Column Names
do main'read_txtfile_format(\*LNG_FILE, \*languages);

lang: while (<LNG_FILE>) {
        next lang if (/\^#/ || /\^!/);
        eval "($languages) = split(/[,\\n]/)";

        print "$started - $ended: $title for $company\\n\\t$description\\n\\n";
}

Results

And here I have it dumping out a format much like the resume.txt file I have been updating as I change job functions:

> ./resume.pl resume.txt
1/05 - present: Staff Engineer Software for Sun Microsystems
        NFS development

6/01 - 12/05: File System Engineer for Network Appliance
        WAFL and NFS development

4/01 - 6/01: Manager for Network Appliance
        Manager of Engineering Internal Test

10/99 - 4/01: System Administrator for Network Appliance
        Perl hacker and filer administrator

But wait, there is an easier way

#! /usr/bin/perl

open(LNG_FILE, $ARGV[0]) || die "Can't open LNG_FILE: $!\\n";

lang: while (<LNG_FILE>) {
        next lang if (/\^#/ || /\^!/);
        ($started, $ended, $title, $company, $description) = split(/[,\\n]/);

        print "$started - $ended: $title for $company\\n\\t$description\\n\\n";
}

It does the same thing, less code as well.

But it isn't as dynamic. I have to edit both the data file and the script to make a change. If I were to add a new field location after company, I would have to change the script on the split. Also, what if I have many scripts manipulating the same data? During my research, I had two different data files and six different scripts per problem set.

Research examples

For clique detection, 5-6 lines of data dictionary entries resulted in about 400 lines of C code. For predator/prey, about 50 lines of data dictionary entries resulted in about 890 lines of C code.

An example set of data dictionary entries for the predator/prey would be:

!fId:fSymbol:fType:fArity:fMacro:fDifGen:fChild1:fChild2:fChild3:fChild4:fChild5:fDescription
Agent:Ag:Agent:1:True:False:Agent:NO_CHILD:NO_CHILD:NO_CHILD:NO_CHILD:Returns A's predatorId.
And:&&:Boolean:2:True:False:Boolean:Boolean:NO_CHILD:NO_CHILD:NO_CHILD:A AND B.
CellOf:CellOf:Cell:2:False:False:Agent:Tack:NO_CHILD:NO_CHILD:NO_CHILD:The (X,Y) coordinate of agent A if it moves from its current cell to the one in the Tack B.

(Note I've changed the separator to a ':' for clarity.)

Part of the processing script would be:

        # Put in all Caps
        ( $capId = 'GP_L' . $LBranch . '_F_' . $fId ) =~ tr/a-z/A-Z/;
...
        print INI_FP '    /\*';
        print INI_FP '     \* ' . $fDescription;
        print INI_FP '     \*/';
        print INI_FP '    pgps->als[' . $LBranch . '].afs[i].iId = ' . $capId . ';';
        print INI_FP '    pgps->als[' . $LBranch . '].afs[i].psSymbol = "' . $fSymbol . '";';
        print INI_FP '    pgps->als[' . $LBranch . '].afs[i].ftType = ' . $capType . ';';
        print INI_FP '    pgps->als[' . $LBranch . '].afs[i].arity = ' . $fArity . ';';
...

And some resulting code would be:

    /\*
     \* Branch 0 - Main language for the system
     \*/
    /\*
     \* Branch 0 - Functions
     \*/
    pgps->als[0].afs = (FunctionsStruct \*)calloc( GP_L0_MAX_FUNCTIONS,
                                      GP_FUNCTIONS_SIZE );
    if ( !pgps->als[0].afs ) {
        GU_logError( stderr, "%s(%d): Out of Memory!\\n",
                     __FILE__, __LINE__ );
        GU_exit ( -1 );
    }

    /\*
     \* Returns A's predatorId.
     \*/
    pgps->als[0].afs[i].iId = GP_L0_F_AGENT;
    pgps->als[0].afs[i].psSymbol = "Ag";
    pgps->als[0].afs[i].ftType = FT_L0_E_AGENT;
    pgps->als[0].afs[i].arity = 1;
    pgps->als[0].afs[i].bMacro = TRUE;
    pgps->als[0].afs[i].bActive = TRUE;
    pgps->als[0].afs[i].bDifGeneric = FALSE;

    pgps->als[0].afs[i].pftChildren = (FunctionTypes \*)calloc( pgps->als[0].afs[i].arity, FT_TYPES_SIZE );
    if ( !pgps->als[0].afs[i].pftChildren ) {
        GU_logError( stderr, "%s(%d): Out of Memory!\\n",
                     __FILE__, __LINE__ );
        GU_exit ( -1 );
    }

    pgps->als[0].afs[i].pftChildren[0] = FT_L0_E_AGENT;

    pgps->als[0].afs[i].pFct = gpf_L0_Agent;
    i++;

    /\*
     \* A AND B.
     \*/
    pgps->als[0].afs[i].iId = GP_L0_F_AND;
    pgps->als[0].afs[i].psSymbol = "&&";

By the way, the same script f_types.pl would process all of the language data dictionaries without being modified. If I happened to change the underlying data structures in the C code, i.e., FunctionsStruct, I could change that one script and rebuild all of the different languages.

So where's the Beef?

Why the walk down memory lane?

Well, I still use this script. I've used it to do volunteer scheduling at AAAI, generate Java opcodes for a simple JVM implementation, plan a new company, check for sibling conflicts during a recreational soccer season, implement testbeds for QA efforts for multiple companies, etc. I don't have to have a database on my system. I can suck data out of a database on a Windows box, store it in a CSV datafile on OpenSolaris, and play with the data. I don't have to know SQL and/or care too much about the data. I can generate "reports" and such from the CLI.

And it is the power of Perl (well, the eval() it offers) which lets me get away with this. One of the selling points of Perl for me was rapid prototyping, especially with respect to strings. I could have written C programs to do all of this, but why?

If I'm going to learn Python, I need to be able to replace this piece of functionality. Or else I'll be back with Perl before you know it.

And honestly, even if I learn to make Python bark for me, I'll pick up the tool I need when I have to. :->

Well off to sleep and I'll pick this up tomorrow when I start playing with Python.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily
About

tdh

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today