Thursday Jan 29, 2009

Be careful of assumptions

I had been working on fixing 6775211: mirror mounts use the zcred; should use caller's cr augmented w/ PRIV_SYS_MOUNT before the Xmas break. I had it down to unit testing and thanks to Nico Williams, I even had some great unit tests. I just ran out of time to work on the bug.

Well it jumped in priority from a P4 to a P2 and was impacting other engineers. So I thought it might be easy to dust off and finish. I was wrong.

The bug was strange for several reasons, foremost was that it involved doing something that never occurred to us during development. The issue is when you have a NFSv4 client which has a user credential but not a machine credential - a configuration I thought was rare, but which is evidently common amongst Kerberos developers.

Under such a configuration, access is denied to a mirrormount (an automatic mounting in NFSv4 when a filesystem boundary is encountered). The underlying issue is that we use the kcred or zcred to always force the mount to occur. But in the case of there being no machine credentials, those creds do not carry any weight on a server. So we need to extend the privileges given to the user.

As an aside, I would consider this a security bug, but I'm coming to realize that while Kerberos may deal with security, the definition of a security bugs seems to be a bug in which an user gains unauthorized access. So a bug which keeps an authorized user from gaining expected access may be evil, but it is not worthy of a security tag.

Perhaps I am wrong and tempting fate. I hope not, I do not want to see the bug slapped with a security tag when I get up in the morning.

Anyway, back to the bug. The unit test is quite simple - create a share hierarchy on the server that allows a client machine to mount the parent with say AUTH_SYS, but requires kerberized access to the child:

zpool create pnfs1
zfs create pnfs1/scratch
zfs set sharenfs=on pnfs1/scratch
zfs create pnfs1/scratch/aaa
zfs set sharenfs=sec=krb5i pnfs1/scratch
zfs create pnfs1/scratch/aaa/bbb
zfs create pnfs1/scratch/dir1
zfs set sharenfs=sec=krb5i pnfs1/scratch/dir1
zfs create pnfs1/scratch/dir1/dir1
zfs create pnfs1/scratch/dir1/dir3
zfs create pnfs1/scratch/dir1/dir4
zfs create pnfs1/scratch/dir2
zfs set sharenfs=sec=krb5i pnfs1/scratch/dir2
zfs create pnfs1/scratch/dir2/dir3
zfs create pnfs1/scratch/dir2/dir4
zfs create pnfs1/scratch/dir3
zfs create pnfs1/scratch/dir3/dir1
zfs create pnfs1/scratch/dir3/dir2
zfs create pnfs1/scratch/dir4
zfs set sharenfs=sec=krb5i pnfs1/scratch/dir4
zfs create pnfs1/scratch/dir4/dir1
zfs create pnfs1/scratch/dir4/dir2
chown -R thud:staff /pnfs1/scratch/aaa
chown -R lover:staff /pnfs1/scratch/dir1
chown -R boy:staff /pnfs1/scratch/dir2
chown -R imoo:staff /pnfs1/scratch/dir4

Now, there are four configurations possible to show both the original problem and the fix:

IdCode baseMachine credsUser creds Notes
A - pnfs-9-23 onnv_105 Yes Yes Normal configuration
B - pnfs-9-24 onnv_105 No Yes Buggy configuration
C - pnfs-9-25 onnv_105 + fixes Yes Yes Should show no regression
D - pnfs-9-26 onnv_105 + fixes No Yes Should show the fix

Now, I made the claim that A and C should yield the same results (or there would be a regression). I also made the claim that D should yield the same results as both A and C (or there was a regression). And all I really cared about B was that is showed the bug.

So here is A in action:

[thud@pnfs-9-23 thud]> cd /net/theserver/pnfs1/scratch
[thud@pnfs-9-23 scratch]> ls -la
total 10
drwxr-xr-x   8 root     root           8 Jan 28 15:20 .
dr-xr-xr-x   2 root     root           2 Jan 29 00:08 ..
dr-xr-xr-x   1 root     root           1 Jan 29 00:29 aaa
dr-xr-xr-x   1 root     root           1 Jan 29 00:29 dir1
dr-xr-xr-x   1 root     root           1 Jan 29 00:29 dir2
dr-xr-xr-x   1 root     root           1 Jan 29 00:29 dir3
dr-xr-xr-x   1 root     root           1 Jan 29 00:29 dir4
dr-xr-xr-x   1 root     root           1 Jan 29 00:29 foo
[thud@pnfs-9-23 scratch]> cd aaa
[thud@pnfs-9-23 scratch]> cd aaa
[thud@pnfs-9-23 aaa]> ls -la
total 7
drwxr-xr-x   3 thud     staff          3 Jan 27 21:10 .
drwxr-xr-x  10 root     root          10 Jan 29 01:06 ..
dr-xr-xr-x   1 root     root           1 Jan 29 21:18 bbb
[thud@pnfs-9-23 aaa]> cd ..
[thud@pnfs-9-23 scratch]> ls -la 
total 14
drwxr-xr-x  10 root     root          10 Jan 29 01:06 .
dr-xr-xr-x   2 root     root           2 Jan 29 21:18 ..
drwxr-xr-x   3 thud     staff          3 Jan 27 21:10 aaa
dr-xr-xr-x   1 root     root           1 Jan 29 21:18 dir1
dr-xr-xr-x   1 root     root           1 Jan 29 21:18 dir2
dr-xr-xr-x   1 root     root           1 Jan 29 21:18 dir3
dr-xr-xr-x   1 root     root           1 Jan 29 21:18 dir4

This shows that a mirrormount only occurs when the user commits the client to accessing a share on the server. And the reporting of 'root:root' in the attributes is expected from prior testing. It mimics what happens with the automounter (which was a design specification). We also see once we do the mount, that we get the proper attributes showing.

I'm going to cheat (which I'll explain shortly), but here is the bug in action:

[thud@pnfs-9-23 thud]> cd /net/theserver/pnfs1/scratch
[thud@pnfs-9-24 scratch]> cd aaa
aaa: Permission denied.

And here is what we see with the fix for case D:

[thud@pnfs-9-26 thud]> cd /net/theserver/pnfs1/scratch
[thud@pnfs-9-26 scratch]> ls -la
total 17
drwxr-xr-x   8 root     root           8 Jan 28 15:20 .
dr-xr-xr-x   2 root     root           2 Jan 28 20:59 ..
drwxr-xr-x   3 thud     staff          3 Jan 27 21:10 aaa
drwxr-xr-x   5 lover    staff          6 Jan 27 20:22 dir1
drwxr-xr-x   4 boy      staff          5 Jan 27 20:47 dir2
dr-xr-xr-x   1 root     root           1 Jan 28 20:59 dir3
drwxr-xr-x   4 imoo     staff          4 Jan 27 15:46 dir4
[thud@pnfs-9-26 scratch]> cd aaa
[thud@pnfs-9-26 aaa]> ls -la
total 9
drwxr-xr-x   3 thud     staff          3 Jan 27 23:54 .
drwxr-xr-x   7 root     sys            7 Jan 23 21:24 ..
drwxr-xr-x   2 thud     staff          2 Jan 27 23:54 bbb
[thud@pnfs-9-26 aaa]> cd ..
[thud@pnfs-9-26 scratch]> ls -al
total 17
drwxr-xr-x   7 root     sys            7 Jan 23 21:24 .
dr-xr-xr-x   2 root     root           2 Jan 29 21:32 ..
drwxr-xr-x   3 thud     staff          3 Jan 27 23:54 aaa
drwxr-xr-x   5 lover    staff          5 Jan 27 23:54 dir1
drwxr-xr-x   4 boy      staff          4 Jan 27 23:54 dir2
dr-xr-xr-x   1 root     root           1 Jan 29 21:32 dir3
drwxr-xr-x   4 imoo     staff          4 Jan 27 23:54 dir4

The attributes do not match what we saw with the base case! Now it turns out that mirrormounts are not occuring, but a getattr will not necessarily cause one.

There are some red herrings here, why does 'dir3' not show up? It is related to the fact that a WRONGSEC will not be returned for it. But it turns out that the same calls are going across the wire for the security negotiation.

The big clue, in retrospect, is that no mirrormounting is going on. My new code can not be getting called and hence can not be the root cause.

So what assumptions did I make?

  1. C would be the same as A -- which holds!
  2. D would be the same as A and C -- which does not hold!
  3. B's results do not matter -- which is false, it better show the bug

The issues are that I've confused expected test results (access to the kerberized entries) with there being no other changes. But that is patently false, two things have changed from A to D and one from C to D. I.e., I have to remember that D also has no machine credentials!

Okay, back to where I cheated on showing the results for B. I cheated because when I did my unit testing, I ignored the results from the 'ls'. Time to go back and see what I got (which would have made it obvious to the astute reader what was going on):

[thud@pnfs-9-24 thud]> cd /net/theserver/pnfs1/scratch
[thud@pnfs-9-24 scratch]> ls -la
total 17
drwxr-xr-x   8 root     root           8 Jan 28 15:20 .
dr-xr-xr-x   2 root     root           2 Jan 28 20:59 ..
drwxr-xr-x   3 thud     staff          3 Jan 27 21:10 aaa
drwxr-xr-x   5 lover    staff          6 Jan 27 20:22 dir1
drwxr-xr-x   4 boy      staff          5 Jan 27 20:47 dir2
dr-xr-xr-x   1 root     root           1 Jan 28 20:59 dir3
drwxr-xr-x   4 imoo     staff          4 Jan 27 15:46 dir4

So, I can now prove that D provides the same attribute listings without regards to my fix. I really don't care why there is a difference between attribute reportage when I have machine credentials versus when I do not - I care about it behaving consistently.

I actually found the difference was that for some reason the code detected that without machine credentials the new filesystems supported ACLs. Why?, I didn't get that far once I realized my error in assumptions.

It is easy to step back and realize that I made an over generalization in my testing assumptions. I was right with regards to getting access, but wrong on the unexpected attributes. But you also have to remember we never tested in the case where there were no machine credentials. So the lack of attributes was expected. I think it becomes easy to blame what has changed explicitly (versus hidden changes) because that is the first rule we are taught in debugging.

Why do I suddenly see Hannible Lecter asking Clarice:

"First principles, Clarice. Simplicity. Read Kernighan & Ritchie. Of each particular
bug ask: what is it in itself? What is its nature? What has changed, in this code you seek?"

Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Wednesday Nov 19, 2008

Debugging an array

My kernel is dying trying to pull apart an XDR decoded array. And I've got an array of two items, so how do I view the second one?

[0]> ffffff0010c3a5f0::print DS_REPORTAVAILargs
{
    ds_id = 0x64
    ds_verifier = 0xffffff02efe3b060
    ds_addrs = {
        ds_addrs_len = 0x1
        ds_addrs_val = 0xffffff0307106368
    }
    ds_attrvers = 1 (DS_ATTR_v1)
    ds_storinfo = {
        ds_storinfo_len = 0x2
        ds_storinfo_val = 0xffffff0304545a80
    }
}

We can easily see the first one:

[0]> 0xffffff0304545a80::print ds_storinfo
{
    type = 1 (ZFS)
    ds_storinfo_u = {
        zfs_info = {
            guid_map = {
                ds_guid = {
                    stor_type = 1 (ZFS)
                    ds_guid_u = {
                        zfsguid = {
                            zfsguid_len = 0x10
                            zfsguid_val = 0xffffff02ff6de768
                        }
                    }
                }
                mds_sid_array = {
                    mds_sid_array_len = 0
                    mds_sid_array_val = 0
                }
            }
            attrs = {
                attrs_len = 0x1
                attrs_val = 0xffffff04c8fd6578
            }
        }                             
    }
}

But how do I get to the second one? I.e., how do I tell if it is just a pointer size away or if some special packing is going on?

I can use the array dcmd:

[0]> 0xffffff0304545a80::array ds_storinfo 2
ffffff0304545a80
ffffff0304545ac0

And that shows us:

[0]> ffffff0304545ac0::print  ds_storinfo
{
    type = 1 (ZFS)
    ds_storinfo_u = {
        zfs_info = {
            guid_map = {
                ds_guid = {
                    stor_type = 1 (ZFS)
                    ds_guid_u = {
                        zfsguid = {
                            zfsguid_len = 0x10
                            zfsguid_val = 0xffffff0304def2e0
                        }
                    }
                }
                mds_sid_array = {
                    mds_sid_array_len = 0
                    mds_sid_array_val = 0
                }
            }
            attrs = {
                attrs_len = 0x1
                attrs_val = 0xffffff04bc8f62a0
            }
        }                             
    }
}

Which looks valid and we can quickly test:

[0]> 0xffffff04bc8f62a0::print ds_zfsattr    
{
    attrname = {
        utf8string_len = 0x4
        utf8string_val = 0xffffff0508e88760
    }
    attrvalue = {
        attrvalue_len = 0x15
        attrvalue_val = 0xffffff050853f270
    }
}
[0]> 0xffffff050853f270::dump -w 2
                   \\/ 1 2 3  4 5 6 7  8 9 a b  c d e f  0 1 2 3  4 5 6 7  8 9 a
 b  c d e f  v123456789abcdef0123456789abcdef
ffffff050853f270:  706e6673 2d392d32 343a6461 7461322f 706e6673 32bbddba cefaed
fe 98140000  pnfs-9-24:data2/pnfs2...........

So either it is valid or it just happens to point to what I expect!


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Monday Oct 20, 2008

Another common task for Python

I'm in the midst of debugging a snoop implementation and I wanted to recompile with gcc and use gdb. I saved the output from the make command and basically used vi to put each .o file on a single line:

[th199096@jhereg snoop]> more files.make 
nfs4_xdr.o
snoop.o
snoop_aarp.o
snoop_adsp.o
snoop_aecho.o
snoop_apple.o

Note that I could strip off the '.o's manually, but typically I would leave them there. What I want to do is take the filename and use it twice in command. I.e.

% gcc -g -c -o nfs4_xdr.o nfs4_xdr.c

So I decided to use Python to learn a bit more about it:

[th199096@jhereg snoop]> more tran.py 
#!/usr/sfw/bin/python

l1 = []
print "#!/bin/sh -x"
print "# Make no changes here, machine generated!"
print "rm snoop \*.o"

for line in open("files.make"):
        [name, ext] = line.split('.')
        print "gcc -g -DUSE_FOR_SNOOP -c -I/builds/th199096/snoop/proto/root_i386/usr/include" \\
                " -I. -I/builds/th199096/snoop/usr/src/common/net/dhcp -o %s.o %s.c" % (name, name)
        l1.append(name)

print 'gcc -g -DTEXT_DOMAIN="SUNW_OST_OSCMD" -D_TS_ERRN -Bdirect -o snoop ',
for name in l1:
        print "%s.o" % (name),

print "-L/builds/th199096/snoop/proto/root_i386/lib -L/builds/th199096/snoop/proto/root_i386/usr/lib" \\
        " -ldhcputil -ldlpi -lsocket -lnsl -ltsol"

One thing that jumped out was since I threw away the 'ext', I didn't have to worry abotu stripping off the '\\n'. I also made use of the ',' on the end of the print statements to keep a line going.

I liked the ease of adding to the list of filenames. And in general, I found it easy to make a quick change and retest.

Could I have done this another way, say with the Makefile? Sure, but it wouldn't have been a learning experience. And off I go, the gdb prompt is calling me!


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Wednesday Oct 08, 2008

Another reader suggestion for the Python script

So Justin suggested:

try

template =  """"%(started)s - %(ended)s: %(title)s for      
%(company)s
    %(jobdesc)s"""


print template % row

I'm trying to capture what he put in, the line wrap on the first 'template' line is probably the comment system.

So I entered this, making a reformat tweak, as:


template = """"%(started)s - %(ended)s: %(title)s for %(company)s %(description)s"""

if __name__ == "__main__":
        for row in main(sys.argv[1],'!'):
                print template % row

And I got the wrong output. The error is of course mine:

> ./justin.py r4.txt
"1/05 - present: Staff Engineer Software for Sun Microsystems NFS development
"6/01 - 12/05: File System Engineer for Network Appliance WAFL and NFS development
"4/01 - 6/01: Manager for Network Appliance Manager of Engineering Internal Test
"10/99 - 4/01: System Administrator for Network Appliance Perl hacker and filer administrator

I shouldn't have joined the lines in the 'template' format. But why the extra '"'?

Ahh, Justin had one in his entry.

Okay, so the new changes work, here is the full program:

#!/usr/bin/env python

import csv, sys

def main(dfile,format,delimiter=","):
        db=open(dfile,'U')
        start=0
        for line in db:
                if line.startswith(format):
                        db.seek(start+len(format))
                        return csv.DictReader(db,delimiter=delimiter)
                else:
                        start+=len(line)+(len(db.newlines)==2) #windows hackery
        raise "There is no %s header line in %s" % (format,dfile)

template = """%(started)s - %(ended)s: %(title)s for %(company)s
        %(description)s"""

if __name__ == "__main__":
        for row in main(sys.argv[1],'!'):
                print template % row

And now I need to go figure out what Justin did...

Okay, the '%(name)s' has to be a formatting option. Can I duplicate it?

>>> for row in csv.DictReader(file("r4.txt")):
...     print "%(title)s" % row
...
Staff Engineer Software
File System Engineer
Manager
System Administrator

And now I do understand it - 3.6.2 String Formatting Operations:

A conversion specifier contains two or more characters and has the following components, which must occur in this order:

   1. The "%" character, which marks the start of the specifier.
   2. Mapping key (optional), consisting of a parenthesised sequence of characters (for example, (somename)). 

So, this code would be good for printing, but not necessarily for doing more complex manipulations.

By the way, I'm having fun figuring this stuff out...


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

A reader suggestion on how to solve the Perl script

Neil doesn't like that our comment section wipes out whitespace. His concern is certainly valid where it comes to the way Python uses indentation.

He suggested the following implementation:

#!/usr/bin/env python

import csv

def main(dfile,format,delimiter=","):
        db=open(dfile,'U')
        start=0
        for line in db:
                if line.startswith(format):
                        db.seek(start+len(format))
                        return csv.DictReader(db,delimiter=delimiter)
                else:
                        start+=len(line)+(len(db.newlines)==2) #windows hackery
        raise "There is no %s header line in %s" % (format,dfile)


if __name__ == "__main__":
        for row in main('data.txt','!'):
                print "%s - %s: %s for %s\\n\\t%s\\n\\n" % \\
                                tuple([row[column] for column in ['started','ended','title','company','jobdesc']])

And he provided the following note:

So what about something like this?

The csv module should take care of delimiters within columns
Simplification is possible if you don't need to deal with windows or
unix style line terminators
Changing delimiters is easy too.

I like that he caught on to making the separator an argument. It makes the code much more portable. I'm not sure it is as robust with respect to error handling, but in all fairness that could easily be handled and I did add those after posting the Perl script. Oh, and it does easily handle the addition of a new column in the data file.

I like the use of raise, I'm certainly not used to exception handlers any more.

I can see part of what is going on here:

>>> for row in neil.main("r4.txt",'!'):
...     print row
...
{'description': 'NFS development', 'title': 'Staff Engineer Software', 'started': '1/05', 'company': 'Sun Microsystems', 'ended': 'present', 'mad': 'money'}
{'description': 'WAFL and NFS development', 'title': 'File System Engineer', 'started': '6/01', 'company': 'Network Appliance', 'ended': '12/05', 'mad': 'honey'}
{'description': 'Manager of Engineering Internal Test', 'title': 'Manager', 'started': '4/01', 'company': 'Network Appliance', 'ended': '6/01', 'mad': 'scot'}
{'description': 'Perl hacker and filer administrator', 'title': 'System Administrator', 'started': '10/99', 'company': 'Network Appliance', 'ended': '4/01', 'mad': 'dam'}

And I thing the stuff with 'start' is what gets over the '!' in the first line???

>>> import csv
>>> help(csv.DictReader)

>>> for row in csv.DictReader(file("r4.txt")):
...     print row
...
{'!started': '1/05', 'description': 'NFS development', 'title': 'Staff Engineer Software', 'company': 'Sun Microsystems', 'ended': 'present', 'mad': 'money'}
{'!started': '6/01', 'description': 'WAFL and NFS development', 'title': 'File System Engineer', 'company': 'Network Appliance', 'ended': '12/05', 'mad': 'honey'}
{'!started': '4/01', 'description': 'Manager of Engineering Internal Test', 'title': 'Manager', 'company': 'Network Appliance', 'ended': '6/01', 'mad': 'scot'}
{'!started': '10/99', 'description': 'Perl hacker and filer administrator', 'title': 'System Administrator', 'company': 'Network Appliance', 'ended': '4/01', 'mad': 'dam'}

But I haven't figured out yet how the result is being built up. Okay, yes I have. I was fixated on the 'if' and 'else' thinking that was handling the header line versus the data line. But no, all it does it get you to the header line (i.e., there are comments in the file) and then the 'db.seek' gets you to the start of the header line and + 1 (via 'len(format)') for the format character. Then, just as in my interactive example, 'csv.DictReader' does the magic for you!

Sweet, Neil's code does what I had the Perl script doing!

It also shows I'm not used to all of the Python way of doing things. But it was fun to figure out what his script was doing!


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Tuesday Oct 07, 2008

Analyzing that old perl script

Guess I have to understand that script to rewrite it in Python. :->

First, gethead.pl reads through the file until it finds a line which starts with a '!'. In which case it creates a list of names of the form '$'column name:

        $format = '$' . join(', $', split(/,/, $first_line));
        print $format . "\\n";

Yields:

> ./r.pl r.txt
$started, $ended, $title, $company, $description

The magic really occurs in the main processing loop:

do main'read_txtfile_format(\*LNG_FILE, \*languages);

lang: while (<LNG_FILE>) {
        next lang if (/\^#/ || /\^!/);
        eval "($languages) = split(/[,\\n]/)";

        print "$started - $ended: $title for $company\\n\\t$description\\n\\n";
}

The first line gets '$languages' setup to the 'variable' names. Each time through the while loop, we call the eval to translate/associate the columns to variable names.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Python strings are Immutable

So I added some code to my simple script that wasn't in the Perl:

for line in open(sys.argv[1]):
        line.lstrip()

And my intent was to strip out all of the leading spaces. I didn't have to, but I created a simple test case with the first line pushed over by a tab and the second line pushed over by 8 spaces. The first one worked correctly and the second did not:

Update, I wasn't thinking correctly here, I knew I had two bad lines here and I didn't know why. After solving the coding problem, I can see that both lines of input failed. The header line is being treated here as if it were a normal line and being processed.

> ./simple2.py r2.txt
        !started - ended: title for company
        description



     1/05 - present: Staff Engineer Software for Sun Microsystems
        NFS development

Well d'oh, even in Perl I just told it to strip out one character. I've got to tell it that while there is whitespace, strip it out:

for line in open(sys.argv[1]):
        while line.isspace(): line.lstrip()

And this doesn't work either. At which point I realize it must be because strings are immutable, right? I mean it is never changing! Note I get to the right conclusion, but for the wrong reasons. If it were immutable and the string had whitespace at the start, I should be stuck in an endless loop here. See the ending section for that analysis.

It also points out that I never did anything with that line.lstrip(). It never changes line, but does create a reference to a new string. Which we can see here:

>>> st2 = "     This is the radio clash!"
>>> print st2.lstrip()
This is the radio clash!
>>> print st2
     This is the radio clash!
>>>

See, st2.lstrip() actually works!

I've fixed up the script (in a boring way) and it works:

for line in open(sys.argv[1]):
        line2 = line.lstrip()
        if line2.startswith("!") or line2.startswith("#"): continue
        print "%s - %s: %s for %s\\n\\t%s\\n\\n" % tuple(line2.split(","))

Another mistake I just made

Okay, to try to understand this, I did the following in the shell:

>>> st2 = "     This is the radio clash!"
>>> while st2.isspace():
...     print st2
...     st2.lstrip()
...

Which should be an endless loop according to what I know now. But nothing gets done. Which means that st2.isspace() is FALSE. And a help(st2.isspace) shows that:

Help on built-in function isspace: isspace(...) S.isspace() -> bool Return True if all characters in S are whitespace and there is at least one character in S, False otherwise.

I.e., my misunderstanding of st2.lstrip() being immutable made me think that st2.isspace() worked on the first character of the string. Actually, I made a bad assumption based on what I thought C would do. My bad.

So I don't ever want to do that while loop on a string which is really all whitespace.

All the reading in the world about Python strings will not help me understand the immutability of them as much as this simple example.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

The simple version of the old perl script

My first pass at a Python version of An old perl script reveals my inner C programmer. I've restricted the program to the simple version which does not generate the column name as local variables - first I want to get my proof of concept correct:

#!/usr/bin/python

for line in open("resume.txt"):
        line.lstrip()
        if line.startswith("!") or line.startswith("#"): continue
        (started, ended, title, company, description) = line.split(",")
        print "%s - %s: %s for %s\\n\\t%s\\n\\n", started, ended, title, company, description

It looks like it will reformat, but I've messed up the print statement:

> ./simple_1.py
%s - %s: %s for %s
        %s

1/05 present Staff Engineer Software Sun Microsystems NFS development

%s - %s: %s for %s
        %s

6/01 12/05 File System Engineer Network Appliance WAFL and NFS development

%s - %s: %s for %s
        %s

4/01 6/01 Manager Network Appliance Manager of Engineering Internal Test

%s - %s: %s for %s
        %s

10/99 4/01 System Administrator Network Appliance Perl hacker and filer administrator

I.e., I treated print like a C printf. Okay, I can try again with this one:

#!/usr/bin/python

for line in open("resume.txt"):
        line.lstrip()
        if line.startswith("!") or line.startswith("#"): continue
        (started, ended, title, company, description) = line.split(",")
        print "%s - %s: %s for %s\\n\\t%s\\n\\n" % (started, ended, title, company, description)

And get more of what I want to see:

> ./simple.py
1/05 - present: Staff Engineer Software for Sun Microsystems
        NFS development



6/01 - 12/05: File System Engineer for Network Appliance
        WAFL and NFS development



4/01 - 6/01: Manager for Network Appliance
        Manager of Engineering Internal Test



10/99 - 4/01: System Administrator for Network Appliance
        Perl hacker and filer administrator

I'm getting an extra line I don't want and I have to hard code the file to process. I can easily fix these both up:

#!/usr/bin/python

import sys

for line in open(sys.argv[1]):
        line.lstrip()
        if line.startswith("!") or line.startswith("#"): continue
        (started, ended, title, company, description) = line.split(",")
        print "%s - %s: %s for %s\\n\\t%s\\n" % (started, ended, title, company, description)

Feeling adventurous

Okay, with this simple example, I could get rid of the names in Perl and make it really simple. Can I do so in Python?

#!/usr/bin/python

import sys

for line in open(sys.argv[1]):
        line.lstrip()
        if line.startswith("!") or line.startswith("#"): continue
        print "%s - %s: %s for %s\\n\\t%s\\n\\n" % line.split(",")

No, not as I have tried:

> ./simple2.py resume.txt
Traceback (most recent call last):
  File "./simple2.py", line 8, in 
    print "%s - %s: %s for %s\\n\\t%s\\n\\n" % line.split(",")
TypeError: not enough arguments for format string

I've got a type error, hmm, I'm going to try this by hand:

> python
>>> st1 = "This is the radio clash!"
>>> st1.split()
['This', 'is', 'the', 'radio', 'clash!']
>>>

So I have a '[]' instead of a '()'. What does that mean? It means I have a list versus a tuple. And I find a converter called strangely enough, tuple:

        print "%s - %s: %s for %s\\n\\t%s\\n\\n" % tuple(line.split(","))

And that works.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Monday Oct 06, 2008

An old perl script

I've got an old perl script that I have gotten a lot of mileage from:

package read_txtfile_format;

sub main'read_txtfile_format {
        local(\*file,\*format) = @_;
        local($first_line, $first_char) = '';

        do {
                $first_line = <file>;
                $first_line =~ /(.)(.\*)/;

                $first_char = $1;
                $first_line = $2;
        } until ($first_char eq "!" || eof(file));

        if (eof(file)) {
                die "There is no ! header line in $file";
        }

        $format = '$' . join(', $', split(/,/, $first_line));
}

I didn't write it, I think either Mark Lawrence or Walt Gaber did while I was at DRD Corporation. Or they got it somewhere. I know they called it a data dictionary - which I'm not sure is a way I would use that term these days. By the way, I still have and use my very beaten up copy of Programming Perl from back then - a 1991 printing.

What it does is allow you to read in another file, generate variable names based on a line starting with a '!' and then use those names per line. It is a cheap database laid out in a flat file.

I think we called this basic script and its associated text files data dictionaries because we would use it to quickly prototype and change data structures in C. I know I used it in my Genetic Programming research to describe the operators used in a new problem set.

Perhaps an example will show the power.

Resume example

I want to quickly take my resume and reformat it as needed. Perhaps I need it in html format, a plain text file, etc.

I can keep my data in a file, I can have a skeleton script to process it, and I can quickly change it to adapt to new styles...

I'm picking an example which looks like I'm pimping myself out because I thought it was quirky and fun to code. It is also a way I never would have thought to do an example with this piece of code.

Data file

!started,ended,title,company,jobdesc
1/05,present,Staff Engineer Software,Sun Microsystems,NFS development
6/01,12/05,File System Engineer,Network Appliance,WAFL and NFS development
4/01,6/01,Manager,Network Appliance,Manager of Engineering Internal Test
10/99,4/01,System Administrator,Network Appliance,Perl hacker and filer administrator

Perl script

#! /usr/bin/perl

do 'getthead.pl';

open(LNG_FILE, $ARGV[0]) || die "Can't open LNG_FILE: $!\\n";

# Determine the Column Names
do main'read_txtfile_format(\*LNG_FILE, \*languages);

lang: while (<LNG_FILE>) {
        next lang if (/\^#/ || /\^!/);
        eval "($languages) = split(/[,\\n]/)";

        print "$started - $ended: $title for $company\\n\\t$description\\n\\n";
}

Results

And here I have it dumping out a format much like the resume.txt file I have been updating as I change job functions:

> ./resume.pl resume.txt
1/05 - present: Staff Engineer Software for Sun Microsystems
        NFS development

6/01 - 12/05: File System Engineer for Network Appliance
        WAFL and NFS development

4/01 - 6/01: Manager for Network Appliance
        Manager of Engineering Internal Test

10/99 - 4/01: System Administrator for Network Appliance
        Perl hacker and filer administrator

But wait, there is an easier way

#! /usr/bin/perl

open(LNG_FILE, $ARGV[0]) || die "Can't open LNG_FILE: $!\\n";

lang: while (<LNG_FILE>) {
        next lang if (/\^#/ || /\^!/);
        ($started, $ended, $title, $company, $description) = split(/[,\\n]/);

        print "$started - $ended: $title for $company\\n\\t$description\\n\\n";
}

It does the same thing, less code as well.

But it isn't as dynamic. I have to edit both the data file and the script to make a change. If I were to add a new field location after company, I would have to change the script on the split. Also, what if I have many scripts manipulating the same data? During my research, I had two different data files and six different scripts per problem set.

Research examples

For clique detection, 5-6 lines of data dictionary entries resulted in about 400 lines of C code. For predator/prey, about 50 lines of data dictionary entries resulted in about 890 lines of C code.

An example set of data dictionary entries for the predator/prey would be:

!fId:fSymbol:fType:fArity:fMacro:fDifGen:fChild1:fChild2:fChild3:fChild4:fChild5:fDescription
Agent:Ag:Agent:1:True:False:Agent:NO_CHILD:NO_CHILD:NO_CHILD:NO_CHILD:Returns A's predatorId.
And:&&:Boolean:2:True:False:Boolean:Boolean:NO_CHILD:NO_CHILD:NO_CHILD:A AND B.
CellOf:CellOf:Cell:2:False:False:Agent:Tack:NO_CHILD:NO_CHILD:NO_CHILD:The (X,Y) coordinate of agent A if it moves from its current cell to the one in the Tack B.

(Note I've changed the separator to a ':' for clarity.)

Part of the processing script would be:

        # Put in all Caps
        ( $capId = 'GP_L' . $LBranch . '_F_' . $fId ) =~ tr/a-z/A-Z/;
...
        print INI_FP '    /\*';
        print INI_FP '     \* ' . $fDescription;
        print INI_FP '     \*/';
        print INI_FP '    pgps->als[' . $LBranch . '].afs[i].iId = ' . $capId . ';';
        print INI_FP '    pgps->als[' . $LBranch . '].afs[i].psSymbol = "' . $fSymbol . '";';
        print INI_FP '    pgps->als[' . $LBranch . '].afs[i].ftType = ' . $capType . ';';
        print INI_FP '    pgps->als[' . $LBranch . '].afs[i].arity = ' . $fArity . ';';
...

And some resulting code would be:

    /\*
     \* Branch 0 - Main language for the system
     \*/
    /\*
     \* Branch 0 - Functions
     \*/
    pgps->als[0].afs = (FunctionsStruct \*)calloc( GP_L0_MAX_FUNCTIONS,
                                      GP_FUNCTIONS_SIZE );
    if ( !pgps->als[0].afs ) {
        GU_logError( stderr, "%s(%d): Out of Memory!\\n",
                     __FILE__, __LINE__ );
        GU_exit ( -1 );
    }

    /\*
     \* Returns A's predatorId.
     \*/
    pgps->als[0].afs[i].iId = GP_L0_F_AGENT;
    pgps->als[0].afs[i].psSymbol = "Ag";
    pgps->als[0].afs[i].ftType = FT_L0_E_AGENT;
    pgps->als[0].afs[i].arity = 1;
    pgps->als[0].afs[i].bMacro = TRUE;
    pgps->als[0].afs[i].bActive = TRUE;
    pgps->als[0].afs[i].bDifGeneric = FALSE;

    pgps->als[0].afs[i].pftChildren = (FunctionTypes \*)calloc( pgps->als[0].afs[i].arity, FT_TYPES_SIZE );
    if ( !pgps->als[0].afs[i].pftChildren ) {
        GU_logError( stderr, "%s(%d): Out of Memory!\\n",
                     __FILE__, __LINE__ );
        GU_exit ( -1 );
    }

    pgps->als[0].afs[i].pftChildren[0] = FT_L0_E_AGENT;

    pgps->als[0].afs[i].pFct = gpf_L0_Agent;
    i++;

    /\*
     \* A AND B.
     \*/
    pgps->als[0].afs[i].iId = GP_L0_F_AND;
    pgps->als[0].afs[i].psSymbol = "&&";

By the way, the same script f_types.pl would process all of the language data dictionaries without being modified. If I happened to change the underlying data structures in the C code, i.e., FunctionsStruct, I could change that one script and rebuild all of the different languages.

So where's the Beef?

Why the walk down memory lane?

Well, I still use this script. I've used it to do volunteer scheduling at AAAI, generate Java opcodes for a simple JVM implementation, plan a new company, check for sibling conflicts during a recreational soccer season, implement testbeds for QA efforts for multiple companies, etc. I don't have to have a database on my system. I can suck data out of a database on a Windows box, store it in a CSV datafile on OpenSolaris, and play with the data. I don't have to know SQL and/or care too much about the data. I can generate "reports" and such from the CLI.

And it is the power of Perl (well, the eval() it offers) which lets me get away with this. One of the selling points of Perl for me was rapid prototyping, especially with respect to strings. I could have written C programs to do all of this, but why?

If I'm going to learn Python, I need to be able to replace this piece of functionality. Or else I'll be back with Perl before you know it.

And honestly, even if I learn to make Python bark for me, I'll pick up the tool I need when I have to. :->

Well off to sleep and I'll pick this up tomorrow when I start playing with Python.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Trying to figure out printing and variables in Python

I'm pretty used to referencing variables inside print blocks in Perl. I'm not at all comfortable with Python. I have a block of code that I want to change the 'onhg' to come out of a config file. So I set up a scratch directory and make a bare bones implementation:

[th199096@jhereg etc]> ls -laiR ~/scratch/
/home/th199096/scratch/:
total 42
       749 drwxr-xr-x   3 th199096 staff          4 Oct  6 16:36 .
         3 drwxr-xr-x  39 th199096 staff         55 Oct  6 16:35 ..
       752 drwxr-xr-x   2 th199096 staff          6 Oct  6 16:42 etc
       750 -rwxr-xr-x   1 th199096 staff       1297 Oct  6 16:42 updateoso.py

/home/th199096/scratch/etc:
total 25
       752 drwxr-xr-x   2 th199096 staff          6 Oct  6 16:42 .
       749 drwxr-xr-x   3 th199096 staff          4 Oct  6 16:36 ..
       753 -rw-r--r--   1 th199096 staff       1052 Oct  6 16:40 __init__.py
       754 -rw-r--r--   1 th199096 staff        243 Oct  6 16:41 __init__.pyc
       751 -rwxr-xr-x   1 th199096 staff         94 Oct  6 16:42 config.py
       756 -rw-r--r--   1 th199096 staff        257 Oct  6 16:42 config.pyc

Where the config file simply has:

GATE_USER = "onhg"
GATE_GROUP = "gk"

OSOREPO = "ssh://hg.opensolaris.org/hg/onnv/onnv-gate"

And the updateoso has:

import os, pwd, subprocess, sys

from mercurial import hg, repo
from mercurial.node import hex

sys.path.insert(1,
    os.path.realpath(os.path.join(os.path.dirname(__file__), "..")))
from etc import config

__USAGE = """
updateoso.py [-n] <-R repo root>
    -n: dry run, no email sent (displayed on stdout)
    -R: root dir of repo (where .hg is)

Attempt to send changes to 
%s

This script must be run as user "onhg".
You should set up RBAC and use pfexec(1).
""" % (config.OSOREPO)
__USAGE = __USAGE.strip()

print >> sys.stderr, __USAGE

Well, in isolation, I can already see what I am going to have to do. All I need to do is replace the 'ohng' with a %s and add a second argument:

[th199096@jhereg ~/scratch]> diff updateoso.py updateoso.py.first 
39c39
< This script must be run as user "%s".
---
> This script must be run as user "ohng".
41c41
< """ % (config.OSOREPO, config.GATE_USER)
---
> """ % (config.OSOREPO)

And we get:

[th199096@jhereg ~/scratch]> ./updateoso.py
updateoso.py [-n] <-R repo root>
    -n: dry run, no email sent (displayed on stdout)
    -R: root dir of repo (where .hg is)

Attempt to send changes to 
ssh://hg.opensolaris.org/hg/onnv/onnv-gate

This script must be run as user "onhg".
You should set up RBAC and use pfexec(1).

I ought to be able to test this inside the interactive shell:

[th199096@jhereg ~/scratch]> python
Python 2.4.4 (#1, Aug 25 2008, 03:30:42) [C] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> import updateoso
updateoso.py [-n] <-R repo root>
    -n: dry run, no email sent (displayed on stdout)
    -R: root dir of repo (where .hg is)

Attempt to send changes to 
ssh://hg.opensolaris.org/hg/onnv/onnv-gate

This script must be run as user "onhg".
You should set up RBAC and use pfexec(1).
>>> config.GATE_USER = "duke"
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
NameError: name 'config' is not defined

Okay, I should have known that wasn't going to work. It would probably work in the code (we'll see later), but for now this will work:

>>> updateoso.config.GATE_USER = "duke"
>>> reload(updateoso)
updateoso.py [-n] <-R repo root>
    -n: dry run, no email sent (displayed on stdout)
    -R: root dir of repo (where .hg is)

Attempt to send changes to 
ssh://hg.opensolaris.org/hg/onnv/onnv-gate

This script must be run as user "duke".
You should set up RBAC and use pfexec(1).
<module 'updateoso' from 'updateoso.pyc'>

To be honest, I knew the reference would work, but I expected it to be reset. In retrospect, I can see that I reloaded updateoso and etc/config. Just something to get used to. I could force it to 'reset' via:

>>> reload(etc/config)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
NameError: name 'etc' is not defined
>>> from etc reload(config)
  File "<stdin>", line 1
    from etc reload(config)
                  \^
SyntaxError: invalid syntax
>>> reload(updateoso.config)
<module 'etc.config' from 'etc/config.pyc'>
>>> reload(updateoso)
updateoso.py [-n] <-R repo root>
    -n: dry run, no email sent (displayed on stdout)
    -R: root dir of repo (where .hg is)

Attempt to send changes to 
ssh://hg.opensolaris.org/hg/onnv/onnv-gate

This script must be run as user "onhg".
You should set up RBAC and use pfexec(1).
<module 'updateoso' from 'updateoso.pyc'>

Took me a bit to figure out the syntax.

Okay, can I see the change from the script:

[th199096@jhereg ~/scratch]> diff updateoso.py updateoso.py.second 
45,48d44
< 
< config.GATE_USER = "gark"
< print >> sys.stderr, __USAGE
< 

I don't expect this to work. And it doesn't.

This script must be run as user "onhg".
...
This script must be run as user "onhg".

How about a test driver script?

[th199096@jhereg ~/scratch]> more test.py 
import updateoso

print "Now change the user"

updateoso.config.GATE_USER = "nark"

reload(updateoso)

And that works:

This script must be run as user "onhg".
...
Now change the user
...
This script must be run as user "nark".

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Sunday Oct 05, 2008

Learning a new language - python

So I decided to learn python - why? because it is used by Mercurial. And there is at least one of the gatekeeping scripts which I needed to hack for the nfs41-gate.

I bought Learning Python, Third Edition by Mark Lutz because the local Borders did not have Programming Python, Third Edition. From some reviews I read, it would probably have been a better fit for me.

I know I can find most of what I want on the net, but I wanted a printed resource.

Anyway, I had a question right off the bat - about whether if file a imports modules b and c, what happens if c also imports b? Deeper in the book that I've read, it does state that an import is equivalent to load the file if it is not already loaded. But that doesn't help me learn the language. :->

The following example is quite simple, but effective in answering the question for me:

a.py

> cat a.py
#!/usr/bin/python

title = "This is the file a.py!"

print title

print "importing b from a"
import b

print "importing c from a"
import c

b.py

> cat b.py
#!/usr/bin/python

title = "This is the file b.py!"

print title

c.py

> cat c.py
#!/usr/bin/python

import b

title = "This is the file c.py!"

print title

Test 1

> ./a.py
This is the file a.py!
importing b from a
This is the file b.py!
importing c from a
This is the file c.py!

So we see that if a has loaded it, then c will not. How about the other way?

a2.py

> cat a2.py
#!/usr/bin/python

title = "This is the file a.py!"

print title

print "importing c from a"
import c

print "importing b from a"
import b

print "and b's title is"
print b.title

Test 2

> ./a2.py
This is the file a.py!
importing c from a
This is the file b.py!
This is the file c.py!
importing b from a
and b's title is
This is the file b.py!

We see that c loads b and that b's attributes are visible from a.

What would really help me here is if b could state the call stack of what is importing it.

Test 3

A simple change fails:

> cat b.py
#!/usr/bin/python

title = "This is the file b.py!"

print title

print "Called from", __file__

but it does show the effect of byte code compilation:

> ./a.py
This is the file a.py!
importing b from a
This is the file b.py!
Called from /home/tdh/python/b.py
importing c from a
This is the file c.py!
> ./a2.py
This is the file a.py!
importing c from a
This is the file b.py!
Called from /home/tdh/python/b.pyc
This is the file c.py!
importing b from a
and b's title is
This is the file b.py!

I can see the "nesting" if I pop into an interactive session:

> python
>>> import a2
This is the file a.py!
importing c from a
This is the file b.py!
Called from b.pyc
This is the file c.py!
importing b from a
and b's title is
This is the file b.py!
>>> a2.__dict__.keys()
['c', 'b', 'title', '__builtins__', '__file__', '__name__', '__doc__']
>>> a2.c.__dict__.keys()
['b', 'title', '__builtins__', '__file__', '__name__', '__doc__']
>>> a2.c.b.__dict__.keys()
['__builtins__', '__name__', '__file__', '__doc__', 'title']

But this doesn't answer my question of how to figure this out recursively. I.e., I guess I am looking for a parent "pointer" and I could walk it to get my answer.

But I've still learned more than just reading the book linearly.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Tuesday Sep 30, 2008

Why you should do basic triage before sending out a plea for help

So I think I have found a bug in our implementation of Mercurial. I'm all excited and I'm about to send the following email off to the ON gatekeepers:

Take a zfs clone of a workspace which has ssh://aus1500-home//pool/ws/th199096/spe-gate as a parent.

[th199096@jhereg ~]> zfs clone pool/builds/th199096/spe-gate@postjimw pool/builds/th199096/spe-fix
[th199096@jhereg ~]> ws /builds/th199096/spe-fix

Workspace			: /builds/th199096/spe-fix
Workspace Parent		: ssh://aus1500-home//pool/ws/th199096/spe-gate
Proto area ($ROOT)		: /builds/th199096/spe-fix/proto/root_i386
Root of source ($SRC)		: /builds/th199096/spe-fix/usr/src
Root of test source ($TSRC)  : /builds/th199096/spe-fix/usr/ontest
Current directory ($PWD)	: /builds/th199096/spe-fix

Reparent to ssh://aus1500-home//pool/ws/nfs41-clone

[th199096@jhereg spe-fix]> hg reparent ssh://aus1500-home//pool/ws/nfs41-clone
[th199096@jhereg spe-fix]> ws usr/closed/

Workspace			: /builds/th199096/spe-fix/usr/closed
Workspace Parent		: ssh://aus1500-home//pool/ws/th199096/spe-gate/usr/closed
Proto area ($ROOT)		: /builds/th199096/spe-fix/usr/closed/proto/root_i386
Root of source ($SRC)		: /builds/th199096/spe-fix/usr/closed/usr/src
Root of test source ($TSRC)  : /builds/th199096/spe-fix/usr/closed/usr/ontest
Current directory ($PWD)	: /builds/th199096/spe-fix/usr/closed

[th199096@jhereg closed]>  hg reparent ssh://aus1500-home//pool/ws/nfs41-clone/usr/closed
[th199096@jhereg closed]> exit
exit
[th199096@jhereg spe-fix]> hg list
added:
      	usr/src/cmd/fs.d/nfs/sped/Makefile
	usr/src/cmd/fs.d/nfs/sped/sped.c
	usr/src/cmd/fs.d/nfs/sped/sped_dt.d
	usr/src/cmd/fs.d/nfs/sped/sped_server.c
	usr/src/cmd/fs.d/nfs/sped/spedaemon.c
	usr/src/cmd/fs.d/nfs/sped/spedaemon.h
	usr/src/cmd/fs.d/nfs/svc/spe.xml
	usr/src/head/rpcsvc/spe_prot.x
	usr/src/uts/common/fs/nfs/spe.c
	usr/src/uts/common/nfs/spe.h
	usr/src/uts/common/nfs/spe_attr.h
	usr/src/uts/common/nfs/spe_impl.h
modified:
        usr/src/cmd/fs.d/nfs/Makefile
	usr/src/cmd/fs.d/nfs/svc/Makefile
	usr/src/head/Makefile
	usr/src/head/rpcsvc/daemon_utils.h
	usr/src/lib/libshare/nfs/libshare_nfs.h
	usr/src/pkgdefs/SUNWhea/prototype_com
	usr/src/pkgdefs/SUNWhea/prototype_com
	usr/src/pkgdefs/SUNWnfscr/prototype_com
	usr/src/pkgdefs/SUNWnfscu/prototype_com
	usr/src/uts/common/Makefile
	usr/src/uts/common/Makefile.files
	usr/src/uts/common/dserv/dserv_mds.c
	usr/src/uts/common/fs/Makefile
	usr/src/uts/common/fs/nfs/ds_srv.c
	usr/src/uts/common/fs/nfs/nfs41_srv.c
	usr/src/uts/common/fs/nfs/nfs41_state.c
	usr/src/uts/common/fs/nfs/nfs_sys.c
	usr/src/uts/common/nfs/Makefile
	usr/src/uts/common/nfs/mds_state.h
	usr/src/uts/common/nfs/nfs4.h
	usr/src/uts/common/nfs/nfssys.h
	usr/src/uts/intel/nfs/Makefile
	usr/src/uts/sparc/nfs/Makefile
Time spent in user mode	  (CPU seconds) : 3.61s
Time spent in kernel mode (CPU seconds) : 5.44s
Total time				: 0:27.46s
CPU utilisation (percentage)		: 32.9%

End up making changes to usr/src/head/rpcsvc/ds_prot.x, usr/src/uts/common/fs/nfs/ds_srv.c, usr/src/uts/common/fs/nfs/nfs41_state.c, and usr/src/uts/common/nfs/nfs_serv_inst.h

[th199096@jhereg spe-fix]> vi usr/src/head/rpcsvc/ds_prot.x
[th199096@jhereg spe-fix]> vi usr/src/uts/common/dserv/dserv_mds.c
[th199096@jhereg spe-fix]> vi usr/src/uts/common/fs/nfs/ds_srv.c
[th199096@jhereg spe-fix]> vi usr/src/uts/common/fs/nfs/nfs41_srv.c
[th199096@jhereg spe-fix]> vi usr/src/uts/common/fs/nfs/nfs41_state.c
[th199096@jhereg spe-fix]> vi usr/src/uts/common/fs/nfs/nfs41_state.c
[th199096@jhereg spe-fix]> pushd usr/src/uts/common/fs/nfs
/builds/th199096/spe-fix/usr/src/uts/common/fs/nfs /builds/th199096/spe-fix
[th199096@jhereg nfs]> grep ds_guid_info_idx \*
ds_srv.c:	    mds_server->ds_guid_info_idx,
nfs41_state.c:rfs4_index_t \*ds_guid_info_idx;
nfs41_state.c:	instp->ds_guid_info_idx = rfs4_index_create(instp->ds_guid_info_tab,
[th199096@jhereg nfs]> vi ds_srv.c
[th199096@jhereg nfs]> popd
/builds/th199096/spe-fix
[th199096@jhereg spe-fix]> vi usr/src/uts/common/fs/nfs/nfs41_state.c
[th199096@jhereg spe-fix]> vi usr/src/uts/common/fs/nfs/nfs41_state.c\\
?
[th199096@jhereg spe-fix]> vi usr/src/uts/common/nfs/nfs_serv_inst.h

Reparent back to ssh://aus1500-home//pool/ws/th199096/spe-gate

[th199096@jhereg spe-fix]> hg reparent ssh://aus1500-home//pool/ws/th199096/spe-gate
[th199096@jhereg spe-fix]> ws usr/closed/

Workspace			: /builds/th199096/spe-fix/usr/closed
Workspace Parent		: ssh://aus1500-home//pool/ws/nfs41-clone/usr/closed
Proto area ($ROOT)		: /builds/th199096/spe-fix/usr/closed/proto/root_i386
Root of source ($SRC)		: /builds/th199096/spe-fix/usr/closed/usr/src
Root of test source ($TSRC)  : /builds/th199096/spe-fix/usr/closed/usr/ontest
Current directory ($PWD)	: /builds/th199096/spe-fix/usr/closed

[th199096@jhereg closed]> hg reparent ssh://aus1500-home//pool/ws/th199096/spe-gate/usr/closed
[th199096@jhereg closed]> exit
exit

And no changes ???

[th199096@jhereg spe-fix]> hg outgoing
comparing with ssh://aus1500-home//pool/ws/th199096/spe-gate
searching for changes
no changes found
[th199096@jhereg spe-fix]> hg push
pushing to ssh://aus1500-home//pool/ws/th199096/spe-gate
searching for changes
no changes found
[th199096@jhereg spe-fix]>

Up until I've just been cutting and pasting what had happened, now I start to debug to show that I'm not just emailing without trying anything:

Hmm, I happen to have a copy of that gate on this machine:

[th199096@jhereg spe-fix]> diff usr/src/head/rpcsvc/ds_prot.x ../spe-gate/usr/src/head/rpcsvc/ds_prot.x
255d254
<	utf8string	ds_path;
[th199096@jhereg spe-fix]> diff usr/src/uts/common/dserv/dserv_mds.c ../spe-gate/usr/src/uts/common/dserv/dserv_mds.c
[th199096@jhereg spe-fix]> diff usr/src/uts/common/fs/nfs/ds_srv.c ../spe-gate/usr/src/uts/common/fs/nfs/ds_srv.c
139,141d138
<			UTF8STRING_FREE(res_ok->guid_map.guid_map_val[i].
<                           ds_path);
<
711,713d707
<			(void) utf8_copy(&pip->ds_path,
<                           &guid_map[count].ds_path);
<
[th199096@jhereg spe-fix]> diff usr/src/uts/common/fs/nfs/nfs41_srv.c ../spe-gate/usr/src/uts/common/fs/nfs/nfs41_srv.c
[th199096@jhereg spe-fix]> diff usr/src/uts/common/fs/nfs/nfs41_state.c ../spe-gate/usr/src/uts/common/fs/nfs/nfs41_state.c
1919c1919
<  \* this will populate the following MDS tables.
---
>  \* this will populste the following MDS tables.
2016c2016
<	ds_guid_info_t		\*pip;
---
>	mds_pool_info_t		\*pip;
2022c2022
<	rw_enter(&ds_guid_info_lock, RW_READER);
---
>	rw_enter(&mds_pool_info_lock, RW_READER);
2024c2024
<	pip = (ds_guid_info_t \*)rfs4_dbsearch(ds_guid_info_path_idx,
---
>	pip = (mds_pool_info_t \*)rfs4_dbsearch(mds_pool_info_path_idx,
2031c2031
<	rw_exit(&ds_guid_info_lock);
---
>	rw_exit(&mds_pool_info_lock);
2119c2119
< ds_guid_info_path_compare(rfs4_entry_t entry, void \*key)
---
> mds_pinfo_path_compare(rfs4_entry_t entry, void \*key)
2121c2121
<	ds_guid_info_t \*pip = (ds_guid_info_t \*)entry;
---
>	mds_pool_info_t \*pip = (mds_pool_info_t \*)entry;
2127c2127
< ds_guid_info_path_mkkey(rfs4_entry_t entry)
---
> mds_pinfo_path_mkkey(rfs4_entry_t entry)
2129c2129
<	ds_guid_info_t \*pip = (ds_guid_info_t \*)entry;
---
>	mds_pool_info_t \*pip = (mds_pool_info_t \*)entry;
2433,2438d2432
<	instp->ds_guid_info_path_idx =
<	    rfs4_index_create(instp->ds_guid_info_tab,
<	    "DS_guid_path-idx", ds_guid_info_hash, ds_guid_info_path_compare,
<	    ds_guid_info_path_mkkey,
<	    TRUE);
<
[th199096@jhereg spe-fix]> diff usr/src/uts/common/nfs/nfs_serv_inst.h ../spe-gate/usr/src/uts/common/nfs/nfs_serv_inst.h
193d192
<	rfs4_index_t \*ds_guid_info_path_idx;
[th199096@jhereg spe-fix]>

So the files are different

[th199096@jhereg spe-fix]> diff usr/src/uts/common/nfs/nfs_serv_inst.h /net/aus1500-home/pool/ws/th199096/spe-gate/usr/src/uts/common/nfs/nfs_serv_inst.h
193d192<	rfs4_index_t \*ds_guid_info_path_idx;

Yes, really they are

Why doesn't Mercurial think so?

It knows that the files have changed

[th199096@jhereg spe-fix]> hg list
modified:
        usr/src/head/rpcsvc/ds_prot.x
	usr/src/uts/common/fs/nfs/ds_srv.c
	usr/src/uts/common/fs/nfs/nfs41_state.c
	usr/src/uts/common/nfs/nfs_serv_inst.h

[th199096@jhereg spe-fix]> hg outgoing
comparing with ssh://aus1500-home//pool/ws/th199096/spe-gate
searching for changes
no changes found

And here the email stops as I RTFM and realize I forgot a step:

[th199096@jhereg spe-fix]> hg commit
[th199096@jhereg spe-fix]> hg outgoing
comparing with ssh://aus1500-home//pool/ws/th199096/spe-gate
searching for changes
changeset:   7779:ced6eccb4366
tag:		tip
user:		Thomas Haynes 
date:		Tue Sep 30 14:57:46 2008 -0500
summary:	Fix up for new NFS instances

Without a hg commit, even with changes, there is nothing to integrate back to the parent.

So, instead of an email, I blog about it...

[th199096@jhereg spe-fix]> hg push
pushing to ssh://aus1500-home//pool/ws/th199096/spe-gate
searching for changes
Are you sure you wish to push? [y/N]: y
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 1 changesets with 4 changes to 4 files

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Thursday Apr 03, 2008

Putting that throw away code to good use

I'm going to narrate a massive debug session I just had, in which I used my XDR dump routine to flush out bugs. I'm not going to give all of the code, and my approach is very free form. But that is very much like a massive debug session. Enjoy or not!

Some my upcall from the kernel to the sped daemon is working fine, but the return path is not. My 64 bit gpool-id is off:

# Apr  3 16:13:29 pnfs-3-15 ./sped[100781]: spedit: |35865356092523570| gives |35865356092523570| and |35865356092523570|
10, 16, 64000, 35865356092523570 uid == 200096
20, 32, 2000, 35865356092523571 uid == 1066
30, 64, 1000, 35865356092523572 uid == 0
40, 2, 2000, 35865356092523573 subnet ==  10.1.233.0/24
Apr  3 16:13:29 pnfs-3-15 ./sped[100781]: spedit: |35865356092523571| gives |35865356092523571| and |35865356092523571|
Apr  3 16:13:29 pnfs-3-15 ./sped[100781]: spedit: |35865356092523572| gives |35865356092523572| and |35865356092523572|
Apr  3 16:13:29 pnfs-3-15 ./sped[100781]: spedit: |35865356092523573| gives |35865356092523573| and |35865356092523573|
We got a policy 64, -17890057, 18444853557553841283 for /nippy/bogey4
Apr  3 16:13:34 pnfs-3-15 nfs: WARNING: sped: 2 0 2 200096 10 33554432 /nippy/bogey4 0 0
Apr  3 16:13:34 pnfs-3-15 ./sped[100781]:    7:   0  0  0  2  0  0  0  0
Apr  3 16:13:34 pnfs-3-15 ./sped[100781]:   15:   0  0  0  2  0  3  D A0
Apr  3 16:13:34 pnfs-3-15 ./sped[100781]:   23:   0  0  0  A  2  0  0  0
Apr  3 16:13:34 pnfs-3-15 ./sped[100781]:   31:   0  0  0  D 2F 6E 69 70
Apr  3 16:13:34 pnfs-3-15 ./sped[100781]:   39:  70 79 2F 62 6F 67 65 79
Apr  3 16:13:34 pnfs-3-15 ./sped[100781]:   47:  34  0  0  0  0  0  0  0
Apr  3 16:13:34 pnfs-3-15 ./sped[100781]:   52:   0  0  0  0
Apr  3 16:13:34 pnfs-3-15 ./sped[100781]: We got a policy 64, -17890057, 18444853557553841283 for /nippy/bogey4
Apr  3 16:13:34 pnfs-3-15 ./sped[100781]:    4:   0  0  0  5

We can see it change from 35865356092523570 to 18444853557553841283. Hmm, it used to have one more change. Anyway, the value of -17890057 looks off. It should be one of 1000, 2000, or 64000. And perhaps the 64 is off as well.

Anyway, we can see that the input XDR looks okay. I've even added a byte counter for it.

And that lets us see that the last line must be the reply. The counter has started again and 4 bytes as a reply looks too small. Oh crud, what does a 5 mean as a status code?

/\*
 \* Return status codes
 \*/
#define SPED_OK         0

/\*
 \* Caller provided invalid arguments
 \*/
#define SPED_INVALID    1

/\*
 \* Internal error in daemon e.g. out of memory, can't return result
 \*/
#define SPED_INTERNAL   2

/\*
 \* Were layouts revoked?
 \*/
#define SPED_REVOKED    3

/\*
 \* Was there a matching rule, but not enough available gpools?
 \*/
#define SPED_MATCH_NOT_ENOUGH   4

/\*
 \* Was there no matching rules?
 \*/
#define SPED_NO_MATCH   5

Grin, so I thought we were going to find that the buffer was too small and something in the encoding was way off. Looks like the buffer is just right and now I need to debug why my input is not matching a policy. But, my debug code is correct here! What is being printed is the garbage in the local variable.

I added some code to zap the local variable and to also dump the input parameters:

Apr  3 16:37:20 pnfs-3-15 ./sped[100790]:   52:   0  0  0  0
Apr  3 16:37:20 pnfs-3-15 ./sped[100790]: Matching against: 200096 10 0 0 0 0 (N/A) (N/A) (N/A) (N/A) (N/A) (N/A) /nippy/bogey5 (N/A)

Hmm, no XDR dump of the output. That means we are erroring out earlier, probably due to zapping the local variable with a memset!

It is coring now somewhere in here:


                if (sar.gpoolids)
                        u = sar.gpoolids[0];
                else
                        u = (uint64_t)0xe4e5c7c8;

                printf("We got a policy %d, %d, %llu for %s\\n",
                    sar.lh_unit_size, sar.lh_stripe_count,
                    sar.gpoolids[0], saa.path);

                syslog(LOG_ERR, "We got a policy %d, %d, %llu for %s",
                    sar.lh_unit_size, sar.lh_stripe_count,
                    sar.gpoolids[0], saa.path);

We should be concerned that we generate u but never use it. My guess is that sar.gpoolids is now NULL instead of random junk. If we fix that, we get back to:

Apr  3 21:38:41 pnfs-3-15 ./sped[100841]: Matching against: 200096 10 0 0 33554432 0 (N/A) (N/A) (N/A) (N/A) (N/A) (N/A) /nippy/bogey8 (N/A)
Apr  3 21:38:41 pnfs-3-15 ./sped[100841]: We got a policy 0, 0, 3840264136 for /nippy/bogey8

Ignoring the need to fix the receiving code to detect a NULL array, what can we do to get a match?

Another unitialized variable in going from the stand alone policy debugger to the sped daemon. How did I find it?

(gdb) n
spe_eval_thunk (st={st_is_interior = B_FALSE, st_node = 0x8067af0}, 
    pat=0x8047144, sa=0x8047078, prc=0x80470d8, bServer=B_TRUE)
    at spedaemon.c:827
827		return (b);
(gdb) p b
$19 = B_TRUE
(gdb) n
spe_eval_attribute (si=0x80674e8, pat=0x8047144, prc=0x80470d8, bServer=B_TRUE)
    at spedaemon.c:864
864			if (\*prc != 0) {
(gdb) p \*prc
$21 = -16785880

Why was \*prc such an odd number, no error cases were hit.

int
spe_global_eval(policy_attributes_t \*pat, uint64_t \*\*gooies,
    uint32_t \*n_gooies, uint32_t \*unit_size, uint32_t \*stripes)
{
        spe_policy_t    \*sp;
        boolean_t       b = FALSE;
        int             rc;

If we set rc = 0, it works.

Apr  3 22:54:24 pnfs-3-15 ./sped[100938]: Matching against: 200096 10 0 0 33554432 0 (N/A) (N/A) (N/A) (N/A) (N/A) (N/A) /nippy/bogeya (N/A)
Apr  3 22:54:24 pnfs-3-15 ./sped[100938]: Considering rule 0
Apr  3 22:54:24 pnfs-3-15 ./sped[100938]: Considering 200096 vs 200096
Apr  3 22:54:24 pnfs-3-15 ./sped[100938]:    7:   0  0  0  0  8  6 F0 38

An 8 byte reply looks wrong. And we have another braindead problem. (Please realize this is the first time I am testing code I wrote over a month ago. I had to task switch to a P1 bug.):

                result = spe_global_eval(&pat, &sar.gpoolids,
                    &sar.n_gooies, &sar.lh_unit_size, &sar.lh_stripe_count);
                if (!result) {
                        sped_door_return(result, &sar, proc);
                }

0 is a valid match, everything else is an error. Either I planned to make it a boolean (in which case spe_global_eval is wrong) or I messed up the check here. I'll assume the check:


Hey, the policy match info is correct, but why no payload?

        sped_door_return(result, NULL, NULL);
        /\* NOTREACHED \*/

The two sped_door_return instances are mixed up. Flip them. Okay, back to a result and a 4 byte payoff.

And proc has a bad initialization:<

        switch (command) {
        case (SPED_DS_ZPOOL):
                proc = xdr_spe_alloc_caller_t;

Should be: xdr_spe_alloc_res_t

I think we finally have good results:

Apr  3 23:14:02 pnfs-3-15 ./sped[100957]: Matching against: 200096 10 0 0 33554432 0 (N/A) (N/A) (N/A) (N/A) (N/A) (N/A) /nippy/bogeyd (N/A)
Apr  3 23:14:02 pnfs-3-15 ./sped[100957]: We got a policy 64000, 16, 35865356092523570 for /nippy/bogeyd
Apr  3 23:14:02 pnfs-3-15 ./sped[100957]:    7:   0  0  0  0  0  0  0  1
Apr  3 23:14:02 pnfs-3-15 ./sped[100957]:   15:   0 7F 6B 59 F1 A3 44 32
Apr  3 23:14:02 pnfs-3-15 ./sped[100957]:   23:   0  0  0  0  0  0  0  0
Apr  3 23:14:02 pnfs-3-15 ./sped[100957]:   31:   0  0  0  0  0  0 FA  0
Apr  3 23:14:02 pnfs-3-15 ./sped[100957]:   39:   0  0  0 10  0  0  0  0
Apr  3 23:14:02 pnfs-3-15 nfssrv: WARNING: sped: 64000 16 0 35865356092523570

There are some take home points here:

  1. Try to unit test/debug your code when you write it.
    1. You'll remember shortcuts you took.
    2. You'll remember the code.
  2. Write debugging tools if necessary.
    1. If I had to step through a debugger every time, I would have gone crazy.
    2. The XDR dump routine allowed me to focus on problems.
  3. Remember, sometimes the debugging tools introduce other bugs (see the stuff with u above.
  4. You didn't see it, but I created small focused test cases when necessary.
    1. Small one-off files allow you to eliminate large chunks of code.
    2. Allowed me to use gdb, in which case I fought the code and not the debugger.

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Debugging fat code

I've been working on a kernel to user space door call which encodes the data in an XDR stream. Normally, I view XDR output as it goes across the wire with snoop. As this stuff is not exposed, I ended up writing a little piece of code to dump the bits on the console. I know I had some problems when I wrote it, but I'm just going to give you the current code:

void
sped_xdr_dump(XDR \*x)
{
        int     i;
        int     j;
        char    buf[100];
        char    str[10];
        char    cbuf[4];

        j = 0;
        memset(&cbuf, '\\0', 4);

        buf[0] = '\\0';
        for (i = 0; i < x->x_handy; i++) {
                cbuf[0] = x->x_private[i];
                sprintf(str, " %X", cbuf[0]);
                strcat(buf, str);
                j++;
                if (j == 8) {
                        syslog(LOG_ERR, "%s", buf);
                        j = 0;
                        buf[0] = '\\0';
                }
        }

        if (j != 0) {
                syslog(LOG_ERR, "%s", buf);
        }
}

The crap with cbuf is where I consider this code to be fat. It is a lazy hack because I felt the bytes were running together. It is lazy because I don't need it. And it is where I should be looking for bugs. Anyway, with the following input:

> fec3fce8::print XDR 
{
    x_op = 1 (XDR_DECODE)
    x_ops = 0xfee45bd8
    x_public = 0
    x_private = 0xfec3fdd0
    x_base = 0xfec3fdd0
    x_handy = 0x30
}
> 0xfec3fdd0/48B
0xfec3fdd0:     0       0       0       2       0       0       0       0
                0       0       0       2       0       3       d       a0
                0       0       0       a       2       0       0       0
                0       0       0       a       2f      6e      69      70
                70      79      2f      68      32      62      0       0
                0       0       0       0       0       0       0       0

As shown by mdb, which I don't like, I get the following output:

Apr  3 12:04:06 pnfs-3-15 last message repeated 1 time
Apr  3 12:04:22 pnfs-3-15 sped[100920]:  0 0 0 2 0 0 0 0
Apr  3 12:04:22 pnfs-3-15 sped[100920]:  0 0 0 2 0 3 D FFFFFFA0
Apr  3 12:04:22 pnfs-3-15 sped[100920]:  0 0 0 A 2 0 0 0
Apr  3 12:04:22 pnfs-3-15 sped[100920]:  0 0 0 A 2F 6E 69 70
Apr  3 12:04:22 pnfs-3-15 sped[100920]:  70 79 2F 68 32 62 0 0
Apr  3 12:04:22 pnfs-3-15 sped[100920]:  0 0 0 0 0 0 0 0

Where is FFFFFFA0 coming from? Okay, I took the code and stuck it in a driver:

> cat y.c
#include <sys/systm.h>
#include <sys/sdt.h>
#include <rpc/types.h>
#include <rpc/svc.h>
#include <rpc/xdr.h>

#include <rpc/auth_sys.h>

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

void
sped_xdr_dump(char \*jack, int len)
{
        int     i;
        int     j;
        char    buf[100];
        char    str[10];
        char    cbuf[4];

        j = 0;
        memset(&cbuf, '\\0', 4);

        buf[0] = '\\0';
        for (i = 0; i < len; i++) {
                cbuf[0] = jack[i];
                sprintf(str, " %X", cbuf[0]);
                strcat(buf, str);
                j++;
                if (j == 8) {
                        printf("%s\\n", buf);
                        j = 0;
                        buf[0] = '\\0';
                }
        }

        if (j != 0) {
                printf("%s\\n", buf);
        }
}

int
main(int argc, char \*argv[])
{
	char	jack[] = {
            0x0,  0x0,  0x0,  0x2,  0x0,  0x0,  0x0,  0x0,
            0x0,  0x0,  0x0,  0x2,  0x0,  0x3,  0xd,  0xa0,
            0x0,  0x0,  0x0,  0xa,  0x2,  0x0,  0x0,  0x0,
            0x0,  0x0,  0x0,  0xa,  0x2f, 0x6e, 0x69, 0x70,
            0x70, 0x79, 0x2f, 0x68, 0x32, 0x62, 0x0,  0x0,
            0x0,  0x0,  0x0,  0x0,  0x0,  0x0,  0x0,  0x0

	};

	sped_xdr_dump(jack, sizeof(jack));
	return (0);
}
>  ./a.out 
 0 0 0 2 0 0 0 0
 0 0 0 2 0 3 D FFFFFFA0
 0 0 0 A 2 0 0 0
 0 0 0 A 2F 6E 69 70
 70 79 2F 68 32 62 0 0
 0 0 0 0 0 0 0 0

So, I've captured the problem. Using gdb, I can get to the heart of the issue:

Breakpoint 3, sped_xdr_dump (jack=0x8047134 "", len=48) at x.c:32
32	                cbuf[0] = jack[i];
5: str = " D\\000\\000\\000\\000\\000\\000X\\003"
4: j = 7
3: i = 15
2: jack[i] = -96 '?'
1: cbuf = "\\r\\000\\000"
(gdb) p jack[i-1]
$1 = 13 '\\r'
(gdb) q

As tempting as it is to say that I'm grabbing a bit from the previous word, it is clear that the fat code is really working. And while I could do shenanigans with bit shifting, the problem isn't in the way I am explicitly pulling out the data. Nor is the " %X" an issue.

The real issue is that I've got the wrong type of data, I'm trying to print out signed data, and 'A0' has a high order bit set.

> diff y.c y1.c	
21c21
<         char    cbuf[4];
---
>         unsigned char    cbuf[4];
> ./a.out 
 0 0 0 2 0 0 0 0
 0 0 0 2 0 3 D A0
 0 0 0 A 2 0 0 0
 0 0 0 A 2F 6E 69 70
 70 79 2F 68 32 62 0 0
 0 0 0 0 0 0 0 0

Okay, we can do better than that. We need to get rid of cbuf:

> diff y.c y2.c
21d20
<         char    cbuf[4];
24d22
<         memset(&cbuf, '\\0', 4);
28,29c26
<                 cbuf[0] = jack[i];
<                 sprintf(str, " %X", cbuf[0]);
---
>                 sprintf(str, " %X", (unsigned char)jack[i]);
> ./a.out
 0 0 0 2 0 0 0 0
 0 0 0 2 0 3 D A0
 0 0 0 A 2 0 0 0
 0 0 0 A 2F 6E 69 70
 70 79 2F 68 32 62 0 0
 0 0 0 0 0 0 0 0

And we need to clean-up the output:

> diff y2.c y3.c
26c26
<                 sprintf(str, " %X", (unsigned char)jack[i]);
---
>                 sprintf(str, " %2X", (unsigned char)jack[i]);
> ./a.out
  0  0  0  2  0  0  0  0
  0  0  0  2  0  3  D A0
  0  0  0  A  2  0  0  0
  0  0  0  A 2F 6E 69 70
 70 79 2F 68 32 62  0  0
  0  0  0  0  0  0  0  0

I might be able to make it even more either smarter or compact, but it is now throw away code. But even thow away code has to be functionally correct. When it isn't, look for areas where you are doing more work than you need to. If you remember you tried something and weren't convinced it would work, and it only worked partially, go back and question that code.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Thursday Mar 13, 2008

Using mdb to enable error injection

I've got a bug for which I do not have a reproducible test case. But I'm pretty confident I found what is going wrong. I can run regression tests to show I haven't broken anything - but those same regression tests never tripped the bug in the first place.

The fix is:

------- usr/src/uts/common/fs/nfs/nfs4_stub_vnops.c -------

Index: usr/src/uts/common/fs/nfs/nfs4_stub_vnops.c
23c23
<  \* Copyright 2007 Sun Microsystems, Inc.  All rights reserved.
---
>  \* Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
27c27
< #pragma ident      "@(#)nfs4_stub_vnops.c  1.3     07/10/25 SMI"
---
> #pragma ident      "%Z%%M% %I%     %E% SMI"
1751a1757,1769
>             \* Someone is already working on it. We
>             \* need to back off and let them proceed.
>             \*
>             \* We return EBUSY so that the caller knows
>             \* something is going on. Note that by that
>             \* time, the umount in the other thread
>             \* may have already occured.
>             \*/
>            if (was_locked) {
>                    return (EBUSY);
>            }
> 
>            /\*
1762,1763c1780
<            if (was_locked == FALSE &&
<                !mutex_tryenter(&net->net_tree_lock)) {
---
>            if (!mutex_tryenter(&net->net_tree_lock)) {
1814c1831
<            } else if (was_locked == FALSE) {
---
>            } else {

In English, the lock detection used to only handle when the lock was not being held.

What I want to do is force was_locked to be true at this point in both the original code (to verify I can trigger the panic at will) and also in my fix (to verify I have fixed the correct bug).

I can do that by adding the following code:

# wx diffs

------- usr/src/uts/common/fs/nfs/nfs4_stub_vnops.c -------

122a123,124
> int   nfsv4_mm_was_locked = FALSE;
> 
1750a1753,1755
>               if (nfsv4_mm_was_locked)
>                       was_locked = TRUE;

This is a global in the nfs module which by default does not force was_locked to be set. I can use mdb to change it on the fly:

# mdb -kw
Loading modules: [ unix genunix specfs dtrace cpu.generic cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci ufs mpt ip hook neti sctp arp usba fctl nca lofs cpc random zfs nfs fcip logindmux ptm sppp ]
> nfsv4_mm_was_locked::print
0
> nfsv4_mm_was_locked/W 1
nfsv4_mm_was_locked:            0               =       0x1
> nfsv4_mm_was_locked::print
0x1
> $q

Note that I do not want to add special code to check the environment, add something in /etc/default/nfs, or anything else which requires changing anything on a system. I leave it entirely in the kernel and I use mdb to control it.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily
About

tdh

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today