Saturday Aug 21, 2010

Tulsa Labs

Tulsa Labs


Originally posted on Kool Aid Served Daily
Copyright (C) 2010, Kool Aid Served Daily

Wednesday Aug 18, 2010

LinkedIn

Tom Haynes


Originally posted on Kool Aid Served Daily
Copyright (C) 2010, Kool Aid Served Daily

Sunday Mar 21, 2010

IETF 77 in Anaheim

I just attended the Newcomers Tutorial and the take home point for me was that IETF is a commune and not a police state. I.e., the IETF looks for "rough consensus" and does not "enforce".


Originally posted on Kool Aid Served Daily
Copyright (C) 2010, Kool Aid Served Daily

Wednesday Nov 11, 2009

The pain of updating a router versus a desktop

I recently updated my home desktop to Win7 - I wanted to get the 64 bit professional version and update my system memory to 8G. I managed to loose my Firefox config information - i.e., mainly my accounts and passwords. Not that big of a deal to recover from. I did have a recent backup of my bookmarks, so I didn't loose all of my favorite web comics.

Last night I updated my Linksys wireless router to an Airport Extreme and even though I had all of the config information, that was much more of a pain. Changing routers always is it seems.

I went with the Airport Extreme because I liked the Airport Express I carry around to quickly setup a wireless network when I travel. There was a Hilton Garden Inn in Austin that does not like my Mac Book Air, so the Express solved that issue.

I checked to make sure I could configure port translation (look in the NAT settings) and I assumed I could easily VPN out of it.

The earliest issues I had were all related to getting the internet connection established and understanding the Apple nomenclature. I could not figure out whether the DHCP pane was talking about the external connection or the internal one. I think the internal address is determined by the DHCP range that you offer.

The big problem was that the DSL modem and the Extreme didn't seem to want to communicate. I could ping the Extreme, but couldn't seem to see out. Note, make sure to record an external pingable IP before you change your network infrastructure. Don't count on DNS working!

I put back the Linksys, did some searching, and decided to reboot the DSL modem and then the Extreme. That did the trick. Note that with other routers you might try MAC address cloning to fix this issue. Evidently the DSL router I have is smart enough to detect the MAC address it is connected to upon reboot. But it will also only talk to that one device.

My other issues revolved with the Airport Utility software - making a simple change to the Extreme configuration requires a reboot. And then the Airport Utility software might not be able to detect the Extreme. I rotated among three different clients to make changes.

I got through all that and tried using my Express to extend the Extreme's network. Now I can't see it at all via Airport Utility. And searching the web does not make me feel comfortable about getting Airport Utility to work properly.

The last issue I hit was that my mail server started rejecting requests from my Thunderbird desktop mail client. I don't know how I had my earlier Thunderbird outgoing server set up, but the current one was set to my outside DNS name. The Linksys must have automatically short cut the packet back to my internal server. The Airport Extreme was sending the request back with the external address of my DSL router. And my sendmail config was rejecting the relay request.

The solution here was to modify my mail server account in Thunderbird to be in the internal DNS subdomain.

I'm happy enough with the Airport Extreme - like most Apple products, it takes some getting used to. BTW - the reason I replaced the Linksys was it started locking up - after one "We've lost the internet!" call on a business trip, it is just too much to deal with! :-> I wanted N, dual band, and GigE ports. We'll see if I stay happy with the Airport Extreme...


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

chown(1) and username spaces under NFSv3 and NFSv4

A colleague came to me with an interesting question. They had a share on a server which allowed root access and could not get chown(1) to work via NFS. They wanted to know why our NFS server implementation denied it.

I immediately ran the test and it worked right off the bat! So now the question was what was different on the two systems? First of all, I tried it with a Linux client:

root@silent:~# mount slammer:/export/slammer /slammer/slammer/
root@silent:~# cd /slammer/slammer/
root@silent:/slammer/slammer# touch sil
root@silent:/slammer/slammer# chown thud sil
root@silent:/slammer/slammer# ls -la sil
-rw-r--r--+ 1 thud root 0 2009-11-11 12:40 sil

To eliminate that as the culprit, I tried an OpenSolaris client:

[root@silver tdh]> mount slammer:/export/slammer /slammer/slammer/
[root@silver tdh]> cd /slammer/slammer/
[root@silver slammer]> touch ver
[root@silver slammer]> chown thud ver
chown: ver: Permission denied

While it appears to be an issue between OpenSolaris and Linux, it is far from the case. The issue is at the NFS protocol layer and I can easily show it by flipping the success on each client!

Working OpenSolaris:

[root@silver slammer]> cd
[root@silver ~]> umount /slammer/slammer/
[root@silver ~]> mount -o vers=3 slammer:/export/slammer /slammer/slammer/
[root@silver ~]> cd /slammer/slammer/
[root@silver slammer]> touch ver_3
[root@silver slammer]> chown thud ver_3
[root@silver slammer]> ls -la ver_3
-rw-r--r--   1 thud     root           0 Nov 11 12:50 ver_3

Broken Linux:

root@silent:~# mount -t nfs4 slammer:/export/slammer /slammer/slammer/
root@silent:~# cd /slammer/slammer/
root@silent:/slammer/slammer# touch sil_4
root@silent:/slammer/slammer# chown thud sil_4
root@silent:/slammer/slammer# ls -la sil_4
-rw-r--r-- 1 thud 4294967294 0 2009-11-11 12:53 sil_4

But wait, you are saying it worked for Linux. Well, no, it did not. If we look at this on the server, we see:

slammer# ls -la sil\* ver\*
-rw-r--r--   1 2008     root           0 Nov 11 18:40 sil
-rw-r--r--   1 nobody   root           0 Nov 11 18:53 sil_4
-rw-r--r--   1 nobody   nobody         0 Nov 11 18:44 ver
-rw-r--r--   1 2008     root           0 Nov 11 18:50 ver_3

So while the Linux client thinks that it worked, it did not. We can confirm that by remounting with NFSv3:

root@silent:/slammer/slammer# ls -la sil\* ver\*
-rw-r--r--+ 1 thud       root       0 2009-11-11 12:40 sil
-rw-r--r--+ 1 4294967294 root       0 2009-11-11 12:53 sil_4
-rw-r--r--+ 1 4294967294 4294967294 0 2009-11-11 12:44 ver
-rw-r--r--+ 1 thud       root       0 2009-11-11 12:50 ver_3

And for fun, if we remount with NFSv4:

root@silent:/slammer/slammer# cd
root@silent:~# umount /slammer/slammer/
root@silent:~# mount -t nfs4 slammer:/export/slammer /slammer/slammer/
root@silent:~# cd /slammer/slammer/
root@silent:/slammer/slammer# ls -la sil\* ver\*
-rw-r--r-- 1 4294967294 4294967294 0 2009-11-11 12:40 sil
-rw-r--r-- 1 4294967294 4294967294 0 2009-11-11 12:53 sil_4
-rw-r--r-- 1 4294967294 4294967294 0 2009-11-11 12:44 ver
-rw-r--r-- 1 4294967294 4294967294 0 2009-11-11 12:50 ver_3

The first thing we normally check in such cases is the mapid domain:

slammer# cat /var/run/nfs4_domain
internal.excfb.com
[root@silver slammer]> cat /var/run/nfs4_domain
internal.excfb.com
root@silent:/slammer/slammer# cat /proc/net/rpc/nfs4.idtoname/content
#domain type id [name]

So I need to get the Linux box in the right domain! Actually, I need to modify '/etc/default/nfs-common' on this Ubuntu system and tell it to run mapid!

Much better:

root@silent:/slammer/slammer# touch sil_4_2
root@silent:/slammer/slammer# chown thud sil_4_2
chown: changing ownership of `sil_4_2': Invalid argument

Now the Linux and OpenSolaris clients are agreeing to behave the same way with the server! What is the issue here?

In NFSv4, we change the owner with a SETATTR compound op:

NFS:  Op = 34 (SETATTR)
NFS:  State ID hash = SPECIAL_0
NFS:      len = 12    val = 000000000000000000000000
NFS:      State ID Sequence ID = 00000000
NFS:    0x00
NFS:    0x00
NFS:    0x00
NFS:    0x00
NFS:    0x10   OWNER
NFS:    0x00
NFS:    0x00
NFS:    0x00
NFS:  Owner = thud@internal.excfb.com

And in NFSv3, we try with this:

NFS:  Proc = 2 (Set file attributes)
NFS:  File handle = [86D5]
NFS:   A76A323408831F040A001500000000003F0C00000A000300000000002D000000
NFS:  Mode = (not set)
NFS:  User ID = 2008
NFS:  Group ID = (not set)
NFS:  Size = (not set)
NFS:  Access time = (do not set)
NFS:  Modification time = (do not set)
NFS:

So in the NFSv4 case, we need to be able to map 'thud@internal.excfb.com' to an user id. If that mapping does not exist, then we disallow the action. In NFSv3, we already have the user id, and we don't care if it is undefined on the server:

slammer# grep thud /etc/passwd
slammer# ls -la sil\* ver\*
-rw-r--r--   1 2008     root           0 Nov 11 18:40 sil
-rw-r--r--   1 nobody   root           0 Nov 11 18:53 sil_4
-rw-r--r--   1 root     root           0 Nov 11 19:18 sil_4_2
-rw-r--r--   1 nobody   nobody         0 Nov 11 18:44 ver
-rw-r--r--   1 2008     root           0 Nov 11 18:50 ver_3

If we add the user id of thud, we see:

slammer# useradd -u 2009 -g 10 -d /export/slammer thud
slammer# ls -la sil\* ver\*
-rw-r--r--   1 2008     root           0 Nov 11 18:40 sil
-rw-r--r--   1 nobody   root           0 Nov 11 18:53 sil_4
-rw-r--r--   1 root     root           0 Nov 11 19:18 sil_4_2
-rw-r--r--   1 nobody   nobody         0 Nov 11 18:44 ver
-rw-r--r--   1 2008     root           0 Nov 11 18:50 ver_3

Yes, I did an off by one error on purpose! What happens if we try to chown(1) via NFSv4 now:

root@silent:/slammer/slammer# chown thud base
root@silent:/slammer/slammer# ls -la base sil\* ver\*
-rw-r--r-- 1 thud   root    0 2009-11-11 13:29 base
-rw-r--r-- 1 nobody root    0 2009-11-11 12:40 sil
-rw-r--r-- 1 nobody root    0 2009-11-11 12:53 sil_4
-rw-r--r-- 1 root   root    0 2009-11-11 13:18 sil_4_2
-rw-r--r-- 1 nobody nogroup 0 2009-11-11 12:44 ver
-rw-r--r-- 1 nobody root    0 2009-11-11 12:50 ver_3

Those 'nobody's should be 'thud' and that 'thud' should not map! What does NFSv3 say:

root@silent:/slammer/slammer# ls -la base sil\* ver\*
-rw-r--r--+ 1       2009 root       0 2009-11-11 13:29 base
-rw-r--r--+ 1 thud       root       0 2009-11-11 12:40 sil
-rw-r--r--+ 1 4294967294 root       0 2009-11-11 12:53 sil_4
-rw-r--r--+ 1 root       root       0 2009-11-11 13:18 sil_4_2
-rw-r--r--+ 1 4294967294 4294967294 0 2009-11-11 12:44 ver
-rw-r--r--+ 1 thud       root       0 2009-11-11 12:50 ver_3

The NFSv4 listing shows you the username space of the server and the NFSv3 listing shows you the username space of the client.

The big difference is that NFSv4 enforces that both systems understand the name. It doesn't mandate that the names have the same userid, just that they have a mapping.

To get everything in sync, to have both NFSv3 and NFSv4 work the same, use either NIS or LDAP to maintain your user mappings.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Friday Oct 23, 2009

Using the 7000 BUI to document script generation

I'm old skool enough that I prefer a CLI versus a GUI for administration tasks. I've got that disclaimer there because I happen to like web browsers, GIMP, and even WireShark at times.

Given a choice between being able to ssh (or even telnet) in to a box versus going to a web based interface, I'll almost always chose the ASCII interface. The exceptions would be when I don't have a manual and need to do something fast. If I'm not in a hurry, I'm willing to learn by trial and error.

I've been running some tests on a 7410 and I need to be able to restart NFS. I've been using the BUI because I really find it intuitive. For example, I needed to bring up another interface on a backend network and it was very simple to go to Configuration -> Network and just click my way through it. I also liked that my routing tables were set up correctly. I had just fought that in a similar situation on a v20z running Fedora 11.

I like being able to click on the Device and getting the relevant datalinks and interfaces. I like the apply changes implementation. I like it checks for uncommited transactions if I try to leave the current entry.

But what I really like is a very understated correlation between the BUI and CLI - the BUI documents scripting actions you can do in the CLI. Likewise, if you understand the CLI commands, you can quickly navigate to a BUI counterpart.

So my goal was to be able to restart NFS from the CLI. Rather than picking up a manual, I looked at where I was at in the BUI, which was Configuration -> Services. And to restart it, I selected the circular arrows to the left of the NFS Data Service.

This was 99% of the equivalent command:

washdc:> configuration services nfs 
abort    commit   done     get      script   show     
assert   disable  enable   help     set      

I was looking for a "restart" option and I realized that the button must be a combination of "disable" and "enable". Sure enough, this:

washdc:> configuration services nfs disable
washdc:> configuration services nfs enable

was functionally equivalent.

I find that every time I interact with either the BUI or CLI, I find something else to remind me of the tight integration and also to appreciate them both being designed at the same time.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Friday Oct 02, 2009

args[ ] may not be referenced because probe description matches an unstable set of probes

I need to be able to see if a mount request generated an error in the mountd daemon. I have a custom kernel that has changed the mount() function call to return the error code. So, now I'm using a very simple DTrace script to catch the error codes:

#!/usr/sbin/dtrace -s

pid$1::mount:return
{
        printf("rc = %d", args[1]);
}

pid$1::audit_mountd_mount:return
{
        printf("rc = %d", args[1]);
}

The issue is what happens when I try to run it:

[root@pnfs-4-11 ~]> ./mountd_res.d `pgrep -x mountd`
dtrace: failed to compile script ./mountd_res.d: line 18: args[ ] may not be referenced because probe description pid100824::mount:return matches an unstable set of probes

What I think this means is that I've got multiple declarations of mount() and they all do not return something. Okay, I can narrow down the probe to just the one I want:

[root@pnfs-4-11 ~]> dtrace -l -f mount
   ID   PROVIDER            MODULE                          FUNCTION NAME
 2378 lx-syscall                                               mount entry
 2379 lx-syscall                                               mount return
13708        fbt           genunix                             mount entry
13709        fbt           genunix                             mount return
31190    syscall                                               mount entry
31191    syscall                                               mount return
65944  pid100824            mountd                             mount entry
65945  pid100824         libc.so.1                             mount entry
65946  pid100824            mountd                             mount return
65947  pid100824         libc.so.1                             mount return

And if I adjust my script:

pid$1:mountd:mount:return
{
        printf("rc = %d", args[1]);
}

We see I've fixed this issue!

[root@pnfs-4-11 ~]> ./mountd_res.d `pgrep -x mountd`
dtrace: failed to compile script ./mountd_res.d: line 18: index 1 is out of range for pid100824:mountd:mount:return args[ ]

Okay, I got my syntax wrong for the return code:

pid$1:mountd:mount:return
{
        printf("rc = %d", arg1);
}

And now I see the correct output:

[root@pnfs-4-11 ~]> ./mountd_res.d `pgrep -x mountd`
dtrace: script './mountd_res.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  0  65946                     mount:return rc = 0
  0  65240        audit_mountd_mount:return rc = 1
  0  65946                     mount:return rc = 0
  0  65240        audit_mountd_mount:return rc = 1


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Sunday Aug 30, 2009

Investigating mountd

We've had some interesting question on nfs-discuss lately about mountd and I thought I'd use an userland script I wrote, snoop, and DTrace to show some interesting properties about mountd.

My perl script will send UDP mount requests to a server and spoof the client IP. I want to control what I send and I'll sometimes spoof a request from a non-existent machine.

BTW - you will notice I don't talk about what the share is that much, unless I state otherwise, it is:

[root@silver ~]> share | grep tdh
-@tank/home     /export/zfs/tdh   rw   ""  

Need IP to name mappings

We can try a host without a name mapping:

[tdh@slayer ~/src]> host 10.10.20.41
Host 41.20.10.10.in-addr.arpa. not found: 3(NXDOMAIN)
[tdh@slayer ~/src]> sudo ./udp_raw.pl src_addr 10.10.20.42

Note that since we don't have a client to receive the reply, we'll snoop it:

 30   0.01297  10.10.20.42 -> silver       MOUNT3 C Mount /export/zfs/tdh
 31   0.01712       silver -> thens.internal.excfb.com DNS C 42.20.10.10.in-addr.arpa. Internet PTR ?
 32   0.05912 thens.internal.excfb.com -> silver       DNS R  Error: 3(Name Error)
 33   0.00522       silver -> thens.internal.excfb.com DNS C 42.20.10.10.in-addr.arpa. Internet PTR ?
 34   0.00056 thens.internal.excfb.com -> silver       DNS R  Error: 3(Name Error)
 35   0.00777       silver -> 10.10.20.42  MOUNT3 R Mount Permission denied

BTW - right off the bat we can see that mountd tries to resolve the client IP. What happens if there is a reverse entry?

[tdh@slayer ~/src]> host 192.168.4.14
14.4.168.192.in-addr.arpa domain name pointer blast-4-14.internal.excfb.com.
[tdh@slayer ~/src]> sudo ./udp_raw.pl src_addr 192.168.4.14

Well?

 37   0.03198 blast-4-14.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh
 38   0.00089       silver -> thens.internal.excfb.com DNS C 14.4.168.192.in-addr.arpa. Internet PTR ?
 39   0.00051 thens.internal.excfb.com -> silver       DNS R 14.4.168.192.in-addr.arpa. Internet PTR blast-4-14.internal.excfb.com.
 40   0.00290       silver -> blast-4-14.internal.excfb.com MOUNT3 R Mount OK FH=01CC Auth=unix

So a client has to have a reverse mapping from IP to host name before we allow a mount to succeed. And we can see that in the source code for usr/src/cmd/fs.d/nfs/mountd/mountd.c:

    875 	getclientsnames(transp, &nb, &clnames);
    876 	if (clnames == NULL || nb == NULL) {
    877 		/\*
    878 		 \* We failed to get a name for the client, even 'anon',
    879 		 \* probably because we ran out of memory. In this situation
    880 		 \* it doesn't make sense to allow the mount to succeed.
    881 		 \*/
    882 		error = EACCES;
    883 		goto reply;
    884 	}

Need to match exactly on hostname

What if I change the share to be:

[root@silver ~]> zfs set sharenfs=rw=blast4-14:blast4-15 tank/home/tdh

Will it work or fail?

[tdh@slayer ~/src]> sudo ./udp_raw.pl src_addr 192.168.4.14
[tdh@slayer ~/src]> sudo ./udp_raw.pl src_addr 192.168.4.15

Note that I wanted to show something warm in the cache and something cold:

 24   0.03500 blast-4-14.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh
 25   0.00129       silver -> blast-4-14.internal.excfb.com MOUNT3 R Mount Permission denied
 41   0.03706 blast-4-15.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh (retransmit)
 42   0.01419       silver -> thens.internal.excfb.com DNS C 15.4.168.192.in-addr.arpa. Internet PTR ?
 43   0.00048 thens.internal.excfb.com -> silver       DNS R 15.4.168.192.in-addr.arpa. Internet PTR blast-4-15.internal.excfb.com.
 44   0.00089       silver -> blast-4-15.internal.excfb.com MOUNT3 R Mount Permission denied

Two points here, the DNS cache is not flushed when a share reloads, but the nfs auth cache must be. If it were not flushed, we would have gotten permission granted.

Okay, I can show you what is going on by this example:

[root@silver ~]> zfs set sharenfs=rw=blast4-14.internal.excfb.com:blast4-15 tank/home/tdh

And;

 24   0.03769 blast-4-14.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh
 25   0.00122       silver -> blast-4-14.internal.excfb.com MOUNT3 R Mount Permission denied
 41   0.03115 blast-4-15.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh (retransmit)
 42   0.00092       silver -> blast-4-15.internal.excfb.com MOUNT3 R Mount Permission denied

Okay, we should have gotten permission granted for the first!

[root@silver ~]> zfs set sharenfs=rw=blast4-17.internal.excfb.com:blast4-15 tank/home/tdh

And that fails as well:

 20   0.03105 blast-4-17.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh
 21   0.00086       silver -> thens.internal.excfb.com DNS C 17.4.168.192.in-addr.arpa. Internet PTR ?
 22   0.00056 thens.internal.excfb.com -> silver       DNS R 17.4.168.192.in-addr.arpa. Internet PTR blast-4-17.internal.excfb.com.
 23   0.00163       silver -> blast-4-17.internal.excfb.com MOUNT3 R Mount Permission denied

Ohhh! Smug mode off

Okay, I'm interested in what the function in_access_list will tell me. I've got a small DTrace script, which I iterated over the development off, which will let me know what is going on here:


!/usr/sbin/dtrace -Fs

/\*
 \* Thanks to Peter Harvey;
 \* http://blogs.sun.com/peteh/entry/dereferencing_user_space_pointers_in
 \*
 \*      # ./mountd.d `pgrep -x mountd`
 \*
 \*/

dtrace:::BEGIN
{
        printf("Sampling... Hit Ctrl-C to end.\\n");
}

pid$1::check_client_new:return
{
        printf("Access permission is %d", arg1);
}

pid$1::in_access_list:entry
{
        self->trace_me = 1;
        printf("Access list is %s", copyinstr(arg2));
}

pid$1::in_access_list:return
{
        self->trace_me = 0;
        printf("Access permission is %d", arg1);
}

pid$1::strcasecmp:entry
/self->trace_me == 1/
{
        printf("host vs list entry: |%s|, |%s|\\n", copyinstr(arg0),
                copyinstr(arg1));
}

pid$1::strcasecmp:entry
/self->trace_me == 1/
{
        printf("Comparison is %d", arg1);
}

Note that I need to use strcasecmp because I can't iterate over the array. I get as a result:

[root@silver ~]> ./mount_trace.sh
+ pgrep -x mountd 
+ /root/mountd.d 634 
dtrace: script '/root/mountd.d' matched 6 probes
CPU FUNCTION                                 
  0 | :BEGIN                                  Sampling... Hit Ctrl-C to end.

  0  -> in_access_list                        Access list is blast4-17.internal.excfb.com:blast4-15
  0    -> strcasecmp                          host vs list entry: |blast4-17.internal.excfb.com|, |blast-4-17.internal.excfb.com|

  0     | strcasecmp:entry                    Comparison is 135156944
  0     | strcasecmp:entry                    host vs list entry: |blast4-15|, |blast-4-17.internal.excfb.com|

  0     | strcasecmp:entry                    Comparison is 135156944
  0    <- in_access_list                      Access permission is 0
\^C

D'oh! My share is wrong, wrong I say!

If instead I try:

[root@silver ~]> zfs set sharenfs=rw=blast-4-17.internal.excfb.com:blast-4-15 tank/home/tdh

We expect blast-4-15 to fail and blast-4-17 to succeed see:

 28   0.03211 blast-4-15.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh
 29   0.00150       silver -> blast-4-15.internal.excfb.com MOUNT3 R Mount Permission denied
 49   0.03460 blast-4-17.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh (retransmit)
 50   0.00087       silver -> blast-4-17.internal.excfb.com MOUNT3 R Mount OK FH=01CC Auth=unix

Which shows again, that you need an exact match and we don't append the domain name to end of the hostname. What would happen if added this to end of the server's /etc/hosts?

192.168.4.14    blast-4-15

I expect it to work. Does it?

[root@silver ~]> grep MOUNT xxx 
 24   0.03176 blast-4-15.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh
 27   0.00124       silver -> blast-4-15.internal.excfb.com MOUNT3 R Mount Permission denied
 49   0.00160 blast-4-17.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh (retransmit)
 52   0.00098       silver -> blast-4-17.internal.excfb.com MOUNT3 R Mount OK FH=01CC Auth=unix

Hmm, I bet the name entry is cached, which I can show by trying one which is not cached:

[root@silver ~]> zfs set sharenfs=rw=blast-4-17.internal.excfb.com:blast-4-31 tank/home/tdh

And

[root@silver ~]> grep MOUNT xxx
 24   0.03633 blast-4-17.internal.excfb.com -> silver       MOUNT3 C Mount /export/zfs/tdh
 27   0.00102       silver -> blast-4-17.internal.excfb.com MOUNT3 R Mount OK FH=01CC Auth=unix
 45   0.02997   blast-4-31 -> silver       MOUNT3 C Mount /export/zfs/tdh (retransmit)
 46   0.00189       silver -> blast-4-31   MOUNT3 R Mount OK FH=01CC Auth=unix

And we can see the lack of caching because of two clues, the name output in snoop, i.e., "blast-4-31", and the lack of DNS activity between the two packets.

Summary

Some of the behaviour shocked me and I made some stupid mistakes that were hard to figure out what I had done. As an exercise in triage, it was great. I now have the beginnings of a DTrace debugging tool that I can point people at if they need some help. I'm very, very happy about that part!


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Question: Why isn't a policy working?

One of the hardest things about developing software is testing it. As a developer, you get too close to the code and what you expect it to do. So I was very happy to get a question in email about why was a set of policies not working correctly:

I tried to reset the SPE policies to the following:

# cat /etc/policies.spe
101, 1, 8k, DS2, path == /diskpool/JUNK/TEST/P1
102, 2, 4k, DS1, path == /diskpool/JUNK/TEST/P2
103, 3, 2k, mix3, path == /diskpool/JUNK/TEST/P3
104, 4, 1k, mix4, path == /diskpool/JUNK/TEST/P4
#105, 5, 128k, default, path == /diskpool/JUNK/TEST/P5 || path == /export/JUNK
111, 5, 32k, default, path == /diskpool/JUNK/TEST/P5
112, 5, 64k, default, path == /export/JUNK

and have the client uses the path to verify the stripe.  For stripe 1,2,3
things seem working OK; however when I tried with strip 4 or 5 with
the policy above, I got "nfsstat -l" to list stripe 10 and count 32k:

pnfs-minipit1-3:~#
pnfs-minipit1-3:~# /mntpit1-3: mount -o vers=4 pnfs-minipit1-5:/diskpool/JUNK/TEST/P5 /mnt
pnfs-minipit1-3:~# nfsstat -m /mnt
/mnt from pnfs-minipit1-5:/diskpool/JUNK/TEST/P5
 Flags:         vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=1048576,wsize=1048576,retrans=5,timeo=600
 Attr cache:    acregmin=3,acregmax=60,acdirmin=30,acdirmax=60

pnfs-minipit1-3:~# mkfile -v 1m /mnt/kk
/mnt/kk 1048576 bytes
pnfs-minipit1-3:~# nfsstat -l /mnt/kk
Number of layouts: 1
Proxy I/O count: 0
DS I/O count: 33
Layout [0]:
        Layout obtained at: Sun Aug 30 00:23:47:199247 2009
        status: UNKNOWN, iomode: LAYOUTIOMODE_RW
        offset: 0, length: EOF
        num stripes: 10, stripe unit: 32768
        Stripe [0]:
                tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
        Stripe [1]:
                tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
        Stripe [2]:
                tcp:pnfs-minipit1-7.Central.Sun.COM:10.1.235.56:4920 OK
        Stripe [3]:
                tcp:pnfs-minipit1-7.Central.Sun.COM:10.1.235.56:4920 OK
        Stripe [4]:
                tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
        Stripe [5]:
                tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
        Stripe [6]:
                tcp:pnfs-minipit1-7.Central.Sun.COM:10.1.235.56:4920 OK
        Stripe [7]:
                tcp:pnfs-minipit1-7.Central.Sun.COM:10.1.235.56:4920 OK
        Stripe [8]:
                tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
        Stripe [9]:
                tcp:pnfs-minipit1-6.Central.Sun.COM:10.1.232.125:4920 OK
pnfs-minipit1-3:~#

What did I do wrong?

Before we even look at the issue, what did the person asking for help do right

  1. Described the problem.
    1. What did they expect to see?
    2. What did they actually see?
  2. Showed me how to reproduce the problem.

In short, I'm confident that the person took all reasonable steps to solve the problem on their own.

So, now let's try to figure out what is going on. I have a DTrace script which will help out here. Ouch, I should have shared that earlier. Anyway, spe.d should help. I'll start it up and monitor the results as I do a file create:

[root@pnfs-minipit1-5 ~]> ./spe.d
dtrace: script './spe.d' matched 8 probes
CPU     ID                    FUNCTION:NAME
  1   3420 nfs41_spe_allocate:spe-i-check_open 200096 (10) from 10.1.233.191 is checking /diskpool/JUNK/TEST/P5/foollu2
  1   3419 nfs41_spe_allocate:spe-i-policy_eval Policy 101 did not match with error 0 from 10.1.233.191
  1   3419 nfs41_spe_allocate:spe-i-policy_eval Policy 102 did not match with error 0 from 10.1.233.191
  1   3419 nfs41_spe_allocate:spe-i-policy_eval Policy 103 did not match with error 0 from 10.1.233.191
  1   3419 nfs41_spe_allocate:spe-i-policy_eval Policy 104 did not match with error 0 from 10.1.233.191
  1   3419 nfs41_spe_allocate:spe-i-policy_eval Policy 111 matched with error 0 from 10.1.233.191
  1  63564    mds_ds_path_to_mds_sid:return mds_sid[0] = pnfs-minipit1-6:pNFSpool1/p1DS2
  1  63564    mds_ds_path_to_mds_sid:return mds_sid[1] = pnfs-minipit1-6:pNFSpool2/p2DS2
  1  63564    mds_ds_path_to_mds_sid:return mds_sid[2] = pnfs-minipit1-6:pNFSpool3/p3DS2
  1  63564    mds_ds_path_to_mds_sid:return mds_sid[3] = pnfs-minipit1-7:pNFSpool1/p1DS1
  1  63564    mds_ds_path_to_mds_sid:return ERROR - could not find a matching pgi for pnfs-minipit1-7:pNFSpool1/p2DS1
  1  56851        nfs41_spe_allocate:return No matching policy

So we did find a matching policy, but we only found 5 datasets and the policy requires 5. So, why isn't that dataset in memory?

[root@pnfs-minipit1-5 ~]> echo "::walk mds_DS_guid_entry_cache|::print struct rfs4_dbe dbe_data|::print ds_guid_info_t ds_dataset_name.utf8string_val|::eval /s" | mdb -k
0xffffff01e73e5fc0:             pnfs-minipit1-6:pNFSpool3/p3DS2�����f
0xffffff01e7227078:             pnfs-minipit1-6:pNFSpool3/p3DS1�����f
0xffffff01e7227120:             pnfs-minipit1-6:pNFSpool2/p2DS2�����f
0xffffff01e72271c8:             pnfs-minipit1-6:pNFSpool2/p2DS1�����f
0xffffff01e7227270:             pnfs-minipit1-6:pNFSpool1/p1DS2�����f
0xffffff01e7227318:             pnfs-minipit1-6:pNFSpool1/p1DS1�����f
0xffffff01e5d62e70:             pnfs-minipit1-7:pNFSpool2/p2DS2�����f
0xffffff01e6597628:             pnfs-minipit1-7:pNFSpool2/p2DS1�����f
0xffffff01e4e40468:             pnfs-minipit1-7:pNFSpool1/p1DS2�����f
0xffffff01e5f84cb8:             pnfs-minipit1-7:pNFSpool1/p1DS1�����f
[root@pnfs-minipit1-5 ~]> echo "::walk mds_DS_guid_entry_cache|::print struct rfs4_dbe dbe_invalid" | mdb -k
dbe_invalid = 0
dbe_invalid = 0
dbe_invalid = 0
dbe_invalid = 0
dbe_invalid = 0
dbe_invalid = 0
dbe_invalid = 0
dbe_invalid = 0
dbe_invalid = 0
dbe_invalid = 0 

Hmm, I need to figure out a better way to process UTF8 stings in mdb, but it appears to be there and in a valid entry. I couldn't figure this out until I lined up the dataset name output:

pnfs-minipit1-7:pNFSpool2/p2DS1
pnfs-minipit1-7:pNFSpool1/p2DS1

The naming convention shows the error, 'p2DS1' should be on pool #2, but it is on pool #1: pNFSpool1.

If this were implemented in SMF instead of a as a flat file, we would be able to do checking when we create the policy to catch this type of configuration error. But for now, DTrace and mdb can catch it.

For me, getting mdb to dump the dataset names was the hardest part. I had to experiment and search the web. I'm glad for all of the effort here, I'll use that trick again and again.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Wednesday Aug 26, 2009

Beating my head against named on Fedora!

I added 3 new subdomains to my home network for testing. I added the records to my chroot'ed named at /var/named/chroot/var/named/named.conf. I just did reverse pointers and I couldn't get it to work:

[root@adept var]# host 192.168.4.120
Host 120.4.168.192.in-addr.arpa. not found: 3(NXDOMAIN)

I did this with a simple Perl script, so I debugged the heck out of it and checked for tabs galore. I finally added forward lookups, which worked:

[root@adept var]# host blast-4-120
blast-4-120.internal.excfb.com has address 192.168.4.120

Heck, I've been burnt by a bad link in /etc before, so I checked it:

[root@adept var]# ls -al /etc/named.conf
lrwxrwxrwx 1 root named 21 2008-02-25 16:15 /etc/named.conf -> /var/named/named.conf
[root@adept var]# ls -la /var/named/named.conf
lrwxrwxrwx 1 root named 38 2008-02-25 16:24 /var/named/named.conf -> /var/named/chroot/var/named/named.conf

I even diff'ed them to be really, really sure. I ran named manually with '-g', fixed the warnings I got and then found out it didn't handle the chroot nicely. I looked at the init file and gave up on understanding it.

I couldn't find a log file for it, so I sent a SIGHUP to look for a database dump. I added logging to the config file and never saw any output. I never found that database dump.

But I did find an option that said where it should be:

       dump-file "/var/named/data/cache_dump.db";

I then asked myself, is there another copy of the config file?

[root@adept var]# ps -ef | grep named
named     4047     1  0 00:30 ?        00:00:00 /usr/sbin/named -u named -t /var/named/chroot
root      4207  2886  0 01:00 pts/3    00:00:00 grep named
[root@adept var]# cd /var
[root@adept var]# find . -name named.conf
./named/named.conf
./named/chroot/etc/named.conf
./named/chroot/var/named/named.conf
[root@adept var]# ls -la ./named/named.conf
lrwxrwxrwx 1 root named 38 2008-02-25 16:24 ./named/named.conf -> /var/named/chroot/var/named/named.conf
[root@adept var]# ls -la ./named/chroot/etc/named.conf
-rw-r--r-- 1 root named 2741 2008-02-25 20:49 ./named/chroot/etc/named.conf

Why yes, yes there is and it doesn't have my new zones!

[root@adept etc]# pwd
/var/named/chroot/etc
[root@adept etc]# mv named.conf named.conf.fracked
[root@adept etc]# ln -s ../var/named/named.conf .
[root@adept etc]# ls -la ../var/named/named.conf
-rw-r----- 1 root named 4920 2009-08-27 00:13 ../var/named/named.conf
[root@adept etc]# service named restart
Stopping named:                                            [  OK  ]
Starting named:                                            [  OK  ]
[root@adept etc]#
[root@adept etc]# host 192.168.4.120
120.4.168.192.in-addr.arpa domain name pointer blast-4-120.internal.excfb.com.

Now what was I doing before I fell down this rat hole?


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Sunday Aug 16, 2009

Removes are failing, time to debug

Removes are failing and I don't think this is the classic issue we have had with this operation. I know that at the last BAT, the removes were succeeding enough to cause a panic. And we don't seem to be getting to that code even.

Snoop shows the client sending the remove, but the MDS is not sending anything to the DS. Barring the MDS sending to the wrong DS, I've put together a simple DTrace script to see what is going on:

[root@pnfs-17-24 ~]> more remove.d
#!/usr/sbin/dtrace -s

::do_ctl_mds_remove:return
{
        printf("rc = %d", args[1]);
}

::ctl_mds_clnt_remove_file:return
{
        printf("rc = %d", args[1]);
}

:::nfss-e-vop_fid_pseudo_failed
{
}

:::nfss-i-layout_is_null_cannot_remove
{
}

::mds_op_remove:entry
{
        self->resop = (nfs_resop4 \*)arg1;
        self->resp = self->resop->nfs_resop4_u.opremove;
}

::mds_op_remove:return
{

        printf("rc = %d", self->resp.status);
}

I don't like the hoops I have to go through to get the status on mds_op_remove, but I have to do that because that function is of type void. I'm going to change them to return a status. It doesn't cost much and makes simple DTrace scripting easier.

Anyway, here is our result:

[root@pnfs-17-24 ~]> ./remove.d
dtrace: script './remove.d' matched 6 probes
CPU     ID                    FUNCTION:NAME
  0  30503 do_ctl_mds_remove:nfss-e-vop_fid_pseudo_failed 
  0  62050         do_ctl_mds_remove:return rc = 28
  0  62052             mds_op_remove:return rc = 0
\^C

What is that 28?

[thud@ultralord nfssrv]> grep 28 /usr/include/sys/errno.h 
#define	ENOSPC	28	/\* No space left on device		\*/

Now I want to know what vop_fid_pseudo() is returning, but only when called in this context. I can use the thread properties like this:

::vop_fid_pseudo:entry
/self->live == 1/
{
        self->vp = (vnode_t \*)arg0;
        self->fidp = (fid_t \*)arg1;
}

::vop_fid_pseudo:return
/self->live == 1/
{
        printf("fid len = %d\\trc = %d",
            self->fidp->un._fid.len, args[1]);
}

Only if self->live is set will I see output:

[root@pnfs-17-24 ~]> ./remove.d
dtrace: script './remove.d' matched 8 probes
CPU     ID                    FUNCTION:NAME
  0  62468            vop_fid_pseudo:return fid len = 10        rc = 28
  0  30503 do_ctl_mds_remove:nfss-e-vop_fid_pseudo_failed 
  0  62050         do_ctl_mds_remove:return rc = 28
  0  62052             mds_op_remove:return rc = 0
\^C

And now I need to know where that error is set:

  49 vop_fid_pseudo(vnode_t \*vp, fid_t \*fidp)
  50 {
  51         struct vattr va;
  52         int error;
  53 
  54         error = VOP_FID(vp, fidp, NULL);
  55 
  56         /\*
  57          \* XXX nfs4_fid() does nothing and returns EREMOTE.
  58          \* XXX nfs3_fid()/nfs_fid() returns nfs filehandle as its fid
  59          \* which has a bigger length than local fid.
  60          \* NFS_FH4MAXDATA is the size of
  61          \* fhandle4_t.fh_xdata[NFS_FH4MAXDATA].
  62          \*
  63          \* Note: nfs[2,3,4]_fid() only gets called for diskless clients.
  64          \*/
  65         if (error == EREMOTE ||
  66             (error == 0 && fidp->fid_len > NFS_FH4MAXDATA)) {
  67 
  68                 va.va_mask = AT_NODEID; 
  69                 error = VOP_GETATTR(vp, &va, 0, CRED(), NULL);
  70                 if (error)
  71                         return (error);
  72 
  73                 fidp->fid_len = sizeof (va.va_nodeid);
  74                 bcopy(&va.va_nodeid, fidp->fid_data, fidp->fid_len);
  75                 return (0);
  76         }
  77 
  78         return (error);
  79 }

And that means I need to know what VOP_GETATTR() is returning. I'll pull the same trick as before, but I'll need to be looking at, well, I have to figure that out now, don't I?

The trick is to let the vnode tell me:

int
fop_getattr(
        vnode_t \*vp,
...
        err = (\*(vp)->v_op->vop_getattr)(vp, vap, flags, cr, ct);

So I need to find the pointer to the function:

        printf("vop_getattr = %p", self->vp->v_op->vop_getattr);

And then I need to find out what that is:

  0  62467             vop_fid_pseudo:entry vop_getattr = fffffffff883c180
...
[root@pnfs-17-24 ~]> mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci ufs mpt sd sockfs ip hook neti sctp arp usba stmf fctl nca lofs idm cpc random zfs nfs fcip logindmux ptm sppp ]
> fffffffff883c180::dis
zfs_getattr:                    pushq  %rbp

Okay, we at least know which filesystem is complaining. I probably could have done this differently, but this way works. The only error in that function comes from zfs_zaccess():

  0  53373               zfs_zaccess:return rc = 0
  0  53125               zfs_getattr:return rc = 0
  0  53125               zfs_getattr:return rc = 0

Which means I was reading the code incorrectly! Note that these are being called by the code, my check forces that to be true, but the error must be elsewhere. It must be the VOP_FID() which is causing the issue!

4323 static int
4324 zfs_fid(vnode_t \*vp, fid_t \*fidp, caller_context_t \*ct)
4325 {
4326         znode_t         \*zp = VTOZ(vp);
4327         zfsvfs_t        \*zfsvfs = zp->z_zfsvfs;
4328         uint32_t        gen;
4329         uint64_t        object = zp->z_id;
4330         zfid_short_t    \*zfid;
4331         int             size, i;
4332 
4333         ZFS_ENTER(zfsvfs);
4334         ZFS_VERIFY_ZP(zp);
4335         gen = (uint32_t)zp->z_gen;
4336 
4337         size = (zfsvfs->z_parent != zfsvfs) ? LONG_FID_LEN : SHORT_FID_LEN;
4338         if (fidp->fid_len < size) {
4339                 fidp->fid_len = size;
4340                 ZFS_EXIT(zfsvfs);
4341                 return (ENOSPC);
4342         }

Some quick DTrace will tell me I'm correct:

::zfs_fid:entry
/self->live == 1/
{
        zp = (znode_t \*)(self->vp)->v_data;
        zfsvfs = zp->z_zfsvfs;

        lfid = sizeof (zfid_long_t) - sizeof (uint16_t);
        sfid = sizeof (zfid_short_t) - sizeof (uint16_t);

        size = (zfsvfs->z_parent != zfsvfs) ? lfid : sfid;

        printf("fidlen %s size! (%d, %d, %d, %d)",
            self->fidp->un._fid.len < size ? "<" : "not <",
            self->fidp->un._fid.len, size, lfid, sfid);
}
...
  0  53162                    zfs_fid:entry fidlen < size! (0, 10, 20, 10)
  0  53163                   zfs_fid:return rc = 28

So is this new? Or something I introduced? I don't see how the code ever worked here. We pass a fid_t off the stack, i.e., uninitialized, into vop_fid_pseudo() and when we call zfs_fid(), that check should trigger every time. The comments on lines 57-61 lead me to think that vop_fid_pseudo() is not intended for types other than NFS. In any event, the other callers of it do:

makefh4(nfs_fh4 \*fh, vnode_t \*vp, struct exportinfo \*exi)
...
        fid_t fid;
...
        bzero(&fid, sizeof (fid));
        fid.fid_len = MAXFIDSZ;

And the file should have been removed!

  0  67453                   zfs_fid:return rc = 0
  0  81693             vop_fid_pseudo:entry vop_getattr = fffffffff8088180
  0  67452                    zfs_fid:entry fidlen not < size! (64, 10, 20, 10)
  0  67453                   zfs_fid:return rc = 0
  0  81694            vop_fid_pseudo:return fid len = 10        rc = 0

From the snoop we can both see the MDS communicating with the DS and that I need to go teach snoop about a remove!

CTL-MDS:  ----- Sun CTL-MDS -----
CTL-MDS:
CTL-MDS:  Proc = 10 (Remove object(s) or entire fsid at the DS)
CTL-MDS:

While I fixed a valid bug (and we are past the code that was panicking at the BAT, I doubt I fixed the problem we had been seeing:

Before a huge copy:

[root@pnfs-17-22 ~]> zfs list
NAME                         USED  AVAIL  REFER  MOUNTPOINT
pnfs1                       9.35G  6.28G    23K  /pnfs1

After a huge copy:

[root@pnfs-17-22 ~]> zfs list
NAME                         USED  AVAIL  REFER  MOUNTPOINT
pnfs1                       12.5G  3.09G    23K  /pnfs1

And after the delete:

[root@pnfs-17-22 ~]> zfs list
NAME                         USED  AVAIL  REFER  MOUNTPOINT
pnfs1                       12.5G  3.09G    23K  /pnfs1

Note, the DS did report that it tried to delete the file:

[root@pnfs-17-22 ~]> ./remove.d 
dtrace: script './remove.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  0  62434        ctl_mds_srv_remove:return status = 0
  1  62434        ctl_mds_srv_remove:return status = 0
  1  62434        ctl_mds_srv_remove:return status = 0
  0  62434        ctl_mds_srv_remove:return status = 0
  1  62434        ctl_mds_srv_remove:return status = 0
  1  62434        ctl_mds_srv_remove:return status = 0
  0  62434        ctl_mds_srv_remove:return status = 0

Amazingly enough, I had forgotten I was running that DTrace script. So the issue is that ctl_mds_srv_remove() reports success, but the file contents are still there.

I'm starting to wonder if I haven't hit my head on a known ZFS bug that is fixed in 118. The pNFS code is based on 117. I'll investigate.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Wednesday Aug 12, 2009

Regression testing saves my bacon

So, my unit testing is all done and it is time to integrate. Bzzt! Wrong answer - we have to do regression testing. In our case, it is called pNFS/miniPIT and was mainly crafted by Helen Chao. And it was bombing on my new code.

First of all, the core was confusing:

[root@pnfs-minipit2-4 ~]> 
panic[cpu0]/thread=30008a61560: 000000047ad
BAD TRAP: type=31 rp=2a100f596d0 addr=14 mmu_fsr=0 occurred in module "nfssrv" due to a NULL pointer dereference

nfsd: trap type = 0x31
addr=0x14
pid=100528, pc=0x7afb3d64, sp=0x2a100f58f71, tstate=0x4414001604, context=0x29b
g1-g7: 7be42af4, 2, 2000, 0, 0, 29b, 30008a61560

The NULL pointer wasn't anything to do with the function being called. Secondly, I focused on trying to reproduce it inside the test harness - it wasn't until I made a simple test case that I progressed.

Third, I focused on the server being 32 bit instead of the real difference - it was not setup as a MDS - therefore it had no DSes reporting to it. I'll show why that is important:

4463         error = nfs41_spe_allocate(&spe_va, claddr,
4464             vp->v_path, &lc, TRUE);
4465         if (error) {
4466                 /\*
4467                  \* XXX: Until we get the SMF code
4468                  \* in place, we handle all errors by
4469                  \* using the default layout of the
4470                  \* old prototype code
4471                  \*
4472                  \* At that point, we should return the
4473                  \* given error.
4474                  \*
4475                  \* XXX: Any way \*plo is NULL here?
4476                  \*/
4477                 \*plo = mds_gen_default_layout(cs->instp);
4478 
4479                 /\*
4480                  \* Record the layout, don't get bent out of shape
4481                  \* if it fails, we'll try again at checkstate time.
4482                  \*/
4483                 (void) mds_put_layout(\*plo, vp);
4484 
4485                 return (NFS4_OK);
4486         }
4487 
4488         /\*
4489          \* XXX: Any way \*plo is NULL here?
4490          \*/
4491         \*plo = mds_add_layout(&lc);

The fourth, and major problem, is that I had valid comments and I ignored them. Yes, it was certainly valid for the \*plo to be NULL after 4477 but probably not 4491. But they are both easy enough to code for!

The issue is that a NFSv4.1 server, i.e., no pNFS, also goes through this code path. We need to handle not having a layout:

4474                  \* At that point, we should return the
4475                  \* given error.
4476                  \*/
4477                 \*plo = mds_gen_default_layout(cs->instp);
4478                 if (\*plo == NULL) {
4479                         status = NFS4ERR_LAYOUTUNAVAILABLE;
4480                 } else {
4481                         /\*
4482                          \* Record the layout, don't get
4483                          \* bent out of shape if it fails,
4484                          \* we'll try again at checkstate time.
4485                          \*/
4486                         (void) mds_put_layout(\*plo, vp);
4487                 }
4488 
4489                 return (status);

Which appears okay, except with just this change, files will not be created. We want to return NFS4ERR_LAYOUTUNAVAILABLE, mainly for DTrace probing, but in the caller, we want to do:

4892         } else {
4893                 status = mds_createfile_get_layout(req, vp, cs, &ct, plo);
4894 
4895                 /\*
4896                  \* Allow mds_createfile_get_layout() to be verbose
4897                  \* in what it presents as a status, but be aware
4898                  \* that it is permissible to not generate a
4899                  \* layout.
4900                  \*/
4901                 if (status == NFS4ERR_LAYOUTUNAVAILABLE) {
4902                         status = NFS4_OK;
4903                 }
4904         }

Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Monday Aug 10, 2009

Tracking down a nasty delegation return bug

I've been fixing up a refcnt bug I found in recall_read_delegations(). One of the things I found out while in Austin was how to unit test those changes. We need a NFSv4.0 and a NFSv4.1 client both reading a file from a NFSv4.1 server. Note that we can't have a pNFS server, due to we do not yet support a mount option to force a 4.1 over a 4.0 mount. And we may never.

Anyway, you have two clients with a read delegation and force a recall by touching the files from another client.

I know it is a good test case because the server consistently crashes:

panic[cpu0]/thread=ffffff02d457db40: kernel heap corruption detected

ffffff001054de90 genunix:kmem_error+4a9 ()
ffffff001054deb0 genunix:kmem_free+d0 ()
ffffff001054df30 nfssrv:recall_read_delegations+8e8 ()
ffffff001054df90 nfssrv:deleg_rd_setattr+45 ()
ffffff001054e000 genunix:vnext_setattr+84 ()
ffffff001054e060 nfssrv:deleg_rd_setattr+66 ()
ffffff001054e0e0 genunix:vhead_setattr+9f ()
ffffff001054e140 genunix:fop_setattr+ad ()

Firefox rules! My UPS just failed and I thought I lost all of this. Yeah Firefox for saving the day!

I added DTrace probes, printf() statements when the probes failed to fire, and finally delays in the code. I couldn't see anything. I finally decided to remove all of that code and worked on cleaning the original code up. There was a rfs4_file_t pointer (fp), array of pointers (fpa), and last active pointer (fap). I was getting confused.

I fixed all that and while the panic was still occurring, the code was easier to read. I finally started adding some ASSERTs and nailed down the problem!

If we look at the original code:

 89         /\*
 90          \* Is there more than one file structure for this vp?
 91          \* Get the vsd for each instance of the server, if it exists.
 92          \*/
 93         fpa = kmem_zalloc((sizeof (rfs4_file_t \*) \* nsi_count), KM_SLEEP);
 94         DTRACE_NFSV4_1(rrd__i__fpa_alloc, int, nsi_count);
 95 
 96         fpa[0] = fp;
 97         cnt = 1;
 98         mutex_enter(&vp->v_lock);
 99         for (instp = list_head(&nsi_head); instp != NULL;
100             instp = list_next(&nsi_head, &instp->nsi_list)) {
101                 fpa[cnt] = (rfs4_file_t \*)vsd_get(vp, instp->vkey);
102                 if (fpa[cnt] && (fpa[cnt] != fp))
103                         cnt++;
104         }
105         mutex_exit(&vp->v_lock);

I said that before 101, it should be the case that cnt is less than nsi_count. I added an ASSERT and it triggered right away. I looked back at the code and while I still thought the ASSERT was valid, I saw why it could trigger in a good code path. Consider a case where the number of items in the list is greater than nsi_count. Then consider that we immediately fill in the remaining items in fpa, leaving some items in the list. We then read an item, but we don't mean to store it.

That is where the bug actually comes in I believe in the author's mind. Either there never is more than nsi_count entries in the list or not considering there can be more valid ones. Another point here is that there is no way to guarantee that the first item returned in the list is fp.

My fix is this:

103         for (instp = list_head(&nsi_head); instp != NULL;
104             instp = list_next(&nsi_head, &instp->nsi_list)) {
105                 rfs4_file_t     \*temp;
106 
107                 temp = (rfs4_file_t \*)vsd_get(vp, instp->vkey);
108                 if (temp && (temp != fp)) {
109                         ASSERT(cnt < nsi_count);
110                         fpa[cnt++] = temp;
111                 }
112         }

It works as a great defensive fix, but I'm concerned it misses the bigger issue of we should either count the items first in the list for the allocation or we should determine why we have more entries than we expect. I'll have to confer with my colleague once he gets back from vacation...


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Code walk through was a success

The code walk through in Austin was great for me - I ended up resolving all of my open issues and several bugs were found. I need to do another round of unit testing and then start running our regression test suite.

I also got some great help in putting the spe daemon into smf:

[root@pnfs-17-24 nfs]> svcadm enable svc:/network/nfs/spe
[root@pnfs-17-24 nfs]> ps -ef | grep spe
    root 100882 100854   0 11:02:47 pts/1       0:00 grep spe
    root 100880 100004   0 11:02:30 ?           0:00 /usr/lib/nfs/sped

Not sure I have the dependencies right, but that may be because I have such a horribly hacked together test rig. I'll have to do a fresh install to make sure things start right the first time.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Sunday Aug 02, 2009

Off to Austin for a code walkthrough

I'll be in Austin this week getting approval for the adoption of mds_sids to manage devices, layouts (in core and on disk), and file handles. Oh, and to add kspe!

For those of you interested, the code is over here: http://cr.opensolaris.org/~tdh/perse/. I've removed the spe debug printfs and added DTRACE probes as appropriate.

On my todo list is to perhaps remove some debug DTrace probes (don't see a real need for in shipping code, kinda just kernel printfs) and to determine whether kspe lives in the nfs or nfssrv module. I made a half-hearted attempt to stuff it back into the nfs module and got reminded why I had it in nfssrv in the first place. I.e., need to get at mds_sid info and to abstract the kspe globals such that we aren't in the way of zonification. That last is the kicker - everything in the nfs module needs to know everything about kspe and it gets messy.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily
About

tdh

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today