Wednesday Aug 18, 2010

LinkedIn

Tom Haynes


Originally posted on Kool Aid Served Daily
Copyright (C) 2010, Kool Aid Served Daily

Sunday Mar 21, 2010

IETF 77 in Anaheim

I just attended the Newcomers Tutorial and the take home point for me was that IETF is a commune and not a police state. I.e., the IETF looks for "rough consensus" and does not "enforce".


Originally posted on Kool Aid Served Daily
Copyright (C) 2010, Kool Aid Served Daily

Wednesday Nov 11, 2009

chown(1) and username spaces under NFSv3 and NFSv4

A colleague came to me with an interesting question. They had a share on a server which allowed root access and could not get chown(1) to work via NFS. They wanted to know why our NFS server implementation denied it.

I immediately ran the test and it worked right off the bat! So now the question was what was different on the two systems? First of all, I tried it with a Linux client:

root@silent:~# mount slammer:/export/slammer /slammer/slammer/
root@silent:~# cd /slammer/slammer/
root@silent:/slammer/slammer# touch sil
root@silent:/slammer/slammer# chown thud sil
root@silent:/slammer/slammer# ls -la sil
-rw-r--r--+ 1 thud root 0 2009-11-11 12:40 sil

To eliminate that as the culprit, I tried an OpenSolaris client:

[root@silver tdh]> mount slammer:/export/slammer /slammer/slammer/
[root@silver tdh]> cd /slammer/slammer/
[root@silver slammer]> touch ver
[root@silver slammer]> chown thud ver
chown: ver: Permission denied

While it appears to be an issue between OpenSolaris and Linux, it is far from the case. The issue is at the NFS protocol layer and I can easily show it by flipping the success on each client!

Working OpenSolaris:

[root@silver slammer]> cd
[root@silver ~]> umount /slammer/slammer/
[root@silver ~]> mount -o vers=3 slammer:/export/slammer /slammer/slammer/
[root@silver ~]> cd /slammer/slammer/
[root@silver slammer]> touch ver_3
[root@silver slammer]> chown thud ver_3
[root@silver slammer]> ls -la ver_3
-rw-r--r--   1 thud     root           0 Nov 11 12:50 ver_3

Broken Linux:

root@silent:~# mount -t nfs4 slammer:/export/slammer /slammer/slammer/
root@silent:~# cd /slammer/slammer/
root@silent:/slammer/slammer# touch sil_4
root@silent:/slammer/slammer# chown thud sil_4
root@silent:/slammer/slammer# ls -la sil_4
-rw-r--r-- 1 thud 4294967294 0 2009-11-11 12:53 sil_4

But wait, you are saying it worked for Linux. Well, no, it did not. If we look at this on the server, we see:

slammer# ls -la sil\* ver\*
-rw-r--r--   1 2008     root           0 Nov 11 18:40 sil
-rw-r--r--   1 nobody   root           0 Nov 11 18:53 sil_4
-rw-r--r--   1 nobody   nobody         0 Nov 11 18:44 ver
-rw-r--r--   1 2008     root           0 Nov 11 18:50 ver_3

So while the Linux client thinks that it worked, it did not. We can confirm that by remounting with NFSv3:

root@silent:/slammer/slammer# ls -la sil\* ver\*
-rw-r--r--+ 1 thud       root       0 2009-11-11 12:40 sil
-rw-r--r--+ 1 4294967294 root       0 2009-11-11 12:53 sil_4
-rw-r--r--+ 1 4294967294 4294967294 0 2009-11-11 12:44 ver
-rw-r--r--+ 1 thud       root       0 2009-11-11 12:50 ver_3

And for fun, if we remount with NFSv4:

root@silent:/slammer/slammer# cd
root@silent:~# umount /slammer/slammer/
root@silent:~# mount -t nfs4 slammer:/export/slammer /slammer/slammer/
root@silent:~# cd /slammer/slammer/
root@silent:/slammer/slammer# ls -la sil\* ver\*
-rw-r--r-- 1 4294967294 4294967294 0 2009-11-11 12:40 sil
-rw-r--r-- 1 4294967294 4294967294 0 2009-11-11 12:53 sil_4
-rw-r--r-- 1 4294967294 4294967294 0 2009-11-11 12:44 ver
-rw-r--r-- 1 4294967294 4294967294 0 2009-11-11 12:50 ver_3

The first thing we normally check in such cases is the mapid domain:

slammer# cat /var/run/nfs4_domain
internal.excfb.com
[root@silver slammer]> cat /var/run/nfs4_domain
internal.excfb.com
root@silent:/slammer/slammer# cat /proc/net/rpc/nfs4.idtoname/content
#domain type id [name]

So I need to get the Linux box in the right domain! Actually, I need to modify '/etc/default/nfs-common' on this Ubuntu system and tell it to run mapid!

Much better:

root@silent:/slammer/slammer# touch sil_4_2
root@silent:/slammer/slammer# chown thud sil_4_2
chown: changing ownership of `sil_4_2': Invalid argument

Now the Linux and OpenSolaris clients are agreeing to behave the same way with the server! What is the issue here?

In NFSv4, we change the owner with a SETATTR compound op:

NFS:  Op = 34 (SETATTR)
NFS:  State ID hash = SPECIAL_0
NFS:      len = 12    val = 000000000000000000000000
NFS:      State ID Sequence ID = 00000000
NFS:    0x00
NFS:    0x00
NFS:    0x00
NFS:    0x00
NFS:    0x10   OWNER
NFS:    0x00
NFS:    0x00
NFS:    0x00
NFS:  Owner = thud@internal.excfb.com

And in NFSv3, we try with this:

NFS:  Proc = 2 (Set file attributes)
NFS:  File handle = [86D5]
NFS:   A76A323408831F040A001500000000003F0C00000A000300000000002D000000
NFS:  Mode = (not set)
NFS:  User ID = 2008
NFS:  Group ID = (not set)
NFS:  Size = (not set)
NFS:  Access time = (do not set)
NFS:  Modification time = (do not set)
NFS:

So in the NFSv4 case, we need to be able to map 'thud@internal.excfb.com' to an user id. If that mapping does not exist, then we disallow the action. In NFSv3, we already have the user id, and we don't care if it is undefined on the server:

slammer# grep thud /etc/passwd
slammer# ls -la sil\* ver\*
-rw-r--r--   1 2008     root           0 Nov 11 18:40 sil
-rw-r--r--   1 nobody   root           0 Nov 11 18:53 sil_4
-rw-r--r--   1 root     root           0 Nov 11 19:18 sil_4_2
-rw-r--r--   1 nobody   nobody         0 Nov 11 18:44 ver
-rw-r--r--   1 2008     root           0 Nov 11 18:50 ver_3

If we add the user id of thud, we see:

slammer# useradd -u 2009 -g 10 -d /export/slammer thud
slammer# ls -la sil\* ver\*
-rw-r--r--   1 2008     root           0 Nov 11 18:40 sil
-rw-r--r--   1 nobody   root           0 Nov 11 18:53 sil_4
-rw-r--r--   1 root     root           0 Nov 11 19:18 sil_4_2
-rw-r--r--   1 nobody   nobody         0 Nov 11 18:44 ver
-rw-r--r--   1 2008     root           0 Nov 11 18:50 ver_3

Yes, I did an off by one error on purpose! What happens if we try to chown(1) via NFSv4 now:

root@silent:/slammer/slammer# chown thud base
root@silent:/slammer/slammer# ls -la base sil\* ver\*
-rw-r--r-- 1 thud   root    0 2009-11-11 13:29 base
-rw-r--r-- 1 nobody root    0 2009-11-11 12:40 sil
-rw-r--r-- 1 nobody root    0 2009-11-11 12:53 sil_4
-rw-r--r-- 1 root   root    0 2009-11-11 13:18 sil_4_2
-rw-r--r-- 1 nobody nogroup 0 2009-11-11 12:44 ver
-rw-r--r-- 1 nobody root    0 2009-11-11 12:50 ver_3

Those 'nobody's should be 'thud' and that 'thud' should not map! What does NFSv3 say:

root@silent:/slammer/slammer# ls -la base sil\* ver\*
-rw-r--r--+ 1       2009 root       0 2009-11-11 13:29 base
-rw-r--r--+ 1 thud       root       0 2009-11-11 12:40 sil
-rw-r--r--+ 1 4294967294 root       0 2009-11-11 12:53 sil_4
-rw-r--r--+ 1 root       root       0 2009-11-11 13:18 sil_4_2
-rw-r--r--+ 1 4294967294 4294967294 0 2009-11-11 12:44 ver
-rw-r--r--+ 1 thud       root       0 2009-11-11 12:50 ver_3

The NFSv4 listing shows you the username space of the server and the NFSv3 listing shows you the username space of the client.

The big difference is that NFSv4 enforces that both systems understand the name. It doesn't mandate that the names have the same userid, just that they have a mapping.

To get everything in sync, to have both NFSv3 and NFSv4 work the same, use either NIS or LDAP to maintain your user mappings.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Thursday Jul 16, 2009

Testing mirrormounts over UFS

We've been seeing an intermittent bug where when we have a mirrormount and we unmount the parent, it fails to do so. We don't know why, but we thought we had a reproducible test case. It was in the miniPIT test suite. The problem with that is that it is huge and thus hard to debug.

At a first blush, this is easy to conceptualize:

[root@pnfs-18-09 ~]> mount -o vers=4 pnfs-18-11:/a_mnt /mnt
[root@pnfs-18-09 ~]> cd /mnt
[root@pnfs-18-09 /mnt]> cd stress/
[root@pnfs-18-09 stress]> nfsstat -m `pwd`
/mnt/stress from pnfs-18-11:/a_mnt/stress
 Flags:         vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,mirrormount,rsize=1048576,wsize=1048576,retrans=5,timeo=600
 Attr cache:    acregmin=3,acregmax=60,acdirmin=30,acdirmax=60

[root@pnfs-18-09 stress]> cd 
[root@pnfs-18-09 ~]> umount /mnt

We see that the mirrormount did succeed, but we also see that the unmount did as well.

It turned out that the test case we had was on UFS and not ZFS. I was wondering how the test suite could easily create a myriad of UFS filesystems on the fly - after all, there aren't that many slices available. My colleague Helen Chao showed me how we do it with mkfile and lofiadm:

[root@pnfs-18-11 ~]> mkfile 1G /export/FILE2
[root@pnfs-18-11 ~]> mkfile 1G /export/FILE1
[root@pnfs-18-11 ~]> lofiadm -a /export/FILE1
/dev/lofi/1
[root@pnfs-18-11 ~]> lofiadm -a /export/FILE2
/dev/lofi/2
[root@pnfs-18-11 ~]> newfs /dev/lofi/1
newfs: construct a new file system /dev/rlofi/1: (y/n)? y
/dev/rlofi/1:   2097000 sectors in 3495 cylinders of 1 tracks, 600 sectors
        1023.9MB in 219 cyl groups (16 c/g, 4.69MB/g, 2240 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 9632, 19232, 28832, 38432, 48032, 57632, 67232, 76832, 86432,
 2006432, 2016032, 2025632, 2035232, 2044832, 2054432, 2064032, 2073632,
 2083232, 2092832
[root@pnfs-18-11 ~]> newfs /dev/lofi/2
newfs: construct a new file system /dev/rlofi/2: (y/n)? y
/dev/rlofi/2:   2097000 sectors in 3495 cylinders of 1 tracks, 600 sectors
        1023.9MB in 219 cyl groups (16 c/g, 4.69MB/g, 2240 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 9632, 19232, 28832, 38432, 48032, 57632, 67232, 76832, 86432,
 2006432, 2016032, 2025632, 2035232, 2044832, 2054432, 2064032, 2073632,
 2083232, 2092832
[root@pnfs-18-11 ~]> mkdir /a_mnt
[root@pnfs-18-11 ~]> mount /dev/lofi/1 /a_mnt
[root@pnfs-18-11 ~]> mkdir /a_mnt/stress
[root@pnfs-18-11 ~]> mount /dev/lofi/2 /a_mnt/stress
[root@pnfs-18-11 ~]> share -F nfs /a_mnt
[root@pnfs-18-11 ~]> share -F nfs /a_mnt/stress

I probably overkilled the file size in the mkfile since I'm not writing to them, but in all, this is a very neat piece of hackery she does here. I really like how she follows the Unix paradigm of using small tools to do complex tasks.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Friday Apr 03, 2009

Hearty discussion of server-side copy offload I-D

If you've ever wondered how the NFS protocol is maintained, designed, discussed, etc, now is a great time to watch it happen live. James Lentini has proposed a method to allow client to initiate a file "copy" from one location to another on the same server. With the traditional method, the client has to read the file from the server and then write it back. With the COPY proposal, all of that overhead is removed.

That part is pretty pedestrian and not being debated. What is being debated is a server-to-server copy proposal. You can see the benefits here:

                                         /
                                         /
             Existing approach           /            New COPY procedures
  Client           Source   Destination  / Client           Source   Destination
    +                +          +        /   +                +          +
    |                |          |        /   |                |          |
1   |>--get FH------>|          |        /   |>-COPY_NOTIFY-->|          |
    |               /|          |        /   |               /|          |
    |              / |          |        /   |              / |          |
    |             /  |          |        /   |             /  |          |
2   |<-----------+   |          |        /   |<-----------+   |          |
    |                |          |        /   |                |          |
    |                |          |        /   |                |          |
3   |>--get FH----------------->|        /   |>--COPY------------------->|
    |                |         /|        /   |                |         /|
    |                |        / |        /   |                |        / |
4   |<-----------------------+  |        /   |<-----------------------+  |
    |                |          |        /   |                |          |
5   |>--read-------->|          |        /   |                |<--read--<|
    |               /|          |        /   |                |\\         |
    |              / |          |        /   |                | \\        |
    |             /  |          |        /   |                |  \\       |
6   |<-----------+   |          |        /   |                |   +----->|
    |                |          |        /   |                |          |
7   |>--write------------------>|        /   |                |<--read--<|
    |                |         /|        /   |                |\\         |
    |                |        / |        /   |                | \\        |
    |                |       /  |        /   |                |  \\       |
8   |<----------------------+   |        /   |                |   +----->|
    |                |          |        /   |                |          |
9   |>--read-------->|          |        /   |                |<--read--<|
    |               /|          |        /   |                |\\         |
    |              / |          |        /   |                | \\        |
    |             /  |          |        /   |                |  \\       |
10  |<-----------+   |          |        /   |                |   +----->|
    |                |          |        /   |                |          |

Don't read anything into the length of a horizontal request, i.e., the read at time 5 is not twice as fast as the write at time 7. Assume that everything is just 1 network hop away.

Also, ignore that we can have multiple reads or writes in flight. In the end, you can see that we will issue almost 1/2 the number of over the wire calls if we let the servers handle the copy.

And the mailing list is full of lively discussion about how to do this - there isn't a disagreement that it is a bad thing to do. Instead the discussion is all about how to implement it.

Rather than bias you towards my view, you can go watch the action in either the thread archives or join the working group alias nfsv4 -- NFSv4 Working Group.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

RFC3530bis is now on the ietf.org site

I decided to undertake the editing of RFC3530bis, which is a second effort on RFC3530. The intent is to make the document cleaner, but to not introduce any over-the-wire protocol changes. (We have NFSv4.1 for that.)

The main effort so far has been in taking the text version of RFC3530 and converting it to XML. The reason to do that is that it really makes it easy to organize the layout. Effectively, this is just like keeping my thesis in LaTeX. Plus I always love compiling a document. The previous version of the document had been maintained in troff and the known source was out of sync with the official document.

So now you can look at the documents here:

  1. draft-ietf-nfsv4-rfc3530bis-00.txt [draft]
  2. draft-ietf-nfsv4-rfc3530bis-dot-x-00.txt[draft]

The first major change was the pulling of the XDR description of NFSv4.0 out of the main document and into a separate one. That allows for the easy creation of your own .x file from that document.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Sunday Feb 22, 2009

Set up a Mercurial repository for the Cthon NFS tests

The NFS tests we employ at Connectathon are available as a tarball at NFS Testsuite. And they are all read-only. Someone somewhere has a SCCS repository for it.

All of this was setup some time ago, way before git, Mercurial, etc. We should be able to do better than this for managing this code. As the source tree is lost, we've had a problem updating changes. The last official change occurred in 2004.

So I created a Mercurial tree on hg.opensolaris.org. You can grab a copy of the source via:

hg clone ssh://anon@hg.opensolaris.org/hg/nfsv41/cthon

which does not assume you have an OpenSolaris account. You could update the tree with something along the lines of:

hg push -R /home/nfs4hg/src/cthon04 \\
-e "ssh -q -F /home/nfs4hg/opensolaris/config"

which assumes you do have an OpenSolaris account and that in this case you have setup a secure and passwordless ssh key. (I'm not sure those last two really belong together.)


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Thursday Jan 29, 2009

Connectathon 2009 Talk schedule posted

We've posted the talk schedule for Connectathon 2009. They will be presented for a 3 day period of Feb 23rd - 25th.

It is pretty much set, but you might see some minor adjustments being made.

The talks are open to the public and will probably be a bit different this year. In the past, they tended to focus on:

What did we do over the last year?

Whereas we are now trying to change the focus to

What are we going to do over the next year?

The change is subtle, but we expect less slides and more audience participation. The impact of that will be that the slides posted to the site (after the conference) will probably not stand on their own as well. In short, I'm urging you to get out and attend the talks.


Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily

Thursday Oct 16, 2008

Another Linux Dynamic Pseudo Root patch submitted by Steve Dickson of Redhat

So another patch set for a Linux Dynamic Pseudo Root has been submitted by Steve Dickson to the Linux NFSv4 mailing list:

The following patch series gives rpc.mountd the ability to allocate a dynamic pseudo root, so the
'fsid=0' export option is no longer required. This allows v2, v3 and v4 clients mounts without
any changes to the server's exports list.

One anomaly of the Linux NFS server is that it requires a pseudo root to be defined. Currently
the only way a pseudo root can be defined is by setting the fsid to zero (i.e. fsid=0). So
if we wanted to make v4 the default mounting version and have things just work like v2/v3 all
of the existing exports configurations would have to change (i.e. a 'fsid=0' would have to
be added) to support a v4 mounts, which, imho, is unacceptable. So this patch series address this problem.

Steve has really highlighted a huge gap in seamless integration of the Linux NFSv4 implementation into automounters, etc. The path to an export should not change based on the version of the protocol.

Hmm, strike that, from a re-reading of I'm not sure if this patch eliminates my concern about the mount path or not. I.e., above he talks about adding a 'fsid=0' on the server and not what the client has to do about the path.

Time to ask him!

Update: Steve says it does address the mount path issue we've seen in the past.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Thursday Oct 09, 2008

Pushed out the changes for 6751438 mirror mounted mountpoints panic when umounted

6751438 mirror mounted mountpoints panic when umounted, finally got through the code review process, the RTI process, etc. I wanted to get this into snv_101 because that is the candidate to become OpenSolaris 2008.11. I want a stable mirror mount experience out there for users.

The other bug fix I have in the works has passed stalled, which I'm really okay with at this point. This is the 6738223 Can not share a single IP address bug. Anyway, I've gotten two positive reviews and one reviewer asking questions. If I were tricky, I could try to submit an RTI, but the fact of the matter is that the third reviewer can still be satisfied.

Also, there is no way I want to integrate this into snv_101. While the fix is trivial, it can wait for snv_102. I.e., I don't see a business need to put it back right now.

Some of the key differences between the bugs and why one has to go in:

  1. Mirror mount bug is a panic, single addr one has a valid and documented work around
  2. Mirror mount bug is impacting regression testing, single addr one does not have a regression test case
  3. Mirror mounts are a hot request from some customers, single addr is being driven by my getting frustrated by unwieldy notation

So I feel one is a strong must and the other is a nice to have. And I don't think 'nice to have' meets the entry criteria for snv_101.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Friday Oct 03, 2008

So the closed binaries are live

I just announced on nfs41-discuss that the closed-binaries went live! See Mercurial Repository created.

When I configured my community, I followed all of the steps outlined at Setting up a pnfs community except for the mdsadm command on the MDS server:

[root@pnfs-9-10 ~]> mdsadm -o add -t auth -a ip=10.1.233.50
adding: IP Addr - 10.1.233.50

I experienced the following on my MDS console, which are being investigated but do not appear to be fatal:

[root@pnfs-9-14 ~]> Oct  3 17:00:11 pnfs-9-14 /usr/lib/nfs/nfsd[101025]: write failed for /var/nfs/v4_state/mds_/010.001.233.053-6448e6939a: write(669938620) returned -1 errno=14 ss_len=669938600
Oct  3 17:05:55 pnfs-9-14 nfssrv: NOTICE: op_destroy_session: SP4_NONE

And ditto for these on the DS console:

[root@pnfs-9-13 ~]> dservadm enable
[root@pnfs-9-13 ~]> Oct  3 16:52:29 pnfs-9-13 dserv[101033]: bad cmd: 3
Oct  3 16:52:29 pnfs-9-13 last message repeated 1 time
Oct  3 16:52:32 pnfs-9-13 dserv: WARNING: CLNT_CALL() ds protocol to mds failed: 5
Oct  3 16:52:32 pnfs-9-13 dserv[101033]: ioctl failed: I/O error
sahre
sahre: Command not found.
[root@pnfs-9-13 ~]> share
-@data/nfs4     /data/nfs4   anon=0,sec=sys,rw   ""  
[root@pnfs-9-13 ~]> Oct  3 17:00:27 pnfs-9-13 /usr/lib/nfs/nfsd[101019]: write failed for /var/nfs/v4_state/mds_/010.001.233.053-6448e693c4: write(419178265) returned -1 errno=14 ss_len=419178245

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Thursday Oct 02, 2008

Juggling work load

The group has so much to do and it feels like so little time to do it. I don't think anyone just codes. I'm looking at my action list and it is all over the place:

Test fix for 6738223 Can not share a single IP address
Unit testing was successful, but mini-PIT testing is mixed. I think this is more a configuration issue than a code issue.
Push code review along for 6751438 mirror mounted mountpoints panic when umounted
Frank gave me a good review/discussion, but I need another reviewer.
Figure out how to configure a minimal jumpstart
I could ask someone to do it, but I've been meaning to understand jumpstarting myself. This is an easy way to ease into it.
Open up an OpenSolaris gate for NFSv41
Chewed up a large chunk of my day. Hey, I need to do a test integration to see if it works. If it does, I have to run because Dave Marker is going to eat my beating heart. And it works! Here is my Linux box at home:
[thud@adept src]> hg incoming
comparing with ssh://anon@hg.opensolaris.org/hg/nfsv41/nfs41-gate
searching for changes
changeset:   7742:9fab48a31a4a
tag:         tip
user:        Thomas Haynes 
date:        Thu Oct 02 21:19:03 2008 -0500
summary:     Test of push to osol

[thud@adept src]> hg pull -u
pulling from ssh://anon@hg.opensolaris.org/hg/nfsv41/nfs41-gate
searching for changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 1 changes to 1 files
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
Build closed-binaries for OpenSolaris
Trivial, but time consuming. I will need to also install and test. We also just got rid of auth records, so I need to see how we are changing the configuration of the DS and MDS.
Figure out how to translate data path to guuid
The above mentioned server change really hosed up my progress on spe. I've got the final pieces in my mind, but I need to just grind it all out. And then I need to start testing. I may need to go to VirtualBoxes to get enough DSes for testing.
D'oh, need to get a VirtualBox image together for OpenSolaris
I can piggyback on the push of the closed binaries.
Modernize style of blog site
Looks dated and I want to add a tag cloud. At least one of the shared styles will do this, but I just don't want to use it.

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Some usr/src/uts/common/fs/nfs naming conventions

In usr/src/uts/common/fs/nfs/, there are some historical naming conventions that can help you understand where you are in the code:

rfs_
You are in the NFSv2 or NFSv3 server code. It may also be generic code that will call down into the NFSv4 layer.
rfs4_
You are in the NFSv4 server code.
rfs41_
You are in the NFS4.1 and/or pNFS server code.
nfs_
You are in the NFSv2 or NFSv3 client code. It may also be generic code that will call down into the NFSv4 layer.
rfs4_
You are in the NFSv4 client code.
rfs41_
You are in the NFS4.1 and/or pNFS client code.

You may end up in code which doesn't follow these conventions, i.e., the spe code I will be adding, some of the mirror mount code, etc. But if you are look at a stack trace, you would be able to see this type of code being called by a function following the above patterns.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Thursday Sep 25, 2008

More on that refcnt debugging

Earlier I wrote about Debugging a refcnt. I'm still not convinced I have the answer, but I am convinced I'm on the right track. So I've made a some changes to the code base to make things better/correct:

  1. Introduced a incr function to make sure that all bumps are correct:
       291  static void
       292  nfs4_ephemeral_tree_incr(nfs4_ephemeral_tree_t \*net)
       293  {
       294          ASSERT(mutex_owned(&net->net_cnt_lock));
       295          net->net_refcnt++;
       296          ASSERT(net->net_refcnt != 0);
       297  }
       298
       299  static void
       300  nfs4_ephemeral_tree_hold(nfs4_ephemeral_tree_t \*net)
       301  {
       302          mutex_enter(&net->net_cnt_lock);
       303          nfs4_ephemeral_tree_incr(net);
       304          mutex_exit(&net->net_cnt_lock);
       305  }
    ...
      1859                  } else {
      1860                          nfs4_ephemeral_tree_incr(net);
      1861                  }
    
  2. Fixed up the one locking issue:
       734          } else {
       735                  net = mi->mi_ephemeral_tree;
       736                  nfs4_ephemeral_tree_hold(net);
       737
       738                  mutex_exit(&mi->mi_lock);
    
  3. And I still have the change which I thought would fix the bug:
      2006                  if (was_locked == FALSE)
      2007                          mutex_exit(&net->net_tree_lock);
    

    As a matter of fact, after struggling to work my way through all of the cases in nfs4_ephemeral_umount, I need to add more documentation about the interactions of the booleans \*pmust_unlock, is_recursed, and was_locked. I could add more booleans to be explicit about what is going on, but I think better comments will suffice.

I've also introduced some key debug logic to help me figure out if my conjecture about what is happening is correct or not inside nfs4_ephemeral_umount_activate:

  1732  void
  1733  nfs4_ephemeral_umount_activate(mntinfo4_t \*mi, bool_t \*pmust_unlock,
  1734      nfs4_ephemeral_tree_t \*\*pnet)
  1735  {
  1736          nfs4_ephemeral_tree_t   \*checker = mi->mi_ephemeral;
  1737
  1738          /\*
  1739           \* Now we need to get rid of the ephemeral data if it exists.
  1740           \*/
  1741          mutex_enter(&mi->mi_lock);
  1742          if (mi->mi_ephemeral) {
  1743                  /\*
  1744                   \* If we are the root node of an ephemeral branch
  1745                   \* which is being removed, then we need to fixup
  1746                   \* pointers into and out of the node.
  1747                   \*/
  1748                  if (!(mi->mi_flags & MI4_EPHEMERAL_RECURSED))
  1749                          nfs4_ephemeral_umount_cleanup(mi->mi_ephemeral);
  1750
  1751                  ASSERT(mi->mi_ephemeral != NULL);
  1752
  1753                  kmem_free(mi->mi_ephemeral, sizeof (\*mi->mi_ephemeral));
  1754                  mi->mi_ephemeral = NULL;
  1755          }
  1756          mutex_exit(&mi->mi_lock);
  1757
  1758          if (\*pmust_unlock) {
  1759                  ASSERT(checker != \*pnet);
  1760          }
  1761
  1762          nfs4_ephemeral_umount_unlock(pmust_unlock, pnet);
  1763  }

I didn't want to change 1753 to be an assignment to NULL, which would mimic the case for mi->mi_ephemeral_tree. And I don't want to change this code too much.

So I save off a local copy of mi->mi_ephemeral right off the bat. I allow the free and NULLing to occur. And then, only in the case where we must unlock (because the code is more likely to be valid in the other case), do I check to make sure that the saved address is different than the node we are about to unlock. Note that if the mi->mi_ephemeral pointer were NULL when we entered the function, we will not trigger here. Actually, if we do, we have more problems. :->

And if this assert does trigger, I'll have credible evidence I know what is going on and I'll have a good handle on how to fix it.

Now if my build would just finish...

Update:

Okay, I'm way off base and probably need to take a break from this:

../../common/fs/nfs/nfs4_stub_vnops.c:1736: warning: initialization from incompatible pointer type
"../../common/fs/nfs/nfs4_stub_vnops.c", line 1736: warning: assignment type mismatch:

I realized earlier that the types were different, but I didn't let that sink in. The core is consistent:

> \*ffffff0004f33bb8::print nfs4_ephemeral_tree_t
{
    net_mount = 0xffffff0523c36000
    net_root = 0
    net_next = 0
    net_tree_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_cnt_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_status = 0
    net_refcnt = 0
}
> ffffff0524673000::print mntinfo4_t mi_ephemeral_tree | ::print nfs4_ephemeral_tree_t
{
    net_mount = 0xffffff0523c36000
    net_root = 0
    net_next = 0
    net_tree_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_cnt_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_status = 0
    net_refcnt = 0
}

The mntinfo4 does point to the tree.

Okay, I took my first advice to add more comments and I have a new piece of code to question in nfs4_ephemeral_umount:

   1763 	int			is_recursed = FALSE;
   1764 	int			was_locked = FALSE;
...
   1770 	\*pmust_unlock = FALSE;
...
   1817 	if (!is_recursed) {
...
   1845 			was_locked = TRUE;
   1846 		} else {
   1847 			net->net_refcnt++;
   1848 			ASSERT(net->net_refcnt != 0);
...
   1859 	if (was_locked == FALSE) {
...
   1866 		if (mutex_tryenter(&net->net_tree_lock)) {
   1867 			\*pmust_unlock = TRUE;
   1868 		} else {

So, if !is_recursed, then we can end up on line 1847 to bump the refcnt. At this point, \*pmust_unlock is FALSE and was_locked is also FALSE. If we grab the net->net_tree_lock right away on 1866, then crud, it is okay.

But the other case is not. Assume that is_recursed is TRUE. That means we do not bump the refcnt and at 1859, the following values hold: was_locked is FALSE and \*pmust_unlock is FALSE. Now if we grab the mutex right away, we end up with \*pmust_unlock as TRUE. In my examination of the code, that then implies we \*must\* drop the refcnt at some exit point. But we just saw that we never grabbed it in this scenario.

This would account for us trying to rele a refcnt we never held. We return TRUE to nfs4_free_mount and it sends this into nfs4_ephemeral_umount_activate and thus into nfs4_ephemeral_umount_unlock.

I'm going to introduce more state in a \*pmust_rele to keep track of whether I need to rele or not.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Debugging a refcnt

A refcnt is a way to keep track of how many outstanding holds are being used on a data structure. Say that you need to reference something in the data structure, but you don't want to do it while holding a lock. You do want to make sure that the data is still there when you go back. What you do is bump the refcnt and have the cleanup code never free that data until the refcnt goes back to 0.

Some issues are what happens if it never goes back to 0? Or if it accidentally goes back to 0 before you are done with it? The answers are that you leak memory and crash, in that order.

I'm debugging the scenario where the refcnt goes back to 0 before you release your hold. Before I dive into looking at that, let's look at the relevant code in the mirror mount source: usr/src/uts/common/fs/nfs/nfs4_stub_vnops.c (Note that this code will change over time, I'm linking to the copy of the gate.)

We bump the count here:

    291 static void
    292 nfs4_ephemeral_tree_hold(nfs4_ephemeral_tree_t \*net)
    293 {
    294 	mutex_enter(&net->net_cnt_lock);
    295 	net->net_refcnt++;
    296 	ASSERT(net->net_refcnt != 0);
    297 	mutex_exit(&net->net_cnt_lock);
    298 }

The first thing to note is that we grab net_cnt_lock before we do anything. How do we even know that is valid? Well, we grab a lock on the mountinfo and check to see if there is an ephemeral node:

    696 	mutex_enter(&mi->mi_lock);
...
    701 	if (mi->mi_ephemeral_tree == NULL) {
...
    727 	} else {
    728 		net = mi->mi_ephemeral_tree;
    729 		mutex_exit(&mi->mi_lock);
    730 
    731 		nfs4_ephemeral_tree_hold(net);

Oh, that is very naughty indeed - what I said and what the code says disagree. What do other parts of the code say?

   1422 	mutex_enter(&mi->mi_lock);
...
   1444 	nfs4_ephemeral_tree_hold(net);
...
   1528 	mutex_exit(&mi->mi_lock);

And a special case in nfs4_ephemeral_harvest_forest

   2098 	mutex_enter(&ntg->ntg_forest_lock);
   2099 	for (net = ntg->ntg_forest; net != NULL; net = next) {
   2100 		next = net->net_next;
   2101 
   2102 		nfs4_ephemeral_tree_hold(net);

I expect the code at 2102 to not hold the mntinfo4. We don't have our hands on that info at the time. But I believe the code at 731 to be a bug that will eventually bit us.

But back to the hold code:

    295 	net->net_refcnt++;
    296 	ASSERT(net->net_refcnt != 0);

We bump the refcnt and immediately assert that it is not 0. As refcnt is an unsigned int, this will catch the case where we wrap over the size of an integer. We don't ever expect to hit this case.

It is in the rele case that we are more concerned:

    300 /\*
    301  \* We need a safe way to decrement the refcnt whilst the
    302  \* lock is being held.
    303  \*/
    304 static void
    305 nfs4_ephemeral_tree_decr(nfs4_ephemeral_tree_t \*net)
    306 {
    307 	ASSERT(mutex_owned(&net->net_cnt_lock));
    308 	ASSERT(net->net_refcnt != 0);
    309 	net->net_refcnt--;
    310 }
    311 
    312 static void
    313 nfs4_ephemeral_tree_rele(nfs4_ephemeral_tree_t \*net)
    314 {
    315 	mutex_enter(&net->net_cnt_lock);
    316 	nfs4_ephemeral_tree_decr(net);
    317 	mutex_exit(&net->net_cnt_lock);
    318 }

First note that either we hold the lock ourself and call nfs4_ephemeral_tree_decr or we let nfs4_ephemeral_tree_rele grab the lock for us. The assert on 307 checks that this is valid. The assert on 308 checks to see that we are not already at 0 when we decrement. I.e., because the refcnt is unsigned, once we drop below 0, we can not do a check like:

                ASSERT(net->net_refcnt >= 0);

By the way, we do not need to be holding mi->mi_lock everywhere we rele or decr. The reason we need to when we hold is that we need to know the memory is valid. With the other two operations, by definition, the memory must be valid.

To prove that statement, we need to show that the ephemeral tree can not be disassociated from the mntinfo4 without the mi->mi_lock being held. The only place we do that is in nfs4_ephemeral_umount:

   2003 		/\*
   2004 		 \* At this point, the tree should no
   2005 		 \* longer be associated with the
   2006 		 \* mntinfo4. We need to pull it off
   2007 		 \* there and let the harvester take
   2008 		 \* care of it once the refcnt drops.
   2009 		 \*/
   2010 		mutex_enter(&mi->mi_lock);
   2011 		mi->mi_ephemeral_tree = NULL;
   2012 		mutex_exit(&mi->mi_lock);

Okay, further validation that we should have the lock held at 731.

The final piece of the puzzle is what happens in the harvester? I.e., we decided in 2011 to disassociate the tree with the mntinfo4. But we didn't release the memory. How do we do that? Well, we look in nfs4_ephemeral_harvest_forest and we also see why we don't care about holding mi->mi_lock there.

   2098 	mutex_enter(&ntg->ntg_forest_lock);
   2099 	for (net = ntg->ntg_forest; net != NULL; net = next) {
   2100 		next = net->net_next;
   2101 
   2102 		nfs4_ephemeral_tree_hold(net);

When we hold ntg->ntg_forest_lock, no other thread is allowed to manipulate the forest of ephemeral trees. What that means is that the current forest is locked and the memory is not going away. Hence we don't need to hold mi->mi_lock. The other code has to hold it because the reference in the mntinfo4 can go away (see line 2011).

   2140 		while (e) {
...
   2247 			e = prior;
   2248 		}
   2249 
   2250 check_done:
   2251 
   2252 		/\*
   2253 		 \* At this point we are done processing this tree.
   2254 		 \*
   2255 		 \* If the tree is invalid and we are the only reference
   2256 		 \* to it, then we push it on the local linked list
   2257 		 \* to remove it at the end. We avoid that action now
   2258 		 \* to keep the tree processing going along at a fair clip.
   2259 		 \*
   2260 		 \* Else, even if we are the only reference, we drop
   2261 		 \* our hold on the current tree and allow it to be
   2262 		 \* reused as needed.
   2263 		 \*/
   2264 		mutex_enter(&net->net_cnt_lock);
   2265 		if (net->net_refcnt == 1 &&
   2266 		    net->net_status & NFS4_EPHEMERAL_TREE_INVALID) {
   2267 			nfs4_ephemeral_tree_decr(net);
   2268 			net->net_status &= ~NFS4_EPHEMERAL_TREE_LOCKED;
   2269 			mutex_exit(&net->net_cnt_lock);
   2270 			mutex_exit(&net->net_tree_lock);
   2271 
   2272 			if (prev)
   2273 				prev->net_next = net->net_next;
   2274 			else
   2275 				ntg->ntg_forest = net->net_next;
   2276 
   2277 			net->net_next = harvest;
   2278 			harvest = net;
   2279 			continue;
   2280 		}
   2281 
   2282 		nfs4_ephemeral_tree_decr(net);
   2283 		net->net_status &= ~NFS4_EPHEMERAL_TREE_LOCKED;
   2284 		mutex_exit(&net->net_cnt_lock);
   2285 		mutex_exit(&net->net_tree_lock);
   2286 
   2287 		prev = net;
   2288 	}
   2289 	mutex_exit(&ntg->ntg_forest_lock);

2248 completes the a while processing loop started in 2140 above. Among other things, this loop decides if a mirror mount has been idle long enough to automatically release it. It does this in a recursion elimination to visit every child and sibling node of the root of the ephemeral tree.

The interesting code starts on lines 2265 and 2266. If the refcnt is 1, then the hold we established on 2102 is the only reference to this node. Then, if the tree has been invalidated, either by the harvest loop above (this time or earlier) or by a manual unmount, then we know it is safe to release this node. We decrement the refcnt at 2267. This is important because if we leave it at 1 and push it on the harvest list, then another node can do a rele and we think it is okay. We remove it from the forst in lines 2272 to 2275 and push it onto a local linked list harvest on lines 2277 and 2278. We then go on to process the next node.

When we have finished all trees in the forest, they are all in one of 3 states:

  1. Not in use by a thread and harvested.
  2. Not idled out. May or may not be in use by a thread.
  3. In use by a thread, idled out, and not harvested.

With all of that background out of the way, let us look at a coredump:

> $c
vpanic()
assfail+0x7e(fffffffff8490b98, fffffffff8490b70, 134)
nfs4_ephemeral_tree_decr+0x64(ffffff020db7fc18)
nfs4_ephemeral_umount_unlock+0x3f(ffffff0004f33c14, ffffff0004f33bb8)
nfs4_ephemeral_umount_activate+0x7b(ffffff0524673000, ffffff0004f33c14, ffffff0004f33bb8)
nfs4_free_mount+0x2fe(ffffff047ae40dc8, 400, ffffff01974f2b28)
nfs4_free_mount_thread+0x24(ffffff016623a9f8)
thread_start+8()
> \*ffffff0004f33bb8::print nfs4_ephemeral_tree_t
{
    net_mount = 0xffffff0523c36000
    net_root = 0
    net_next = 0
    net_tree_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_cnt_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_status = 0
    net_refcnt = 0
}
> ffffff0004f33c14::print boolean_t
1 (B_TRUE)

This is the assert on line 308. Some things we can tell right away is that this should be a manual unmount, since we came through nfs4_free_mount and not the harvester. The flags are 0x400, which means MS_FORCE - a forced umount call. Also, while the refcnt is 0, it has not been harvested. In that case, we would expect to see that the state would include NFS4_EPHEMERAL_TREE_INVALID. And indeed, the mntinfo4 (see net_mount) does still point to this node:

> ffffff0523c36000::print mntinfo4_t mi_ephemeral_tree
mi_ephemeral_tree = 0xffffff020db7fc18
> ffffff0523c36000::print mntinfo4_t mi_ephemeral_tree | ::print nfs4_ephemeral_tree_t
{
    net_mount = 0xffffff0523c36000
    net_root = 0
    net_next = 0
    net_tree_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_cnt_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_status = 0
    net_refcnt = 0
}

So, who did us in? Well, the first thought I had was this piece of code in nfs4_ephemeral_umount : (which we would have just called)

   1990 		nfs4_ephemeral_tree_decr(net);
   1991 		mutex_exit(&net->net_cnt_lock);
   1992 
   1993 		if (was_locked == FALSE)
   1994 			mutex_exit(&net->net_tree_lock);

We only hold the refcnt in here if we do not set was_locked to TRUE:

   1764 	int			was_locked = FALSE;
...
   1845 			was_locked = TRUE;
   1846 		} else {
   1847 			net->net_refcnt++;
   1848 			ASSERT(net->net_refcnt != 0);
   1849 		}
   1850 
   1851 		mutex_exit(&net->net_cnt_lock);

Hmm, we also avoid using nfs4_ephemeral_tree_hold. We should probably create a symmetric function for incr like we did for decr.

Anyway, I fixed the code at 19909 and we still see the panic. The new one has the same signature. By the way, I found this probable error by matching holds to reles. I stopped with just the simple cases when I found the above issue.

Okay, what does nfs4_free_mount() do?

   4039 	/\*
   4040 	 \* If we got an error, then do not nuke the
   4041 	 \* tree. Either the harvester is busy reclaiming
   4042 	 \* this node or we ran into some busy condition.
   4043 	 \*
   4044 	 \* The harvester will eventually come along and cleanup.
   4045 	 \* The only problem would be the root mount point.
   4046 	 \*
   4047 	 \* Since the busy node can occur for a variety
   4048 	 \* of reasons and can result in an entry staying
   4049 	 \* in df output but no longer accessible from the
   4050 	 \* directory tree, we are okay.
   4051 	 \*/
   4052 	if (!nfs4_ephemeral_umount(mi, flag, cr,
   4053 	    &must_unlock, &eph_tree))
   4054 		nfs4_ephemeral_umount_activate(mi, &must_unlock,
   4055 		    &eph_tree);
   4056 

So we call nfs4_ephemeral_umount, which had that suspect piece of code. We know that we will not set NFS4_EPHEMERAL_TREE_LOCKED ourselves. We can see that it was not held, because pmust_unlock was set to TRUE. Also, the status tells us we did not mark the tree as NFS4_EPHEMERAL_TREE_INVALID and the fact that the mntinfo4 pointer is valid tells us we did not disassociate the tree from it. All of which tells me fixing line 1990 would probably have no impact on the execution path. I.e., I'm starting to suspect we don't have a problem with a race case.

The only way to return 0 and not disassociate the tree is to go down this path back in nfs4_ephemeral_umount:

   1945 	if (!is_derooting) {
   1946 		/\*
   1947 		 \* Only work on children if the caller has not already
   1948 		 \* done so.
   1949 		 \*/
   1950 		if (!is_recursed) {
   1951 			ASSERT(eph != NULL);
   1952 
   1953 			error = nfs4_ephemeral_unmount_engine(eph,
   1954 			    FALSE, flag, cr);
   1955 			if (error)
   1956 				goto is_busy;
   1957 		}
   1958 	} else {

And as a matter of fact, the evidence suggests that no other code did this either. nfs4_ephemeral_unmount_engine must have been error free.

So either we have decremented the refcnt twice for our one hold or we decremented when we did not have a hold.

Heh, I've gone down many a code branch, only to backtrack as I found the code was correct. Perhaps we should return to our debug session? nfs4_ephemeral_umount_activate takes a pointer to a mntinfo4 and a pointer to a pointer of a nfs4_ephemeral_tree_t:

nfs4_ephemeral_umount_activate+0x7b(ffffff0524673000, ffffff0004f33c14, ffffff0004f33bb8)
> \*ffffff0004f33bb8::print nfs4_ephemeral_tree_t
{
    net_mount = 0xffffff0523c36000
    net_root = 0
    net_next = 0
    net_tree_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_cnt_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_status = 0
    net_refcnt = 0
}
> ffffff0524673000::print mntinfo4_t mi_ephemeral
mi_ephemeral = 0

The ephemeral node is not pointing to the same thing as the function. And I have more faith in the function, because it had to just execute:

   1733 	if (mi->mi_ephemeral) {
   1734 		/\*
   1735 		 \* If we are the root node of an ephemeral branch
   1736 		 \* which is being removed, then we need to fixup
   1737 		 \* pointers into and out of the node.
   1738 		 \*/
   1739 		if (!(mi->mi_flags & MI4_EPHEMERAL_RECURSED))
   1740 			nfs4_ephemeral_umount_cleanup(mi->mi_ephemeral);
   1741 
   1742 		ASSERT(mi->mi_ephemeral != NULL);
   1743 
   1744 		kmem_free(mi->mi_ephemeral, sizeof (\*mi->mi_ephemeral));
   1745 		mi->mi_ephemeral = NULL;
   1746 	}
   1747 	mutex_exit(&mi->mi_lock);
   1748 
   1749 	nfs4_ephemeral_umount_unlock(pmust_unlock, pnet);

I.e., if we pass in a mntinfo4, either the ephemeral tree is already NULL or we free it and NULL it out. And oh crap, if mi->mi_ephemeral was a pointer to the same memory location as \*pnet, we just screwed the pooch and caused a core. I.e., the only place we should be removing this memory is at:

   2010 		mutex_enter(&mi->mi_lock);
   2011 		mi->mi_ephemeral_tree = NULL;
   2012 		mutex_exit(&mi->mi_lock);

Well, this entry is long enough as it is. I'll convince myself that removing 1744 is the correct thing to do and then test that fix. Note, this kmem_free might explain why systems do not always core. If the memory is garbage, all it has to be is non-zero for the assert to not trigger. Whether or not other nasty things could have happened is to be seen later.

UPDATE: Hmm, one is a mi->mi_ephemeral and the other is a mi->mi_ephemeral_tree. The end result may stand, but my analysis may be off. Back with more later!


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily
About

tdh

Search

Archives
« July 2015
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
       
Today