Thursday Oct 02, 2008

How to tie our closed-bins to the new gate

I was trying to relax and I realized we would have an ongoing problem in keeping the new ssh://anon@hg.opensolaris.org/hg/nfsv41/nfs41-gate in sync with our copy of the closed binaries. But, I think we will be saved by a couple of things:

  1. We don't update the internal nfs41-gate automatically with every change in the onnv-gate. We actually normally sync up with the 2 week releases. This means that random changes to the closed source will not impact the osol gate. As a matter of fact, we control when a change causes a respin of the closed bits.
  2. Just like the ON gatekeepers tag their gate every 2 weeks, we could also use 'hg tag' to mark when the closed binaries changed. We could store the closed binaries on the project download page and when an external developer saw the tag change, they could then pick up a new copy

Plus with setting the mail to go out to the dev mailing list, people would be able to see a need to pickup a new set of closed binaries.

[thud@adept src]> hg incoming
comparing with ssh://anon@hg.opensolaris.org/hg/nfsv41/nfs41-gate
searching for changes
changeset:   7743:c672b1cb86be
tag:         tip
user:        Thomas Haynes 
date:        Thu Oct 02 22:28:30 2008 -0500
summary:     Added tag closedv1 for changeset 9fab48a31a4a

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Juggling work load

The group has so much to do and it feels like so little time to do it. I don't think anyone just codes. I'm looking at my action list and it is all over the place:

Test fix for 6738223 Can not share a single IP address
Unit testing was successful, but mini-PIT testing is mixed. I think this is more a configuration issue than a code issue.
Push code review along for 6751438 mirror mounted mountpoints panic when umounted
Frank gave me a good review/discussion, but I need another reviewer.
Figure out how to configure a minimal jumpstart
I could ask someone to do it, but I've been meaning to understand jumpstarting myself. This is an easy way to ease into it.
Open up an OpenSolaris gate for NFSv41
Chewed up a large chunk of my day. Hey, I need to do a test integration to see if it works. If it does, I have to run because Dave Marker is going to eat my beating heart. And it works! Here is my Linux box at home:
[thud@adept src]> hg incoming
comparing with ssh://anon@hg.opensolaris.org/hg/nfsv41/nfs41-gate
searching for changes
changeset:   7742:9fab48a31a4a
tag:         tip
user:        Thomas Haynes 
date:        Thu Oct 02 21:19:03 2008 -0500
summary:     Test of push to osol

[thud@adept src]> hg pull -u
pulling from ssh://anon@hg.opensolaris.org/hg/nfsv41/nfs41-gate
searching for changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 1 changes to 1 files
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
Build closed-binaries for OpenSolaris
Trivial, but time consuming. I will need to also install and test. We also just got rid of auth records, so I need to see how we are changing the configuration of the DS and MDS.
Figure out how to translate data path to guuid
The above mentioned server change really hosed up my progress on spe. I've got the final pieces in my mind, but I need to just grind it all out. And then I need to start testing. I may need to go to VirtualBoxes to get enough DSes for testing.
D'oh, need to get a VirtualBox image together for OpenSolaris
I can piggyback on the push of the closed binaries.
Modernize style of blog site
Looks dated and I want to add a tag cloud. At least one of the shared styles will do this, but I just don't want to use it.

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Trying to understand the NFSv4 database interface

For spe, I need to understand the database stuff in usr/src/uts/common/fs/nfs/nfs4_db.c. Note that this code is not part of the NFSv4 spec, it is a Sun implementation detail. Specifically, I need to know how to create two indexes on the same set of data. I could create two tables, but I think having a second copy is overkill. Also, it would be ugly to make sure that they stayed in sync.

The issue is that I need to keep track of mapping from data source name to guuid. Whereas the existing code only needs to be aware of the guuid.

I've gone through this code in the past, but I've forgotten most of what I learned. So I'm going to walk through it and annotate it here. I'll provide quick links which should stay relevant as the code changes.

We can see the server start to use the database code in rfs4_state_init:

   1238 	/\* Create the overall database to hold all server state \*/
   1239 	rfs4_server_state = rfs4_database_create(rfs4_database_debug);

It then creates some tables and indexes:

   1241 	/\* Now create the individual tables \*/
   1242 	rfs4_client_cache_time \*= rfs4_lease_time;
   1243 	rfs4_client_tab = rfs4_table_create(rfs4_server_state,
   1244 	    "Client",
   1245 	    rfs4_client_cache_time,
   1246 	    2,
   1247 	    rfs4_client_create,
   1248 	    rfs4_client_destroy,
   1249 	    rfs4_client_expiry,
   1250 	    sizeof (rfs4_client_t),
   1251 	    TABSIZE,
   1252 	    MAXTABSZ/8, 100);
   1253 	rfs4_nfsclnt_idx = rfs4_index_create(rfs4_client_tab,
   1254 	    "nfs_client_id4", nfsclnt_hash,
   1255 	    nfsclnt_compare, nfsclnt_mkkey,
   1256 	    TRUE);
   1257 	rfs4_clientid_idx = rfs4_index_create(rfs4_client_tab,
   1258 	    "client_id", clientid_hash,
   1259 	    clientid_compare, clientid_mkkey,
   1260 	    FALSE);

Looks like I now know I can create two different indexes on the same table.

Of interest is that there is only one index that can be used to create the table. That is given by the last parameter to rfs4_index_create. Indeed, we can see that rfs4_nfsclnt_idx is the create index for rfs4_client_tab.

BTW: Using the naming convention (see Some usr/src/uts/common/fs/nfs naming conventions)< we can easily tell that the code, table, and indexes are all for the NFSv4 server.

Going back to the code, we can see this property is enforced:

    381 	if (createable) {
    382 		table->ccnt++;
    383 		if (table->ccnt > 1)
    384 			panic("Table %s currently can have only have one "
    385 			    "index that will allow creation of entries",
    386 			    table->name);
    387 		idx->createable = TRUE;
    388 	} else {
    389 		idx->createable = FALSE;
    390 	}

Lines 383-384 are basically a VERIFY which spits out useful information. Note that only developers should see this panic as it should happen the first time they try to boot up a kernel. I.e., the index creation is not normally runtime dependent on executing special paths.

Hmm, that was quick. I guess when I have to dive further, I'll add more annotation. ;>


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Some usr/src/uts/common/fs/nfs naming conventions

In usr/src/uts/common/fs/nfs/, there are some historical naming conventions that can help you understand where you are in the code:

rfs_
You are in the NFSv2 or NFSv3 server code. It may also be generic code that will call down into the NFSv4 layer.
rfs4_
You are in the NFSv4 server code.
rfs41_
You are in the NFS4.1 and/or pNFS server code.
nfs_
You are in the NFSv2 or NFSv3 client code. It may also be generic code that will call down into the NFSv4 layer.
rfs4_
You are in the NFSv4 client code.
rfs41_
You are in the NFS4.1 and/or pNFS client code.

You may end up in code which doesn't follow these conventions, i.e., the spe code I will be adding, some of the mirror mount code, etc. But if you are look at a stack trace, you would be able to see this type of code being called by a function following the above patterns.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Tuesday Sep 30, 2008

branch gatekeeping 101

So I maintain nfs41-gate which is a development branch of onnv-gate. With the introduction of Mercurial as our version control system, my life has changed a bit, but the basic tasks for gatekeeping stay the same:

Build binaries
When developers change the source base, build new BFU bits for QA and/or other developers. I.e., reference bits without anyone else's changes present.
Branch merge with the ON gate
When something we want is introduced into ON or we don't want to drift too far, we sync up with onnv-gate.

I've noticed that I find merging to be easier with Mercurial, so it tends to happen more often.

Building Binaries

I typically already have an existing workspace and I'll do an incremental build in it. Also, I do this for both sparc and i386. I don't have to worry about conflicts or merging since nothing changes in the child. A typical session would be:

[th199096@aus-build-x86 ~]> ws /builds/th199096/nfs41-gk

Workspace                    : /builds/th199096/nfs41-gk
Workspace Parent             : ssh://aus1500-home//pool/ws/nfs41-clone
Proto area ($ROOT)           : /builds/th199096/nfs41-gk/proto/root_i386
Root of source ($SRC)        : /builds/th199096/nfs41-gk/usr/src
Root of test source ($TSRC)  : /builds/th199096/nfs41-gk/usr/ontest
Current directory ($PWD)     : /builds/th199096/nfs41-gk

[th199096@aus-build-x86 nfs41-gk]>  hg pull -u
pulling from ssh://aus1500-home//pool/ws/nfs41-clone
searching for changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 93 changes to 93 files
93 files updated, 0 files merged, 0 files removed, 0 files unresolved
[th199096@aus-build-x86 nfs41-gk]> ws usr/closed/

Workspace                    : /builds/th199096/nfs41-gk/usr/closed
Workspace Parent             : ssh://aus1500-home//pool/ws/nfs41-clone/usr/closed
Proto area ($ROOT)           : /builds/th199096/nfs41-gk/usr/closed/proto/root_i386
Root of source ($SRC)        : /builds/th199096/nfs41-gk/usr/closed/usr/src
Root of test source ($TSRC)  : /builds/th199096/nfs41-gk/usr/closed/usr/ontest
Current directory ($PWD)     : /builds/th199096/nfs41-gk/usr/closed

[th199096@aus-build-x86 closed]> hg pull -u
pulling from ssh://aus1500-home//pool/ws/nfs41-clone/usr/closed
searching for changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 3 changes to 3 files
3 files updated, 0 files merged, 0 files removed, 0 files unresolved
[th199096@aus-build-x86 closed]> exit
exit
[th199096@aus-build-x86 nfs41-gk]> `which nightly` -in nightly.env 

At which point I go do something else, like blog about what I am doing.

I said typical, except that this change set (which gets rid of auth records in the MDS as a byproduct) happens to have touched something in closed. We rarely seem to make changes there.

Okay, the build is done (remember to check the logs) and in this case it did not fail. So push the BFU bits out and then send out email telling people about the new reference bits.

[th199096@aus-build-x86 nfs41-gk]> ~/gk/hg-push.sh nfs41-gk i386 2008-09-30
ARCH=i386
BASE=/builds/th199096
AUS=/net/aus1500-home.central/pool/ws/nfs41-gate-hg-archives/i386
NIGHTLY=/builds/th199096/nfs41-gk/archives/i386/nightly
NIGHTLY_ND=/builds/th199096/nfs41-gk/archives/i386/nightly-nd
DATER=/builds/th199096/nfs41-gk/archives/i386/2008-09-30
DATER_ND=/builds/th199096/nfs41-gk/archives/i386/2008-09-30-nd
+ mv /builds/th199096/nfs41-gk/archives/i386/nightly /builds/th199096/nfs41-gk/archives/i386/2008-09-30 
+ mv /builds/th199096/nfs41-gk/archives/i386/nightly-nd /builds/th199096/nfs41-gk/archives/i386/2008-09-30-nd 
+ cp -r /builds/th199096/nfs41-gk/archives/i386/2008-09-30 /net/aus1500-home.central/pool/ws/nfs41-gate-hg-archives/i386/2008-09-30 
+ cp -r /builds/th199096/nfs41-gk/archives/i386/2008-09-30-nd /net/aus1500-home.central/pool/ws/nfs41-gate-hg-archives/i386/2008-09-30-nd 
+ rm /net/aus1500-home.central/pool/ws/nfs41-gate-hg-archives/i386/latest /net/aus1500-home.central/pool/ws/nfs41-gate-hg-archives/i386/latest-nd 
+ cd /net/aus1500-home.central/pool/ws/nfs41-gate-hg-archives/i386 
+ ln -s 2008-09-30 latest 
+ ln -s 2008-09-30-nd latest-nd 

Branch merging

This case is a bit more complicated and can be summarized by:

  1. Take a clone of nfs41-clone - call it nfs41-sync
  2. Reparent it to onnv-clone (onnv-gate is write only)
  3. Pull across the changes and merge them.
    This is actually the only difficulty in the entire process
  4. On sparc and x86 build boxes, get clones of nfs41-sync and do full builds.
    Incrementals are not sufficient in this case.
  5. Populate a pNFS community (client, DS, and MDS) with these changes and make sure the Connectathon tests all pass.
    Depending on the scope of the changes and/or the difficulty of the merge, we may skip this item. -- pure judgment call. Also, note that these clones will become the basis for future incremental builds as described in the previous section.
  6. If there have been further integrations to nfs41-gate, reparent to nfs41-clone (which is automatically kept up to date with nfs41-gate), and do the pull/merge cycle until everything is up to date. You may have to rebuild and retest. Much easier with a small group of developers to ask them not to integrate.
  7. ZFS snapshot nfs41-gate to make rolling back the changes easier.
  8. Reparent nfs41-sync to nfs41-gate and integrate the changes.

By not changing the nfs41-gate until the final moment, I can throw everything away if needed. And believe me, as painful as that is, I've done it. Also, note that when I talk about a workspace above, I am also taling about working in parallel with the closed version of it.

But now onto a detailed example:

Get a backup snapshot of the gate:

[th199096@aus1500-home ~]> zfs snapshot pool/ws/nfs41-clone@sync99

Now grab your copy for merging

[th199096@aus1500-home ~]> cd /pool/ws/th199096/
[th199096@aus1500-home th199096]> ~/bin/hg-clone ssh://aus1500-home//pool/ws/nfs41-clone nfs41-syn
c
397b36b5473d
=== clone open tree: ssh://aus1500-home//pool/ws/nfs41-clone ===
requesting all changes
adding changesets
adding manifests
adding file changes
added 7515 changesets with 101024 changes to 51313 files
updating working directory
42507 files updated, 0 files merged, 0 files removed, 0 files unresolved
2a39f20bc20e
=== clone closed tree: ssh://aus1500-home//pool/ws/nfs41-clone/usr/closed ===
requesting all changes
adding changesets
adding manifests
adding file changes
added 968 changesets with 8269 changes to 4389 files
updating working directory
2677 files updated, 0 files merged, 0 files removed, 0 files unresolved

~/bin/hg-clone is a simple script to get both the open and closed versions of the gate.< /p>

And reparent it to onnv-clone

[th199096@aus1500-home th199096]> ws nfs41-sync/

Workspace			: /pool/ws/th199096/nfs41-sync
Workspace Parent		: ssh://aus1500-home//pool/ws/nfs41-clone
Proto area ($ROOT)		: /pool/ws/th199096/nfs41-sync/proto/root_i386
Root of source ($SRC)		: /pool/ws/th199096/nfs41-sync/usr/src
Root of test source ($TSRC)  : /pool/ws/th199096/nfs41-sync/usr/ontest
Current directory ($PWD)	: /pool/ws/th199096/nfs41-sync

[th199096@aus1500-home nfs41-sync]> hg reparent ssh://onnv.eng//export/onnv-clone

Pull and merge

[th199096@aus1500-home nfs41-sync]> hg pull -u
pulling from ssh://onnv.eng//export/onnv-clone
searching for changes
adding changesets
adding manifests
adding file changes
added 210 changesets with 2122 changes to 1808 files (+1 heads)
not updating, since new heads added
(run 'hg heads' to see heads, 'hg merge' to merge)
[th199096@aus1500-home nfs41-sync]> hg merge
merging usr/src/cmd/Makefile
merging usr/src/cmd/zfs/zfs_iter.c
merging usr/src/cmd/zfs/zfs_main.c
merging usr/src/lib/Makefile
merging usr/src/lib/libzfs/common/libzfs.h
merging usr/src/lib/libzfs/common/libzfs_dataset.c
merging usr/src/lib/libzfs/common/mapfile-vers
merging usr/src/pkgdefs/Makefile
merging usr/src/pkgdefs/SUNWcsu/prototype_com
merging usr/src/pkgdefs/SUNWhea/prototype_com
merging usr/src/pkgdefs/etc/exception_list_sparc
merging usr/src/uts/common/Makefile.files
merging usr/src/uts/common/Makefile.rules
merging usr/src/uts/common/fs/zfs/dsl_dataset.c
merging usr/src/uts/common/fs/zfs/zfs_ioctl.c
merging usr/src/uts/common/sys/Makefile
merging usr/src/uts/common/sys/fs/zfs.h
merging usr/src/uts/intel/Makefile.intel.shared
merging usr/src/uts/intel/os/minor_perm
merging usr/src/uts/intel/os/name_to_major
merging usr/src/uts/sparc/Makefile.sparc.shared
merging usr/src/uts/sparc/os/minor_perm
merging usr/src/uts/sparc/os/name_to_major
1785 files updated, 23 files merged, 62 files removed, 0 files unresolved
(branch merge, don't forget to commit)

So, I used filemerge to do any manual editing in the above merge. It is invoked in my .hgrc:

# Merge tool
[merge-patterns]
\*\* = filemerge

[merge-tools]
filemerge.executable = /ws/onnv-tools/teamware/bin/filemerge
filemerge.args = -a $base $local $other $output
filemerge.checkchanged = true
filemerge.gui = true

Then I would commit and repeat the cycle for the closed branch:

[th199096@aus1500-home nfs41-sync]> hg commit
[th199096@aus1500-home nfs41-sync]> ws usr/closed

Workspace			: /pool/ws/th199096/nfs41-sync/usr/closed
Workspace Parent		: ssh://aus1500-home//pool/ws/nfs41-clone/usr/closed
Proto area ($ROOT)		: /pool/ws/th199096/nfs41-sync/usr/closed/proto/root_i386
Root of source ($SRC)		: /pool/ws/th199096/nfs41-sync/usr/closed/usr/src
Root of test source ($TSRC)  : /pool/ws/th199096/nfs41-sync/usr/closed/usr/ontest
Current directory ($PWD)	: /pool/ws/th199096/nfs41-sync/usr/closed

[th199096@aus1500-home closed]> hg reparent ssh://onnv.eng//export/onnv-clone/usr/closed
[th199096@aus1500-home closed]> hg pull -u
pulling from ssh://onnv.eng//export/onnv-clone/usr/closed
searching for changes
adding changesets
adding manifests
adding file changes
added 15 changesets with 145 changes to 137 files (+1 heads)
not updating, since new heads added
(run 'hg heads' to see heads, 'hg merge' to merge)
[th199096@aus1500-home closed]> hg merge
135 files updated, 0 files merged, 10 files removed, 0 files unresolved
(branch merge, don't forget to commit)
[th199096@aus1500-home closed]> hg commit

The next step is a clone/build on one of the build machines. As this looks a lot like the cloning in this section and the build from the prior, I'm going to leave it out.

After the build and verification is done, we prepare the nfs41-gate for the integration. Because of the branch merge comments not being in the approved RTI format, we need to turn off sanity checking for this operation. Note that it is okay to keep on for developers pushing to the gate:

[th199096@aus1500-home ~]> su - nfs4hg
Password:
Sun Microsystems Inc.	SunOS 5.11	snv_92	January 2008
[nfs4hg@aus1500-home ~]> cd /pool/ws/nfs41-gate/usr/closed/.hg
[nfs4hg@aus1500-home .hg]> cp hgrc hgrc.good
[nfs4hg@aus1500-home .hg]> vi hgrc
[nfs4hg@aus1500-home .hg]> diff hgrc hgrc.good
73c73
< #pretxnchangegroup.1 = python:hook.sanity.sanity
---
> pretxnchangegroup.1 = python:hook.sanity.sanity

Reparent to the gate and push

[th199096@aus1500-home closed]> hg reparent ssh://nfs4hg@aus1500-home//pool/ws/nfs41-gate/usr/clos
ed
[th199096@aus1500-home closed]> hg push
pushing to ssh://nfs4hg@aus1500-home//pool/ws/nfs41-gate/usr/closed
searching for changes
Are you sure you wish to push? [y/N]: y
pushing to ssh://nfs4hg@aus1500-home//pool/ws/nfs41-gate/usr/closed
...
remote: Preparing gk email...
remote: ...gk email sent

Fix the .hgrc back to turn on sanity checking and repeat for the open bits.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Thursday Sep 25, 2008

Debugging a refcnt

A refcnt is a way to keep track of how many outstanding holds are being used on a data structure. Say that you need to reference something in the data structure, but you don't want to do it while holding a lock. You do want to make sure that the data is still there when you go back. What you do is bump the refcnt and have the cleanup code never free that data until the refcnt goes back to 0.

Some issues are what happens if it never goes back to 0? Or if it accidentally goes back to 0 before you are done with it? The answers are that you leak memory and crash, in that order.

I'm debugging the scenario where the refcnt goes back to 0 before you release your hold. Before I dive into looking at that, let's look at the relevant code in the mirror mount source: usr/src/uts/common/fs/nfs/nfs4_stub_vnops.c (Note that this code will change over time, I'm linking to the copy of the gate.)

We bump the count here:

    291 static void
    292 nfs4_ephemeral_tree_hold(nfs4_ephemeral_tree_t \*net)
    293 {
    294 	mutex_enter(&net->net_cnt_lock);
    295 	net->net_refcnt++;
    296 	ASSERT(net->net_refcnt != 0);
    297 	mutex_exit(&net->net_cnt_lock);
    298 }

The first thing to note is that we grab net_cnt_lock before we do anything. How do we even know that is valid? Well, we grab a lock on the mountinfo and check to see if there is an ephemeral node:

    696 	mutex_enter(&mi->mi_lock);
...
    701 	if (mi->mi_ephemeral_tree == NULL) {
...
    727 	} else {
    728 		net = mi->mi_ephemeral_tree;
    729 		mutex_exit(&mi->mi_lock);
    730 
    731 		nfs4_ephemeral_tree_hold(net);

Oh, that is very naughty indeed - what I said and what the code says disagree. What do other parts of the code say?

   1422 	mutex_enter(&mi->mi_lock);
...
   1444 	nfs4_ephemeral_tree_hold(net);
...
   1528 	mutex_exit(&mi->mi_lock);

And a special case in nfs4_ephemeral_harvest_forest

   2098 	mutex_enter(&ntg->ntg_forest_lock);
   2099 	for (net = ntg->ntg_forest; net != NULL; net = next) {
   2100 		next = net->net_next;
   2101 
   2102 		nfs4_ephemeral_tree_hold(net);

I expect the code at 2102 to not hold the mntinfo4. We don't have our hands on that info at the time. But I believe the code at 731 to be a bug that will eventually bit us.

But back to the hold code:

    295 	net->net_refcnt++;
    296 	ASSERT(net->net_refcnt != 0);

We bump the refcnt and immediately assert that it is not 0. As refcnt is an unsigned int, this will catch the case where we wrap over the size of an integer. We don't ever expect to hit this case.

It is in the rele case that we are more concerned:

    300 /\*
    301  \* We need a safe way to decrement the refcnt whilst the
    302  \* lock is being held.
    303  \*/
    304 static void
    305 nfs4_ephemeral_tree_decr(nfs4_ephemeral_tree_t \*net)
    306 {
    307 	ASSERT(mutex_owned(&net->net_cnt_lock));
    308 	ASSERT(net->net_refcnt != 0);
    309 	net->net_refcnt--;
    310 }
    311 
    312 static void
    313 nfs4_ephemeral_tree_rele(nfs4_ephemeral_tree_t \*net)
    314 {
    315 	mutex_enter(&net->net_cnt_lock);
    316 	nfs4_ephemeral_tree_decr(net);
    317 	mutex_exit(&net->net_cnt_lock);
    318 }

First note that either we hold the lock ourself and call nfs4_ephemeral_tree_decr or we let nfs4_ephemeral_tree_rele grab the lock for us. The assert on 307 checks that this is valid. The assert on 308 checks to see that we are not already at 0 when we decrement. I.e., because the refcnt is unsigned, once we drop below 0, we can not do a check like:

                ASSERT(net->net_refcnt >= 0);

By the way, we do not need to be holding mi->mi_lock everywhere we rele or decr. The reason we need to when we hold is that we need to know the memory is valid. With the other two operations, by definition, the memory must be valid.

To prove that statement, we need to show that the ephemeral tree can not be disassociated from the mntinfo4 without the mi->mi_lock being held. The only place we do that is in nfs4_ephemeral_umount:

   2003 		/\*
   2004 		 \* At this point, the tree should no
   2005 		 \* longer be associated with the
   2006 		 \* mntinfo4. We need to pull it off
   2007 		 \* there and let the harvester take
   2008 		 \* care of it once the refcnt drops.
   2009 		 \*/
   2010 		mutex_enter(&mi->mi_lock);
   2011 		mi->mi_ephemeral_tree = NULL;
   2012 		mutex_exit(&mi->mi_lock);

Okay, further validation that we should have the lock held at 731.

The final piece of the puzzle is what happens in the harvester? I.e., we decided in 2011 to disassociate the tree with the mntinfo4. But we didn't release the memory. How do we do that? Well, we look in nfs4_ephemeral_harvest_forest and we also see why we don't care about holding mi->mi_lock there.

   2098 	mutex_enter(&ntg->ntg_forest_lock);
   2099 	for (net = ntg->ntg_forest; net != NULL; net = next) {
   2100 		next = net->net_next;
   2101 
   2102 		nfs4_ephemeral_tree_hold(net);

When we hold ntg->ntg_forest_lock, no other thread is allowed to manipulate the forest of ephemeral trees. What that means is that the current forest is locked and the memory is not going away. Hence we don't need to hold mi->mi_lock. The other code has to hold it because the reference in the mntinfo4 can go away (see line 2011).

   2140 		while (e) {
...
   2247 			e = prior;
   2248 		}
   2249 
   2250 check_done:
   2251 
   2252 		/\*
   2253 		 \* At this point we are done processing this tree.
   2254 		 \*
   2255 		 \* If the tree is invalid and we are the only reference
   2256 		 \* to it, then we push it on the local linked list
   2257 		 \* to remove it at the end. We avoid that action now
   2258 		 \* to keep the tree processing going along at a fair clip.
   2259 		 \*
   2260 		 \* Else, even if we are the only reference, we drop
   2261 		 \* our hold on the current tree and allow it to be
   2262 		 \* reused as needed.
   2263 		 \*/
   2264 		mutex_enter(&net->net_cnt_lock);
   2265 		if (net->net_refcnt == 1 &&
   2266 		    net->net_status & NFS4_EPHEMERAL_TREE_INVALID) {
   2267 			nfs4_ephemeral_tree_decr(net);
   2268 			net->net_status &= ~NFS4_EPHEMERAL_TREE_LOCKED;
   2269 			mutex_exit(&net->net_cnt_lock);
   2270 			mutex_exit(&net->net_tree_lock);
   2271 
   2272 			if (prev)
   2273 				prev->net_next = net->net_next;
   2274 			else
   2275 				ntg->ntg_forest = net->net_next;
   2276 
   2277 			net->net_next = harvest;
   2278 			harvest = net;
   2279 			continue;
   2280 		}
   2281 
   2282 		nfs4_ephemeral_tree_decr(net);
   2283 		net->net_status &= ~NFS4_EPHEMERAL_TREE_LOCKED;
   2284 		mutex_exit(&net->net_cnt_lock);
   2285 		mutex_exit(&net->net_tree_lock);
   2286 
   2287 		prev = net;
   2288 	}
   2289 	mutex_exit(&ntg->ntg_forest_lock);

2248 completes the a while processing loop started in 2140 above. Among other things, this loop decides if a mirror mount has been idle long enough to automatically release it. It does this in a recursion elimination to visit every child and sibling node of the root of the ephemeral tree.

The interesting code starts on lines 2265 and 2266. If the refcnt is 1, then the hold we established on 2102 is the only reference to this node. Then, if the tree has been invalidated, either by the harvest loop above (this time or earlier) or by a manual unmount, then we know it is safe to release this node. We decrement the refcnt at 2267. This is important because if we leave it at 1 and push it on the harvest list, then another node can do a rele and we think it is okay. We remove it from the forst in lines 2272 to 2275 and push it onto a local linked list harvest on lines 2277 and 2278. We then go on to process the next node.

When we have finished all trees in the forest, they are all in one of 3 states:

  1. Not in use by a thread and harvested.
  2. Not idled out. May or may not be in use by a thread.
  3. In use by a thread, idled out, and not harvested.

With all of that background out of the way, let us look at a coredump:

> $c
vpanic()
assfail+0x7e(fffffffff8490b98, fffffffff8490b70, 134)
nfs4_ephemeral_tree_decr+0x64(ffffff020db7fc18)
nfs4_ephemeral_umount_unlock+0x3f(ffffff0004f33c14, ffffff0004f33bb8)
nfs4_ephemeral_umount_activate+0x7b(ffffff0524673000, ffffff0004f33c14, ffffff0004f33bb8)
nfs4_free_mount+0x2fe(ffffff047ae40dc8, 400, ffffff01974f2b28)
nfs4_free_mount_thread+0x24(ffffff016623a9f8)
thread_start+8()
> \*ffffff0004f33bb8::print nfs4_ephemeral_tree_t
{
    net_mount = 0xffffff0523c36000
    net_root = 0
    net_next = 0
    net_tree_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_cnt_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_status = 0
    net_refcnt = 0
}
> ffffff0004f33c14::print boolean_t
1 (B_TRUE)

This is the assert on line 308. Some things we can tell right away is that this should be a manual unmount, since we came through nfs4_free_mount and not the harvester. The flags are 0x400, which means MS_FORCE - a forced umount call. Also, while the refcnt is 0, it has not been harvested. In that case, we would expect to see that the state would include NFS4_EPHEMERAL_TREE_INVALID. And indeed, the mntinfo4 (see net_mount) does still point to this node:

> ffffff0523c36000::print mntinfo4_t mi_ephemeral_tree
mi_ephemeral_tree = 0xffffff020db7fc18
> ffffff0523c36000::print mntinfo4_t mi_ephemeral_tree | ::print nfs4_ephemeral_tree_t
{
    net_mount = 0xffffff0523c36000
    net_root = 0
    net_next = 0
    net_tree_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_cnt_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_status = 0
    net_refcnt = 0
}

So, who did us in? Well, the first thought I had was this piece of code in nfs4_ephemeral_umount : (which we would have just called)

   1990 		nfs4_ephemeral_tree_decr(net);
   1991 		mutex_exit(&net->net_cnt_lock);
   1992 
   1993 		if (was_locked == FALSE)
   1994 			mutex_exit(&net->net_tree_lock);

We only hold the refcnt in here if we do not set was_locked to TRUE:

   1764 	int			was_locked = FALSE;
...
   1845 			was_locked = TRUE;
   1846 		} else {
   1847 			net->net_refcnt++;
   1848 			ASSERT(net->net_refcnt != 0);
   1849 		}
   1850 
   1851 		mutex_exit(&net->net_cnt_lock);

Hmm, we also avoid using nfs4_ephemeral_tree_hold. We should probably create a symmetric function for incr like we did for decr.

Anyway, I fixed the code at 19909 and we still see the panic. The new one has the same signature. By the way, I found this probable error by matching holds to reles. I stopped with just the simple cases when I found the above issue.

Okay, what does nfs4_free_mount() do?

   4039 	/\*
   4040 	 \* If we got an error, then do not nuke the
   4041 	 \* tree. Either the harvester is busy reclaiming
   4042 	 \* this node or we ran into some busy condition.
   4043 	 \*
   4044 	 \* The harvester will eventually come along and cleanup.
   4045 	 \* The only problem would be the root mount point.
   4046 	 \*
   4047 	 \* Since the busy node can occur for a variety
   4048 	 \* of reasons and can result in an entry staying
   4049 	 \* in df output but no longer accessible from the
   4050 	 \* directory tree, we are okay.
   4051 	 \*/
   4052 	if (!nfs4_ephemeral_umount(mi, flag, cr,
   4053 	    &must_unlock, &eph_tree))
   4054 		nfs4_ephemeral_umount_activate(mi, &must_unlock,
   4055 		    &eph_tree);
   4056 

So we call nfs4_ephemeral_umount, which had that suspect piece of code. We know that we will not set NFS4_EPHEMERAL_TREE_LOCKED ourselves. We can see that it was not held, because pmust_unlock was set to TRUE. Also, the status tells us we did not mark the tree as NFS4_EPHEMERAL_TREE_INVALID and the fact that the mntinfo4 pointer is valid tells us we did not disassociate the tree from it. All of which tells me fixing line 1990 would probably have no impact on the execution path. I.e., I'm starting to suspect we don't have a problem with a race case.

The only way to return 0 and not disassociate the tree is to go down this path back in nfs4_ephemeral_umount:

   1945 	if (!is_derooting) {
   1946 		/\*
   1947 		 \* Only work on children if the caller has not already
   1948 		 \* done so.
   1949 		 \*/
   1950 		if (!is_recursed) {
   1951 			ASSERT(eph != NULL);
   1952 
   1953 			error = nfs4_ephemeral_unmount_engine(eph,
   1954 			    FALSE, flag, cr);
   1955 			if (error)
   1956 				goto is_busy;
   1957 		}
   1958 	} else {

And as a matter of fact, the evidence suggests that no other code did this either. nfs4_ephemeral_unmount_engine must have been error free.

So either we have decremented the refcnt twice for our one hold or we decremented when we did not have a hold.

Heh, I've gone down many a code branch, only to backtrack as I found the code was correct. Perhaps we should return to our debug session? nfs4_ephemeral_umount_activate takes a pointer to a mntinfo4 and a pointer to a pointer of a nfs4_ephemeral_tree_t:

nfs4_ephemeral_umount_activate+0x7b(ffffff0524673000, ffffff0004f33c14, ffffff0004f33bb8)
> \*ffffff0004f33bb8::print nfs4_ephemeral_tree_t
{
    net_mount = 0xffffff0523c36000
    net_root = 0
    net_next = 0
    net_tree_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_cnt_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    }
    net_status = 0
    net_refcnt = 0
}
> ffffff0524673000::print mntinfo4_t mi_ephemeral
mi_ephemeral = 0

The ephemeral node is not pointing to the same thing as the function. And I have more faith in the function, because it had to just execute:

   1733 	if (mi->mi_ephemeral) {
   1734 		/\*
   1735 		 \* If we are the root node of an ephemeral branch
   1736 		 \* which is being removed, then we need to fixup
   1737 		 \* pointers into and out of the node.
   1738 		 \*/
   1739 		if (!(mi->mi_flags & MI4_EPHEMERAL_RECURSED))
   1740 			nfs4_ephemeral_umount_cleanup(mi->mi_ephemeral);
   1741 
   1742 		ASSERT(mi->mi_ephemeral != NULL);
   1743 
   1744 		kmem_free(mi->mi_ephemeral, sizeof (\*mi->mi_ephemeral));
   1745 		mi->mi_ephemeral = NULL;
   1746 	}
   1747 	mutex_exit(&mi->mi_lock);
   1748 
   1749 	nfs4_ephemeral_umount_unlock(pmust_unlock, pnet);

I.e., if we pass in a mntinfo4, either the ephemeral tree is already NULL or we free it and NULL it out. And oh crap, if mi->mi_ephemeral was a pointer to the same memory location as \*pnet, we just screwed the pooch and caused a core. I.e., the only place we should be removing this memory is at:

   2010 		mutex_enter(&mi->mi_lock);
   2011 		mi->mi_ephemeral_tree = NULL;
   2012 		mutex_exit(&mi->mi_lock);

Well, this entry is long enough as it is. I'll convince myself that removing 1744 is the correct thing to do and then test that fix. Note, this kmem_free might explain why systems do not always core. If the memory is garbage, all it has to be is non-zero for the assert to not trigger. Whether or not other nasty things could have happened is to be seen later.

UPDATE: Hmm, one is a mi->mi_ephemeral and the other is a mi->mi_ephemeral_tree. The end result may stand, but my analysis may be off. Back with more later!


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Friday Dec 07, 2007

Steve Dickson (of RedHat) releases prototype pseudo-fs root for Linux NFSv4

I've slammed the Linux NFSv4 implementation before for not having the same namespace as NFSv3. I.e., it used the 'fsid=0' hack to export the root of the v4 namespace and thus that path may not be the same as '/'.

Well, over on the nfsv4 <at> linux-nfs.org mailing list, Steve just announced a prototype which fixes that problem! And the crowd goes wild!

The following patch series gives rpc.mountd the ability to allocate
a dynamic pseudo root, so the 'fsid=0' export option is no longer 
required. This allows v2, v3 and v4 clients mounts without any 
changes to the server's exports list.

One anomaly of the Linux NFS server is that it requires a pseudo root
to be defined. Currently the only way a pseudo root can be defined is by 
setting the fsid to zero (i.e. fsid=0). So if we wanted to make v4
the default mounting version and have things just work like v2/v3
all of the existing exports configurations would have to change 
(i.e. a 'fsid=0' would have to be added) to support a v4 mounts,
which, imho, is unacceptable. So this patch series address
this problem.

I think this might also mark the first major piece of work on the Linux NFSv4 code to come from some place other than CITI. I might be wrong, but I think this is a sign of the maturity of code.


Originally posted on Kool Aid Served Daily
Copyright (C) 2007, Kool Aid Served Daily

Tuesday Jun 26, 2007

OpenSolaris Project Models and pNFS

I believe that the majority of OpenSolaris development occurs within Sun Microsystems Engineering. As much as we would like for it to snowball in the wild, that has not happened. I'm saying this from my biased view, I know some projects have been proposed externally from Sun, e.g., the i18n port of the closed library. I also acknowledge the work that Dennis Clark is leading for the PPC port. There are more and I am not trying to take away from them. I am relating my experience with trying to get projects off the ground on OpenSolaris - see for example OpenSolaris Project: NFS Server in non-Global Zones.

So what does happen is that a new project gets started and there is no external indication of forward progress. People might start asking for code drops and the reality is that because of the huge internal pressure towards quality in Sun Engineering, that is not going to happen until the code has baked a bit. It gets to the point that a prime question on new project proposals is will code be released. Again, there isn't some hidden agenda within Sun to withhold the code - we are just new to this model and we want things to be perfect, not just good enough.

Look back at the discussion that went on for Project Proposal -- Honeycomb Information and dev tools and the lack of a code drop. The OpenSolaris Project: HoneyComb Fixed Content Storage already shows a binary drop and plans for a code drop in the Fall of 2007. Some valid reasons for a group to not drop code right away are that they do not understand the process (they need someone to help them) and they need to clear a legal hurdle to make sure that they are not violating the rights of either an individual or a company. I've seen both occur internally. The good news is that we have internal people ready and willing to help development groups.

What I find really exciting are projects that have a significant external presence. And sometimes that external pressure doesn't contribute directly to the code work. In NFSv4 and NFSv4.1, the external collaboration takes place through the IETF and Connectathon. Both companies and open source developers come together to design and implement future NFS protocol extensions. Interoperability across multiple OS platforms is ensured via the yearly meetings at Connectathon. And with the UMICH CITI developers working on Projects: NFS Version 4 Open Source Reference Implementation, which is mainly distributed to Linux, but forms a reference for both BSD directly and OSX indirectly, and Sun working on OpenSolaris, it is possible for vendors to do compatibility testing all year long.

Take for example NetApp, which provides only a NFS server. They are able to test new NFSv4.1 features against Linux and OpenSolaris clients. Admittedly this isn't new, NetApp was able to use the Solaris 10 beta code to test NFSv4. And the companies in question all sign NDAs and exchange hardware and engineering drops of binaries for testing.

So there is almost no work being driven from OpenSolaris into this open design project. There is a OpenSolaris Project: NFS version 4.1 pNFS, but it is mainly a portal to the Sun NFS team's work. A question that they asked themselves was whether they were going to do binary drops, code drops, or any drop at all. It wasn't a legal issue, the design is done in the open and all of the coding is new development. It wasn't a fear of the unknown, they had already shared binaries in the past. No, rather it was a concern on the impact of providing a drop on the development schedule. Would the overhead of publishing code and/or binaries kill the final deliverable?

Another OpenSolaris reality is that Sun expects to make money. I know that is an evil concept to some open source developers, but we bet the company on being able to deliver quality and sell service along with the source. So making the deadline for the pNFS deliverable is a major concern for the group.

I'm happy that the group decided that they could both deliver on time and make code and binary drops. Lisa just announced for the group the latest drop in FYI: pNFS Code and BFU Archives posted. You can check out the b66 implementation by downloading it. The code is rough in the sense that you wouldn't want to put it in production, but it gives other developers a chance to see what is going on and allows them to test their own implementations. Remember, this code has not been putback into Nevada - it lives in a group workspace. Before OpenSolaris, it would have only been shared under NDA and the expectation that the person installing the code assumed responsibility for any problems.

Project development in OpenSolaris is different than that occurring in other open source communities. There are different hurdles to jump, but there are different expectations as well. Internal developers are proud of the quality that they demand of the code and want to keep that bar high. That in turn makes early code drops hard for them to deliver. It is something they are learning to do. And the pNFS team is leading the way.


Originally posted on Kool Aid Served Daily
Copyright (C) 2007, Kool Aid Served Daily

Monday Feb 12, 2007

Trying to get a Kerberized NFSv4 server/client on a NSLU2

Normally I don't summarize what I'm about to write about, however, I think this entry is all over the place. But there is useful information in here, So, I'm trying to get first kerberos and then NFSv4 working on a NSLU2 running OpenSlug. In order to validate my results, I also try to get a Linux NFSv4 server up and running on one of my Shuttle SS51G boxes. I finally get that to work, but I have no luck on getting the NSLU2 working correctly as either a server or client.

I decided to try another Linux client to see if I could get the process streamlined:

[tdh@sandman ~]> kadmin -p tdh/admin
Couldn't open log file /var/krb5/kdc.log: Permission denied
Authenticating as principal tdh/admin with password.
Password for tdh/admin@INTERNAL.EXCFB.COM:
kadmin:  addprinc -randkey nfs/mrbill.internal.excfb.com
WARNING: no policy specified for nfs/mrbill.internal.excfb.com@INTERNAL.EXCFB.COM; defaulting to no policy
Principal "nfs/mrbill.internal.excfb.com@INTERNAL.EXCFB.COM" created.
kadmin:  addprinc -randkey host/mrbill.internal.excfb.com
WARNING: no policy specified for host/mrbill.internal.excfb.com@INTERNAL.EXCFB.COM; defaulting to no policy
Principal "host/mrbill.internal.excfb.com@INTERNAL.EXCFB.COM" created.
kadmin:  ktadd -k /export/keytabs/mrbill.keytab -e des-cbc-crc:normal nfs/mrbill.internal.excfb.com
kadmin: No such file or directory while adding key to keytab

Okay, not only do I need to fix the above, I also need to fix not being able to add to /var/krb5/kdc.log. We can get the keytab generated with:

[tdh@sandman /export]> sudo chown tdh:staff keytabs/

And we see:

kadmin:  ktadd -k /export/keytabs/mrbill.keytab -e des-cbc-crc:normal nfs/mrbill.internal.excfb.com
Entry for principal nfs/mrbill.internal.excfb.com with kvno 4, encryption type DES cbc mode with CRC-32 added to keytab WRFILE:/export/keytabs/mrbill.keytab.
kadmin:  ktadd -k /export/keytabs/mrbill.keytab -e des-cbc-crc:normal host/mrbill.internal.excfb.com
Entry for principal host/mrbill.internal.excfb.com with kvno 3, encryption type DES cbc mode with CRC-32 added to keytab WRFILE:/export/keytabs/mrbill.keytab.

Okay, the first thing to note is that mrbill is running OpenSlug:

root@mrbill:~# uname -a
Linux mrbill 2.6.16 #1 PREEMPT Fri Jun 9 07:34:31 PDT 2006 armv5teb unknown unknown GNU/Linux

We try to get the keytab:

root@mrbill:~# mount sandman:/export/keytabs /mnt/sandman/keytabs
mount: can't get address for sandman
root@mrbill:~# host sandman
-sh: host: not found

Why? Well it turns out that:

root@mrbill:~# cat /etc/resolv.conf
search mshome
nameserver 192.168.2.108
nameserver 182.168.2.1

I thought that the domain entered in the turnup init was for the CIFS domain. Easy enough to fix...

root@mrbill:~# cat /etc/resolv.conf
search internal.excfb.com
nameserver 192.168.2.108
nameserver 182.168.2.1
root@mrbill:~#  mount sandman:/export/keytabs /mnt/sandman/keytabs
root@mrbill:~# cd /etc
root@mrbill:/etc# cp /mnt/sandman/keytabs/mrbill.keytab krb5.keytab
cp: cannot open `/mnt/sandman/keytabs/mrbill.keytab' for reading: Permission denied

What now? (Permissions)

root@mrbill:/etc# ls -la /mnt/sandman/keytabs
total 9
drwxr-xr-x  2 tdh  uucp  512 Feb 12  2007 .
drwxr-xr-x  5 root root 4096 Feb 12 08:22 ..
-rw-r--r--  1 root root 1968 Feb 12 06:50 krb5.conf
-rw-------  1 tdh  uucp  161 Feb 12  2007 mrbill.keytab
-rw-r--r--  1 root root  155 Feb 12 06:48 mrx.keytab

Fix them up on the server and:

root@mrbill:/etc# cp /mnt/sandman/keytabs/mrbill.keytab krb5.keytab

We need to get a good copy of krb5.conf, idmapd.conf, and sysconfig/nfs. For now, we will leave idmapd.conf alone, to illustrate the NFSv4 mapid issue.

root@mrbill:/etc# scp mrx:/etc/krb5.conf .
root@mrbill:/etc# scp mrx:/etc/sysconfig/nfs sysconfig

Now this time I know kerberos is not installed:

root@mrbill:/# ls -la ./usr/kerberos/bin/kinit
ls: ./usr/kerberos/bin/kinit: No such file or directory

And we can easily add it:

root@mrbill:/# ipkg list | grep krb5
kernel-module-rpcsec-gss-krb5 - 2.6.16-r6.6 - rpcsec-gss-krb5 kernel module
root@mrbill:/# ipkg install kernel-module-rpcsec-gss-krb5
Installing kernel-module-rpcsec-gss-krb5 (2.6.16-r6.6) to root...
Downloading http://ipkg.nslu2-linux.org/feeds/slugos-bag/cross/3.10-beta/kernel-module-rpcsec-gss-krb5_2.6.16-r6.6_ixp4xxbe.ipk
Installing kernel-module-auth-rpcgss (2.6.16-r6.6) to root...
Downloading http://ipkg.nslu2-linux.org/feeds/slugos-bag/cross/3.10-beta/kernel-module-auth-rpcgss_2.6.16-r6.6_ixp4xxbe.ipk
Configuring kernel-module-auth-rpcgss
Configuring kernel-module-rpcsec-gss-krb5

Still not there for me:

root@mrbill:/# ls -la ./usr/kerberos/bin/kinit
ls: ./usr/kerberos/bin/kinit: No such file or directory
root@mrbill:/# find . -name kinit

My guess is that you can export with kerberos, you just can't mount it.

We should confirm that!

root@mrbill:~# mkdir /home/nfs4
root@mrbill:~# chmod 777 /home/nfs4
root@mrbill:~# cd /home/nfs4
root@mrbill:/home/nfs4# touch see_me
root@mrbill:/home/nfs4# chown tdh:10 see_me
root@mrbill:/home/nfs4# ls -la
total 8
drwxrwxrwx  2 root root 4096 Feb 12 09:00 .
drwxrwxr-x  8 root root 4096 Feb 12 09:00 ..
-rw-r--r--  1 tdh  uucp    0 Feb 12 09:00 see_me

And I try to add the export:

root@mrbill:/home/nfs4# more /etc/exports
/home/NFS4 172.16.0.0/16(rw,fsid=0,insecure,no_subtree_check,sync,anonuid=65534,anongid=65534)
root@mrbill:/home/nfs4# cd ..
root@mrbill:/home# ls -la
total 32
drwxrwxr-x   8 root root  4096 Feb 12 09:00 .
drwxr-xr-x  18 root root  4096 Feb  5 22:44 ..
drwxrwxrwx   2 tdh  uucp  4096 Feb  5 23:03 NFS4
drwxrwxrwx   2 root root  4096 Feb 12 09:00 nfs4
drwxr-xr-x   2 root root  4096 Feb  5 22:53 nfsv2
drwxr-xr-x   2 root root  4096 Feb  5 22:53 nfsv3
drwxr-xr-x   2 root root  4096 Feb  5 22:53 nfsv4
lrwxrwxrwx   1 root root     7 Feb  5 22:26 root -> ../root
drwxr-xr-x   2 tdh  staff 4096 Feb  7 21:21 tdh
root@mrbill:/home#

Looks like /home/NFS4 was created for me, or I'm suffering from severe memory loss...

I could have done this last week, note the time stamp.

root@mrbill:/home# ls -la NFS4
total 8
drwxrwxrwx  2 tdh    uucp 4096 Feb  5 23:03 .
drwxrwxr-x  8 root   root 4096 Feb 12 09:00 ..
-rw-r--r--  1 200096 uucp    0 Feb  5 23:03 ut

Must be memory loss!

root@mrbill:/home# cd NFS4/
root@mrbill:/home/NFS4# touch see_me
root@mrbill:/home/NFS4# chown tdh:10 see_me
root@mrbill:/home/NFS4# ls -la
total 8
drwxrwxrwx  2 tdh    uucp 4096 Feb 12 09:03 .
drwxrwxr-x  8 root   root 4096 Feb 12 09:00 ..
-rw-r--r--  1 tdh    uucp    0 Feb 12 09:03 see_me
-rw-r--r--  1 200096 uucp    0 Feb  5 23:03 ut

And yes:

[tdh@mrx ipk]> showmount -e mrbill
Export list for mrbill:
/home/NFS4 172.16.0.0/16

I was in 172.16.0.0/16 space last week. Touch up the export and:

[tdh@mrx ipk]> showmount -e mrbill
Export list for mrbill:
/home/NFS4 192.168.2.0/24

Okay, I do the mount and I'll claim it gets done as nfsv3:

[tdh@mrx ipk]> sudo mount mrbill:/home/NFS4 /mnt/mrbill/NFS4
[tdh@mrx ipk]> ls -la /mnt/mrbill/NFS4
total 8
drwxrwxrwx 2 tdh    wheel 4096 Feb 12 03:03 .
drwxr-xr-x 3 root   root  4096 Feb 12 11:08 ..
-rw-r--r-- 1 tdh    wheel    0 Feb 12 03:03 see_me
-rw-r--r-- 1 200096 wheel    0 Feb  5 17:03 ut

Why do I claim it is nfsv3? Because I suspect that the idmapping should be hosed. Can we verify this? Yes:

[tdh@mrx ipk]> sudo umount /mnt/mrbill/NFS4
[tdh@mrx ipk]> sudo mount -o vers=3 mrbill:/home/NFS4 /mnt/mrbill/NFS4
[tdh@mrx ipk]> ls -la /mnt/mrbill/NFS4
total 8
drwxrwxrwx 2 tdh    wheel 4096 Feb 12 03:03 .
drwxr-xr-x 3 root   root  4096 Feb 12 11:08 ..
-rw-r--r-- 1 tdh    wheel    0 Feb 12 03:03 see_me
-rw-r--r-- 1 200096 wheel    0 Feb  5 17:03 ut
[tdh@mrx ipk]> sudo umount /mnt/mrbill/NFS4
[tdh@mrx ipk]> sudo mount -o vers=4 mrbill:/home/NFS4 /mnt/mrbill/NFS4
'vers=4' is not supported.  Use '-t nfs4' instead.
[tdh@mrx ipk]> sudo mount -t nfs4 mrbill:/home/NFS4 /mnt/mrbill/NFS4
mount.nfs4: mount point /mnt/mrbill/NFS4 does not exist

Okay, mrbill knows nothing about NFSv4 as far as I can tell:

root@mrbill:/home/NFS4# mount -t nfs4 sandman:/export/home /mnt/sandman/home
mount: unknown filesystem type 'nfs4'

I'm sensing protocol discrimination here:

root@mrbill:/home/NFS4# ipkg list | grep -i nfs
kernel-module-lockd - 2.6.16-r6.6 - lockd kernel module; NFS file locking service version 0.5.
kernel-module-nfs - 2.6.16-r6.6 - nfs kernel module
kernel-module-nfs - 2.6.16-r6.4 -
kernel-module-nfsd - 2.6.16-r6.6 - nfsd kernel module
nfs-utils - 1.0.6-r7 - userspace utilities for kernel nfs
nfs-utils-doc - 1.0.6-r7 - userspace utilities for kernel nfs

Time to check the log file:

Feb 12 09:08:29 (none) user.warn kernel: nfsd: nfsv4 idmapping failing: has idmapd not been started?

Okay, configure idmapping and reboot:

Feb 12 09:16:37 (none) user.info kernel: Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
Feb 12 09:16:37 (none) user.warn kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Feb 12 09:16:37 (none) user.warn kernel: NFSD: unable to find recovery directory /var/lib/nfs/v4recovery
Feb 12 09:16:37 (none) user.warn kernel: NFSD: starting 90-second grace period

Try the mount again:

[tdh@mrx ipk]> sudo mount -t nfs4 mrbill:/home/NFS4 /mnt/mrbill/NFS4
mount.nfs4: Permission denied

And try it from a Solaris client:

[tdh@sandman keytabs]> sudo mount mrbill:/home/NFS4 /mnt/mrbill/NFS4
[tdh@sandman keytabs]> sudo mount mrbill:/home/NFS4 /mnt/mrbill/NFS4
NFS compound failed for server mrbill: error 7 (RPC: Authentication error)
NFS compound failed for server mrbill: error 7 (RPC: Authentication error)
NFS compound failed for server mrbill: error 7 (RPC: Authentication error)
NFS compound failed for server mrbill: error 7 (RPC: Authentication error)
NFS compound failed for server mrbill: error 7 (RPC: Authentication error)
NFS compound failed for server mrbill: error 7 (RPC: Authentication error)
nfs mount: mount: /mnt/mrbill/NFS4: Permission denied

Okay, can we get Kerberos working at all on the NSLU2?

root@mrbill:~# more /etc/exports
/home/NFS4 192.168.2.0/24(rw,fsid=0,sec=krb5,insecure,no_subtree_check,sync,anonuid=65534,anongid=65534)
root@mrbill:~# exportfs -rv
exportfs: /etc/exports:1: unknown keyword "sec=krb5"
unexporting sandman.internal.excfb.com:/home/NFS4 from kernel

The keyword is not correct? Time to try on a known good linux config:

[tdh@mrx ipk]> cat /etc/exports
/home/tdh 192.168.2.0/24(rw,fsid=0,sec=krb5,insecure,no_subtree_check,sync,anonuid=65534,anongid=65534)
[tdh@mrx ipk]> sudo exportfs -rv
exportfs: /etc/exports:1: unknown keyword "sec=krb5"

Okay, here is what we are supposed to do:

[tdh@mrx ipk]> cat /etc/exports
/home/tdh gss/krb5(rw,fsid=0,insecure,no_subtree_check,sync,anonuid=65534,anongid=65534)
[tdh@mrx ipk]> sudo exportfs -rv
exporting gss/krb5:/home/tdh
exporting gss/krb5:/home/tdh to kernel
gss/krb5:/home/tdh: Cannot allocate memory

By sheer effort of will, I determined that the firewall was on.

root@mrbill:~# showmount -e mrx
Export list for mrx:
/home/tdh gss/krb5

First lets see what happens without kerberos:

[tdh@sandman ~]> sudo mount -o vers=3 mrx:/home/tdh /mnt/mrx/tdh
[tdh@sandman ~]> ls -la /mnt/mrx/tdh
total 230394
drwxr-xr-x   7 tdh      staff       4096 Feb 12 02:01 .
drwxr-xr-x   3 root     root         512 Feb 12 11:49 ..

And NFSv4:

[tdh@sandman ~]> sudo mount mrx:/home/tdh /mnt/mrx/tdh
nfs mount: mrx:/home/tdh: No such file or directory

Okay, I knew about this, but forgot it. I think I heard Bruce complaining about still having it:

[tdh@sandman ~]> sudo mount mrx:/ /mnt/mrx/tdh
[tdh@sandman ~]> ls -al /mnt/mrx/tdh
total 230394
drwxr-xr-x   7 tdh      nobody      4096 Feb 12 02:01 .
drwxr-xr-x   3 root     root         512 Feb 12 11:49 ..
-rw-------   1 tdh      nobody        68 Feb 12 01:51 .Xauthority
-rw-------   1 tdh      nobody        96 Feb 12 11:31 .lesshst

And now we turn on kerberos:

[tdh@sandman ~]> sudo mount mrx:/ /mnt/mrx/tdh
NFS compound failed for server mrx: error 7 (RPC: Authentication error)
NFS compound failed for server mrx: error 7 (RPC: Authentication error)
NFS compound failed for server mrx: error 7 (RPC: Authentication error)
nfs mount: mount: /mnt/mrx/tdh: Permission denied

We can be very specific about what security flavor we want to use:

[tdh@sandman ~]> sudo mount -o sec=krb5 mrx:/ /mnt/mrx/tdh
nfs mount: mount: /mnt/mrx/tdh: Permission denied

Note that the compound fails messages must have been about AUTH_NONE, AUTH_SYS, and AUTH_DH.

I think I've found the answer in Mike Eisler's blog Real Authentication in NFS, scroll down into the comments:

> Also, does NetApp require a root principle like Solaris did prior to 10?

Actually even prior to Solaris 10, the Solaris NFS server would allow
an NFSv3 mount if root didn't have Kerberos credentials. ONTAP is the
same way. However, if using NFSv4, because NFSv4 has no separate mount
protocol, an NFSv4 server cannot distinguish a mount from a LOOKUP. If
a volume is exported with sec=krb5, then the NFSv4 requests need to be
using Kerberos. Since UNIX clients usually require one to be superuser
to do an NFS mount, superuser (root) needs to have credentials. Root
credentials aren't required, but whatever uid the credentials map to
has to have search permissions for the path name.

And we can try that here:

kadmin:  addprinc root
WARNING: no policy specified for root@INTERNAL.EXCFB.COM; defaulting to no policy
Enter password for principal "root@INTERNAL.EXCFB.COM":
Re-enter password for principal "root@INTERNAL.EXCFB.COM":
Principal "root@INTERNAL.EXCFB.COM" created.

And then we grab a ticket:

[tdh@sandman ~]> sudo kinit root
Password for root@INTERNAL.EXCFB.COM:
[tdh@sandman ~]> sudo mount -o sec=krb5 mrx:/ /mnt/mrx/tdh

Aargh!

[tdh@sandman ~]> ls -la /mnt/mrx/tdh
total 230394
drwxr-xr-x   7 tdh      nobody      4096 Feb 12 02:01 .
drwxr-xr-x   3 root     root         512 Feb 12 11:49 ..
-rw-------   1 tdh      nobody        68 Feb 12 01:51 .Xauthority
-rw-------   1 tdh      nobody        96 Feb 12 11:31 .lesshst

Since we can't even get the export shared without kerberos on mrbill, that does not explain the issue on that machine.

This works:

[tdh@sandman ~]> sudo mount -o vers=3 mrbill:/home/NFS4 /mnt/mrbill/NFS4

And this does not:

[tdh@sandman ~]> sudo mount -o vers=4 mrbill:/ /mnt/mrbill/NFS4
nfs mount: mount: /mnt/mrbill/NFS4: Resource temporarily unavailable

I'll come back to this later...


Originally posted on Kool Aid Served Daily
Copyright (C) 2007, Kool Aid Served Daily

Sunday Feb 11, 2007

Installing a Kerberos KDC and setting up NFS mounts

We always seem to have problems at Connectathon setting up Kerberos. So I decided to take the cookbook we use there and get kerberos working on my home systems. Please note that I could easily clean up the notes to not show some errors I make. But then, where is the love?

Also, as with any first foray into a new tool, I have no clue what I am doing. I kinda understand tickets and the ideas behind Kerberos, but I'm really in the dark as to what I'm supposed to do.

First edit /etc/krb5/krb5.conf:

# diff krb5.conf stock/krb5.conf
35c35
<         default_realm = INTERNAL.EXCFB.COM
---
>         default_realm = ___default_realm___
38,41c38,43
<         INTERNAL.EXCFB.COM = {
<                 kdc = sandman.internal.excfb.com
<                 kdc = ultralord.internal.excfb.com
<                 admin_server = sandman.internal.excfb.com
---
>         ___default_realm___ = {
>                 kdc = ___master_kdc___
>                 kdc = ___slave_kdc1___
>                 kdc = ___slave_kdc2___
>                 kdc = ___slave_kdcN___
>                 admin_server = ___master_kdc___

Then edit /etc/krb5/kdc.conf:

# diff kdc.conf stock/kdc.conf
32c32
<       INTERNAL.EXCFB.COM = {
---
>       ___default_realm___ = {
41,42d40
<               sunw_dbprob_enable = true
<               sunw_dbprop_master_ulogsize = 1000

Make sure you can get at the kdcs via DNS (or whatever name service in /etc/resolv.conf)

# host sandman
sandman.internal.excfb.com has address 192.168.2.109
# host sandman.internal.excfb.com
sandman.internal.excfb.com has address 192.168.2.109

Create the kerberos database

# /usr/sbin/kdb5_util create -r INTERNAL.EXCFB.COM -s
Initializing database '/var/krb5/principal' for realm 'INTERNAL.EXCFB.COM',
master key name 'K/M@INTERNAL.EXCFB.COM'
You will be prompted for the database Master Password.
It is important that you NOT FORGET this password.
Enter KDC database master key:
Re-enter KDC database master key to verify:

Start getting some principals:

# /usr/sbin/kadmin.local
Authenticating as principal root/admin@INTERNAL.EXCFB.COM with password.
kadmin.local:  addprinc tdh/admin
WARNING: no policy specified for tdh/admin@INTERNAL.EXCFB.COM; defaulting to no policy
Enter password for principal "tdh/admin@INTERNAL.EXCFB.COM":
Re-enter password for principal "tdh/admin@INTERNAL.EXCFB.COM":
Principal "tdh/admin@INTERNAL.EXCFB.COM" created.

Get some kiprop installed:

kadmin.local:  addprinc -randkey kiprop/sandman.internal.excfb.com
WARNING: no policy specified for kiprop/sandman.internal.excfb.com@INTERNAL.EXCFB.COM; defaulting to no policy
add_principal: Principal or policy already exists while creating "kiprop/sandman.internal.excfb.com@INTERNAL.EXCFB.COM".
kadmin.local:  addprinc -randkey kiprop/ultralord.internal.excfb.com
WARNING: no policy specified for kiprop/ultralord.internal.excfb.com@INTERNAL.EXCFB.COM; defaulting to no policy
Principal "kiprop/ultralord.internal.excfb.com@INTERNAL.EXCFB.COM" created.

Enable kadmin and changepw:

kadmin.local:  ktadd -k /etc/krb5/kadm.keytab kadmin/sandman.internal.excfb.com
Entry for principal kadmin/sandman.internal.excfb.com with kvno 3, encryption type AES-128 CTS mode with 96-bit SHA-1 HMAC added to keytab WRFILE:/etc/krb5/kadm.keytab.
Entry for principal kadmin/sandman.internal.excfb.com with kvno 3, encryption type Triple DES cbc mode with HMAC/sha1 added to keytab WRFILE:/etc/krb5/kadm.keytab.
Entry for principal kadmin/sandman.internal.excfb.com with kvno 3, encryption type ArcFour with HMAC/md5 added to keytab WRFILE:/etc/krb5/kadm.keytab.
Entry for principal kadmin/sandman.internal.excfb.com with kvno 3, encryption type DES cbc mode with RSA-MD5 added to keytab WRFILE:/etc/krb5/kadm.keytab.
kadmin.local:  ktadd -k /etc/krb5/kadm.keytab changepw/sandman.internal.excfb.com
Entry for principal changepw/sandman.internal.excfb.com with kvno 3, encryption type AES-128 CTS mode with 96-bit SHA-1 HMAC added to keytab WRFILE:/etc/krb5/kadm.keytab.
Entry for principal changepw/sandman.internal.excfb.com with kvno 3, encryption type Triple DES cbc mode with HMAC/sha1 added to keytab WRFILE:/etc/krb5/kadm.keytab.
Entry for principal changepw/sandman.internal.excfb.com with kvno 3, encryption type ArcFour with HMAC/md5 added to keytab WRFILE:/etc/krb5/kadm.keytab.
Entry for principal changepw/sandman.internal.excfb.com with kvno 3, encryption type DES cbc mode with RSA-MD5 added to keytab WRFILE:/etc/krb5/kadm.keytab.

Enable kiprop:

kadmin.local:  ktadd -k /etc/krb5/kadm.keytab kiprop/sandman.internal.excfb.com
Entry for principal kiprop/sandman.internal.excfb.com with kvno 3, encryption type AES-128 CTS mode with 96-bit SHA-1 HMAC added to keytab WRFILE:/etc/krb5/kadm.keytab.
Entry for principal kiprop/sandman.internal.excfb.com with kvno 3, encryption type Triple DES cbc mode with HMAC/sha1 added to keytab WRFILE:/etc/krb5/kadm.keytab.
Entry for principal kiprop/sandman.internal.excfb.com with kvno 3, encryption type ArcFour with HMAC/md5 added to keytab WRFILE:/etc/krb5/kadm.keytab.
Entry for principal kiprop/sandman.internal.excfb.com with kvno 3, encryption type DES cbc mode with RSA-MD5 added to keytab WRFILE:/etc/krb5/kadm.keytab.

Quit:

kadmin.local:  quit

Enable the services:

# svcadm enable -r network/security/krb5kdc
# svcadm enable -r network/security/kadmin

Authenticate the admin account:

# /usr/sbin/kadmin -p tdh/admin
Authenticating as principal tdh/admin with password.
Password for tdh/admin@INTERNAL.EXCFB.COM:
kadmin: Communication failure with server while initializing kadmin interface

Hmm, I got the right password. I can see what happens when it is wrong:

# /usr/sbin/kadmin -p tdh/admin
Authenticating as principal tdh/admin with password.
Password for tdh/admin@INTERNAL.EXCFB.COM:
kadmin: Incorrect password while initializing kadmin interface

Ahh, lets see if kerberos is up and running:

# grep kadmin /var/adm/messages
Feb 11 23:31:19 sandman svc.startd[7]: [ID 748625 daemon.error] network/security/kadmin:default failed repeatedly: transitioned to maintenance (see 'svcs -xv' for details)
Feb 11 23:31:57 sandman kadmin[4143]: [ID 737709 user.error] unable to open connection to ADMIN server (t_error 9)
Feb 11 23:33:56 sandman kadmin[4146]: [ID 737709 user.error] unable to open connection to ADMIN server (t_error 9)

No, it is not.

# svcs -xv
svc:/network/security/kadmin:default (Kerberos administration daemon)
 State: maintenance since Sun Feb 11 23:31:19 2007
Reason: Restarting too quickly.
   See: http://sun.com/msg/SMF-8000-L5
   See: man -M /usr/share/man -s 1M kadmind
   See: /var/svc/log/network-security-kadmin:default.log
Impact: This service is not running.

Clear the maintenance state:

# svcadm clear /network/security/kadmin:default

Restart:

# svcadm enable -r network/security/kadmin

Check:

# svcs -xv #

And try again:

# /usr/sbin/kadmin -p tdh/admin
Authenticating as principal tdh/admin with password.
Password for tdh/admin@INTERNAL.EXCFB.COM:
kadmin: Communication failure with server while initializing kadmin interface

If we look at kadm5.acl:

\*/admin@___default_realm___ \*

Hmm, touch that up:

\*/admin@INTERNAL.EXCFB.COM \*

And for sanity:

# grep default \*
kdc.conf:[kdcdefaults]
kdc.conf:               default_principal_flags = +preauth
krb5.conf:[libdefaults]
krb5.conf:        default_realm = INTERNAL.EXCFB.COM
krb5.conf:      ___domainname___ = ___default_realm___
krb5.conf:        default = FILE:/var/krb5/kdc.log
krb5.conf:[appdefaults]

Okay, time to fix up krb5.conf as well:

[domain_realm]
        ___domainname___ = INTERNAL.EXCFB.COM

And restart:

# svcadm restart network/security/krb5kdc
# svcadm restart network/security/kadmin

And try again:

# /usr/sbin/kadmin -p tdh/admin
Authenticating as principal tdh/admin with password.
Password for tdh/admin@INTERNAL.EXCFB.COM:
kadmin: Communication failure with server while initializing kadmin interface

Okay, we know it is talking to something, i.e., it understands a bad password.

Lets try something else:

# kadmin.local
Authenticating as principal root/admin@INTERNAL.EXCFB.COM with password.
kadmin.local:  addprinc admin/admin@INTERNAL.EXCFB.COM
WARNING: no policy specified for admin/admin@INTERNAL.EXCFB.COM; defaulting to no policy
Enter password for principal "admin/admin@INTERNAL.EXCFB.COM":
Re-enter password for principal "admin/admin@INTERNAL.EXCFB.COM":
Principal "admin/admin@INTERNAL.EXCFB.COM" created.
kadmin.local:  quit

Okay, time to search. If we look at System Administration Guide: Security Services :

Communication failure with server while initializing kadmin interface

    Cause: The host that was entered for the admin server, also called the master KDC,
    did not have the kadmind daemon running.

    Solution: Make sure that you specified the correct host name for the master KDC.
    If you specified the correct host name, make sure that kadmind is running on
    the master KDC that you specified.

But wait:

# svcs | grep krb
online         23:43:04 svc:/network/security/krb5kdc:default
# svcs | grep kad
maintenance    23:42:54 svc:/network/security/kadmin:default
# svcs -vx
svc:/network/security/kadmin:default (Kerberos administration daemon)
 State: maintenance since Sun Feb 11 23:42:54 2007
Reason: Restarting too quickly.
   See: http://sun.com/msg/SMF-8000-L5
   See: man -M /usr/share/man -s 1M kadmind
   See: /var/svc/log/network-security-kadmin:default.log
Impact: This service is not running.

Lets look at the log file:

Feb 11 23:42:53 sandman kadmind[4275](Error): Keytab file "/etc/krb5/kadm5.keytab" does not exist
Feb 11 23:42:53 sandman kadmind[4275](Error): Keytab file "/etc/krb5/kadm5.keytab" does not exist
Feb 11 23:42:53 sandman kadmind[4275](info): No dictionary file specified, continuing without one.
Feb 11 23:42:53 sandman kadmind[4275](Error): Unable to set RPCSEC_GSS service names ('kadmin@sandman.internal.excfb.com,changepw@sandman.internal.excfb.com')
krb5kdc: Interrupted system call - while selecting for network input(1)
Feb 11 23:43:03 sandman krb5kdc[4105](info): shutting down

Hmm, we need to create a keytab:

# ls -la /etc/krb5/kadm5.keytab
/etc/krb5/kadm5.keytab: No such file or directory

Ack, why do I have a kadm.keytab and not a kadm5.keytab?

# mv kadm.keytab kadm5.keytab

Because that is what I frigging entered in my session!

# /usr/sbin/kadmin -p tdh/admin
Authenticating as principal tdh/admin with password.
Password for tdh/admin@INTERNAL.EXCFB.COM:
kadmin:

The correct incantations should have been:

kadmin.local:  ktadd -k /etc/krb5/kadm5.keytab kadmin/sandman.internal.excfb.com
kadmin.local:  ktadd -k /etc/krb5/kadm5.keytab changepw/sandman.internal.excfb.com
kadmin.local:  ktadd -k /etc/krb5/kadm5.keytab kiprop/sandman.internal.excfb.com

Okay, back to our regularly scheduled programming:

What principals exist?

kadmin:  listprincs
K/M@INTERNAL.EXCFB.COM
admin/admin@INTERNAL.EXCFB.COM
changepw/sandman.internal.excfb.com@INTERNAL.EXCFB.COM
kadmin/changepw@INTERNAL.EXCFB.COM
kadmin/history@INTERNAL.EXCFB.COM
kadmin/sandman.internal.excfb.com@INTERNAL.EXCFB.COM
kiprop/sandman.internal.excfb.com@INTERNAL.EXCFB.COM
kiprop/ultralord.internal.excfb.com@INTERNAL.EXCFB.COM
krbtgt/INTERNAL.EXCFB.COM@INTERNAL.EXCFB.COM
tdh/admin@INTERNAL.EXCFB.COM

To kerberize NFS, we need to touch up /etc/nfssec.conf:

# diff nfssec.conf nfssec.conf.stock
48,50c48,50
< krb5          390003  kerberos_v5     default -               # RPCSEC_GSS
< krb5i         390004  kerberos_v5     default integrity       # RPCSEC_GSS
< krb5p         390005  kerberos_v5     default privacy         # RPCSEC_GSS
---
> #krb5         390003  kerberos_v5     default -               # RPCSEC_GSS
> #krb5i                390004  kerberos_v5     default integrity       # RPCSEC_GSS
> #krb5p                390005  kerberos_v5     default privacy         # RPCSEC_GSS

We need to add a nfs principal:

kadmin:  addprinc -randkey nfs/sandman.internal.excfb.com
WARNING: no policy specified for nfs/sandman.internal.excfb.com@INTERNAL.EXCFB.COM; defaulting to no policy
Principal "nfs/sandman.internal.excfb.com@INTERNAL.EXCFB.COM" created.
kadmin:  ktadd nfs/sandman.internal.excfb.com
Entry for principal nfs/sandman.internal.excfb.com with kvno 3, encryption type AES-128 CTS mode with 96-bit SHA-1 HMAC added to keytab WRFILE:/etc/krb5/krb5.keytab.
Entry for principal nfs/sandman.internal.excfb.com with kvno 3, encryption type Triple DES cbc mode with HMAC/sha1 added to keytab WRFILE:/etc/krb5/krb5.keytab.
Entry for principal nfs/sandman.internal.excfb.com with kvno 3, encryption type ArcFour with HMAC/md5 added to keytab WRFILE:/etc/krb5/krb5.keytab.
Entry for principal nfs/sandman.internal.excfb.com with kvno 3, encryption type DES cbc mode with RSA-MD5 added to keytab WRFILE:/etc/krb5/krb5.keytab.

Verify that is does indeed exist:

# klist -k
Keytab name: FILE:/etc/krb5/krb5.keytab
KVNO Principal
---- --------------------------------------------------------------------------
   3 nfs/sandman.internal.excfb.com@INTERNAL.EXCFB.COM
   3 nfs/sandman.internal.excfb.com@INTERNAL.EXCFB.COM
   3 nfs/sandman.internal.excfb.com@INTERNAL.EXCFB.COM
   3 nfs/sandman.internal.excfb.com@INTERNAL.EXCFB.COM

And now we are going to have to make a share that is kerberized and setup a client to access it:

# /usr/sbin/kclient

Starting client setup

---------------------------------------------------
Do you want to use DNS for kerberos lookups ? [y/n]: n
        No action performed.
Enter the Kerberos realm: INTERNAL.EXCFB.COM
Specify the KDC hostname for the above realm: sandman.internal.excfb.com
sandman.internal.excfb.com

Note, this system and the KDC's time must be within 5 minutes of each other for Kerberos to function.  Both systems should run some form of time
 synchronization system like Network Time Protocol (NTP).

Setting up /etc/krb5/krb5.conf.

Enter the krb5 administrative principal to be used: tdh/admin
Obtaining TGT for tdh/admin ...
Password for tdh/admin@INTERNAL.EXCFB.COM:

Do you have multiple DNS domains spanning the Kerberos realm INTERNAL.EXCFB.COM ? [y/n]: n
        No action performed.

Do you plan on doing Kerberized nfs ? [y/n]: y

nfs/ultralord.internal.excfb.com entry ADDED to KDC database.
nfs/ultralord.internal.excfb.com entry ADDED to keytab.

host/ultralord.internal.excfb.com entry ADDED to KDC database.
host/ultralord.internal.excfb.com entry ADDED to keytab.

Do you want to copy over the master krb5.conf file ? [y/n]: y
Enter the pathname of the file to be copied: /etc/krb5/krb5.conf
cp: /etc/krb5/krb5.conf and /etc/krb5/krb5.conf are identical

Copy of /etc/krb5/krb5.conf failed, exiting.
---------------------------------------------------
Setup FAILED.

Hmm, how are we supposed to enter that? I bet we need to use /net. Which I don't have configured right now. Okay, the hard way:

# scp sandman:/etc/krb5/krb5.conf /etc/krb5/krb5.conf

Now, lets set up a test share:

# cd /export
# mkdir kerberos
# cd kerberos
# touch see_me
# chown tdh:staff see_me
# ls -la
total 4
drwxr-xr-x   2 root     root         512 Feb 12 00:23 .
drwxr-xr-x   4 root     sys          512 Feb 12 00:23 ..
-rw-r--r--   1 tdh      staff          0 Feb 12 00:23 see_me
# share -F nfs -o sec=krb5:krb5i:krb5p -d "Kerberos" /export/kerberos
# share -F nfs -d "Home dirs" /export/home
# share
-               /export/kerberos   sec=krb5,sec=krb5i,sec=krb5p   "Kerberos"
-               /export/home   rw   "Home dirs"

Now try to get some access:

[tdh@ultralord ~]> kinit
kinit(v5): Client not found in Kerberos database while getting initial credentials
[tdh@ultralord ~]> sudo klist -k
Keytab name: FILE:/etc/krb5/krb5.keytab
KVNO Principal
---- --------------------------------------------------------------------------
   4 nfs/ultralord.internal.excfb.com@INTERNAL.EXCFB.COM
   4 nfs/ultralord.internal.excfb.com@INTERNAL.EXCFB.COM
   4 nfs/ultralord.internal.excfb.com@INTERNAL.EXCFB.COM
   4 nfs/ultralord.internal.excfb.com@INTERNAL.EXCFB.COM
   4 host/ultralord.internal.excfb.com@INTERNAL.EXCFB.COM
   4 host/ultralord.internal.excfb.com@INTERNAL.EXCFB.COM
   4 host/ultralord.internal.excfb.com@INTERNAL.EXCFB.COM
   4 host/ultralord.internal.excfb.com@INTERNAL.EXCFB.COM

Okay, I think I need to add user principals for tdh:

kadmin:  addprinc tdh
WARNING: no policy specified for tdh@INTERNAL.EXCFB.COM; defaulting to no policy
Enter password for principal "tdh@INTERNAL.EXCFB.COM":
Re-enter password for principal "tdh@INTERNAL.EXCFB.COM":
Principal "tdh@INTERNAL.EXCFB.COM" created.

[tdh@ultralord ~]> kinit
Password for tdh@INTERNAL.EXCFB.COM:

And now I want to get a mount:

[tdh@ultralord ~]> sudo mkdir -p /mnt/sandman/home
[tdh@ultralord ~]> sudo mkdir -p /mnt/sandman/kerberos
[tdh@ultralord ~]> sudo showmount -e sandman
export list for sandman:
/export/kerberos (everyone)
/export/home     (everyone)
[tdh@ultralord ~]> sudo mount sandman:/export/kerberos /mnt/sandman/kerberos
[tdh@ultralord ~]> sudo mount sandman:/export/home /mnt/sandman/home
[tdh@ultralord ~]> ls -al /mnt/sandman/kerberos
total 4
drwxr-xr-x   2 root     root         512 Feb 12 00:23 .
drwxr-xr-x   4 root     root         512 Feb 12 00:36 ..
-rw-r--r--   1 tdh      staff          0 Feb 12 00:23 see_me
[tdh@ultralord ~]> ls -la /mnt/sandman/home
total 22
drwxr-xr-x   4 root     root         512 Dec 30 15:01 .
drwxr-xr-x   4 root     root         512 Feb 12 00:36 ..
drwx------   2 root     root        8192 Dec 20 11:28 lost+found
drwxr-xr-x   4 tdh      staff        512 Jan 21 20:48 tdh

Success!

But wait, we need to show that a client without kerberos enabled will be denied access to sandman:/export/kerberos:

[tdh@kanigix ~]> sudo mkdir -p /mnt/sandman/home
[tdh@kanigix ~]> sudo mkdir -p /mnt/sandman/kerberos
[tdh@kanigix ~]> sudo mount sandman:/export/kerberos /mnt/sandman/kerberos
nfs mount: mount: /mnt/sandman/kerberos: Permission denied

Some other things to do would be to setup /etc/pam.conf to allow single signon - i.e., use ssh without a password. We also need to setup ultralord as a slave.

But before I tune this out, we need to get a Linux client up and running. Why? Because we need to show we can interoperate.

Some systems only support single DES, so we need to create special keytabs for them:

kadmin:  addprinc -randkey nfs/mrx.internal.excfb.com
WARNING: no policy specified for nfs/mrx.internal.excfb.com@INTERNAL.EXCFB.COM; defaulting to no policy
Principal "nfs/mrx.internal.excfb.com@INTERNAL.EXCFB.COM" created.
kadmin:  addprinc -randkey host/mrx.internal.excfb.com
WARNING: no policy specified for host/mrx.internal.excfb.com@INTERNAL.EXCFB.COM; defaulting to no policy
Principal "host/mrx.internal.excfb.com@INTERNAL.EXCFB.COM" created.

Now, I've created /export/keytabs to store the keytab files we will need:

# cd /export
# mkdir keytabs
# share -F nfs -o ro /export/keytabs

And we can create the keytab:

kadmin:  ktadd -k /export/keytabs/mrx.keytab -e des-cbc-crc:normal nfs/mrx.internal.excfb.com
Entry for principal nfs/mrx.internal.excfb.com with kvno 3, encryption type DES cbc mode with CRC-32 added to keytab WRFILE:/export/keytabs/mrx.keytab.
kadmin:  ktadd -k /export/keytabs/mrx.keytab -e des-cbc-crc:normal host/mrx.internal.excfb.com
Entry for principal host/mrx.internal.excfb.com with kvno 3, encryption type DES cbc mode with CRC-32 added to keytab WRFILE:/export/keytabs/mrx.keytab.

We see we are in business:

# cp /etc/krb5/krb5.conf /export/keytabs/
# ls -la
total 10
drwxr-xr-x   2 root     root         512 Feb 12 00:50 .
drwxr-xr-x   5 root     sys          512 Feb 12 00:46 ..
-rw-r--r--   1 root     root        1968 Feb 12 00:50 krb5.conf
-rw-------   1 root     root         155 Feb 12 00:48 mrx.keytab
# chmod +r mrx.keytab

And now we setup the Linux machine:

[root@mrx ~]# mkdir -p /mnt/sandman/keytabs
[root@mrx ~]# showmount -e sandman
Export list for sandman:
/export/kerberos (everyone)
/export/home     (everyone)
/export/keytabs  (everyone)
[root@mrx ~]# mount sandman:/export/keytabs /mnt/sandman/keytabs

We should make sure we do not have access to sandman:/export/kerberos:

[root@mrx ~]# mkdir -p /mnt/sandman/kerberos
[root@mrx ~]# mkdir -p /mnt/sandman/home
[root@mrx ~]# mount sandman:/export/kerberos /mnt/sandman/kerberos
mount: sandman:/export/kerberos failed, security flavor not supported

What do we need to change:

[root@mrx ~]# cd /etc
[root@mrx etc]# ls -la k\*
-rw-r--r-- 1 root root  657 Jan  9 14:03 krb5.conf
-rw-r--r-- 1 root root 2241 Jul 13  2006 krb.conf
-rw-r--r-- 1 root root 1296 Jul 13  2006 krb.realms
[root@mrx etc]# mkdir stock
[root@mrx etc]# cp k\* stock
[root@mrx etc]# cp /mnt/sandman/keytabs/krb5.conf .
cp: overwrite `./krb5.conf'? y
[root@mrx etc]# cp /mnt/sandman/keytabs/mrx.keytab krb5.keytab

And we try to authenticate:

[tdh@mrx ~]> kinit
kinit: Command not found.

Okay, we need to install the kerberos packages:

[tdh@mrx /]> sudo yum install krb5-workstation
Loading "installonlyn" plugin
Setting up Install Process
Setting up repositories
Reading repository metadata in from local files
Parsing package install arguments
Nothing to do

No, we don't. Where is that rascally rabbit?

[tdh@mrx /]> sudo find . -name kinit
./usr/kerberos/bin/kinit
[tdh@mrx /]> ./usr/kerberos/bin/kinit
Password for tdh@INTERNAL.EXCFB.COM:

And we try the mount:

[tdh@mrx /]> sudo mount sandman:/export/kerberos /mnt/sandman/kerberos
mount: sandman:/export/kerberos failed, security flavor not supported
[tdh@mrx /]> ./usr/kerberos/bin/klist
Ticket cache: FILE:/tmp/krb5cc_1066
Default principal: tdh@INTERNAL.EXCFB.COM

Valid starting     Expires            Service principal
02/12/07 01:01:42  02/12/07 09:01:42  krbtgt/INTERNAL.EXCFB.COM@INTERNAL.EXCFB.COM
        renew until 02/13/07 00:59:17



Kerberos 4 ticket cache: /tmp/tkt1066
klist: You have no tickets cached

What is up here?

# snoop -x 0,2000 -o /tmp/m2s.snoop sandman mrx
Using device /dev/hme (promiscuous mode)
33 \^C

Note: I used -x 0,2000 to get payload data. I knew I would want to look at most of the packet.

And

[tdh@mrx ~]> sudo mount -t nfs4 sandman:/export/kerberos /mnt/sandman/kerberos
mount.nfs4: Operation not permitted

 26   0.00034 mrx.internal.excfb.com -> sandman      NFS C 4 () PUTFH FH=324D LOOKUP export GETFH GETATTR 10011a 30a23a
 27   0.00030      sandman -> mrx.internal.excfb.com NFS R 4 () NFS4_OK PUTFH NFS4_OK LOOKUP NFS4_OK GETFH NFS4_OK FH=30E6 GETATTR NFS4_OK
 28   0.00033 mrx.internal.excfb.com -> sandman      NFS C 4 () PUTFH FH=30E6 LOOKUP kerberos GETFH GETATTR 10011a 30a23a
 29   0.00021      sandman -> mrx.internal.excfb.com NFS R 4 () NFS4ERR_WRONGSEC PUTFH NFS4_OK LOOKUP NFS4ERR_WRONGSEC

I popped into wireshark and I found out that mrx is only sending AUTH_SYS and AUTH_NULL.

Note: I used wireshark because it will parse the payload data for me. I didn't want to be doing byte conversions and consulting some specs!

In NetApp Filer, NFSv4, and Linux, we find using -o sec=krb5. We can try that:

[tdh@mrx ~]> sudo mount -t nfs4 -o sec=krb5 sandman:/export/kerberos /mnt/sandman/kerberos
Warning: rpc.gssd appears not to be running.
mount.nfs4: Invalid argument

Which is strange, since it is running:

[tdh@mrx ~]> sudo chkconfig --list | grep rpcgssd
rpcgssd         0:off   1:off   2:off   3:on    4:on    5:on    6:off
[tdh@mrx ~]> sudo chkconfig --list | grep rpcidmapd
rpcidmapd       0:off   1:off   2:off   3:on    4:on    5:on    6:off

What does the log state:

RPC: Couldn't create auth handle (flavor 390003)

I've copied the stock krb5.conf back and now the diffs are:

[tdh@mrx /etc]> diff krb5.conf stock/krb5.conf
7c7
<  default_realm = INTERNAL.EXCFB.COM
---
>  default_realm = EXAMPLE.COM
14,17c14,17
<  INTERNAL.EXCFB.COM = {
<   kdc = sandman.internal.excfb.com:88
<   admin_server = sandman.internal.excfb.com:749
<   default_domain = internal.excfb.com
---
>  EXAMPLE.COM = {
>   kdc = kerberos.example.com:88
>   admin_server = kerberos.example.com:749
>   default_domain = example.com
21,22c21,22
<  .internal.excfb.com = INTERNAL.EXCFB.COM
<  internal.excfb.com = INTERNAL.EXCFB.COM
---
>  .example.com = EXAMPLE.COM
>  example.com = EXAMPLE.COM

You know what, rpc.gssd is not running!

[tdh@mrx /etc]> ps -ef | grep rpc
rpc       1877     1  0 01:49 ?        00:00:00 portmap
root      1898     1  0 01:49 ?        00:00:00 rpc.statd
root      1931     1  0 01:49 ?        00:00:00 rpc.idmapd
tdh       2697  2519  0 02:04 pts/0    00:00:00 grep rpc

[tdh@mrx /etc]> sudo sh -c "ulimit -c unlimited;/usr/sbin/rpc.gssd -f -vvv"
Using keytab file '/etc/krb5.keytab'
Processing keytab entry for principal 'nfs/mrx.internal.excfb.com@INTERNAL.EXCFB.COM'
We will use this entry (nfs/mrx.internal.excfb.com@INTERNAL.EXCFB.COM)
Processing keytab entry for principal 'host/mrx.internal.excfb.com@INTERNAL.EXCFB.COM'
We will NOT use this entry (host/mrx.internal.excfb.com@INTERNAL.EXCFB.COM)
Using (machine) credentials cache: 'MEMORY:/tmp/krb5cc_machine_INTERNAL.EXCFB.COM'

And I put it in the background. Hmm, why doesn't it like the host entry?

Alright, I went back to why isn't rpc.gssd starting up at boot:

[ -f /etc/sysconfig/nfs ] && . /etc/sysconfig/nfs
[ "${SECURE_NFS}" != "yes" ] && exit 0

# ls -la /etc/sysconfig/nfs
#

Time to create it (look at Learning NFSv4 with Fedora Core 2 (Linux 2.6. 5 kernel))

# This entry should be "yes" if you are using RPCSEC_GSS_KRB5 (auth=krb5,krb5i, or krb5p)
SECURE_NFS="yes"
# This entry sets the number of NFS server processes.  8 is the default
RPCNFSDCOUNT=8

[tdh@mrx sysconfig]> sudo /etc/init.d/rpcgssd start
Starting RPC gssd:                                         [  OK  ]

God I'm totally hacked about this:

[tdh@mrx sysconfig]> sudo mount -o sec=krb5 sandman:/export/kerberos /mnt/sandman/kerberos
[tdh@mrx sysconfig]> ls -la /mnt/sandman/kerberos
total 5
drwxr-xr-x 2 root root   512 Feb 12 00:23 .
drwxr-xr-x 5 root root  4096 Feb 12 00:49 ..
-rw-r--r-- 1 tdh  wheel    0 Feb 12 00:23 see_me

Originally posted on Kool Aid Served Daily
Copyright (C) 2007, Kool Aid Served Daily

Sunday Dec 31, 2006

How NFSv4 should work when crossing filesystems

In Some fun with NFSv4 and automount across a ssh tunnel, I revealed the work going on in Solaris for Mirror Mounts. The example was a desire to automount across a ssh tunnel. Well, I dusted off wont, the box from hell (being used by my son for video games) and created some zfs filesystems on it:

# zfs list
NAME               USED  AVAIL  REFER  MOUNTPOINT
zoo                398K   118G  24.5K  /zoo
zoo/home           256K   118G  35.5K  /export/zfs
zoo/home/braves   24.5K   118G  24.5K  /export/zfs/braves
zoo/home/kanigix  24.5K   118G  24.5K  /export/zfs/kanigix
zoo/home/loghyr   24.5K   118G  24.5K  /export/zfs/loghyr
zoo/home/mrx      24.5K   118G  24.5K  /export/zfs/mrx
zoo/home/nfsv2    24.5K   118G  24.5K  /export/zfs/nfsv2
zoo/home/nfsv3    24.5K   118G  24.5K  /export/zfs/nfsv3
zoo/home/nfsv4    24.5K   118G  24.5K  /export/zfs/nfsv4
zoo/home/spud     24.5K   118G  24.5K  /export/zfs/spud
zoo/home/tdh      24.5K   118G  24.5K  /export/zfs/tdh
# uname -a
SunOS wont 5.11 snv_55 i86pc i386 i86pc

I then opened a ssh tunnel to it on my Fedora Core 4 box and did a little bit of exploring:

[tdh@adept tdh]> uname -a
Linux adept 2.6.15-1.1833_FC4 #1 Wed Mar 1 23:41:37 EST 2006 i686 i686 i386 GNU/Linux
[tdh@adept ~/usenix]> ssh -fN -L "5049:wont:2049" wont
Password:
[tdh@adept ~/usenix]> sudo mount -o port=5049 -t nfs4 localhost:/ /nfs4/wont
[tdh@adept ~/usenix]> cd /nfs4/wont
[tdh@adept wont]> ls -la
total 6
drwxr-xr-x  38 root root 1024 Dec 31 17:49 .
drwxr-xr-x   4 root root 4096 Dec 31 18:17 ..
drwxr-xr-x   4 root sys   512 Dec 31 17:50 export
[tdh@adept wont]> cd export
[tdh@adept export]> ls -la
total 4
drwxr-xr-x   4 root sys   512 Dec 31 17:50 .
drwxr-xr-x  38 root root 1024 Dec 31 17:49 ..
drwxr-xr-x  11 root sys    11 Dec 31 17:50 zfs
[tdh@adept export]> cd zfs
[tdh@adept zfs]> ls -la
total 16
drwxr-xr-x  11 root sys  11 Dec 31 17:50 .
drwxr-xr-x   4 root sys 512 Dec 31 17:50 ..
drwxr-xr-x   2 root sys   2 Dec 31 17:50 braves
drwxr-xr-x   2 root sys   2 Dec 31 17:50 kanigix
drwxr-xr-x   2 root sys   2 Dec 31 17:50 loghyr
drwxr-xr-x   2 root sys   2 Dec 31 17:50 mrx
drwxr-xr-x   2 root sys   2 Dec 31 17:50 nfsv2
drwxr-xr-x   2 root sys   2 Dec 31 17:50 nfsv3
drwxr-xr-x   2 root sys   2 Dec 31 17:50 nfsv4
drwxr-xr-x   2 root sys   2 Dec 31 17:50 spud
drwxr-xr-x   2 root sys   2 Dec 31 17:50 tdh
[tdh@adept zfs]> cd tdh
[tdh@adept tdh]> ls -la
total 3
drwxr-xr-x   2 root sys  2 Dec 31 17:50 .
drwxr-xr-x  11 root sys 11 Dec 31 17:50 ..

Notice that I only did one mount command. As I crossed down into the exported filesystems, the Linux 2.16 implementation of NFSv4 did the mounts automatically for me in the background. Also, note that since '/' is not exported from wont, this must be a pseudo-fs:

[tdh@adept tdh]> showmount -e wont
Export list for wont:
/export/zfs         (everyone)
/export/zfs/nfsv2   (everyone)
/export/zfs/nfsv3   (everyone)
/export/zfs/nfsv4   (everyone)
/export/zfs/tdh     (everyone)
/export/zfs/loghyr  (everyone)
/export/zfs/kanigix (everyone)
/export/zfs/mrx     (everyone)
/export/zfs/spud    (everyone)
/export/zfs/braves  (everyone)

Let's export '/' and see what happens:

# share -F nfs -o rw -d "root" /

And on the Linux box:

[tdh@adept tdh]> cd /nfs4/wont
[tdh@adept wont]> ls -la
total 6
drwxr-xr-x  38 root root 1024 Dec 31 17:49 .
drwxr-xr-x   4 root root 4096 Dec 31 18:17 ..
drwxr-xr-x   4 root sys   512 Dec 31 17:50 export

What happened? Why didn't we see the root directory on wont? Well, when we did the mount command earlier, we basically got a reference to a file handle in the pseudo-fs. We need to flush this by umounting and remounting:

[tdh@adept wont]> cd
[tdh@adept ~]> sudo umount /nfs4/wont/
[tdh@adept ~]> sudo mount -o port=5049 -t nfs4 localhost:/ /nfs4/wont
[tdh@adept ~]> cd /nfs4/wont
[tdh@adept wont]> ls -la
total 67
drwxr-xr-x  38 root root 1024 Dec 31 17:49 .
drwxr-xr-x   4 root root 4096 Dec 31 18:17 ..
lrwxrwxrwx   1 root root    9 Dec 31 13:17 bin -> ./usr/bin
drwxr-xr-x   5 root sys   512 Dec 31 14:12 boot
drwxr-xr-x   2 root root  512 Dec 31 14:51 Desktop
drwxr-xr-x  24 root sys  4096 Dec 31 14:42 dev
drwxr-xr-x  10 root sys   512 Dec 31 14:42 devices
drwxr-xr-x   2 root root  512 Dec 31 14:51 Documents
drwxr-xr-x   9 root root  512 Dec 31 17:31 .dt
-rwxr-xr-x   1 root root 5111 Dec 31 14:51 .dtprofile
-rw-------   1 root root   16 Dec 31 17:31 .esd_auth
drwxr-xr-x  87 root sys  4608 Dec 31 17:52 etc
drwxr-xr-x   4 root sys   512 Dec 31 17:50 export
...

Let's walk down the paths again and see what happens:

[tdh@adept wont]> cd export [tdh@adept export]> ls -la total 5 drwxr-xr-x 4 root sys 512 Dec 31 17:50 . drwxr-xr-x 38 root root 1024 Dec 31 17:49 .. drwxr-xr-x 2 root root 512 Dec 31 13:17 home drwxr-xr-x 11 root sys 11 Dec 31 17:50 zfs [tdh@adept export]> cd zfs [tdh@adept zfs]> ls -la total 16 drwxr-xr-x 11 root sys 11 Dec 31 17:50 . drwxr-xr-x 4 root sys 512 Dec 31 17:50 .. drwxr-xr-x 2 root sys 2 Dec 31 17:50 braves drwxr-xr-x 2 root sys 2 Dec 31 17:50 kanigix drwxr-xr-x 2 root sys 2 Dec 31 17:50 loghyr drwxr-xr-x 2 root sys 2 Dec 31 17:50 mrx drwxr-xr-x 2 root sys 2 Dec 31 17:50 nfsv2 drwxr-xr-x 2 root sys 2 Dec 31 17:50 nfsv3 drwxr-xr-x 2 root sys 2 Dec 31 17:50 nfsv4 drwxr-xr-x 2 root sys 2 Dec 31 17:50 spud drwxr-xr-x 2 root sys 2 Dec 31 17:50 tdh [tdh@adept zfs]> cd tdh [tdh@adept tdh]> ls -la total 3 drwxr-xr-x 2 root sys 2 Dec 31 17:50 . drwxr-xr-x 11 root sys 11 Dec 31 17:50 ..

Let's make sure we are in the right place:

# scp sandman:/export/home/tdh/.tcshrc .
Password:
.tcshrc              100% |\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*|  5417       00:00
# chown tdh:staff .tcshrc
# ls -la
total 18
drwxr-xr-x   2 root     sys            3 Dec 31 18:10 .
drwxr-xr-x  11 root     sys           11 Dec 31 17:50 ..
-rw-------   1 tdh      staff       5417 Dec 31 18:10 .tcshrc

And on the client:

[tdh@adept tdh]> ls -la
total 9
drwxr-xr-x   2 root sys       3 Dec 31 18:10 .
drwxr-xr-x  11 root sys      11 Dec 31 17:50 ..
-rw-------   1 tdh  nobody 5417 Dec 31 18:10 .tcshrc
[tdh@adept tdh]> grep 10 /etc/group
wheel:x:10:root

The nobody shows up for the group because there is no mapping between the string "staff" and "wheel". In NFSv3, the numeric 10 would have gone across the wire and the ls command would have spit out "wheel".

Okay, let's check to see what the Solaris client would have done:

[tdh@sandman ~]> ssh -fN -L "5049:wont:2049" wont
Password:
[tdh@sandman ~]> su -
Password:
Sun Microsystems Inc.   SunOS 5.11      snv_54  October 2007
# mkdir -p /nfs4/wont
# mount -o port=5049 localhost:/ /nfs4/wont
# exit
[tdh@sandman ~]> cd /nfs4/wont
[tdh@sandman wont]> ls -la
total 134
drwxr-xr-x  38 root     root        1024 Dec 31 17:49 .
drwxr-xr-x   3 root     root         512 Dec 31 18:17 ..
...
drwxr-xr-x   2 root     root         512 Dec 31 14:51 Desktop
drwxr-xr-x   2 root     root         512 Dec 31 14:51 Documents
lrwxrwxrwx   1 root     root           9 Dec 31 13:17 bin -> ./usr/bin
drwxr-xr-x   5 root     sys          512 Dec 31 14:12 boot
drwxr-xr-x  24 root     sys         4096 Dec 31 14:42 dev
drwxr-xr-x  10 root     sys          512 Dec 31 14:42 devices
drwxr-xr-x  87 root     sys         4608 Dec 31 17:52 etc
drwxr-xr-x   4 root     sys          512 Dec 31 17:50 export
...
[tdh@sandman wont]> cd export
[tdh@sandman export]> ls -la
total 9
drwxr-xr-x   4 root     sys          512 Dec 31 17:50 .
drwxr-xr-x  38 root     root        1024 Dec 31 17:49 ..
drwxr-xr-x   2 root     root         512 Dec 31 13:17 home
drwxr-xr-x  11 root     sys           11 Dec 31 17:50 zfs
[tdh@sandman export]> cd zfs
[tdh@sandman zfs]> ls -la
total 5
drwxr-xr-x  11 root     sys           11 Dec 31 17:50 .
drwxr-xr-x   4 root     sys          512 Dec 31 17:50 ..

Okay, we have hit the crux of the problem for Mirror Mounts. We have a filesystem crossing on the server which needs to be mirrored on the client. We have to do this manually (or with an automounter if the ports are open):

[tdh@sandman zfs]> cd
[tdh@sandman ~]> su -
Password:
Sun Microsystems Inc.   SunOS 5.11      snv_54  October 2007
# mount -o port=5049 localhost:/export/zfs /nfs4/wont/export/zfs
# ls -la /nfs4/wont/export/zfs
total 32
drwxr-xr-x  11 root     sys           11 Dec 31 17:50 .
drwxr-xr-x   4 root     sys          512 Dec 31 17:50 ..
drwxr-xr-x   2 root     sys            2 Dec 31 17:50 braves
drwxr-xr-x   2 root     sys            2 Dec 31 17:50 kanigix
drwxr-xr-x   2 root     sys            2 Dec 31 17:50 loghyr
drwxr-xr-x   2 root     sys            2 Dec 31 17:50 mrx
drwxr-xr-x   2 root     sys            2 Dec 31 17:50 nfsv2
drwxr-xr-x   2 root     sys            2 Dec 31 17:50 nfsv3
drwxr-xr-x   2 root     sys            2 Dec 31 17:50 nfsv4
drwxr-xr-x   2 root     sys            2 Dec 31 17:50 spud
drwxr-xr-x   2 root     sys            3 Dec 31 18:10 tdh
# tcsh
# ls -la /nfs4/wont/export/zfs/tdh
total 6
drwxr-xr-x   2 root     sys            3 Dec 31 18:10 .
drwxr-xr-x  11 root     sys           11 Dec 31 17:50 ..
# mount -o port=5049 localhost:/export/zfs/tdh /nfs4/wont/export/zfs/tdh
# ls -la  /nfs4/wont/export/zfs/tdh
total 18
drwxr-xr-x   2 root     sys            3 Dec 31 18:10 .
drwxr-xr-x  11 root     sys           11 Dec 31 17:50 ..
-rw-------   1 tdh      staff       5417 Dec 31 18:10 .tcshrc

Notice how the '/export/zfs' gave information about the child filesystems whereas '/' did not. Also, note how we get the correct group name because the '/etc/group' is the same on the two Solaris hosts. Finally, even with zfs presenting up the child filesystems, we did have to manually mount the child in order to peer into it.

So the Mirror Mounts project in the NFSv4 development team is going to fix all of this. Under the hood, the client is going to understand it is about to traverse to a different filesystem and do the equivalent of a NFSv3 mount.


Technorati Tags:
Orginally posted on Kool Aid Served Daily
Copyright (C) 2006, Kool Aid Served Daily

Saturday Dec 30, 2006

Some fun with NFSv4 and automount across a ssh tunnel

I just saw an interesting question on one of the internal Sun developer lists:

ssh -fN -L "3049:somehost.someext:2049" jonb@somehost.someext
mount -o port=3049 localhost:/f /f

This allows you to mount a filesystem from a remote machine over a ssh tunnel. The question was why doesn't automounting also work?

so I have a

/-              auto_direct     -nosuid,rw

in auto_master

/f -rw,port=3049 localhost:/f

in auto_direct.

And since this is over Solaris, by default it is NFSv4 and we shouldn't need anything but port 2049. The question was why didn't this work. A good resource on this is at: Tunneling NFS traffic via ssh, which is provided by Spencer Shepler. In the comments, it specifically mentions that all you need is port 2049 for NFSv4 traffic.

In my reply on the mailing list, I pointed out that this does not apply to the auotmounter. It has to call on the server's portmapper, which is port 111. Let's repeat the experiment and see if we can prove this to be what is happening.

[tdh@sandman ~]> ssh -fN -L "3049:ultralord:2049" ultralord
The authenticity of host 'ultralord (192.168.2.104)' can't be established.
RSA key fingerprint is a5:e6:e6:9c:e2:9c:2e:d9:6c:d2:91:c0:ed:41:8f:38.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ultralord,192.168.2.104' (RSA) to the list of known hosts.
Password:

And

# mount -o port=3049 localhost:/f/1 /f
# ls -la /f
total 4
drwxr-xr-x   2 root     root         512 Dec 30 15:08 .
drwxr-xr-x  24 root     root         512 Dec 30 15:08 ..
# tcsh
# ls -la /f
total 4
drwxr-xr-x   2 root     root         512 Dec 30 15:11 .
drwxr-xr-x  24 root     root         512 Dec 30 15:08 ..
-rw-r--r--   1 root     root           0 Dec 30 15:11 ultralord
# showmount -e localhost
showmount: localhost: RPC: Program not registered

I'll claim I don't need to setup the automounter. I'll do that later to confirm things. Okay, let's try some snoop. Krap! snoop only works on non-loopback interfaces. I found a dtrace script which is supposed to work: DTrace TCP Snoop, but I need to work on it a bit more. Anyway, let's look at how showmount works over a regular interface:

[tdh@sandman ~]> showmount -e ultralord
export list for ultralord:
/f/1 (everyone)
/f/2 (everyone)

And the snoop:

# snoop sandman ultralord
Using device /dev/hme (promiscuous mode)
     sandman -> ultralord PORTMAP C GETPORT prog=100005 (MOUNT) vers=1 proto=TCP
     ultralord -> sandman PORTMAP R GETPORT port=36795
     sandman -> ultralord MOUNT1 C Get export list
     ultralord -> sandman MOUNT1 R Get export list 2 entries

Which can be translated as:

sandman asks ultralord's portmapper: "Do you support MOUNT vers 1 over TCP?"
ultralord replies: "Yeah and you can get it on port 36795."
sandman then asks ultralord's mountd: "What exports do you have?"
ultralord then replies: "/f/1 and /f/2"

It makes sense that the automounter would also use the mount protocol. It basically needs to do a MOUNTPROC3_DUMP (see RFC 1813) in order to determine what the server has available. In effect you would have to punch a hole for both port 111 and the port ultralord advertises mountd on. And nothing specifies that the server has to use the same port everytime.

With NFSv4, you do not need an automounter to discover all of the exports on a server. But the implementation shipped in Solaris 10 does not support the automatic mounting of new filesystems as you cross boundaries. In part, it was an engineering decision to ship NFSv4 without this functionality. You can use automounters to get some of the equivalent functionality. You also have to look at typical usuage of shares in Solaris pre-zfs - small sets of shares per server and no child shares.

The problem is given this set of shares on a server:

/    ro
/f   rw=.eng.sun.com,ro=@172.16
/f/1 rw=.lab.sun.com,anon=0
/f/2 rw,root=.lab.sun.com

How do you dynamically determine where you can go? We haven't specified that this is a Solaris server, so let's assume that '/', '/f', '/f/1', and '/f/2' are all different filesystems. Via the NFSv4 spec (RFC 3530) you should be able to do the following:

# mount -o vers=4 sandman:/ /sandman
...
[tdh@sandman ~]> cd /sandman
[tdh@sandman ~]> ls -la
total 6
drwxr-xr-x   3 tdh      staff        512 Dec 30 16:18 .
drwxr-xr-x   5 tdh      staff        512 Dec 30 16:18 ..
drwxr-xr-x   2 tdh      staff        512 Dec 30 16:18 f
[tdh@sandman ~]> cd f
[tdh@sandman ~]> ls -la
total 10
drwxr-xr-x   5 tdh      staff        512 Dec 30 16:18 .
drwxr-xr-x  24 root     root         512 Dec 30 15:07 ..
drwxr-xr-x   2 tdh      staff        512 Dec 30 15:11 1
drwxr-xr-x   2 tdh      staff        512 Dec 30 15:11 2
drwxr-xr-x   3 tdh      staff        512 Dec 30 16:18 3
[tdh@sandman ~]> cd 1
[tdh@sandman ~]> ls -la
total 4
drwxr-xr-x   2 tdh      staff        512 Dec 30 15:11 .
drwxr-xr-x   5 tdh      staff        512 Dec 30 16:18 ..
-rw-r--r--   1 tdh      staff          0 Dec 30 15:11 ultralord
[tdh@sandman ~]> cd ../3
[tdh@sandman ~]> ls -la
total 6
drwxr-xr-x   3 tdh      staff        512 Dec 30 16:21 .
drwxr-xr-x   5 tdh      staff        512 Dec 30 16:18 ..
drwxr-xr-x   2 tdh      staff        512 Dec 30 16:18 g

Under the current Solaris implementations, you would need to manually make the mounts before each of the cd commands. Except for the last, in that case you are staying on the filesystem represented by '/f'.

The NFS development team is actively working on fixing this issue, which we call Mirror Mounts. Once that functionality is in place, you will be able to explore a remote host across both a firewall and a ssh tunnel via only one port: 2049.


Technorati Tags:
Orginally posted on Kool Aid Served Daily
Copyright (C) 2006, Kool Aid Served Daily
About

tdh

Search

Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today