Be careful of assumptions

I had been working on fixing 6775211: mirror mounts use the zcred; should use caller's cr augmented w/ PRIV_SYS_MOUNT before the Xmas break. I had it down to unit testing and thanks to Nico Williams, I even had some great unit tests. I just ran out of time to work on the bug.

Well it jumped in priority from a P4 to a P2 and was impacting other engineers. So I thought it might be easy to dust off and finish. I was wrong.

The bug was strange for several reasons, foremost was that it involved doing something that never occurred to us during development. The issue is when you have a NFSv4 client which has a user credential but not a machine credential - a configuration I thought was rare, but which is evidently common amongst Kerberos developers.

Under such a configuration, access is denied to a mirrormount (an automatic mounting in NFSv4 when a filesystem boundary is encountered). The underlying issue is that we use the kcred or zcred to always force the mount to occur. But in the case of there being no machine credentials, those creds do not carry any weight on a server. So we need to extend the privileges given to the user.

As an aside, I would consider this a security bug, but I'm coming to realize that while Kerberos may deal with security, the definition of a security bugs seems to be a bug in which an user gains unauthorized access. So a bug which keeps an authorized user from gaining expected access may be evil, but it is not worthy of a security tag.

Perhaps I am wrong and tempting fate. I hope not, I do not want to see the bug slapped with a security tag when I get up in the morning.

Anyway, back to the bug. The unit test is quite simple - create a share hierarchy on the server that allows a client machine to mount the parent with say AUTH_SYS, but requires kerberized access to the child:

zpool create pnfs1
zfs create pnfs1/scratch
zfs set sharenfs=on pnfs1/scratch
zfs create pnfs1/scratch/aaa
zfs set sharenfs=sec=krb5i pnfs1/scratch
zfs create pnfs1/scratch/aaa/bbb
zfs create pnfs1/scratch/dir1
zfs set sharenfs=sec=krb5i pnfs1/scratch/dir1
zfs create pnfs1/scratch/dir1/dir1
zfs create pnfs1/scratch/dir1/dir3
zfs create pnfs1/scratch/dir1/dir4
zfs create pnfs1/scratch/dir2
zfs set sharenfs=sec=krb5i pnfs1/scratch/dir2
zfs create pnfs1/scratch/dir2/dir3
zfs create pnfs1/scratch/dir2/dir4
zfs create pnfs1/scratch/dir3
zfs create pnfs1/scratch/dir3/dir1
zfs create pnfs1/scratch/dir3/dir2
zfs create pnfs1/scratch/dir4
zfs set sharenfs=sec=krb5i pnfs1/scratch/dir4
zfs create pnfs1/scratch/dir4/dir1
zfs create pnfs1/scratch/dir4/dir2
chown -R thud:staff /pnfs1/scratch/aaa
chown -R lover:staff /pnfs1/scratch/dir1
chown -R boy:staff /pnfs1/scratch/dir2
chown -R imoo:staff /pnfs1/scratch/dir4

Now, there are four configurations possible to show both the original problem and the fix:

IdCode baseMachine credsUser creds Notes
A - pnfs-9-23 onnv_105 Yes Yes Normal configuration
B - pnfs-9-24 onnv_105 No Yes Buggy configuration
C - pnfs-9-25 onnv_105 + fixes Yes Yes Should show no regression
D - pnfs-9-26 onnv_105 + fixes No Yes Should show the fix

Now, I made the claim that A and C should yield the same results (or there would be a regression). I also made the claim that D should yield the same results as both A and C (or there was a regression). And all I really cared about B was that is showed the bug.

So here is A in action:

[thud@pnfs-9-23 thud]> cd /net/theserver/pnfs1/scratch
[thud@pnfs-9-23 scratch]> ls -la
total 10
drwxr-xr-x   8 root     root           8 Jan 28 15:20 .
dr-xr-xr-x   2 root     root           2 Jan 29 00:08 ..
dr-xr-xr-x   1 root     root           1 Jan 29 00:29 aaa
dr-xr-xr-x   1 root     root           1 Jan 29 00:29 dir1
dr-xr-xr-x   1 root     root           1 Jan 29 00:29 dir2
dr-xr-xr-x   1 root     root           1 Jan 29 00:29 dir3
dr-xr-xr-x   1 root     root           1 Jan 29 00:29 dir4
dr-xr-xr-x   1 root     root           1 Jan 29 00:29 foo
[thud@pnfs-9-23 scratch]> cd aaa
[thud@pnfs-9-23 scratch]> cd aaa
[thud@pnfs-9-23 aaa]> ls -la
total 7
drwxr-xr-x   3 thud     staff          3 Jan 27 21:10 .
drwxr-xr-x  10 root     root          10 Jan 29 01:06 ..
dr-xr-xr-x   1 root     root           1 Jan 29 21:18 bbb
[thud@pnfs-9-23 aaa]> cd ..
[thud@pnfs-9-23 scratch]> ls -la 
total 14
drwxr-xr-x  10 root     root          10 Jan 29 01:06 .
dr-xr-xr-x   2 root     root           2 Jan 29 21:18 ..
drwxr-xr-x   3 thud     staff          3 Jan 27 21:10 aaa
dr-xr-xr-x   1 root     root           1 Jan 29 21:18 dir1
dr-xr-xr-x   1 root     root           1 Jan 29 21:18 dir2
dr-xr-xr-x   1 root     root           1 Jan 29 21:18 dir3
dr-xr-xr-x   1 root     root           1 Jan 29 21:18 dir4

This shows that a mirrormount only occurs when the user commits the client to accessing a share on the server. And the reporting of 'root:root' in the attributes is expected from prior testing. It mimics what happens with the automounter (which was a design specification). We also see once we do the mount, that we get the proper attributes showing.

I'm going to cheat (which I'll explain shortly), but here is the bug in action:

[thud@pnfs-9-23 thud]> cd /net/theserver/pnfs1/scratch
[thud@pnfs-9-24 scratch]> cd aaa
aaa: Permission denied.

And here is what we see with the fix for case D:

[thud@pnfs-9-26 thud]> cd /net/theserver/pnfs1/scratch
[thud@pnfs-9-26 scratch]> ls -la
total 17
drwxr-xr-x   8 root     root           8 Jan 28 15:20 .
dr-xr-xr-x   2 root     root           2 Jan 28 20:59 ..
drwxr-xr-x   3 thud     staff          3 Jan 27 21:10 aaa
drwxr-xr-x   5 lover    staff          6 Jan 27 20:22 dir1
drwxr-xr-x   4 boy      staff          5 Jan 27 20:47 dir2
dr-xr-xr-x   1 root     root           1 Jan 28 20:59 dir3
drwxr-xr-x   4 imoo     staff          4 Jan 27 15:46 dir4
[thud@pnfs-9-26 scratch]> cd aaa
[thud@pnfs-9-26 aaa]> ls -la
total 9
drwxr-xr-x   3 thud     staff          3 Jan 27 23:54 .
drwxr-xr-x   7 root     sys            7 Jan 23 21:24 ..
drwxr-xr-x   2 thud     staff          2 Jan 27 23:54 bbb
[thud@pnfs-9-26 aaa]> cd ..
[thud@pnfs-9-26 scratch]> ls -al
total 17
drwxr-xr-x   7 root     sys            7 Jan 23 21:24 .
dr-xr-xr-x   2 root     root           2 Jan 29 21:32 ..
drwxr-xr-x   3 thud     staff          3 Jan 27 23:54 aaa
drwxr-xr-x   5 lover    staff          5 Jan 27 23:54 dir1
drwxr-xr-x   4 boy      staff          4 Jan 27 23:54 dir2
dr-xr-xr-x   1 root     root           1 Jan 29 21:32 dir3
drwxr-xr-x   4 imoo     staff          4 Jan 27 23:54 dir4

The attributes do not match what we saw with the base case! Now it turns out that mirrormounts are not occuring, but a getattr will not necessarily cause one.

There are some red herrings here, why does 'dir3' not show up? It is related to the fact that a WRONGSEC will not be returned for it. But it turns out that the same calls are going across the wire for the security negotiation.

The big clue, in retrospect, is that no mirrormounting is going on. My new code can not be getting called and hence can not be the root cause.

So what assumptions did I make?

  1. C would be the same as A -- which holds!
  2. D would be the same as A and C -- which does not hold!
  3. B's results do not matter -- which is false, it better show the bug

The issues are that I've confused expected test results (access to the kerberized entries) with there being no other changes. But that is patently false, two things have changed from A to D and one from C to D. I.e., I have to remember that D also has no machine credentials!

Okay, back to where I cheated on showing the results for B. I cheated because when I did my unit testing, I ignored the results from the 'ls'. Time to go back and see what I got (which would have made it obvious to the astute reader what was going on):

[thud@pnfs-9-24 thud]> cd /net/theserver/pnfs1/scratch
[thud@pnfs-9-24 scratch]> ls -la
total 17
drwxr-xr-x   8 root     root           8 Jan 28 15:20 .
dr-xr-xr-x   2 root     root           2 Jan 28 20:59 ..
drwxr-xr-x   3 thud     staff          3 Jan 27 21:10 aaa
drwxr-xr-x   5 lover    staff          6 Jan 27 20:22 dir1
drwxr-xr-x   4 boy      staff          5 Jan 27 20:47 dir2
dr-xr-xr-x   1 root     root           1 Jan 28 20:59 dir3
drwxr-xr-x   4 imoo     staff          4 Jan 27 15:46 dir4

So, I can now prove that D provides the same attribute listings without regards to my fix. I really don't care why there is a difference between attribute reportage when I have machine credentials versus when I do not - I care about it behaving consistently.

I actually found the difference was that for some reason the code detected that without machine credentials the new filesystems supported ACLs. Why?, I didn't get that far once I realized my error in assumptions.

It is easy to step back and realize that I made an over generalization in my testing assumptions. I was right with regards to getting access, but wrong on the unexpected attributes. But you also have to remember we never tested in the case where there were no machine credentials. So the lack of attributes was expected. I think it becomes easy to blame what has changed explicitly (versus hidden changes) because that is the first rule we are taught in debugging.

Why do I suddenly see Hannible Lecter asking Clarice:

"First principles, Clarice. Simplicity. Read Kernighan & Ritchie. Of each particular
bug ask: what is it in itself? What is its nature? What has changed, in this code you seek?"

Originally posted on Kool Aid Served Daily
Copyright (C) 2009, Kool Aid Served Daily
Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

tdh

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today