Thursday Feb 01, 2007

Understanding snoop(1M) NFSv3 file handles

Introduction

Reading the snoop(1M) trace of NFS traffic you'll see references to file handles, but interpreting these is not straightforward. As any NFS engineer should tell you, file handles are opaque to the NFS client and are meaningful only to the NFS server that issued them.

For Solaris the file handle is derived from the underlying file system so some parts of the file handle are meaningful only to the underlying file system.

Here's an example snoop(1M) output:

  3   0.00000 v4u-450f-gmp03 -> v4u-80a-gmp03 NFS C GETATTR3 FH=FD0D
  4   0.00035 v4u-80a-gmp03 -> v4u-450f-gmp03 NFS R GETATTR3 OK

So what does FH=FD0D mean? File handles are surely longer than that?

Analysis

To make things easier to read, snoop(1M) hashes the file handle to 16-bits. Check the sum_filehandle() code in OpenSolaris:

sum_filehandle(len)
	int len;
{
	int i, l;
	int fh = 0;

	for (i = 0; i < len; i += 4) {
		l = getxdr_long();
		fh \^= (l >> 16) \^ l;
	}

	return (fh);
}

To see the complete file handle we need the verbose output:

# snoop -p3,4 -v -i /tmp/snoop2.out | grep NFS:
NFS:  ----- Sun NFS -----
NFS:  
NFS:  Proc = 1 (Get file attributes)
NFS:  File handle = [FD0D]
NFS:   0080001000000002000A0000000091EF23696D2E000A000000008FDF5F48F2A0
NFS:  
NFS:  ----- Sun NFS -----
NFS:  
NFS:  Proc = 1 (Get file attributes)
NFS:  Status = 0 (OK)
NFS:    File type = 1 (Regular File)
NFS:    Mode = 0644
NFS:     Setuid = 0, Setgid = 0, Sticky = 0
NFS:     Owner's permissions = rw-
NFS:     Group's permissions = r--
NFS:     Other's permissions = r--
NFS:    Link count = 1, User ID = 0, Group ID = 0
NFS:    File size = 301, Used = 1024
NFS:    Special: Major = 0, Minor = 0
NFS:    File system id = 137438953488, File id = 37359
NFS:    Last access time      = 01-Feb-07 15:12:16.398735000 GMT
NFS:    Modification time     = 01-Feb-07 15:12:16.410570000 GMT
NFS:    Attribute change time = 01-Feb-07 15:12:16.410570000 GMT
NFS:  
NFS:  
# 

Note the file system ID and file ID values for later.

To break down that file handle you need to understand the NFS server's implementation, for NFSv3 on OpenSolaris we have:

typedef struct {
	fsid_t	_fh3_fsid;			/\* filesystem id \*/
	ushort_t _fh3_len;			/\* file number length \*/
	char	_fh3_data[NFS_FH3MAXDATA];	/\* and data \*/
	ushort_t _fh3_xlen;			/\* export file number length \*/
	char	_fh3_xdata[NFS_FH3MAXDATA];	/\* and data \*/
} fhandle3_t;

Which means ...

fh3_fsid    0080001000000002
fh3_len     000A
fh3_data    0000000091EF23696D2E
fh3_xlen    000A
fh3_xdata   000000008FDF5F48F2A0

The fh3_fsid is itself a compressed version of the dev_t for the device and the file system type, check cmpldev(). Essentially:

  • the first 32 bits (0x00800010) are the major number shifted right 14 bits plus the minor number
  • the second 32 bits (0x00000002) are the file system type, see struct vfssw

That compressed fsid is what you see in mnttab, for this case:

# grep 800010 /etc/mnttab
/dev/dsk/c0t0d0s0       /       ufs     rw,intr,largefiles,logging,xattr,onerror=panic,dev=800010       1170093759
# 

Reassuringly, file system type 2 is ufs.

The fh3_data is derived from the underlying file system (ufs) which is a ufid structure:

struct ufid {
	ushort_t ufid_len;
	ushort_t ufid_flags;
	int32_t	ufid_ino;
	int32_t	ufid_gen;
};

So 0000000091EF23696D2E breaks down as:

ufid_flags   0000
ufid_ino     000091EF
ufid_gen     23696D2E

Reassuringly again, ufid_ino (the inode) makes sense:

# mdb
> 000091EF=U
                37359           
> !ls -li /export
total 922
     37359 -rw-r--r--   1 root     root         301 Feb  1 15:12 motd

That's the file I was checking from the NFS client and it matches the file ID from the snoop output.

The fh3_xdata represents the export data, ie the exported file system. The inode number in this case is 0x00008FDF. Checking:

> !share
-               /export   rw   ""  
> 00008FDF=U
                36831           
> !ls -lid /export
     36831 drwxr-xr-x   2 root     sys          512 Feb  1 15:12 /export
> 

If you've been paying attention you might be wondering what happened to the file system ID (137438953488). This is the uncompressed dev_t value. We can check it by compressing it (14 bit shift of the major, add the minor):

> 0t137438953488=J
                2000000010      
> (0t137438953488>>0t14) + 0t137438953488&0xffffffff=J
                800010          
> 

Yes, that looks familiar.

Conclusion

As already noted, NFS file handles are only meaningful to the NFS server and this example is just for the Solaris NFSv3 implementation. However, I hope it's given some insight into how that works and with this knowledge it's relatively easy to match snoop(1M) file handles to files on the NFS server.

Wednesday Jan 10, 2007

pfiles(1), locked files and a pinch of salt

Following a few days working this out I thought I'd share my thoughts on why the locking information returned by pfiles(1) should be taken with a pinch of salt.

What pfiles(1) tells us about locked open files

On Solaris, if we run pfiles(1) on a process that has open files it checks each file to see if it has any locks. The way it does this is to use fcntl(2) and F_GETLK. This is more complex than it first appears as it runs the system call from the context of the process being inspected. Suffice to say, it produces output like this:

# pfiles 102226
102226: /home/foo/lock /home/foo/testfile
  Current rlimit: 256 file descriptors
   0: S_IFCHR mode:0620 dev:301,0 ino:12582918 uid:54321 gid:7 rdev:24,1
      O_RDWR|O_NOCTTY|O_LARGEFILE
      /devices/pseudo/pts@0:1
   1: S_IFCHR mode:0620 dev:301,0 ino:12582918 uid:54321 gid:7 rdev:24,1
      O_RDWR|O_NOCTTY|O_LARGEFILE
      /devices/pseudo/pts@0:1
   2: S_IFCHR mode:0620 dev:301,0 ino:12582918 uid:54321 gid:7 rdev:24,1
      O_RDWR|O_NOCTTY|O_LARGEFILE
      /devices/pseudo/pts@0:1
   3: S_IFREG mode:0666 dev:314,5 ino:2557867 uid:54321 gid:10 size:343
      O_RDWR
      advisory write lock set by system 0x1 process 102225
      /home/foo/testfile
# 

All fine and dandy, but what is meant by "system" and is that process ID actually useful?

How the process and system IDs are derived

To understand this we need to look much closer at how these values are set and in particular whether we're using NFSv3 or NFSv4. I'm ignoring NFSv2 but that will behave similarly to NFSv3.

There are two sides to this, who sets the lock and who checks it with pfiles(1).

Setting the lock

For NFSv3 we use lockd/statd and the locking code passes the PID of the locking process to the server. There is no 'system id' passed to the server. The NFSv3 server passes the PID to underlying file system in the flock structure of the VOP_FRLOCK() call.

For NFSv4 it is considerably more complicated but essentially it passes what should be an opaque 'lock_owner' structure to the server. This can only by understood by the client that sent it. The NFSv4 server creates a unique identifier which it assigns to the PID member of the flock structure. As with NFSv3 this flock structure is passed to the underlying file system.

The pfiles(1) output

The pfiles(1) output is based on a F_GETLK call for the open file. This returns a structure with l_pid and l_sysid. The values of these is as follows:

For NFSv3 the locking code returns the PID passed to the underlying file system in the original VOP_FRLOCK() call. For locks set for NFSv3 this is the true PID of the locking process. The sysid may be set to zero in which case it is ignored by pfiles(1). If the lock was held by a process on this machine the sysid will be set to the system id of the NFS server plus 0x4000. For locks set by NFSv4 the PID is not the PID but the unique identifier used at lock time on the server.

For NFSv4, if we're the client that set the lock we can return sensible values for the PID and system ID - though the latter is \*always\* the client itself. If we're not the client we fabricate the PID and sysid as follows:

  • sysid: we take the sysid of the server and add 1
  • pid: we add up all the bytes in the clientid and owner
   15370 		/\*
   15371 		 \* Construct a new sysid which should be different from
   15372 		 \* sysids of other systems.
   15373 		 \*/
   15374 
   15375 		flk->l_sysid++;
   15376 		flk->l_pid = lo_to_pid(&lockt_denied->owner);

Interpreting the results

So, with that in mind we have the following result:

NFSv3 set the lock, read by NFSv3

Process ID is correct, no system identification. eg:
      advisory write lock set by process 102250

Or perhaps:

      advisory write lock set by system 0x4001 process 102265

NFSv4 set the lock, read by NFSv3

Process ID is unique but meaningless to the client. No system identification. eg:
      advisory write lock set by process 776458

NFSv3 set the lock, read by NFSv4

The lock owner will not match on the client, result is fabricated. eg:
      advisory write lock set by system 0x2 process 512

NFSv4 set the lock, read by NFSv4

If the lock owner matches, we return the true PID of the process running on this system. The sysid is that of the server as understood by the client. eg:
      advisory write lock set by system 0x1 process 102313
If the lock owner doesn't match the client the result is fabricated. eg:
      advisory write lock set by system 0x2 process 1580

Bugs

While checking this out I found and logged two curious bugs in this area.

If pfiles(1) is run on an x64 box the 64-bit version is run. If the target process is 32-bit it mixes up the process and system id. Compare these two outputs:

# /usr/bin/amd64/pfiles 2353 | grep advisory
      advisory write lock set by system 0x943
# /usr/bin/i86/pfiles 2353 | grep advisory  
      advisory write lock set by system 0x1 process 2371
# 

Jan 11th - the following is not a bug

The other bug is that for failed lock attempts on NFSv4 or NFSv3 the PID is not set, in my test code I set the l_pid to 12345.

My code output:

F_GETLK returned type 2, sysid 1, pid 2371
F_SETLK returned type 2, sysid 1, pid 12345

truss -t fcntl -vall output

fcntl(3, F_GETLK, 0x08047C60)                   = 0
        typ=F_WRLCK  whence=SEEK_SET start=0     len=0     sys=1  pid=2371
fcntl(3, F_SETLK, 0x08047C60)                   Err#11 EAGAIN
        typ=F_WRLCK  whence=SEEK_SET start=0     len=0     sys=1  pid=12345

Notice the PID is wrong when the lock fails but is correct when tested?

The fcntl(2) man page clearly states that the l_pid and l_sysid values are only used for F_GETLK calls:

     The l_pid and l_sysid fields are used only with  F_GETLK  or
     F_GETLK64  to return the process ID of the process holding a
     blocking lock and to indicate which system is  running  that
     process.

One more complication - pfiles(1) doesn't show locks held by the process itself

The reason pfiles(1) fails to show locks held by a process is due to the way that pfiles works. It effectively hijacks the process being checked and makes the fcntl(2) call within that process context. Processes can re-lock files they already have open and locked as per the fcntl(2) man page:

     There will be at most one type of lock set for each byte  in
     the  file.  Before  a  successful  return  from  an F_SETLK,
     F_SETLK64, F_SETLKW, or F_SETLKW64 request when the  calling
     process has previously existing locks on bytes in the region
     specified by the request, the previous lock  type  for  each
     byte  in  the  specified  region will be replaced by the new
     lock type.

Conclusion

The output of pfiles(1) is tremendously useful but the locking information is not necessarily accurate or useful.

If you're running pfiles(1) against an NFSv4 file then if the process ID matches a local process it's more than likely the process holding the lock. Otherwise, it's meaningless.

If you're running pfiles(1) against an NFSv3 file then it's harder. Again if the process ID makes sense on this client then it's probably correct. Otherwise you could check it against known NFS clients of the same file system ... but that's time consuming.

To properly identify lock owners some mdb(1) or dtrace(1M) is required - and this blog entry is already too long :-)

About

PeteH

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today