bigger filehandles for NFSv4 - die NFSv2 die

So what does a filehandle created by the Solaris NFS server look like? If we take a gander at the fhandle_t struct, we see its layout:

struct svcfh {
    fsid_t	fh_fsid;			/\* filesystem id \*/
    ushort_t    fh_len;			        /\* file number length \*/
    char	fh_data[NFS_FHMAXDATA];		/\* and data \*/
    ushort_t    fh_xlen;			/\* export file number length \*/
    char	fh_xdata[NFS_FHMAXDATA];	/\* and data \*/
typedef struct svcfh fhandle_t;

Where fh_len represents the length of valid bytes in fh_data, and likewise, fh_xlen is the length fh_xdata. Note, NFS_FHMAXDATA used to be:

#define	NFS_FHMAXDATA	((NFS_FHSIZE - sizeof (struct fhsize) + 8) / 2)

To be less confusing, I removed fhsize and shortened that to:

#define NFS_FHMAXDATA    10

Ok, but where does fh_data come from? Its the FID (via VOP_FID) of the local file system. fh_data represents the actual file of the filehandle, and fh_xdata represents the exported file/directory. So for NFSv2 and NFSv3, the filehandle is basically:
fsid + file FID + exported FID

NFSv4 is pretty much the same thing, except at the end we add two fields, and you can see the layout in nfs_fh4_fmt_t:

struct nfs_fh4_fmt {
 	fhandle_ext_t fh4_i;
 	uint32_t      fh4_flag;
 	uint32_t      fh4_volatile_id;

The fh4_flag is used to distinguish named attributes from "normal" files, and fh4_volatile_id is currently only currently used for testing purposes - for testing volatile filehandles of course, and since Solaris doesn't have a local file system that doesn't have persistent filehandles we don't need to use fh4_volatile_id quite yet.

So back to the magical "10" for NFS_FHMAXDATA... what's going on there? Well, adding those fields up, you get: 8(fsid) + 2(len) + 10(data) + 2(xlen) + 10(xdata) = 32 bytes. Which is the protocol limitation of NFSv2 - just look for "FHSIZE". So the Solaris server is currently limiting its filehandles to 10 byte FIDs just to make NFSv2 happy. Note, this limitation has purposely crept into the local file systems to make this all work, check out UFS's ufid:

 \* This overlays the fid structure (see vfs.h)
 \* LP64 note: we use int32_t instead of ino_t since UFS does not use
 \* inode numbers larger than 32-bits and ufid's are passed to NFS
 \* which expects them to not grow in size beyond 10 bytes (12 including
 \* the length).
struct ufid {
 	ushort_t ufid_len;
 	ushort_t ufid_flags;
	int32_t	ufid_ino;
 	int32_t	ufid_gen;

Note that NFSv3's protocol limitation is 64 bytes and NFSv4's limitation is 128 bytes. So these two file systems could theoreticallly give out bigger filehandles, but there's two reasons why they don't for currently existing data: 1) there's really no need and more importantly 2) the filehandles MUST be the same on the wire before any change is done. If 2) isn't satisfied, then all clients with active mounts will get STALE errors when the longer filehandles are introduced. Imagine a server giving out 32 byte filehandles over NFSv3 for a file, then the server is upgraded and now gives out 64 byte filehandles - even if all the extra 32 bytes are zeroed out, that's a different filehandle and the client will think it has a STALE reference. Now a force umount or client reboot will fix the problem, but it seems pretty harsh to force all active clients to perform some manual admin action for a simple (and should be harmless) server upgrade.

So yeah my blog title is how i changed filehandles to be bigger - which almost contradicts the above paragraph. The key point to note is that files that have never been served up via NFS have never had a filehandle generated for them (duh), so they can be whatever length the protocol allows and we don't have to worry about STALE filehandles.

If you're not familiar with ZFS's .zfs/snapshot, there will be a future blog on it soon. But basically it places a dot file (.zfs) under the "main" file system at its root, and all snapshots created are then placed namespace-wise under .zfs/snapshot. Here's an example:

fsh-mullet# zfs snapshot bw_hog@monday
fsh-mullet# zfs snapshot bw_hog@tuesday
fsh-mullet# ls -a /bw_hog
.         ..        .zfs      aces.txt  is.txt    zfs.txt
fsh-mullet# ls -a /bw_hog/.zfs/snapshot
.        ..       monday   tuesday
fsh-mullet# ls -a /bw_hog/.zfs/snapshot/monday
.         ..        aces.txt  is.txt    zfs.txt

With the introduction of .zfs/snapshot, we were faced with an interesting dilemma for NFS - either only have NFS clients that could do "mirror mounts" have access to the .zfs directory OR increase ZFS's fid for files under .zfs. "Mirror mounts" would allow us to do the technically correct solution of having a unique FSID for the "main" file system and each of its snapshots. This requires NFS clients to cross server mount points. The latter option has one FSID for the "main" file system and all of its snapshots. This means the same file under the "main" file system and any of its snapshots will appear to be the same - so things like "cp" over NFS won't like it.

"Mirror mounts" is our lingo for letting clients cross server file system boundaries - as dictated by the FSID (file system identifier). This is totally legit in NFSv4 (see section "7.7. Mount Point Crossing" and section "5.11.7. mounted_on_fileid" in rfc 3530). NFSv3 doesn't really allow this functionality (see "3.3.3 Procedure 3: LOOKUP - Lookup filename" here). Though, with some little trickery, i'm sure it could be achieved - perhaps via the automounter?

The problem with mirror mounts is that no one has actually implemented them. So if we went with the more technically correct solution of having a unique FSID for the "main" local file system and a unique FSID for all its snapshots, only Solaris Update 2(?) NFSv4 clients would be able to access .zfs upon initial delivery of ZFS. That seems silly.

If we instead bend a little on the unique FSID, then all NFS clients in existence today can access .zfs. That seems much more attractive. Oh wait... small problem. We would rather like at least the filehandles to be different for files in the "main" files ystem from the snapshots - this ensures NFS doesn't get completely confused. Slight problem is that the filehandles we give out today are maxed out at the 32 byte NFSv2 protocol limitation (as mentioned above). If we add any other bit of uniqueness to the filehandles (such as a snapshot identifier) then v2 just can't handle it.... hmmm...

Well you know what? Tough s\*&t v2. Seriously, you are antiquated and really need to go away. So since the snapshot identifier doesn't need to be added to the "main" file system. FIDs for non-.zfs snapshot files will remain the same size and fit within NFSv2's limitations. So we can access ZFS over NFSv2, just will be denied .zfs's goodness:

fsh-weakfish# mount -o vers=2 fsh-mullet:/bw_hog /mnt
fsh-weakfish# ls /mnt/.zfs/snapshot/      
monday   tuesday
fsh-weakfish# ls /mnt/.zfs/snapshot/monday
/mnt/.zfs/snapshot/monday: Object is remote

So what about v3 and v4? Well since v4 is the default for Solaris and its code is simpler, i just changed v4 to handle bigger filehandles for now. NFSv3 is coming soooon. So we basically have the same structure as fhandle_t, except we extend it a bit for NFSv4 via fhandle4_t:

 \* This is the in-memory structure for an NFSv4 extended filehandle.
typedef struct {
        fsid_t  fhx_fsid;                       /\* filesystem id \*/
        ushort_t fhx_len;                       /\* file number length \*/
        char    fhx_data[NFS_FH4MAXDATA];    /\* and data \*/
        ushort_t fhx_xlen;                      /\* export file number length \*/
        char    fhx_xdata[NFS_FH4MAXDATA];   /\* and data \*/
} fhandle4_t;

So the only difference is that FIDs can be up to 26 bytes instead of 10 bytes. Why 26? Thats NFSv3's protocol limitation - 64 bytes. And if we ever need larger than 64 byte filehandles for NFSv4, its easy to change - just create a new struct with the capacity for larger FIDs and use that for NFSv4. Why will it be easier in the future than it was for this change? Well part of what i needed to do to make NFSv4 filehandles backwards compatible is that when filehandles are actuallly XDR'd, we need to parse them so that filehandles that used to be given out with 10 byte FIDs (based on the fhandle_t struct) continue to give out filehandles base on 10 byte FIDs, but at the same time VOP_FID()s that return larger than 10 byte FIDs (such as .zfs) are allowed to do so. So NFSv4 will return different length filehandles based on the need of the local file system.

So checking out xdr_nfs_resop4, the old code (knowing that the filehandle was safe to be a contigious set of bytes), simply did this:

case OP_GETFH:
	if (!xdr_int(xdrs,
		     (int32_t \*)&objp->nfs_resop4_u.opgetfh.status))
		return (FALSE);
	if (objp->nfs_resop4_u.opgetfh.status != NFS4_OK)
		return (TRUE);
	return (xdr_bytes(xdrs,
	    (char \*\*)&objp->nfs_resop4_u.opgetfh.object.nfs_fh4_val,
	    (uint_t \*)&objp->nfs_resop4_u.opgetfh.object.nfs_fh4_len,

Now, instead of simply doing a xdr_bytes, we use the template of fhandle_ext_t and internally always have the space for 26 byte FIDS but for OTW we skip bytes depending on what fhx_len and fhx_xlen, see xdr_encode_nfs_fh4.

whew, that's enough about filehandles for 2005.


I have 3 machines: (OS = Solaris) A - running OS10, sharing /ramdisk via NFS B - running OS9 client to the A /ramdisk via NFS C - running OS10, client to the A /ramdisk via NFS If A is rebooted, I get "no such file or directory" when doing an ls -l from B or C. It takes about ~90 seconds delay before I finally can access ramdisk from B or C. How can I fix this? Would a "fixed" file handle resolve the problem, and if so do you have the instruction for doing this? Thanks.

Posted by Julius Rahmandar on January 27, 2006 at 02:49 AM PST #

Ok, so "A" is your server running s10... "B" (s9) and "C" (s10) are your clients.

Is the /ramdisk temporary? What are you trying to "ls -l" when you get "no such file or directory?

Is the 90 seconds after "A" comes back up or is that including the reboot time?

Note this blog isn't about fixing filehandles, its about extending them when the local file system needs a larger FID - like ZFS's .snapshot. So what you're seeing has nothing to do with filehandles.

Please follow up with a message to - its easier to respond there.

Posted by eric kustarz on January 27, 2006 at 02:58 AM PST #

Post a Comment:
  • HTML Syntax: NOT allowed



« April 2014