Wednesday Mar 28, 2007

sharemgr - a new way to share

Back in Nov. of 2006 (snv_53), Doug introduced sharemgr(1M), a new way of managing your NFS shares in Solaris.

He also has a blog on how sharemgr currently interacts with ZFS.

happy sharing,

Friday Jun 30, 2006

NFSv4 client for MacOSX 10.4

Just saw this pass by on mailing list from Rick... excellent news!

If anyone is interested in trying an NFS Version 4 client (including Kerberos
support) on xnu-792.6.70 (Mac OSX 10.4.6 on PPC or Darwin8.0.1 on x86), you can
download the patch from (currently Beta test, I'd say):

There is also a "Readme" file and a primitive web page at:

If you are interested in seeing further announcements related to this,
please join I'll try and refrain from spamming these
lists with announcements most of you aren't interested in.

Just in case you are interested in it, rick
ps: If you have already downloaded the patch, you should do so again,
    since I just updated it a few minutes ago.

Perhaps soon my laptop will have NFSv4 and ZFS....

Monday Apr 17, 2006

Linux support for mirror mounts

RFC [PATCH 0/6] Client support for crossing NFS server mountpoints
Trond Myklebust 
Tue, 11 Apr 2006 13:45:43 -0400

The following series of patches implement NFS client support for crossing
server submounts (assuming that the server is exporting them using the
'nohide' option).  We wish to ensure that inode numbers remain unique
on either side of the mountpoint, so that programs like 'tar' and
'rsync' do not get confused when confronted with files that have the same
inode number, but are actually on different filesystems on the server.

This is achieved by having the client automatically create a submount
that mirrors the one on the server.

In order to avoid confusing users, we would like for this mountpoint to be
transparent to 'umount': IOW: when the user mounts the filesystem '/foo',
then an automatic submount by the NFS client for /foo/bar should not cause
'umount /foo' (particularly since the kernel cannot create entries for
/foo/bar in /etc/mtab). To get around this we mark automatically
created submounts using the new flag MNT_SHRINKABLE, and then allow
the NFS client to attempt to unmount them whenever the user calls umount on
the parent.

Note: This code also serves as the base for NFSv4 'referral' support, in
which one server may direct the client to a different server as it crosses
into a filesystem that has been migrated.

NFSv4 mailing list

F'n sweet.

AIX support for referrals and replicas

Spencer pointed me to: this documentation. Looks like AIX versions or later have support for referrals and replicas! Note that its only for NFSv4. The v4 train is rolling, good news!

Tuesday Mar 14, 2006

NFSv3 support for .zfs

Rob just did the putback last night to now enable NFSv3 to access .zfs/snapshot! So now both v3 and v4 have access.

fsh-weakfish# mount -o vers=3 fsh-mullet:/pool /mnt
fsh-weakfish# ls /mnt/.zfs/snapshot
fsh-weakfish# ls /mnt/.zfs/snapshot/krispies
fsh-weakfish# cat /mnt/.zfs/snapshot/krispies/neat.txt

In my previous blog on how the v4 support was done, i introduced a new data structure "fhandle_ext_t". That strucuture is now renamed to fhandle4_t to ease code changes in the future in case we ever need to increase v4's potential filehandle size (its currently set to v3's protocol limitation). NFSv3, similarly, uses the data structure fhandle3_t.

These changes will be in build 36 of nevada.

Wednesday Nov 16, 2005

bigger filehandles for NFSv4 - die NFSv2 die

So what does a filehandle created by the Solaris NFS server look like? If we take a gander at the fhandle_t struct, we see its layout:

struct svcfh {
    fsid_t	fh_fsid;			/\* filesystem id \*/
    ushort_t    fh_len;			        /\* file number length \*/
    char	fh_data[NFS_FHMAXDATA];		/\* and data \*/
    ushort_t    fh_xlen;			/\* export file number length \*/
    char	fh_xdata[NFS_FHMAXDATA];	/\* and data \*/
typedef struct svcfh fhandle_t;

Where fh_len represents the length of valid bytes in fh_data, and likewise, fh_xlen is the length fh_xdata. Note, NFS_FHMAXDATA used to be:

#define	NFS_FHMAXDATA	((NFS_FHSIZE - sizeof (struct fhsize) + 8) / 2)

To be less confusing, I removed fhsize and shortened that to:

#define NFS_FHMAXDATA    10

Ok, but where does fh_data come from? Its the FID (via VOP_FID) of the local file system. fh_data represents the actual file of the filehandle, and fh_xdata represents the exported file/directory. So for NFSv2 and NFSv3, the filehandle is basically:
fsid + file FID + exported FID

NFSv4 is pretty much the same thing, except at the end we add two fields, and you can see the layout in nfs_fh4_fmt_t:

struct nfs_fh4_fmt {
 	fhandle_ext_t fh4_i;
 	uint32_t      fh4_flag;
 	uint32_t      fh4_volatile_id;

The fh4_flag is used to distinguish named attributes from "normal" files, and fh4_volatile_id is currently only currently used for testing purposes - for testing volatile filehandles of course, and since Solaris doesn't have a local file system that doesn't have persistent filehandles we don't need to use fh4_volatile_id quite yet.

So back to the magical "10" for NFS_FHMAXDATA... what's going on there? Well, adding those fields up, you get: 8(fsid) + 2(len) + 10(data) + 2(xlen) + 10(xdata) = 32 bytes. Which is the protocol limitation of NFSv2 - just look for "FHSIZE". So the Solaris server is currently limiting its filehandles to 10 byte FIDs just to make NFSv2 happy. Note, this limitation has purposely crept into the local file systems to make this all work, check out UFS's ufid:

 \* This overlays the fid structure (see vfs.h)
 \* LP64 note: we use int32_t instead of ino_t since UFS does not use
 \* inode numbers larger than 32-bits and ufid's are passed to NFS
 \* which expects them to not grow in size beyond 10 bytes (12 including
 \* the length).
struct ufid {
 	ushort_t ufid_len;
 	ushort_t ufid_flags;
	int32_t	ufid_ino;
 	int32_t	ufid_gen;

Note that NFSv3's protocol limitation is 64 bytes and NFSv4's limitation is 128 bytes. So these two file systems could theoreticallly give out bigger filehandles, but there's two reasons why they don't for currently existing data: 1) there's really no need and more importantly 2) the filehandles MUST be the same on the wire before any change is done. If 2) isn't satisfied, then all clients with active mounts will get STALE errors when the longer filehandles are introduced. Imagine a server giving out 32 byte filehandles over NFSv3 for a file, then the server is upgraded and now gives out 64 byte filehandles - even if all the extra 32 bytes are zeroed out, that's a different filehandle and the client will think it has a STALE reference. Now a force umount or client reboot will fix the problem, but it seems pretty harsh to force all active clients to perform some manual admin action for a simple (and should be harmless) server upgrade.

So yeah my blog title is how i changed filehandles to be bigger - which almost contradicts the above paragraph. The key point to note is that files that have never been served up via NFS have never had a filehandle generated for them (duh), so they can be whatever length the protocol allows and we don't have to worry about STALE filehandles.

If you're not familiar with ZFS's .zfs/snapshot, there will be a future blog on it soon. But basically it places a dot file (.zfs) under the "main" file system at its root, and all snapshots created are then placed namespace-wise under .zfs/snapshot. Here's an example:

fsh-mullet# zfs snapshot bw_hog@monday
fsh-mullet# zfs snapshot bw_hog@tuesday
fsh-mullet# ls -a /bw_hog
.         ..        .zfs      aces.txt  is.txt    zfs.txt
fsh-mullet# ls -a /bw_hog/.zfs/snapshot
.        ..       monday   tuesday
fsh-mullet# ls -a /bw_hog/.zfs/snapshot/monday
.         ..        aces.txt  is.txt    zfs.txt

With the introduction of .zfs/snapshot, we were faced with an interesting dilemma for NFS - either only have NFS clients that could do "mirror mounts" have access to the .zfs directory OR increase ZFS's fid for files under .zfs. "Mirror mounts" would allow us to do the technically correct solution of having a unique FSID for the "main" file system and each of its snapshots. This requires NFS clients to cross server mount points. The latter option has one FSID for the "main" file system and all of its snapshots. This means the same file under the "main" file system and any of its snapshots will appear to be the same - so things like "cp" over NFS won't like it.

"Mirror mounts" is our lingo for letting clients cross server file system boundaries - as dictated by the FSID (file system identifier). This is totally legit in NFSv4 (see section "7.7. Mount Point Crossing" and section "5.11.7. mounted_on_fileid" in rfc 3530). NFSv3 doesn't really allow this functionality (see "3.3.3 Procedure 3: LOOKUP - Lookup filename" here). Though, with some little trickery, i'm sure it could be achieved - perhaps via the automounter?

The problem with mirror mounts is that no one has actually implemented them. So if we went with the more technically correct solution of having a unique FSID for the "main" local file system and a unique FSID for all its snapshots, only Solaris Update 2(?) NFSv4 clients would be able to access .zfs upon initial delivery of ZFS. That seems silly.

If we instead bend a little on the unique FSID, then all NFS clients in existence today can access .zfs. That seems much more attractive. Oh wait... small problem. We would rather like at least the filehandles to be different for files in the "main" files ystem from the snapshots - this ensures NFS doesn't get completely confused. Slight problem is that the filehandles we give out today are maxed out at the 32 byte NFSv2 protocol limitation (as mentioned above). If we add any other bit of uniqueness to the filehandles (such as a snapshot identifier) then v2 just can't handle it.... hmmm...

Well you know what? Tough s\*&t v2. Seriously, you are antiquated and really need to go away. So since the snapshot identifier doesn't need to be added to the "main" file system. FIDs for non-.zfs snapshot files will remain the same size and fit within NFSv2's limitations. So we can access ZFS over NFSv2, just will be denied .zfs's goodness:

fsh-weakfish# mount -o vers=2 fsh-mullet:/bw_hog /mnt
fsh-weakfish# ls /mnt/.zfs/snapshot/      
monday   tuesday
fsh-weakfish# ls /mnt/.zfs/snapshot/monday
/mnt/.zfs/snapshot/monday: Object is remote

So what about v3 and v4? Well since v4 is the default for Solaris and its code is simpler, i just changed v4 to handle bigger filehandles for now. NFSv3 is coming soooon. So we basically have the same structure as fhandle_t, except we extend it a bit for NFSv4 via fhandle4_t:

 \* This is the in-memory structure for an NFSv4 extended filehandle.
typedef struct {
        fsid_t  fhx_fsid;                       /\* filesystem id \*/
        ushort_t fhx_len;                       /\* file number length \*/
        char    fhx_data[NFS_FH4MAXDATA];    /\* and data \*/
        ushort_t fhx_xlen;                      /\* export file number length \*/
        char    fhx_xdata[NFS_FH4MAXDATA];   /\* and data \*/
} fhandle4_t;

So the only difference is that FIDs can be up to 26 bytes instead of 10 bytes. Why 26? Thats NFSv3's protocol limitation - 64 bytes. And if we ever need larger than 64 byte filehandles for NFSv4, its easy to change - just create a new struct with the capacity for larger FIDs and use that for NFSv4. Why will it be easier in the future than it was for this change? Well part of what i needed to do to make NFSv4 filehandles backwards compatible is that when filehandles are actuallly XDR'd, we need to parse them so that filehandles that used to be given out with 10 byte FIDs (based on the fhandle_t struct) continue to give out filehandles base on 10 byte FIDs, but at the same time VOP_FID()s that return larger than 10 byte FIDs (such as .zfs) are allowed to do so. So NFSv4 will return different length filehandles based on the need of the local file system.

So checking out xdr_nfs_resop4, the old code (knowing that the filehandle was safe to be a contigious set of bytes), simply did this:

case OP_GETFH:
	if (!xdr_int(xdrs,
		     (int32_t \*)&objp->nfs_resop4_u.opgetfh.status))
		return (FALSE);
	if (objp->nfs_resop4_u.opgetfh.status != NFS4_OK)
		return (TRUE);
	return (xdr_bytes(xdrs,
	    (char \*\*)&objp->nfs_resop4_u.opgetfh.object.nfs_fh4_val,
	    (uint_t \*)&objp->nfs_resop4_u.opgetfh.object.nfs_fh4_len,

Now, instead of simply doing a xdr_bytes, we use the template of fhandle_ext_t and internally always have the space for 26 byte FIDS but for OTW we skip bytes depending on what fhx_len and fhx_xlen, see xdr_encode_nfs_fh4.

whew, that's enough about filehandles for 2005.

Monday Oct 03, 2005

SUSE (Beta) support for NFSv4

Just saw Bryce post this on

Novell is now including the latest NFSv4 packages with kernel support
for NFSv4 by default in their SUSE 10 Beta 3.  You can activate NFSv4 in
SUSE Linux 10.0 Beta 3 and onswards by setting NFS4_SUPPORT="yes" in
/etc/sysconfig/nfs.  They are testing using pynfs and finding similar
results as the other testers.  Note this is not SLES10; beta testing of
that will begin later.


Those with more linux knowledge can correct me, but i believe that makes SUSE the third linux distribution to include NFSv4 behind Gentoo and Fedora.

Tuesday Sep 27, 2005

nasconf 2005

nasconf is coming this October 18-20!

What was once only for NFS, the conference has grown to include anything NAS related. There will be talks on NFS, CIFS, iSCSI, RDMA, OSD, etc. And your own humble narrator will be giving a talk on using filebench (also mentioned here) to evaluate the Solaris NFSv4 implementation.

sign up today!

ps: a personal thank you to Katie for making the website look all professional.

Monday Aug 15, 2005

Default Use of Privileged Ports Changed

Default Use of Privileged Ports Changed

Noel recently changed (putback in snv_22) the Solaris NFS client's default behavior for port selection. Previously, the client would default to using privileged ports via the variables 'clnt_cots_do_bindresvport' for TCP and 'clnt_clts_do_bindresvport' for UDP.

Why did we set the default set for privileged ports in the first place way back when? It served its purpose in the older days of insecure NFS, where servers would automatically deny client requests using a non-reserved port. Now with RPCSEC_GSS, we can move forward.

Another piece to this puzzle, is nfs_portmon. It has this comment in the code:

 \* If nfs_portmon is set, then clients are required to use privileged
 \* ports (ports < IPPORT_RESERVED) in order to get NFS services.
 \* N.B.: this attempt to carry forward the already ill-conceived notion
 \* of privileged ports for TCP/UDP is really quite ineffectual.  Not only
 \* is it transport-dependent, it's laughably easy to spoof.  If you're
 \* really interested in security, you must start with secure RPC instead.
static int nfs_portmon = 0;

And can be found in nfsd(1M). Turning it on forces the server to only accept privileged ports. Since i dislike nfs_portmon so much, i'm leaving it up to the reader to figure out how to turn it on.

So Noel's fix is to have the client (by default) try using a non-privileged port, and if that fails with AUTH_TOOWEAK due to someone (unfortunately) having nfs_portmon turned, then it will retry the request using a privileged port (assuming the client has some available).

Technorati Tag:
Technorati Tag:
Technorati Tag:

Tuesday Jun 14, 2005

NFSv4 Client's Recovery Messages

NFSv4 Client's Recovery Messages for Post-Mortem Analysis

When something goes awry with NFS, one of the typical first things we ask for is a snoop trace. This works ok when the problem is reproducible and you're talking to other developers - but when a non-reproducible failure occurs on a production system, snoop traces just don't cut it. This is especially true when it comes to post-mortem analysis (say of a panic or hard hang). It would be quite useful to know exactly what was going on in the system that lead to the crash.

NFSv4 is a stateful protocol (much like NFSv3's side locking protocol NLM). So when a server fails (such as reboot or lease timeout), recovery of the accumulated state needs to happen. We decided we would like to keep track of when recovery wasn't 100% successful and how we got to that point to aid in post-mortem analysis.

I did this by adding an in-kernel queue to store "facts" and "events". Events are direct actions the nfs clients takes; whereas facts are interesting tidbits that help determine what actions are to be taken, but don't directly cause recovery actions -- facts are intended to be supplemental.

Having this information proved quite useful when i fixed 6225426. I knew the code and had an inclination that it might be related to the client receiving NFS4ERR_BAD_SEQID. Luckily, we had already added the "event" RE_BAD_SEQID, and all i had to do was use the mdb dcmd nfs4_diag to tell me if the client did receive that error -- i didn't have to ask for a snoop trace + reproduction. From the core file, i just type:

> ::nfs4_diag -s

vfs: 3052ca3c600        mi: 306d9183000
mount point: /tools/somemntpt
mount from: server:/export/tools
Messages queued:
2005 Feb  2 01:51:38: fact RF_ERR
2005 Feb  2 01:52:42: event RE_BAD_SEQID
2005 Feb  2 01:52:42: event RE_END
2005 Feb  2 01:52:42: fact RF_DELMAP_CB_ERR

So let's say you wanted to track when a filesytem migrated in NFS... how would you add that ability? (note: the nfsv4 working group is currently discussing a protocol to do replication and migration, so Solaris doesn't officially support migration directly within NFS yet). We can use OpenSolaris to peer into the code...

1. Adding a new fact name

You'll find the facts defined in nfs4_clnt.h. We can go ahead and add a fact named RF_MIGRATE:

typedef enum {
} nfs4_fact_type_t;

2. Add a callout using the new fact

Then wherever you need to record this fact, call nfs4_queue_fact() and pass in your newly defined RF_MIGRATE. This function can be found in nfs4_client_debug.c. So a probable place to record this fact is within errs_to_action() in nfs4_recovery.c.

errs_to_action(recov_info_t \*recovp,
        nfs4_server_t \*sp, mntinfo4_t \*mi, stateid4 \*sidp,
        nfs4_lost_rqst_t \*lost_rqstp, int unmounted, nfs_opnum4 op,
        nfs4_bseqid_entry_t \*bsep)
	} else {
                recovp->rc_error = geterrno4(stat);
                switch (stat) {
                        action = NR_STALE;
                        action = NR_MIGRATE;
			nfs4_queue_fact(RF_MIGRATE, mi, 0, 0, 0, FALSE, NULL
				0, NULL);

3. Plugging into the kernel

Within set_fact(), you can set the nfs4_rfact_t (a structure to hold the fields of a "fact"). In our simple example, all we wish to store is when the migration actually happened. That information is actually already stored in nfs4_debug_msg_t (along with which server and mountpoint recovery was active on - see nfs4_clnt.h ). So we just need to store the fact type and don't have to store any additional information within the nfs4_rfact_t itself.

typedef enum {
} nfs4_msg_type_t;

typedef struct nfs4_debug_msg {
        timespec_t              msg_time;
        nfs4_msg_type_t         msg_type;
        char                    \*msg_srv;
        char                    \*msg_mntpt;
        union {
                nfs4_rfact_t    msg_fact;
                nfs4_revent_t   msg_event;
        } rmsg_u;
        nfs4_msg_status_t       msg_status;
        list_node_t             msg_node;
} nfs4_debug_msg_t;

The nfs4_msg_type_t determines whether the rmsg_u is for a fact or event.

static void
set_fact(nfs4_fact_type_t id, nfs4_rfact_t \*fp, nfsstat4 stat4,
    nfs4_recov_t raction, nfs_opnum4 op, bool_t reboot, int error,
    vnode_t \*vp)
        rnode4_t \*rp1;

        switch (id) {
                fp->rf_op = op;
                fp->rf_stat4 = stat4;

                rp1 = VTOR4(vp);
                fp->rf_rp1 = rp1;
                if (rp1 && rp1->r_svnode.sv_name)
                        fp->rf_char1 = fn_path(rp1->r_svnode.sv_name);
                        fp->rf_char1 = NULL;
		zcmn_err(getzoneid(), CE_NOTE, "illegal fact %d", id);

Then we wish to know if this fact was generated by a successful reply from the server and mark that in successful_comm() (in our case, migration will happen only after receiving NFS4ERR_MOVED, which means we got a legit reply). Note: the reason on why we need to know if it was a successful reply is outside the scope of our example.

static int
successful_comm(nfs4_debug_msg_t \*msgp)
        if (msgp->msg_type == RM_EVENT) {
	} else {
                switch (msgp->rmsg_u.msg_fact.rf_type) {
		case RF_MIGRATE:
			return (1);
			return (0);
			return (0);

Add the new fact to get_facts() so we can print out a summary "fact sheet". The fact sheet is a way of condensing up to NFS4_MSG_MAX facts into a one line summary. In our simple example, we don't actually need to save anything to the fact sheet:

static int
get_facts(nfs4_debug_msg_t \*msgp, nfs4_rfact_t \*ret_fp, char \*\*mnt_pt,
mntinfo4_t \*mi)
		switch (cur_fp->rf_type) {
		case RF_MIGRATE:
			zcmn_err(getzoneid(), CE_NOTE,
			    "get facts: illegal fact %d", cur_fp->rf_type);

Add to queue_print_fact() to actually print this information in /var/adm/messages. Note: only when something goes really bad (like an open had to be closed on the user or siglost) do we actually dump the queue's contents to /var/adm/messages - normally we just keep it in kernel memory.

static void
queue_print_fact(nfs4_debug_msg_t \*msg, int dump)
	nfs4_rfact_t    \*fp;
        zoneid_t        zoneid;

        fp = &msg->rmsg_u.msg_fact;
        zoneid = getzoneid();

        switch (fp->rf_type) {
		fp->rf_op = op;
		fp->rf_stat4 = stat4;

		rp1 = VTOR4(vp);
		fp->rf_rp1 = rp1;
		if (rp1 && rp1->r_svnode.sv_name)
			fp->rf_char1 = fn_path(rp1->r_svnode.sv_name);
			fp->rf_char1 = NULL;
		zcmn_err(zoneid, CE_NOTE, "![NFS4][Server: %s][Mntpt: %s]"
		    "file system migrated", msg->msg_srv, msg->msg_mntpt);

4. Plugging into mdb

The last thing to do is to add this new fact to the mdb dcmd nfs4_diag via nfs4_diag.c. Unfortunately, the nfs mdb module is encumbered code so i can't show the source (yet!), but i can name the functions.

Add this to fact_to_str():

        case RF_MIGRATE:
                return ("RF_MIGRATE");

Then just like the kernel function queue_print_fact(), we add the human readable output to nfs4_fact_print():

		mdb_readstr(buf, sizeof (buf), (uintptr_t)fp->rf_char1);
		mdb_printf("[NFS4]%Y: file system migrated\\n",

When the nfs4_diag.c file does become unencumbered, i'll update this blog... until then, feel free to contact me to make changes.

5. mdb output

#mdb -k
> ::nfs4_diag

vfs: 6000095c900        mi: 3000d2ad000
   mount point: /mnt
    mount from: fsh-weakfish:/export/tmp
Messages queued:
[NFS4]2005 May 25 17:14:36: file system migrated
> ::nfs4_diag -s

vfs: 6000095c900        mi: 3000d2ad000
   mount point: /mnt
    mount from: fsh-weakfish:/export/tmp
Messages queued:
2005 May 25 17:14:36: fact RF_MIGRATE
> $q

So now, if there was ever a problem with any server's implementation of migration, we would have a clue on the Solaris client of why things went bad.

Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:

Tuesday May 03, 2005

dscript to get active NFS clients / nfstop

Have you ever wondered which client(s) we're banging away when your NFS server got a little sluggish? Well with the power of D, it becomes quite simple. Here's a dscript to do exactly that:

#!/usr/sbin/dtrace -FCs

#define AF_INET  2
#define AF_INET6  26

  self->ca = (struct sockaddr \*)(args[0]->rq_xprt->xp_xpc.xpc_rtaddr.buf);
  self->sin_addr = (uchar_t \*)&((struct sockaddr_in \*)self->ca)->sin_addr;

  self->in = 1;

/self->in && self->ca->sa_family == AF_INET/
  self->sin_addr = (uchar_t \*)&((struct sockaddr_in \*)self->ca)->sin_addr;
  @hosts[self->sin_addr[0], self->sin_addr[1], self->sin_addr[2],
  self->sin_addr[3]] = count();

  self->in = 0;
  self->ca = 0;
  self->sin_addr = 0;

/self->in && self->ca->sa_family == AF_INET6/
  self->sin6 = (uchar_t \*)&((struct sockaddr_in6 \*)self->ca)->sin6_addr;

  @hosts6[self->sin6[0], self->sin6[1], self->sin6[2], self->sin6[3],
   self->sin6[4], self->sin6[5], self->sin6[6], self->sin6[7],
   self->sin6[8], self->sin6[9], self->sin6[10], self->sin6[11],
   self->sin6[12], self->sin6[13], self->sin6[14], self->sin6[15]]
   = count();

  self->in = 0;
  self->ca = 0;
  self->sin6 = 0;

  printa("\\nhost: %d.%d.%d.%d num nfs calls: %@d", @hosts);
  printa("\\nhost: %x%x:%x%x:%x%x:%x%x:%x%x:%x%x:%x%x:%x%x num nfs calls: %@d", @hosts6);

And here's an example of the output:

# ./get_nfs_clients.d
dtrace: script './get_nfs_clients.d' matched 4 probes
host:  num nfs calls: 2
host:  num nfs calls: 2
host:  num nfs calls: 2
host:  num nfs calls: 5
host:  num nfs calls: 6
host:  num nfs calls: 10
host:  num nfs calls: 10
host:  num nfs calls: 25
host: fe80:00:00:00:00:00:00:xx  num nfs calls: 2


Sorted automatically from least to most active... with ipv4 and ipv6 hosts being separate sorts (due to being separate aggregates).

very cool! aaaah, but Richard, being the DE as he is, one upped me to create a 'nfstop' of sorts by replacing the above 'END' with:

  trunc (@hosts, 20);
  trunc (@hosts6, 20);
  printa("\\nhost: %d.%d.%d.%d  num nfs calls: %@d", @hosts);
  printa("\\nhost: %x%x:%x%x:%x%x:%x%x:%x%x:%x%x:%x%x:%x%x num nfs calls: %@d", @hosts6);
  trunc (@hosts);
  trunc (@hosts6);

Now we're talking!

Friday Mar 04, 2005


So now that Connectathon is over, what was the biggest highlight?

I would have to say having the first successful demonstration of pNFS (courtesy of Sun and Netapp). The three amigos from Sun (Sam Faulkner, Lisa Week, and Alok ) ran a Solaris client against Netapp's server (Garth Goodson). Garth later on gave a talk about it. The basic premise is for very large files, having only one server can cause us to run into bandwith limitations. Striping a very large file across several (in the presentations there are ALWAYS three) data servers, and then having that "layout" information passed back to the client via the meta-data server, and then all reads/writes are sent from the client to the data servers can solve the bandwith problems (and lets you stay in vogue with horizontal scaling).

The internet draft for pNFS can be found here.

Now Garth Gibson (and not Garth Goodson - heh) originally started coming up with a solution (mentioned here) and CITI actually had a workshop back in Dec 2003.

This demonstration used files/NFSv4 as the backend protocol to move the actual data (as opposed to just talking to the meta-data server). But you can easily see how the block people are excited to use their own solutions. And can also see how the cluster people are very interested in an open standard.

Ok, so its parallel... why should i care? To quote Goodson:


This draft considers the problem of limited bandwidth to NFS servers. The bandwidth limitation exists because an NFS server has limited network, CPU, memory and disk I/O resources. Yet, access to any one file system through the NFSv4 protocol requires that a single server be accessed. While NFSv4 allows file system migration, it does not provide a mechanism that supports multiple servers simultaneously exporting a single writable file system.

This problem has become aggravated in recent years with the advent of very cheap and easily expanded clusters of application servers that are also NFS clients. The aggregate bandwidth demands of such clustered clients, typically working on a shared data set preferentially stored in a single file system, can increase much more quickly than the bandwidth of any server. The proposed solution is to provide for the parallelization of file services, by enhancing NFSv4 in a minor version.


So now it works, excellent. Sam, Lisa, Alok... when are we going to get perf numbers? heh

Lastly, i'd like to point out that this shows the beauty behind NFSv4. This is exactly what v4 was designed for, openness and enhancements (via minor versions) where everyone can play. We can build upon the base of the protocol to solve interesting problems instead of redoing the (un-interesting) work every five or ten years.

Wednesday Mar 02, 2005

Connectathon is almost over!

The ultimate in NFS conferences, Connectathon is now in its final days for 05.

check out the talks:

The last one had a interesting security note:
pNFS Update and Security Discussion - Brent Welch, Panasas

In reference to pNFS (later blog), the block solution for "security" from the client to the data servers is authenticated via a "capability". The only problem being that this completely susceptible to snoop/man in the middle attacks. So we really relying on physical access to the network.

And i believe i was just complaining to fellow group memebers last week how silly it was that we still support DH - oh well.

man i hate blocks.

Note: Brent is a super sharp guy and didn't come up with the security solution, he was just explaining it.
Note2: Brent mentioned you could use IPsec to protect the "capability", but why not just use a files solution and krb5?

Thursday Feb 03, 2005

NFSv4 Cross-Domain Considerations, DNS TXT RR solution

As mentioned in my previous entry on 'NFSMAPID_DOMAIN', there was going to be an internet draft for the DNS TXT RR solution to finding a proper NFSv4 domain, and Rick has now submitted it:

Section "2.3 IETF DNS Community Considerations" is interesting as it hints more work is needed in the future.

Friday Jan 28, 2005


What did NFSv4 change in the user/group attributes representation?

NFSv4 introduces the use of UTF8 strings for user and group attributes instead of the old NFSv3 way of an 32-bit unsigned int - which is really the UNIX uid/gid. We now have the form:


This makes intuitive sense (for larger networks), better matches Window's SIDs, and more importantly solves the "scalability" of user/group (see example below).

An example where this is quite useful

If two companies merged and were now housed under the same administrative domain, but existed under a different DNS/NIS/LDAP/etc domain, then its quite possible that both companies used some of the exact same uids/gids - but they of course refer to different user/groups. By enforcing the domain as part of the attribute, NFSv4 allows the uids/gids to remain the same, but the OTW representation will be different, and everything works: is internally represented as uid 1000 from CompanyA. is internally represented as uid 1000 from CompanyB.

CompanyA merges with CompanyB. To ease the transition, all of CompanyA's machines are separated out as DNS domain All companyB's machines are left in DNS domain

In NFSv3, we would just pass user "1000" OTW; in NFSv4, we pass "" OTW. Within the DNS domain, its not obvious whether "1000" represents joe or bob, but it is obvious "" is not bob. And this is obvious and detected within NFSv4, no need for a special proprietary protocol/solution to do the uid/gid mappings.

Real access

One thing to remember is that access (including ACLs) are evaluated by the server and are based on the RPC credentials, not on the user/group attrs. So if the user/group attr is not recognizable (for example, server couldn't find a proper mapping, so translates to "nobody"), then only apps/commands that depend on that attribute will fail.

So a properly written application will use access(2) to perform permission checks, and will not rely on the mode bits or uid/gid from stat(2).

It should be noted though, that even if the client cannot accurately determine the user/group attribute, properly written applications will still run successfully - since they're permission checks are done on the RPC cred, not on the user/group attribute.

So what does Solaris do?

Solaris currently only handles one NFSv4 domain. If the client or server receives an user/group string that does not match its domain, it will map that user/group into uid/gid "nobody" (60001). In the future Solaris may support multiple domains, domain equivalence, or a mapping of users between domains.

As an example, assume a flat uid numberspace, user "eric"'s uid is 1000, the client resides in DNS domain and therefore1 has NFSMAPID_DOMAIN set to, and the server resides in DNS domain and has NFSMAPID_DOMAIN also as

If a client contacts the server, "" is passed OTW, and the server will not be able to map "" to uid 1000, since the domains don't match. The server will translate this to "nobody".

1 Note, below I will explain the use of a DNS TXT RR that we use to solve this problem (and are proposing via the IETF for others to use as well).


So the OTW representation is set, but if we're talking Solaris client to Solaris server, then end to end, we will need to translate a uid to user@domain at the client, and then user@domain to uid at the server.

So how does Solaris figure out what its NFSv4 domain is?

Solaris allows each "node" to explicitly set its own NFSv4 domain, derive its NFSv4 domain from a common name service domain, or to acquire the NFSv4 domain from a DNS network service. Explicitly setting the domain doesn't scale, so we encourage customers to use their existing DNS (or NIS/NIS+/LDAP/etc) domain as their NFSv4 domain. In the case where a site has multiple name service domains (such as some nodes using DNS whereas other nodes only use NIS), but keep a flat or non-overlapping numberspace, we recommend the third option - acquiring the NFSv4 domain via DNS and a RR.

With the common default configurations, customers should not have to explicitly define the NFSv4 domain. The automatic algorithm that is used should be sufficient.

Here is the Solaris algorithm for this mapping:

if (NFSMAPID_DOMAIN2 is uncommented/set in /etc/default/nfs)
        use it and return;
if (DNS is configured)
        lookup TXT RR3
        if (TXT RR)
                use it and return;
        if (DNS domainname)
                use it and return;
if (non-DNS domainname: NIS/NIS+/LDAP/etc)
        strip leading component;
        use it and return;
Otherwise, send OTW stringified uid/gids4

2NFSMAPID_DOMAIN is a Solaris variable - see nfs(4). Also note, that when this is not manualy set, Solaris will automatically figure out a NFSv4 domain (as described above), but when it does the automatic figuring it does NOT set the variable NFSMAPID_DOMAIN in /etc/default/nfs -- it leaves it alone.

3 The TXT resource record we use is:
and will be explained in full detail in a forthcoming Internet draft.

4Section 5.8 goes into detail of why using stringified uids/gids is not recommended and mentions how the server should handle them: " A server is not obligated to accept such a string, but may return NFS4ERR_BADOWNER instead. To avoid this mechanism being used to subvert user and group translation, so that a client might pass all of the owners and groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER error when there is a valid translation for the user or owner designated in this way. " However, there are useful in cases where no domain is discernable - such as a diskless client that needs to access its boot files over the net before its domain is set.


So if you do see the user/group popping up as "nobody", make sure that the client and server have matching domains for NFSv4. You can do this easily by looking at:


or running this simple dtrace script which shows what's being sent OTW:

#!/usr/sbin/dtrace -s

#pragma D option quiet

        printf("DOMAIN FOR OWNER:       %s\\n", stringof(arg0));

        printf("DOMAIN FOR OWNER_GROUP: %s\\n", stringof(arg0));



« July 2016