NFSv4 Client's Recovery Messages

NFSv4 Client's Recovery Messages for Post-Mortem Analysis

When something goes awry with NFS, one of the typical first things we ask for is a snoop trace. This works ok when the problem is reproducible and you're talking to other developers - but when a non-reproducible failure occurs on a production system, snoop traces just don't cut it. This is especially true when it comes to post-mortem analysis (say of a panic or hard hang). It would be quite useful to know exactly what was going on in the system that lead to the crash.

NFSv4 is a stateful protocol (much like NFSv3's side locking protocol NLM). So when a server fails (such as reboot or lease timeout), recovery of the accumulated state needs to happen. We decided we would like to keep track of when recovery wasn't 100% successful and how we got to that point to aid in post-mortem analysis.

I did this by adding an in-kernel queue to store "facts" and "events". Events are direct actions the nfs clients takes; whereas facts are interesting tidbits that help determine what actions are to be taken, but don't directly cause recovery actions -- facts are intended to be supplemental.

Having this information proved quite useful when i fixed 6225426. I knew the code and had an inclination that it might be related to the client receiving NFS4ERR_BAD_SEQID. Luckily, we had already added the "event" RE_BAD_SEQID, and all i had to do was use the mdb dcmd nfs4_diag to tell me if the client did receive that error -- i didn't have to ask for a snoop trace + reproduction. From the core file, i just type:

> ::nfs4_diag -s


\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
vfs: 3052ca3c600        mi: 306d9183000
mount point: /tools/somemntpt
mount from: server:/export/tools
Messages queued:
=============================================
2005 Feb  2 01:51:38: fact RF_ERR
...
2005 Feb  2 01:52:42: event RE_BAD_SEQID
2005 Feb  2 01:52:42: event RE_END
2005 Feb  2 01:52:42: fact RF_DELMAP_CB_ERR
>

So let's say you wanted to track when a filesytem migrated in NFS... how would you add that ability? (note: the nfsv4 working group is currently discussing a protocol to do replication and migration, so Solaris doesn't officially support migration directly within NFS yet). We can use OpenSolaris to peer into the code...

1. Adding a new fact name

You'll find the facts defined in nfs4_clnt.h. We can go ahead and add a fact named RF_MIGRATE:

typedef enum {
        RF_BADOWNER,
        RF_ERR,
        RF_RENEW_EXPIRED,
        RF_SRV_NOT_RESPOND,
        RF_SRV_OK,
        RF_SRVS_NOT_RESPOND,
        RF_SRVS_OK,
        RF_DELMAP_CB_ERR,
	RF_MIGRATE	
} nfs4_fact_type_t;

2. Add a callout using the new fact

Then wherever you need to record this fact, call nfs4_queue_fact() and pass in your newly defined RF_MIGRATE. This function can be found in nfs4_client_debug.c. So a probable place to record this fact is within errs_to_action() in nfs4_recovery.c.

void
errs_to_action(recov_info_t \*recovp,
        nfs4_server_t \*sp, mntinfo4_t \*mi, stateid4 \*sidp,
        nfs4_lost_rqst_t \*lost_rqstp, int unmounted, nfs_opnum4 op,
        nfs4_bseqid_entry_t \*bsep)
...
	} else {
                recovp->rc_error = geterrno4(stat);
                switch (stat) {
		...
		case NFS4ERR_STALE:
                        action = NR_STALE;
                        break;
		case NFS4ERR_MOVED:
                        action = NR_MIGRATE;
			nfs4_queue_fact(RF_MIGRATE, mi, 0, 0, 0, FALSE, NULL
				0, NULL);
                        break;
		default:
		....
		}
	}
...
}

3. Plugging into the kernel

Within set_fact(), you can set the nfs4_rfact_t (a structure to hold the fields of a "fact"). In our simple example, all we wish to store is when the migration actually happened. That information is actually already stored in nfs4_debug_msg_t (along with which server and mountpoint recovery was active on - see nfs4_clnt.h ). So we just need to store the fact type and don't have to store any additional information within the nfs4_rfact_t itself.

typedef enum {
        RM_EVENT,
        RM_FACT
} nfs4_msg_type_t;

typedef struct nfs4_debug_msg {
        timespec_t              msg_time;
        nfs4_msg_type_t         msg_type;
        char                    \*msg_srv;
        char                    \*msg_mntpt;
        union {
                nfs4_rfact_t    msg_fact;
                nfs4_revent_t   msg_event;
        } rmsg_u;
        nfs4_msg_status_t       msg_status;
        list_node_t             msg_node;
} nfs4_debug_msg_t;

The nfs4_msg_type_t determines whether the rmsg_u is for a fact or event.

static void
set_fact(nfs4_fact_type_t id, nfs4_rfact_t \*fp, nfsstat4 stat4,
    nfs4_recov_t raction, nfs_opnum4 op, bool_t reboot, int error,
    vnode_t \*vp)
{
        rnode4_t \*rp1;

        switch (id) {
	...
	case RF_DELMAP_CB_ERR:
                fp->rf_op = op;
                fp->rf_stat4 = stat4;

                rp1 = VTOR4(vp);
                fp->rf_rp1 = rp1;
                if (rp1 && rp1->r_svnode.sv_name)
                        fp->rf_char1 = fn_path(rp1->r_svnode.sv_name);
                else
                        fp->rf_char1 = NULL;
                break;
	case RF_MIGRATE:
		break;
	default:
		zcmn_err(getzoneid(), CE_NOTE, "illegal fact %d", id);
		break;
	}
}

Then we wish to know if this fact was generated by a successful reply from the server and mark that in successful_comm() (in our case, migration will happen only after receiving NFS4ERR_MOVED, which means we got a legit reply). Note: the reason on why we need to know if it was a successful reply is outside the scope of our example.

static int
successful_comm(nfs4_debug_msg_t \*msgp)
{
        if (msgp->msg_type == RM_EVENT) {
		...
	} else {
                switch (msgp->rmsg_u.msg_fact.rf_type) {
		...
		case RF_DELMAP_CB_ERR:
		case RF_MIGRATE:
			return (1);
		case RF_SRV_NOT_RESPOND:
		case RF_SRVS_NOT_RESPOND:
			return (0);
		default:
			return (0);
		}
	}
}

Add the new fact to get_facts() so we can print out a summary "fact sheet". The fact sheet is a way of condensing up to NFS4_MSG_MAX facts into a one line summary. In our simple example, we don't actually need to save anything to the fact sheet:

static int
get_facts(nfs4_debug_msg_t \*msgp, nfs4_rfact_t \*ret_fp, char \*\*mnt_pt,
mntinfo4_t \*mi)
{
	...
		switch (cur_fp->rf_type) {
		...
		case RF_DELMAP_CB_ERR:
			break;
		case RF_MIGRATE:
			break;
		default:
			zcmn_err(getzoneid(), CE_NOTE,
			    "get facts: illegal fact %d", cur_fp->rf_type);
			break;
		}
	...
}

Add to queue_print_fact() to actually print this information in /var/adm/messages. Note: only when something goes really bad (like an open had to be closed on the user or siglost) do we actually dump the queue's contents to /var/adm/messages - normally we just keep it in kernel memory.

static void
queue_print_fact(nfs4_debug_msg_t \*msg, int dump)
{
	nfs4_rfact_t    \*fp;
        zoneid_t        zoneid;

        fp = &msg->rmsg_u.msg_fact;
        zoneid = getzoneid();

        switch (fp->rf_type) {
	...
	case RF_DELMAP_CB_ERR:
		fp->rf_op = op;
		fp->rf_stat4 = stat4;

		rp1 = VTOR4(vp);
		fp->rf_rp1 = rp1;
		if (rp1 && rp1->r_svnode.sv_name)
			fp->rf_char1 = fn_path(rp1->r_svnode.sv_name);
		else
			fp->rf_char1 = NULL;
		break;
	case RF_MIGRATE:
		zcmn_err(zoneid, CE_NOTE, "![NFS4][Server: %s][Mntpt: %s]"
		    "file system migrated", msg->msg_srv, msg->msg_mntpt);
		break;
	default:
	...
	}
	...
}

4. Plugging into mdb

The last thing to do is to add this new fact to the mdb dcmd nfs4_diag via nfs4_diag.c. Unfortunately, the nfs mdb module is encumbered code so i can't show the source (yet!), but i can name the functions.

Add this to fact_to_str():

        case RF_MIGRATE:
                return ("RF_MIGRATE");

Then just like the kernel function queue_print_fact(), we add the human readable output to nfs4_fact_print():

	case RF_MIGRATE:
		mdb_readstr(buf, sizeof (buf), (uintptr_t)fp->rf_char1);
		mdb_printf("[NFS4]%Y: file system migrated\\n",
		    msg->msg_time.tv_sec);
		break;

When the nfs4_diag.c file does become unencumbered, i'll update this blog... until then, feel free to contact me to make changes.

5. mdb output

#mdb -k
> ::nfs4_diag

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
vfs: 6000095c900        mi: 3000d2ad000
   mount point: /mnt
    mount from: fsh-weakfish:/export/tmp
Messages queued:
=============================================
[NFS4]2005 May 25 17:14:36: file system migrated
>
>
> ::nfs4_diag -s


\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
vfs: 6000095c900        mi: 3000d2ad000
   mount point: /mnt
    mount from: fsh-weakfish:/export/tmp
Messages queued:
=============================================
2005 May 25 17:14:36: fact RF_MIGRATE
> $q

So now, if there was ever a problem with any server's implementation of migration, we would have a clue on the Solaris client of why things went bad.


Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:
Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

erickustarz

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today