Monday Dec 07, 2009

NIS+ is now NIS-

As already blogged by Chris,

With the push of this feature into Solaris:

6874309 Remove NIS+ from Solaris
PSARC/2009/530 Removal of NIS+

a bit of Solaris history is made. The namespace that was to replace NIS (YP) has been survived by the system it was to replace.

My first day at Sun Microsystems in 1995 was the day I first touched NIS+ having geekily read and re-read the white papers prior to my joining. Chris was already service's recognised NIS+ expert world-wide so I was in excellent company as tentatively typed my first niscat.

To witness its removal is eerie to the least. The irony is that I'm now the manager responsible for the team that just removed something that's been been a golden thread running through my career at Sun for what is almost 15 years. I'll raise a virtual glass to all the people who've worked on it and with it - cheers. As Chris said, "it was fun".

Wednesday Jun 27, 2007

OpenSolaris Troubleshooting - The Unofficial Tourist Guide

I had the pleasure of presenting to the London OpenSolaris User Group (LOSUG) in June. The topic was OpenSolaris Troubleshooting - The Unofficial Tourist Guide.

My experience of debugging problems on non-Solaris systems was one of 'I know what I want to do, but I don't know how they do it', which led me to paraphrase the famous LP Hartly quote, "Diagnosing problems on unfamiliar operating systems is a foreign country: they do things differently there".

It also wanted to link the essence of good troubleshooting to the practical details of getting the data. I'm an Sun Global Resolution Troubleshooting (SGRT) programme leader - which is Sun's implementation of the Kepner-Tregoe processes. Linking the abstract troubleshooting process to gathering actual useful data is one of the trickier aspects when it comes to teaching SGRT.

Suffice to say, OpenSolaris has a rich set of diagnostic tools - not just DTrace - and this talk covers the ones I use regularly.

The PDF of the presentation, which includes an update to the slides on how dbx and mdb deal with core files from other systems, is available on the LOSUG OpenSolaris pages.

It's a presentation I plan to keep updated so if you have any comments please let me know.

Using DTrace to observe nscd(1M) and nis_cachemgr(1M) interaction

I wanted to know why the NIS+ cache manager (nis_cachemgr(1M), aka NCM) was struggling on one of our internal Sun Ray servers. It appeared to be wasting time looking up non-existent directories. This is a problem for NIS+ as:

  • (a) The NCM is single threaded, other callers must wait,
  • (b) The directory cache managed by the NCM doesn't cache negative hits, and
  • (c) Our internal NIS+ domain had a bogus domain with a single unresponding server.

The first two aren't so bad as the NCM deals with that quite quickly, the real problem was the last point. For some reason we kept trying to contact that broken domain.

The NCM doesn't know why it's being asked to bind a directory, only the caller knows why and this is where DTrace came in. I knew the most common caller would be name services and that means nscd(1M), the name services cache. I wrote a D script that extracted the name being looked up:

#!/usr/sbin/dtrace -qs

#pragma D option nspec=32

	self->timestamp = timestamp;
	self->line = 0;
	self->spec = speculation();
	self->lookup = stringof(copyinstr(arg3));
	printf("%016.16#x %02d Looking up: %s\\n",
	    self->timestamp, self->line++,

/self->spec && ! self->called_ncm/
	self->spec = 0;

/self->spec && self->called_ncm/
	this->time = timestamp - self->timestamp;
	this->secs = this->time / 1000000000;
	this->msecs = (this->time - this->secs \* 1000000000) / 1000000;
	this->timestamp = timestamp;
   "%016.16#x %02d Total time for %s: %2d.%03d secs\\n%016.16#x %02d\\n",
	    self->timestamp, self->line++,
	    self->lookup, this->secs, this->msecs,
	    self->timestamp, self->line++);
	self->spec = 0;
	self->called_ncm = 0;

	printf("%016.16#x %02d Calling nis_list(): %s\\n",
	    self->timestamp, self->line++,

	self->called_ncm = 1;
	this->dnamep = \*((uint32_t \*)copyin(arg0, 4));
	self->dname = stringof(copyinstr(this->dnamep));
	self->ncm_start = timestamp - self->timestamp;

	nis_error[0] = " Success";
	nis_error[5] = " Name unreachable";
	nis_error[20] = " No such name";
	this->err = \*((uint32_t \*)copyin(arg1, 4));
	this->start_secs = self->ncm_start / 1000000000;
	this->start_msecs = (self->ncm_start - this->start_secs \* 1000000000)
	    / 1000000;
	printf("%016.16#x %02d %2d.%03d %26.26s (%d%s)\\n",
	    self->timestamp, self->line++,
	    this->start_secs, this->start_msecs,
	    self->dname, this->err, nis_error[this->err]);

It's run like this:

# dtrace -o dtrace.out -q -s nis_list_cache_calls.d $(pgrep nscd)

The interesting thing for me was using speculations. We speculate when nscd calls the NIS+ specific code and record what lookup is taking place. We trace calls to nis_cache_bind_replica_2() and mark this speculation as worthy of interest. When we return from the NIS+ code we check to see if the NCM has been called and if so commit() the speculation.

One thing I learnt was that speculation buffers, like other DTrace buffers, are per-CPU and the traced output may not be in time order. I had to add a timestamp prefix and a line number to the output to make sense of it.

Here's some sample output:

 timestamp-.  Line-.  Action (some with start times)-.
           |       |                                 |
<----------+-----> +> <------------------------------+----------------------->

0x0009c35e1ad134f0 00 Looking up:
0x0009c35e1ad134f0 01 Calling nis_list(): [name=mail]
0x0009c35e1ad134f0 02  0.000 (20 No such name)
0x0009c35e1ad134f0 03  0.001 (20 No such name)
0x0009c35e1ad134f0 04  0.003 (20 No such name)
0x0009c35e1ad134f0 05  0.004 (20 No such name)
0x0009c35e1ad134f0 06  0.005 (20 No such name)
0x0009c35e1ad134f0 07  0.006 (20 No such name)
0x0009c35e1ad134f0 08  0.008 (20 No such name)
0x0009c35e1ad134f0 09  0.009 (20 No such name)
0x0009c35e1ad134f0 10 Total time for  0.010 secs
0x0009c3697101c690 00 Looking up:
0x0009c3697101c690 01 Calling nis_list(): [name=foo]
0x0009c3697101c690 02  0.000 (20 No such name)
0x0009c3697101c690 03  0.001 (20 No such name)
0x0009c3697101c690 04  0.002 (5 Name unreachable)
0x0009c3697101c690 05  5.025 (5 Name unreachable)
0x0009c3697101c690 06 10.045 (20 No such name)
0x0009c3697101c690 07 10.048 (20 No such name)
0x0009c3697101c690 08 Total time for 10.050 secs

In the first example someone is trying to resolve and we can see NIS+ following its search path as that's not a fully qualified name. It took 10ms, not too shabby.

In the second example I tried to resolve which hits a broken domain, that takes a far more significant 10s to resolve as each attempt to bind to the domain has a 5s timeout.

It turned out that the reason we were having this problem is that our Sun Rays now allow direct access to web sites, ie no proxy. That means a far more varied set of name lookups. As nisplus is before dns in our /etc/nsswitch.conf all these are checked with NIS+ first. The combination of the wide range of names, the fact they are not usually fully qualified, the short negative cache time in nscd(1M), the NIS+ search path, the lack of NIS+ caching of non-existent directories, the fact that NCM is single threaded and the broken NIS+ domain resulted in a somewhat hesitant name service.

The resolution was to fix the broken domain and/or put dns first in the /etc/nsswitch.conf. Some people argue that if you're using DNS for host name resolution then don't use NIS+ for host name resolution as well -- this is one example of how that pays off.

Another day, more things learnt, I love this job :-)

Thursday Jun 14, 2007

Fixing terminfo so that terminal applications like vi(1) restore the screen

Ever noticed in Solaris that when you finish using vi(1) the terminal screen is not restored? Similarly true for applications like screen(1) and prstat(1M).

This was irritating me somewhat a few months ago but I've only now just logged bug 6569261 (xterm, xtermc, xterm-color: smcup/rmcup definitions are incorrect/missing) on it.

Essentially the problem is that the smcup and rmcup definitions don't work with our gnome-terminal and xterm implementations.

terminfo(4) defines smcup and rmcup as follows:

       exit_ca_mode       rmcup  te       String to end programs
                                          that use cup

       enter_ca_mode      smcup  ti       String to begin programs
                                          that use cup

These sequences are typically used by terminal applications that control the cursor (eg vi). smcup switches to an alternate screen buffer, rmcup switches back.

The practical effect of this is that the scrolling screen buffer is replaced by a non-scrolling screen buffer when running vi and restored when vi exits.

The rmcup/smcup terminfo descriptions fail to work on Solaris xterm and gnome-terminal implementations. The scolling buffer is not restored and, in some cases, strange escape sequences are seen (eg mysterious '0' characters).

The Solaris xtermc and xterm-color terminfo defines:


These escape sequences don't make much sense for xterm. The xterm terminfo has nothing for rmcup/smcup.

The Sun Freeware terminfo definitions are more as expected:

  xterm-color rmcup=\\E[2J\\E[?47l\\E8,

For 'standard' xterm: smcup saves the cursor and switches to the alternate screen buffer; rmcup clears the screen, returns to the normal screen buffer and restores the cursor position.

  xterm       rmcup=\\E[?1049l

The 1049 sequence is a XFree addition which our xterm doesn't appear to support. Our current gnome-terminal (snv_65) does support this sequence.

  xtermc      rmcup=\\E@0\\E[?4r

This is as broken as our definition.

The workaround is to define smcup and rmcup appropriately. Here's one way to fix it:

  1. mkdir $HOME/terminfo
  2. TERM=xterm infocmp > xterm.src
  3. Edit xterm.src and add/change rmcup and smcup:
    rmcup=\\E[2J\\E[?47l\\E8, smcup=\\E7\\E[?47h,
    every entry must be terminated in a comma (,).
  4. TERMINFO=$HOME/terminfo tic xterm.src
  5. TERMINFO=$HOME/terminfo TERM=xterm xterm

You should now find that your scrolling terminal window is retored after 'fixed' window terminal applications finish.

Wednesday Feb 28, 2007

Process exit status explored: SMF and what does shell's $? represent?


All UNIX shells I've seen allow access to the exit value of the last command through the $? macro. The ksh(1) man page neatly states:

    The value of a simple-command is its exit status if it terminates
    normally. If it terminates abnormally due to receipt of a signal,
    the value is the signal number plus 128.  See signal.h(3HEAD) for
    a list of signal values. Obviously, normal exit status values 129
    to 255 cannot be distinguished from abnormal exit caused by
    receiving signal numbers 1 to 127.

How is exit status set? And how much can we derive from it? The reason I asked this question was I was trying to understand SMF's handling of process exit status.

SMF digression

I was looking at the SMF method failure messages. There are three method failed error paths in usr/src/cmd/svc/startd/method.c method_run():

signal"%s: Method \\"%s\\" failed due to signal %s.\\n"
exit()"%s: Method \\"%s\\" failed with exit status %d.\\n"
other"%s: Method \\"%s\\" failed with exit status %d.\\n"

What I was confused about is under what conditions the third error path can ever be taken. Note the ambiguity in that the last two error messages are identical.

Exit status

Exit status is documented in the wait(3c) man page. This is what we have:

      In the  following,  status  is  the  object  pointed  to  by

        o  If the child process terminated due to an _exit() call,
           the  low  order 8 bits of status will be 0 and the high
           order 8 bits will contain the low order 7 bits  of  the
           argument  that the child process passed to _exit(); see

        o  If the child process terminated due to  a  signal,  the
           high order 8 bits of status will be 0 and the low order
           7bits will contain the number of the signal that caused
           the  termination.  In  addition, if  WCOREFLG is set, a
           "core   image"   will   have   been    produced;    see
           signal.h(3HEAD) and wait.h(3HEAD).

In other words:

If lower 8 bits are zero, we called exit() or fell off the end of main(), exit status in the top 8 bits.

If lower 8 bits are not zero and the upper 8 bits are zero, we took a signal, the top bit (WCOREFLG) is set in the lower 8 bit we also core dumped. The signal taken is in the lower 7 bits.

What this doesn't document is when the lower 8 bits are non-zero and the upper 8 bits are non-zero. The exit status is set by wstat():

 \* convert code/data pair into old style wait status
wstat(int code, int data)
    int stat = (data & 0377);

    switch (code) {
    case CLD_EXITED:
        stat <<= 8;
    case CLD_DUMPED:
        stat |= WCOREFLG;
    case CLD_KILLED:
    case CLD_TRAPPED:
    case CLD_STOPPED:
        stat <<= 8;
        stat |= WSTOPFLG;
        stat = WCONTFLG;
        cmn_err(CE_PANIC, "wstat: bad code");
        /\* NOTREACHED \*/
    return (stat);

If the lower 8 bits and the upper 8 bits are non-zero it looks like a CLD_STOPPED or CLD_TRAPPED. Reading the source further the upper 8 bits will be the signal that caused it.

What you also have to bear in mind is that applications interpret the status returned from wait(3c). For ksh(1) this is as documented above, but what of SMF?

SMF handling of exit status

The SMF code looks like this (with comments from me):

   if (!WIFEXITED(ret_status)) {

       WIFEXITED tests for ((int)((stat)&0xFF) == 0)

       We didn't exit cleanly, so let's find out why

     if (WIFSIGNALED(ret_status)) {

       WIFSIGNALED does this test:
       ((int)((stat)&0xFF) > 0 && (int)((stat)&0xFF00) == 0) \*/

       We already know the first test is non-zero and, assuming
       I've not confused my types, ANDing 0xFF with a signed int
       will always be >=0 as we implicitly cast to the larger
       type (unsigned), AND with 0xFF and then explicitly cast
       back to signed. In other words, it's something of a no-op.

       The second test is more interesting as it relates to the
       overloading of the exit status. If the upper 8 bits are
       zero then it's a simple signal.

       Log: "%s: Method \\"%s\\" failed due to signal %s.\\n"

       Signal is derived using WTERMSIG(), see
       wait.h(3head). Currently that's:

       #define	WTERMSIG(stat)		((int)((stat)&0x7F))

     } else {

       We can only reach this clause if we have something more
       complex than a signal, like the child stopping.

       Log: "%s: Method \\"%s\\" failed with exit status %d.\\n"

       The exit status is: ((int)(((stat)>>8)&0xFF))

       As noted above, I believe the value is the signal that
       that caused the CLD_STOPPED or CLD_TRAPPED.

     \*\* Jump out \*\*

   Normal exit
   \*exit_code = WEXITSTATUS(ret_status);
   if (\*exit_code != 0) {

     Log: "%s: Method \\"%s\\" failed with exit status %d.\\n"



There's an ambiguity in the SMF method failed code in that we don't distinguish between a clean exit and the child being stopped or trapped.

As far as shells are concerned, the man pages need to be checked as regards the handling of the exit status and the resulting $? exit value presented by the shell.

Friday Feb 16, 2007

Dereferencing user space pointers in DTrace(1M) - a worked example


Every time I write a DTrace script I learn something new. This time is was all about exploring a user process problem and using copyin().

Take a look at bug 6523693 (mountd core dump in nfsauth_prog on S10). I could have added some debugging to the mountd source or tinkered about with truss(1) but it was (a) Friday afternoon, and (b) it's not the Solaris 10 way :-)

What I was trying to find ...

Having explored the problem and the core dump a little I found that check_client() was being called with a NULL clnames pointer.

in_access_list(ffbffc60, 0, ffa33, 1, 2a7d0, 19400)
check_client_old(0, ffbffc60, 0, 1, ffa30, ffa33))
nfsauth_access_svc(ffbffcd4, ffbffcd0, 2d330, ff32f928, 4b59c, cfb40)
nfsauth_prog+0x98(fdb38, fdb38, 2d330, 0, ff0d2000, 1)`_svc_prog_dispatch+0x184(fdb38, fdc18, 2d330, 1, 162a0, 1)`_svc_run_mt+0x640(fdb38, ff32ce34, 0, ff322f58, ffffff39, ff327054)`svc_run+0x94(193dc, ff327018, 2a008, ff322f58, 4f788, 0)
main+0x754(19000, 19000, 12c00, 18400, 3, 19000)
_start+0x108(0, 0, 0, 0, 0, 0)

The OpenSolaris version of nfsauth_access_svc looks like this:

     52 static void
     53 nfsauth_access(auth_req \*argp, auth_res \*result)
     54 {
     55 	struct netconfig \*nconf;
     56 	struct nd_hostservlist \*clnames = NULL;
     57 	struct netbuf nbuf;
     58 	struct share \*sh;
     59 	char tmp[MAXIPADDRLEN];
     60 	char \*host = NULL;
     62 	result->auth_perm = NFSAUTH_DENIED;
     64 	/\*
     65 	 \* Convert the client's address to a hostname
     66 	 \*/
     67 	nconf = getnetconfigent(argp->req_netid);
     68 	if (nconf == NULL) {
     69 		syslog(LOG_ERR, "No netconfig entry for %s", argp->req_netid);
     70 		return;
     71 	}
     73 	nbuf.len = argp->req_client.n_len;
     74 	nbuf.buf = argp->req_client.n_bytes;
     76 	if (netdir_getbyaddr(nconf, &clnames, &nbuf)) {
     77 		host = &tmp[0];
     78 		if (strcmp(nconf->nc_protofmly, NC_INET) == 0) {
     79 			struct sockaddr_in \*sa;
     81 			/\* LINTED pointer alignment \*/
     82 			sa = (struct sockaddr_in \*)nbuf.buf;
     83 			(void) inet_ntoa_r(sa->sin_addr, tmp);
     84 		} else if (strcmp(nconf->nc_protofmly, NC_INET6) == 0) {
     85 			struct sockaddr_in6 \*sa;
     86 			/\* LINTED pointer \*/
     87 			sa = (struct sockaddr_in6 \*)nbuf.buf;
     88 			(void) inet_ntop(AF_INET6, sa->sin6_addr.s6_addr,
     89 				    tmp, INET6_ADDRSTRLEN);
     90 		}
     91 		clnames = anon_client(host);
     92 	}
     94 	/\*
     95 	 \* Now find the export
     96 	 \*/
     97 	sh = findentry(argp->req_path);
     98 	if (sh == NULL) {
     99 		syslog(LOG_ERR, "%s not exported", argp->req_path);
    100 		goto done;
    101 	}
    103 	result->auth_perm = check_client(sh, &nbuf, clnames, argp->req_flavor);

The combination of static function names and pointers to pointers made this somewhat hard to debug using the pid provider. In particular, I wanted to decode struct nd_hostservlist \*clnames.

Dereferencing a user address pointer

We first decode the call to netdir_getbyaddr(nconf, &clnames, &nbuf) which graciously provides us with the address of the pointer:

/self->nfsauth = 1/
	self->arg1 = arg1;
	self->clnames = \*(uint32_t \*)copyin(self->arg1, 4);
	printf("clnames %#p", self->clnames);

We save the value of arg1 (&clnames) as a thread-local variable so that we can use it later to monitor what happens to that pointer. As it it's a user address pointer we can't look at it directly, we have to copy it into the kernel.

copyin() copies from the user address to DTrace scratch space. It returns a DTrace pointer to the result. At this point we have to assume that we're examining a 32-bit process - pointers are 32-bit - so the copyin() copies 4 bytes. To use this further we need to cast it to what it represents ... in this case a pointer to a 32-bit number. To examine that value we dereference the DTrace pointer. The result is what clnames is pointing to - which should be NULL at this point.

On the return from that function we can check what clnames is set to, and decode it if possible. As DTrace doesn't have conditionals we have to use predicates:

/self->nfsauth = 1 && \*(uint32_t \*)copyin(self->arg1, 4)/
	printf("returned %d, ", arg1);

	self->clnames = \*(uint32_t \*)copyin(self->arg1, 4);
	printf("clnames %#p\\n", self->clnames);

	this->h_cnt = (int \*)copyin(self->clnames, 4);
	this->h_hostservs = (uint32_t \*)copyin(self->clnames + 4, 4);
	printf("h_cnt %d, ", \*this->h_cnt);
	printf("h_hostservs %#p\\n", \*this->h_hostservs);

	this->h_hostp = (uint32_t \*)copyin(\*this->h_hostservs, 4);
	this->h_host = copyinstr(\*this->h_hostp);
	printf("h_host %s", this->h_host);

/self->nfsauth = 1 && ! \*(uint32_t \*)copyin(self->arg1, 4)/
	printf("returned %d, ", arg1);

	self->clnames = \*(uint32_t \*)copyin(self->arg1, 4);
	printf("clnames %#p", self->clnames);

The predicate dereferences the clnames pointer and only decodes the result if it's non-NULL.

The probe actions further demonstrate dereferencing user pointers. The structure we're picking apart looks like this:

struct nd_hostservlist {
	int			h_cnt;		/\* number of nd_hostservs \*/
	struct nd_hostserv	\*h_hostservs;	/\* the entries \*/

As self->clnames is a DTrace pointer to a 32-bit user address we can use copyin() again to get what it's pointing to ... in this case a structure. The first member of which is an int ...

	this->h_cnt = (int \*)copyin(self->clnames, 4);

The second member of which is another pointer to another structure:

	this->h_hostservs = (uint32_t \*)copyin(self->clnames + 4, 4);

The latter can be dereferenced again using copyin() to decode:

struct nd_hostserv {
	char		\*h_host;	/\* the host name \*/
	char		\*h_serv;	/\* the service name \*/

This is the complete script

#!/usr/sbin/dtrace -s

 \* This DTrace script traces calls to netdir_byaddr() et al. in
 \* nfsauth_access_svc(). It's somewhat complicated by some of the
 \* functions being static and the general complexity of manipulating
 \* user pointers in DTrace's kernel space. It handles both resolved
 \* and unresolved names.

 \* This was written to debug 6523693 (mountd core dump in nfsauth_prog
 \* on S10) where clnames was NULL on calling client_check(). It should
 \* be started:
 \*	# ./mountd-auth-debug.d $(pgrep -x mountd)

	self->nfsauth = 1;

/self->nfsauth = 1/
	self->nfsauth = 0;

/self->nfsauth = 1/
	self->arg1 = arg1;
	self->clnames = \*(uint32_t \*)copyin(self->arg1, 4);
	printf("clnames %#p", self->clnames);

/self->nfsauth = 1 && \*(uint32_t \*)copyin(self->arg1, 4)/
	printf("returned %d, ", arg1);

	self->clnames = \*(uint32_t \*)copyin(self->arg1, 4);
	printf("clnames %#p\\n", self->clnames);

	this->h_cnt = (int \*)copyin(self->clnames, 4);
	this->h_hostservs = (uint32_t \*)copyin(self->clnames + 4, 4);
	printf("h_cnt %d, ", \*this->h_cnt);
	printf("h_hostservs %#p\\n", \*this->h_hostservs);

	this->h_hostp = (uint32_t \*)copyin(\*this->h_hostservs, 4);
	this->h_host = copyinstr(\*this->h_hostp);
	printf("h_host %s", this->h_host);

/self->nfsauth = 1 && ! \*(uint32_t \*)copyin(self->arg1, 4)/
	printf("returned %d, ", arg1);

	self->clnames = \*(uint32_t \*)copyin(self->arg1, 4);
	printf("clnames %#p", self->clnames);

/self->nfsauth = 1/
	printf("host '%s'", copyinstr(arg0));

/self->nfsauth = 1/
	printf("returned %#p\\n", arg1);

	this->h_cnt = (int \*)copyin(arg1, 4);
	this->h_hostservs = (uint32_t \*)copyin(arg1 + 4, 4);
	printf("h_cnt %d, ", \*this->h_cnt);
	printf("h_hostservs %#p\\n", \*this->h_hostservs);

	this->h_hostp = (uint32_t \*)copyin(\*this->h_hostservs, 4);
	this->h_host = copyinstr(\*this->h_hostp);
	printf("h_host %s", this->h_host);

/self->nfsauth = 1/
	printf("clnames %#p\\n", arg2);

	this->h_cnt = (int \*)copyin(arg2, 4);
	this->h_hostservs = (uint32_t \*)copyin(arg2 + 4, 4);
	printf("h_cnt %d, ", \*this->h_cnt);
	printf("h_hostservs %#p\\n", \*this->h_hostservs);

	this->h_hostp = (uint32_t \*)copyin(\*this->h_hostservs, 4);
	this->h_host = copyinstr(\*this->h_hostp);
	printf("h_host %s", this->h_host);


As we say in Perl TIMTOWTDI and I did this as much as an exercise as anything else but, think about this:

I can send this script to any customer running Solaris 10 and decode their mountd behaviour associated with the bug. No debug binaries, no intrusive and wasteful trussing or complicated debugger procedures.

As any DTrace person will tell you, most DTrace scripts are evolved as they explore the problem. DTrace is a superb observability tool but like all observations it matters where you look and having found one clue you follow the trail to the next.

Think: Treasure trail ... enjoy :-)

PS Bonus points for anyone who can spot the obvious lack of error checking in that code

Thursday Feb 01, 2007

Understanding snoop(1M) NFSv3 file handles


Reading the snoop(1M) trace of NFS traffic you'll see references to file handles, but interpreting these is not straightforward. As any NFS engineer should tell you, file handles are opaque to the NFS client and are meaningful only to the NFS server that issued them.

For Solaris the file handle is derived from the underlying file system so some parts of the file handle are meaningful only to the underlying file system.

Here's an example snoop(1M) output:

  3   0.00000 v4u-450f-gmp03 -> v4u-80a-gmp03 NFS C GETATTR3 FH=FD0D
  4   0.00035 v4u-80a-gmp03 -> v4u-450f-gmp03 NFS R GETATTR3 OK

So what does FH=FD0D mean? File handles are surely longer than that?


To make things easier to read, snoop(1M) hashes the file handle to 16-bits. Check the sum_filehandle() code in OpenSolaris:

	int len;
	int i, l;
	int fh = 0;

	for (i = 0; i < len; i += 4) {
		l = getxdr_long();
		fh \^= (l >> 16) \^ l;

	return (fh);

To see the complete file handle we need the verbose output:

# snoop -p3,4 -v -i /tmp/snoop2.out | grep NFS:
NFS:  ----- Sun NFS -----
NFS:  Proc = 1 (Get file attributes)
NFS:  File handle = [FD0D]
NFS:   0080001000000002000A0000000091EF23696D2E000A000000008FDF5F48F2A0
NFS:  ----- Sun NFS -----
NFS:  Proc = 1 (Get file attributes)
NFS:  Status = 0 (OK)
NFS:    File type = 1 (Regular File)
NFS:    Mode = 0644
NFS:     Setuid = 0, Setgid = 0, Sticky = 0
NFS:     Owner's permissions = rw-
NFS:     Group's permissions = r--
NFS:     Other's permissions = r--
NFS:    Link count = 1, User ID = 0, Group ID = 0
NFS:    File size = 301, Used = 1024
NFS:    Special: Major = 0, Minor = 0
NFS:    File system id = 137438953488, File id = 37359
NFS:    Last access time      = 01-Feb-07 15:12:16.398735000 GMT
NFS:    Modification time     = 01-Feb-07 15:12:16.410570000 GMT
NFS:    Attribute change time = 01-Feb-07 15:12:16.410570000 GMT

Note the file system ID and file ID values for later.

To break down that file handle you need to understand the NFS server's implementation, for NFSv3 on OpenSolaris we have:

typedef struct {
	fsid_t	_fh3_fsid;			/\* filesystem id \*/
	ushort_t _fh3_len;			/\* file number length \*/
	char	_fh3_data[NFS_FH3MAXDATA];	/\* and data \*/
	ushort_t _fh3_xlen;			/\* export file number length \*/
	char	_fh3_xdata[NFS_FH3MAXDATA];	/\* and data \*/
} fhandle3_t;

Which means ...

fh3_fsid    0080001000000002
fh3_len     000A
fh3_data    0000000091EF23696D2E
fh3_xlen    000A
fh3_xdata   000000008FDF5F48F2A0

The fh3_fsid is itself a compressed version of the dev_t for the device and the file system type, check cmpldev(). Essentially:

  • the first 32 bits (0x00800010) are the major number shifted right 14 bits plus the minor number
  • the second 32 bits (0x00000002) are the file system type, see struct vfssw

That compressed fsid is what you see in mnttab, for this case:

# grep 800010 /etc/mnttab
/dev/dsk/c0t0d0s0       /       ufs     rw,intr,largefiles,logging,xattr,onerror=panic,dev=800010       1170093759

Reassuringly, file system type 2 is ufs.

The fh3_data is derived from the underlying file system (ufs) which is a ufid structure:

struct ufid {
	ushort_t ufid_len;
	ushort_t ufid_flags;
	int32_t	ufid_ino;
	int32_t	ufid_gen;

So 0000000091EF23696D2E breaks down as:

ufid_flags   0000
ufid_ino     000091EF
ufid_gen     23696D2E

Reassuringly again, ufid_ino (the inode) makes sense:

# mdb
> 000091EF=U
> !ls -li /export
total 922
     37359 -rw-r--r--   1 root     root         301 Feb  1 15:12 motd

That's the file I was checking from the NFS client and it matches the file ID from the snoop output.

The fh3_xdata represents the export data, ie the exported file system. The inode number in this case is 0x00008FDF. Checking:

> !share
-               /export   rw   ""  
> 00008FDF=U
> !ls -lid /export
     36831 drwxr-xr-x   2 root     sys          512 Feb  1 15:12 /export

If you've been paying attention you might be wondering what happened to the file system ID (137438953488). This is the uncompressed dev_t value. We can check it by compressing it (14 bit shift of the major, add the minor):

> 0t137438953488=J
> (0t137438953488>>0t14) + 0t137438953488&0xffffffff=J

Yes, that looks familiar.


As already noted, NFS file handles are only meaningful to the NFS server and this example is just for the Solaris NFSv3 implementation. However, I hope it's given some insight into how that works and with this knowledge it's relatively easy to match snoop(1M) file handles to files on the NFS server.

Wednesday Jan 10, 2007

pfiles(1), locked files and a pinch of salt

Following a few days working this out I thought I'd share my thoughts on why the locking information returned by pfiles(1) should be taken with a pinch of salt.

What pfiles(1) tells us about locked open files

On Solaris, if we run pfiles(1) on a process that has open files it checks each file to see if it has any locks. The way it does this is to use fcntl(2) and F_GETLK. This is more complex than it first appears as it runs the system call from the context of the process being inspected. Suffice to say, it produces output like this:

# pfiles 102226
102226: /home/foo/lock /home/foo/testfile
  Current rlimit: 256 file descriptors
   0: S_IFCHR mode:0620 dev:301,0 ino:12582918 uid:54321 gid:7 rdev:24,1
   1: S_IFCHR mode:0620 dev:301,0 ino:12582918 uid:54321 gid:7 rdev:24,1
   2: S_IFCHR mode:0620 dev:301,0 ino:12582918 uid:54321 gid:7 rdev:24,1
   3: S_IFREG mode:0666 dev:314,5 ino:2557867 uid:54321 gid:10 size:343
      advisory write lock set by system 0x1 process 102225

All fine and dandy, but what is meant by "system" and is that process ID actually useful?

How the process and system IDs are derived

To understand this we need to look much closer at how these values are set and in particular whether we're using NFSv3 or NFSv4. I'm ignoring NFSv2 but that will behave similarly to NFSv3.

There are two sides to this, who sets the lock and who checks it with pfiles(1).

Setting the lock

For NFSv3 we use lockd/statd and the locking code passes the PID of the locking process to the server. There is no 'system id' passed to the server. The NFSv3 server passes the PID to underlying file system in the flock structure of the VOP_FRLOCK() call.

For NFSv4 it is considerably more complicated but essentially it passes what should be an opaque 'lock_owner' structure to the server. This can only by understood by the client that sent it. The NFSv4 server creates a unique identifier which it assigns to the PID member of the flock structure. As with NFSv3 this flock structure is passed to the underlying file system.

The pfiles(1) output

The pfiles(1) output is based on a F_GETLK call for the open file. This returns a structure with l_pid and l_sysid. The values of these is as follows:

For NFSv3 the locking code returns the PID passed to the underlying file system in the original VOP_FRLOCK() call. For locks set for NFSv3 this is the true PID of the locking process. The sysid may be set to zero in which case it is ignored by pfiles(1). If the lock was held by a process on this machine the sysid will be set to the system id of the NFS server plus 0x4000. For locks set by NFSv4 the PID is not the PID but the unique identifier used at lock time on the server.

For NFSv4, if we're the client that set the lock we can return sensible values for the PID and system ID - though the latter is \*always\* the client itself. If we're not the client we fabricate the PID and sysid as follows:

  • sysid: we take the sysid of the server and add 1
  • pid: we add up all the bytes in the clientid and owner
   15370 		/\*
   15371 		 \* Construct a new sysid which should be different from
   15372 		 \* sysids of other systems.
   15373 		 \*/
   15375 		flk->l_sysid++;
   15376 		flk->l_pid = lo_to_pid(&lockt_denied->owner);

Interpreting the results

So, with that in mind we have the following result:

NFSv3 set the lock, read by NFSv3

Process ID is correct, no system identification. eg:
      advisory write lock set by process 102250

Or perhaps:

      advisory write lock set by system 0x4001 process 102265

NFSv4 set the lock, read by NFSv3

Process ID is unique but meaningless to the client. No system identification. eg:
      advisory write lock set by process 776458

NFSv3 set the lock, read by NFSv4

The lock owner will not match on the client, result is fabricated. eg:
      advisory write lock set by system 0x2 process 512

NFSv4 set the lock, read by NFSv4

If the lock owner matches, we return the true PID of the process running on this system. The sysid is that of the server as understood by the client. eg:
      advisory write lock set by system 0x1 process 102313
If the lock owner doesn't match the client the result is fabricated. eg:
      advisory write lock set by system 0x2 process 1580


While checking this out I found and logged two curious bugs in this area.

If pfiles(1) is run on an x64 box the 64-bit version is run. If the target process is 32-bit it mixes up the process and system id. Compare these two outputs:

# /usr/bin/amd64/pfiles 2353 | grep advisory
      advisory write lock set by system 0x943
# /usr/bin/i86/pfiles 2353 | grep advisory  
      advisory write lock set by system 0x1 process 2371

Jan 11th - the following is not a bug

The other bug is that for failed lock attempts on NFSv4 or NFSv3 the PID is not set, in my test code I set the l_pid to 12345.

My code output:

F_GETLK returned type 2, sysid 1, pid 2371
F_SETLK returned type 2, sysid 1, pid 12345

truss -t fcntl -vall output

fcntl(3, F_GETLK, 0x08047C60)                   = 0
        typ=F_WRLCK  whence=SEEK_SET start=0     len=0     sys=1  pid=2371
fcntl(3, F_SETLK, 0x08047C60)                   Err#11 EAGAIN
        typ=F_WRLCK  whence=SEEK_SET start=0     len=0     sys=1  pid=12345

Notice the PID is wrong when the lock fails but is correct when tested?

The fcntl(2) man page clearly states that the l_pid and l_sysid values are only used for F_GETLK calls:

     The l_pid and l_sysid fields are used only with  F_GETLK  or
     F_GETLK64  to return the process ID of the process holding a
     blocking lock and to indicate which system is  running  that

One more complication - pfiles(1) doesn't show locks held by the process itself

The reason pfiles(1) fails to show locks held by a process is due to the way that pfiles works. It effectively hijacks the process being checked and makes the fcntl(2) call within that process context. Processes can re-lock files they already have open and locked as per the fcntl(2) man page:

     There will be at most one type of lock set for each byte  in
     the  file.  Before  a  successful  return  from  an F_SETLK,
     F_SETLK64, F_SETLKW, or F_SETLKW64 request when the  calling
     process has previously existing locks on bytes in the region
     specified by the request, the previous lock  type  for  each
     byte  in  the  specified  region will be replaced by the new
     lock type.


The output of pfiles(1) is tremendously useful but the locking information is not necessarily accurate or useful.

If you're running pfiles(1) against an NFSv4 file then if the process ID matches a local process it's more than likely the process holding the lock. Otherwise, it's meaningless.

If you're running pfiles(1) against an NFSv3 file then it's harder. Again if the process ID makes sense on this client then it's probably correct. Otherwise you could check it against known NFS clients of the same file system ... but that's time consuming.

To properly identify lock owners some mdb(1) or dtrace(1M) is required - and this blog entry is already too long :-)

Friday Dec 08, 2006

SIGBUS versus SIGSEGV - according to siginfo.h(3head)

Having asked a number of colleagues I failed to find a consistent answer to the question of the differences between SIGBUS and SIGSEGV. According to the Solaris signal(3head) man page we have:

     Name             Value   Default     Event
     SIGBUS           10      Core        Bus Error
     SIGSEGV          11      Core        Segmentation Fault

So I dug a bit further and found that siginfo_t can tell you more about the origins of the signal, in particular we have, from the siginfo.h(3head) man page:

     Signal         Code                 Reason
     SIGSEGV        SEGV_MAPERR          address not mapped to object
                    SEGV_ACCERR          invalid permissions for mapped object
     SIGBUS         BUS_ADRALN           invalid address alignment
                    BUS_ADRERR           non-existent physical address
                    BUS_OBJERR           object specific hardware error

Obviously this may be open to interpretation but that clarifies a few things for me.

For the techie take a look at the OpenSolaris source code for the trap() function. Here we see the handling for various types of trap including page faults. For example, there's a section where a decision is made as to return SIGBUS or SIGSEGV:

	case T_WIN_OVERFLOW + T_USER:	/\* window overflow in ??? \*/
	case T_WIN_UNDERFLOW + T_USER:	/\* window underflow in ??? \*/
	case T_SYS_RTT_PAGE + T_USER:	/\* window underflow in user_rtt \*/
	case T_INSTR_MMU_MISS + T_USER:	/\* user instruction mmu miss \*/
	case T_DATA_MMU_MISS + T_USER:	/\* user data mmu miss \*/
	case T_DATA_PROT + T_USER:	/\* user data protection fault \*/
		switch (type) {
		 \* In the case where both pagefault and grow fail,
		 \* set the code to the value provided by pagefault.
		(void) instr_size(rp, &addr, rw);
		bzero(&siginfo, sizeof (siginfo));
		siginfo.si_addr = addr;
		if (FC_CODE(res) == FC_OBJERR) {
			siginfo.si_errno = FC_ERRNO(res);
			if (siginfo.si_errno != EINTR) {
				siginfo.si_signo = SIGBUS;
				siginfo.si_code = BUS_OBJERR;
				fault = FLTACCESS;
		} else { /\* FC_NOMAP || FC_PROT \*/
			siginfo.si_signo = SIGSEGV;
			siginfo.si_code = (res == FC_NOMAP) ?
			fault = FLTBOUNDS;

I was digging around this following a discussion regarding bug 6466257 (mmap file writing fails on nfs3 client with EMC nas device) and the signals delivered by mmap(2). The man page suggests that either SIGBUS or SIGSEGV can be returned for a number of error conditions but doesn't seem sure which. The answer is, "it depends".

So my conclusion is sadly another question - can an application developer infer anything from a SIGBUS versus a SIGSEGV? The answer, I believe, is yes - but quite often the result is the same which is to fix the code :-)

Monday Jul 10, 2006

Compiling lsof 4.77 for Solaris 10 6/06 (known internally as update 2)

We found a problem compiling lsof 4.77 for Solaris 10 6/06, see this bug:

  6440943 <sys/kstat.h> should include <sys/types32.h>

When you compile lsof with Sun's compiler you get "syntax error near caddr32_t". There's a similar failure for gcc.

The workaround in the bug allows lsof 4.77 to be compiled.

We also found that you have to recompile lsof for Solaris 10 6/06. Versions compiled for Solaris 10, Solaris 10 3/05 and Solaris 10 1/06 don't work properly. Here's a diff of the two different outputs:
\*\*\* /tmp/1      Mon Jul 10 14:39:32 2006
--- /tmp/2 Mon Jul 10 14:39:38 2006
\*\*\* 12,19 \*\*\*\*
nis_cache 195 root 1u VCHR 0,0 4 /devices/pseudo/cn@0:console
nis_cache 195 root 2u VCHR 0,0 4 /devices/pseudo/cn@0:console
nis_cache 195 root 3u VCHR 0,0 4 /devices/pseudo/cn@0:console
! nis_cache 195 root 4u IPv4 0x600012e4c80 0t0 UDP \*:65535
nis_cache 195 root 5r DOOR 0t0 54 /var/run/rpc_door/rpc_100029.2 (door to keyserv[178]) (FA:-&gt;0x600011fb0c0)
! nis_cache 195 root 6u IPv4 0x60001ffef00 0t0 UDP \*:65535
! nis_cache 195 root 7u IPv4 0x60001ffe100 0t0 UDP \*:65535
! nis_cache 195 root 8u IPv4 0x6000095adc0 0t524 TCP \*:65535 (CLOSE_WAIT)
--- 12,19 ----
nis_cache 195 root 1u VCHR 0,0 4 /devices/pseudo/cn@0:console
nis_cache 195 root 2u VCHR 0,0 4 /devices/pseudo/cn@0:console
nis_cache 195 root 3u VCHR 0,0 4 /devices/pseudo/cn@0:console
! nis_cache 195 root 4u IPv4 0x600012e4c80 0t0 UDP \*:32800
nis_cache 195 root 5r DOOR 0t0 54 /var/run/rpc_door/rpc_100029.2 (door to keyserv[178]) (FA:-&gt;0x600011fb0c0)
! nis_cache 195 root 6u IPv4 0x60001ffef00 0t0 UDP \*:32975
! nis_cache 195 root 7u IPv4 0x60001ffe100 0t0 UDP \*:32977
! nis_cache 195 root 8u IPv4 0x6000095adc0 0t524 TCP mach1:32921->mach2:58028 (CLOSE_WAIT)

The UDP and TCP ports are shown incorrectly. I've passed this on to the lsof maintainers.

Thursday Jun 29, 2006

A simple workaround for /net not showing newly exported filesystems

I just submitted this to our internal documentation system that ultimately publishes it on SunSolve (assuming it's approved) [Note: now published as SRDB 86189]. But I often get asked this - so here's the answer:

Problem Statement:

If the automounter on an NFS client has already mounted a -hosts map (eg /net) for an NFS server it will not show file systems later exported on that server.


There are at least three different ways to resolve this:

  1. Wait for the /net mount to time-out
  2. Use a different name for the host
    eg /net/foo
    Each one will be a new 'view' on the exported directories.

  3. Reboot
There's an RFE for this which, if you follow the duplicate of links, comes to another RFE which I've recently reopened. Workaround 2 is by far the easiest :-)


Wednesday Jun 28, 2006

Deskbar and the Mozilla/Firefox bookmarks search

I wasn't getting the Mozilla/Firefox features appearing in Deskbar. In other words: searching bookmarks, using the quick searches and using the search engine(s).

After some digging around the python source and some truss output I found that it searched for the 'enabled' flag in the gconf preferences:

  def is_preferred_browser(test):
        # We will import only if the user's preferred browser is mozilla
        http_handler = gconf.client_get_default().get_string(
        if not gconf.client_get_default().get_bool(
                return False

        if http_handler.find(test) != -1:
                return True

        http_handler = http_handler.split(" ")[0]
For some reason I didn't have an 'enabled' key. I added it using gconf-editor, restarted Deskbar and voilĂ  - my Firefox searches appeared. Magic. Anyone know why that key was missing?

One possible cause is that I didn't delete any of my Gnome config files when I moved to snv_41. AFAIK Chris \*did\* delete his Gnome config and his Deskbar Mozilla/Firefox searching worked first time.

Thursday Mar 30, 2006

Interesting article on "conscientious software"

Read an interesting interview with Ron Goldman on Conscientious Software today. The idea of self-maintaining software which handles the inevitable bugs in code struck a chord with the work that I do.

I work in the Solaris sustaining organisation which means we maintain the current releases, typically by fixing bugs escalated to us by customers. Some bugs are new, some are already known about and some are already fixed in the development release - see OpenSolaris for details :-)

As you might expect, we don't have infinite resources so we prioritise this work. In other words, don't be surprised that there really are bugs in Solaris and we really do know about them and no, we haven't fixed them yet.

We are painfully aware that most current software isn't conscientious as the article discusses. I am enormously grateful for the development of tools such as libumem and DTrace but code that's also able to handle its failings and environment would be welcome too.

Fault Management Architecture is perhaps a step in the right direction though it's currently more applicable to hardware. A few of us looked at applying Fault trees to software ... but it proved very difficult - at least when it came to retro-fitting to existing software designs.


Wednesday Mar 22, 2006

Underground bunkers, ZFS and cool threads

underground bunker Fascinating thread on the OpenSolaris NFS forum. David Orman is working on building an underground data centre in Hawaii and is getting interested in ZFS, NFS and CoolThreads.

There a Pacific Business News article on this and David sent a link to some pictures.

The \*ahem\* cool bit is seeing how OpenSolaris lead David to not only ZFS but to Solaris and Sun systems in general.

Other bunkers of interest include:


Friday Oct 07, 2005

Thoughts about a bug in share(1M) and utmpx

When I submitted bug 6318057 (share(1M) shouldn't break if the system time is wrong) I had a nagging doubt and if I've learned anything from working in software it's listen to those nagging doubts.

Essentially I didn't know how utmpx(4) was used and, in particular, what was contained in the very first entry.

The essence of the bug is that share(1M) checks the time of the sharetab(4) file against the last shutdown time as read from the first entry in utmpx. If the current time is earlier than the last shutdown time (for whatever reason) the sharetab is cleared every time share is run resulting in only one thing ever being shared.

But surely, I thought, utmpx is cleared on boot so the first entry must be the boot time and not the shutdown time?

Not so.

Here's some simple code to check the times from utmpx.

#include ‹stdio.h›
#include ‹utmpx.h›
#include ‹time.h›
#include ‹stdlib.h›

#define BUFSIZE 100

static char \*ut_type_name[] = {
        "DOWN_TIME" };

main(int argc, char \*\*argv)
        struct utmpx \*utmpxp;
        char buf[100];
        int i, recs;

        if (argc > 1) {
                recs = atoi(argv[1]);
        } else {
                recs = 2;

        for (i = 0; i < recs; i++) {
                utmpxp = getutxent();
                (void) cftime(buf, 0, &utmpxp->ut_xtime);
                (void) printf("Time is %s, ", buf);
                (void) printf("Type is %d (%s)\\n", utmpxp->ut_type,
                    utmpxp->ut_type <
                    (sizeof (ut_type_name))/(sizeof (char \*)) ?
                    ut_type_name[utmpxp->ut_type] :
        return (0);

When run it shows the first and second entry in utmpx are:

  Time is Mon Oct  3 03:46:24 BST 2005, Type is 10 (DOWN_TIME)
  Time is Mon Oct  3 03:53:32 BST 2005, Type is 2 (BOOT_TIME)

In other words, the first record in utmpx and thus the record read by share is the DOWN_TIME. You can see this logic from the SMF svc.startd(1M) code where we do indeed clear utmpx ...

    150 	utmpx_clear_old();
    151 	utmpx_write_boottime();

But we also write a DOWN_TIME record:

    370 	if (stat(WTMPX_FILE, &stbuf) == 0 && stbuf.st_size != 0) {
    371 		tstamp = (stbuf.st_atime >= stbuf.st_mtime) ?
    372 		    stbuf.st_atime : stbuf.st_mtime;
    373 		utmpx_write_entry(DOWN_TIME, DOWN_MSG, tstamp);
    374 	}
    376 	/\*
    377 	 \* The boot time (or start time, for a non-global zone) is retrieved in
    378 	 \* log_init().
    379 	 \*/
    380 	tstamp = st->st_start_time.tv_sec;
    382 	utmpx_write_entry(BOOT_TIME, BOOT_MSG, tstamp);

So that explains why the first record is DOWN_TIME not BOOT_TIME. I feel happier now :-)

Technorati Tag:
Technorati Tag:




« July 2016