Wednesday Jun 27, 2007

Using DTrace to observe nscd(1M) and nis_cachemgr(1M) interaction

I wanted to know why the NIS+ cache manager (nis_cachemgr(1M), aka NCM) was struggling on one of our internal Sun Ray servers. It appeared to be wasting time looking up non-existent directories. This is a problem for NIS+ as:

  • (a) The NCM is single threaded, other callers must wait,
  • (b) The directory cache managed by the NCM doesn't cache negative hits, and
  • (c) Our internal NIS+ domain had a bogus domain with a single unresponding server.

The first two aren't so bad as the NCM deals with that quite quickly, the real problem was the last point. For some reason we kept trying to contact that broken domain.

The NCM doesn't know why it's being asked to bind a directory, only the caller knows why and this is where DTrace came in. I knew the most common caller would be name services and that means nscd(1M), the name services cache. I wrote a D script that extracted the name being looked up:

#!/usr/sbin/dtrace -qs

#pragma D option nspec=32

pid$1::_nss_nisplus_expand_lookup:entry
{
	self->timestamp = timestamp;
	self->line = 0;
	self->spec = speculation();
	speculate(self->spec);
	self->lookup = stringof(copyinstr(arg3));
	printf("%016.16#x %02d Looking up: %s\\n",
	    self->timestamp, self->line++,
	    self->lookup);
}

pid$1::_nss_nisplus_expand_lookup:return
/self->spec && ! self->called_ncm/
{
	discard(self->spec);
	self->spec = 0;
}

pid$1::_nss_nisplus_expand_lookup:return
/self->spec && self->called_ncm/
{
	speculate(self->spec);
	this->time = timestamp - self->timestamp;
	this->secs = this->time / 1000000000;
	this->msecs = (this->time - this->secs \* 1000000000) / 1000000;
	this->timestamp = timestamp;
	printf(
   "%016.16#x %02d Total time for %s: %2d.%03d secs\\n%016.16#x %02d\\n",
	    self->timestamp, self->line++,
	    self->lookup, this->secs, this->msecs,
	    self->timestamp, self->line++);
	commit(self->spec);
	self->spec = 0;
	self->called_ncm = 0;
}

pid$1:libnsl:nis_list:entry
/self->spec/
{
	speculate(self->spec);
	printf("%016.16#x %02d Calling nis_list(): %s\\n",
	    self->timestamp, self->line++,
	    stringof(copyinstr(arg0)));
}

pid$1:libnsl:nis_cache_bind_replica_2:entry
/self->spec/
{
	self->called_ncm = 1;
	this->dnamep = \*((uint32_t \*)copyin(arg0, 4));
	self->dname = stringof(copyinstr(this->dnamep));
	self->ncm_start = timestamp - self->timestamp;
}

pid$1:libnsl:nis_cache_bind_replica_2:return
/self->spec/
{
	speculate(self->spec);
	nis_error[0] = " Success";
	nis_error[5] = " Name unreachable";
	nis_error[20] = " No such name";
	this->err = \*((uint32_t \*)copyin(arg1, 4));
	this->start_secs = self->ncm_start / 1000000000;
	this->start_msecs = (self->ncm_start - this->start_secs \* 1000000000)
	    / 1000000;
	printf("%016.16#x %02d %2d.%03d %26.26s (%d%s)\\n",
	    self->timestamp, self->line++,
	    this->start_secs, this->start_msecs,
	    self->dname, this->err, nis_error[this->err]);
}

It's run like this:

# dtrace -o dtrace.out -q -s nis_list_cache_calls.d $(pgrep nscd)

The interesting thing for me was using speculations. We speculate when nscd calls the NIS+ specific code and record what lookup is taking place. We trace calls to nis_cache_bind_replica_2() and mark this speculation as worthy of interest. When we return from the NIS+ code we check to see if the NCM has been called and if so commit() the speculation.

One thing I learnt was that speculation buffers, like other DTrace buffers, are per-CPU and the traced output may not be in time order. I had to add a timestamp prefix and a line number to the output to make sense of it.

Here's some sample output:


 timestamp-.  Line-.  Action (some with start times)-.
           |       |                                 |
<----------+-----> +> <------------------------------+----------------------->

0x0009c35e1ad134f0 00 Looking up: mail.google.com
0x0009c35e1ad134f0 01 Calling nis_list(): [name=mail]ipnodes.org_dir.google.com
0x0009c35e1ad134f0 02  0.000        org_dir.google.com. (20 No such name)
0x0009c35e1ad134f0 03  0.001 ipnodes.org_dir.google.com (20 No such name)
0x0009c35e1ad134f0 04  0.003 org_dir.google.com.eu.cte. (20 No such name)
0x0009c35e1ad134f0 05  0.004 ipnodes.org_dir.google.com (20 No such name)
0x0009c35e1ad134f0 06  0.005 org_dir.google.com.cte.sun (20 No such name)
0x0009c35e1ad134f0 07  0.006 ipnodes.org_dir.google.com (20 No such name)
0x0009c35e1ad134f0 08  0.008 org_dir.google.com.sun.com (20 No such name)
0x0009c35e1ad134f0 09  0.009 ipnodes.org_dir.google.com (20 No such name)
0x0009c35e1ad134f0 10 Total time for mail.google.com:  0.010 secs
...
0x0009c3697101c690 00 Looking up: foo.swiss
0x0009c3697101c690 01 Calling nis_list(): [name=foo]hosts.org_dir.swiss
0x0009c3697101c690 02  0.000 org_dir.swiss.eu.cte.sun.c (20 No such name)
0x0009c3697101c690 03  0.001 hosts.org_dir.swiss.eu.cte (20 No such name)
0x0009c3697101c690 04  0.002 org_dir.swiss.cte.sun.com. (5 Name unreachable)
0x0009c3697101c690 05  5.025 hosts.org_dir.swiss.cte.su (5 Name unreachable)
0x0009c3697101c690 06 10.045     org_dir.swiss.sun.com. (20 No such name)
0x0009c3697101c690 07 10.048 hosts.org_dir.swiss.sun.co (20 No such name)
0x0009c3697101c690 08 Total time for foo.swiss: 10.050 secs

In the first example someone is trying to resolve mail.google.com and we can see NIS+ following its search path as that's not a fully qualified name. It took 10ms, not too shabby.

In the second example I tried to resolve foo.swiss which hits a broken domain, that takes a far more significant 10s to resolve as each attempt to bind to the domain has a 5s timeout.

It turned out that the reason we were having this problem is that our Sun Rays now allow direct access to web sites, ie no proxy. That means a far more varied set of name lookups. As nisplus is before dns in our /etc/nsswitch.conf all these are checked with NIS+ first. The combination of the wide range of names, the fact they are not usually fully qualified, the short negative cache time in nscd(1M), the NIS+ search path, the lack of NIS+ caching of non-existent directories, the fact that NCM is single threaded and the broken NIS+ domain resulted in a somewhat hesitant name service.

The resolution was to fix the broken domain and/or put dns first in the /etc/nsswitch.conf. Some people argue that if you're using DNS for host name resolution then don't use NIS+ for host name resolution as well -- this is one example of how that pays off.

Another day, more things learnt, I love this job :-)

Friday Feb 16, 2007

Dereferencing user space pointers in DTrace(1M) - a worked example

Introduction

Every time I write a DTrace script I learn something new. This time is was all about exploring a user process problem and using copyin().

Take a look at bug 6523693 (mountd core dump in nfsauth_prog on S10). I could have added some debugging to the mountd source or tinkered about with truss(1) but it was (a) Friday afternoon, and (b) it's not the Solaris 10 way :-)

What I was trying to find ...

Having explored the problem and the core dump a little I found that check_client() was being called with a NULL clnames pointer.

in_access_list(ffbffc60, 0, ffa33, 1, 2a7d0, 19400)
check_client_old(0, ffbffc60, 0, 1, ffa30, ffa33))
check_client()
nfsauth_access_svc(ffbffcd4, ffbffcd0, 2d330, ff32f928, 4b59c, cfb40)
nfsauth_prog+0x98(fdb38, fdb38, 2d330, 0, ff0d2000, 1)
libnsl.so.1`_svc_prog_dispatch+0x184(fdb38, fdc18, 2d330, 1, 162a0, 1)
libnsl.so.1`_svc_run_mt+0x640(fdb38, ff32ce34, 0, ff322f58, ffffff39, ff327054)
libnsl.so.1`svc_run+0x94(193dc, ff327018, 2a008, ff322f58, 4f788, 0)
main+0x754(19000, 19000, 12c00, 18400, 3, 19000)
_start+0x108(0, 0, 0, 0, 0, 0)

The OpenSolaris version of nfsauth_access_svc looks like this:

     52 static void
     53 nfsauth_access(auth_req \*argp, auth_res \*result)
     54 {
     55 	struct netconfig \*nconf;
     56 	struct nd_hostservlist \*clnames = NULL;
     57 	struct netbuf nbuf;
     58 	struct share \*sh;
     59 	char tmp[MAXIPADDRLEN];
     60 	char \*host = NULL;
     61 
     62 	result->auth_perm = NFSAUTH_DENIED;
     63 
     64 	/\*
     65 	 \* Convert the client's address to a hostname
     66 	 \*/
     67 	nconf = getnetconfigent(argp->req_netid);
     68 	if (nconf == NULL) {
     69 		syslog(LOG_ERR, "No netconfig entry for %s", argp->req_netid);
     70 		return;
     71 	}
     72 
     73 	nbuf.len = argp->req_client.n_len;
     74 	nbuf.buf = argp->req_client.n_bytes;
     75 
     76 	if (netdir_getbyaddr(nconf, &clnames, &nbuf)) {
     77 		host = &tmp[0];
     78 		if (strcmp(nconf->nc_protofmly, NC_INET) == 0) {
     79 			struct sockaddr_in \*sa;
     80 
     81 			/\* LINTED pointer alignment \*/
     82 			sa = (struct sockaddr_in \*)nbuf.buf;
     83 			(void) inet_ntoa_r(sa->sin_addr, tmp);
     84 		} else if (strcmp(nconf->nc_protofmly, NC_INET6) == 0) {
     85 			struct sockaddr_in6 \*sa;
     86 			/\* LINTED pointer \*/
     87 			sa = (struct sockaddr_in6 \*)nbuf.buf;
     88 			(void) inet_ntop(AF_INET6, sa->sin6_addr.s6_addr,
     89 				    tmp, INET6_ADDRSTRLEN);
     90 		}
     91 		clnames = anon_client(host);
     92 	}
     93 
     94 	/\*
     95 	 \* Now find the export
     96 	 \*/
     97 	sh = findentry(argp->req_path);
     98 	if (sh == NULL) {
     99 		syslog(LOG_ERR, "%s not exported", argp->req_path);
    100 		goto done;
    101 	}
    102 
    103 	result->auth_perm = check_client(sh, &nbuf, clnames, argp->req_flavor);
    104 

The combination of static function names and pointers to pointers made this somewhat hard to debug using the pid provider. In particular, I wanted to decode struct nd_hostservlist \*clnames.

Dereferencing a user address pointer

We first decode the call to netdir_getbyaddr(nconf, &clnames, &nbuf) which graciously provides us with the address of the pointer:

pid$1::netdir_getbyaddr:entry
/self->nfsauth = 1/
{
	self->arg1 = arg1;
	self->clnames = \*(uint32_t \*)copyin(self->arg1, 4);
	printf("clnames %#p", self->clnames);
}

We save the value of arg1 (&clnames) as a thread-local variable so that we can use it later to monitor what happens to that pointer. As it it's a user address pointer we can't look at it directly, we have to copy it into the kernel.

copyin() copies from the user address to DTrace scratch space. It returns a DTrace pointer to the result. At this point we have to assume that we're examining a 32-bit process - pointers are 32-bit - so the copyin() copies 4 bytes. To use this further we need to cast it to what it represents ... in this case a pointer to a 32-bit number. To examine that value we dereference the DTrace pointer. The result is what clnames is pointing to - which should be NULL at this point.

On the return from that function we can check what clnames is set to, and decode it if possible. As DTrace doesn't have conditionals we have to use predicates:

pid$1::netdir_getbyaddr:return
/self->nfsauth = 1 && \*(uint32_t \*)copyin(self->arg1, 4)/
{
	printf("returned %d, ", arg1);

	self->clnames = \*(uint32_t \*)copyin(self->arg1, 4);
	printf("clnames %#p\\n", self->clnames);

	this->h_cnt = (int \*)copyin(self->clnames, 4);
	this->h_hostservs = (uint32_t \*)copyin(self->clnames + 4, 4);
	printf("h_cnt %d, ", \*this->h_cnt);
	printf("h_hostservs %#p\\n", \*this->h_hostservs);

	this->h_hostp = (uint32_t \*)copyin(\*this->h_hostservs, 4);
	this->h_host = copyinstr(\*this->h_hostp);
	printf("h_host %s", this->h_host);
}

pid$1::netdir_getbyaddr:return
/self->nfsauth = 1 && ! \*(uint32_t \*)copyin(self->arg1, 4)/
{
	printf("returned %d, ", arg1);

	self->clnames = \*(uint32_t \*)copyin(self->arg1, 4);
	printf("clnames %#p", self->clnames);
}

The predicate dereferences the clnames pointer and only decodes the result if it's non-NULL.

The probe actions further demonstrate dereferencing user pointers. The structure we're picking apart looks like this:

struct nd_hostservlist {
	int			h_cnt;		/\* number of nd_hostservs \*/
	struct nd_hostserv	\*h_hostservs;	/\* the entries \*/
};

As self->clnames is a DTrace pointer to a 32-bit user address we can use copyin() again to get what it's pointing to ... in this case a structure. The first member of which is an int ...

	this->h_cnt = (int \*)copyin(self->clnames, 4);

The second member of which is another pointer to another structure:

	this->h_hostservs = (uint32_t \*)copyin(self->clnames + 4, 4);

The latter can be dereferenced again using copyin() to decode:

struct nd_hostserv {
	char		\*h_host;	/\* the host name \*/
	char		\*h_serv;	/\* the service name \*/
};

This is the complete script

#!/usr/sbin/dtrace -s

/\*
 \* This DTrace script traces calls to netdir_byaddr() et al. in
 \* nfsauth_access_svc(). It's somewhat complicated by some of the
 \* functions being static and the general complexity of manipulating
 \* user pointers in DTrace's kernel space. It handles both resolved
 \* and unresolved names.
 \*/

/\*
 \* This was written to debug 6523693 (mountd core dump in nfsauth_prog
 \* on S10) where clnames was NULL on calling client_check(). It should
 \* be started:
 \*
 \*	# ./mountd-auth-debug.d $(pgrep -x mountd)
 \*
 \*/

pid$1::nfsauth_prog:entry
{
	self->nfsauth = 1;
}

pid$1::nfsauth_prog:return
/self->nfsauth = 1/
{
	self->nfsauth = 0;
}

pid$1::netdir_getbyaddr:entry
/self->nfsauth = 1/
{
	self->arg1 = arg1;
	self->clnames = \*(uint32_t \*)copyin(self->arg1, 4);
	printf("clnames %#p", self->clnames);
}

pid$1::netdir_getbyaddr:return
/self->nfsauth = 1 && \*(uint32_t \*)copyin(self->arg1, 4)/
{
	printf("returned %d, ", arg1);

	self->clnames = \*(uint32_t \*)copyin(self->arg1, 4);
	printf("clnames %#p\\n", self->clnames);

	this->h_cnt = (int \*)copyin(self->clnames, 4);
	this->h_hostservs = (uint32_t \*)copyin(self->clnames + 4, 4);
	printf("h_cnt %d, ", \*this->h_cnt);
	printf("h_hostservs %#p\\n", \*this->h_hostservs);

	this->h_hostp = (uint32_t \*)copyin(\*this->h_hostservs, 4);
	this->h_host = copyinstr(\*this->h_hostp);
	printf("h_host %s", this->h_host);
}

pid$1::netdir_getbyaddr:return
/self->nfsauth = 1 && ! \*(uint32_t \*)copyin(self->arg1, 4)/
{
	printf("returned %d, ", arg1);

	self->clnames = \*(uint32_t \*)copyin(self->arg1, 4);
	printf("clnames %#p", self->clnames);
}

pid$1::anon_client:entry
/self->nfsauth = 1/
{
	printf("host '%s'", copyinstr(arg0));
}

pid$1::anon_client:return
/self->nfsauth = 1/
{
	printf("returned %#p\\n", arg1);

	this->h_cnt = (int \*)copyin(arg1, 4);
	this->h_hostservs = (uint32_t \*)copyin(arg1 + 4, 4);
	printf("h_cnt %d, ", \*this->h_cnt);
	printf("h_hostservs %#p\\n", \*this->h_hostservs);

	this->h_hostp = (uint32_t \*)copyin(\*this->h_hostservs, 4);
	this->h_host = copyinstr(\*this->h_hostp);
	printf("h_host %s", this->h_host);
}

pid$1::check_client:entry
/self->nfsauth = 1/
{
	printf("clnames %#p\\n", arg2);

	this->h_cnt = (int \*)copyin(arg2, 4);
	this->h_hostservs = (uint32_t \*)copyin(arg2 + 4, 4);
	printf("h_cnt %d, ", \*this->h_cnt);
	printf("h_hostservs %#p\\n", \*this->h_hostservs);

	this->h_hostp = (uint32_t \*)copyin(\*this->h_hostservs, 4);
	this->h_host = copyinstr(\*this->h_hostp);
	printf("h_host %s", this->h_host);
}

Conclusion

As we say in Perl TIMTOWTDI and I did this as much as an exercise as anything else but, think about this:

I can send this script to any customer running Solaris 10 and decode their mountd behaviour associated with the bug. No debug binaries, no intrusive and wasteful trussing or complicated debugger procedures.

As any DTrace person will tell you, most DTrace scripts are evolved as they explore the problem. DTrace is a superb observability tool but like all observations it matters where you look and having found one clue you follow the trail to the next.

Think: Treasure trail ... enjoy :-)

PS Bonus points for anyone who can spot the obvious lack of error checking in that code

About

PeteH

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today