DTrace and customer service

Today, I thought I'd share a real-world experience that might portray DTrace in a slightly different light than you're used to. The other week, I was helping a customer with the following question:

Why is automountd constantly taking up 1.2% of CPU time?

The first thought that came to mind was a broken automountd. But if that were the case, you'd be more likely to see it spinning and stealing 100% of the CPU. Just to be safe, I asked the customer to send truss -u a.out:: output for the automountd process. As expected, I saw automountd chugging away, happily servicing each request as it came in. Automountd was doing nothing wrong - some process was indirectly sending millions of requests a day to the automounter. Taking a brief look at the kernel code, I responded with the following D script:

   #!/usr/sbin/dtrace -s

   auto_lookup_request:entry
   {
      @lookups[execname, stringof(args[0]->fi_path)] = count();
   }

The customer gave it a shot, and found a misbehaving program that was continuously restarting and causing loads of automount activity. Without any further help from me, the customer could easily see exactly which application was the source of the problem, and quickly fixed the misconfiguration.

Afterwards, I reflected on how simple this exchange was, and how difficult it would have been in the pre-Solaris 10 days. Now, I don't expect customers to be able to come up with the above D script on their own (though industrious admins will soon be able to wade through OpenSolaris code). But I was able to resolve their problem in just 2 emails. I was reminded of the infamous gtik2_applet2 fiasco described in the DTrace USENIX paper - automountd was just a symptom of an underlying problem, part of an interaction that was prohibitively difficult to trace to its source. One could turn on automountd debug output, but you'd still only see the request itself, not where it came from. To top it off, the offending processes were so short-lived, that they never showed up in prstat(1) output, hiding from traditional system-wide tools.

After a little thought, I imagined a few Solaris 9 scenarios where I'd either set a kernel breakpoint via kadb, or set a user breakpoint in automountd and use mdb -k to see which threads were waiting for a response. But these (and all other solutions I came up with) were:

  • Disruptive to the running system
  • Not guaranteed to isolate the particular problem
  • Difficult for the customer to understand and execute

It really makes me feel the pain our customer support staff must go through now to support Solaris 8 and Solaris 9. DTrace is such a fundamental change in the debugging and observability paradigm that it changes not only the way we kernel engineers work, but also the way people develop applications, administer machines, and support customers. Too bad we can't EOL Solaris 8 and Solaris 9 next week for the benefit of Sun support...

Comments:

Dead libk: http://www.sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf" Shouldn't it be? http://www.sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf

Posted by Rayson Ho on February 01, 2005 at 10:09 PM PST #

Yep, fixed now. Thanks for catching that.

Posted by Eric Schrock on February 02, 2005 at 01:28 AM PST #

In fact you could have done this purely from the command line ...

$ dtrace -n 'auto_lookup_request:entry{@lookups[execname, stringof(args[0]->fi_path)] = count();}
trace: description 'auto_lookup_request:entry' matched 1 probe
\^C

  ls             /clones     1
  csh            /clones     5
  dtwm           /home       5

I edited the aove for clarity (removing some spaces)

Alan.

Posted by Alan Hargreaves on February 03, 2005 at 12:33 PM PST #

Yout got me - that's what I actually sent to the customer. But the script looks nicer for illustrative purposes ;-)

Posted by Eric Schrock on February 03, 2005 at 02:52 PM PST #

Yout got me - that's what I actually sent to the customer. But the script looks nicer for illustrative purposes ;-) cell phone assurance quality home loans

Posted by asiapower on March 16, 2005 at 08:34 PM PST #

You might want to consider giving more examples. I'm doing a Solaris 10 training for 40+ IT folks of my key customers in April, and example like this is certainly useful to drive the point. Thank you. Iwan.

Posted by iwan rahabok on March 21, 2005 at 10:04 PM PST #

Post a Comment:
Comments are closed for this entry.
About

Musings about Fishworks, Operating Systems, and the software that runs on them.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today