Tuesday Feb 01, 2005

DTrace and customer service

Today, I thought I'd share a real-world experience that might portray DTrace in a slightly different light than you're used to. The other week, I was helping a customer with the following question:

Why is automountd constantly taking up 1.2% of CPU time?

The first thought that came to mind was a broken automountd. But if that were the case, you'd be more likely to see it spinning and stealing 100% of the CPU. Just to be safe, I asked the customer to send truss -u a.out:: output for the automountd process. As expected, I saw automountd chugging away, happily servicing each request as it came in. Automountd was doing nothing wrong - some process was indirectly sending millions of requests a day to the automounter. Taking a brief look at the kernel code, I responded with the following D script:

   #!/usr/sbin/dtrace -s

   auto_lookup_request:entry
   {
      @lookups[execname, stringof(args[0]->fi_path)] = count();
   }

The customer gave it a shot, and found a misbehaving program that was continuously restarting and causing loads of automount activity. Without any further help from me, the customer could easily see exactly which application was the source of the problem, and quickly fixed the misconfiguration.

Afterwards, I reflected on how simple this exchange was, and how difficult it would have been in the pre-Solaris 10 days. Now, I don't expect customers to be able to come up with the above D script on their own (though industrious admins will soon be able to wade through OpenSolaris code). But I was able to resolve their problem in just 2 emails. I was reminded of the infamous gtik2_applet2 fiasco described in the DTrace USENIX paper - automountd was just a symptom of an underlying problem, part of an interaction that was prohibitively difficult to trace to its source. One could turn on automountd debug output, but you'd still only see the request itself, not where it came from. To top it off, the offending processes were so short-lived, that they never showed up in prstat(1) output, hiding from traditional system-wide tools.

After a little thought, I imagined a few Solaris 9 scenarios where I'd either set a kernel breakpoint via kadb, or set a user breakpoint in automountd and use mdb -k to see which threads were waiting for a response. But these (and all other solutions I came up with) were:

  • Disruptive to the running system
  • Not guaranteed to isolate the particular problem
  • Difficult for the customer to understand and execute

It really makes me feel the pain our customer support staff must go through now to support Solaris 8 and Solaris 9. DTrace is such a fundamental change in the debugging and observability paradigm that it changes not only the way we kernel engineers work, but also the way people develop applications, administer machines, and support customers. Too bad we can't EOL Solaris 8 and Solaris 9 next week for the benefit of Sun support...

About

Musings about Fishworks, Operating Systems, and the software that runs on them.

Search

Archives
« February 2005 »
SunMonTueWedThuFriSat
  
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
     
       
Today