Thursday Nov 02, 2006

DTrace: a history

An unsurprisingly common request on the DTrace discussion forum has been for updated documentation. People have been -- on the whole -- very pleased with the Solaris Dynamic Tracing Guide that we worked hard to produce, but I readily admit that we haven't been nearly as diligent in updating it. OK: we haven't updated it at all.

But we have been updating DTrace itself, adding new variables and functions, tacking on new features, adding new providers, and fixing bugs. But unless you've been scraping our putback logs, or reading between the lines on the discussion forum, these features haven't necessarily been obvious. To that end, I've scraped the putback logs, and tried to tease out some of the bigger features, and put them all on the DTrace Change Log. We'll try to keep this up to date so you can see what features are in the build of Solaris Nevada you're running or the Solaris 10 release.

This is probably going to be handy in its own right and ameliorate the documentation gap, but we do still need to update the documentation. I'm torn between keeping it in SGML (or converting it to XML), and converting it to a wiki. The former has the disadvantage of being overly onerous to update (witness the complete lack of updates), while the latter prevents us from releasing it in printed form (unless someone knows of a tool that can turn a wiki into a book). If anyone from the community is interested in working on this project, it would be a tremendous help to every DTrace user and developer.


Technorati Tags:

Friday Sep 22, 2006

DTrace is a web developer's best friend

I have this friend who might be most accurately described as a web developer. When DTrace was able to observe php he was interested. Me: "I should give you a demo some time." Him: "Absolutely..."

When DTrace ticked Ruby off its list, he was more enthusiastic. Him: "Cool! I loves me the Ruby!" Me: "Let me know when you want that demo".

The other day I got an IM from my friend. Him: "DTrace for JavaScript, eh?" Me: "How 'bout that, huh?" Him: "So when can I get that demo?"

Last week Brendan Gregg released Helper Monkey -- a DTrace-enabled version of Mozilla's Spider Monkey JavaScript engine. Why was this the tipping point for my friend the web developer? Probably not because he's more fond of JavaScript than php or Ruby; much more likely, it's because JavaScript is an observability atrocity. Even if you don't use any probes in DTrace other than the ones associated with JavaScript, Helper Monkey is a galactic improvement on the state of JavaScript development. Consider the next 40,000 - 200,000 DTrace probes gravy.


Technorati Tags:

Wednesday Sep 13, 2006

DTrace user number one

Some people think DTrace was built for developers; others think it was for system administrators; some even think it was a tool designed just for Solaris kernel hackers but was so useful we decided to unleash it on the world. All wrong. The user we always had in mind was Solaris user extraordinaire Jarod Jenson. DTrace let's you explore virtually any element of the system -- it's biggest limitation is the user's own knowledge of the system. Jarod has the most diverse and expansive knowledge of enterprise computing bar none; in his hands DTrace seemingly has no limit.

Here's how Jarod works. He gets on a plane and arrives at some gigantic (and usually petulant) corporation. Having never seen the application, he then says something like: I'll get you a 20% win or don't pay me. He sits down with DTrace and gets between 20% and 20,000% (no joke). And from those experiences, he's provided input that's improved DTrace immeasurably (in fact, Jarod hacked up the first version of Java support in DTrace "right quick").

So how does he do it? I only have a very vague idea. Luckily, DTrace user number one is also the latest member of the blogsphere. Check out Jarod's blog to get the latest DTrace war stories. I know I've been awaiting this one for a while, and it's about damned time.

Monday Aug 07, 2006

Apple's DTrace team

As Bryan wrote, Apple has ported DTrace to Mac OS X. The Apple kernel team invited us to the WWDC today for the (albeit muted) announcement of their DTrace support, and then for a demo and dinner. It was surprisingly fun to play with DTrace on another OS, and it was a true pleasure to talk to the Apple guys who worked on the port. And it's my pleasure to introduce those engineers to the DTrace community at large:


James McIlree, Tom Duffy, Steve Peters, Terry Lambert (DTrace on Mac OS X inset)

And here's team DTrace with our new friends at Apple:


Congratulations to the Apple team and to Mac OS X users. I'm installing the DTrace-enabled Mac OS X on my laptop right now -- something I've been wanting for a long time...


Technorati Tags:

Friday Jun 16, 2006

DTrace on Geek Muse

DTrace was recently featured on episode 35 of Geek Muse. DTrace was brought to their attention because of John Birrell's recent work to port it to FreeBSD (nice work, John!). The plug was nice, but I did want to respond to a few things:

DTrace was referred to as "a scripting language for debugging". While I can understand why one might get that impression, it's kind of missing the point. DTrace, concisely, is a systemic observability framework that's designed explicitly for use on mission-critical systems. It lets users and system administrators get concise answers to arbitrary questions. The scripting language aspect to DTrace lets you express those questions, but that's really just a component. James Dickens took a stab at an all-encompassing definition of DTrace....

One of the podcasters said something to the effect of "I'm just a web developer..." One of the great things about DTrace is that it has uses for developers at almost any layer of the stack. Initially DTrace could only view the kernel, and C and C++ code, but its release in Solaris 10 well over a year ago, DTrace has been extended to Java, Ruby, php, python, perl, and a handful of other dynamic languages that folks who are "just web developers" tend to use. In addition to being able to understand how your own code works, you'll be able to see how it interacts with every level of the system all the way down to things like disk I/O and the CPU scheduler.

Shortly after that, someone opined "I could use it for looking at XML-RPC communication". For sure! DTrace is crazy useful for understanding communication between processes, and in particular for XML-RPC for viewing calls and replies quickly and easily.

At one point they also identified the need to make sure users can't use DTrace to spy on each other. By default, DTrace is only executable by the root user. System administrators can dole out various levels of DTrace privilege to users as desired. Check out the manual -- and the security chapter in particular.


Technorati Tags:

Monday May 22, 2006

JavaOne DTrace Session 2006

Jarod Jenson and I presented at JavaOne last week. If you're interested, here are the slides we used:

Admittedly, a pdf can't capture the scene of Jarod typing in DTrace commands while I futilely try to narrate, but you can run your own demos by checking out this, this, and this.


Technorati Tags:

Wednesday May 17, 2006

DTrace caption contest

Two of the DTrace creators (that's me on the right, and Bryan on the left) apparently square off at the recent OpenSolaris user group meeting at JavaOne. You fill in the bubbles; the funniest response wins nothing but my heartfelt amusement.

Friday May 12, 2006

DTrace at JavaOne 2006

At last year's JavaOne, DTrace enjoyed some modicum of celebrity, being featured in a keynote, a session, and the DTrace challenge. The session was attended by about 1,000 people (out of about 10,000 at the conference), and the DTrace Challenge -- which promised to find a performance win in Java applications brought to us or fork over an iPod -- went an impressive 15 for 16 (and that one outlier was a trivial 200 line program -- we've added a 2,000 line minimum this year).

Building on that success, Jarod Jenson and I are giving an updated session on DTrace for Java Thursday 5/18 2:45-3:45 in the Moscone Center Gateway 102/103. We'll be giving a similar introduction to DTrace and to using DTrace on Java as at last year's talk, and showing off some of the improvements we've made in the past year. Jarod's also manning the DTrace Challenge again this year so get your Java application onto a USB key, CD or whatever, and bring it by booth 739; if he can't make it go faster you'll get an iPod, but if you were hoping for an iPod rather than a performance win prepare to be disappointed. Angelo Rajadurai and Peter Karlsson are also giving a hands on lab on using DTrace to examine Java applications (Friday 5/19 1:15-2:15 Moscone Center Hall E 130/131) so be sure to sign up for that if you want to get your hands dirty with the stuff we talk about at our session.


Technorati Tags:

Monday May 08, 2006

User-land tracing gets better and better

As I've mentioned in the past, developers can add their own DTrace probes using the user-land statically defined tracing (USDT) mechanism. It's been used to instrument Postgres and Apache, and to add observability into dynamic languages such as Java, Ruby, and php. I recently made a couple of improvements to USDT that I mentioned here and here, but I think deserve a little more discussion.

Adding USDT probes (as described in the DTrace manual) requires creating a file defining the probes, modifying the source code to identify those probe sites, and modifying the build process to invoke dtrace(1M) with the -G option which causes it to emit an object file which is then linked into the final binary. Bart wrote up a nice example of how to do this. The mechanisms are mostly the same, but have been tweaked a bit.

USDT in C++

One of the biggest impediments to using USDT was its (entirely understandable) enmity toward C++. Briefly, the problem was that the modifications to the source code used a structure that was incompatible with C++ (it turns out you can only extern "C" symbols at the file scope -- go figure). To address this, I added a new -h option that creates a header file based on the probe definitions. Here's what the new way looks like:

provider.d

provider database {
        probe query__start(char \*);
        probe query__done(char \*);
};

src.c or src.cxx

...
#include "provider.h"
...
static int
db_query(char \*query, char \*result, size_t size)
{
        ...
        DATABASE_QUERY_START(query);
        ...
        DATABASE_QUERY_DONE(result);
        ...
}

Here's how you compile it:

$ dtrace -h -s provider.d
$ gcc -c src.c
$ dtrace -G -s provider.d src.o
$ gcc -o db provider.o src.o ...

If you've looked at the old USDT usage, the big differences are the creation and use of provider.h, and that we use the PROVIDER_PROBE() macro rather than the generic DTRACE_PROBE1() macro. In addition to working with C++, this has the added benefit that it engages the compiler's type checking since the macros in the generated header file require the types specified in the provider definition.

Is-Enabled Probes

One of the tenets of DTrace is that the mere presence of probes can never slow down the system. We achieve this for USDT probes by only adding the overhead of a few no-op instructions. And while it's mostly true that USDT probes induce no overhead, there are some cases where the overhead can actually be substantial. The actual probe site is as cheap as a no-op, but setting up the arguments to the probe can be expensive. This is especially true for dynamic languages where probe arguments such as the class or method name are often expensive to compute. As a result, some providers -- the one for Ruby, for example -- couldn't be used in production due to the disabled probe effect.

To address this problem, Bryan and I came up with the idea of what -- for lack of a better term -- I call is-enabled probes. Every probe specified in the provider definition has an associated is-enabled macro (in addition to the actual probe macro). That macro is used to check if the DTrace probe is currently enabled so the program can then only do the work of computing the requisite arguments if they're needed.

For comparison, Rich Lowe's prototype Ruby provider basically looked like this:

rb_call(...
{
        ...
        RUBY_ENTRY(rb_class2name(klass), rb_id2name(method));
        ...
        RUBY_RETURN(rb_class2name(klass), rb_id2name(method));
        ...
}
Where rb_class2name() and rb_id2name perform quite expensive operations.

With is-enabled probes, Bryan was able to greatly reduce the overhead of the Ruby provider to essentially zero:

rb_call(...
{
        ...
        if (RUBY_ENTRY_ENABLED())
                RUBY_ENTRY(rb_class2name(klass), rb_id2name(method));
        ...
        if (RUBY_RETURN_ENABLED())
                RUBY_RETURN(rb_class2name(klass), rb_id2name(method));
        ...
}

When the source objects are post-processed by dtrace -G, each is-enabled site is turned into a simple move of 0 into the return value register (%eax, %rax, or %o0 depending on your ISA and bitness). When probes are disabled, we get to skip all the expensive argument setup; when a probe is enabled, the is-enabled site changes so that the return value register will have a 1. (It's also worth noting that you can pull some compiler tricks to make sure that the program text for the uncommon case -- probes enabled -- is placed out of line.)

The obvious question is then "When should is-enabled probes be used?" As with so many performance questions the only answer is to measure both. If you can eke by without is-enabled probes, do that: is-enabled probes are incompatible with versions of Solaris earlier than Nevada build 38 and they incur a greater enabled probe effect. But if acceptable performance can only be attained by using is-enabled probes, that's exactly where they were designed to be used.


Technorati Tags:

Sunday Feb 12, 2006

DTrace at SCALE 4x

I gave a talk today at SCALE 4X describing how to use DTrace to understand Linux applications by using BrandZ. I admit I was a little leery about how a talk about a Solaris feature would go over at a Linux conference, but it was very well received. Here's my slide deck:

For more, check out my blog post on the subject.


Technorati Tags:

Friday Feb 03, 2006

DTrace in ACM Queue

While usually a shameless self-promoter, Bryan has been conspicuously silent about his recent article in ACM Queue. It's quite good, and goes beyond what has previously been written about DTrace in discussing its origins, the landscape of the problem, and some fundamental innovations embodied in DTrace -- information that may have been too obscure in our USENIX paper (according to at least one Stanford professor).


Technorati Tags:

Tuesday Dec 13, 2005

DTrace for Linux

With BrandZ, it's now possible to use DTrace on Linux applications. For the uninitiated, DTrace is the dynamic tracing facility in OpenSolaris; it allows for systemic analysis of a scope and precision unequalled in the industry. With DTrace, administrators and developers can trace low level services like I/O and scheduling, up the system stack through kernel functions calls, system calls, and system library calls, and into applications written in C and C++ or any of a host of dynamic languages like Java, Ruby, Perl or php. One of my contributions to BrandZ was to extend DTrace support for Linux binaries executed in a branded Zone.

DTrace has several different instrumentation providers that know how to instrument a particular part of the system and provide relevant probes for that component. The io provider lets you trace disk I/O, the fbt (function boundary tracing) provider lets you trace any kernel function call, etc. A typical system will start with more than 30,000 probes but providers can create probes dynamically to trace new kernel modules or user-land processes. When strictly focused on a user-land application, the most useful providers are typically the syscall provider to examine system calls and the pid provider that can trace any instruction in a any process executing on the system.

For Linux processes, the pid provider just worked (well, once Russ built a library to understand the Linux run-time linker), and we introduced a new provider -- the lx-syscall provider -- to trace entry and return for emulated Linux system calls. With these providers it's possible to understand every facet of a Linux application's behavior and with the other DTrace probes it's possible to reason about an application's use of system resources. In other words, you can take that sluggish Linux application, stick it in a branded Zone, dissect it using Solaris tools, and then bring it back to a native Linux system with the fruits of your DTrace investigation[1].

To give an example of using DTrace on Linux applications, I needed an application to examine. I wanted a well known program that either didn't run on Solaris or operated sufficiently differently such examining the Linux version rather than the Solaris port made sense. I decided on /usr/bin/top partly because of the dramatic differences between how it operates on Linux vs. Solaris (due to the differences in /proc), but mostly because of what I've heard my colleague, Bryan, refer to as the "top problem": your system is slow, so you run top. What's the top process? Top!

Running top in the Linux branded zone, I opened a shell in the global (Solaris) zone to use DTrace. I started as I do on Solaris applications: I looked at system calls. I was interested to see which system calls were being executed most frequently which is easily expressed in DTrace:

bash-3.00# dtrace -n lx-syscall:::entry'/execname == "top"/{ @[probefunc] = count(); }'
dtrace: description 'lx-syscall:::entry' matched 272 probes
\^C

  fstat64                                                         322
  access                                                          323
  gettimeofday                                                    323
  gtime                                                           323
  llseek                                                          323
  mmap2                                                           323
  munmap                                                          323

  select                                                          323
  getdents64                                                     1289
  lseek                                                          1291
  stat64                                                         3545
  rt_sigaction                                                   5805
  write                                                          6459
  fcntl64                                                        6772
  alarm                                                          8708
  close                                                         11282
  open                                                          14827
  read                                                          14830

Note the use of the aggregation denoted with the '@'. Aggregations are the mechanism by which DTrace allows users to examine patterns of system behavior rather than examining each individual datum -- each system call for example. (In case you also noticed the strange discrepancy between the number of open and close system calls, many of those opens are failing so it makes sense that they would have no corresponding close. I used the lx-syscall provider to suss this out, but I omitted that investigation in a vain appeal for brevity.)

There may well be something fishy about this output, but nothing struck me as so compellingly fishy to explore immediately. Instead, I fired up vi and wrote a short D script to see which system calls were taking the most time:

lx-sys.d

#!/usr/sbin/dtrace -s

lx-syscall:::entry
/execname == "top"/
{
        self->ts = vtimestamp;
}

lx-syscall:::return
/self->ts/
{
        @[probefunc] = sum(vtimestamp - self->ts);
        self->ts = 0;
}

This script creates a table of system calls and the time spent in them (in nanoseconds). The results were fairly interesting.

bash-3.00# ./lx-sys.d
dtrace: script './lx-sys.d' matched 550 probes
\^C

  llseek                                                      4940978
  gtime                                                       5993454
  gettimeofday                                                6603844
  fstat64                                                    14217312
  select                                                     26594875
  lseek                                                      30956093
  mmap2                                                      43463946
  access                                                     49033498
  alarm                                                      72216971
  fcntl64                                                   188281402
  rt_sigaction                                              197646176
  stat64                                                    268188885
  close                                                     417574118
  getdents64                                                781844851
  open                                                     1314209040
  read                                                     1862007391
  write                                                    2030173630
  munmap                                                   2195846497

That seems like a lot of time spent in munmap for top. In fact, I'm rather surprised that there's any mapping and unmapping going on at all (I guess that should have raised an eyebrow after my initial system call count). Unmapping memory is a pretty expensive operation that gets more expensive on bigger systems as it requires the kernel to do some work on every CPU to completely wipe out the mapping.

I then modified lx-sys.d to record the total amount of time top spent on the CPU and the total amount of time spent in system calls to see how large a chunk of time these seemingly expensive unmap operations were taking:

lx-sys2.d

#!/usr/sbin/dtrace -s

lx-syscall:::entry
/execname == "top"/
{
        self->ts = vtimestamp;
}

lx-syscall:::return
/self->ts/
{
        @[probefunc] = sum(vtimestamp - self->ts);
        @["- all syscalls -"] = sum(vtimestamp - self->ts);
        self->ts = 0;
}

sched:::on-cpu
/execname == "top"/
{
        self->on = timestamp;
}

sched:::off-cpu
/self->on/
{
        @["- total -"] = sum(timestamp - self->on);
        self->on = 0;
}

I used the sched provider to see when top was going on and off of the CPU, and I added a row to record the total time spent in all system call. Here were the results (keep in mind I was just hitting \^C to end the experiment after a few seconds so it's expected that these numbers would be different from those above; there are ways to have more accurately timed experiments):

bash-3.00# ./lx-sys2.d
dtrace: script './lx-sys2.d' matched 550 probes
\^C

  llseek                                                       939771
  gtime                                                       1088745
  gettimeofday                                                1090020
  fstat64                                                     2494614
  select                                                      4566569
  lseek                                                       5186943
  mmap2                                                       7300830
  access                                                      8587484
  alarm                                                      11671436
  fcntl64                                                    31147636
  rt_sigaction                                               33207341
  stat64                                                     45223200
  close                                                      69338595
  getdents64                                                131196732
  open                                                      220188139
  read                                                      309764996
  write                                                     340413183
  munmap                                                    365830103
  - all syscalls -                                         1589236337
  - total -                                                3258101690

So system calls are consuming roughly half of top's time on the CPU and the munmap syscall is consuming roughly a quarter of that. This was enough to convince me that there was probably room for improvement and further investigation might bear fruit.

Next, I wanted to understand what this mapped memory was being used for so I wrote a little script that traces all the functions called in the process between when memory is mapped using the mmap2(2) system call and when it's unmapped and returned to the system through the munmap(2) system call:

map.d

#!/usr/sbin/dtrace -s

#pragma D option quiet

lx-syscall::mmap2:return
/pid == $target/
{
        self->ptr = arg1;
        self->depth = 10;

        printf("%\*.s <- %s syscall\\n", self->depth, "", probefunc);
}

pid$target:::entry
/self->ptr/
{
        self->depth++;
        printf("%\*.s -> %s`%s\\n", self->depth, "", probemod, probefunc);
}

pid$target:::return
/self->ptr/
{
        printf("%\*.s <- %s`%s\\n", self->depth, "", probemod, probefunc);
        self->depth--;
}

lx-syscall::munmap:entry
/arg0 == self->ptr/
{
        self->depth++;
        printf("%\*.s -> %s syscall\\n", self->depth, "", probefunc);

        self->ptr = 0;
        self->depth = 0;
        exit(0);
}

This script uses the $target variable which means that we need to run it with the -p option where is the process ID of top. When mmap2 returns, we set a thread local variable, 'ptr', which stores the address at the base of the mapped region; for every function entry and return in the process we call printf() if self->ptr is set; finally, we exit DTrace when munmap is called with that same address. Here are the results:

bash-3.00# ./map.d -p `pgrep top`
           <- mmap2 syscall
           <- LM2`libc.so.1`syscall
          <- LM2`lx_brand.so.1`lx_emulate
         <- LM2`lx_brand.so.1`lx_handler
        <- libc.so.6`mmap
       <- libc.so.6`malloc
       -> libc.so.6`memset
       <- libc.so.6`memset
       -> libc.so.6`cfree
       <- libc.so.6`cfree
       -> libc.so.6`munmap
       <- LM2`lx_brand.so.1`lx_handler_table
       -> LM2`lx_brand.so.1`lx_handler
        -> LM2`lx_brand.so.1`lx_emulate
         -> LM2`libc.so.1`syscall
          -> munmap syscall

I traced the probemod (shared object name) in addition to probefunc (function name) so that we'd be able to differentiate proper Linux functions (in this case all in libc.so.6) from functions in the emulation library (LM2`lx_brand.so.1). What we can glean from this is that the mmap call is a result of a call to malloc() and the unmap is due to a call to free(). What's unfortunate is that we're not seeing any function calls in top itself. For some reason, the top developer chose to strip this binary (presumably to save precious 2k the symbol table would have used on disk) so we're stuck with no symbolic information for top's functions and no ability to trace the individual function calls[2], but we can still reason about this a bit more.

A little analysis in mdb revealed that cfree (an alias for free) makes a tail-call to a function that calls munmap. It seems strange to me that freeing memory would immediately results in an unmap operation (since it would cause exactly the high overhead we're seeing here. To explore this, I wrote another script which looks at what proportion of calls to malloc result in a call to mmap():

malloc.d

#!/usr/sbin/dtrace -s

pid$target::malloc:entry
{
        self->follow = arg0;
}

lx-syscall::mmap2:entry
/self->follow/
{
        @["mapped"] = count();
        self->follow = 0;
}

pid$target::malloc:return
/self->follow/
{
        @["no map"] = count();
        self->follow = 0;
}

Here are the results:

bash-3.00# ./malloc.d -p `pgrep top`
dtrace: script './malloc.d' matched 11 probes
\^C

  mapped                                                          275
  no map                                                         3024

So a bunch of allocations result in a mmap, but not a huge number. Next I decided to explore if there might be a correlation between the size of the allocation and whether or not it resulted in a call to mmap using the following script:

malloc2.d

#!/usr/sbin/dtrace -s

pid$target::malloc:entry
{
        self->size = arg0;
}

lx-syscall::mmap2:entry
/self->size/
{
        @["mapped"] = quantize(self->size);
        self->size = 0;
}

pid$target::malloc:return
/self->size/
{
        @["no map"] = quantize(self->size);
        self->size = 0;
}

Rather than just counting the frequency, I used the quantize aggregating action to built a power-of-two histogram on the number of bytes being allocated (self->size). The output was quite illustrative:

bash-3.00# ./malloc2.d -p `pgrep top`
dtrace: script './malloc2.d' matched 11 probes
\^C

  no map
           value  ------------- Distribution ------------- count
               2 |                                         0
               4 |@@@@@@@                                  426
               8 |@@@@@@@@@@@@@@@                          852
              16 |@@@@@@@@@@@                              639
              32 |@@@@                                     213
              64 |                                         0
             128 |                                         0
             256 |                                         0
             512 |@@@@                                     213
            1024 |                                         0

  mapped
           value  ------------- Distribution ------------- count
          131072 |                                         0
          262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 213
          524288 |                                         0

All the allocations that required a mmap were huge -- between 256k and 512k. Now it makes sense why the Linux libc allocator would treat these allocations a little differently than reasonably sized allocations. And this is clearly a smoking gun for top performance: it would do much better to preallocate a huge buffer and grow it as needed (assuming it actually needs it at all) than to malloc it each time. Tracking down the offending line of code would just require a non-stripped binary and a little DTrace invocation like this:

# dtrace -n pid`pgrep top`::malloc:entry'/arg0 >= 262144/{@[ustack()] = count()}'

From symptoms to root cause on a Linux application in a few DTrace scripts -- and it took me approximately 1000 times longer to cobble together some vaguely coherent prose describing the scripts than it did for me to actually perform the investigation. BrandZ opens up some pretty interesting new vistas for DTrace. I look forward to seeing Linux applications being brought in for tune-ups on BrandZ and then reaping those benefits either back on their mother Linux or sticking around to enjoy the fault management, ZFS, scalability, and, of course, continued access to DTrace in BrandZ.


[1] Of course, results may vary since the guts of the Linux kernel differ significantly from those of the Solaris kernel, but they're often fairly similar. I/O or scheduling problems will be slightly different, but often not so different that the conclusions lack applicability.
[2] Actually, we can can still trace function calls -- in fact, we can trace any instruction -- but it takes something of a heroic effort. We could disassemble parts of top to identify calls sites and then use esoteric pid123::-:address probe format to trace the stripped function. I said we could do it; I never said it would be pretty.


Technorati Tags:

Thursday Oct 13, 2005

Too much pid provider

Perhaps it's a bit Machiavellian, but I just love code that in some way tricks another piece of code. For example, in college I wrote some code that trolled through the address space of my favorite game to afford me certain advantages. Most recently, I've been working on some code that tricks other code into believing a complete fiction[1] about what operating system it's executing on. While working on that, I discovered an interesting problem with the pid provider -- code that's all about deception and sleight of hand. Before you read further, be warned: I've already written two completely numbing accounts of the details of the pid provider here and here, and this is going to follow much in that pattern. If you skip this one for fear of being bored to death[2], I won't be offended.

The problem arose because the traced process tried to execute an x86 instruction like this:

    call    \*0x10(%gs)

This instruction is supposed to perform a call to the address loaded from 0x10 bytes beyond the base of the segment described by the %gs selector. The neat thing about the pid provider (in case you've skipped those other posts) is that most instructions are executed natively, but some -- and call is one of them -- have to be emulated in the kernel. This instruction's somewhat unusual behavior needed to be emulated precisely; the pid provider, however, didn't know from selector prefixes and blithely tried to load from the absolute virtual address 0x10. Whoops.

To correct this, I needed to add some additional logic to parse the instruction and then augment the emulation code to know how to deal with these selectors. The first part was trivial, but the second half involved some digging into the x86 architecture manual. There are two kinds of descriptor tables, the LDT (local) and GDT (global). The value of %gs, in this case, tells us which table to look in, the index into that table, and the permissions associated with that selector.

Below is the code I added to usr/src/uts/intel/dtrace/fasttrap_isa.c to handle this case. You can find the context here.

1145                         if (tp->ftt_code == 1) {
1146 
1147                                 /\*
1148                                  \* If there's a segment prefix for this
1149                                  \* instruction, first grab the appropriate
1150                                  \* segment selector, then pull the base value
1151                                  \* out of the appropriate descriptor table
1152                                  \* and add it to the computed address.
1153                                  \*/
1154                                 if (tp->ftt_segment != FASTTRAP_SEG_NONE) {
1155                                         uint16_t sel, ndx;
1156                                         user_desc_t \*desc;
1157 
1158                                         switch (tp->ftt_segment) {
1159                                         case FASTTRAP_SEG_CS:
1160                                                 sel = rp->r_cs;
1161                                                 break;
1162                                         case FASTTRAP_SEG_DS:
1163                                                 sel = rp->r_ds;
1164                                                 break;
1165                                         case FASTTRAP_SEG_ES:
1166                                                 sel = rp->r_es;
1167                                                 break;
1168                                         case FASTTRAP_SEG_FS:
1169                                                 sel = rp->r_fs;
1170                                                 break;
1171                                         case FASTTRAP_SEG_GS:
1172                                                 sel = rp->r_gs;
1173                                                 break;
1174                                         case FASTTRAP_SEG_SS:
1175                                                 sel = rp->r_ss;
1176                                                 break;
1177                                         }
1178 
1179                                         /\*
1180                                          \* Make sure the given segment register
1181                                          \* specifies a user priority selector
1182                                          \* rather than a kernel selector.
1183                                          \*/
1184                                         if (!SELISUPL(sel)) {
1185                                                 fasttrap_sigsegv(p, curthread,
1186                                                     addr);
1187                                                 new_pc = pc;
1188                                                 break;
1189                                         }
1190 
1191                                         ndx = SELTOIDX(sel);
1192 
1193                                         if (SELISLDT(sel)) {
1194                                                 if (ndx > p->p_ldtlimit) {
1195                                                         fasttrap_sigsegv(p,
1196                                                             curthread, addr);
1197                                                         new_pc = pc;
1198                                                         break;
1199                                                 }
1200 
1201                                                 desc = p->p_ldt + ndx;
1202 
1203                                         } else {
1204                                                 if (ndx >= NGDT) {
1205                                                         fasttrap_sigsegv(p,
1206                                                             curthread, addr);
1207                                                         new_pc = pc;
1208                                                         break;
1209                                                 }
1210 
1211                                                 desc = cpu_get_gdt() + ndx;
1212                                         }
1213 
1214                                         addr += USEGD_GETBASE(desc);
1215                                 }

The thing I learned by writing this is how to find the base address for those segment selectors which has been something I've been meaning to figure out. We (and most other operating systems) get to the thread pointer through a segment selector, so when debugging in mdb(1) I've often wondered how to perform the mapping from the value of %gs to the thread pointer that I care about. I haven't put that code back yet, so feel free to point out any problems you see. Anyway, if you made it here, congratulations and thanks.


[1]Such is my love of the elaborate ruse that I once took months setting up a friend of mine for a very minor gag. Lucas and I were playing scrabble and he was disappointed to hear that the putative word "fearslut" wasn't good. Later I conspired with a friend at his company to have a third party send mail to an etymology mailing list claiming that he had found the word "fearslut" in an old manuscript of an obscure Shakespear play. Three months later Lucas triumphantly announced to me that, lo and behold, "fearslut" was a word. I think I passed out I was laughing so hard.

[2]My parents are fond of recounting my response when they asked what I was doing in my operating systems class during college: "If I told you, you wouldn't understand, and if I explained it, you'd be bored."

Thursday Aug 18, 2005

DTrace boot camp

I just got back from an extended DTrace world tour -- Prague, Watford, London, San Francisco, Los Angeles, and San Diego. Along the way, I gave a couple of workshops on using DTrace. The slides are far from finished[1], but if you're learning DTrace, they might be a good place to start. The workshop is designed to take you through DTrace gradually and includes many examples and exercises.

If you have suggestions for how to improve the talk or content you'd like to see in there, please let me know. If you'd like some hints or solutions for the exercises, feel free to post a comment, mail me privately, or (probably the best) post a question to the DTrace discussion forum (or subscribe to it).


[1] I first wrote these slides in a four hour marathon immedidately after a Sun volleyball tournament in Prague. After getting home around 10pm, I sat down, turned on the Red Sox game on mlb.tv, and cranked out slides until 2am. There are some places that definitely read like they were written a little too late at night...

Technorati tag:

Wednesday Aug 10, 2005

At the Unix Users' Group in LA

Next week I'll be continuing the DTrace world tour in LA and San Diego (see Matt's announcement). On Monday, August 15 from 7pm-9pm I'll be talk to the Unix Users' Group here:
Sun Office in El Segundo
222 N. Sepulveda Blvd., 10th Floor
El Segundo, CA 90245

My plan is to demonstrate DTrace (including the recent advancements in Java and php) and talk a bit about OpenSolaris. If there's anything in particular you'd like to hear about, feel free to leave a comment or just bring your question Monday night. Hope to see you there.


Technorati Tags:

About

Adam Leventhal, Fishworks engineer

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today