Tom Erickson's Weblog

Using libdtrace

Using libdtrace

On this occasion of the full release of OpenSolaris I thought I would share some of my experiences using libdtracelibdtrace is the library to programmatically control DTrace, and I have gained some familiarity with it in the course of developing a Java DTrace API (available soon!).  While libdtrace is still private (and therefore still subject to change), I thought it would be worthwhile to at least describe what's there and some of the issues I encountered.

I first approached libdtrace with the obvious questions: How do I use it?  What functionality is available?  How do I get data out of it?  I started with existing programs that use libdtrace.  From the source for the dtrace command (dtrace.c) I learned the natural life cycle of a DTrace consumer:
open() returns a handle needed to access libdtrace functionality  (a dtrace_hdl_t pointer, the first argument to all consumer functions).  Two compile() functions both take a D program, one as a string and the other as a file.  A single consumer may compile multiple programs, as happens when you specify multiple -n or -s options to the dtrace command.  I was at first confused by exec() and go(), which sound like the same thing, until I learned that exec() is what enables probes, instrumenting code at matching probe points.  It is called once for each compiled D program and as an extra favor gives back information about the resulting program enablings such as the number of matching probes.  go() is a non-blocking call issued once to begin tracing data from all enabled probes, writing the output of probe actions into per-consumer DTrace buffers whenever a probe fires.

stop() on second thought isn't exactly a natural stage in the life of a DTrace consumer.  A consumer dies of natural causes when it encounters the exit() action in a D program, and there is no need to call stop()stop() is a programmatic way of interrupting a consumer outside of any D program action.  Whether or not the consumer dies of natural causes, it leaves any unconsumed data available to functions like dtrace_aggregate_snap() and dtrace_aggregate_walk().  Finally, close() releases all system resources held by the consumer.

After dtrace.c, I looked at the source for the intrstat command (intrstat.c), another example of a DTrace consumer. The biggest difference between it and the dtrace command is the absence of calls to dtrace_sleep() and dtrace_work().  Modeling the Java DTrace API after intrstat was a bit of a false start, based partly on the assumption that API users were in most cases interested only in  aggregations.  It took a while before I understood the difference between these two approaches to consuming DTrace data.  In particular, I remember going back to these examples one day and wondering why dtrace.c did not call dtrace_aggregate_print() until after exiting its consumer loop:

do {
        if (!g_intr && !done)
       dtrace_sleep(g_dtp);

        if (g_newline) {
                /\*
                 \* Output a newline just to make the output look
                 \* slightly cleaner.  Note that we do this even in
                 \* "quiet" mode...
                 \*/
                oprintf("\\n");
                g_newline = 0;
        }

        if (done || g_intr || (g_psc != 0 && g_pslive == 0)) {
                done = 1;
                if (dtrace_stop(g_dtp) == -1)
                        dfatal("couldn't stop tracing");
        }

        switch (dtrace_work(g_dtp, g_ofp, chew, chewrec, NULL)) {
        case DTRACE_WORKSTATUS_DONE:
                done = 1;
                break;
        case DTRACE_WORKSTATUS_OKAY:
                break;
        default:
                if (!g_impatient && dtrace_errno(g_dtp) != EINTR)
                        dfatal("processing aborted");
        }

        if (fflush(g_ofp) == EOF)
                clearerr(g_ofp);
} while (!done);
oprintf("\\n");

if (!g_impatient) {
        if (dtrace_aggregate_print(g_dtp, g_ofp, NULL) == -1 &&
            dtrace_errno(g_dtp) != EINTR)
                dfatal("failed to print aggregations");
}

dtrace_close(g_dtp);

whereas intrstat.c was calling dtrace_aggregate_snap(), dtrace_aggregate_walk(),and dtrace_aggregate_clear() repeatedly inside of its consumer loop:

while (indefinite || iter) {

        (void) sigsuspend(&set);

        (void) dtrace_status(g_dtp);

        if (g_intr == 0)
                continue;

        iter--;
        g_intr--;
        check_pset();

        now = gethrtime();
        g_interval = now - last;
        last = now;

        if (dtrace_aggregate_snap(g_dtp) != 0)
                fatal("failed to add to aggregate");

        g_start = g_end = 0;

        do {
                g_header = 1;

                if (dtrace_aggregate_walk_keyvarsorted(g_dtp,
                    walk, NULL) != 0)
                        fatal("failed to sort aggregate");

                if (g_start == g_end)
                        break;
        } while ((g_start = g_end) < g_max_cpus);

       dtrace_aggregate_clear(g_dtp);
}

The answer is that the various aggregate functions act on all aggregations at once - all aggregations from all enabled D programs - regardless of the actions in the probe clauses of those D programs (a programmatic alternative to relying on D actions).  That approach is not appropriate for the dtrace command, which generates output only from D program actions until interrupted with Ctrl-C, when it prints all unconsumed aggregation data before exiting.  Consider the following D program:

syscall:::entry
{
        @calls[execname] = count();
        @fcalls[probefunc] = count();
}

profile:::tick-1sec
{
        printa(@calls);
        clear(@calls);
}

Without the tick probe, this program produces no output when run with dtrace -s (except the number of matching probes) until Ctrl-C, when dtrace_aggregate_print() prints both the @calls and the @fcalls aggregations together.  With the tick probe, however, dtrace prints the @calls aggregation once every second, but waits to print the @fcalls aggregation until Ctrl-C.  In fact, it skips all but unprinted aggregations (i.e. aggregations that have not already been printed at least once as a result of the printa() action), in this case printing only the @fcalls aggregation after Ctrl-C.

The D program in intrstat.c, on the other hand, has no need of a tick probe, because intrstat relies completely on aggregate functions to process all aggregations at once:

static const char \*g_prog =
"interrupt-start"
"{"
"       self->ts = vtimestamp;"
"}"
""
"interrupt-complete"
"/self->ts/"
"{"
"       this->devi = (struct dev_info \*)arg0;"
"       @counts[stringof(`devnamesp[this->devi->devi_major].dn_name),"
"            this->devi->devi_instance] = count();"
"       @times[stringof(`devnamesp[this->devi->devi_major].dn_name),"
"            this->devi->devi_instance] = sum(vtimestamp - self->ts);"
"       self->ts = 0;"
"}"

This is convenient for a command that collects aggregated statistics on its own programmatic interval.  It works because intrstat does not need to support DTrace options like aggrate and switchrate (so no need for dtrace_sleep() ), and because it handles all output itself and does not need any output-producing actions in its D program.

The Java DTrace API, however, needs to support all kinds of D programs, those that use output-producing actions and those that do not.  Like dtrace (and unlike intrstat) it needs to call dtrace_work(), because that is the only way to consume the output of D actions like trace(), printf(), and printa().  It also needs to support options like aggrate and switchrate, so it uses dtrace_sleep() to regulate its consumer loop, which wakes up as often as needed to honor all D options that control consumption rates.   However, this introduces another problem, because  dtrace_work()  sends its output to a file (stdout by default), whereas the Java DTrace API must not print to a file but rather generate data instances for the user of the API and leave that user in control of how (or whether) to output that data.  For example, an application using the API might not print any text at all, but instead display the information in a graph.  Like intrstat, the Java API needs to take control of what to do with the data it consumes.

libdtrace provides for this need with dtrace_handle_buffered(), which lets a user register a function to be called whenever there is buffered output that would otherwise be written to a file.  This is similar to the callback mechanism libdtrace employs to handle data drops, runtime errors, and target process completion (of interest when running dtrace -c).  For example, in dtrace.c:

if (g_mode != DMODE_LIST) {
        if (dtrace_handle_err(g_dtp, &errhandler, NULL) == -1)
                dfatal("failed to establish error handler");

        if (dtrace_handle_drop(g_dtp, &drophandler, NULL) == -1)
                dfatal("failed to establish drop handler");

        if (dtrace_handle_proc(g_dtp, &prochandler, NULL) == -1)
                dfatal("failed to establish proc handler");
}

These calls register three user-defined functions (implemented in dtrace.c) to call under the appropriate circumstances: errhandler(), drophandler(), and prochandler().  When the Java DTrace API registers a similar callback for buffered output with dtrace_handle_buffered(), it has the added effect of turning off output to the usual output file.  In all of these callback registration examples, the resulting calls to the registered handler function occur in the same thread that calls dtrace_work().  Here I found something that I consider especially considerate on the part of the libdtrace developers.  The NULL argument to each of the three handler registrations (above) is passed to a parameter (unused in this case) that could instead take a pointer to a user argument that is forwarded to every handler invocation.  Anything special needed by your handler function can be provided via the user arg set when the handler is registered.  In the case of the Java API, this gives the handler access to the JNI environment, including cached classes and method IDs, needed to pass data back to interested Java listeners.

Once the Java API registers a buffered output handler, it can use dtrace_work() to get probe data printed to an in-memory buffer instead of to a file.  Now that the Java API can support output-producing D actions like trace(), printf(), and printa(), we still want it to support the convenience enjoyed by intrstat, that is, the ability to request all aggregations at once on a programmatic interval outside of any D actions.  Supporting both approaches simultaneously requires a per-consumer lock to prevent the needed dtrace_aggregate_print() from interfering with dtrace_work(). (These two approaches to data consumption are inherently asynchronous.)

While on the subject of threads, it is also possible to run multiple consumers simultaneously, as long as you protect calls to libdtrace with a single global lock.  Fortunately exempt from this requirement are the most time-consuming calls, dtrace_work() and its competing dtrace_aggregate_print(), which only need to be kept from interfering with each other.  Note that the API uses dtrace_aggregate_print() instead of dtrace_aggregate_walk() in order to leverage the buffered output handler (which otherwise does not get called during aggregate traversal).

Using the libdtrace Buffered Output Handler

The purpose of the buffered output handler (bufhandler) is to make the output of the libdtrace print routines available without writing it to a file (such as stdout).  This is needed for the stack(), ustack(), and jstack() actions to get human-readable stack values, since there is no public function in libdtrace to convert the internal representation of stack values.  It is also used to get the formatted string resulting from the D printf() action.  The mechanism anticipates other similar formatting requirements for which a public libdtrace conversion function is not suitable.

dtrace_handle_buffered() takes a handler function of type dtrace_handle_buffered_f and a user argument forwarded to each callback.  In addition to the user argument, the bufhandler takes a pointer to a dtrace_bufdata_t structure:

typedef struct dtrace_bufdata {
        dtrace_hdl_t \*dtbda_handle;             /\* handle to DTrace library \*/
        const char \*dtbda_buffered;             /\* buffered output \*/
        dtrace_probedata_t \*dtbda_probe;        /\* probe data \*/
        dtrace_recdesc_t \*dtbda_recdesc;        /\* record description \*/
        dtrace_aggdata_t \*dtbda_aggdata;        /\* aggregation data, if agg. \*/
} dtrace_bufdata_t;

In addition to the formatted output (dtbda_buffered) of the libdtrace print routines normally sent to a file, the bufdata structure contains any probedata (including the current probedata record) and any aggregation data processed while generating that output.

The bufhandler is called once for each output-producing, non-aggregating D action, such as trace() or printf(), and once for each libdtrace aggregation record (whether in response to the D printa() action or an asynchronous aggregate request).   There is one libdtrace aggregation record per tuple element plus one for the corresponding value.  The bufhandler is called with a null record at the end of each complete tuple/value pair.

Using the buffered output handler leads to a few interesting observations:
  • The buffered output handler (bufhandler) is never called with action type DTRACEACT_PRINTA.
  • The bufhandler never sees the D exit() action
  • The base probedata pointer (bufdata->dtbda_probe->dtpda_data) is overwritten during probedata consumption and needs to be cached ahead of the bufhandler callback (in the probe-handling function, see below)
The points listed above highlight the need to coordinate use of the bufhandler with two other callback functions passed to dtrace_work():
Both of these functions also take a pointer to a user arg optionally passed to dtrace_work() and forwarded to each callback.

The probe-consuming function, dtrace_consume_probe_f, is called once each time a probe fires and ahead of any calls to the record consuming function or the bufhandler (for that probe's data).  It takes a probedata argument that includes an array of records corresponding to actions and statements in that probe.  In the case of the printf() action it includes a record for each unformatted printf element.  A statement that is not a D action, such as assignment to a variable, can be distinguished from an action by the fact that it has no size (rec->dtrd_size == 0).

The record-consuming function, dtrace_consume_rec_f, is called once for each unskipped probedata record and ahead of any call to the bufhandler (resulting from output associated with that record).  It differs from the bufhandler as follows:
  1. It does not have access to libdtrace-formatted output.
  2. It is called for each D program statement, including non-actions (size in probedata buffer is zero); the bufhandler is not bothered with non-actions, since they produce no output.
  3. It is called for the D exit() action; the bufhandler is not.
  4. In response to the printa() action, it is called with a record having an action of type DTRACEACT_PRINTA. The bufhandler never sees that action type, only the output-producing aggregation records (with action types such as DTRACEACT_STACK and DTRACEAGG_COUNT).
  5. It is called with a NULL record at the end of each probedata.
I said the record handler is called for each "unskipped" record because it is called only once for the printf() action, even though there may be multiple probedata records corresponding to the unformatted elements of that printf().   The knowledge of how many probedata records to consume to process a single printf() is private to libdtrace, so we cannot know ahead of time how many probedata records belong to the current printf().  Instead, we need to look back at how many records since the previous call were skipped to reach the current probedata buffer offset, for example:

for (r = prev_probedata_index;
        ((r < edesc->dtepd_nrecs) &&
        (edesc->dtepd_rec[r].dtrd_offset < rec->dtrd_offset));
        ++r);

Here edesc refers to the enabled probe description (metadata that includes an array of all the probedata records).  Records from prev_probedata_index up to and not including the current record index, r, are associated with the previous printf() action.  For proof of these observations, the following dtrace command gives a glimpse into the internal workings just described:

$ dtrace -n 'BEGIN { n = 1; printf("%s %d\\n", "cat", 9); trace(1/--n); exit(0); }'
dtrace: description 'BEGIN ' matched 1 probe
dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): divide-by-zero in action #4 at DIF offset 20
$

The divide-by-zero error message says it failed in action #4, even though there are only two actions in the program specified with the -n option: printf() and trace().  Internally, however, the assignment n = 1 is the first probedata record with action type DTRACEACT_DIFEXPR and size zero.  The printf() accounts for the next two probedata records, one for each unformatted element: "cat" and 9.  The fourth record (with action type DTRACEACT_DIFEXPR) is the trace() action with the divide-by-zero error (action type DTRACEACT_DIFEXPR indicates the trace() action unless size is zero).  From an implementation point of view, trace() is in this case "action #4".

By the time the action type DTRACEACT_PRINTF reaches the bufhandler, the probe- and record-consuming functions have already had their chance to see all information about that printf() except for the final formatted string.  That is now available as bufdata->dtbda_buffered.

The bufhandler should normally return DTRACE_HANDLE_OK, but in the case of an unexpected program error, returning DTRACE_HANDLE_ABORT stops the consumer prematurely.  Similarly, the record- and probe-consuming functions normally return DTRACE_CONSUME_THIS, but may return DTRACE_CONSUME_ABORT to stop the consumer.  If the current record is the exit() action  or  is  NULL (to delimit probedata), return DTRACE_CONSUME_NEXT.

Unlike the probe- and record-consuming functions, the bufhandler is called not only from dtrace_work() but also from dtrace_aggregate_print() (in an intrstat-style aggregate request).  This means that the user arg passed to dtrace_handle_buffered() is forwarded to the bufhandler in both contexts, even though it may contain state that is invalid outside of the thread that called dtrace_handle_buffered() (the same thread that calls dtrace_work() ).  To avoid the resulting (probably surprising) errors in callbacks from dtrace_aggregate_print(), you can use pthread_key_create() to create a user arg capable of obtaining user data specific to each context.  You just need to associate user data with the created key in each thread using pthread_setspecific(), then in the bufhandler use pthread_getspecific() to obtain the user data specific to the current thread (dtrace_work() or dtrace_aggregate_print() ).  In general, take care that state recorded in the bufhandler during dtrace_work() does not interfere with state recorded in the bufhandler during dtrace_aggregate_print() and vice-versa.


Technorati Tag:
Technorati Tag:
Technorati Tag:

Comments:

Hi there. Sorry if this question is in the wrong place but I don't know a better place to ask. Linux kernel people are passionate about writing drivers for almost every piece of hardware available. Now I have heard that with progressive Solaris releases older machines like older SPARC servers/workstations get dropped off the compatibility list. I am wondering if OpenSolaris is gonna travel the Linux road and be eventually runnable on \*ancient\* Sun (even non-Sun) hardware or are they gonna continue the Solaris non-backward compatible trend. Thank you very much for your input. I'm just curious, and I really like DTrace and ZFS and Zones but if it's not gonna work on older machines then that's a potential issue for future development. Thanks :)

Posted by Simon on January 04, 2007 at 06:56 PM PST #

I'm nos sure if i'm writting in the right place but it's the only place i have found where my questions could be answered.

I'm trying to learn how to use libdtrace. I have found a source code in this web page http://www.javapassion.com/handsonlabs/solaris_dtrace/exercise14.html
and they mentioned your work.

The problem is that the first example that they use it doesn't work in my MacOsx 10.6. I don't know why.

Posted by Josepg Forestein on February 01, 2010 at 11:49 PM PST #

It looks like the hyperlinks into source that use mnemonic tags (instead of line numbers) don't work. Looks like a big effort to clean them all up, but it would be nice if they took you to the right place...

Posted by Michael Ernest on April 08, 2010 at 09:40 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

tomee

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Blogs
Other

No bookmarks in folder