DTrace Inlines, Translators, and File Descriptors

I've recently added some new features to the DTrace inline feature, so it seems like a good time to go back and review some of the more advanced features of DTrace's D language, and how these features are used to make observing the system easier for DTrace users. This entry is a bit long, but if you hang in there you'll be rewarded with a peek at a new DTrace feature that is headed for Solaris Express.


Early in DTrace's development, once Bryan and I had assembled the nascent DTrace prototype far enough to be able to locate probes, trace data, and execute simple D expressions, it was obvious that even this early stage of DTrace was an incredibly powerful kernel debugging tool. For fun and posterity, here is one of the earliest known actual D programs at work once the compiler, tracing framework, and access to kernel types had been connected together:

From: Bryan Cantrill <bmc@eng.sun.com>
Subject: Leaving now...
To: mws@eng.sun.com (Michael Shapiro)
Date: Tue, 12 Feb 2002 17:47:36 -0800 (PST)

But this is pretty hot:

# dtrace -f 'bcopy/(arg2 > 1000) &&
    (curthread->t_procp->p_cred->cr_uid == 31992)/{stack(20)}'
dtrace: 2 probes enabled.
CPU     ID                    FUNCTION:NAME
  1   8576                      bcopy:entry 

And while this was indeed hot (and still is), you can immediately see how the relationship between the question ("What are the stack traces of all bcopy() calls performed on behalf of user Bryan of length greater than 1000 bytes?") and its realization in D requires knowledge of the Solaris kernel implementation (i.e. that a kthread_t has a proc_t pointer, which has a cred_t pointer, which contains the UID of the user associated with that process). So while great for us kernel programmers, this immediately presented two challenges for us to grapple with in making DTrace more accessible:
  • How can we allow administrators and developers to express concepts that they readily understand, like the idea that a process has a particular UID associated with it, without requiring them to understand how those concepts are implemented?

  • How can we allow DTrace users to write programs that continue to work as the implementation of these concepts changes over time inside of Solaris?

The second question is of particular importance because one of the challenges in writing observability tools and debuggers is that by exposing everything inside of a software system, you increase the risk of users and programs coming to depend upon that knowledge, which then begins to constrain the implementors. One of my most hated examples of this phenomenon is the fact that in the original UNIX ABI the stdio FILE structure was actually exported to programs along with a set of macros that referenced its members, and the file descriptor was represented as an unsigned char (0-255) instead of an int. The result: 32-bit program binaries that used the fileno(3C) macro could be created that would break or cause silent data corruption if we then fixed the design to support file descriptor values above 255. This issue is still causing problems more than a decade later, despite changing fileno() to a function and fixing the issue entirely in the 64-bit Solaris ABI. But I digress.

To address these two issues in DTrace, we created the notion of a translator. A translator is a collection of D assignment statements provided by the supplier of an interface that can be used to translate an input expression into an object of struct type. Like any D statements, the body of the translator can refer directly to kernel types and kernel global data structures, as well as other DTrace variables. If you're familiar with object-oriented programming, you can imagine a translator sort of like a class that implements a bunch of "get" methods (of course, we don't have functions in D since we can't allow recursion). Translator definitions correspond to the implementation of some piece of software, like a part of the kernel, but they yield a struct that is in effect a stable interface to that software.

For example, DTrace provides a translator from a kernel thread pointer such as the built-in curthread variable to the /proc lwpsinfo structure. This structure is well-defined and documented in the proc(4) and is what you get if you read the file /proc/pid/lwp/lwpid/lwpsinfo on your Solaris system. Here is an excerpt of the translator definition, which is delivered to you in the file /usr/lib/dtrace/procfs.d:

translator lwpsinfo_t < kthread_t \*T > {
        pr_syscall = T->t_sysnum;
        pr_pri = T->t_pri;
        pr_clname = `sclass[T->t_cid].cl_name;

As you can see, each statement in the translator body is in effect an expression that can be inlined by the D compiler to produce the value of that member when it is referenced. For example, the pr_clname field represents the idea that every Solaris LWP has an associated scheduling class with a well-defined name (e.g. TS=timeshare, RT=real-time) that you can pass to commands like priocntl(1) using the -c option. To retrieve the string, you take the class ID, an integer index into the kernel's sclass array, and then grab the name from the contents of that array. The translator isolates DTrace programs that you write from that implementation detail, so that if we were to say, delete sclass and replace it with a hash table, you could still reliably use pr_clname in DTrace on various versions of Solaris and get the same result.


To use a translator in D, you apply the xlate operator to an input expression and specify an output type of either the desired structure or a pointer to it, as shown in the following example:

        printf("%s tid %d waiting for i/o, class=%s\\n",
            execname, tid, xlate<lwpsinfo_t>(curthread).pr_clname);

Here we translate curthread to retrive pr_clname and record the scheduling class of every thread that blocks on an i/o. The results of running this for a few seconds on my desktop look like this:

dtrace: script '/dev/stdin' matched 1 probe
CPU     ID                    FUNCTION:NAME
  1   2053               biowait:wait-start sched tid 0 waiting for i/o, class=SYS
  1   2053               biowait:wait-start cat tid 1 waiting for i/o, class=TS


While using the xlate operator directly is fun (for me, anyway), DTrace also provides an inline facility that makes D programs that use translators easier to read and write. An inline is the declaration of a typed identifier that is replaced by the compiler with the result of an expression whenever that identifier is referenced somewhere else in the program. This is more powerful than simple lexical substitution like the sort provided by C's #define, as we'll see in a moment. Here are some example inline declarations:

inline int c = 123;
inline uid_t uid = curthread->t_procp->p_cred->cr_uid;

Once declared, inlines can be used anywhere as if they were variables provided for you by DTrace. We can also use inlines to substitute translator expressions, which allows us to connect together all of the ideas discussed so far. For example, DTrace provides a built-in curlwpsinfo variable to let you access all of the process model information for the current LWP. This variable is not a variable at all, but instead the following inline provided for you by /usr/lib/dtrace/procfs.d:

inline lwpsinfo_t \*curlwpsinfo = xlate <lwpsinfo_t \*> (curthread);

So using the inlines and translators provided for you by DTrace, you can rewrite the previous example like this, using only the stable interfaces defined in proc(4):

        printf("%s tid %d waiting for i/o, class=%s\\n",
            execname, tid, curlwpsinfo->pr_clname);

Together, inlines and translators let us provide stable representations of Solaris kernel interfaces in a form that resembles a Solaris administrative or user-programming concept that is already well-understood, while allowing us to continue to evolve the Solaris implementation underneath.

Observing File Descriptors

I recently added an extension to the inline facility in DTrace to permit inlines to define identifiers that act like D associative arrays, instead of scalar variables similar to the examples in the previous section. Everything from this point forward will be available in Build 16 of Nevada (aka the next Solaris release) which you will be able to download here at some point in the future. We'll likely backport this feature to a Solaris 10 Update later this year as well. To create an inline that acts like an associative array using the new DTrace feature, you can use a declaration like this:

inline int a[int x, int y] = x + y;

Given this definition, a reference to the expression a[1, 2] would be as if you typed 3 in your program. Using this new facility, I've added an fds[] array to DTrace that returns information about the file descriptors associated with the process corresponding to the current thread. The array's base type is the fileinfo_t structure already used by DTrace's I/O provider, with a new member for the open(2) flags. Here's an example of fds[] in action:

$ dtrace -q -s /dev/stdin
/ execname == "ksh" && fds[arg0].fi_oflags & O_APPEND /
        printf("ksh %d appending to %s\\n", pid, fds[arg0].fi_pathname);

If I run this command on my desktop and start typing commands in another shell, I see output like this:

ksh 127453 appending to /home/mws/.sh_history
ksh 127453 appending to /home/mws/.sh_history

That is, given a file descriptor specified as an argument to write(2), I can match writes by ksh where the file descriptor was opened O_APPEND and then print the pathname of the file to which the data is being appended.

All of the implementation for fds[] is provided by a translator and an inline (i.e. zero new kernel support required). The translator converts a kernel file structure to a DTrace fileinfo_t, and then the inline declaration to define fds[] looks like this:

inline fileinfo_t curfds[int fd] = xlate <fileinfo_t> (
    fd >= 0 && fd < curthread->t_procp->p_user.u_finfo.fi_nfiles ?
    curthread->t_procp->p_user.u_finfo.fi_list[fd].uf_file : NULL);

I'll discuss how inlines can affect how we programmatically compute the stability of your DTrace programs in a future blog.


Post a Comment:
  • HTML Syntax: NOT allowed



« June 2016