Tuesday Sep 13, 2005

Man, myth, legend

On a Sunday night shortly after we bought Kealia, I got a call at home from John Fowler. He asked me if I'd like to join Glenn Weinberg and him the next morning to meet with Andy Bechtolsheim at Kealia's offices in Palo Alto. It's hard to express my excitement at this proposition -- it was like asking a kid if he wants to go throw a ball around with Joe DiMaggio. And indeed, my response was something akin to the "Gee mister -- that'd be swell!" that this image evokes...

When we walked in to Kealia's offices the next morning, there, in the foyer, was Andy! Andy blinked for a moment, and then -- without any introductions -- began to excitedly describe some of his machines. Still talking, he marched to his office, whereupon he went to the whiteboard and started furiously drawing block diagrams. Here, at long last, was the Real Deal: a fabled engineer who didn't disappoint -- a giant who dwarfed the substantial legend that proceeded him. After several minutes at the whiteboard, Andy got so excited that he had to actually get the plans to show us how some particular piece had been engineered. And with that, he flew out of the room.

As we caught our breath, Glenn looked at me and said "just so you know, this is what it's like trying to talk to you." While I was still trying to figure out if this was a compliment or an insult (which I still haven't figured out, by the way), Andy flew back in, unfurled some plans for a machine and excitedly pointed out some of the finer details of his design. Andy went on for a few more minutes when, like a raging prairie fire that had suddenly hit a fireline, he went quiet. With that, I kicked the door down (metaphorically) and started describing what we had been working on in Solaris 10. (After all, John hadn't brought me along to just sit back and watch.) As I went through revolutionary feature after revolutionary feature, I was astounded by how quickly Andy grasped detail -- he asked incisive questions that reflected a greater understanding of software than any other hardware engineer I had ever encountered. And as he seemed to be absorbing detail faster and faster, I began delivering it faster and faster. Now, as others have observed, I'm not exactly a slow talker; this might have been one of the few times in my life where I thought I actually needed to speak faster to stay in front of my audience. Whew! Most impressive of all, Andy had a keen intuition for the system -- he immediately saw how his innovative hardware and our innovative software could combine to deliver some uniquely innovative systems to our customers. He was excited about our software; we were excited about his hardware. How much better than that does it get?

Needless to say, ever since that morning -- which was nearly a year and a half ago now -- I have eagerly awaited the day that Andy's boxes would ship. If you've talked to me over the last year, you know that I've been very bullish on Sun; now you know why. (Well, now you have a taste as to why; believe it or not, Andy's best boxes are still to come.) Not everyone can own a car designed by Enzo Ferrari or a lamp crafted by Louis Comfort Tiffany -- but at just over two grand a pop, pretty much everyone can own a machine designed by the greatest single-board computer designer in history. Congratulations Andy and team on an historic launch! And might I add that it was especially fitting that it was welcomed with what is easily the funniest ad in Sun's history.

Friday Sep 09, 2005

TR35

So MIT's Technology Review has named me as one of their TR35 -- the top 35 innovators under the age of thirty-five. It's a great honor, especially because the other honorees are actually working on things like cures for cancer and rocket science -- domains that I have known only as rhetorical flourish. Should you like to hear me make a jackass out of myself on the subject, you might want to check out Richard Giles's latest I/O podcast, in which he interviewed me about the award.

Sunday Aug 21, 2005

DTrace and Ruby

It's been an exciting few weeks for DTrace. The party got started with Wez Furlong's new PHP DTrace provider at OSCON. Then Devon O'Dell announced that he was starting to work in earnest on a DTrace port to FreeBSD. And now, Rich Lowe has made available a prototype Ruby DTrace provider. To install this, grab Ruby 1.8.2, apply Rich's patch, and run ./configure with the --enable-dtrace option. When you run the resulting ruby, you'll see two probes: function-entry and function-return. The arguments to these probes are as follows:
  • arg0 is the name of the class (a pointer to a string within Ruby)

  • arg1 is the name of the method (also a pointer to a string within Ruby)

  • arg2 is the name of the file containing the call site (again, a pointer to a string within Ruby)

  • arg3 is the line number of the call site.

So if, for example, you'd like to know the classes and methods that are called in a particular Ruby script, you could do it with this simple D script:

#pragma D option quiet

ruby$target:::function-entry
{
        @[copyinstr(arg0), copyinstr(arg1)] = count();
}

END
{
        printf("%15s %30s   %s\\n", "CLASS", "METHOD", "COUNT");
        printa("%15s %30s   %@d\\n", @);
}

To run this against the cal.rb that ships in the sample directory of Ruby, call the above script whatmethods.d and run it this way:

# dtrace -s ./whatmethods.d -c "../ruby ./cal.rb"
    August 2005
 S  M Tu  W Th  F  S
    1  2  3  4  5  6
 7  8  9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31

          CLASS                         METHOD   COUNT
          Array                             <<   1
          Array                        compact   1
          Array                        reverse   1
          Array                          shift   1
          Array                           size   1
          Array                        unshift   1
          Array                      values_at   1
            Cal                          group   1
            Cal                     initialize   1
            Cal                        monthly   1
            Cal                          opt_c   1
            Cal                          opt_j   1
            Cal                          opt_m   1
            Cal                          opt_t   1
            Cal                          opt_y   1
            Cal                           pict   1
            Cal                     set_params   1
            Cal                        unlines   1
          Class                            now   1
          Class                   valid_civil?   1
          Class                    valid_date?   1
           Date                              -   1
             IO                          write   1
         Module                append_features   1
         Module                        include   1
         Module                       included   1
         Module                   undef_method   1
         Object                         detect   1
         Object                           nil?   1
         Object                          print   1
         Object     singleton_method_undefined   1
         String                             ==   1
         String                         center   1
         String                         empty?   1
         String                           scan   1
         String                          split   1
           Time                     initialize   1
           Time                           to_a   1
          Array                             ==   2
          Class                      jd_to_ajd   2
         Fixnum                             >=   2
           Hash                           each   2
           Hash                           keys   2
         Module                           attr   2
         Module               method_undefined   2
         Module                         public   2
       Rational                         coerce   2
          Array                              +   3
          Class                    civil_to_jd   3
           Hash                             []   3
         Object                        collect   3
          Array                        collect   4
          Class                      inherited   4
          Range                           each   4
         String                           size   4
         Module           private_class_method   5
         Object                           eval   5
         Object                        require   5
         String                           gsub   5
          Class                     jd_to_wday   7
          Class                           once   7
           Date                       __8713__   7
           Date                           wday   7
         Fixnum                              %   8
          Array                           join   10
           Hash                            []=   10
         String                              +   10
          Array                           each   11
       NilClass                           to_s   11
         Module                   alias_method   22
         Module                        private   22
         Symbol                           to_s   26
         Module                    module_eval   28
           Date                           mday   31
         Object                           send   31
           Date                            mon   42
           Date                      __11105__   43
          Class                    jd_to_civil   45
           Date                           succ   47
          Class                            os?   48
           Date                              +   49
         Fixnum                             <=   49
         String                          rjust   49
          Class                      ajd_to_jd   50
          Class                        clfloor   50
           Date                      __10417__   50
        Integer                           to_i   50
         Object                        Integer   50
       Rational                         divmod   50
       Rational                           to_i   50
           Date                            <=>   51
           Date                            ajd   51
           Date                             jd   51
       Rational                            <=>   51
          Class                           new0   52
           Date                     initialize   52
        Integer                           to_r   52
         Object         singleton_method_added   67
           Date                          civil   75
         Symbol                           to_i   91
          Float                              \*   96
          Float                         coerce   96
         Fixnum                              /   97
         Object                        frozen?   100
       Rational                              -   104
         Fixnum                           to_s   123
          Array                             []   141
         Object                          class   150
         Module                   method_added   154
          Float                              /   186
         Module                            ===   200
       Rational                              /   204
       Rational                              +   248
          Float                          floor   282
         Fixnum                             <<   306
          Class                         reduce   356
        Integer                            gcd   356
         Object                       Rational   356
         Fixnum                              +   414
          Class                           new!   610
       Rational                     initialize   610
          Class                            new   612
         Fixnum                            abs   712
         Fixnum                            div   762
         Fixnum                              \*   1046
         Fixnum                              <   1069
         Fixnum                              >   1970
         Fixnum                              -   2398
         Object                       kind_of?   2439
         Fixnum                             >>   4698
         Fixnum                             []   7689
         Fixnum                             ==   11436

This may leave us with many questions. For example, there are a couple of calls to construct new objects -- where are they coming from? To answer that question:

#pragma D option quiet

ruby$target:::function-entry
/copyinstr(arg1) == "initialize"/
{
        @[copyinstr(arg0), copyinstr(arg2), arg3] = count();
}

END
{
        printf("%-10s %-40s %-10s %s\\n", "CLASS",
            "INITIALIZED IN FILE", "AT LINE", "COUNT");
        printa("%-10s %-40s %-10d %@d\\n", @);
}
Calling the above whereinit.d, we can run it in a similar manner:
# dtrace -s ./whereinit.d -c "../ruby ./cal.rb"
    August 2005
 S  M Tu  W Th  F  S
    1  2  3  4  5  6
 7  8  9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31

CLASS      INITIALIZED IN FILE                      AT LINE    COUNT
Cal        ./cal.rb                                 144        1
Date       /usr/local/lib/ruby/1.8/date.rb          593        1
Date       /usr/local/lib/ruby/1.8/date.rb          703        1
Date       /usr/local/lib/ruby/1.8/date.rb          916        1
Time       /usr/local/lib/ruby/1.8/date.rb          702        1
Date       /usr/local/lib/ruby/1.8/date.rb          901        49
Rational   /usr/local/lib/ruby/1.8/rational.rb      374        610

Looking at the Date class, it's interesting to look at line 901 of file /usr/local/lib/ruby/1.8/date.rb:

   897   # If +n+ is not a Numeric, a TypeError will be thrown.  In
   898   # particular, two Dates cannot be added to each other.
   899   def + (n)
   900     case n
   901     when Numeric; return self.class.new0(@ajd + n, @of, @sg)
   902     end
   903     raise TypeError, 'expected numeric'
   904   end

This makes sense: we're initializing new Date instances in the + method for Date. And where are those coming from? It's not hard to build a script that will tell us the file and line for the call site of an arbitrary class and method:

#pragma D option quiet

ruby$target:::function-entry
/copyinstr(arg0) == $$1 && copyinstr(arg1) == $$2/
{
        @[copyinstr(arg2), arg3] = count();
}

END
{
        printf("%-40s %-10s %s\\n", "FILE", "LINE", "COUNT");
        printa("%-40s %-10d %@d\\n", @);
}

For this particular example (Date#+()), call the above wherecall.d and run it this way:

# dtrace -s ./wherecall.d "Date" "+" -c "../ruby ./cal.rb"
    August 2005
 S  M Tu  W Th  F  S
    1  2  3  4  5  6
 7  8  9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31

FILE                                     LINE       COUNT
./cal.rb                                 102        2
./cal.rb                                 60         6
./cal.rb                                 63         41

And looking at the indicated lines in cal.rb:

    55   def pict(y, m)
    56     d = (1..31).detect{|d| Date.valid_date?(y, m, d, @start)}
    57     fi = Date.new(y, m, d, @start)
    58     fi -= (fi.jd - @k + 1) % 7
    59
    60     ve  = (fi..fi +  6).collect{|cu|
    61       %w(S M Tu W Th F S)[cu.wday]
    62     }
    63     ve += (fi..fi + 41).collect{|cu|
    64       if cu.mon == m then cu.send(@da) end.to_s
    65     }
    66

So this is doing exactly what we would expect, given the code. Now, if we were interested in making this perform a little better, we might be interested to know the work that is being induced by Date#+(). Here's a script that reports the classes and methods called by a given class/method -- and its callees:

#pragma D option quiet

ruby$target:::function-entry
/copyinstr(arg0) == $$1 && copyinstr(arg1) == $$2/
{
        self->date = 1;
}

ruby$target:::function-entry
/self->date/
{
        @[strjoin(strjoin(copyinstr(arg0), "#"),
            copyinstr(arg1))] = count();
}

ruby$target:::function-return
/copyinstr(arg0) == $$1 && copyinstr(arg1) == $$2/
{
        self->date = 0;
        ndates++;
}

END
{
        normalize(@, ndates);
        printf("Each call to %s#%s() induced:\\n\\n", $$1, $$2);
        printa("%@8d call(s) to %s()\\n", @);
}

Calling the above whatcalls.d, we can answer the question about Date#+() this way:

# dtrace -s ./whatcalls.d "Date" "+" -c "../ruby ./cal.rb"
    August 2005
 S  M Tu  W Th  F  S
    1  2  3  4  5  6
 7  8  9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31

Each call to Date#+() induced:

       1 call(s) to Class#new0()
       1 call(s) to Class#reduce()
       1 call(s) to Date#+()
       1 call(s) to Date#initialize()
       1 call(s) to Fixnum#+()
       1 call(s) to Fixnum#<<()
       1 call(s) to Integer#gcd()
       1 call(s) to Module#===()
       1 call(s) to Object#Rational()
       1 call(s) to Object#class()
       2 call(s) to Class#new()
       2 call(s) to Class#new!()
       2 call(s) to Fixnum#abs()
       2 call(s) to Fixnum#div()
       2 call(s) to Rational#+()
       2 call(s) to Rational#initialize()
       3 call(s) to Fixnum#\*()
       3 call(s) to Fixnum#<()
       8 call(s) to Object#kind_of?()
      10 call(s) to Fixnum#-()
      10 call(s) to Fixnum#>()
      23 call(s) to Fixnum#>>()
      37 call(s) to Fixnum#[]()
      52 call(s) to Fixnum#==()

That's a lot of work for something that should be pretty simple! Indeed, it's counterintuitive that, say, Integer#gcd() would be called from Date#+() -- and it certainly seems suboptimal. I'll leave further exploration into this as an exercise to the reader, but suffice it to say that this has to do with the use of a rational number in the Date class -- the elimination of which would elminate most of the above calls and presumably greatly improve the performance of Date#+().

Now, Ruby aficionados may note that some of the above functionality has been available in Ruby by setting the set_trace_func function (upon which the Ruby profiler is implemented). While that's true (to a point -- the set_trace_func seems to be a pretty limited mechanism), the Ruby DTrace provider is nonetheless a great lurch forward for Ruby developers: it allows developers to use DTrace-specific constructs like aggregations and thread-local variables to hone in on a problem; it allows Ruby-related work performed lower in the stack (e.g., in the I/O subsystem, CPU dispatcher or network stack) to be connected to the Ruby code inducing it; it allows a running Ruby script to be instrumented (and reinstrumented) without stopping or restarting it; and it allows multiple, disjoint Ruby scripts to be coherently observed and understood as a single entity. More succinctly, it's just damned cool. So thanks to Rich for developing the prototype provider -- and if you're a Ruby developer, enjoy!

Tuesday Aug 16, 2005

DTrace on FreeBSD?

One of the exciting things about OpenSolaris is that it's released under a license -- the CDDL -- that allows ports of individual components to other systems. In particular, at my OSCON presentation two weeks ago, I discussed some of the expertise required to port one such component, DTrace, to another system. I'm happy to now report that Devon O'Dell has started working on a port to one such system, FreeBSD. This has been talked about before (in some cases, with braggadocio), but Devon is the first to start the work in earnest. And indeed, work it will be: DTrace isn't a simple system, and it has several dependencies on other, Solaris-specific system components. That said, it should certainly be possible, and we on Team DTrace are available to help out in any way we can. So if you're interested in working on this, you should ping Devon -- I know that he'll welcome the help. And if you have specific questions about DTrace internals (or anything, for that matter), swing by #opensolaris and join the party!

Thursday Aug 11, 2005

Ubuntu and DTrace break bread

Today, Mike, Adam and I had lunch with Jeff Waugh, who is in town for LinuxWorld. He showed us his, we showed him ours, and a great time was had by all. I think everyone agreed that a system with Debian packages, Ubuntu package management, the Solaris Service Management Facility and (of course) DTrace would be one hell of a system. And with OpenSolaris, this paradise seems tantalizingly attainable -- or does original sin prevent us from reentering the garden?

Friday Aug 05, 2005

DTrace and PHP, demonstrated

If you were at my presentation at OSCON yesterday, I apologize for the brisk pace -- there was a ton of material to cover, and forty-five minutes isn't much time. Now that I've got a little more time, I'd like to get into the details of the PHP demonstration that I did during the presentation. Here is the (very) simple PHP program that I was using:
<?php
function blastoff()
{
        echo "Blastoff!\\n\\n";
}

function one()
{
        echo "One...\\n";
        blastoff();
}

function two()
{
        echo "Two...\\n";
        one();
}

function three()
{
        echo "Three...\\n";
        two();
}

function launch()
{
        three();
}

while (1)
{
        launch();
        sleep(1);
}
?>
Running this in a window just repeats this output:
% php ./blastoff.php
php ./blastoff.php
Content-type: text/html
X-Powered-By: PHP/5.1.0-dev

Three...
Two...
One...
Blastoff!

Three...
Two...
One...
Blastoff!
...
Now, because I have specified Wez Furlong's new dtrace.so as an extension in my php.ini file, when I run the above, two probes show up:
# dtrace -l | grep php
  4   php806       dtrace.so          php_dtrace_execute function-return
  5   php806       dtrace.so          php_dtrace_execute function-entry
The function-entry and function-return probes have three arguments:
  • arg0 is the name of the function (a pointer to a string within PHP)

  • arg1 is the name of the file containing the call site (also a pointer to a string within PHP)

  • arg2 is the line number of the call site a pointer to a string within PHP)

So for starters, let's just get an idea of which functions are being called in my PHP program:

# dtrace -n function-entry'{printf("called %s() in %s at line %d\\n", \\
    copyinstr(arg0), copyinstr(arg1), arg2)}' -q
called launch() in /export/php/blastoff.php at line 32
called three() in /export/php/blastoff.php at line 27
called two() in /export/php/blastoff.php at line 22
called one() in /export/php/blastoff.php at line 16
called blastoff() in /export/php/blastoff.php at line 10
called launch() in /export/php/blastoff.php at line 32
called three() in /export/php/blastoff.php at line 27
called two() in /export/php/blastoff.php at line 22
called one() in /export/php/blastoff.php at line 16
called blastoff() in /export/php/blastoff.php at line 10
\^C
If you're new to DTrace, note that you have the power to trace an arbitrary D expression in your action. For example, instead of printing out the file and line number of the call site, we could trace the walltimestamp:

# dtrace -n function-entry'{printf("called %s() at %Y\\n", \\
    copyinstr(arg0), walltimestamp)}' -q
called launch() at 2005 Aug  5 08:08:24
called three() at 2005 Aug  5 08:08:24
called two() at 2005 Aug  5 08:08:24
called one() at 2005 Aug  5 08:08:24
called blastoff() at 2005 Aug  5 08:08:24
called launch() at 2005 Aug  5 08:08:25
called three() at 2005 Aug  5 08:08:25
called two() at 2005 Aug  5 08:08:25
called one() at 2005 Aug  5 08:08:25
called blastoff() at 2005 Aug  5 08:08:25
\^C
Note, too, that (unless I direct it not to) this will aggregate across PHP instances. So, for example:

#!/usr/sbin/dtrace -s

#pragma D option quiet

php\*:::function-entry
{
        @bypid[pid] = count();
        @byfunc[copyinstr(arg0)] = count();
        @bypidandfunc[pid, copyinstr(arg0)] = count();
}

END
{
        printf("By pid:\\n\\n");
        printa("  %-40d %@d\\n", @bypid);

        printf("\\nBy function:\\n\\n");
        printa("  %-40s %@d\\n", @byfunc);

        printf("\\nBy pid and function:\\n\\n");
        printa("  %-9d %-30s %@d\\n", @bypidandfunc);
}
If I run three instances of blastoff.php and then the above script, I see the following after I \^C:

By pid:

  806                                      30
  875                                      30
  889                                      30

By function:

  launch                                   18
  three                                    18
  two                                      18
  one                                      18
  blastoff                                 18

By pid and function:

  875       two                            6
  875       three                          6
  875       launch                         6
  875       blastoff                       6
  889       blastoff                       6
  806       launch                         6
  889       one                            6
  806       three                          6
  889       two                            6
  806       two                            6
  889       three                          6
  806       one                            6
  889       launch                         6
  806       blastoff                       6
  875       one                            6
The point is that DTrace allows you to aggregate across PHP instances, allowing you to understand not just what a particular PHP is doing, but what PHP is doing more generally.

If we're interested in better understanding the code flow in a particular PHP instance, we can write a script that uses a thread-local variable to follow the code flow:

#pragma D option quiet

self int indent;

php$target:::function-entry
/copyinstr(arg0) == "launch"/
{
        self->follow = 1;
}

php$target:::function-entry
/self->follow/
{
        printf("%\*s", self->indent, "");
        printf("-> %s\\n", copyinstr(arg0));
        self->indent += 2;
}

php$target:::function-return
/self->follow/
{
        self->indent -= 2;
        printf("%\*s", self->indent, "");
        printf("<- %s\\n", copyinstr(arg0));
}

php$target:::function-return
/copyinstr(arg0) == "launch"/
{
        self->follow = 0;
        exit(0);
}
Running the above requires giving the script a particular PHP process; assuming the above script is named blast.d, it might look like this:

# dtrace -s ./blast.d -p `pgrep -n php`
-> launch
  -> three
    -> two
      -> one
        -> blastoff
        <- blastoff
      <- one
    <- two
  <- three
<- launch

This shows us all of the PHP functions that were called from launch(), but it doesn't tell the full story -- we know that our PHP functions are calling into native code to do some of their work. To add this to the mix, we'll write a slightly longer script:

#pragma D option quiet

self int indent;

php$target:::function-entry
/copyinstr(arg0) == "launch"/
{
        self->follow = 1;
}

php$target:::function-entry
/self->follow/
{
        printf("%\*s", self->indent, "");
        printf("-> %-20s %\*sphp\\n", copyinstr(arg0),
            46 - self->indent, "");
        self->indent += 2;
}

php$target:::function-return
/self->follow/
{
        self->indent -= 2;
        printf("%\*s", self->indent, "");
        printf("<- %-20s %\*sphp\\n", copyinstr(arg0),
            46 - self->indent, "");
}

pid$target:libc.so.1::entry
/self->follow/
{
        printf("%\*s", self->indent, "");
        printf("-> %-20s %\*s%s\\n", probefunc, 46 - self->indent, "",
            probemod);
        self->indent += 2;
}

pid$target:libc.so.1::return
/self->follow/
{
        self->indent -= 2;
        printf("%\*s", self->indent, "");
        printf("<- %-20s %\*s%s\\n", probefunc, 46 - self->indent, "",
            probemod);
}

php$target:::function-return
/copyinstr(arg0) == "launch"/
{
        self->follow = 0;
        exit(0);
}

Running the above yields the following output:

-> launch                                          php
  -> three                                         php
    -> write                                       libc.so.1
      -> _save_nv_regs                             libc.so.1
      <- _save_nv_regs                             libc.so.1
      -> _write                                    libc.so.1
      <- _write                                    libc.so.1
    <- write                                       libc.so.1
    -> two                                         php
      -> write                                     libc.so.1
        -> _save_nv_regs                           libc.so.1
        <- _save_nv_regs                           libc.so.1
        -> _write                                  libc.so.1
        <- _write                                  libc.so.1
      <- write                                     libc.so.1
      -> one                                       php
        -> write                                   libc.so.1
          -> _save_nv_regs                         libc.so.1
          <- _save_nv_regs                         libc.so.1
          -> _write                                libc.so.1
          <- _write                                libc.so.1
        <- write                                   libc.so.1
        -> blastoff                                php
          -> write                                 libc.so.1
            -> _save_nv_regs                       libc.so.1
            <- _save_nv_regs                       libc.so.1
            -> _write                              libc.so.1
            <- _write                              libc.so.1
          <- write                                 libc.so.1
        <- blastoff                                php
      <- one                                       php
    <- two                                         php
  <- three                                         php
<- launch                                          php

This shows us what's really going on -- or at least more of what's really going on: our PHP functions are inducing work from libc. This is much more information, but of course, we're still only seeing what's going on at user-level. To add the kernel into the mix, add the following to the script:

fbt:genunix::entry
/self->follow/
{
        printf("%\*s", self->indent, "");
        printf("-> %-20s %\*skernel\\n", probefunc, 46 - self->indent, "");
        self->indent += 2;
}

fbt:genunix::return
/self->follow/
{
        self->indent -= 2;
        printf("%\*s", self->indent, "");
        printf("<- %-20s %\*skernel\\n", probefunc, 46 - self->indent, "");
}

Running this new script generates much more output:

-> launch                                          php
  -> three                                         php
    -> write                                       libc.so.1
      -> _save_nv_regs                             libc.so.1
      <- _save_nv_regs                             libc.so.1
      -> _write                                    libc.so.1
        -> syscall_mstate                          kernel
          -> gethrtime_unscaled                    kernel
          <- gethrtime_unscaled                    kernel
        <- syscall_mstate                          kernel
        -> syscall_entry                           kernel
          -> pre_syscall                           kernel
          <- pre_syscall                           kernel
          -> copyin_args32                         kernel
            -> copyin_nowatch                      kernel
              -> watch_disable_addr                kernel
              <- watch_disable_addr                kernel
            <- copyin_nowatch                      kernel
          <- copyin_args32                         kernel
        <- syscall_entry                           kernel
        -> write32                                 kernel
          -> write                                 kernel
            -> getf                                kernel
              -> set_active_fd                     kernel
              <- set_active_fd                     kernel
            <- getf                                kernel
            ...
            -> releasef                            kernel
              -> cv_broadcast                      kernel
              <- cv_broadcast                      kernel
            <- releasef                            kernel
          <- write                                 kernel
        <- write32                                 kernel
        ...
        -> syscall_mstate                          kernel
          -> gethrtime_unscaled                    kernel
          <- gethrtime_unscaled                    kernel
        <- syscall_mstate                          kernel
      <- _write                                    libc.so.1
    <- write                                       libc.so.1
    -> two                                         php
      -> write                                     libc.so.1
        -> _save_nv_regs                           libc.so.1
        <- _save_nv_regs                           libc.so.1
        -> _write                                  libc.so.1
          -> syscall_mstate                        kernel
            -> gethrtime_unscaled                  kernel
            <- gethrtime_unscaled                  kernel
          <- syscall_mstate                        kernel
          ...

Now we're seeing everything that our PHP program is doing, from the PHP through the native code, into the kernel and back. So using DTrace on PHP has a number of unique advantages: you can look at the entire system, and you can look at the entire stack -- and you can do it all in production. Thanks again to Wez for putting together the PHP provider -- and if you're a PHP developer, bon appetit!

Wednesday Aug 03, 2005

DTrace and PHP

Tonight during our OpenSolaris BOF at OSCON, PHP core developer Wez Furlong was busy adding a DTrace provider to PHP. After a little bit of work (and a little bit of debugging), we got it working -- and damn is it cool. Wez implemented it as a shared object, which may then be loaded via an explicit extension directive in php.ini. Once loaded, two probes show up: function-entry and function-return. These probes have as their arguments a pointer to the function name, a pointer to the file name, and a line number. This allows one to, for example, get a count of all PHP functions being called:

# dtrace -n function-entry'{@[copyinstr(arg0)] = count()}'
Or you can aggregate on file name and quantize by line number:

# dtrace -n function-entry'{@[copyinstr(arg1)] = lquantize(arg2, 0, 5000)}'
Or you can determine the amount of wall time taken by a given PHP function:

# dtrace -n function-entry'/copyinstr(arg0) == "myfunc"/{self->ts = timestamp}'
  -n function-return'/self->ts/{@ = avg(timestamp - self->ts); self->ts = 0)}'
And because it's DTrace, this can all be done on a production box -- and without regard to the number of PHP processes. (So if you have 200 Apache processes handling PHP, the above invocations would aggregate across them.) When I get back, I'll download Wez's provider and post some more comprehensive examples. In the meantime, if you're a PHP developer at OSCON, stop Wez if you see him and ask him to give you a demo -- it's the kind of thing that needs to be seen to be appreciated...

Finally, if you're interested in adding your own DTrace provider to the application, language or system that you care about, be sure to check out my presentation on DTrace tomorrow at 4:30 in Portland room 255. (Hopefully this time I won't be tortured by memories of being mindfucked by Inside UFO 54-40.)

Tuesday Aug 02, 2005

Party in our room

If you're at OSCON and you're interested in an after-party with -- I'm told -- "premium drinks", swing by the OpenSolaris suite (room 1560 at the Doubletree) from 10pm on. If for nothing else, you'll want to be there in the early morning hours when Keith throws the furniture out the window in a drunken rage against gcc's SPARC backend.

Sunday Jul 31, 2005

DTrace at OSCON

So I'll be at OSCON this week, where I'll be giving two presentations on DTrace. The first is a free tutorial that Keith and I will be giving on OpenSolaris development with DTrace. This tutorial is on Tuesday from 1:30p to 5p in room D140, and did I mention that this tutorial is free? So even if you didn't plan to attend any of the tutorials, if you're going to be in Portland on Tuesday afternoon, you should feel free to swing by -- no need to preregister. This tutorial should be fun -- we're going to keep in very hands-on, and it will be a demo-intensive introduction to both DTrace and to our larger tool-set that we use to both build OpenSolaris and to diagnose it when it all goes horribly wrong. Hopefully you can join us!

The second session is a presentation exclusively on DTrace. This is quite a bit shorter (it's 45 minutes), so this presentation is going to give a quick review of the DTrace fundamentals, and then focus on the confluence of DTrace and open source -- both what DTrace can do for your open source project, and what you can do (if you're so inclined) for the DTrace open source project. This session is on Thursday at 4:30 in Portland room 255.

Other than that, my schedule is filled only with odds and ends; if you're going to be at OSCON and you want to connect, drop me a line or leave a message for me at my hotel, the 5th Avenue Suites. See you in Portland!

Wednesday Jul 27, 2005

Using DTrace to debug NTP

Brian Utterback has a great blog entry describing using DTrace to debug a really nasty problem in NTP. This problem is a good object lesson for two reasons:
  • The pathology -- a signal of mysterious origin killing an app -- is a canonically nasty problem and (before DTrace) it was very difficult (or damned near impossible?) to diagnose.

  • While Brian was able to use DTrace to get a big jump on the problem, completely understanding it took some great insight on Brian's part. This has been said before, but it merits reemphasis: DTrace is a tool, not a magician. That is, DTrace still needs to be used by someone with a brain. And not, by the way, because DTrace is difficult to use, but rather because systems -- especially misbehahaving or suboptimal ones -- have endemic complexity. We have tried to make it as easy as possible to use DTrace to understand that complexity, but the complexity exists nonetheless.

Summary: DTrace allows you to solve problems that were previously unsolvable (or damned near) -- but it means you'll be using your brain more, not less, so you'd better stop stabbing it with Q-tips.

Tuesday Jul 19, 2005

DTrace Safety

DTrace is a big piece of technology, and it can be easy to lose the principles in the details. But understanding these principles is key to understanding the design decisions that we have made -- and to understanding the design decisions that we will make in the future. Of these principles, the most fundamental is the principle of safety: DTrace must not be able to accidentally induce system failure. It is our strict adherence to this principle that allows DTrace to be used with confidence on production systems -- and it is its use on production systems that most fundamentally separates DTrace from what has come before it.

Of course, it's easy to say that this should be true, but what does the safety constraint mean? First and foremost, given that DTrace allows for dynamic instrumentation, this means that the user must not be allowed to instrument code and contexts that are unsafe to instrument. In any sufficiently dynamic instrumentation framework, such code and contexts exist (if nothing else, the framework itself cannot be instrumented without inducing recursion), and this must be dealt with architecturally to assure safety. We have designed DTrace such that probes are provided by instrumentation providers that guarantee their safety. That is, instead of the user picking some random point to instrument, instrumentation providers make available only the points that can be safely instrumented -- and the user is restricted to selecting among these published probes. This puts the responsibility for instrumentation safety where it belongs: in the provider. The specific techniques that the providers use to assure safety are a bit too arcane to discuss here,[1] but suffice it to say that the providers are very conservative in their instrumentation.

This addresses one aspect of instrumentation safety -- instrumenting wholly unsafe contexts -- but it doesn't address the recursion issue, where the code required to process a probe (a context that we call probe context) ends up itself encountering an enabled probe. This kind of recursion can be dealt with in one of two ways: lazily (that is, the recursion can be detected when it happens, and processing of the probe that induced the recursion can be aborted) or proactively (the system can be designed such that recursion is impossible). For a myriad of reasons, we elected for the second approach: to make recursion architecturally impossible. We achieve this by mandating that while in probe context, DTrace itself must not call into any facilities in the kernel at-large. This means both implicit and explicit transfers of control into the kernel-at-large -- so just as DTrace must avoid (for example) allocating memory in probe context, it must also avoid inducing scheduler activity by blocking.[2]

Once the fundamental safety issues of instrumentation are addressed, focus turns to the safety of user-specified actions and predicates. Very early in our thinking on DTrace, we knew that we wanted actions and predicates to be completely programmable, giving rise to a natural question: how are they executed? For us, the answer was so clear that it was almost unspoken: we knew that we needed to develop a virtual machine that could act as a target instruction set for a custom compiler. Why was this the clear choice? Because the alternative -- to execute user-specified code natively in the kernel -- is untenable from a safety perspective. Executing user-specified code natively in the kernel is untenable for many reasons:

  • Explicit stores to memory would have to be forbidden. To allow for user-defined variables (as we knew we wanted to do), one must clearly allow data to be stored. But if one executes natively, one has no way of differentiating a store to legal variable memory from a stray store to arbitrary kernel state. One is reduced to either forbidding stores completely (destroying a critical architectural component in the process), rewriting the binary to add checks around the store, or to emulating stores and manually checking the target address in the emulation code. The first option is unacceptable from a feature perspective, the second option is a non-trivial undertaking rife with new failure modes, and the third option -- emulating stores -- shouldn't be thought of as native execution but rather executing a a virtual machine that happens to match the underlying instruction set architecture.

  • Loops would have to be dynamically detected. One cannot allow user-defined code to spin on the CPU in the kernel, so loops must be dynamically detected and halted. Static analysis might be tempting, but in a Turing-complete system such analysis will always be heuristic -- one cannot solve the Halting Problem. While heuristics are fine when trying to achieve performance, they are not acceptable when correctness is the constraint. The problem must be therefore solved dynamically, and dynamic detection of loops isn't simple -- one must detect loops using any control transfer mechanism, including procedure calls. As with stores, one is reduced to either forbidding calls and backwards branches, rewriting the binary to add dynamic checks before all control transfers, or emulating them and manually checking the target address in the emulation code. And again, emulating these instructions negates the putative advantages of native execution.

  • Code would have to be very carefully validated for illegal instructions. For example, any instruction that operates on floating point state is (generally) illegal to execute in the kernel (Solaris -- like most operating systems -- doesn't save and restore the floating point registers on a kernel/user context switch; using the floating point registers in the kernel would corrupt user floating point state); floating point operations would have to be detected and code containing them rejected. There are many such examples (e.g. register-indirect transfers of control must be prohibited to prevent user-specified code from transferring control into the kernel at large, privileged instructions must be prohibited to prevent user-specified code from hijacking the operating system, etc. etc.), and detecting them isn't fail-safe: if one fails to detect so much as one of these cases, the entire system is vulnerable.

  • Executing natively isn't portable. This might seem counterintuitive because executing natively seems to "automatically" work on any instruction set architecture that the operating system supports -- it leverages the existing tool chain for compiling the user-specified code. But this leverage is Fools' Gold: as described above, the techniques to assure safety for native execution are profoundly specific to the instruction set -- and any new instruction set would require completely new validation code. And again, this isn't fail-safe: so much as one slip-up in a new instruction set architecture means that the entire system is at risk on the new architecture.

We left these many drawbacks of native execution largely unspoken because the alternative -- a purpose-built virtual machine for executing user-specified code -- was so clearly the better choice. The virtual machine that we designed, the D Intermediate Format (DIF) virtual machine, has the following safety properties:
  • It has no mechanism for storing to arbitrary addresses; the only store opcodes represent stores to specific kinds of user-defined variables. This solves two problems in a single design decision: it prohibits stores to arbitrary kernel memory, and it allows us to distinguish stores to different kinds of variables (global, thread-local, clause-local, etc.) from the virtual instruction opcode itself. This allows us to off-load intelligence about variable management from the instruction set and into the runtime where it belongs.

  • It has no backwards branches, and supports only calls to defined runtime support routines -- eliminating the possibility of user-defined loops altogether. This may seem unnecessarily severe (this makes loops an impossibility by architecture), but to us it was an acceptable tradeoff to achieve absolute safety.

  • It is sufficiently simple to be easily (and rigorously) validated, as can be seen from the straightforward DIF object validation code.

  • It is completely portable, allowing the validation and emulation code to be written and debugged once -- accelerating bringup of DTrace on new platforms.

Just having an appropriately restricted virtual machine addressed many safety issues, but several niggling safety issues still had to be dealt with explicitly:

  • Runtime errors like division-by-zero or misaligned loads. While a virtual machine doesn't solve these in and of itself, it makes them trivial to solve: the emulator simply refuses to perform such operations, aborting processing and indicating a runtime error.

  • Loads from I/O space. In the Solaris kernel, devices have memory that can be mapped into the kernel's address space. Loads from these memory ranges can have side-effects; they must be prohibited. This is only slightly more complicated than dealing with divisions-by-zero; before performing a load, the emulator checks that the virtual address does not fall in a range reserved for memory mapped devices, aborting processing if it does.

  • Loads from unmapped memory. Given that we wanted to allow user-defined code to chase pointers in the kernel, we knew that we had to deal with user-defined code attempting to load from an unmapped address. This can't be dealt with strictly in the emulator, as it would require probing kernel VM structures from probe context (which, if allowed, would prohibit instrumentation of the VM system). We dealt with this instead by modifying the kernel's page fault handler to check if a load has been DIF-directed before vectoring into the VM system to handle the fault. If the fault is DIF-directed, the kernel sets a bit indicating that the load has faulted, increments the trapping instruction pointer past the faulting load, and returns from the trap. The emulation code checks the faulted bit after emulating each instruction, aborting processing if it is set.

So DTrace is not safe by accident -- DTrace is safe by deliberate design and by careful execution. DTrace's safety comes from a probe discovery process that assures safe instrumentation, a purpose-built virtual machine that assures safe execution, and a careful implementation that assures safe exception handling. Could a safe system for dynamic instrumentation be built with a different set of design decisions? Perhaps -- but we believe that were such a system to be as safe, it would be either be so under-powered or so over-complicated as to invalidate those design decisions.

[1] The best place to see this provider-based safety is in the implementation of the FBT provider (x86, SPARC) and in the implementation of the pid provider (x86, SPARC).

[2] While it isn't a safety issue per se, this led us to two other important design decisions: probe context is lock-free (and almost always wait-free), and interrupts are disabled in probe context.


Technorati tags:

Monday Jul 11, 2005

Using DTrace to understand GNOME

I read with some interest about the GNOME startup bounty. As Stephen O'Grady pointed out, this problem is indeed perfect for DTrace. To get a feel for the problem, I wrote a very simple D script:
#!/usr/sbin/dtrace -s

#pragma D option quiet

proc:::exec-success
/execname == "gnome-session"/
{
        start = timestamp;
        go = 1;
}

io:::start
/go/
{
        printf("%10d { -> I/O %d %s %s %s }\\n",
            (timestamp - start) / 1000000, pid, execname,
            args[0]->b_flags & B_READ ? "reads" : "writes",
            args[2]->fi_pathname);
}

io:::done
/go/
{
        printf("%10d { <- I/O to %s }\\n",
            (timestamp - start) / 1000000, args[2]->fi_pathname);
}

io:::start
/go && ((struct buf \*)arg0)->b_file != NULL &&
    ((struct buf \*)arg0)->b_file->v_path == NULL/
{
        printf("%10s   (vp %p)\\n", "", ((struct buf \*)arg0)->b_file);
}

io:::start
/go/
{
        @apps[execname] = count();
        @files[args[2]->fi_pathname] = count();
        @appsfiles[execname, args[2]->fi_pathname] = count();
}

proc:::exec
/go/
{
        self->parent = execname;
}

proc:::exec-success
/self->parent != NULL/
{
        printf("%10d -> %d %s (from %d %s)\\n",
            (timestamp - start) / 1000000, pid, execname,
            curpsinfo->pr_ppid, self->parent);
        self->parent = NULL;
}

proc:::exit
/go/
{
        printf("%10d <- %d %s\\n",
            (timestamp - start) / 1000000, pid, execname);

}

profile-101hz
/go && arg1 != NULL/
{
        printf("%10d [ %d %s ]\\n",
            (timestamp - start) / 1000000, pid, execname);
}

profile-101hz
/go && arg1 == NULL &&
  (curlwpsinfo->pr_flag & PR_IDLE)/
{
        printf("%10d [ idle ]\\n",
            (timestamp - start) / 1000000);
}

END
{
        printf("\\n  %-72s %s\\n", "APPLICATION", "I/Os");
        printa("  %-72s %@d\\n", @apps);
        printf("\\n  %-72s %s\\n", "FILE", "I/Os");
        printa("  %-72s %@d\\n", @files);
        printf("\\n  %-16s %-55s %s\\n", "APPLICATION", "FILE", "I/Os");
        printa("  %-16s %-55s %@d\\n", @appsfiles);
}
This script uses a combination of CPU sampling and I/O tracing to determine roughly what's going on over the course of logging in. I ran the above script on on my Acer Ferrari 3400 laptop running OpenSolaris by dropping to the command-line login after a reboot and running:
# nohup ./login.d > /var/tmp/login.out &
I then logged in, and made sure that the first thing I did was launch a gnome-terminal to pkill dtrace. (This could obviously be made quite a bit more precise, but it works as a crude way of indicating the completion of login.) Here is the output from performing this experiment. The first column is the millisecond offset from the start of gnome-session. CPU samples are contained within square brackets, launched programs are contained withing curly braces, and I/O is explicitly noted. An I/O summary is provided at the end. A couple of observations:
  • Nearly one third of all I/O is to shared object text and read-only data. This is classic death-of-a-thousand-cuts, and it's hard to see that there's an easy way to fix this. But perhaps text could be reordered to save some I/O? More investigation is probably merited here.

  • Over a quarter of all I/O is from GConf -- and many of these are from wandering around an expansive directory hierarchy looking for configuration information. It is well-known that the XML backend is a big performance problem, and a better backend is apparently being worked on. At any rate, solving this problem is clearly in GConf's future.

  • Of the remaining I/O, a bunch is to icon files. Glynn Foster pointed out that this looks to be addressed by the GTK+ gtk-update-icon-cache, new in GTK+ 2.6 and contained within GNOME 2.10, so this experiment will obviously need to be repeated on GNOME 2.10.

  • CPU utilization doesn't look like a big issue -- or it's one that is at least dwarfed by the cost of performing I/O. That said, gconfd-2 looks to be a bit piggy in this department as well. We're only using CPU sampling -- and we're using a pretty coarse grained sample -- and gconfd-2 still showed up. More precise investigation into CPU utilization can be performed with the sched provider and its on-cpu and off-cpu probes.

And an end-note: you might note many files named "<unknown>" in the output. This is due to I/Os being induced by lookups that are going through directories and/or symlinks that haven't been explicitly opened. For these I/Os, my script also gives you the pointer to the vnode structure in the kernel; you can get the path to these by using ::vnode2path in MDB:
# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs ip
sctp usba uhci s1394 random nca lofs nfs audiosup sppp ptm ipc ]
> ffffffff8f6851c0::vnode2path
/usr/sfw/lib/libXrender.so.1
>
Yes, having to do this sucks a bit, and it's a known issue. And rumor has it that Eric even has a workspace with the fix, so stay tuned...

Update: Eric pointed me to a prototype with the fix, and I reran the script on a GNOME cold start; here is the output from that run. Interestingly, because the symlinks now show up, a little postprocessing reveals that we chased nearly eighty symlinks on startup! From the output, reading many of these took 10 to 20 milliseconds; we might be spending as much as one second of GNOME startup blocked on I/O just to chase symlinks! Ouch! Again, it's not clear how one would fix this; having an app link to libpango-1.0.so.0 and not libpango-1.0.so.400.1 is clearly a Good Thing, and having this be a symlink instead of a hardlink is clearly a Good Thing -- but all of that goodness leaves you with a read dependency that's hard to work around. Anyway, be looking for Eric's fix in OpenSolaris and then in an upcoming Solaris Express release; it makes this kind of analysis much easier -- and thanks Eric for the quick prototype!


Technorati tags:

Monday Jun 27, 2005

DTrace at JavaOne

I spent the day today at JavaOne in the City, where DTrace was featured prominently in the morning keynote: JohnnyL described DTrace in the abstract, and drew a (surprisingly large!) round of applause when he announced that DTrace now supports Java. He then called Adam on stage to do a demo, the most remarkable aspect of which is that it was actually DTrace and not the canned, gussied-up gimcrack that normally passes for a keynote demo. In fact, this demo was so tangibly real, Adam has blogged the scripts from the demo. With these scripts, you can now give Adam's keynote in the comfort of your own home!

In other JavaOne news, Adam, Stephen and I are going to be leading the charge at an OpenSolaris BOF tonight at 7:30pm at the San Francisco Marriot in Golden Gate B1. With the IBM announcement today supporting Solaris, with the new super-fast, super-cheap Solaris Opteron workstation, and with new Solaris technologies like DTrace and Zones, Solaris has never been more attractive to the Java developer; swing by our BOF to see what all the fuss is about!


Technorati tags:

Saturday Jun 25, 2005

Still more blog sifting...

Even though the launch of OpenSolaris was well over a week ago, and even though the Opening Day entries have now been sifted through in five different blog entries (here, here, here here and here), there are still some great uncategorized entries. Without further ado...

Phew! I think that about does it. When we first tried to encourage Solaris engineers to blog on Opening Day, I thought we were going to have a hard time convincing engineers to blog -- I knew that providing in-depth, technical content takes a lot of time, and I knew that everyone had other priorities. So when we were planning the launch and talking about the possibility of dealing with a massive amount of Opening Day content, my response was "hurt me with that problem." Well, as it turns out, most engineers didn't need much convincing -- many provided rich, deep content -- and I was indeed hurt with that problem! While it was time consuming to sift through them, hopefully you've enjoyed reading these entries as much as I have. And let it be said once more: welcome to OpenSolaris!


Technorati tags:

Monday Jun 20, 2005

Yet more blog sifting...

Despite there being now four blog entries to sift through the Opening Day entries (one from me, one from Liane, one from Claire and another from me), there are still some great entries that have gone uncategorized. Many of these entries fall into the category of "debugging war stories" -- engineers describing a particularly tough or interesting bug that they nailed. The proliferation of these kinds of stories among the Opening Day entries is revealing: in some organizations, you're a hero among engineers if you come up with a whizzy new app (even if the implementation is a cesspool), while in others you might gain respect among engineers by getting a big performance win (even if it makes things a bit flakey), but in Solaris development, nothing gains the respect of peers like debugging some diabolical bug that has plagued the operating system for years. So if you're new to OpenSolaris and you're looking to make a name for yourself, a good place to start is in the never-ending search for elusive bugs. To get a flavor for the thrill of the hunt, check out: And here are a few more on the simple (but thorough) satisfaction from fixing old bugs: That about does it for today. There are still many more entries to categorize, but fortunately, I think I can see the light at the end of the tunnel -- or is that just the approach of death from the exhaustion of sifting through all of this content?


Technorati tags:
About

bmc

Search

Top Tags
Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today