Happy 5th Birthday, DTrace!

It's hard to believe, but DTrace is five years old today: it was on September 3, 2003 that DTrace integrated into Solaris. DTrace was a project that extended all three of us to our absolute limit as software engineers -- and the 24 hours before integration was then (and remains now) the most harrowing of my career. As it will hopefully remain my most stressful experience as an engineer, the story of that final day merits a retelling...


Our project had been running for nearly two years, but it was not until mid-morning on September 2nd -- the day before we were slated to integrate -- that it was discovered that the DTrace prototype failed to boot on some very old hardware (the UltraSPARC-I, the oldest hardware still supported at that time). Now, "failed to boot" can meet a bunch of different things, but this was about as awful as it gets: a hard hang after the banner message. That is, booting mysteriously stopped making progress soon after control transferred to the kernel -- and one could not break in with the kernel debugger. This is an awful failure mode because with no debugger and no fatal error, one has no place to start other than to start adding print statements -- or start ripping out the code that is the difference between the working system and the busted one. This was a terrifying position to be in less than 24 hours before integration! Strangely, it was only the non-DEBUG variant that failed to boot: the DEBUG version laden with assertions worked fine. Our only lucky break was that we were able to find two machines that exhibited the problem, enabling us to bifurcate our efforts: I starting ripping out DTrace-specific code in one workspace, while Mike started frenetically adding print statements in another...


Meanwhile, while we were scrambling to save our project, Eric was having his first day at Sun. My office door was closed, and with our integration pending and me making frequent (and rapid) trips back and forth to the lab, the message to my coworkers was clear: stay the hell back. Eric was blissfully unaware of these implicit signals, however, and he cheerfully poked his head in my office to say hello (Eric had worked the previous summer in our group as an intern). I can't remember exactly what I said to Eric when he opened my office door, but suffice it to say that the implicit signals were replaced with a very explicit one -- and I remain grateful to this day that Eric didn't quit on the spot...


Back on our problem, Mike -- through process of elimination -- had made the key breakthrough: it wasn't actually an instruction set architecture (ISA) issue, but rather it seemed to be a host bus adapter (HBA) issue. This was an incredibly important discovery: while we had a bevy of architectural changes that could conceivably be invalid on an ancient CPU, we had no such HBA-specific changes -- this was more likely to be something marring the surface of our work rather than cracking its foundation. Mike further observed that running a DEBUG variant of these ancient HBA drivers (esp and fas) would boot on an otherwise non-DEBUG kernel. At that, I remembered that we actually did have some cosmetic changes to these drivers, and on carefully reviewing the diffs, we found a deadly problem: in folding some old tracing code under a DEBUG-only #define, a critical line (the one that actually initiates the I/O) became compiled in only when DEBUG was defined. We hadn't seen this until now because these drivers were only used on ancient machines -- machines on which we had never tested non-DEBUG. We fixed the problem, and all of our machines booted DEBUG and non-DEBUG -- and we felt like we were breathing again for the first time in the more than six hours that we had been working on the problem. (Here is the mail that I sent out explaining the problem.)


To celebrate DTrace's birthday beyond just recounting the terror of its integration, I wanted to make a couple of documents public that we have not previously shared:


  • The primordial presentation on DTrace, developed in late 1999. Some of the core ideas are present here (in particular, production instrumentation and zero disabled probe effect), but we hadn't yet figured out some very basic notions -- like that we needed our own language.


  • Our first real internal presentation on DTrace, presented March 12, 2002 as a Kernel Technical Discussion. Here the thinking is much better fleshed out around kernel-level instrumentation -- and a prototype existed and was demonstrated. But a key direction for the technology -- the ability to instrument user-level generally and in semantically relevant ways in particular -- was still to come when Adam joined the team shortly after this presentation. (A video of this presentation also exists; in the unlikely event that anyone wants to actually relive three hours of largely outmoded thinking, I'll find a way to make it available.)


  • The e-mail we sent out after integration, September 3, 2003 -- five years ago today.


We said it then, and it's even truer today: it's been quite a ride. Happy 5th Birthday, DTrace -- and thanks again to everyone in the DTrace community for making it what it has become!

Comments:

Hey, Bryan,

I'll raise a glass to you three this evening in honor of the occasion.

Chad

Posted by Chad Mynhier on September 03, 2008 at 04:23 AM PDT #

Raise a pint my friend, ya'll deserved it. Of all the Solaris 10 features DTrace & ZFS are the absolute winners.

Posted by Anantha on September 03, 2008 at 05:28 AM PDT #

Congratulations!

Posted by Jonathan Adams on September 03, 2008 at 06:38 AM PDT #

I've been spoiled for years. I can't even think about looking at a system that does not have DTrace!

Congratulations on showing the world how to do it right.

Posted by Jim Mauro on September 03, 2008 at 08:03 AM PDT #

here's to applying Moore's Law to the next 5 years!
It was truly inspiring being introduced to DTrace at your Lisa in 2006 presentation - I was reminded of the Java presentation I saw James Gosling give in Calgary a loooong time ago

may you never run out of interesting problems to solve!

Cheers!

Posted by fletch on September 03, 2008 at 09:27 AM PDT #

Congratulations Bryan, Mike and Adam! Can't imagine a world without DTrace. Thank you thank you.

Posted by Jason Hoffman on September 03, 2008 at 10:57 AM PDT #

Congratulations guys. Thank you for bringing light to the darkness. :-) Can't imagine life before it again.

Posted by Jason J. W. Williams on September 03, 2008 at 05:50 PM PDT #

Congrats and many thanks for this tool.
Your dtrace presentation on goggle tech talks really blew me away. It was actually one of the reasons why I immediately switched from Linux to Solaris after a decade. I now use it almost every day when debugging problems.
Dtrace makes working with software so much more fun and a lot more productive.

Regards,
Björn

Posted by Björn Herzig on September 03, 2008 at 10:40 PM PDT #

Congrats, guys! :)

Posted by Jim Grisanzio on September 04, 2008 at 07:56 PM PDT #

Happy Birthday to a really great tool!

So, are there any anniversary T-Shirts? :-)

Posted by Magnus on September 05, 2008 at 02:36 AM PDT #

And by the way, yes, please share the presentation video.

Posted by Magnus on September 05, 2008 at 02:41 AM PDT #

Congratulations! Any idea if the PSARC case could be opened for us all to see?

Posted by Ceri Davies on September 05, 2008 at 05:11 AM PDT #

Hilarious Google techtalk, at first I thought you'd lost me with the rambling intro... but your point was well made. Great talk, a lot of fun.

I think the value of Dtrace will be to make OS's much much better by having a superb textual visualization tool - Honestly I'm an avid fan of FreeBSD, Linux and Solaris... it'll be great to see all three of these improve so much as your tool gets used day to day by other kernel hackers.

I wonder whether the Linux community will eventually make an (API-, D-language-, DtraceVM- ) compatible version from your specs, in order to have a fully GPL'd Dtrace impl in mainline.

After seeing Dtrace running...like your analogy, its hard to go back to normal TV, when you've tried TiVo :]

cheers,

gord.

Posted by just...gord on September 30, 2008 at 04:04 PM PDT #

how dtrace its thread safe? i mean changing the nops for traps instructions in the probe.

Posted by CarlOS on October 18, 2008 at 07:44 AM PDT #

CarlOS: What's your concern, exactly?

Posted by Jonathan Adams on October 18, 2008 at 08:04 AM PDT #

according to this http://blogs.sun.com/basant/entry/performance_impact_of_dtrace_probes normally the probes are just nops, but when dtrace use the probe the nops are changed for a trap instruction. How does this works on a multithreaded app?
What happend if multiple threads step in the same probe and DTrace at the same time change the nops of the probe for the trap instruction?

Posted by CarlOS on October 18, 2008 at 09:34 AM PDT #

The first thing dtrace does when processing a probe firing is to disable interrupts; at that point, the processing is locked to a CPU. Tracing data goes into per-CPU buffers.

The provider (fbt or sdt) doesn't have to worry about any of this; it just has to turn the 'nop' into something which calls dtrace_probe() with the appropriate arguments.

Posted by Jonathan Adams on October 18, 2008 at 05:07 PM PDT #

Post a Comment:
Comments are closed for this entry.
About

bmc

Search

Top Tags
Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today