Debugging the Debugger

I've been missing almost a week, mostly because of my involvement with the amd64 bringup effort. A while ago, I was recruited to get the ptools and mdb up and running in 64-bit mode. This certainly made me appreciate some of the old war stories - all the Solaris veterans have their favorite bug that they debugged using only hex dumps, a pocket knife, and a ball of string. Over time, you start taking the Solaris debugging tools for granted: try going back to Solaris 9 after spending a year with DTrace1. I was in for quite a shock when I learned that my spoiled lifestyle wasn't going to cut it in the jungles of amd64.

It's no secret that we've had the amd64 kernel up and running for a while now. Thankfully, I was not part of the initial bringup effort. Back when I joined, the kernel was already booting multiuser, and I never had to lay my finger on a simulator or diagnose a double fault. 64 bit applications would load and run (thanks in part to a certain linker alien2), but debugging them was basically impossible: no truss, no mdb, no pstack. So where do you begin?

Thankfully, we've had a 64-bit OS for years, and most of the infrastructure was already working. All our tools worked with 64-bit ELF files out of the box, for example. But a lot of things were still broken. I ended up along roughly the following path:

pstack on corefiles

So pstack segfaulted the first time I ran it. At this point I could run elfdump on the corefile, but not much else. The first task was getting pstack to run on corefiles, so I at least knew where to begin inserting my printf() statements. Walking a stack on amd64 can be a tricky thing - so I begin with a simple version that works 99% of the time.

mdb for corefiles

The next step was to get MDB chewing on these corefiles. A stacktrace is all well and good, but we need to be able to exmine registers and memory. This turned out to be quite a bit of work; mdb is quite a heavy consumer of libproc, and uses some little-used interfaces in libc (in particular, getcontext(2) and makecontext(3c) were annoying). But with a lot of printfs, a few fixes and a few hacks, we had post mortem debugging.


Sadly, I can't take credit for this one. This turned out to be just a bug in fork(2), and once that was fixed, truss worked flawlessly.

mdb for live processes

This was not too difficult thanks to the magic of libproc, which allows us to manipulate live processes and corefiles through the same interface. A few minor tweaks were needed here and there, and some of the finer bugs have yet to be fixed, but it's basically working. Most of the ISA specific actions (such as setting breakpoints) are the same on ia32 and amd64.

agent LWP and pfiles

Finally, I had to get Psyscall (the libproc internal function that executes a system call in the context of a target process) working. This was particularly annoying, mostly because the code was poorly structured - rather than having separate ISA specific actions in different files, we had tons of #ifdefs scattered throughout the code. A large part of this was just ripping apart the code and restructuring it in a way that made porting easier. Someday when someone ports Solaris to run on Adam's laptop, they'll appreciate it.

In a testament to the portability of Solaris, there were no large infrastructure changes outside of Psyscall. Basically, I just fixed one small bug after another. So all the debugging tools are now up and running, and with Bryan and Matt helping, we have DTrace and KMDB as well. So now I can go back to a pampered life in my Hollywood Hills mansion; surrounded by DTrace, MDB, and a few of my closest ptools.

1 Solaris debugging can be roughly divided into three eras: pre-mdb (Paleozoic), pre-DTrace (Mesozoic), and modern day (Cenozoic). The arrival of CTF data could be seen end of the Triassic period and beginning of the Jurassic, while KMDB may begin the Pleistocene (a.k.a. modern) era. Sounds like an interesting science project...

2 There were many others involved in getting the kernel this far. But Mike's the only one with a blog, so he gets all the credit.


Post a Comment:
Comments are closed for this entry.

Musings about Fishworks, Operating Systems, and the software that runs on them.


« July 2016