Real Java debugging w/ DTrace

When I was in college one of the rights of passage in the computer science department was the software engineering class which involved a large group project. Fresh from completing that class, my brother turned up the other day in San Francisco (where I live); naturally I wanted to try out the game he and his friends had written. Hogs is a 3-D tank game written in Java -- when it failed to run on my Solaris 10 laptop I decided to use the new DTrace agents for the JVM that I blogged about recently.

After downloading the game and the requisite libraries (jogl, OGL, etc.) I tried running it and got this:

java.net.UnknownHostException: epizooty: epizooty
        at java.net.InetAddress.getLocalHost(InetAddress.java:1308)
        at hogs.net.client.RemoteEngine.(RemoteEngine.java:79)
        at hogs.net.client.NetEngine.forHost(NetEngine.java:93)
        at hogs.common.Controller.(Controller.java:226)
        at hogs.common.Controller.main(Controller.java:118)

Without understanding much about Java or anything about how my brother's game worked, I guessed that this code was trying to figure out the hostname of my laptop. The strange thing was that it seemed to find the name -- epizooty -- bu then get confused and throw some exception. The stack backtrace didn't give me much to go on so I decided to put this new Java DTrace agent to the test.

Using the dvm provider was, initially, a bit of a pain (through no fault of its own). The dvm provider is very easy to use for long running Java programs: you just fire up the JVM and enable the probes at some later time. Because of the failure during start up, the game wasn't sticking around long enough for me to enable the probes. And while dtrace(1M) has a -c option that lets you specify a command to examine with DTrace the dvm probes don't show up until a little later when the JVM has initialized. It's worth mentioning that on the next version of Solaris (available via Solaris Express we've added a feature that lets you specify probes that don't yet exist that will be enabled when they show up; that feature will be in an early Solaris 10 update. Since this was a stock Solaris 10 system though, I had to get creative.

Using some knowledge of how DTrace user-level statically defined tracing (USDT) providers load, I wrote stop.d that waits until the dvm provider loads and stops the process. After the process is stopped, another invocation of DTrace can then use the dvm provider.

#!/usr/sbin/dtrace -s

#pragma D option destructive

syscall::close:entry
/pid == $target &&
 basename(curthread->t_procp->p_user.u_finfo.fi_list[arg0].uf_file->f_vnode->v_path) == "dtrace@0:helper"/
{
        self->interested = 1;
}

syscall::close:entry
/self->interested/
{
        cnt++;
}

syscall::close:entry
/self->interested && cnt == 2/
{
        stop();
        printf("stopped %d\\n", pid);
        exit(0);
}

syscall::close:return
/self->interested/
{
        self->interested = 0;
}

DTrace USDT provider and helpers open a special helper psuedo device to register with the DTrace framework. When they're done, they use the close(2) system call to close the file descriptor to the device. What this script does is look for calls to close(2) where the file descriptor corresponds to that pseudo device. It's worth mentioning here that in the next version of Solaris there's a fds[] array that gives you the file name and other information for an open file descriptor so this will be a little cleaner in the future. The script looks for the second such close(2) because the JVM itself has a DTrace helper which enables the magic of the jstack() action. To be clear: I'm not particularly proud of this script, but it got the job done.

Once I had the game stopped at the right spot, I run amid the noise this snippet looked interesting:

  0  34481       _method_entry:method-entry -> java/net/InetAddress$1.lookupAllHostAddr()
  0  34481       _method_entry:method-entry -> java/net/UnknownHostException.()

So this localAllHostAddr() method was throwing the exception that was causing me so much heartache. I wanted to understand the actual interaction between this method and lower level address resolution. It turned out that the native library calls were in a shared object that the JVM was lazily loading so I needed to stop the process after the native library had been loaded but before the method had completed. I wrote the following as a sort of conditional breakpoint:

#!/usr/sbin/dtrace -s

#pragma D option destructive

dvm$target:::method-entry
/copyinstr(arg1) == "getLocalHost"/
{
        self->state = 1;
}

dvm$target:::method-entry
/copyinstr(arg1) == "lookupAllHostAddr" && self->state == 1/
{
        self->state = 2;
        stop();
        exit(0);
}


dvm$target:::method-return
/copyinstr(arg1) == "lookupAllHostAddr" && self->state == 2/
{
        self->state = 1;
}

dvm$target:::method-return
/copyinstr(arg1) == "getLocalHost" && self->state == 1/
{
        self->state = 0;
}

Sifting through some more data, I figured out the name of the native function that was being used to implement lookupAllHostAddr() and wrote this script to follow the program flow from there:

#!/usr/sbin/dtrace -s

#pragma D option flowindent

pid$target::Java_java_net_Inet4AddressImpl_lookupAllHostAddr:entry
{
        self->interested = 1;
}

pid$target:::entry
/self->interested/
{
}

pid$target:::return
/self->interested/
{
        printf("+%x %x (%d)", arg0, arg1, errno);
}

pid$target::gethostbyname_r:entry
/self->interested/
{
        printf("hostname = %s", copyinstr(arg0));
}

pid$target::Java_java_net_Inet4AddressImpl_lookupAllHostAddr:return
/self->interested/
{
        self->interested = 0;
}

In the output I found a smoking gun: gethostbyname_r(3NSL) was returning NULL. A little more investigation confirmed that the argument to gethostbyname_r(3NSL) was "epizooty"; a little test program showed the same problem. Now well away from Java and in more familar waters, I quickly realized that adding an entry into /etc/hosts was all I needed to do to clear up the problem.

This was a great experience: not only was I able to use this dvm stuff to great effect (for which my excitement had been largely theoretical), but I got to prove to my brother how amazingly cool this DTrace thing really is. As I haven't done any serious Java debugging for quite a while I'd like to pose this question to anyone who's managed to stay with me so far: How would anyone debug this without DTrace? Are there other tools that let you observe Java and the native calls and the library routines? And, though I didn't need it here, are there tools that let you correlate Java calls to low level kernel facilities? I welcome your feedback.


Technorati tag:
Comments:

Well, I'm not sure about the native libraries, but one can use SunStudio10, it has support for both java-level and native debugging.

However, I myself use a combination of dbx and Netbeans on solaris, and VisualStudio+Netbeans on windows. That is, I'd start an application and attach to it with both java and native debugger.

It's especially fun when debugging j2se's full-screen support on windows, then I have the windows host running the application in fullscreen exclusive mode, and Netbeans running on my Solaris box attached to that application to debug the java level, and VisualStudio running on my notebook (the only reason I have to keep windows on that notebook =( ), also attached to the same application for native code debugging.

Thanks,
Dmitri
Java2D Team

Posted by Dmitri Trembovetski on May 30, 2005 at 03:21 PM PDT #

Well, knowing that it was an UnknownHostException, I'd search my naming databases for an entry for the string that is printed in the stack trace. The exception is very self-explanatory!

Posted by S.G. on June 01, 2005 at 01:31 PM PDT #

S.G.,
I hope you're being facetious. It obviously found the name of the host, and other tools (ping(1M), traceroute(1M), etc.) were able to resolve that hostname. The problem was to understand the precise behavior of the Java implementation to understand how it differed from what other tools that query the naming database were doing.

Posted by Adam Leventhal on June 02, 2005 at 01:09 AM PDT #

[Trackback] Although a little crusty, if you call February 2005 crusty, and I do, this study on Linux 2.6 and Solaris 10 just floated past my desk. Some of it's right, but some of it's FUD, as expected from a "sponsored" study by IBM and Red Hat. Let me...

Posted by The Alethiometer on July 12, 2005 at 02:13 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

Adam Leventhal, Fishworks engineer

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today