DTrace Mozilla - what's going on with my DNS

DNS Lookup Probes:
 mozilla<pid>:::dnslookup-start 
mozilla<pid>:::dnslookup-done

Args:
arg0 is of type void \* - unique_id
arg1 is an enum nsTraceDNSLookupType - TYPE_DNSLOOKUP_LOAD or TYPE_DNSLOOKUP_REQUEST
arg2 is of type struct nsTraceDNSLookupInfo \* - pointer to the DNS Lookup info

struct nsTraceDNSLookupInfo {
char \* host;
char \* address;
int status;
};

Well here's a little output using the probes in our timing script, to look at what happens when we load up Jonathan's blog :) The full listing is here, both when doing a hard load (not cached) and an ordinary refresh (lots of caching).

DTrace Firefox load times for
http://blogs.sun.com/jonathan

Hard - Reload
================
$ browserspy_time_short.d
Tracing Browser Network Activity... Hit Ctrl-C to end.
\^C

PROBE STATS:

TYPE TOTAL COUNT
DNS_REQ Network Requests - Done 14
DNS_REQ Network Requests - Started 14
DNS_LOAD Lookups - Network and Cache 25

TYPE TOTAL COUNT
URI Network Requests - Done 23
URI Network Requests - Started 23
IMAGE_REQ Network Requests - Done 53
IMAGE_REQ Network Requests - Started 53
IMAGE_LOAD Image Loads - Network and Cache 182

Browser DNS lookup requests,

PID UID STATUS COUNT AVG(msec) SUM(msec) IP HOST
27359 65535 1 1 0 0 72.5.124.61 www.sun.com
27359 65535 1 1 0 0 65.214.39.152 www.bloglines.com
27359 65535 1 1 0 0 66.249.93.91 sb.google.com
27359 65535 1 1 0 0 66.249.93.99 www.google.com
27359 65535 1 1 0 0 213.200.109.9 images.cafepress.com
27359 65535 1 1 10029 10029 64.78.155.105 www.newsgator.com
27359 65535 1 1 10035 10035 145.97.39.156 upload.wikimedia.org
27359 65535 1 1 10038 10038 159.134.196.118 us.i1.yimg.com
27359 65535 1 1 10045 10045 72.32.62.117 www.37signals.com
27359 65535 1 1 10052 10052 64.125.132.39 weblogs.java.net
27359 65535 1 1 10153 10153 216.14.208.10 russian-crafts.com
27359 65535 1 1 10207 10207 204.16.104.2 marketing.openoffice.org
27359 65535 1 1 10287 10287 205.219.188.135 www.netstate.com
27359 65535 1 1 10492 10492 208.106.144.101 www.gobuffaloniagara.com

Now it's great that we can get this data back along with the resolved IP, but what was worrying is the DNS lookup times. They are either less than a msec or around 10 sec. This looked weird, especially when an ordinary nslookup on my machine takes around 300 msec.

With Brendan Gregg's and biesi's help on irc.mozilla.org #developers we dug in a little more to try and figure out what was happening. We thought maybe the probe location in the nsHostResolver::ThreadFunc(...) was to blame, so we moved them around a little to avoid being blocked by some logging calls and an  OnComplete( ) but still no joy. I was suspicious that some queries I was doing in the dnslookup-done probe to get the IP from the result  might have been expensive, they weren't. So the probes looked OK, something else was up.

Brendan put together a one-liner to look at the lower level dns lookup call the probes where surrounding, but still we got the 10 sec delays on first lookup.

dtrace -n 'BEGIN { self->last = timestamp; } pid$target::\*getaddrinfo\*:entry,\\
pid$target::\*getaddrinfo\*:return { trace(timestamp - self->last); self->last = timestamp; }'\\
-p `pgrep -n firefox-bin`

So I needed to hack together a quick script to measure function elapsed times for  getaddrinfo, looked at syscalls it was making, timed them along with total cpu for the function. When I ran this it was clear that the function was not making an excessive number of syscalls and the time for them and the total cpu was only accounting for a fraction of the 10 sec delay. So I added a stack trace aggregation to the off-cpu probe to see why the function was going off cpu. This is what we got:

              libc.so.1`__lwp_park+0x19
libc.so.1`cond_wait_queue+0x3e
libc.so.1`cond_wait_common+0x1e1
libc.so.1`_cond_timedwait+0x4a
libc.so.1`cond_timedwait+0x27
libc.so.1`pthread_cond_timedwait+0x21
libnspr4.so`PR_WaitCondVar+0x24a
libxul.so`__1cOnsHostResolverKThreadFunc6Fpv_v_+0xb6
libnspr4.so`_pt_root+0xca
libc.so.1`_thr_setup+0x52
libc.so.1`_lwp_start

You can see from the stack that for some reason we are doing a pthread_cond_timedwait, so this would certainly explain why the delay. Now it's getting late and taking a quick look at the code reveals nothing obvious, so it's back to pester the mozilla developers and ask them why the thread doing the dns lookups seems to being put into a conditional wait. [Update] talking to biesi apparently there's a limit on the number of threads for DNS lookups, perhaps we are hitting this limit, but from the code we shouldn't, puzzling...

Well the good news at least is that the dns lookup probes are working :)

Another issue that came up on talking with Brendan Gregg and MaKroZ on irc.mozilla.org #developers was about how can you group all the resource load events and lookups for a specific page. Timestamps don't cut it as the loads of the various elements are asynchronous :( What we'd need to do is pass in a common page identifier shared by all the probes, including the loading URI, the image load probes and the dns lookup probes upon which we could aggregate. I've been able to concoct unique probe identifiers which are shared between the load-start and load-done for either URI or Image, but I haven't found an additional group identifier to allow me to group elements being loaded for a given page. MaKroZ is looking into it so hopefully he'll turn up something we can use.

Have to say the work certainly isn't dull, but ohh it would be nice to be in the same time zone for a change!! The 12pm midnight chats on irc are doing my beauty sleep no good at all :) Though without the help of Brendan, biesi and the other mozilla developers I'd be losing a lot more sleep !!





        
    
Comments:

Post a Comment:
Comments are closed for this entry.
About

jmr

Search

Top Tags
Categories
Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today