DTrace TCP, UDP providers

We recently integrated DTrace TCP and UDP providers into Solaris/OpenSolaris - they will appear in build 142. I should mention at the outset that Brendan did the lion's share of this work - I just had to merge his changes with the latest Solaris Nevada builds and do the paperwork mostly.

The first thing to note is that there is a great advantage in tracing network events above IP in the networking stack - at that point both the inbound and outbound data has been mapped to a connection, which in turn is tied to a processs, a zone etc. This allows us to represent that information (along with an identifier that uniquely represents that connection for its lifetime) at the various probe points via a csinfo_t (Connection State INFO Type) probe argument:

typedef struct csinfo {
        uintptr_t cs_addr;
	uint64_t cs_cid;     /\* connection ID \*/
	pid_t cs_pid;        /\* associated PID \*/
	zoneid_t cs_zoneid;  /\* associated zoneid \*/
} csinfo_t;

There is great utility in being able to tie TCP and UDP events to a particular process, as illustrated by the tcptop.d script in /usr/demo/dtrace, which is essentially Brendan's DTraceToolkit tcptop using the stable tcp provider. Here's some sample output - we see the top processes as measured by TCP payload bytes (when 0s appear, they signify connections sending control segments like ACKs which have no payload).

# dtrace -s /usr/demo/dtrace/tcptop.d
Sampling... Please wait.
2010 Jun  4 07:56:56,  load: 0.36,  TCPin:      0 Kb,  TCPout:      0 Kb

  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0 102238       49668        9010         0
     0 102238 ::1             50795 ::1              9010         0

2010 Jun  4 07:57:01,  load: 0.35,  TCPin:      0 Kb,  TCPout:      0 Kb

  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0 102238       41433        9010         0
     0 102238 ::1             36782 ::1              9010         0
     0 132388       53480        9010         0
     0 132388       62234        9010         0
     0 132388 ::1             50266 ::1              9010         0
     0 132388 ::1             63145 ::1              9010         0

2010 Jun  4 07:57:06,  load: 0.33,  TCPin:      0 Kb,  TCPout:      0 Kb

  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0 132388       47516        9010         0
     0 132388       48726        9010         0
     0 132388 ::1             57467 ::1              9010         0
     0 132388 ::1             62543 ::1              9010         0

2010 Jun  4 07:57:11,  load: 0.32,  TCPin:      0 Kb,  TCPout:      0 Kb

  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0 132388 56142    993       247


These events are collected via the tcp:::send and tcp:::receive probe firing, and since we are interested in valid pids, we exclude TCP segments which have no associated process (such as RST|ACK segments generated in response to inbound connection requests for ports where no TCP service is listening). Here we collect the details in the @out aggregation:

/ args[1]->cs_pid != -1 /
 	@out[args[1]->cs_zoneid, args[1]->cs_pid, args[2]->ip_saddr,
            args[4]->tcp_sport, args[2]->ip_daddr, args[4]->tcp_dport] =
            sum(args[2]->ip_plength - args[4]->tcp_offset);

As a quick way to see which processes/zones are busy, this is pretty useful. To drill down to individual segments, tcpsnoop can be used:

# dtrace -s /usr/demo/dtrace/tcpsnoop.d
  TIME          PID           LADDR:PORT                 RADDR:PORT   BYTES FLAGS
735575877102 102238  ->       0 (SYN)
735575877139     -1   <-      0 (SYN)
735575877168     -1   ->      0 (RST|ACK)
735575877230 102238  <-       0 (RST|ACK)
735581534280 132408  ->       712 (PUSH|ACK)
735581725757 132408  <-         0 (ACK)
735581735388 132408  <-       356 (PUSH|ACK)
735581850134 132408  ->         0 (ACK)

As mentioned above, note how RST|ACKs sent specify a PID of -1 (NOPID) - this indicates that they are generated by the kernel protocol stack and have no associated process. Notice also that they are sent and received locally, so they appear twice, as does all localhost traffic.

Similar udptop.d and udpsnoop.d demo scripts are also available in /usr/demo/dtrace.

From the above, I learned a general lesson about DTrace providers - while it is clearly of great benefit to add a stable provider to allow probing into a subsystem, it is of even greater benefit if that provider can tie the events it probes to other system abstractions (in this case, we link network protocol processing to user processes and system zones).

The other thing to emphasize is that the above DTrace probes are not intended as replacements for snoop or wireshark - those tools do packet sniffing on the wire (often in promiscuous mode), whereas the above scripts monitor traffic that reaches IP and is fanned out to TCP or UDP connections on the inbound path, or has not yet reached IP (on the outbound path). So, in other words, a snoop trace could look quite different. The primary goal of this work (at least to me) is not to replace those packet-sniffing tools, but rather to augment the kinds of analysis we can do - in particular we can view TCP and UDP processing events in a wider system context (Which process did this traffic originate from? Which zone is receiving the most TCP traffic? Which ports are busy?).

In the next entry, I'm going to talk about how we can carry out TCP latency measurements using the tcp provider.


Shall Not...

Posted by KATE on June 20, 2010 at 11:55 PM PDT #

Hi Alan,

thanks for this blog entry. We were looking for this kind of stuff about 2 years ago as we needed to trace the processing of multi-cast UDP messages.
At this time some complicated traversing of kernel structures was required to get some tiny bits of information.

I would greatly appreciate, if you also could write a blog entry specifically related to multi-cast messages.

Many thanks, Jörg

Posted by Jörg Thönnes on June 21, 2010 at 04:19 AM PDT #

Hi Jorg, thanks for the feedback. IPv4 UDP multicast send/receive is pretty straightforward - we just use a predicate which narrows down probing to events that specify an address in the appropriate IP address range for multicast. Since the IP source and destination addresses are available to us via the ipinfo structure in args[2] to the TCP and UDP probe points, and since we can utilize the fact that DTrace supports relational operators on strings, we can do something as simple as this to narrow down to udp multicast send events where the destination is a multicast address in the range:

/ args[2]->ip_daddr >= "224." && args[2]->ip_daddr < "240." /

The simple ASCII comparison that happens to work here would break down for the
more complex IPv6 case of course - it just happens to work for IPv4 multicast addresses because "." is less than "0" in ASCII, so for example "22.0" is < "224" when evaluated as an ASCII relational comparison. Hope this helps, Alan.

Posted by Alan Maguire on June 22, 2010 at 12:03 AM PDT #

Post a Comment:
Comments are closed for this entry.



« April 2014