In this post from Alan Maguire, he shows how to use DTrace to inspect common network-related system calls. DTrace is a powerful tracing tool for understanding user and kernel processes.
We’ve been steadily adding DTrace support for networking in Oracle Linux. Let’s see how we can use these probes to better understand network behaviour. Let’s do a quick tour to see what sort of support is available for dynamic tracing the network stack.
Before diving into the details though, it’s worth asking why use DTrace at all? Don’t we have enough tools in the networking space? The reasons I use it are:
- DTrace is open-ended while being production-safe. So in other words if I want to find something out, there’s usually a way of doing it, and most importantly I can do that without panic’ing a system. The hard part of course is forming those questions into queries that DTrace understands. I hope the next few blog posts will provide some useful hints on how to do so.
- DTrace allows me to reach across multiple subsystems, e.g. how effectively are interrupts being coalesced when receiving packets? Which locks are hot on the network receive path? In cloud environments especially, it’s critical to be able to connect the dots from network behaviour to other subsystems and abstractions. People love to blame “the network”. DTrace allows us to see where the costs are really incurred.
Socket layer: syscall provider – available since UEK2
As on other operating systems, the syscall provider is avaiable and we can trace network-related system calls – sendto, sendmsg, sendmmsg, recvfrom, recvmsg, recvmmsg.
Here’s a simple example – trace bytes sent on a per-executable basis. Hit Ctrl+C when you’re done collecting info.
# dtrace -n 'syscall::sendto:entry { @c[execname] = sum(arg2); }'
dtrace: description 'syscall::sendto:entry ' matched 1 probe
^C
DNS Resolver #6 65
ping 192
Socket Thread 2006
vpnagentd 3872
This is what we call a DTrace one-liner. Let’s examine it piece-by-piece.
- The -n option says we are directly specifying our probes rather than using a DTrace script (if we wanted to do that, we would use “dtrace -s <scriptname>”).
- The probe is a 4-element description, with each element separated by a “:”. The first element is the provider – the DTrace subsystem which we are using. In this case it is “syscall”. The second is the module, which is relevant for other DTrace providers but not here (for Linux it is always “vmlinux”, so we can leave it blank). The third is the function – in the case of the syscall provider, that’s the system call name. And finally we have the probe name. For each system call there are two probes, entry and return, corresponding to system call request/completion.
- Next is the action,contained in the curly brackets. DTrace code takes the form “probe /predicate/ { action }”. In this case we have no predicate, but if we did it would determine if the body was executed or not. Next let’s describe the action.
- Here we are using an aggregation variable “c” – the aggregation is signified by the “@” symbol. Aggregations can specify multiple keys in between the square brackets, and the associated values for the various sets of keys are computed using one of the aggregating functions sum(), count(), avg() etc. Aggregations are collected efficientl in-kernel. Here we are sum()ing arg2 – the third argument to sendto(), which if you look at the manual page you will see is the size.
So to summarize, our one-liner says “show me the sum total of bytes sent via sendto() system calls, and collect the sums for each program name” (“execname” is a built in variable that references the current process executable name).
Now of course these are requests, and some may fail. Again note that the probe we used was syscall::sendto:entry. There is also a corresponding syscall::sendto:return. A common pattern in DTrace is to measure how long it takes from :entry to :return.
Layer 4 (Transport layer): TCP, UDP providers – available since UEK4QU5
In Linux – as in Solaris – we have transport-level providers for TCP and UDP. These trace transmit/receive events, and for TCP we also trace TCP state machine transitions and connection establishment milestones.
# dtrace -l -P udp ID PROVIDER MODULE FUNCTION NAME 1836 udp vmlinux udp_send_skb send 1838 udp vmlinux udp_recvmsg receive 1840 udp vmlinux udp_queue_rcv_skb receive 1868 udp vmlinux udp_v6_send_skb send 1869 udp vmlinux udpv6_recvmsg receive 1871 udp vmlinux udpv6_queue_rcv_skb receive # dtrace -l -P tcp ID PROVIDER MODULE FUNCTION NAME 1818 tcp vmlinux tcp_set_state state-change 1819 tcp vmlinux tcp_conn_request state-change 1820 tcp vmlinux tcp_finish_connect connect-established 1821 tcp vmlinux tcp_rcv_state_process accept-established 1822 tcp vmlinux tcp_rcv_state_process connect-refused 1823 tcp vmlinux tcp_transmit_skb send 1824 tcp vmlinux tcp_transmit_skb connect-request 1825 tcp vmlinux tcp_connect state-change 1826 tcp vmlinux tcp_v4_send_ack send 1827 tcp vmlinux tcp_v4_send_reset send 1828 tcp vmlinux tcp_v4_send_reset accept-refused 1829 tcp vmlinux tcp_v4_send_synack send 1830 tcp vmlinux tcp_v4_rcv receive 1831 tcp vmlinux tcp_time_wait state-change 1882 tcp vmlinux tcp_v6_send_response send 1883 tcp vmlinux tcp_v6_send_response accept-refused 1884 tcp vmlinux tcp_v6_send_synack send 1885 tcp vmlinux tcp_v6_rcv receive
Here’s a quick example – what systems (remote IP) are we refusing connections to, and which ports are they trying to connect to?
# dtrace -n 'tcp:::accept-refused { @c[args[2]->ip_daddr, args[4]->tcp_sport] = count(); }'
dtrace: description 'tcp:::accept-refused ' matched 2 probes
^C
127.0.0.1 7 1
I telnet’ed port 7 locally in another window to generate this. Port 7 was closed so we see 1 connection being refused. The TCP, UDP and IP providers are all static-defined tracing (SDT) providers. The idea of these is that rather than trying to write probes that sit on ever-changing function entry or return to gather information from data structures that are also changing, we define a set of events of interest, and each place in the code that these events occur we place a static probe. So for example we can see that the tcp:::send probe is in tcp_v4_send_reset, tcp_v4_send_ack, etc. Importantly, these probes also define stable translators which take the OS-specific structures and translate them into stable arguments that DTrace consumers can rely on. To see the relevant deinifitions for Oracle Linux, you can look in /usr/lib64/dtrace/4.1/tcp.d, udp.d, ip.d.
For the most part we have preserved compatibility with Solaris. If there’s something missing you really need, let us know! Here’s how the various arguments look. We use args[0] instead of arg0 to let DTrace know we’re trying to access the stable, translated argument. The raw arguments are still available via arg0, arg1 etc.
args[0] - pktinfo_t * [packet information] args[1] - csinfo_t * [ connection information] args[2] - ipinfo_t * [IP protocol information] args[3] - tcpsinfo_t * for TCP, udpsinfo_t * for UDP [TCP/UDP state information] args[4] - tcpinfo_t * for TCP, udpinfo_t * for UDP [TCP/UDP header information]
Layer 3 (IP provider) – available since UEK4QU4
The IP provider traces send, receive, and inbound/outbound drops.
# dtrace -l -P ip ID PROVIDER MODULE FUNCTION NAME 1839 ip vmlinux ip_rcv_finish drop-in 1840 ip vmlinux ip_local_deliver receive 1841 ip vmlinux ip_rcv drop-in 1842 ip vmlinux __ip_append_data drop-out 1843 ip vmlinux __ip_local_out_sk send 1844 ip vmlinux ip_append_page drop-out 1845 ip vmlinux ip_send_skb drop-out 1861 ip vmlinux raw_sendmsg send 1878 ip vmlinux __ip6_flush_pending_frames drop-out 1879 ip vmlinux ip6_xmit send 1880 ip vmlinux ip6_xmit drop-out 1881 ip vmlinux ip6_finish_output2 drop-out 1882 ip vmlinux __ip6_append_data drop-out 1883 ip vmlinux ip6_forward drop-out 1884 ip vmlinux ip6_output drop-out 1885 ip vmlinux ip6_send_skb drop-out 1886 ip vmlinux ip6_input_finish drop-in 1887 ip vmlinux ipv6_rcv drop-in 1888 ip vmlinux ip6_input receive 1895 ip vmlinux ndisc_send_skb send 1900 ip vmlinux rawv6_sendmsg send 1903 ip vmlinux mld_sendpack send 1904 ip vmlinux mld_sendpack drop-out 1905 ip vmlinux igmp6_send send 1906 ip vmlinux igmp6_send drop-out 1923 ip vmlinux __ip6_local_out_sk send
For IP the arguments are
args[0] - pktinfo_t * [packet information] args[1] - csinfo_t * [connection information] args[2] - ipinfo_t * [IP protocol information] args[3] - ifinfo_t * [IP interface info] args[4] - ipv4info_t * [IPv6 header info if IPv6 packet] args[5] - ipv6info_t * [IPv6 header info if IPv6 packet]
Other
DTrace on Oracle Linux includes a perf provider, which traces perf events. And of course function boundary tracing (fbt) is available too. Here’s an example of using function boundary tracing to get counts of socket buffer (struct sk_buff) allocation sizes. sk_buff is the key structure for network data, and here we’re examining the counts in each power-of-two bucket 0-1, 1-2, 2-4, etc – this is the quantize() aggregating action. We see for example that the majority were in the 2048-4096 range.
# dtrace -n 'fbt::__alloc_skb:entry { @size = quantize(arg0); }'
dtrace: description 'fbt::__alloc_skb:entry ' matched 1 probe
^C
value ------------- Distribution ------------- count
-1 | 0
0 | 80
1 | 3
2 | 0
4 | 137
8 | 988
16 |@@ 5719
32 |@@@@@ 10977
64 | 779
128 |@@@ 6805
256 |@@@@@@@@@@@ 25462
512 | 508
1024 |@@ 5108
2048 |@@@@@@@@@@@@@@@@ 39011
4096 | 8
8192 | 0
16384 | 16
32768 | 0
We’ll describe some more examples using fbt and perf providers in a future blog post!