We recently got a request to find more information about why packet drops were happening on a system. Happily with the latest BPF-based DTrace release along with some neat packet drop-related observability improvements in the Linux kernel, this is a lot easier now. Let’s take a look!
From kfree_skb() to kfree_skb_reason()
The socket buffer sk_buff is the key networking-related data structure holding metadata about packets. When an sk_buff is freed due to an error condition, kfree_skb() is called.
In Linux 5.15, kfree_skb_reason() was introduced; it has a second parameter termed a “drop reason”. Drop reasons are now available for many different situations; for example, in include/net/dropreason-core.h we see:
/**
* enum skb_drop_reason - the reasons of skb drops
*
* The reason of skb drop, which is used in kfree_skb_reason().
*/
enum skb_drop_reason {
/**
* @SKB_NOT_DROPPED_YET: skb is not dropped yet (used for no-drop case)
*/
SKB_NOT_DROPPED_YET = 0,
/** @SKB_CONSUMED: packet has been consumed */
SKB_CONSUMED,
/** @SKB_DROP_REASON_NOT_SPECIFIED: drop reason is not specified */
SKB_DROP_REASON_NOT_SPECIFIED,
/**
* @SKB_DROP_REASON_NO_SOCKET: no valid socket that can be used.
* Reason could be one of three cases:
* 1) no established/listening socket found during lookup process
* 2) no valid request socket during 3WHS process
* 3) no valid child socket during 3WHS process
*/
SKB_DROP_REASON_NO_SOCKET,
/** @SKB_DROP_REASON_PKT_TOO_SMALL: packet size is too small */
SKB_DROP_REASON_PKT_TOO_SMALL,
/** @SKB_DROP_REASON_TCP_CSUM: TCP checksum error */
SKB_DROP_REASON_TCP_CSUM,
/** @SKB_DROP_REASON_SOCKET_FILTER: dropped by socket filter */
SKB_DROP_REASON_SOCKET_FILTER,
...
So here we will make use of drop reasons to provide more info about packet drops.
Note that for Oracle Linux, drop reasons are available on UEK7 and later – a 5.15-based release. However, new reasons have been added over time, so not all reasons are available on that release.
print() Support in DTrace
The 2.0.0-1.14 release of DTrace gained some interesting features to help us analyze dropped packets.
DTrace can print typed data associated with a pointer via the print() action; for example
print((struct sk_buff *)arg0);
will print typed sk_buff metadata for the sk_buff * represented by arg0. This is useful as it allows us to see the metadata associated with our dropped packets.
We can also use the same print() action – albeit indirectly – to print the name associated with the drop reason enumerated value. This is really useful as drop reasons change from kernel to kernel, with later kernels containing more drop reasons. So having a way to display drop reasons based on the Compact Type Format (CTF) associated with that kernel is really useful.
Note also that since DTrace 2.0.0-1.14, distributions that do not include CTF will generate CTF automatically from the BPF Type Format (BTF) information built into the kernel and modules. BTF is included in most distributions these days, so either direct CTF availability or CTF via BTF should work.
As mentioned above, since print() expects a pointer, and we have a numeric drop reason, we will need to provide a pointer value. This can be done via the alloca() action as follows:
self->reason = (enum skb_drop_reason *)alloca(sizeof(enum skb_drop_reason));
*(self->reason) = arg1;
print(self->reason);
A bit of a pain, but much better than hardcoding drop reasons which change in every kernel!
pcap() Support in DTrace
Recently DTrace also gained the ability to do packet capture. This is done via the pcap() action. The arguments associated with pcap() are
- as you might expect – the
sk_buff *(skb) associated with the packet; and - the starting protocol of the packet
The second argument requires a bit more explanation.
When an skb is being constructed for sending, or being parsed during receive, skb->data points at the packet data. We can push additional data during send or pull data during receive when we are finished with a protocol layer. But for packet capture purposes, to interpret the packet data at skb->data we need to know which protocol it is. So the current set of parameters supported are
PCAP_ETHER(ethernet)PCAP_WIFI(wifi)PCAP_IP(IPv4 or IPv6)PCAP_IPOIB(IP over InfiniBand)
Knowing the starting protocol is important for correct capture/display. If we say PCAP_ETHER, we are saying that skb->data points at an ethernet header, while saying PCAP_IP means that either the ethernet header hasn’t been added in the send codepath, or on the receive codepath we’re finished processing layer 2 information and are dealing with IPv4 or IPv6 now.
So to capture IP packet data from the skb pointer in arg0, we have
pcap((struct sk_buff *)arg0, PCAP_IP);
Putting it all Together: Tracing IP Drops on Receive
So here we will demonstrate using a DTrace script to show
- drop reasons via
kfree_skb_reason()s second parameter which specifies the reason for the drop; sk_buffmetadata associated with the packet via theprint()action;- a call stack at the drop point; and
- the dropped packet data using
pcap()
for dropped inbound IP traffic on a specific interface. With the above, we get much greater insight into what is being dropped and why.
We will need to use the skb->dev pointer to get the device name and compare it against our command-line parameter specifying the device.
The script is as follows:
#!/usr/sbin/dtrace -s
fbt::ip_local_deliver:entry
{
self->rcv = 1;
}
fbt::kfree_skb_reason:entry
/self->rcv && arg1>1 &&
(self->skb = (struct sk_buff *)arg0) != NULL &&
self->skb->dev != NULL && stringof(self->skb->dev->name) == stringof($$1)/
{
self->reason = (enum skb_drop_reason *)alloca(sizeof(enum skb_drop_reason));
*(self->reason) = arg1;
printf("dropping packet due to [%d] ", arg1);
print(self->reason);
printf("\t\t\t\t\tskb: ");
print(self->skb);
stack();
pcap(self->skb, PCAP_IP);
}
fbt::ip_local_deliver:return
{
self->rcv = 0;
}
The script ensures we are dealing with an ingress IP packet by using self->rcv as a predicate for kfree_skb_reason(). self->rcv is a per-thread variable set on entering ip_local_deliver() and unset when returning from it. We also make sure the device (from skb->dev) matches the one we want to trace and passed via command-line parameter $1.
Here’s an example of running this:
$ dtrace -s kfree_skb_reason.d wlo1
dtrace: script 'kfree_skb_reason.d' matched 3 probes
CPU ID FUNCTION:NAME
1 73361 kfree_skb_reason:entry dropping packet due to [8] 0xffffced77fa77ea0 = *
(enum skb_drop_reason)SKB_DROP_REASON_NETFILTER_DROP,
skb: 0xffff9f8ae29a0600 = *
(struct sk_buff) {
(struct) {
(struct) {
.prev = (struct sk_buff *)0xffffaed78002ecb8,
(struct) {
.dev = (struct net_device *)0xffff9f8988424000,
.dev_scratch = (long unsigned int)18446638011778220032,
},
},
.rbnode = (struct rb_node) {
.rb_right = (struct rb_node *)0xffffaed78002ecb8,
.rb_left = (struct rb_node *)0xffff9f8988424000,
},
.list = (struct list_head) {
.prev = (struct list_head *)0xffffaed78002ecb8,
},
},
(struct) {
.tstamp = (ktime_t)1712922420187684987,
.skb_mstamp_ns = (u64)1712922420187684987,
},
.cb = (char [48]) [
(char)'',
],
(struct) {
(struct) {
._skb_refdst = (long unsigned int)18446638011745515969,
},
.tcp_tsorted_anchor = (struct list_head) {
.next = (struct list_head *)0xffff9f89864f39c1,
},
._sk_redir = (long unsigned int)18446638011745515969,
},
._nfct = (long unsigned int)18446638018900752898,
.len = (unsigned int)103,
.mac_len = (__u16)14,
(struct) {
(struct) {
.pkt_type = (unsigned char)4,
.ip_summed = (unsigned char)2,
.nf_skip_egress = (unsigned char)1,
.alloc_cpu = (u16)1,
.skb_iif = (int)2,
(struct) {
.napi_id = (unsigned int)8197,
.sender_cpu = (unsigned int)8197,
},
.protocol = (__be16)8,
.transport_header = (__u16)60,
.network_header = (__u16)40,
.mac_header = (__u16)26,
},
.headers = (struct) {
.pkt_type = (unsigned char)4,
.ip_summed = (unsigned char)2,
.nf_skip_egress = (unsigned char)1,
.alloc_cpu = (u16)1,
.skb_iif = (int)2,
(struct) {
.napi_id = (unsigned int)8197,
.sender_cpu = (unsigned int)8197,
},
.protocol = (__be16)8,
.transport_header = (__u16)60,
.network_header = (__u16)40,
.mac_header = (__u16)26,
},
},
.tail = (sk_buff_data_t)143,
.end = (sk_buff_data_t)320,
.head = (unsigned char *)0xffff9f8a45785b80,
.data = (unsigned char *)0xffff9f8a45785ba8,
.truesize = (unsigned int)896,
.users = (refcount_t) {
.refs = (atomic_t) {
.counter = (int)1,
},
},
.extensions = (struct skb_ext *)0x74010074666f,
}
vmlinux`kfree_skb_reason+0x5
vmlinux`ip_local_deliver+0xe4
vmlinux`ip_sublist_rcv_finish+0x85
vmlinux`ip_sublist_rcv+0x18d
vmlinux`ip_list_rcv+0x145
vmlinux`__netif_receive_skb_list_core+0x290
vmlinux`netif_receive_skb_list_internal+0x1d1
vmlinux`napi_complete_done+0x78
sunrpc`cache_initialize+0x94c7b
vmlinux`__napi_poll+0x2e
vmlinux`net_rx_action+0x294
vmlinux`__do_softirq+0x10b
vmlinux`do_softirq.part.0+0x7a
vmlinux`__local_bh_enable_ip+0x8b
sunrpc`cache_initialize+0x95610
vmlinux`irq_thread_fn+0x25
vmlinux`irq_thread+0xec
vmlinux`kthread+0x102
vmlinux`ret_from_fork+0x3e
vmlinux`ret_from_fork_asm+0x1b
pcap 1 0.000000 192.168.1.254 → 192.168.1.255 UDP 103 9431 → 9431 Len=75
^C
What’s nice about this is we’ve combined observability about the OS state representation (the sk_buff, the drop reason, the call stack) with the “on-the-wire” view of the packet.
In this case we can see that netfilter rejected the broadcast UDP packet from 192.168.1.254 -> 192.168.1.255.
Digging Further
What if the summary packet details are not enough? We can create a packet capture file instead! To do so we will add a second parameter – the capture file name. To capture to the file we call
freopen(filename);
prior to calling pcap() to redirect output to the capture file, and
freopen("");
after, to restore output to the console.
Here is the updated script; we have two parameters now, device name and capture file.
#!/usr/sbin/dtrace -ws
BEGIN
{
packet_id = 1;
}
fbt::ip_local_deliver:entry
{
self->rcv = 1;
}
fbt::kfree_skb_reason:entry
/self->rcv && arg1>1 &&
(self->skb = (struct sk_buff *)arg0) != NULL &&
self->skb->dev != NULL && stringof(self->skb->dev->name) == stringof($$1)/
{
self->reason = (enum skb_drop_reason *)alloca(sizeof(enum skb_drop_reason));
*(self->reason) = arg1;
printf("dropping packet #%d due to [%d] ", ++packet_id, arg1);
print(self->reason);
printf("\t\t\t\t\tskb: ");
print(self->skb);
stack();
freopen($2);
pcap(self->skb, PCAP_IP);
freopen("");
}
fbt::ip_local_deliver:return
{
self->rcv = 0;
}
To run the script we must enable destructive actions (-w) since saving data to a file is potentially destructive:
$ dtrace -ws kfree_skb_reason_file.d wlo1 /tmp/mycap
dtrace: script 'kfree_skb_reason_file.d' matched 3 probes
dtrace: allowing destructive actions
CPU ID FUNCTION:NAME
7 73361 kfree_skb_reason:entry dropping packet #1 due to [8] 0xffffced77fbf9318 = *
(enum skb_drop_reason)SKB_DROP_REASON_NETFILTER_DROP,
skb: 0xffff9f8c37954e00 = *
(struct sk_buff) {
(struct) {
(struct) {
.prev = (struct sk_buff *)0xffffaed7802b4cb8,
(struct) {
.dev = (struct net_device *)0xffff9f8988424000,
.dev_scratch = (long unsigned int)18446638011778220032,
},
},
.rbnode = (struct rb_node) {
.rb_right = (struct rb_node *)0xffffaed7802b4cb8,
.rb_left = (struct rb_node *)0xffff9f8988424000,
},
.list = (struct list_head) {
.prev = (struct list_head *)0xffffaed7802b4cb8,
},
},
(struct) {
.tstamp = (ktime_t)1712926988362589831,
.skb_mstamp_ns = (u64)1712926988362589831,
},
.cb = (char [48]) [
(char)'',
],
(struct) {
(struct) {
._skb_refdst = (long unsigned int)18446638024891807425,
},
.tcp_tsorted_anchor = (struct list_head) {
.next = (struct list_head *)0xffff9f8c95e3b6c1,
},
._sk_redir = (long unsigned int)18446638024891807425,
},
._nfct = (long unsigned int)18446638016824631298,
.len = (unsigned int)103,
.mac_len = (__u16)14,
(struct) {
(struct) {
.pkt_type = (unsigned char)4,
.ip_summed = (unsigned char)2,
.nf_skip_egress = (unsigned char)1,
.decrypted = (unsigned char)1,
.slow_gro = (unsigned char)1,
.alloc_cpu = (u16)7,
.skb_iif = (int)2,
(struct) {
.napi_id = (unsigned int)8197,
.sender_cpu = (unsigned int)8197,
},
.protocol = (__be16)8,
.transport_header = (__u16)60,
.network_header = (__u16)40,
.mac_header = (__u16)26,
},
.headers = (struct) {
.pkt_type = (unsigned char)4,
.ip_summed = (unsigned char)2,
.nf_skip_egress = (unsigned char)1,
.decrypted = (unsigned char)1,
.slow_gro = (unsigned char)1,
.alloc_cpu = (u16)7,
.skb_iif = (int)2,
(struct) {
.napi_id = (unsigned int)8197,
.sender_cpu = (unsigned int)8197,
},
.protocol = (__be16)8,
.transport_header = (__u16)60,
.network_header = (__u16)40,
.mac_header = (__u16)26,
},
},
.tail = (sk_buff_data_t)143,
.end = (sk_buff_data_t)320,
.head = (unsigned char *)0xffff9f8cb080c280,
.data = (unsigned char *)0xffff9f8cb080c2a8,
.truesize = (unsigned int)896,
.users = (refcount_t) {
.refs = (atomic_t) {
.counter = (int)1,
},
},
}
vmlinux`kfree_skb_reason+0x5
vmlinux`ip_local_deliver+0xe4
vmlinux`ip_sublist_rcv_finish+0x85
vmlinux`ip_sublist_rcv+0x18d
vmlinux`ip_list_rcv+0x145
vmlinux`__netif_receive_skb_list_core+0x290
vmlinux`netif_receive_skb_list_internal+0x1d1
vmlinux`napi_complete_done+0x78
sunrpc`cache_initialize+0x94c7b
vmlinux`__napi_poll+0x2e
vmlinux`net_rx_action+0x294
vmlinux`__do_softirq+0x10b
vmlinux`do_softirq.part.0+0x7a
vmlinux`__local_bh_enable_ip+0x8b
sunrpc`cache_initialize+0x95610
vmlinux`irq_thread_fn+0x25
vmlinux`irq_thread+0xec
vmlinux`kthread+0x102
vmlinux`ret_from_fork+0x3e
vmlinux`ret_from_fork_asm+0x1b
^C
So output is same as before, but when we hit Ctrl^C we have a packet capture file too:
$ tshark -r /tmp/mycap
1 0.000000 192.168.1.254 → 192.168.1.255 UDP 103 9431 → 9431 Len=75
So we can dig into packet details with verbose mode etc.
How Does This Work?
The pcap() action records the skb->data associated with the packet along with the starting protocol and length, and when this data is consumed in userspace libpcap is used to either dump to a file or we pipe the data to tshark (the commandline version of wireshark). The default capture size can be adjusted via the pcapsize option (-x pcapsize=new_size).
Summary
Here we’ve described how additional observability-friendly features in Linux can be used by DTrace to determine causes of packet drops using the pcap() and print() actions.