Introduction
As Ksplice engineers, we often have to look at completely different sub-systems of the Linux kernel to patch them, either to fix a vulnerability or to add trip wires. As a result, we gain a lot of knowledge in various areas. In this blog post, I’ll share my experience with the functionality and high-level architecture of the netfilter subsystem and its key kernel objects, such as tables, sets, chains, rules, and expressions, based on insights gained while working on a Known Exploit Detection update.
Netfilter provides the in-kernel hook points and infrastructure for packet processing. Using nftables on top of Netfilter, we organize packet-filtering logic into tables, chains, and rules, and use sets to store elements such as IP addresses. A base chain binds to a hook like INPUT or OUTPUT, and rules attached to that chain can accept or drop packets, update sets, or jump to other chains.
nftables is a powerful and flexible framework in the Linux kernel for packet filtering, network address translation (NAT), and other packet mangling operations. It introduces a unified and consistent syntax for defining firewall rules, replacing the older iptables framework. In this post, we’ll explore key nftables components – tables, sets, set elements, objects, chains, and rules. I’ll also provide user-space examples along with a high-level mapping to the corresponding kernel objects.
Netfilter hooks and chain priority
Netfilter exposes well-defined hook points along the packet path: PREROUTING (packets just received, before routing decision), INPUT (packets destined for the local host), FORWARD (packets being routed through the host), OUTPUT (locally generated packets before routing), and POSTROUTING (packets leaving the host, after routing decision). In nftables, a base chain attaches to one of these hooks and specifies a priority, which controls ordering relative to other chains bound to the same hook; lower numeric values run earlier, higher values later (e.g., a chain with priority -100 runs before priority 0). The chain’s type (filter, nat, route) must be compatible with the chosen hook.
Filter, NAT, and route chain types
In nftables, a base chain’s type defines its purpose and allowable expressions. Filter chains implement classic firewall policy-matching packet/connection attributes and issuing verdicts (accept, drop, jump, return). NAT chains perform address/port translation and are typically attached to PREROUTING (DNAT) and POSTROUTING (SNAT) to rewrite destination or source fields as packets enter or leave the host. Route chains influence forwarding decisions and can adjust routing-related metadata (e.g., setting marks or redirecting via specific interfaces) before the kernel finalizes the path. Choosing the correct type ensures the right helper expressions are available and the chain can legally attach to the intended hook.
At a high level, the allowed hook/type combinations are:
filterchains: Can be attached to the usual packet-processing hooks such asprerouting,input,forward,output, andpostroutingnatchains: Used on translation hooks such asprerouting,input,output, andpostroutingroutechains: Used in the output path and attached to theoutputhook
Throughout this blog post, the nft user-space utility will be used to execute various nftables commands.
In the nft CLI, many commands are positional. For example, in:
nft add rule ip mytable input tcp dport 22 accept
ipis the address familymytableis the table nameinputis the chain nametcp dport 22is the match expressionacceptis the verdict
Key Components of nftables
1. Tables
A table is a container for chains, sets, and other objects. It defines a namespace and a family for the rules it contains.
Common families are:
ip: IPv4 onlyip6: IPv6 onlyinet: Mixed IPv4/IPv6 rulesarp: ARP trafficbridge: bridge-layer filteringnetdev: filtering on a network device
To create a table named mytable for IPv4:
nft add table ip mytable
Here, ip is the family and mytable is the table name.
2. Chains
A chain is a sequence of rules applied to packets.
There are two important kinds of chains in nftables:
- Base chains: Chains attached directly to a Netfilter hook such as
input,output, orforward. A base chain receives packets directly from the packet-processing path and therefore specifies a hook, a priority, and a chain type. - Regular chains: Chains that are not attached directly to a Netfilter hook. Packets reach them only when another rule explicitly sends control to them, for example with
jumporgoto.
When used as base chains, chains can have different types:
filter: Used for general packet filtering decisions such asaccept,drop,jump, andreturnnat: Used for network address translation, such as DNAT and SNATroute: Used for rerouting-related decisions in the output path
To create an input chain in the mytable table:
nft add chain ip mytable input { type filter hook input priority 0 \; }
This is a base chain because it is attached to the input hook and specifies both a chain type and a priority.
In this command:
ip: The address family, in this case IPv4mytable: The table nameinput: The chain nametype filter: The chain type, meaning that this chain filters packets by deciding whether toacceptordropthemhook input: The Netfilterinputhook to which this chain is attached, where packets destined for the local system are processedpriority 0: The execution order, which sets the chain’s priority relative to others on the same hook
3. Rules
A rule specifies the conditions for matching packets and the actions to take when a match occurs. Within a rule, expressions are evaluated in sequence. If any expression evaluates to false, the remaining expressions in that rule are skipped. Rules in a chain are processed in order; once a rule returns an “accept” verdict, no further rules in that chain are evaluated, and processing moves on to the next chain. Conversely, if any expression returns “drop”, all remaining expressions, rules, and chains are skipped.
To add a rule that accepts incoming SSH traffic:
nft add rule ip mytable input tcp dport 22 accept
This matches TCP packets whose destination port is 22 and returns the accept verdict. Other common matches use the same style, for example:
nft add rule ip mytable input udp dport 53 accept
nft add rule ip mytable input ip saddr 192.168.1.10 accept
nft add rule ip mytable input icmp type echo-request accept
4. Sets
Sets are collections of elements (e.g., IP addresses, port numbers) that can be referenced in rules for efficient matching.
- Named sets: Persistent, reusable objects within a table that a user creates once and references from multiple rules; they are ideal for reusable or frequently updated lists, such as trusted IPs, blocked subnets, or service port groups, because their elements can be managed independently without changing the rules, and they remain until explicitly deleted. In this example,
trusted_ipsis the set name. nftables looks up the source IP address of each arriving packet in the named settrusted_ips. An empty named set matches nothing, so before any elements are added, this rule accepts no packets. Example (elements are added in the next “Set Elements” section):
nft add set ip mytable trusted_ips { type ipv4_addr; flags interval; }
nft add rule ip mytable input ip saddr @trusted_ips accept
- Anonymous sets: Inline, unnamed sets embedded directly in a rule; they are useful for small, rule-local lists that don’t require reuse or runtime updates, and they exist only for the lifetime of the rule, so changing them requires editing or replacing that rule. The the following example the values inside
{ ... }are set elements separated by commas. This rule matches TCP destination port 22, 80, or 443. Example:
nft add rule ip mytable input tcp dport { 22, 80, 443 } accept
5. Set Elements
Set elements are the individual items contained within a set.
When written on the command line, set elements are usually placed inside curly braces and separated by commas.
- Adding elements to a named set:
nft add element ip mytable trusted_ips { 192.168.1.1, 192.168.1.2, 10.0.0.5 }
- Elements implicitly created within a rule (anonymous, inline set elements are defined as part of the rule itself, without a separately named set) and are removed when the rule is deleted: In this example, the ports 22, 80, and 443 are anonymous set elements attached to the rule; they are created and managed with the rule and cannot be updated independently:
nft add rule ip mytable input tcp dport { 22, 80, 443 } accept
- Updating or removing elements in a named set:
nft delete element ip mytable trusted_ips { 10.0.0.5 }nft add element ip mytable trusted_ips { 10.0.0.8 }
6. Objects
Objects in nftables refer to various entities like counters, quotas, and limits that can be attached to rules for monitoring and controlling traffic.
To add a counter object:
nft add counter ip mytable my_counter
Creating the counter object only defines it. To use it, a rule must reference it. For example:
nft add rule ip mytable input tcp dport 22 counter name my_counter accept
That rule both accepts SSH traffic and increments my_counter each time it matches. You can later inspect counters with commands such as nft list counter ip mytable my_counter.
Comprehensive User-Space Example
Let’s put it all together by creating a simple firewall configuration using nft, and then walk through a closely related user-space C program using Netlink and nftables specific libraries.
# Create table
nft add table inet testfirewall
# Create input chain
# family=inet, table=testfirewall, chain=input
nft add chain inet testfirewall input { type filter hook input priority 0 \; }
# Create a set for trusted IPs
nft add set inet testfirewall trusted_ips { type ipv4_addr \; }
# Add trusted IPs to the set
nft add element inet testfirewall trusted_ips { 192.168.1.1, 192.168.1.2 }
# Add rules
nft add rule inet testfirewall input ip saddr @trusted_ips accept
nft add rule inet testfirewall input drop
The names are:
- family:
inet - table:
testfirewall - chain:
input - set:
trusted_ips
The final drop rule is important because this chain does not set policy drop. Without the trailing drop rule, packets that do not match the trusted-IP rule would reach the end of the chain and then follow the chain policy; if no drop policy is configured, that usually means they are accepted rather than denied. For example, the chain may instead be created with a default drop policy:
nft add chain inet testfirewall input { type filter hook input priority 0 \; policy drop \; }
then packets that do not match the trusted_ips rule are dropped automatically at the end of the chain, and a separate trailing drop rule is no longer necessary.
libmnl and libnftnl
libmnl is a compact user-space Netlink library that provides helpers to open/bind Netlink sockets, build/send messages, and process ACK/error replies. libnftnl is built on top of Netlink and exposes higher-level constructs for nftables objects (tables, chains, sets, rules, and expressions), in order to compose and batch nftables operations without hand-encoding attributes. Used together, they allow nftables policy to be implemented directly from C with clear mappings to kernel objects.
nftables batch transactions
Batching groups multiple nftables operations into a single, atomic transaction. A batch begins with nftnl_batch_begin(), followed by enqueuing one or more messages (for example, creating tables, sets, elements, chains, and rules), and is finalized with nftnl_batch_end() before transmission via libmnl. If any message in the batch fails, the kernel rejects the entire batch to prevent partial configuration. This approach reduces syscall overhead, preserves creation and evaluation order, and maintains consistency.
Program context and goal
/* This program performs the nft CLI sequence via libmnl + libnftnl:
* It creates:
* - table "testfirewall" (family inet)
* - chain "input" (the core object is created here; base-chain hook/type attributes are shown below as examples)
* - set "trusted_ips" with two IPv4 addresses
* - rule 1: payload(ip saddr) -> lookup(@trusted_ips) -> verdict accept
* - rule 2: verdict drop
*
* It also creates private user, mount, and network namespaces so the test can
* run without touching the host nftables state.
*
* LICENSE: GPLv2
*/
#define _GNU_SOURCE
#include <string.h>
#include <time.h>
#include <sched.h>
#include <err.h>
#include <fcntl.h>
#include <arpa/inet.h>
#include <netinet/in.h>
#include <linux/netfilter.h>
#include <linux/netfilter/nf_tables.h>
#include <libmnl/libmnl.h>
#include <libnftnl/chain.h>
#include <libnftnl/expr.h>
#include <libnftnl/rule.h>
#include <libnftnl/set.h>
#include <libnftnl/table.h>
#define mnl_batch_limit (1024 * 1024)
char mnl_batch_buffer[2 * mnl_batch_limit];
void write_file(const char *filename, char *text) {
int fd = open(filename, O_RDWR | O_CREAT);
write(fd, text, strlen(text));
close(fd);
}
void new_ns(void) {
uid_t uid = getuid();
gid_t gid = getgid();
char buffer[0x100];
unshare(CLONE_NEWUSER | CLONE_NEWNS);
unshare(CLONE_NEWNET);
write_file("/proc/self/setgroups", "deny");
snprintf(buffer, sizeof(buffer), "0 %d 1", uid);
write_file("/proc/self/uid_map", buffer);
snprintf(buffer, sizeof(buffer), "0 %d 1", gid);
write_file("/proc/self/gid_map", buffer);
}
1) Setup namespace and mnl socket
/* Open and bind a NETLINK_NETFILTER socket using libmnl */
int main(void)
{
struct mnl_socket *nl;
new_ns();
nl = mnl_socket_open(NETLINK_NETFILTER);
if (nl == NULL) {
err(1, "Error calling mnl_socket_open()");
}
if (mnl_socket_bind(nl, 0, MNL_SOCKET_AUTOPID) < 0) {
err(1, "Error calling mnl_socket_bind()");
}
2) Begin a transaction batch
/* Begin a netlink batch for atomic nftables operations */
unsigned int seq = time(NULL), portid;
int family = NFPROTO_INET;
const char *table_name = "testfirewall";
const char *chain_name = "input";
const char *set_name = "trusted_ips";
uint32_t set_id = 1;
uint32_t set_flags = 0;
uint32_t set_key_len = 4; /* IPv4 */
struct mnl_nlmsg_batch *batch = mnl_nlmsg_batch_start(mnl_batch_buffer, mnl_batch_limit);
nftnl_batch_begin(mnl_nlmsg_batch_current(batch), seq++);
mnl_nlmsg_batch_next(batch);
3) Create table: nft add table inet testfirewall
/* Create an nftables table (inet family) named "testfirewall" */
struct nftnl_table *table = nftnl_table_alloc();
if (!table) errx(1, "nftnl_table_alloc");
nftnl_table_set_u32(table, NFTNL_TABLE_FAMILY, family);
nftnl_table_set_str(table, NFTNL_TABLE_NAME, table_name);
struct nlmsghdr *nlh = nftnl_table_nlmsg_build_hdr(
mnl_nlmsg_batch_current(batch),
NFT_MSG_NEWTABLE,
family,
NLM_F_CREATE | NLM_F_ACK,
seq++
);
nftnl_table_nlmsg_build_payload(nlh, table);
mnl_nlmsg_batch_next(batch);
nftnl_table_free(table);
4) Create set: nft add set inet testfirewall trusted_ips { type ipv4_addr \; }
/* Create a set to hold trusted IPv4 addresses */
struct nftnl_set *set = nftnl_set_alloc();
if (!set) errx(1, "nftnl_set_alloc");
nftnl_set_set_u32(set, NFTNL_SET_FAMILY, family);
nftnl_set_set_str(set, NFTNL_SET_TABLE, table_name);
nftnl_set_set_str(set, NFTNL_SET_NAME, set_name);
nftnl_set_set_u32(set, NFTNL_SET_ID, set_id);
nftnl_set_set_u32(set, NFTNL_SET_FLAGS, set_flags);
nftnl_set_set_u32(set, NFTNL_SET_KEY_LEN, set_key_len);
nlh = nftnl_set_nlmsg_build_hdr(
mnl_nlmsg_batch_current(batch),
NFT_MSG_NEWSET,
family,
NLM_F_CREATE | NLM_F_ACK,
seq++
);
nftnl_set_nlmsg_build_payload(nlh, set);
mnl_nlmsg_batch_next(batch);
nftnl_set_free(set);
5) Add set elements: nft add element inet testfirewall trusted_ips { 192.168.1.1, 192.168.1.2 }
/* Insert IPv4 addresses as elements of "trusted_ips" */
struct nftnl_set *set_e = nftnl_set_alloc();
if (!set_e) errx(1, "nftnl_set_alloc for elements");
nftnl_set_set_u32(set_e, NFTNL_SET_FAMILY, family);
nftnl_set_set_str(set_e, NFTNL_SET_TABLE, table_name);
nftnl_set_set_str(set_e, NFTNL_SET_NAME, set_name);
/* element 1 */
struct nftnl_set_elem *se1 = nftnl_set_elem_alloc();
if (!se1) errx(1, "nftnl_set_elem_alloc se1");
uint32_t ip1; inet_pton(AF_INET, "192.168.1.1", &ip1);
nftnl_set_elem_set(se1, NFTNL_SET_ELEM_KEY, &ip1, sizeof(ip1));
nftnl_set_elem_add(set_e, se1);
/* element 2 */
struct nftnl_set_elem *se2 = nftnl_set_elem_alloc();
if (!se2) errx(1, "nftnl_set_elem_alloc se2");
uint32_t ip2; inet_pton(AF_INET, "192.168.1.2", &ip2);
nftnl_set_elem_set(se2, NFTNL_SET_ELEM_KEY, &ip2, sizeof(ip2));
nftnl_set_elem_add(set_e, se2);
/* Build NEWSETELEM message */
nlh = nftnl_nlmsg_build_hdr(
mnl_nlmsg_batch_current(batch),
NFT_MSG_NEWSETELEM,
family,
NLM_F_CREATE | NLM_F_EXCL | NLM_F_ACK,
seq++
);
nftnl_set_elems_nlmsg_build_payload(nlh, set_e);
mnl_nlmsg_batch_next(batch);
nftnl_set_free(set_e);
6) Create chain object for input (base-chain attributes shown as examples below)
/* Create a chain named "input"; set base-chain attributes if available in your libnftnl:
* type=filter, hook=input, priority=0 (policy can be omitted since we add an explicit drop rule)
*/
struct nftnl_chain *chain = nftnl_chain_alloc();
if (!chain) errx(1, "nftnl_chain_alloc");
nftnl_chain_set_u32(chain, NFTNL_CHAIN_FAMILY, family);
nftnl_chain_set_str(chain, NFTNL_CHAIN_TABLE, table_name);
nftnl_chain_set_str(chain, NFTNL_CHAIN_NAME, chain_name);
/* Example (uncomment if your libnftnl supports these attributes in your environment):
* nftnl_chain_set_str(chain, NFTNL_CHAIN_TYPE, "filter");
* nftnl_chain_set_u32(chain, NFTNL_CHAIN_HOOKNUM, NF_INET_LOCAL_IN);
* nftnl_chain_set_s32(chain, NFTNL_CHAIN_PRIO, 0);
*/
nlh = nftnl_chain_nlmsg_build_hdr(
mnl_nlmsg_batch_current(batch),
NFT_MSG_NEWCHAIN,
family,
NLM_F_CREATE | NLM_F_ACK,
seq++
);
nftnl_chain_nlmsg_build_payload(nlh, chain);
mnl_nlmsg_batch_next(batch);
nftnl_chain_free(chain);
7) Add rule: nft add rule inet testfirewall input ip saddr @trusted_ips accept
/* Rule: ip saddr @trusted_ips accept
* Build expressions: payload(ip saddr) -> lookup(@trusted_ips) -> immediate(verdict=ACCEPT)
*/
struct nftnl_rule *r1 = nftnl_rule_alloc();
if (!r1) errx(1, "nftnl_rule_alloc r1");
nftnl_rule_set_u32(r1, NFTNL_RULE_FAMILY, family);
nftnl_rule_set_str(r1, NFTNL_RULE_TABLE, table_name);
nftnl_rule_set_str(r1, NFTNL_RULE_CHAIN, chain_name);
/* payload: load IPv4 saddr (offset 12, length 4) into reg1 */
struct nftnl_expr *ex_payload = nftnl_expr_alloc("payload");
if (!ex_payload) errx(1, "expr payload");
nftnl_expr_set_u32(ex_payload, NFTNL_EXPR_PAYLOAD_BASE, NFT_PAYLOAD_NETWORK_HEADER);
nftnl_expr_set_u32(ex_payload, NFTNL_EXPR_PAYLOAD_OFFSET, 12);
nftnl_expr_set_u32(ex_payload, NFTNL_EXPR_PAYLOAD_LEN, 4);
nftnl_expr_set_u32(ex_payload, NFTNL_EXPR_PAYLOAD_DREG, NFT_REG_1);
nftnl_rule_add_expr(r1, ex_payload);
/* lookup: reg1 ∈ set trusted_ips */
struct nftnl_expr *ex_lookup = nftnl_expr_alloc("lookup");
if (!ex_lookup) errx(1, "expr lookup");
nftnl_expr_set_u32(ex_lookup, NFTNL_EXPR_LOOKUP_SREG, NFT_REG_1);
nftnl_expr_set_str(ex_lookup, NFTNL_EXPR_LOOKUP_SET, set_name);
nftnl_expr_set_u32(ex_lookup, NFTNL_EXPR_LOOKUP_SET_ID, set_id);
nftnl_expr_set_u32(ex_lookup, NFTNL_EXPR_LOOKUP_FLAGS, 0);
nftnl_rule_add_expr(r1, ex_lookup);
/* immediate: verdict accept */
struct nftnl_expr *ex_accept = nftnl_expr_alloc("immediate");
if (!ex_accept) errx(1, "expr immediate accept");
nftnl_expr_set_u32(ex_accept, NFTNL_EXPR_IMM_DREG, NFT_REG_VERDICT);
nftnl_expr_set_u32(ex_accept, NFTNL_EXPR_IMM_VERDICT, NF_ACCEPT);
nftnl_rule_add_expr(r1, ex_accept);
/* enqueue rule */
nlh = nftnl_rule_nlmsg_build_hdr(
mnl_nlmsg_batch_current(batch),
NFT_MSG_NEWRULE,
family,
NLM_F_APPEND | NLM_F_CREATE | NLM_F_ACK,
seq++
);
nftnl_rule_nlmsg_build_payload(nlh, r1);
mnl_nlmsg_batch_next(batch);
nftnl_rule_free(r1);
8) Add rule: nft add rule inet testfirewall input drop
/* Rule: drop (default catch-all) */
struct nftnl_rule *r2 = nftnl_rule_alloc();
if (!r2) errx(1, "nftnl_rule_alloc r2");
nftnl_rule_set_u32(r2, NFTNL_RULE_FAMILY, family);
nftnl_rule_set_str(r2, NFTNL_RULE_TABLE, table_name);
nftnl_rule_set_str(r2, NFTNL_RULE_CHAIN, chain_name);
struct nftnl_expr *ex_drop = nftnl_expr_alloc("immediate");
if (!ex_drop) errx(1, "expr immediate drop");
nftnl_expr_set_u32(ex_drop, NFTNL_EXPR_IMM_DREG, NFT_REG_VERDICT);
nftnl_expr_set_u32(ex_drop, NFTNL_EXPR_IMM_VERDICT, NF_DROP);
nftnl_rule_add_expr(r2, ex_drop);
nlh = nftnl_rule_nlmsg_build_hdr(
mnl_nlmsg_batch_current(batch),
NFT_MSG_NEWRULE,
family,
NLM_F_APPEND | NLM_F_CREATE | NLM_F_ACK,
seq++
);
nftnl_rule_nlmsg_build_payload(nlh, r2);
mnl_nlmsg_batch_next(batch);
nftnl_rule_free(r2);
9) End batch, send, and close
/* Finalize batch and send to kernel */
nftnl_batch_end(mnl_nlmsg_batch_current(batch), seq++);
mnl_nlmsg_batch_next(batch);
portid = mnl_socket_get_portid(nl);
(void)portid; /* suppress unused warning if not receiving ACKs */
if (mnl_socket_sendto(nl, mnl_nlmsg_batch_head(batch),
mnl_nlmsg_batch_size(batch)) < 0) {
err(1, "Error calling mnl_socket_sendto()");
}
mnl_nlmsg_batch_stop(batch);
/* Close socket (ACK handling can be added via mnl_socket_recvfrom + mnl_cb_run) */
mnl_socket_close(nl);
return 0;
}
Mapping user-space to kernel objects
The user-space sequence above results in the following kernel-level constructs being created and evaluated:
- nft_table: The top-level container for chains, sets, and objects for a given family (e.g., inet).
- nft_set and nft_set_elem: The set that stores trusted IP addresses and its elements. The lookup expression probes this set at runtime.
- nft_chain: The example creates a chain named
input; if base-chain attributes are set, it can attach to the input hook in the packet path, and rule traversal is in-order. - nft_rule: Each rule is a list of nft_expr instances evaluated sequentially.
- nft_expr: nft_payload: retrieves the IPv4 source address from the packet buffer. nft_lookup: compares the loaded address against the set. nft_immediate: writes the verdict (NF_ACCEPT) into the verdict register; a subsequent rule can issue NF_DROP as a default.
Conclusion
Netfilter provides the kernel hook points where packet-processing decisions are made, while nftables gives us the user-space model for expressing those decisions in terms of tables, chains, rules, sets, and objects. Once those pieces are mapped to their kernel counterparts, it becomes much easier to reason about how packets traverse the system and how user-space configuration turns into concrete kernel state.
That mapping is especially useful when debugging behavior, reviewing kernel fixes, or writing Ksplice transformers and exploit detection for the netfilter subsystem. Even a small nftables ruleset exercises a rich set of kernel objects, and understanding their relationships makes the subsystem much less opaque.
As Ksplice engineers, we often have to learn new areas of the kernel very rapidly to understand how a patch can be applied without rebooting the system. This work can be challenging, but it’s also rewarding to explore new areas of the kernel and understand how they interact with each other.
If you find this type of work interesting, consider applying for a job with the Ksplice team! Feel free to drop us a line at ksplice-support_ww@oracle.com.