Introduction

RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access (RDMA) over an Ethernet network. This is accomplished by encapsulating an InfiniBand (IB) transport packet over Ethernet. Mellanox has introduced a few counters in MLNX_OFED 4.1 and kernel 4.12 to identify congestion related information in their RoCE networks. This blog provides information about these counters along with practical data to understand congestion. 

The Counters

These counters are:

  • rp_cnp_handled [Remote Point Congestion Notification Packets Handled]
  • rp_cnp_ignored [Remote Point Congestion Notification Packets Ignored]
  • np_cnp_sent    [Notification Point Congestion Notification Packets Sent]
  • np_ecn_marked_roce_packets [Notification Point Explicit Congestion Notification Marked Roce Packets]

Before diving in to understand the above counters, it is essential to know about Explicit Congestion Notification (ECN).

ECN enables end-to-end congestion notification between two endpoints which are ECN-enabled sender and receiver on TCP/IP based networks. Any device in the transmission path that does not support ECN disrupts the end-to-end ECN functionality. Unlike Congestion control algorithms like, slow start, restart, and AIMD (Additive Increase Multiplicative Decrease), the ECN algorithm works by informing the sender about congestion building up at the routers so that respective measures can be taken to avoid it. ECN is a Congestion Signaling Mechanism as defined in RFC 3168. It occupies two bits in the TCP header to negotiate about ECN and two bits in the IP header to indicate about congestion. The two bits in the IP Header are ECT (ECN Capable Transport) and CE (Congestion Experienced). Because of these two bits in the IP header, four code points are generated as shown below:

 

ECT
CE
Codepoint
Sent From
To
0
0
non-ECT
any 
any 
0
1
ECT(1): ECN Capable Transport
sender 
receiver
1
0
ECT(0): ECN Capable Transport
sender
receiver
1
1
CE: Congestion Experienced
router
receiver

 

  • [0 0] means non-ECT. That means the packet is not ECN supported. So there is no point in marking this packet. If there is congestion then this packet must be dropped instead of marking it.
  • [0 1] is ECT(0) codepoint means the packet is ECT supported.
  • [1 0] is ECT(1) codepoint, also means the packet is ECT enabled. If a packet is ECT(0) or (1) then this packet will not be dropped by the router instead it will be marked.
  • [1 1] is CE codepoint. When the router is congested and the packet is ECT enabled then the router will mark this packet as CE. The router will flip the 0 bit of the ECT codepoint and make it CE. It does not drop the packet, just marks it and transfers it to the receiver.

The image below is a snapshot of tcpdump output, which highlights the congestion notification bits of a non-ECN enabled node.

tcpdump

A Sender (Reaction Point – RP) who is capable of ECN, marks the packets with either ECT(0) or ECT(1). If these packets experience congestion during their transit through the ECN enabled switch, they will be marked as CE. Upon receiving these packets the Receiver Notification Point (NP) sends Congestion Notification Packets (CNP) to the Sender periodically. Reaction point, upon receiving these CNPs, will throttle down the speed at which the data is being sent.

CNP Counters

np_cnp_sent: This counter provides the number of congestion notification packets sent by the notification point when it has observed the CE bit in the RoCEv2 IP header (ECN bits).

rp_cnp_handled:  This counter provides the number of congestion notification packets being handled by the reaction point to throttle the transmission rate.

These two counters get incremented in two different nodes, where congestion is being generated (rp_cnp_handled) and where congestion is being reported (np_cnp_sent). This phenomenon has been seen when a storage node experienced a surge in the number of rdma bytes being read – congestion is created and these CNP bits can be seen incrementing in the corresponding nodes

time      change in data traffic(Gbps)
--------------------------------------
10:37:52 0.394479
10:39:52 0.257222
10:41:52 8.17605
10:43:52 54.1771   <== Spike in the rdma bytes being sent from storage node (being read by compute node)
10:45:52 0.289964
10:47:52 0.344469
10:49:52 0.285193
10:51:52 20.9374
10:53:52 24.7994
10:55:53 4.85491
10:57:53 0.084384
10:59:53 0.0644531
11:01:53 0.0644531
11:03:53 0.0645313
11:05:53 0.0645617
time: 2022-06-15 10:43:40
np_cnp_sent                        0                             0
rp_cnp_handled                     0                             0
time: 2022-06-15 10:43:45
np_cnp_sent                        0                             0
rp_cnp_handled                     1106                          1552   <== At the same time, where there was a spike in the send_rdma_bytes, we see rp_cnp_handled growing.
time: 2022-06-15 10:43:50
np_cnp_sent                        0                             0
rp_cnp_handled                     790                           868
time: 2022-06-15 10:43:55
np_cnp_sent                        0                             0
rp_cnp_handled                     754                           1292
time: 2022-06-15 10:44:00
np_cnp_sent                        0                             0
rp_cnp_handled                     989                           1117
time: 2022-06-15 10:44:10
np_cnp_sent                        0                             0
rp_cnp_handled                     1290                          1287
time: 2022-06-15 10:44:20
np_cnp_sent                        0                             0
rp_cnp_handled                     1695                          1639
time: 2022-06-15 10:44:30
np_cnp_sent                        0                             0
rp_cnp_handled                     1471                          1240
time: 2022-06-15 10:44:40
np_cnp_sent                        0                             0
rp_cnp_handled                     1142                          918
time: 2022-06-15 10:44:50
np_cnp_sent                        0                             0
rp_cnp_handled                     155                           172
time: 2022-06-15 10:45:00
np_cnp_sent                        0                             0
rp_cnp_handled                     882                           715
time: 2022-06-15 10:45:10
np_cnp_sent                        0                             0
rp_cnp_handled                     109                           308
time: 2022-06-15 10:45:20
np_cnp_sent                        0                             0
rp_cnp_handled                     16                            19
time: 2022-06-15 10:45:25
np_cnp_sent                        0                             0
rp_cnp_handled                     0                             0

At the same time, we see another db node, which got affected because of this congestion reporting increment in the np_cnp_sent counter.

time: 2022-06-15 10:43:39
np_cnp_sent                        0                             0
rp_cnp_handled                     0                             0
time: 2022-06-15 10:43:44
np_cnp_sent                        5357                          9400 <== Increase in the np_cnp_sent on db node at the same time.
rp_cnp_handled                     20                            16
time: 2022-06-15 10:43:49
np_cnp_sent                        5149                          5352
rp_cnp_handled                     18                            13
time: 2022-06-15 10:43:54
np_cnp_sent                        4019                          7488
rp_cnp_handled                     10                            19
time: 2022-06-15 10:43:59
np_cnp_sent                        6065                          6628
rp_cnp_handled                     18                            17
time: 2022-06-15 10:44:10
np_cnp_sent                        8528                          5701
rp_cnp_handled                     15                            7
time: 2022-06-15 10:44:20
np_cnp_sent                        9960                          9042
rp_cnp_handled                     26                            11
time: 2022-06-15 10:44:30
np_cnp_sent                        7514                          5613
rp_cnp_handled                     20                            10
time: 2022-06-15 10:44:40
np_cnp_sent                        6935                          4623
rp_cnp_handled                     25                            6
time: 2022-06-15 10:44:50
np_cnp_sent                        940                           1496
rp_cnp_handled                     1                             1
time: 2022-06-15 10:45:00
np_cnp_sent                        5489                          4194
rp_cnp_handled                     13                            6
time: 2022-06-15 10:45:10
np_cnp_sent                        688                           5118
rp_cnp_handled                     2                             4
time: 2022-06-15 10:45:20
np_cnp_sent                        134                           607
rp_cnp_handled                     0                             0
time: 2022-06-15 10:45:25
np_cnp_sent                        0                             0
rp_cnp_handled                     0                             0

Two other counters are np_ecn_marked_roce_packets and rp_cnp_ignored. The first one provides details about the number of packets received by the Receiver (Notification Point) which are marked as Congestion Experienced whereas the second one provides the number of CNP packets received and ignored by the Sender (Reaction Point).

Summary

These congestion related counters provide information on which nodes in the cluster are creating/experiencing congestion, along with the intensity of the congestion. This helps in tracking down the applications which are responsible for generating congestion and take appropriate action.

References

  1. https://www.juniper.net/documentation/us/en/software/junos/cos/topics/concept/cos-qfx-series-explicit-congestion-notification-understanding.html
  2. ttps://www.geeksforgeeks.org/what-is-ecnexplicit-congestion-notification/
  3. https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters