Network Interface transmission queue timeout, cause and recovery methods

August 22, 2023 | 11 minute read
Text Size 100%:

Introduction

A network interface transmission queue is said to be timed-out, if it is in stopped condition and the time since last transmission from this queue has exceeded the threshold which is being dictated by the driver to the watchdog timer. A transmission queue is a collection of data packets waiting to be transmitted by the corresponding networking device. During the initialization of the network device, the transmission queues and the scheduler of the network device are activated via dev_activate(). At the same time, the method dev_watchdog_up() starts a timer to detect transmission problems. This timer calls the method dev_watchdog() every dev->watchdog_timeo ticks to check that the network adapter is working properly. In case the routine dev_watchdog() identifies that a transmission queue is disabled (netif_queue_stopped() returns TRUE) and the time since last frame transmission has exceeded the amount of time to wait (dev→watchdog_timeo), the timer’s handler invokes a routine registered by the device driver to perform recovery. Details of these handlers are being discussed in the further sections of this article.

Networking Interfaces attached to Network Interface Cards (NIC) produced by individual vendors experience timeout due to multiple reason’s. And each of them have their own way of handling this situation. In this article, we will see how Mellanox, Broadcom and Intel device drivers handle the interface timeout.

Reasons which could cause a working networking interface transmission queue to get timed-out will include, but are not limited to the following:

  • During situations when the CPUs to which the network interface interrupt request(irq) lines are mapped are busy, interrupts being received for the packet transactions go un-handled, which will lead to the queue getting stalled and eventually getting timed-out.
  • Network card to which the interfaces are attached either stops transmitting the packets or fails in receiving the acknowledgements for the sent packets.
  • Remote network card is busy and unable to handle the incoming packets, causing a congestion on the source.
  • Congestion at the Switch or fabric issues can also result in the stalls in the transmission queues.

The sections below concentrate on how individual NIC device drivers work on recovering the interfaces handled by them whenever a timeout situation is seen. Oracle Linux 7 source code has been used to make this study.

Mellanox (mlx5) Driver

In the Mellanox driver, watchdog timer is configured to wait for 15 seconds to determine whether any of the transmission queue’s corresponding to the Mellanox interfaces are stopped and need to be recovered. The below mentioned data flow provides insight on how a Mellanox interface transmission queue stall is being dealt with.

mlx5e_build_nic_netdev
{
    ...
    netdev->watchdog_timeo = 15 * HZ;
}

mlx5e_netdev_ops
{
    ...
    .ndo_tx_timeout = mlx5e_tx_timeout,
}

Whenever watchdog timer determines that a queue is stopped, the handler mlx5e_tx_timeout is being called as decided in the mlx5e initialization routine mlx5e_build_nic_netdev. This routine mlx5e_tx_timeout initially alerts us by printing a message “TX timeout detected”, and then verifies every queue associated with the corresponding device. If it finds any queue is stopped, it then prints the detailed information of the queue, such as the consumer index, producer index, send queue number, completion queue number and the time since the queue was stopped from transmitting.

mlx5e_tx_timeout
{
    ...
    for (i = 0; i < priv->channels.num * priv->channels.params.num_tc; i++) {
                struct netdev_queue *dev_queue = netdev_get_tx_queue(dev, i);
                struct mlx5e_txqsq *sq = priv->txq2sq[i];
 
                if (!netif_xmit_stopped(dev_queue))
                        continue;
                netdev_err(dev, "TX timeout on queue: %d, SQ: 0x%x, CQ: 0x%x, SQ Cons: 0x%x SQ Prod: 0x%x, usecs since last trans: %u\n",
                           i, sq->sqn, sq->cq.mcq.cqn, sq->cc, sq->pc,
                           jiffies_to_usecs(jiffies - dev_queue->trans_start));                 
                ...    
    }
}

Next, with the help of the routine mlx5_eq_poll_irq_disabled, the driver then performs the recovery of the transmission queue by recovering any un-handled event queue entries (EQEs) associated with the event queue(EQ) of the corresponding send queue (SQ). This activity primarily involves disabling the interrupt request line, verifying if there any pending EQEs and re-enabling the irq line corresponding to the event queue as shown below.

mlx5_eq_poll_irq_disabled
{
    ...
    disable_irq(eq->irqn);
    count_eqe = eq->cons_index;
    mlx5_eq_int(eq->irqn, eq);
    count_eqe = eq->cons_index - count_eqe;
    enable_irq(eq¿irqn);

    return count_eqe;
}

If there are any pending EQE’s, which get handled now by the routine mlx5_eq_int, the value of the variable count_eqe in the above snippet would be greater than 0, indicating that a certain number of EQE’s were un-handled and have to be recovered. This can be understood by the message shown below, being printed by the function mlx5e_tx_timeout_eq_recover. Handling of event queue entries goes missing in cases when the CPU is busy in another task and misses the interrupts received for the IRQ line associated with the associated interface.

mlx5e_tx_timeout_eq_recover
{
    ...
    netdev_err(dev, "Recover %d eqes on EQ 0x%x\n", eqe_count, eq->eqn);
}

In case, when the EQE’s recovered are not null, the function mlx5e_tx_timeout_eq_recover returns true, otherwise false. Depending on this return value, the function mlx5e_tx_timeout will decide upon the further actions using the variable reopen_channels, as shown below.

mlx5e_tx_timeout
{
    ...
                if (!mlx5e_tx_timeout_eq_recover(dev, sq)) {
                        clear_bit(MLX5E_SQ_STATE_ENABLED, &sq->state);
                        reopen_channels = true;
                }
        } 
        if (reopen_channels && test_bit(MLX5E_STATE_OPENED, &priv->state))
                schedule_work(&priv->tx_timeout_work);
}

If reopen_channels is true, it indicates that there were no un-handled EQEs and we will go ahead and schedule the worker to perform the next set of actions to recover the transmission queue.

mlx5e_build_nic_netdev_priv
{
    ...
    INIT_WORK(&priv->tx_timeout_work, mlx5e_tx_timeout_work);
}

The function mlx5e_tx_timeout_work further executes two routines mlx5e_close_locked and mlx5e_open_locked as part of the recovery. mlx5e_close_locked works by taking off the carrier, stopping all the transmission queues, disabling any further transmissions, deactivating the corresponding channels. mlx5e_open_locked performs the reverse operations i.e., enables the channels, enables the associated transmission queues, and enables the carrier back. This finishes the recovery task of an interface which got timed out because of certain reasons.

mlx5e_tx_timeout_work
{
    ...
    mlx5e_open_channels(priv, &priv->channels);
    mlx5e_activate_priv_channels(priv);
    if (priv->profile->update_carrier)
                priv->profile->update_carrier(priv);
}

Broadcom (bnxt) Driver

Similar to the Mellanox driver, the Broadcom driver also dictates its own threshold time to the watchdog timer. With respect to the kernel under consideration, the timeout for a Broadcom interface is configured to be 5 seconds as shown below.

#define BNXT_TX_TIMEOUT         (5 * HZ)
bnxt_init_one
{
    ...
    dev->watchdog_timeo = BNXT_TX_TIMEOUT;
}

bnxt_netdev_ops
{
    ...
    .ndo_tx_timeout = bnxt_tx_timeout,
}

The routine bnxt_tx_timeout also alerts us by printing the timeout message “TX timeout detected”, as shown in the message below, before performing the recovery action. It also sets the bit BNXT_RESET_TASK_SP_EVENT and then executes the reset task through the routine bnxt_sp_task.

bnxt_tx_timeout
{
    ...
    netdev_err(bp->dev,  "TX timeout detected, starting reset task!\n");
    set_bit(BNXT_RESET_TASK_SP_EVENT, &bp->sp_event);
}

bnxt_sp_task
{
    ...
    if (test_and_clear_bit(BNXT_RESET_TASK_SP_EVENT, &bp->sp_event))
        bnxt_reset(bp, false);
}

The routine bnxt_reset eventually executes bnxt_reset_task with rtnl lock being held. The routine bnxt_reset_task is responsible for the important reset activities to be done. Based on the second argument, “false”, being passed to the routine bnxt_reset, it now stops all the upper layer protocol activities, closes the NIC using the routine bnxt_close_nic which internally involves dropping the carrier, stopping all the transmission queues, flush all the rings, disable interrupts and disabling NAPI.

bnxt_reset_task
{
    ...
    bnxt_ulp_stop(bp);
    bnxt_close_nic(bp, true, false);
    rc = bnxt_open_nic(bp, true, false);
    bnxt_ulp_start(bp, rc);
}

bnxt_close_nic
{
    ...
    bnxt_tx_disable(bp);
    bnxt_shutdown_nic(bp, irq_re_init);
    bnxt_disable_napi(bp);
}

bnxt_tx_disable
{
    ...
    netif_carrier_off(bp->dev);
    netif_tx_disable(bp->dev);
}

Opening the NIC via the routine bnxt_open_nic now involves performing the reverse tasks such as enabling NAPI, enabling the transmission queues and turning on the carrier. This finishes the reset task by the broadcom driver.

__bnxt_open_nic
{
    ...
    bnxt_enable_napi(bp);
    bnxt_tx_enable(bp);
}

bnxt_tx_enable
{
    ...
    netif_tx_wake_all_queues(bp->dev);
    netif_carrier_on(bp->dev);
}

Intel (IXGBE) Driver

Like Broadcom, Intel also has its watchdog timer configured to wait for 5 seconds before executing the timeout recovery procedure as shown below.  Responding to the watchdog timer’s notification, the routine ixgbe_tx_timeout immediately executes ixgbe_tx_timeout_reset which performs the reset activity out of the interrupt context. The routine ixgbe_tx_timeout_reset ensures that the device is not in a down state by verifying the status of the bit __IXGBE_DOWN and then set’s the bit __IXGBE_RESET_REQUESTED in the adapter structure, prints a message onto the console indicating that the reset activity will be initiated. It then executes the routine ixgbe_service_event_schedule.

ixgbe_probe
{
    ...
    netdev->watchdog_timeo = 5 * HZ;
}

ixgbe_netdev_ops
{
    ...
    .ndo_tx_timeout = ixgbe_tx_timeout

}

ixgbe_tx_timeout
{
    ...
    ixgbe_tx_timeout_reset(adapter);
}

The routine ixgbe_service_event_schedule again ensures that the device is neither down nor is in the process of getting removed by verifying the status of the bits __IXGBE_DOWN and __IXGBE_REMOVING. After the check, it sets the bit __IXGBE_SERVICE_SCHED, indicating a service task is being scheduled and finally inserts the adapter’s service task in the work queue.

ixgbe_tx_timeout_reset
{
    ...
    ixgbe_service_event_schedule(adapter);
}

ixgbe_service_event_schedule
{
        if (!test_bit(__IXGBE_DOWN, &adapter->state) &&
            !test_bit(__IXGBE_REMOVING, &adapter->state) &&
            !test_and_set_bit(__IXGBE_SERVICE_SCHED, &adapter->state))
                queue_work(ixgbe_wq, &adapter->service_task);
}

The adapter’s service task is taken care by the routine ixgbe_service_task, which executes multiple sub-tasks to detect issues and take corrective actions on the adapter. Of these, one such task is ixgbe_check_hang_subtask which determines if any of the transmission queues are hung, and re-arms the queues to ensure the transmission rings are cleaned. The routine ixgbe_service_task is also responsible for clearing the bit __IXGBE_SERVICE_SCHED via the routine ixgbe_service_event_complete, indicating that the scheduled job on the adapter is completed.

ixgbe_service_task
{
    ...
    ixgbe_check_hang_subtask
    ixgbe_service_event_complete
}

ixgbe_check_hang_subtask
{
    ...
    set_check_for_tx_hang(adapter->tx_ring);
    ixgbe_irq_rearm_queues(adapter, eics);
}

Examples

Diagnostic data collection tools like Oswatcher will capture Netstat data, which one can use to verify if the network interface was recovered after experiencing a timeout. Below, are few such real-time examples.

In both examples shown example, the interface eth4 timed-out twice, the driver performed the recovery task and the interface started to transmit and receive packets.

The first example below shows the difference in the number of packets from Netstat for the interface eth4:

[pkannoju@amomv0033 log]$ grep timeout messages
Oct  9 05:07:06 nodename kernel: bnxt_en 0000:af:00.1 eth4: TX timeout detected, starting reset task!
Oct  9 05:35:49 nodename kernel: bnxt_en 0000:af:00.1 eth4: TX timeout detected, starting reset task!

--------------------
# grep -e zzz -e eth4 Netstat.dat | awk 'BEGIN {a=0}; /eth4/ {print ($7 - a); a = $7}; /zzz/ {print $0} '
   zzz <10/09/2022 05:05:35> subcount: 29
112343
   zzz <10/09/2022 05:06:15> subcount: 30
120427
   zzz <10/09/2022 05:06:54> subcount: 31
75210
   zzz <10/09/2022 05:07:34> subcount: 32
115417                                            
   zzz <10/09/2022 05:08:15> subcount: 33
128876
   zzz <10/09/2022 05:08:54> subcount: 34
118452
------------------------
grep -e zzz -e eth4 Netstat.dat | awk 'BEGIN {a=0}; /eth4/ {print ($7 - a); a = $7}; /zzz/ {print $0} '
   zzz <10/09/2022 05:33:15> subcount: 24
61364
   zzz <10/09/2022 05:33:53> subcount: 25
93251
   zzz <10/09/2022 05:34:32> subcount: 26
123679
   zzz <10/09/2022 05:35:10> subcount: 27
97251
   zzz <10/09/2022 05:35:49> subcount: 28
66729
   zzz <10/09/2022 05:36:27> subcount: 29
38339
   zzz <10/09/2022 05:37:06> subcount: 30
55957
   zzz <10/09/2022 05:37:44> subcount: 31
46832
   zzz <10/09/2022 05:38:23> subcount: 32
41696
------------------------

In the next example shown below, the interface recovered from the first time-out, however it did not recover from the second time-out:

[pkannoju@amomv0033 log]$ grep timeout messages
Nov 16 16:45:45 nodename kernel: [1562431.085825] bnxt_en 0000:af:00.0 eth3: TX timeout detected, starting reset task!
Nov 16 17:20:48 nodename kernel: [1564533.867772] bnxt_en 0000:af:00.0 eth3: TX timeout detected, starting reset task!


#grep -e zzz -e eth3 Netstat.dat | awk 'BEGIN {a=0}; /eth3/ {print ($7 - a); a = $7}; /zzz/ {print $0} '
zzz <11/16/2020 16:45:11> subcount: 65
482432
zzz <11/16/2020 16:45:19> subcount: 66
381809
zzz <11/16/2020 16:45:27> subcount: 67
849563
zzz <11/16/2020 16:46:01> subcount: 68
15762594101                             <== Interface recovered after this timeout
zzz <11/16/2020 16:46:09> subcount: 69
719911
zzz <11/16/2020 16:46:17> subcount: 70
596325
zzz <11/16/2020 16:46:25> subcount: 71
445348
zzz <11/16/2020 16:46:34> subcount: 72
436979
zzz <11/16/2020 16:46:42> subcount: 73
412436
zzz <11/16/2020 16:46:50> subcount: 74

#grep -e zzz -e eth3 Netstat.dat | awk 'BEGIN {a=0}; /eth3/ {print ($7 - a); a = $7}; /zzz/ {print $0} '
zzz <11/16/2020 17:20:07> subcount: 325
324822
zzz <11/16/2020 17:20:15> subcount: 326
292150
zzz <11/16/2020 17:20:23> subcount: 327
253354
zzz <11/16/2020 17:20:48> subcount: 328
15870074355                             <== Interface did not recover after this timeout as we see transmissions are not happening after timeout.
zzz <11/16/2020 17:21:00> subcount: 329
1
zzz <11/16/2020 17:21:07> subcount: 330
0
zzz <11/16/2020 17:21:14> subcount: 331
0
zzz <11/16/2020 17:21:21> subcount: 332
1
zzz <11/16/2020 17:21:29> subcount: 333
0
zzz <11/16/2020 17:21:36> subcount: 334
0

Conclusion

This explains why network interfaces experience timeout’s, how interfaces are recovered after a timeout and how one should be able to verify if the recovery task has succeeded or not.

References:

  1. The Linux Networking Architecture: Design and Implementation of Network Protocols in the Linux Kernel, By Klaus Wehrle, Frank Pählke, Hartmut Ritter, Daniel Müller, Marc Bechler.
  2. Understanding Linux Network Internals, By Christian Benvenuti.
  3. Queueing in the Linux Network Stack.

Praveen Kumar Kannoju


Previous Post

Oracle Database and Oracle Linux on Arm–quickly gain hands-on experience

Craig McBride | 2 min read

Next Post


Join the Linux and Virtualization team at Oracle CloudWorld 2023

Michele Resta | 5 min read