linux Fake Tx hang detected

在dmesg中,看到如下信息:

[424948.577401] ixgbe 0000:86:00.0 eth4: Fake Tx hang detected with timeout of 5 seconds
[424949.535143] ixgbe 0000:86:00.1 eth5: Fake Tx hang detected with timeout of 5 seconds
[424955.536045] ixgbe 0000:af:00.0 eth6: Fake Tx hang detected with timeout of 10 seconds
[424955.567988] ixgbe 0000:af:00.1 eth7: Fake Tx hang detected with timeout of 10 seconds
[424957.579250] ixgbe 0000:18:00.1 eth1: Fake Tx hang detected with timeout of 10 seconds
[424957.579285] ixgbe 0000:3b:00.1 eth3: Fake Tx hang detected with timeout of 10 seconds
[424958.568923] ixgbe 0000:86:00.0 eth4: Fake Tx hang detected with timeout of 10 seconds
[424959.526676] ixgbe 0000:86:00.1 eth5: Fake Tx hang detected with timeout of 10 seconds
[424975.489166] ixgbe 0000:af:00.0 eth6: Fake Tx hang detected with timeout of 20 seconds
[424975.553019] ixgbe 0000:af:00.1 eth7: Fake Tx hang detected with timeout of 20 seconds
[424977.532376] ixgbe 0000:18:00.1 eth1: Fake Tx hang detected with timeout of 20 seconds
[424977.532409] ixgbe 0000:3b:00.1 eth3: Fake Tx hang detected with timeout of 20 seconds

检测超时的函数:

static void fm10k_tx_timeout(struct net_device *netdev)
{
    struct fm10k_intfc *interface = netdev_priv(netdev);
    bool real_tx_hang = false;
    int i;

#define TX_TIMEO_LIMIT 16000
    for (i = 0; i < interface->num_tx_queues; i++) {
        struct fm10k_ring *tx_ring = interface->tx_ring[i];

        if (check_for_tx_hang(tx_ring) && fm10k_check_tx_hang(tx_ring))
            real_tx_hang = true;
    }

    if (real_tx_hang) {
        fm10k_tx_timeout_reset(interface);
    } else {
        netif_info(interface, drv, netdev,
               "Fake Tx hang detected with timeout of %d seconds\n",
               netdev->watchdog_timeo / HZ);

        /* fake Tx hang - increase the kernel timeout */
        if (netdev->watchdog_timeo < TX_TIMEO_LIMIT)
            netdev->watchdog_timeo *= 2;-----------按倍数递增,直到大于16s,本文就是5-10-20递增,
    }
}

网卡检测是否hung的关键函数是 fm10k_tx_timeout,如果  if (check_for_tx_hang(tx_ring) && fm10k_check_tx_hang(tx_ring)) 条件满足,则会属于real hung,否则是fake hung。

check_for_tx_hang(tx_ring)肯定都是满足的,一般在probe的时候就会设置,fm10k_check_tx_hang 的代码如下:

bool fm10k_check_tx_hang(struct fm10k_ring *tx_ring)
{
    u32 tx_done = fm10k_get_tx_completed(tx_ring);
    u32 tx_done_old = tx_ring->tx_stats.tx_done_old;
    u32 tx_pending = fm10k_get_tx_pending(tx_ring, true);

    clear_check_for_tx_hang(tx_ring);

    /* Check for a hung queue, but be thorough. This verifies
     * that a transmit has been completed since the previous
     * check AND there is at least one packet pending. By
     * requiring this to fail twice we avoid races with
     * clearing the ARMED bit and conditions where we
     * run the check_tx_hang logic with a transmit completion
     * pending but without time to complete it yet.
     */
    if (!tx_pending || (tx_done_old != tx_done)) {-----------------没有pending的报文,或者pending的值没变过
        /* update completed stats and continue */
        tx_ring->tx_stats.tx_done_old = tx_done;
        /* reset the countdown */
        clear_bit(__FM10K_HANG_CHECK_ARMED, &tx_ring->state);

        return false;
    }

    /* make sure it is true for two checks in a row */
    return test_and_set_bit(__FM10K_HANG_CHECK_ARMED, &tx_ring->state);----------------两次alarm,则肯定返回true
}

伴随网卡hung打印的,一般都有cpu的softlock,如果cpu 是softlock,而且tx做了cpu绑定的话,那么该cpu对应的tx则会没有pending报文,从而触发hung。如果没有做绑定,则这个tx可能被多个cpu来使用,如果再出现hung,则要查看对应的tx的锁,是否被拿了没有释放。

猜你喜欢

转载自www.cnblogs.com/10087622blog/p/9558024.html