在dmesg中,看到如下信息:
[424948.577401] ixgbe 0000:86:00.0 eth4: Fake Tx hang detected with timeout of 5 seconds [424949.535143] ixgbe 0000:86:00.1 eth5: Fake Tx hang detected with timeout of 5 seconds [424955.536045] ixgbe 0000:af:00.0 eth6: Fake Tx hang detected with timeout of 10 seconds [424955.567988] ixgbe 0000:af:00.1 eth7: Fake Tx hang detected with timeout of 10 seconds [424957.579250] ixgbe 0000:18:00.1 eth1: Fake Tx hang detected with timeout of 10 seconds [424957.579285] ixgbe 0000:3b:00.1 eth3: Fake Tx hang detected with timeout of 10 seconds [424958.568923] ixgbe 0000:86:00.0 eth4: Fake Tx hang detected with timeout of 10 seconds [424959.526676] ixgbe 0000:86:00.1 eth5: Fake Tx hang detected with timeout of 10 seconds [424975.489166] ixgbe 0000:af:00.0 eth6: Fake Tx hang detected with timeout of 20 seconds [424975.553019] ixgbe 0000:af:00.1 eth7: Fake Tx hang detected with timeout of 20 seconds [424977.532376] ixgbe 0000:18:00.1 eth1: Fake Tx hang detected with timeout of 20 seconds [424977.532409] ixgbe 0000:3b:00.1 eth3: Fake Tx hang detected with timeout of 20 seconds
检测超时的函数:
static void fm10k_tx_timeout(struct net_device *netdev) { struct fm10k_intfc *interface = netdev_priv(netdev); bool real_tx_hang = false; int i; #define TX_TIMEO_LIMIT 16000 for (i = 0; i < interface->num_tx_queues; i++) { struct fm10k_ring *tx_ring = interface->tx_ring[i]; if (check_for_tx_hang(tx_ring) && fm10k_check_tx_hang(tx_ring)) real_tx_hang = true; } if (real_tx_hang) { fm10k_tx_timeout_reset(interface); } else { netif_info(interface, drv, netdev, "Fake Tx hang detected with timeout of %d seconds\n", netdev->watchdog_timeo / HZ); /* fake Tx hang - increase the kernel timeout */ if (netdev->watchdog_timeo < TX_TIMEO_LIMIT) netdev->watchdog_timeo *= 2;-----------按倍数递增,直到大于16s,本文就是5-10-20递增, } }
网卡检测是否hung的关键函数是 fm10k_tx_timeout,如果 if (check_for_tx_hang(tx_ring) && fm10k_check_tx_hang(tx_ring)) 条件满足,则会属于real hung,否则是fake hung。
check_for_tx_hang(tx_ring)肯定都是满足的,一般在probe的时候就会设置,fm10k_check_tx_hang 的代码如下:
bool fm10k_check_tx_hang(struct fm10k_ring *tx_ring) { u32 tx_done = fm10k_get_tx_completed(tx_ring); u32 tx_done_old = tx_ring->tx_stats.tx_done_old; u32 tx_pending = fm10k_get_tx_pending(tx_ring, true); clear_check_for_tx_hang(tx_ring); /* Check for a hung queue, but be thorough. This verifies * that a transmit has been completed since the previous * check AND there is at least one packet pending. By * requiring this to fail twice we avoid races with * clearing the ARMED bit and conditions where we * run the check_tx_hang logic with a transmit completion * pending but without time to complete it yet. */ if (!tx_pending || (tx_done_old != tx_done)) {-----------------没有pending的报文,或者pending的值没变过 /* update completed stats and continue */ tx_ring->tx_stats.tx_done_old = tx_done; /* reset the countdown */ clear_bit(__FM10K_HANG_CHECK_ARMED, &tx_ring->state); return false; } /* make sure it is true for two checks in a row */ return test_and_set_bit(__FM10K_HANG_CHECK_ARMED, &tx_ring->state);----------------两次alarm,则肯定返回true }
伴随网卡hung打印的,一般都有cpu的softlock,如果cpu 是softlock,而且tx做了cpu绑定的话,那么该cpu对应的tx则会没有pending报文,从而触发hung。如果没有做绑定,则这个tx可能被多个cpu来使用,如果再出现hung,则要查看对应的tx的锁,是否被拿了没有释放。