Introduction to the principle of ethtool and troubleshooting ideas for network card packet loss

foreword

I have previously recorded and dealt with the problem of packet loss caused by soft interrupts due to excessive traffic load of the LVS network card. The practice of multi-queue performance tuning for RPS and RFS network cards [1], for ordinary people, the probability of encountering them is actually not high when the pressure is not high. high. The topic I want to share this time is how to troubleshoot common server network card packet loss phenomena. If you want to know about point-to-point packet loss solutions, it may involve a wide range of ideas. You may wish to refer to the previous article How to use MTR to diagnose network problems[ 2 ] , the commonly used network card packet loss analysis tool for Linux is naturally ethtool.

ethtool is used to view and modify the driver parameters and hardware settings of network devices (especially wired Ethernet devices). You can change the parameters of the Ethernet card according to your needs, including parameters such as auto-negotiation, speed, duplex, and wake-on-lan. By configuring the Ethernet card, your computer can communicate efficiently over the network. This tool provides a lot of information about Ethernet devices connected to your Linux system.

1. Understand the process of receiving packets

Receiving data packets is a complex process involving many underlying technical details, but generally requires the following steps:

  1. The NIC receives the packet.
  2. Transfer packets from NIC hardware cache to server memory.
  3. Notify the kernel for processing.
  4. It is processed layer by layer through the TCP/IP protocol.
  5. The application reads data from the socket buffer through read().

Transfer packets received by NIC to host memory (NIC interacts with driver)

After the NIC receives the data packet, it first needs to synchronize the data to the kernel, and the bridge in between is the rx ring buffer. It is an area shared by the NIC and the driver. In fact, the rx ring buffer does not store the actual packet data, but a descriptor. This descriptor points to its real storage address. The specific process is as follows:

  1. The driver allocates a buffer in memory to receive data packets, called sk_buffer;
  2. Add the address and size of the above buffer (that is, the receive descriptor) to the rx ring buffer. The buffer address in the descriptor is the physical address used by DMA;
  3. The driver notifies the network card that there is a new descriptor;
  4. The network card takes out the descriptor from the rx ring buffer to know the address and size of the buffer;
  5. The network card receives a new data packet;
  6. The network card writes new data packets directly into sk_buffer through DMA.

When the processing speed of the driver cannot keep up with the receiving speed of the network card, the driver has no time to allocate the buffer, and the data packets received by the NIC cannot be written to the sk_buffer in time, and accumulation will occur. When the internal buffer of the NIC is full, some data will be discarded , causing packet loss. This part of the packet loss is rx_fifo_errors, which is reflected in the growth of the fifo field in /proc/net/dev, and the growth of the overruns indicator in ifconfig.

Notify the system kernel for processing (the driver interacts with the Linux kernel)

At this time, the data packet has been transferred to sk_buffer. As mentioned earlier, this is a buffer allocated by the driver in the memory, and it is written by DMA. This method does not rely on the CPU to directly write the data to the memory, which means that the kernel does not actually know There is new data in memory. So how to let the kernel know that new data has come in? The answer is the interrupt, which tells the kernel that new data has come in and needs to be processed later.

When it comes to interrupts, it involves hard interrupts and soft interrupts. First, you need to briefly understand their differences:

  • Hard interrupt: generated by the hardware itself, with randomness, after the hard interrupt is received by the CPU, it triggers the execution of the interrupt handler. The interrupt handler will only handle critical tasks that can be processed in a short period of time, and the remaining time-consuming tasks will be placed after the interrupt and completed by the soft interrupt. Hard interrupts are also known as the upper half.
  • Soft interrupt: Generated by the interrupt handler corresponding to the hard interrupt, it is often pre-implemented in the code and does not have randomness. (Besides this, there are also application-triggered soft interrupts, which have nothing to do with the network card receiving packets discussed in this article.) Also known as the second half.

When the NIC copies the packet to the kernel buffer sk_buffer by DMA, the NIC immediately initiates a hardware interrupt. After the CPU receives it, it first enters the upper part. The interrupt handler corresponding to the network card interrupt is a part of the network card driver, and then it initiates a soft interrupt, enters the lower part, and starts to consume the data in the sk_buffer, which is handed over to the kernel protocol stack for processing.

Interrupts can quickly and timely respond to network card data requests, but if the amount of data is large, a large number of interrupt requests will be generated, and the CPU is busy processing interrupts most of the time, which is very inefficient. In order to solve this problem, the current kernel and drivers use a method called NAPI (new API) for data processing. The principle can be simply understood as interrupt + polling. When the amount of data is large, it will be received by polling after an interruption. A certain number of packets are returned to avoid multiple interruptions.

2. ifconfig explanation

[root@localhost ~]
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.135 netmask 255.255.255.0 broadcast 192.168.1.255
inet6 fe80::20c:29ff:fe9b:52d3 prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:9b:52:d3 txqueuelen 1000 (Ethernet)
RX packets 833 bytes 61846 (60.3 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 122 bytes 9028 (8.8 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
  • RX errors

Indicates the total number of received packet errors, including too-long-frames errors, Ring Buffer overflow errors, crc check errors, frame synchronization errors, fifo overruns and missed pkg, etc.

  • RX dropped

Indicates that the data packet has entered the Ring Buffer, but due to system reasons such as insufficient memory, it is discarded during the process of copying to the memory.

  • RX overruns

Indicates the overruns of the fifo, which is caused by the IO transmitted by the Ring Buffer (aka Driver Queue) is greater than the IO that the kernel can handle, and the Ring Buffer refers to the buffer before the IRQ request is initiated. Obviously, the increase of overruns means that the data packet is discarded by the physical layer of the network card before it reaches the Ring Buffer, and the CPU cannot handle the interruption even if it is one of the reasons for the Ring Buffer to be full. The problematic machine above is because Interrupters are unevenly distributed (all are pressed on core0), and there is no packet loss caused by affinity.

  • RX frame

Represents misaligned frames.

 Information through train: Linux kernel source code technology learning route + video tutorial kernel source code

Learning through train: Linux kernel source code memory tuning file system process management device driver/network protocol stack

3. Working principle of network card

Network card outsourcing

The network card driver adds a 14-byte MAC header to the IP packet to form a frame (without CRC). Frame (without CRC) contains the MAC addresses of the sender and receiver. Since the MAC header is created by the driver, the addresses can be entered casually, and host masquerading can also be performed.

The driver program copies the frame (without CRC) to the buffer inside the network card chip, which is processed by the network card.

The network card chip repackages the incomplete frame (without CRC) into a packet that can be sent, that is, adds header synchronization information and CRC check, and then throws it on the network cable to complete the sending of an IP packet. All network cards on the network can see the packet.

Network card receiving packets

The packet on the network line is first obtained by the network card, and the network card will check the CRC check of the packet to ensure the integrity, and then remove the packet header to obtain the frame. The network card will check the destination MAC address in the MAC packet, if it is different from the MAC address of the network card, it will be discarded (except promiscuous mode).

The network card copies the frame to the FIFO buffer inside the network card and triggers a hardware interrupt. (If there is a network card with a ring buffer, it seems that the frame can be stored in the ring buffer first and then trigger a software interrupt (the next article will explain the direction of the frame in Linux in detail). The ring buffer is shared by the network card and the driver. It is the memory in the device, but It is visible to the operating system, because the network card driver in the linux kernel source code uses kcalloc to allocate space, so the ring buffer generally has an upper limit. In addition, the ring buffer size should indicate the number of frames that can be stored , not the byte size. In addition, the ethtool command of some systems cannot change the ring parameters to set the size of the ring buffer. I don’t know why, maybe the driver does not support it.)

The network card driver builds sk_buff through the hard interrupt processing function, copies the frame from the network card FIFO to the memory skb, and then hands it over to the kernel for processing. (Network cards that support napi should be placed directly in the ring buffer without triggering hard interrupts. Instead, use soft interrupts to copy the data in the ring buffer and send them directly to the upper layer for processing. Each network card can handle weight frames in one soft interrupt process. )

During the process, the network card chip performs MAC filtering on the frame to reduce the system load. (except promiscuous mode)

NIC interrupt handler

Every device that generates an interrupt has a corresponding interrupt handler that is part of the device driver. Each network card has an interrupt handler, which is used to notify the network card that the interrupt has been received, and copy the data packet in the network card buffer to the memory.

When the network card receives a packet from the network, it needs to notify the kernel that the packet has arrived. The NIC issues an interrupt immediately. The kernel responds by executing the interrupt handler function that the network card has registered. The interrupt handler starts executing, notifies the hardware, copies the latest network packet to memory, and then reads more packets from the network card.

These are important, urgent, and hardware-related tasks. The kernel usually needs to quickly copy network packets to system memory, because the buffer size of the network card receiving network packets is fixed, and it is much smaller than the system memory. So once the above copying action is delayed, it will inevitably cause the FIFO buffer of the network card to overflow - the incoming data packets fill up the buffer of the network card, and subsequent packets can only be discarded, which should be the source of overrun in ifconfig.

When the network data packet is copied to the system memory, the interrupted task is considered complete, and then it returns control to the program that was running before being interrupted by the system.

buffer access

The kernel buffer of the network card is in the PC memory and is controlled by the kernel, while the network card has a FIFO buffer or ring buffer, which should distinguish the two. The FIFO is relatively small, and if there is data in it, it will try to store the data in the kernel buffer.

The buffers in the network card belong neither to kernel space nor user space. It belongs to the hardware buffer, allowing a buffer between the network card and the operating system;

The kernel buffer is in the kernel space, in the memory, used for the kernel program, as a data buffer read from or written to the hardware;

The user buffer is in user space, in memory, for user programs, as a data buffer for reading from or writing to the hardware;

In addition, in order to speed up data interaction, the kernel buffer can be mapped to the user space, so that the kernel program and the user program can access this area at the same time.

For a network card with a ring buffer, the ring buffer is shared by the driver and the network card, so the kernel can directly access the ring buffer, and generally copy the copy of the frames to its own kernel space for processing (deliver to the upper layer protocol, and the subsequent skb is to press The skb's pointer transfer method is passed until the user obtains the data, so, for the ring buffer network card, a large number of copies occur when the frame is passed from the ring buffer to the computer memory controlled by the kernel).

4. Packet Loss Troubleshooting Ideas

The network card works at the data link layer, and the data volume link layer will do some checks and encapsulate them into frames. We can see if there is an error in the verification and determine whether there is a problem with the transmission. Then from the software level, whether the packet is lost because the buffer is too small.

Check the hardware first

A machine often receives a packet loss alarm, first check if there is any problem with the bottom layer:

  1. Check if the working mode is normal
[root@localhost ~]# ethtool eth0 | egrep 'Speed|Duplex'
Speed: 1000Mb/s
Duplex: Full
  1. Check whether the inspection is normal
[root@localhost ~]# ethtool -S eth0 | grep crc
rx_crc_errors: 0

Speed, Duplex, CRC and the like are all fine, and physical interference can basically be ruled out.

overruns 和 buffer size

for i in `seq 1 100`; do ifconfig eth2 | grep RX | grep overruns; sleep 1; done

RX packets:346547657 errors:0 dropped:0 overruns:35345 frame:0

-g   –show-ringQueries the specified ethernet device for rx/tx ring parameter information.
-G   –set-ringChanges the rx/tx ring parameters of the specified ethernet device.

ethtool -g eth0

[root@localhost ~]
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 256
RX Mini: 0
RX Jumbo: 0
TX: 256

ethtool -G eth0 rx 2048
ethtool -G eth0 tx 2048

[root@localhost ~]
[root@localhost ~]
[root@localhost ~]
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 2048
RX Mini: 0
RX Jumbo: 0
TX: 2048

Red Hat official solution

Issue

Why rx_crc_errors incrementing in the receive counter of ethtool -S output?

$ ethtool -S <Interface_name> | grep -i error
     rx_error_bytes: 0
     tx_error_bytes: 0
     tx_mac_errors: 0
     tx_carrier_errors: 0
     rx_crc_errors: 9244
     rx_align_errors: 0
Resolution
  1. Change the cable.
  2. Check switch configuration.
  3. Change the network interface card.
Root Cause
  1. Most of the time incrementing the value of rx_crc_errors means the problem is in Layer-1 of the networking model.
  2. When a packet is received at the interface, it goes through a data integrity check which is called cyclic redundancy check. If the packet fails in that check, it is marked as rx_crc_errors.
  3. The switch was forcing the NIC to operate in half-duplex mode. Fixing the switch to tell the NIC to operate in full-duplex mode have resolved the issue.
Diagnostic Steps

Check ethtool -S output and find where are the drops and errors.

$ ethtool -S <Interface_name> | grep -i error
     rx_error_bytes: 0
     tx_error_bytes: 0
     tx_mac_errors: 0
     tx_carrier_errors: 0
     rx_crc_errors: 9244  >>>>>>
     rx_align_errors: 0

Check the numbers corresponding to rx_crc_errors.

ethtool p1p1

Settings for p1p1:
 Supported ports: [ FIBRE ]
 Supported link modes:   10000baseT/Full
 Supported pause frame use: Symmetric
 Supports auto-negotiation: No
 Supported FEC modes: Not reported
 Advertised link modes:  10000baseT/Full
 Advertised pause frame use: Symmetric
 Advertised auto-negotiation: No
 Advertised FEC modes: Not reported
 Speed: 10000Mb/s
 Duplex: Full
 Port: FIBRE
 PHYAD: 0
 Transceiver: internal
 Auto-negotiation: off
 Supports Wake-on: d
 Wake-on: d
 Current message level: 0x00000007 (7)
          drv probe link
 Link detected: yes

It shows the interface type, connection mode, rate and other information of p1p1, and whether the network cable is currently connected (if it is a network cable, Supported ports is TP, if it is optical fiber, it will display Fiber), here are three important keywords:

Supported ports: [FIBRE]
Speed: 10000Mb/s
Link detected: yes
ethtool -S p1p1 | grep -i error
     rx_errors: 0
     tx_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_fifo_errors: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     rx_length_errors: 0
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_csum_offload_errors: 0

ethtool -p <Interface_name>
ethtool -p eth0

ethtool -i p1p1

driver: ixgbe
version: 5.1.0-k-rh7.6
firmware-version: 0x80000960, 18.3.6
expansion-rom-version:
bus-info: 0000:04:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

ethtool -s eth0 speed 100

 

Guess you like

Origin blog.csdn.net/youzhangjing_/article/details/131455694