Solution to the problem of load balancing product DPDK in the production environment

ULB4 is a high-availability four-layer load balancing product based on DPDK independently developed by UCloud, and its forwarding capability is close to wire speed; DPDK is a high-performance open source data plane development kit. As the global entrance of user applications, ULB4 is very important to ensure the continuous stability of user services in large-traffic and diversified scenarios. This is also the technical mission of the UCloud network product team. In particular, the carrying bandwidth of a single ULB cluster on the live network has reached 10G, the packet volume is 830,000 PPS, and the operating environment is complex. Even in the face of sudden factors (such as triggering unknown BUG), we must try to ensure the normal operation of the product to avoid serious impact.

Recently, in the online environment of ULB4, we discovered a DPDK packet sending anomaly. Since the entire ULB product is a cluster architecture, this anomaly did not cause user services to be unavailable. However, in order to ensure sufficient stability of user services at any time, the team captures abnormal packets from GB-level traffic on the live network through GDB, packet export tools, and production environment traffic mirroring, and then combines DPDK source code analysis to locate the cause The BUG from DPDK itself has been fixed and resolved. During this period, there was no impact on user business, further ensuring the stable operation of tens of thousands of UCloud ULB instances.

This article will start from the phenomenon of the problem, and describe the whole process of problem location, analysis and solution in detail, hoping to provide reference and inspiration for ULB users and DPDK developers.

problem background

Disaster recovery suddenly appeared in the stable ULB4 cluster at the beginning of December, and a certain ULB4 server was working abnormally and was automatically removed from the cluster. The phenomenon at that time was:

The forwarding plane service monitors that the traffic in the receiving direction of the network card is normal, but the traffic in the sending direction is 0. After restarting the forwarding plane service, it can send and receive normally again. At the same time, other machines in the cluster will also experience abnormalities from time to time. For user services, there will be a small number of connections with slight jitter, and then recover quickly.

The following is the processing process of the whole problem. We made various attempts in this process, and finally completed the analysis and solution in combination with the DPDK source code. In the future, we plan to open source and share the self-developed message export tool.

Problem location and analysis

The ULB4 cluster has been working stably, but suddenly the same problem occurs on different machines in the cluster, and after the machine restores to join the cluster, the same problem occurs again after a period of time. According to our operating experience, the preliminary guess is that some kind of abnormal message triggered the program BUG. However, how to capture abnormal packets in the face of GB-level traffic? And how to find the problem without affecting the business?

1. GDB debugs the message and finds doubts

If you want to know why the whole program does not issue a contract, the best way is to be able to enter the program to see the specific execution process. For DPDK user mode programs, GDB is obviously a useful tool. We set breakpoints in the logic of the sending program, and view the execution logic of the function through the disassemble command. After disassembly, there are more than 700 lines. (Many functions called in this function use inline modification, which causes the function to have a lot of instructions after assembly)

Combined with the source code of the corresponding DPDK version, a single instruction is executed step by step. After many attempts, I found that it would return directly to the place shown in the figure below every time.

The general process is that when i40e_xmit_pkts() is sending, if it finds that the sending queue is full, it will call i40e_xmit_cleanup() to clean up the queue. The network card in DPDK will write back a specific field after sending the data packet, indicating that the packet has been sent, and the driver can check whether the packet has been sent by checking the field. The problem here is that the driver thinks that the packets in the queue have not been sent out by the network card, and the subsequent packets will not be added to the queue and will be discarded directly.

So far, the direct reason has been found, that is, the network card does not send packets for some reason or fails to write back specific fields correctly, causing the driver to think that the sending queue is always full, and cannot add subsequent packets to the sending queue.

So why is the queue full? Is the exception package relevant? With this question in mind, we made a second attempt.

2. Restore network card messages with one click

The queue is full, and the subsequent packets cannot be added, which means that the packets in the queue have been stuck there at this time. Since we guess that there may be abnormal packets, is it possible that the abnormal packets are still in the queue? If we can export all the packets in the current queue, we can further verify our guess.

Based on in-depth research on DPDK, we export packets according to the following steps.

  • When we look at the i40e_xmit_pkts() function, we will find that the first parameter is the sending queue, so we can get the queue information.
  • As shown in the figure below, when you first enter the breakpoint, check the register information to obtain the parameters corresponding to the function.
  • When we print the messages of the queue, we find that there is no symbol information. At this time, we can load the i40e_rxtx.o generated during compilation as shown in the figure below to obtain the corresponding symbol information.
  • After obtaining the queue information, we use the dump command of GDB to export all the messages in the entire queue according to the order in the queue, and name each message according to the serial number.

  • At this time, the exported message is still the original message, and we cannot use wireshark to view the message information conveniently. For this reason, as shown in the figure below, we use the libpcap library to write a simple tool to convert it into a pcap file that can be parsed by wireshark.

Sure enough, as shown in the figure below, all exported messages contain a message with a length of 26 bytes and a content of all 0s. This message looks very abnormal, and it seems to initially verify our guess:

In order to improve the speed of exporting packets when troubleshooting problems, we have written a one-click packet export tool, which can export all packets with one click and convert them into pcap format when an exception occurs.

After exporting packets many times, we found a rule: each time there will be a packet with a length of 26 bytes but all 0s, and there will be a packet with the same length in front of it, and each time the source IP address network segments are all from the same region.

Related video recommendation

Learning address:  https://ke.qq.com/course/5066203?flowToken=1043799  (free subscription, permanent learning)

If you need dpdk learning materials, add qun 909332607 to obtain them

(Information includes: Dpdk/network protocol stack/vpp/OvS/DDos/NFV/virtualization, tcp/ip, plugin, feature, flexible array, golang, mysql, linux, Redis, CDN, etc.), free to share ~

3. Traffic mirroring, confirm abnormal packets

The conclusion of the second step has made the whole investigation a big step forward, but the queue packets are processed through a series of programs, and are not real original business packets. Don’t give up until the goal is reached. At the critical moment, we still need to use mirroring to capture packets, so we urgently contacted network operation and maintenance colleagues to configure port-mirroring (port mirroring) on ​​the switch to mirror the traffic sent to the ULB4 cluster to an idle server for mirroring. grab bag. Of course, the mirror server also requires special configuration, as follows:

  1. Set the NIC promiscuous mode to collect mirrored traffic (ifconfig net2 promisc).
  2. Disable the GRO function (ethtool -K net2 gro off), which is used to collect the most original packets and prevent the GRO function of Linux from assembling packets in advance.

According to the geographical characteristics of the abnormal IP, we targeted to capture the traffic of some source IP segments.

Reference instruction: nohup tcpdump -i net2 -s0 -w %Y%m%d_%H-%M-%S.pcap -G 1800 "proto gre and (((ip[54:4]&0x11223000)==0x11223000) or ((ip[58:4]&0x11223000)==0x11223000))" &

After many attempts, the hard work paid off, and a fault occurred. After layers of stripping and screening, the following message was found:

This is an IP fragment message, but the strange thing is that the second fragment of the IP fragment only has the IP header. After careful comparison, the combination of these two messages is the two connected messages in the export queue message. The last 26 bytes are completely consistent with the all-0 message.

We know that in the TCP/IP protocol, if the length of an IP packet exceeds the MTU when it is sent, IP fragmentation will be triggered, and it will be split into multiple small fragmented packets for transmission. Under normal circumstances, all fragments must carry data. But this fragmented message is very abnormal. The total length of the message is 20, that is to say, there is only one IP header, and no information is carried behind it. Such a message is meaningless. This message is also filled with 26 bytes of 0 after passing through the switch because the length is too short.

So far, we finally found the abnormal message, which basically verified our guess. However, it is still necessary to actually verify whether it is caused by such an abnormal message. (From the perspective of the interaction of the entire message, this piece of message was originally set as a non-fragmentable TCP message, but after passing through a public network gateway, it was forcibly set to allow fragmentation, and the fragmentation resulted in such unusual form.)

4. Solutions

If it is indeed caused by the abnormal message, it is sufficient to check the abnormal message and discard it when receiving the packet. Therefore, we modify the DPDK program to discard such packets. As a verification, an online server was released first, and after a day of operation, there was no abnormal disaster recovery situation. Now that the root cause of the problem has been found, it is this kind of abnormal message that caused the DPDK to work abnormally, and it can be released on the entire network in grayscale.

5. DPDK community feedback

With a responsible attitude towards the open source community, we are going to synchronize the BUG to the DPDK community. After comparing the latest commit, I found a commit submitted on November 6. The situation is exactly the same, as follows:

ip_frag: check fragment length of incoming packet

In the latest version of DPDK 18.11, this has been fixed, which is consistent with our processing logic, and the abnormal packet is also discarded.

review and summary

After dealing with all the problems, we started to do the overall review.

1. Summary of the causes of ULB’s inability to issue contracts

The entire generation process of ULB4 unable to send packets is as follows:

  1. DPDK receives the first fragment in the fragmented message, and caches it for subsequent fragmentation;
  2. The second piece of abnormal fragmentation with only the IP header arrives, DPDK processes it according to the normal message processing logic, and does not check and discard it, so the rte_mbuf structures of the two pieces of messages are chained together to form a chained message return to ulb4;
  3. After such a message is received by ULB4, because the total length of the entire message does not reach the length required for fragmentation, ULB4 directly calls the sending interface of DPDK to send it out;
  4. DPDK does not check this abnormal message, but directly calls the corresponding user-mode NIC driver to send the message directly;
  5. The user mode network card driver triggers the network card tx hang when sending such an abnormal message;
  6. After the tx hang is triggered, the network card will no longer work, and the send descriptor corresponding to the message in the driver queue will no longer be correctly set by the network card to send the completion mark;
  7. Subsequent packets continue to arrive and begin to backlog in the sending queue, eventually filling up the entire queue, and will be discarded directly when another packet arrives.

2. Why does the abnormal message trigger the tx hang of the network card

First of all, let's look at the code related to the network card sending messages in DPDK.

From the above figure, we can see that it is very important to set the relevant fields correctly according to the Datasheet of the network card. If the setting is wrong for some reason, it may lead to unpredictable consequences (refer to the Datasheet of the network card for details).

As shown in the figure below, usually the Datasheet corresponding to the network card will describe the corresponding fields, and the network card driver will generally have a corresponding data structure corresponding to it.

After having a basic understanding, we guess that if we directly construct such a similar abnormal message directly in the program, will it also cause the network card to fail to send packets abnormally?

The answer is yes.

As shown in the figure below, we use such a code fragment to form an abnormal message, and then call the DPDK interface to send it directly, and soon the network card will tx hang

3. Thinking about directly operating hardware

It is very cautious to directly operate the hardware. In the traditional Linux system, the driver is generally in the kernel mode and managed by the kernel, and various exception handling may be performed in the driver code, so user programs rarely occur. A condition in which the operation causes the hardware to not work. However, DPDK can directly operate the hardware in the user mode because of its own characteristics of using the user mode driver. At the same time, a lot of optimizations may be carried out to improve performance. If the user’s own program has a problem, it may cause the network card tx hang. An exception occurred.

4. The value of tools

We have written a tool to export DPDK driver queue messages with one click, so that every time a problem occurs, all the messages in the network card driver sending queue can be quickly exported, which greatly improves the troubleshooting efficiency. After this tool is optimized, it is ready to open source on UCloud GitHub, hoping to help DPDK developers.

write at the end

As an open source package, DPDK usually has no problems with stability and reliability, but the actual application scenarios are ever-changing, and some special situations may cause DPDK to work abnormally. Although the probability of occurrence is very small, DPDK is usually at a key gateway position. Once a problem occurs, even a rare problem will have a serious impact.

Therefore, the technical team understands its working principle and analyzes its source code. At the same time, it can locate the problems of DPDK step by step based on specific phenomena, which is of great significance to improve the service reliability of the entire DPDK program. It is worth mentioning that the high-availability cluster architecture of ULB4 played an important role in the process of solving this problem. When one machine is unavailable, other machines in the cluster can continue to provide reliable services for users, effectively improving the user experience. business reliability.

Original link: https://mp.weixin.qq.com/s/LTJO

Guess you like

Origin blog.csdn.net/weixin_60043341/article/details/126650328