Linux network performance optimization related strategies

This article introduces the Linux network performance optimization strategy from bottom to top"

00

Network card configuration optimization

Starting from 0 is the basic quality of code farmers

Network card function configuration

Generally speaking, to complete the same function, the performance of hardware far exceeds that of software. With the development of hardware, more and more functions are supported. Therefore, we must try to offload the function to the hardware

Use ethtool -k to view the list of functions supported by the network card and the current status. The following is the output of a virtual machine of the author.
Insert picture description here

Note: The output of different network cards is different, and the output of different kernel versions will also be slightly different.

Generally, the following functions need to be enabled:

  1. rx-checksumming: check the checksum of the received message.

  2. tx-checksumming: Calculate the checksum of sent messages.

  3. scatter-gather: supports the scattered-gather memory mode, that is, the memory of the data part of the sent message can be discontinuous and scattered in multiple pages.

  4. tcp-segment-offload: supports TCP large packet segmentation.

  5. udp-fragmentation-offload: supports automatic fragmentation of UDP packets.

  6. generic-segment-offload: When using TSO and UFO, this function is generally turned on. Both TSO and UFO are supported by network card hardware, while GSO is mostly implemented by software at the driver layer in Linux. For forwarding equipment, I personally recommend not to enable GSO. It has been tested before that turning on GSO will increase the forwarding delay.

  7. rx-vlan-offload: It is enabled when deployed in a vlan network environment.

  8. tx-vlan-offload: Same as above

  9. receive-hashing: If you use the software RPS/RFS function, then enable it.

You can use ethtool -K to enable specific functions.

[Article benefits] C/C++ Linux server architect learning materials plus group 812855908 (data including C/C++, Linux, golang technology, Nginx, ZeroMQ, MySQL, Redis, fastdfs, MongoDB, ZK, streaming media, CDN, P2P, K8S, Docker, TCP/IP, coroutine, DPDK, ffmpeg, etc.)
Insert picture description here

Network card ring buffer configuration

The default ring buffer of the network card driver is generally not large. Take the author's virtual machine as an example.
Insert picture description here

The size of receiving and sending is 256 bytes, which is obviously small. When encountering burst traffic, it may cause the receiving ring buffer of the network card to be full and packet loss.
On high-performance and high-traffic servers or forwarding devices, it is generally configured to be 2048 or higher. And, if the author remembers correctly, in the intel network card driver, it is recommended to set the size of the sending buffer to twice the size of the receiving buffer-the reason is not specified.
Use ethtool -G to set the size of the ring buffer of the network card. The author generally sets it to 2048 and 4096. If it is a forwarding device, it may be set larger.

Interrupt settings

Most of the current network cards are multi-queue network cards, and each queue has an independent interrupt. In order to improve concurrent processing capabilities, we need to distribute different interrupts to different CPU cores.
Check the status of hard interrupts by cat /proc/interrupts.
Insert picture description here

In the above picture, the network card interrupts of the author's virtual machine are relatively evenly distributed on different CPU cores.
View the CPU affinity of the corresponding interrupt
Insert picture description here

The interrupts corresponding to different receive/send queues are allocated to CPU0~7.

By default, the smp_affinity corresponding to the interrupt is generally set to ff, that is, the interrupt can be distributed to all cores. At this time, it seems that all queue interrupts can be distributed to any core, which in theory seems to be better than the specified core above. However, the actual effect is often not the case. This depends on the hardware and OS implementation. In the author's experience, I have not encountered a situation where the hard interrupt load is very balanced after smp_affinity is set to ff. Generally, they are distributed to a few designated cores, while other cores receive only a few interrupts.

Therefore, in general, we assign different receive queues of the network card to different CPUs in order. At this time, a problem emerged. How does the network card decide which queue to put the message on?

Network card RSS settings

The network card also uses the hash operation to determine which receive queue to put the message in. Although we cannot change the hash algorithm, we can set the key of the hash, which is calculated by what field of the message, which affects the final result.
Use ethtool --show-tuple to view the specified protocol.
Different network cards have different RSS capabilities, and the supported protocols and the fields that can be set are also different. But what is strange is that the default key of UDP protocol is different from TCP, only source IP + destination IP. In this way, when doing UDP performance testing, we must pay special attention to using the same device as the client, and the generated UDP packets will only be distributed to one queue, resulting in only one CPU processing interruption on the server, which will affect the test results.
Therefore, generally we have to change the key of UDP RSS through ethtool --config-tuple to make it similar to TCP.

01

Optimal strategy for receiving direction

Now start to enter the optimization strategy of the software field.

NAPI mechanism

Modern Linux network device drivers generally support the NAPI mechanism, which integrates interrupts and polling. One interruption can poll the device multiple times. This can have the advantages of both interrupt and polling. —— Of course, for pure forwarding equipment, polling can be used directly. So, how many polls do you have to poll for an interruption? It can be configured through /proc/sys/net/core/netdev_budget, the default is 300. This is shared by all devices. When receiving soft interrupt processing, it is possible that multiple network devices have triggered the interrupt, which is added to the NAPI list. Then the budget shared by these devices is 300, and each device has one NAPI call, and the maximum number of polls is usually written directly in the driver, usually 64. Here, whether it is 64 or 300, it refers to the maximum number of polls. If there is no prepared message in the hardware, even if the number of budgets is not reached, it will exit. If the device has messages all the time, the receiving soft interrupt will always collect the messages until the budget number.
When the soft interrupt occupies a lot of CPU, it will cause the application on this CPU to not be scheduled. Therefore, the budget value should be selected according to the business. Because the processing time of different protocol packets is different, and it is not intuitive to control the CPU usage of receiving soft interrupts by setting the budget number. Therefore, in the new version of the kernel, a new parameter netdev_budget_usecs is introduced to control the maximum CPU time for receiving soft interrupts.

RPS and RFS
in the era when there is no multi-queue network card, the network card can only generate one interrupt and send it to one CPU. At this time, how to use multi-core to improve the ability of parallel processing? RPS was born to solve this problem. RPS is similar to the RSS of the network card, except that the CPU calculates a hash value according to the message protocol, and then uses this hash value to select a CPU and store the message into the CPU's receiving queue. And send an IPI interrupt to the target CPU to notify it to process. In this case, even if only one CPU receives the interrupt, RPS can distribute the message to multiple CPUs again.
By writing to the file /sys/class/net/ethx/queues/rx-0/rps_cpus, you can set which CPUs the network card receive queue can distribute to.
RFS is similar to RPS, except for the letter in the middle. The former is Flow and the latter is Packet. This also explains its realization principle. RPS is distributed completely according to the characteristics of the current message, while RFS takes into account flow-the flow here is not a simple flow, but considers the behavior of the "application", that is, which CPU core processed this flow last time The CPU is the destination CPU.
Now that there are multi-queue network cards, and you can set a custom ntuple to affect the hash algorithm, RPS has not much use.
So does RFS also enter the dust of history? I personally think it is negative. Imagine, in the following scenario, a service S is deployed on an 8-core server, and its 6 worker threads occupy CPU 0-5, and the remaining CPU 6-7 is responsible for processing other services. Because there are 8 CPU cores, the network card queue is generally set to 8. Assuming that these 8 queues correspond to CPU0~7, then the problem comes. The service message of service S is received by the network card, and after the RSS calculation, it is placed in queue 6. The corresponding interrupt is also sent to CPU6, but the thread of service S is not running on CPU6, and the data message is appended to 6. The socket in a worker thread receives the buffer. On the one hand, there may be a competition relationship with the reading operation of the running worker thread. On the other hand, when the corresponding worker thread reads the message, the message data needs to be re-read into the cache of the corresponding CPU. RSS can solve this problem. When a worker thread processes a socket message, the kernel will record that the message is processed by a certain CPU, and save this mapping relationship in a flow table. In this way, even if the CPU6 receives an interrupt, it finds the corresponding entry in the flow table according to the message protocol characteristics, and the message should be processed by the CPU3. At this time, the message will be stored in the backlog queue of CPU3, avoiding the above problems.

XPS
RPS and RFS are used to establish the relationship between the receiving queue and the processing CPU, and XPS can be used not only to establish the relationship between the sending queue and the processing CPU, but also to establish the relationship between the receiving queue and the sending queue. The former is when the sending is completed, its work is completed by the designated CPU, and the latter is to select the sending queue through the receiving queue.

netfilter and nf_conntrack
netfilter is the implementation of the iptables tool in the kernel, and its performance is average, especially when there are a large number of rules or when extended matching conditions are used. Use this according to your situation. And nf_conntrack is the connection tracking function required by netfilter as a stateful firewall. In the early version of Linux, the session table used a large global lock, which hurts performance. In a production environment, it is generally not recommended to load this module, so stateful firewall, NAT, synproxy, etc. cannot be used.

early_demux switch
Students who are familiar with the linux kernel know that after linux receives a message, it will look up the routing table to determine whether the message is sent to the machine or forwarded. If it is determined that it is sent to the local machine, it is necessary to find which socket it is sent to according to the 4-layer protocol. Two lookups are involved here, and for established TCP and some UDPs, the "connection" has been completed, and the route can be regarded as "immutable", so the "connection" routing information can be cached. After opening /proc/sys/net/ipv4/tcp_early_demux or udp_early_demux, the above two searches may be merged into one. After the kernel receives the message, if the layer 4 protocol enables early_demux, it will search the socket in advance, and if it finds it, it will directly use the routing result cached in the socket. For the forwarding device, this switch does not need to be turned on-because the forwarding device is mainly forwarding and has no native service program.

Enabling busy_poll
busy_poll was originally named Low Latency Sockets to improve the delay problem of the kernel processing packets. The main idea is that when doing socket system calls, such as read operations, the socket layer directly calls the driver layer method to poll read messages within a specified time, which may increase the PPS processing capacity by several times.
There are two system-level configurations for busy_poll. The first is /proc/sys/net/core/busy_poll, which sets the timeout period for busy poll execution during select and poll system calls, in us. The second is /proc/sys/net/core/busy_read, which sets the timeout period of busy_poll during read operations, and the unit is also us.
From the test result, the effect of busy_poll is obvious, but it also has limitations. Only when the receiving queue of each network card has and only one application can read it, can the performance be improved. If multiple applications are performing busy polls on a receiving queue at the same time, a scheduler needs to be introduced to make a ruling, which will increase consumption in vain.

Receive buffer size
Linux socket receive buffer has two configurations, one is the default size and the other is the maximum size. It can be obtained by using sysctl -a | grep rmem_default/max, or by reading /proc/sys/net/core/rmem_default/max. Generally speaking, the default values ​​of these two configurations of Linux are somewhat small for the service program (about a few hundred k). Then we can increase the default size and maximum value of the receiving buffer through sysctl or directly writing to the above proc file, so as to prevent burst traffic applications from being too late to deal with the situation that the receiving buffer is full and packet loss.

TCP configuration parameter
/proc/sys/net/ipv4/tcp_abort_overflow: Controls the behavior when a TCP connection is established but the backlog queue is full. The default is generally 0, and the behavior is to retransmit syn+ack, so the peer will retransmit ack. When the value is 1, RST will be sent directly. The former is a relatively gentle treatment, but it is not easy to expose the problem of full backlog. You can set a suitable value according to your business.

/proc/sys/net/ipv4/tcp_allowed_congestion_control: Display the TCP flow control algorithm supported by the current system

/proc/sys/net/ipv4/tcp_congestion_control: Configure the TCP flow control algorithm used by the current system, which needs to be the algorithm shown above.

/proc/sys/net/ipv4/tcp_app_win: used to adjust the cache size at the application layer and the allocation of the TCP window.

/proc/sys/net/ipv4/tcp_dsack: Whether to enable Duplicate SACK.

/proc/sys/net/ipv4/tcp_fast_open: Whether to enable the TCP Fast Open extension. This extension can improve the response time of long-distance communications.

/proc/sys/net/ipv4/tcp_fin_timeout: Used to control the timeout time of waiting for the FIN packet of the opposite end after the local end actively shuts down. It is used to avoid DOS attacks, in seconds.

/proc/sys/net/ipv4/tcp_init_cwnd: Initial congestion window size. You can set a larger value as needed to improve transmission efficiency.

/proc/sys/net/ipv4/tcp_keepalive_intvl: The interval for sending keepalive packets.

/proc/sys/net/ipv4/tcp_keepalive_probes: No keepalive message response is received, the maximum number of keepalives sent.

/proc/sys/net/ipv4/tcp_keepalive_time: The idle time for TCP connection to send keepalive.

/proc/sys/net/ipv4/tcp_max_syn_backlog: The queue length of the TCP three-way handshake that did not receive the client ack. For the server, it needs to be adjusted to a larger value.

/proc/sys/net/ipv4/tcp_max_tw_buckets: The number of sockets whose TCP is in the TIME_WAIT state, used to defend against simple DOS attacks. After this number is exceeded, the socket will be closed directly.

/proc/sys/net/ipv4/tcp_sack: Set whether to enable SACK, it is enabled by default.

/proc/sys/net/ipv4/tcp_syncookies: Used to prevent syn flood attacks. When the syn backlog queue is full, syncookie will be used to verify the client.

/proc/sys/net/ipv4/tcp_window_scaling: Set whether to enable the TCP window scale extension function. You can notify the other party of a larger receiving window to improve transmission efficiency. Enabled by default.

Commonly used socket option
SO_KEEPALIVE: Whether to enable KEEPALIVE.

SO_LINGER: Set the timeout period for the socket "graceful shutdown" (my personal name). When the LINGER option is enabled, when close or shutdown is called, if there is data in the sending buffer of the socket, it will not return immediately but wait for the message to be sent out or until the LINGER timeout period. There is a special situation here, LINGER is enabled, but the LINGER time is 0, what will happen? Will send RST directly to the opposite end.

SO_RCVBUFF: Set the receive buffer size of the socket.

SO_RCVTIMEO: Set the timeout time for receiving data. For the service program, it is generally non-blocking, that is, set to 0.

SO_REUSEADDR: Whether to verify that the bound address and port conflict. For example, if a port has been bound with ANY_ADDR, any local address cannot be used later to bind the same port. For the service program, it is recommended to open it to avoid the bind address failure when the program restarts.

SO_REUSEPORT: Allows to bind exactly the same address and port. More importantly, when the message received by the kernel can match multiple sockets with the same address and port, the kernel will automatically switch between these sockets. Achieve load balancing.

Other system parameters
Maximum number of file descriptors: For TCP service programs, each connection takes up one file descriptor, so the default maximum number of file descriptors is far from enough. We need to increase the maximum descriptor limit of the system and process at the same time. The former can use /proc/sys/fs/file-max,
you can use /proc/sys/fs/file-max or sysctl -n fs.file-max=xxxxxx settings. The latter can be set using ulimit -n or setrlimit.
Bind CPU: Each thread of the service program is bound to the specified CPU. You can use taskset or cgroup commands to bind a specified service thread to a specified CPU or a set of CPUs. You can also call pthread_setaffinity_np to achieve. By binding the specified thread to the CPU, on the one hand, the popularity of the cache (high hit) can be ensured, and on the other hand, CPU load distribution in line with the business can be achieved.

03

bypass kernel

Previously, the network performance of Linux was optimized by adjusting the kernel parameters, but for the application layer service program, there are still several problems that cannot be avoided, such as data copying in and out of the kernel. So the bypass kernel scheme was born, such as dpdk, netmap, pfring, etc., among which dpdk is the most widely used. Compared with the kernel, it has three advantages: 1. Avoid data copying in and out of the kernel; 2. Use large pages to improve the hit rate of TLB; 3. Use poll by default to improve network performance.

However, these tools for sending and receiving packets cannot yet contain a complete protocol stack and network tools like the kernel. ——Of course, DPDK already has many libraries and tools.

For network forwarding equipment, basically only the second and third layer packets are processed, which does not require high protocol stacks. For the service program, a more complete protocol stack is required. Currently there are DPDK+mtcp, DPDK+fstack and DPDK+Nginx solutions.

Because this article focuses on Linux network performance improvement, the bypass scheme is only an introduction.

Pay attention to the official account and share more Internet technology content you are interested in!
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_40989769/article/details/111281977