Linux NET performance optimization practical study notes

The following content is from the geek course, if it is helpful to you, please see the poster for the detailed course:
Picture name

NET performance

NET performance related knowledge

1. Basic knowledge

Next, the kernel protocol stack takes out the network frame from the buffer, and processes the network frame layer by layer from bottom to top through the network protocol stack. such as,

Check the legitimacy of the message at the link layer to find out the type of the upper layer protocol (such as IPv4 or IPv6), then remove the frame header and frame tail, and then hand it over to the network layer.
The network layer takes out the IP header and judges the next direction of the network packet, such as whether it is handed over to the upper layer for processing or forwarding. When the network layer confirms that the packet is to be sent to the local machine, it will take out the type of upper layer protocol (such as TCP or UDP), remove the IP header, and hand it over to the transport layer for processing.
After the transport layer takes out the TCP header or the UDP header, it finds the corresponding Socket according to the <source IP, source port, destination IP, destination port> quadruple as an identifier, and copies the data to the receiving buffer of the Socket.
Finally, the application can use the Socket interface to read the newly received data.

2. NIC indicators

Fourth, the number of bytes, packets, errors, and packet loss situations sent and received by the network, especially when the errors, dropped, overruns, carriers, and collisions of the TX and RX parts are not 0, it usually means that the network I/ O question. among them:

errors represents the number of packets with errors, such as check errors, frame synchronization errors, etc.;
dropped indicates the number of dropped packets, that is, the packet has received the Ring Buffer, but the packet is dropped due to insufficient memory and other reasons;
Overruns indicates the number of overrun data packets, that is, the network I/O speed is too fast, causing the data packets in the Ring Buffer to be too late to be processed (queue full) and packet loss;
carrier indicates the number of packets with carrirer errors, such as mismatch in duplex mode, problems with physical cables, etc.;
collisions represents the number of collision packets.

3. C10K, C1000K problem optimization

IO model optimization

Horizontal trigger: As long as the file descriptor can perform I/O non-blocking, a notification will be triggered. In other words, the application can check the status of the file descriptor at any time, and then perform I/O operations based on the status.
Edge trigger: Only when the status of the file descriptor changes (that is, the I/O request is reached), a notification is sent. At this time, the application needs to perform as much I/O as possible, and it can stop until it can no longer continue to read or write. If the I/O is not executed, or for some reason there is no time to process, then the notification will be lost.

The first is to use non-blocking I/O and level trigger notifications, such as select or poll.
The second is to use non-blocking I/O and edge trigger notifications, such as epoll.
The third is to use asynchronous I/O (Asynchronous I/O, referred to as AIO)

Working model optimization

1. The first type, the main process + multiple worker child processes, this is also the most commonly used model.
It should be noted here that accept() and epoll_wait() calls, there is still a shocking problem. In other words, when a network I/O event occurs, multiple processes are awakened at the same time, but in fact only one process responds to this event, and other awakened processes will go to sleep again. Among them, the shock group problem of accept() has been solved in Linux 2.6; and the problem of epoll, it was only in Linux 4.5 that it was solved by EPOLLEXCLUSIVE. In order to avoid the shock group problem, Nginx adds a global lock (accept_mutex) to each worker process. These worker processes need to compete for the lock first, and only processes that compete for the lock will be added to epoll, so as to ensure that only one worker child process is awakened.
2. The second, a multi-process model that listens to the same port

4. How to evaluate the network performance of the system

At the application layer, you can use wrk, Jmeter, etc. to simulate user load, test the number of requests per second, processing delay, number of errors, etc. of the application;
At the transport layer, you can use tools such as iperf to test the TCP throughput;
Further down, you can also use the pktgen that comes with the Linux kernel to test the PPS of the server.

5. How to optimize DDos

Limit the number of concurrent connections and the frequency of SYN according to the source IP

# 限制syn并发数为每秒1次
$ iptables -A INPUT -p tcp --syn -m limit --limit 1/s -j ACCEPT
# 限制单个IP在60秒新建立的连接数为10
$ iptables -I INPUT -p tcp --dport 80 --syn -m recent --name SYN_FLOOD --update --seconds 60 --hitcount 10 -j REJECT

Increase the number of half-open connections, the default is 256

$ $ sysctl -w net.ipv4.tcp_max_syn_backlog=1024
net.ipv4.tcp_max_syn_backlog = 1024

When connecting each SYN_RECV, if it fails, the kernel will automatically retry, and the default number of retries is 5. You can execute the following command to reduce it to 1 time:
$ sysctl -w net.ipv4.tcp_synack_retries=1
Turn on SYN Cookies

$ sysctl -w net.ipv4.tcp_syncookies=1
net.ipv4.tcp_syncookies = 1

6. The client TCP_QUICKACK is turned off + the server is turned on Nagle, which causes the delay to increase

Insert picture description here

7. How to optimize NAT

Mainly adjust the default parameters of the connection tracking table. Important indicators:

net.netfilter.nf_conntrack_count, which means the current connection tracking number;
net.netfilter.nf_conntrack_max, which indicates the maximum number of connection tracking;
net.netfilter.nf_conntrack_buckets indicates the size of the connection tracking table.

Memory size occupied by the connection tracking table:

# 连接跟踪对象大小为376，链表项大小为16
nf_conntrack_max*连接跟踪对象大小+nf_conntrack_buckets*链表项大小 
= 1000*376+65536*16 B
= 1.4 MB

Reference materials:
https://mp.weixin.qq.com/s/VYBs8iqf0HsNg9WAxktzYQ

8. Common ideas for network performance optimization

application

Use epoll+aio, etc.;
Using long connections instead of short connections can significantly reduce the cost of TCP connection establishment. When the number of requests per second is high, the effect of this is very obvious.
Using memory and other methods to cache data that does not change frequently can reduce the number of network I/Os and speed up the response speed of the application.
Using serialization methods such as Protocol Buffer to compress the data volume of network I/O can improve the throughput of the application.
Use DNS caching, prefetching, HTTPDNS and other methods to reduce the delay of DNS resolution and improve the overall speed of network I/O.

Socket

Therefore, in order to improve the throughput of the network, you usually need to adjust the size of these buffers. such as:

Increase the buffer size of each socket net.core.optmem_max;
Increase the socket receive buffer size net.core.rmem_max and send buffer size net.core.wmem_max;
Increase the TCP receive buffer size net.ipv4.tcp_rmem and send buffer size net.ipv4.tcp_wmem.

Insert picture description here
For example, the send buffer size, the ideal value is throughput * delay, so that the maximum network utilization can be achieved.
In addition, the socket interface also provides some configuration options to modify the behavior of the network connection:

After setting TCP_NODELAY for TCP connection, Nagle algorithm can be disabled;
After opening TCP_CORK for the TCP connection, the small packets can be aggregated into large packets before sending (note that it will block the sending of small packets);
Using SO_SNDBUF and SO_RCVBUF, you can adjust the size of the socket send buffer and receive buffer respectively.

Transport layer

In the first category, in a scenario where the number of requests is relatively large, you may see a large number of connections in the TIME_WAIT state, which will take up a lot of memory and port resources. At this time, we can optimize the kernel options related to the TIME_WAIT state, such as taking the following measures.

Increase the number of connections in the TIME_WAIT state net.ipv4.tcp_max_tw_buckets, and increase the size of the connection tracking table net.netfilter.nf_conntrack_max.
Reduce net.ipv4.tcp_fin_timeout and net.netfilter.nf_conntrack_tcp_timeout_time_wait to let the system release the resources they occupy as soon as possible.
Enable port reuse net.ipv4.tcp_tw_reuse. In this way, the port occupied by the TIME_WAIT state can also be used in the newly created connection.
Increase the local port range net.ipv4.ip_local_port_range. In this way, more connections can be supported and overall concurrency can be improved.
Increase the number of maximum file descriptors. You can use fs.nr_open and fs.file-max to increase the maximum number of file descriptors for the process and system respectively; or configure LimitNOFILE in the systemd configuration file of the application to set the maximum number of file descriptors for the application.

In the second category, in order to alleviate the performance problems caused by attacks using TCP protocol features such as SYN FLOOD, you can consider optimizing the kernel options related to the SYN state, such as taking the following measures.

The third category, in the long connection scenario, usually keepalive is used to detect the status of the TCP connection, so that after the peer connection is disconnected, it can be automatically recycled. However, the system default Keepalive detection interval and number of retries generally cannot meet the performance requirements of the application. Therefore, you need to optimize the kernel options related to Keepalive at this time, such as:

Shorten the interval between the last data packet and the Keepalive detection packet net.ipv4.tcp_keepalive_time;
Shorten the interval time for sending Keepalive detection packets net.ipv4.tcp_keepalive_intvl;
Reduce the number of retries net.ipv4.tcp_keepalive_probes after the Keepalive probe fails until the application is notified.

Insert picture description here
UDP provides a datagram-oriented network protocol. It does not require a network connection and does not provide reliability guarantees. Therefore, UDP optimization is much simpler than TCP. Here I also summarized several common optimization schemes.

As mentioned in the previous socket part, increase the socket buffer size and UDP buffer range;
As mentioned in the previous TCP section, increase the range of local port numbers;
According to the MTU size, adjust the size of the UDP data packet to reduce or avoid fragmentation.

Link layer

Because the interrupt handler (especially the soft interrupt) called after the network card receives the packet, it needs to consume a lot of CPU. Therefore, scheduling these interrupt handlers to execute on different CPUs can significantly improve network throughput. This can usually be done in the following two ways. such as,

You can configure CPU affinity (smp_affinity) for the hard interrupt of the network card, or enable the irqbalance service.
For another example, you can enable RPS (Receive Packet Steering) and RFS (Receive Flow Steering) to schedule application and soft interrupt processing to the same CPU, so that you can increase the CPU cache hit rate and reduce network latency.

In addition, current network cards have very rich functions. The functions that were originally processed by software in the kernel can be offloaded to the network card and executed by hardware.

TSO (TCP Segmentation Offload) and UFO (UDP Fragmentation Offload): send large packets directly in the TCP/UDP protocol; and TCP packet segmentation (in accordance with MSS segmentation) and UDP segmentation (in accordance with MTU segmentation) functions, Completed by the network card.
GSO (Generic Segmentation Offload): When the network card does not support TSO/UFO, the segmentation of TCP/UDP packets will be delayed until it enters the network card before execution. In this way, not only can the consumption of the CPU be reduced, but also only the fragmented packets can be retransmitted when packet loss occurs.
LRO (Large Receive Offload): When receiving TCP segmented packets, the network card assembles and merges them, and then submits them to the upper network for processing. However, it should be noted that when IP forwarding is required, LRO cannot be turned on, because if the header information of multiple packets is inconsistent, LRO merging will cause network packet verification errors.
GRO (Generic Receive Offload): GRO fixes the flaws of LRO and is more versatile. It supports both TCP and UDP.
RSS (Receive Side Scaling): Also known as multi-queue receiving, it allocates network receiving processes based on multiple receiving queues of hardware, so that multiple CPUs can process the received network packets.
VXLAN uninstall: that is, let the network card complete the VXLAN package function.

?? Regarding tcp, there is a verification parameter about opening timestamps, tcp_timestamps. Don’t open it if there is a nat environment. You can check whether to open cat /proc/sys/net/ipv4/tcp_timestamps

Common commands on the network

1. Socket information ss or netstat

# head -n 3 表示只显示前面3行
# -l 表示只显示监听套接字
# -n 表示显示数字地址和端口(而不是名字)
# -p 表示显示进程信息
$ netstat -nlp | head -n 3
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN 840/systemd-resolve

# -l 表示只显示监听套接字
# -t 表示只显示 TCP 套接字
# -n 表示显示数字地址和端口(而不是名字)
# -p 表示显示进程信息
$ ss -ltnp | head -n 3
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 127.0.0.53%lo:53 0.0.0.0:* users:(("systemd-resolve",pid=840,fd=13))
LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=1459,fd=3))

Among them, the receiving queue (Recv-Q) and the sending queue (Send-Q) need your special attention, and they should usually be 0. When you find that they are not 0, it means that there is accumulation of network packets. Of course, also note that in different socket states, their meanings are different.
When the socket is in a connected state (Established),

Recv-Q represents the number of bytes in the socket buffer that have not been taken away by the application (ie, the length of the receive queue).
And Send-Q represents the number of bytes that have not been confirmed by the remote host (that is, the length of the send queue).

When the socket is in the listening state (Listening),

Recv-Q represents the length of the fully connected queue.
And Send-Q represents the maximum length of the fully connected queue.

The so-called full connection means that the server receives the ACK from the client, completes the TCP three-way handshake, and then moves the connection to the full connection queue. The sockets in these full connections need to be taken away by the accept() system call before the server can actually process the client's request. Corresponding to the fully connected queue, there is also a semi-connected queue. The so-called half connection refers to a connection that has not completed the TCP three-way handshake, and the connection is only halfway in progress. After the server receives the SYN packet from the client, it will put the connection in the semi-connection queue, and then send the SYN+ACK packet to the client.

2. Protocol stack information ss -s or netstat -s

3. Network throughput and PPS (sar)

# 数字1表示每隔1秒输出一组数据
$ sar -n DEV 1
Linux 4.15.0-1035-azure (ubuntu) 01/06/19 _x86_64_ (2 CPU)
13:21:40 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
13:21:41 eth0 18.00 20.00 5.79 4.25 0.00 0.00 0.00 0.00
13:21:41 docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
13:21:41 lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

There are many indicators output here, let me briefly explain their meaning.

rxpck/s and txpck/s are the received and sent PPS, and the unit is packet/second.
rxkB/s and txkB/s are the throughput of receiving and sending respectively, and the unit is KB/sec.
rxcmp/s and txcmp/s are the number of compressed data packets received and sent, and the unit is packet/second.
%ifutil is the usage rate of the network interface, namely (rxkB/s+txkB/s)/Bandwidth in half-duplex mode, and max(rxkB/s, txkB/s)/Bandwidth in full-duplex mode.

Insert picture description here