Project pressure test related

The relationship between several important indicators
QPS= concurrency/average response time
Concurrency = QPS*average response time
In other words, the number of concurrent connections represents the server's ability to withstand pressure and the ability to receive connections. qps represents the processing speed of the server under the same concurrency, the shorter the response time, the greater the qps.
It does not mean that the more concurrency, the greater the qps, and the qps may be at an appropriate number of concurrency to perform the best, that is, to process the most requests per second.

Learn a few stress testing tools first. webbench, ab, wrk pressure test

The standard test of webbench can show us two things about the server: the number of corresponding requests per second (qps) and the amount of data transferred per second. webbench not only has the test ability of standard static pages, but also has the ability to test dynamic pages (ASP, PHP, JAVA, CGI).
Its principle is to fork many sub-processes, and the sub-processes randomly select ip+port, initiate a connection to your website, set the number of concurrency, the number of requests, and the duration by yourself. Finally, the results are aggregated to the parent process.
Webbench can simulate up to 30,000 concurrent connections to test the load capacity of the website.

WebBench installation
First download the webbench script http://home.tiscali.cz/cz210552/distfiles/webbench-1.5.tar.gz (or go to github to download)
unzip webbench tar xvzf webbench-1.5.tar.gz
enter the webbench directory
switch root account : su root
install make && make install
test command
webbench -c 300 -t 60 http://test.domain.com/phpinfo.php
webbench -c concurrent number -t running test time URL
is equivalent to sending 100 URLs to the website A connection is a thread request and lasts for 60s.

The following results will appear

Webbench - Simple Web Benchmark 1.5
Copyright (c) Radim Kolar 1997-2004, GPL Open Source Software.

Benchmarking: GET http://test.domain.com/phpinfo.php
300 clients, running 60 sec.

Speed=24525 pages/min, 20794612 bytes/sec.
Requests: 24525 susceed, 0 failed.

The number of response requests per second (qps): 24525 pages/min, the amount of data transmitted per second 20794612 bytes/sec. The
number of returns: 24525 successful returns, 0 failed returns

Change the concurrent number to 1000,
Speed=24920 pages/min, 21037312 bytes/sec.
Requests: 24833 sustained, 87 failed.

It means that the load is overloaded, and there are failed requests.

Problems encountered in stress testing
1. The number of concurrent muduo is too small?
When the muduo library is tested, the failed problem occurs when the concurrency is 1000, which is not normal. It stands to reason that it is no problem for 10k.
Because the logging of the muduo library is really good, if there is a problem, it will be displayed as LOG_ERRNO immediately. I found out that there is a problem with the Acceptor accept, and the reason for the display is OPEN FILE TOO MUCH...
After looking at it through top lsof, I found that the CPU ratio It is also normal and only ran to 45%, indicating that the load is not too large. After lsof, I looked at the occupancy of the file descriptor and found that it was only opened to 1024... Finally found the problem. It turns out that Linux only allows one process to open 1024
file descriptors by default. Let me just talk about the situation... How could it be enough for a server with c1k1000 concurrent connections...
The following is the method of modifying the file_open parameter once. I don't know how to modify it permanently yet.

1、sudo vim /etc/security/limits.conf
这里的话 在最底部增加以下两行即可(进程可打开最大文件描述符数目65536)
* soft nofile 65536
* hard nofile 65536

2、ulimit -n 65536(设置目前用户的最大文件描述符数目 重启后重置为默认1024)
3、ulimit -n或者ulimit -a(查看当前目前最大可开启文件描述符数目)

Muduo short connection should be around 31000qps

To solve the previous problem, we need to know the following factors that
affect the number of concurrent connections?
First of all, it must be determined that the essence is affected by the hardware . For long connections, the number of concurrency is related to the number of bearers, that is, the number of threads. In theory, the more threads created, the greater the number of concurrency. However, thread creation is also affected by memory, and threads will occupy memory space. For 4 G of memory, hundreds of threads will be full. In fact, it is impossible to create so many threads. IO is ignored for computing-intensive tasks, and the number of threads is consistent with the number of CPU cores. The purpose is to make full use of the CPU; for IO-intensive tasks, the number of threads = the number of cores (1+ IO/CPU), too many threads can easily cause frequent thread switching and lower CPU utilization. In addition to the number of CPU cores, there is also the impact of network bandwidth, which generally becomes a bottleneck.
After excluding hardware effects, there are several special factors to consider. The first is the number of file descriptors that can be provided . Only one file descriptor can provide a TCP connection. The default number of file descriptors that can be opened by a process is 1024. This will of course affect the number of concurrent connections. At most, it can only connect about 1000, because the system process It will use up some file descriptors by itself. So remember to modify this default value.
The third is, is it possible for file descriptors and other resources to be opened and forgot to close, causing memory leaks , which will cause your server to explode in memory within tens of seconds, use new file descriptors every time, and crash directly. The memory usage rate is getting higher and higher until the process is directly exploded and killed. (Solutions, smart pointers, how to use smart pointers to solve memory leaks? )

2. The memory usage is getting bigger and bigger. Use smart pointers to solve memory leaks and recycle file descriptors.
You can use top to view.

3. The CPU usage of the main thread is too high and many client connections time out (without log), while the sub-threads are very low
with 1000 concurrent connections. It stands to reason that there should not be such a high CPU usage. Normally, it should be around 30.
The problem is that the second parameter backlog of listen can be changed to hundreds.
Imagine the next 1W concurrency, when the queue is full when connecting, the server will ignore the client's connect, and the client can only try again, if it happens that the queue is still full at this time. Then try again, if life is really that bad. . It is estimated that it will not be connected before it hangs up, and it will time out ! ! ! ! Because the client keeps trying, it will continue to initiate a three-way handshake process with the server, and the server will be busy establishing a handshake with it, sending syn+ack, etc., so a lot of cpu resources are used.
Contact some DDos attacks, that is, some semi-connection attacks or full-connection attacks, semi-connection attacks some virtual ip to access the server, establish a three-way handshake, but do not reply ack at the last time, then there will be a semi-connection queue, which will paralyze the server.
A full connection is to establish a three-way handshake, but if you don't do anything and keep shaking hands, it will also cause server memory and CPU resources to be paralyzed.

The backlog I set at that time was 5. Because it is 5 in many books, I think this is a relatively good value, just like why the expansion of vector is 1.5-2 times better.
The value passed in to listen(n), where n represents the maximum number of connections that the operating system can suspend before the server rejects (over the limit) connections. n can also be regarded as the "number of queues"
TCP has the following two queues:
SYN queue (semi-connection queue): When the server receives the SYN message from the client, it will respond to the SYN/ACK message, and then the connection will be closed. Entering the SYN RECEIVED state, connections in the SYN RECEIVED state are added to the SYN queue, and when their state changes to ESTABLISHED, that is, when the ACK packet in the 3-way handshake is received, they are moved to the accept queue. The size of the SYN queue is set by the kernel parameter /proc/sys/net/ipv4/tcp_max_syn_backlog.
Accept queue (full connection queue): The accept queue stores the connections that have completed the TCP three-way handshake, and the accept system call simply takes out the connection from the accept queue. It does not call the accept function to complete the TCP three-way handshake. The accept queue The size of can be controlled by the second parameter of the listen function.

insert image description here

4. Optimize the working thread, and find that the qps difference with muduo is about 16%, and the cpu usage rate of the working thread is about 10% higher.
Is it possible that when writing business code to parse http (http echo), it is more complicated and can be optimized? Protocol analysis?

5. Starting to go beyond?
Analyze the bottleneck, and then design your own, thread pool work queue processing part? The coroutine part? ET mode?

6 How to test long connection and short connection? What is the significance of the impact on the pressure test results?
1000 concurrent numbers

server Short connection QPS Long connection QPS
WebServer 126798 335338
Muduo 88430 358302

It can be seen that the qps under the long connection is about 3 to 4 times that of the short connection, because there is no overhead for establishing and disconnecting the tcp connection, and there is no need for frequent system calls such as accept and shutdown\close, nor for frequent establishment and destruction the corresponding structure.
The reason why it can surpass muduo is because of the ET mode. Muduo is triggered horizontally, and the et mode is suitable for scenarios with high concurrency and short connections.

Horizontal trigger: the socket receiving buffer is not empty, there is data to read, and the read event is always triggered. As long as it is readable, the read event is always triggered, and as long as it is writable, the write event is always triggered. Edge trigger: from unreadable to readable, from readable to unreadable, from unwritable to writable, from writable to unwritable, all trigger once. When using ET as trigger mode, we usually use the while() loop to read out all the data of sockfd
Horizontal trigger:

1. After the accept is successful, the writable event will be triggered until the send buffer is full. The other party has sent you data, as long as there is still data in your receive buffer, the readable event will always be triggered.
2. When the other party disconnects, whether it is actively disconnected or abnormally disconnected, a readable and writable event will be triggered. Why? Because you will return when you recv and send (read actively disconnects 0, abnormal disconnection -1, write returns -1, triggers SIGPIPE).

int number = epoll_wait( epollfd, events, MAX_EVENT_NUMBER, -1 );//通过epoll_wait监听到epoll事件响应,events中会保存响应的事件队列
sockfd = events[i].data.fd;//从事件队列中取出对应的文件描述符fd
n = recv(sockfd , buff, MAXLNE, 0);//通过recv将sockfd缓冲区的数据接收进buff中

int number = epoll_wait( epollfd, events, MAX_EVENT_NUMBER, -1 );//通过epoll_wait监听到epoll事件响应,events中会保存响应的事件队列
sockfd = events[i].data.fd;//从事件队列中取出对应的文件描述符fd
m_read_idx = 0;//记录一次读数据后的位置索引
while(true) {
    
    //
   // 从buff + m_read_idx索引出开始保存数据
    n = recv(sockfd , buff + m_read_idx, MAXLNE, 0 );//通过recv将sockfd缓冲区的数据接收进buff中
    if (n == -1) {
    
    
        if( errno == EAGAIN || errno == EWOULDBLOCK ) {
    
    
            // 没有数据
            break;
        } 
    } else if (n == 0) {
    
       // 对方关闭连接
		//close(sockfd);
		break;
    }
    m_read_idx += n;
}

Although horizontal triggering is easy to code, repeated event triggering will affect the performance of high-concurrency servers, because the monitoring events returned by epoll_wait involve system calls and require conversion from user mode to kernel mode. Especially for high-concurrency short connections, it will return frequently because the data has not been read, and the frequent disconnection of short connections will also trigger readable and writable events, allowing epoll to return.
So it can be seen that under 1000 concurrent short connections, the ET mode has higher performance than the LT mode, and the performance of long connections is not much different.
Is ET really faster than LT?
First of all, if the amount of concurrency is not large, or if the read data is all small data, there should be no difference; if it is a large amount of data, LT will generate more fearless system calls, which is slower; from the processing process of
ET As can be seen in ET, the requirement of ET is to keep reading and writing until EAGAIN is returned, otherwise events will be missed. In the processing of LT, until EAGAIN is returned is not a hard requirement, but the usual processing will read and write until EAGAIN is returned, but LT has one more step to switch the EPOLLOUT event than ET
. Secondly, if the response of the server is usually small, it will not Trigger EPOLLOUT, then it is suitable to use LT, such as redis, etc. As a high-performance general-purpose server, nginx can run up to 1G in network traffic. In this case, it is easy to trigger EPOLLOUT, so use ET.

To sum up, ET is definitely better for high-concurrency big data, and ET is slightly faster for small data. The reason is that if you write 1m data, you can’t write it all at once. LT will open the EPOLLOUT event, and close it if you finish writing; ET is always writing , until the writing is finished or return to EGAIN, if the writing is finished, EPOLLOUT is not enabled.

**How ​​to find performance bottlenecks through pressure testing and other methods? **Have your own thinking
The file descriptor is not closed in time?
Dealing with logic problems?

Summarize:

1. If there is a problem, 1000 will have a failed connection. First, the file descriptor limit problem, 1024 modification, found that it still doesn’t work, it turns out that the ET mode caused the accept to send a notification only once, due to the large amount of concurrency, so it was not added to epoll in time, so many connections could not be added in time It will time out and cause the connection to fail; the solution is to wrap a layer of accept with while until there is no going back. And not only accept, but for read and write, it must be done once, otherwise the data will be wrong. So I used LT directly and found that it was more efficient.
So I found that it is not that ET is more efficient than LT, it depends on the scene.
First of all, it has no effect on reading, because epollwait returns frequently under high concurrency, and you won't return many times because you haven't finished reading. In fact, the biggest impact is write. The reason is that if the LT write buffer is not full, it will continue to write. So when you return to eagin after writing the data, you must close epoll_ctl to close the write event, otherwise the write will always be triggered, and there is no chance to return to epollwait to monitor, and then add it after returning. And ctl is a system call, so there will be several more system calls. The larger the amount of data, the higher the probability of errors in the middle, so at this time the return is to continue writing, and LT is more suitable. On the contrary, reading ET at one time will cause other fds to starve.
For example, redis uses LT, and nignx uses ET. The potential reason is that redis has a small amount of data, and nignx is likely to run as much as 1G data. The more data, in fact, the main impact is on

  1. backlog problem.
    The CPU usage of the main thread is too high and many client connections time out (without the log), while the sub-threads are very low **
    concurrency of 1000, it stands to reason that there should not be such a high CPU usage, normally it should be around 30. The low number of sub-threads indicates that there are very few requests, because many clients are not connected.
    Here I began to wonder if there is a problem with my coroutine, but it stands to reason that there is only one more switch between sub-coroutines and multi-main coroutines. At most, a few values ​​​​of the registers are changed. How could it be possible that the CPU utilization rate is so high and the connection is not good? Up?
    I've been stuck here for a long time. Later, it was found that a point is the second parameter of listen

The problem is that the second parameter backlog of listen can be changed to hundreds of 582.
Imagine the next 1W concurrency, when the queue is full when connecting, the server will ignore the client's connect, and the client can only try again, if it happens that the queue is still full at this time. Then try again, if life is really that bad. . It is estimated that it will not be connected before it hangs up, and it will time out ! ! ! ! Because the client keeps trying, it will continue to initiate a three-way handshake process with the server, and the server will be busy establishing a handshake with it, sending syn+ack, etc., so a lot of cpu resources are used.
Contact some DDos attacks, that is, some semi-connection attacks or full-connection attacks, semi-connection attacks some virtual ip to access the server, establish a three-way handshake, but do not reply ack at the last time, then there will be a semi-connection queue, which will paralyze the server.
A full connection is to establish a three-way handshake, but if you don't do anything and keep shaking hands, it will also cause server memory and CPU resources to be paralyzed.
The backlog I set at that time was 5. Because it is 5 in many books, I think this is a relatively good value, just like why the expansion of vector is 1.5-2 times better.
The value passed in to listen(n), where n represents the maximum number of connections that the operating system can suspend before the server rejects (over the limit) connections. n can also be regarded as the "number of queues"
TCP has the following two queues:
SYN queue (semi-connection queue): When the server receives the SYN message from the client, it will respond to the SYN/ACK message, and then the connection will be closed. Entering the SYN RECEIVED state, connections in the SYN RECEIVED state are added to the SYN queue, and when their state changes to ESTABLISHED, that is, when the ACK packet in the 3-way handshake is received, they are moved to the accept queue. The size of the SYN queue is set by the kernel parameter /proc/sys/net/ipv4/tcp_max_syn_backlog.
Accept queue (full connection queue): The accept queue stores the connections that have completed the TCP three-way handshake, and the accept system call simply takes out the connection from the accept queue. It does not call the accept function to complete the TCP three-way handshake. The accept queue The size of can be controlled by the second parameter of the listen function.

After changing here, it will be normal.
There are about 80,000, which is about the same as muduo. Maybe it will be a little higher without coroutines.

3. Find the bottleneck.
Analysis process. I found a problem, that is, when there are 10,000 connections, the CPU usage rate of the sub-threads, including the usage rate of the main thread, is not very high (about 70)? It stands to reason that
my master-slave reactor model has such a large amount of concurrency, the CPU usage of the main thread and the sub-thread should be full, because I have no connection to the database to cache rpc or something, the sub-thread gets the task and sends the data directly, no There are so many blocking waits;
I figured out that the problem is the task queue. It needs to be locked, and the entire queue is locked, and the dequeue and enqueue are serialized. This will block. (It is time-consuming to go out and join the team. It is necessary to find the coordinates. Because of the problem of false sharing, the head and tail will affect each other, causing the cache line to become invalid. Every time it is accessed from the internal memory, it takes dozens to hundreds of clock cycles , and pop wants to delete the task, and the deletion operation requires free list management). So we use a circular lock-free message queue. Pre-allocate a ring queue...
After finishing it, I found that the cpu usage rate increased, about 92. qps significantly increased by around 40

top can view the dynamic CPU usage, first of all, an overall one. (Number of processes, CPU usage, memory usage, SWAP partition status) Then the following is the specific situation of the process list; (it is very likely that the process exceeds 100%, because multi-threading is enabled, NUMA architecture, each CPU adds up)
but What we want to check is the usage of threads.
top -H -p 15736 First look at the process, and then the -H parameter can look at the CPU usage of the threads inside.

4 Future Prospects
C10K-C10M
First of all, the hardware is not a problem. The core is software. (Blocking is not only in IO, but also in data copying, address translation, memory management, data processing)
The core of the problem: Don't let the OS kernel perform all the heavy tasks: packet processing, data copying, memory management, processor scheduling, Tasks such as address translation are transferred from the kernel to the application program to complete efficiently, so that an OS such as Linux only handles the control layer, and the data layer is completely handed over to the application program.

1. You must know that the linux kernel is a time-sharing system from the beginning, the most important thing is to ensure fairness, CFS algorithm. And our actual tasks may be different, so it is best to let the application complete the scheduling to reduce the burden of kernel scheduling and meaningless scheduling;
2. Core binding , tasks may run on different CPU cores, especially now It is a NUMA architecture, which will lead to context switching and cache hit rate problems;
3. Data packets directly interact with the application layer . The Linux protocol stack is complex and cumbersome, and data packets passing through it will cause a huge drop in performance and will Uses a lot of memory resources. For example, the network card driver of DPDK is at the application layer, and the data does not pass through the kernel. (Of course, the data packets sent to oneself must pass through the kernel)
4. The large page mechanism is enabled to reduce address translation overhead. (4kb-2mb)
5. Zero-copy technology (kafka), sendfile. Use memory mapping directly in the page cache of the kernel, and then record the offset and total amount. When reading, read directly without copying to the socket kernel cache. (For example, after our thread pool processes and sends back to the client, it will go through these four copies, and the CPU utilization may be higher after zero copy)

2.1. HTTP echo test QPS
test machine configuration information: Centos virtual machine , memory 6G , CPU 4 cores

Test tool: wrk : https://github.com/wg/wrk.git

Deployment information: wrk and TinyRPC service are deployed on the same virtual machine, close the TinyRPC log

Test command:

// -c 为并发连接数,按照表格数据依次修改
wrk -c 1000 -t 8 -d 30 --latency 'http://127.0.0.1:19999/qps?id=1'

Test Results:

SWC WRK concurrent connection 1000 WRK concurrent connections 2000 WRK concurrent connection 5000 WRK concurrent connection 10000
The number of IO threads is 1 27000 QPS 26000 QPS 20000 QPS 20000 QPS
The number of IO threads is 4 140000 QPS 130000 QPS 123000 QPS 118000 QPS
The number of IO threads is 8 135000 QPS 120000 QPS 100000 QPS 100000 QPS
The number of IO threads is 16 125000 QPS 127000 QPS 123000 QPS 118000 QPS





Guess you like

Origin blog.csdn.net/weixin_53344209/article/details/130148525