High concurrent services need to focus on points

System load articles:

First, what is the Load Average?

System load (System Load) is a measure of how busy the system CPU, that is, how many processes are waiting for CPU scheduling (process wait queue length).
Average load (Load Average) is the average load of the system over a period of time, this period of time is generally taken 1 minute, 5 minutes, 15 minutes.

Second, how to view the Load?

1, cat /proc/loadavg

2, uptime

3, top

These three methods can view the current system load, load average: 0.95, 0.94, 0.92 behind the three figures is the most intriguing

They characterize a minute, five minutes, fifteen minutes system load average order they are (in fact, these three time periods that is really how many CPU busy process)

Third, how to assess the level of the average load current system?

Segmentation boundaries: General Experienced system administrators will load full of this standard be set at 70%. (Ie, CPU busy percentage of the total number of server core)

Currently, all of our online server uses a 64-core (vcpu) of ECS, while the number of xadserver through uwsgi from the service, specified process 45, let's look at the conversion ratio: 45/64 = 0.703125

So the process of starting our current services just right number is slightly more than 70% of the standard line

We go online to read the load level:

5, 15 three periods of active CPU was about 45, the same as the number of processes, so we load our server because the number of processes will remain at about 70% (normal services provided)

FIG lower load period a small number of requests daily period, the trend line is apparent from FIG daily 9:00 pm 5:00 the next day the number of requests continues to decrease, and then continued to rise after maximum system load by a service 45 threads determine the system load

The picture shows the accident on the system load curve of the day

11:00 start high concurrency, and it was due to restart services, slb concurrent pressure amortized over the remaining five machines, resulting in the remaining five machines system load increases (but does not exceed the limit of 64 cores, the load increases just the appearance of breakdown service, not the root cause, but can be used as a reference for corroboration)

服务被击穿后,系统负载的变化曲线忽高忽低,暂无确定结果,私以为服务被击穿后,实时处理的客户端请求急剧降低,无法正常返回,故曲线骤降,而负载降至冰点后,大量客户端重试请求和正常请求涌入,再次击穿服务,如此往复导致曲线骤降骤升

 

 

TCP篇:

在Linux平台上,无论编写客户端程序还是服务端程序,在进行高并发TCP连接处理时,最高的并发数量都要受到系统对用户单一进程同时可打开文件数量的限制(这是因为系统为每个TCP连接都要创建一个socket句柄,每个socket句柄同时也是一个文件句柄)。

 

对于文件句柄的数量有三层限制

1,软限制:Linux在当前系统能够承受的范围内进一步限制用户同时打开的文件数

2,硬限制:根据系统硬件资源状况(主要是系统内存)计算出来的系统最多可同时打开的文件数量(通常软限制小于或等于硬限制。)

3,系统限制:当前Linux系统最多允许同时打开(即包含所有用户打开文件数总和)文件个数,是Linux系统级硬限制,所有用户级的打开文件数限制都不应超过这个数值。通常这个系统级硬限制是Linux系统在启动时根据系统硬件资源状况计算出来的最佳的最大同时打开文件数限制,如果没有特殊需要,不应该修改此限制,除非想为用户级打开文件数限制设置超过此限制的值

软硬限制:/etc/security/limits.conf   -→ 文件中的soft nofile/hard nofile

系统限制:cat /proc/sys/fs/file-max

当前这三个限制在我们线上机器设置的数量均为 655350

所以我们线上单台服务器理论上最高并发支持 655350(实例支持的真实最大并发与服务器的硬件配置和网络环境有关)

事故当天单台ECS实例的TCP连接中 ESTABLISHED 数量在24000 - 25000之间,总的TCP连接数数量保持在 90000 - 100000之间,总TCP连接数-ESTABLISHED连接数=TIME_WAIT连接数 + CLOSE_WAIT连接数(其他类型的TCP状态影响较小,暂不考虑)

CLOSE_WAIT in the TIME_WAIT state and TCP connections will take up the current system file handle, affect the system's concurrent processing capability, and the current online server for system configuration files TIME_WAIT process has been optimized, so the direct cause of the accident is a TCP connection It exceeds the maximum number of connections the system can carry actual

During normal operation of the service, the number of ESTABLISHEED connect to a single server is 9000-10000, the total number of TCP connections to maintain 45,000 - Between 60,000

 

Slb concurrent calculation of the current sum: 9500 * 10 = 95000

Query slb number of concurrent connections, ESTABLISHED state of the currently active connections slb the ECS each instance the total number of TCP connections the same order of magnitude

 

Monitoring articles:

System load just a phenomenon of the online service after the problem, and to do any advance warning is to monitor TCP connections ECS instance, for unhealthy TIME_WAIT and CLOSE_WAIT advance discovery (in fact, the current system will eventually monitor the total number of TCP connections )

Monitoring Command: the netstat -n | awk '/ ^ TCP / ++ {S [$ of NF]}  the END { for (A  in S) Print A, S [A]} '

TCP is a top line connections of the machine, can be seen the number of connections in the ESTABLISHED state current magnitude is 9000, and the TIME_WAIT up to 40,000, which is a normal

Because our services are highly concurrent short connections, when the server processes the request will immediately take the initiative to close the connection correctly, this scene will be a large number of socket in the TIME_WAIT state, and our server has been online for this business scenario optimized system configuration

The final direct cause too many TCP connections and hanging services is the total number of TCP connections occupy too many file handles, system beyond the affordable range.

 

Therefore, prevention of online ECS can not afford the high concurrency than expected, we need to monitor the current total number of TCP connections ECS advance, set the threshold, ahead of the police.

 

Guess you like

Origin www.cnblogs.com/575dsj/p/10955741.html