Master Nginx monitoring operation and maintenance, this article is enough!

Nginx is an open source, free, high-performance HTTP and reverse proxy server, and can also be used as an IMAP/POP3 proxy server. Making full use of the features of Nginx can effectively solve the problems of high traffic concurrent requests and cc ***.


image.png

This article discusses the Nginx monitoring program in the e-commerce scenario, and shares the problems and solutions encountered in the use process with everyone.


Nginx features


As a web server, Nginx has to compare with Apache.


Compared with the Apache server, Nginx has the characteristics of high concurrency and low resource consumption due to its asynchronous and non-blocking working model. The highly modular design makes Nginx very extensible; in processing static files and reverse proxy requests In other aspects, Nginx shows great advantages.


Common ways to use Nginx


Nginx can be used as a reverse proxy server to forward user requests; and it can achieve load balancing of back-end instances in the process of processing requests to realize the function of distributing requests; Nginx can also be configured as a local static server to handle static requests.


Nginx monitoring


Sorting out monitoring indicators


The entire process of Nginx processing requests should be monitored so that we can find out whether the service is working properly in time.


The process of Nginx processing requests is recorded in detail in the access.log and error.log files. We give the following (Table 1) key indicators that need to be monitored:

image.png

Table 1: Key indicators


Monitoring practice


The following describes the Nginx monitoring practice from the four indicators of delay, error, flow, and saturation.


Delay monitoring


Delay monitoring mainly focuses on the monitoring of $request_time, and draws a TP indicator graph to confirm the TP99 indicator value.


In addition, we can also increase the monitoring of the $upstream_response_time indicator to help locate the cause of the delay.

image.png

Figure 1: TP indicator


Figure 1 shows the time taken by Nginx to process user requests in the past 15 minutes. It can be seen that 90% of user requests can be processed in 0.1s, and 99% of requests can be completed in 0.3s.


Set the delay alarm threshold according to the TP indicator value and the tolerance of the specific business to delay.


Error monitoring


As a web server, Nginx must not only monitor the running status of Nginx itself, but also monitor various error responses of Nginx. HTTP error status codes and detailed error logs recorded in error.log should be monitored to help solve problems .


①Nginx port monitoring based on HTTP semantics


Simple Nginx port monitoring cannot reflect the true running status of the service. What we need to pay attention to is the survival of Nginx itself and whether it can provide services normally.


Based on our practice, we recommend using semantic monitoring instead of port monitoring, that is, access from the Nginx native machine through http://local_ip:port/ to verify whether the returned data format, content, and HTTP status code meet expectations.


②Error code monitoring


It is necessary to add monitoring of 5xx service error status codes such as 500/502/504, which tell us that there is a problem with the service itself.

image.png

Figure 2: Status code monitoring


The frequency of 5xx errors should be in the single digits per minute. Too many 5xx errors should be investigated and resolved in time; 4xx errors can help solve some unexpected permissions errors, resource loss or performance problems.


You can selectively monitor the 301/302 redirection type and respond to the monitoring of special configuration jumps. For example, after the backend server returns 5xx, Nginx configures the redirection jump and returns the result of the request after the jump.


③Monitor the error log


Nginx internally implements detailed records of request processing errors and saves them in the error.log file.


There are many types of errors. We mainly collect and monitor critical errors that can reflect server-side exceptions to assist us in fault location:

image.png

Table 2: Error log information


data monitoring


①Monitoring the total number of requests accepted by Nginx


Pay attention to the cycle of traffic fluctuations, and capture the sudden increase and decrease of traffic; usually, the low peak and peak fluctuations of 20% in the steady state require attention.


For services with obvious fluctuation cycles, we can also adopt the same ring-on-quarter increase/decrease alert strategy to detect traffic changes in time.

image.png

Figure 3: PV flow diagram

image.png

Figure 4: Key flow diagram


Figure 3 shows the flow fluctuation diagram of a certain platform of JD Cloud within a week. The flow has obvious low peaks and peaks and has a periodicity of days.


基于网站运行特性,根据低峰、高峰的值来监控网站流量的波动,并通过自身的监控仪表盘配置网站关键页面的流量图(图 4),以协助故障排查。

image.png

图 5:网卡流量图


②对网卡 IO 等机器级别流量进行监控


可以及时发现服务器硬件负载的压力,当 Nginx 被用于搭建文件服务器时,此监控指标需要我们尤为关注。


饱和度监控


Google SRE 中提到,饱和度应关注服务对资源的利用率以及服务在当前运行情况下还可以承受多少负载。


Nginx 是低资源消耗的高性能服务器,但诸如在电商场景下,新产品抢购会在短时间内造成 CPU 利用率、请求连接数、磁盘写入的飙升。


CPU 利用率还要考虑通过 worker_cpu_affinity 绑定 Worker 进程到特定 CPU 核心的使用情况,处理高流量时,该配置可以减少 CPU 切换的性能损耗。


Nginx 可以接受的最大连接数在配置文件 nginx.conf 中由 worker_processes 和 worker_connections 两个参数的乘积决定。

image.png

图 6


通过 Nginx 自带的模块 http_stub_status_module 可以对 Nginx 的实时运行信息(图 6)进行监控。


因我们更关心当前 Nginx 运行情况,不对已处理的请求做过多关注,所以我们只对如下指标进行采集监控:

image.png

表 3:指标含义


基于开源软件搭建 Nginx 可视化监控系统


①采用 Elasticsearch+Logstash+Kibana 搭建可视化日志监控


image.png

图 7:ELK 栈架构图


针对以上四个监控黄金指标,搭建的 ELK 栈仪表盘,设置常用的 Nginx 日志过滤规则(图 8),以便可以快速定位分析问题。

image.png

图 8:ELK 仪表盘


②采用 Kibana+Elasticsearch+Rsyslog+Grafana 搭建可视化日志监控


image.png

图 9:Grafana 可视化架构图


相较于 Kibana 能快速地对日志进行检索,Grafana 则在数据展示方面体现了更多的灵活性,某些情况下二者可以形成互补。

image.png

图 10:Grafana 仪表盘


我们在实践中实现上述两种架构的 Nginx 日志可视化监控;从需求本身来讲,ELK 栈模型可以提供实时的日志检索,各种日志规则的过滤和数据展示,基本可以满足 Nginx 日志监控的需求。


Grafana 架构模型无法进行日志检索和浏览,但提供了角色权限的功能,来防护一些敏感数据的访问。


另外,Grafana 更为丰富的图表类型和数据源支持,使其具有更多的应用场景。


基于 Nginx 监控发现并定位问题案例


案例 1:大流量冲击


问题:某平台,进行了一次新产品的抢购活动。活动期间因流量飙升导致商品详情页、下单等核心功能处理耗时增加的情况。

image.png

图 11:PV 飙升图


解决:订单监控及 Nginx 的 PV、请求时间等监控指标发出报警后,运维人员迅速通过自建的 ELK 监控仪表盘,关注网站流量变化,查看用户请求 top IP、top URL;发现存在大量黄牛的恶意抢购行为,导致服务后端处理延时。


因此,我们通过降低高防产品、Nginx 限流配置中相关接口防***阈值,及时拦截了对系统负载造成压力的刷单行为,保障了新品促销活动顺利开展。


案例 2:Nginx 错误状态码警示服务异常


问题:某平台进行后端服务器调整,某个 Nginx 的 upstream 指向的后端服务器配置错误,指向了一个非预期的后端服务。


当错误的配置被发布到线上后,网站开始出现概率性的异常,并伴有 500 和 302 错误状态码数量的飙升。

image.png

图 12:302 错误码统计


Solution: After the Nginx error status code is alarmed, the URL requested by the user under the 302 error code is filtered through the ELK platform, and it is found that the requested URL is related to a certain backend module, and the request is redirected to the homepage of the website.


Further positioning found that a certain Nginx pointed to the wrong backend server, causing the server to return a large number of 500 errors, but due to the redirection of 500 errors in the Nginx configuration, many 302 status codes were generated.

image.png

Figure 13: Configure upstream health monitoring


In subsequent improvements, we upgraded Nginx and adopted the openresty+lua method to monitor the health of the back-end server (Figure 13), to dynamically update the server in the upstream, which can quickly remove abnormal back-end servers to achieve rapid stop loss purpose.


Case 3: Nginx server disk space is exhausted and the service is abnormal


Problem: Nginx serves as the front end of the image server. One of the instances received an alarm when there was no change in the production environment: the 500 status code accounted for too much of the overall traffic.


Solution: Quickly remove this machine from the production environment and no longer provide services. After checking the Nginx error log, the following error was found:

open() "/home/work/upload/client_body_temp/0000030704" failed (28: No space left on device)


When Nginx processes the request, it will temporarily store part or all of the content of the client POST request that exceeds the client_body_buffer_size request in the client_body_temp_path directory. When the disk space is full, the above error is generated.


In the end, we confirmed that this anomaly was caused by the fact that the supported image size was changed from 15MB to 50MB after the product upgrade, and the operator carried out a new product promotion activity, and the amount of images uploaded by users surged to fill up the disk space.


Guess you like

Origin blog.51cto.com/14410880/2551520