Nginx Upstream monitoring and alarm

I wrote an article before, introducing how Nginx monitors the traffic of each server , mainly by adding a third-party status module to view the status of all servers and upstream. Afterwards, people always ask if there is a way to monitor upstream and alert, so I will introduce it today. , Complete upstream monitoring and warning methods


Application: Nginx/Tengine

模块:ngx_http_upstream_check_module

Monitoring: zabbix

Alert: Enterprise WeChat/Dingding


Because the upstream of nginx is passive by default and will not be actively monitored, the upstream_check module of tengine is used directly here


If you are tengine, as long as it is version 1.4 or higher, the module will be opened by default. If you are nginx, you need to recompile nginx and add the module. The compilation method is not much to say here. Download the source code and add it with --add-module Just compile


The upstream_check module provides an active back-end server health check function. Here are some instructions provided by the module


  • check

Syntax: check interval=milliseconds [fall=count] [rise=count] [timeout=milliseconds] [default_down=true|false] [type=tcp|http|ssl_hello|mysql|ajp] [port=check_port]
Default: interval=30000 fall=5 rise=2 timeout=1000 default_down=true type=tcp
Context: upstream
 
 

This command can turn on the health check function of the back-end server. The meaning of the parameters after the command is:

interval : the interval of health check packets sent to the backend

fall(fall_count) : If the number of consecutive failures reaches fall_count, the server is considered down

rise(rise_count) : If the number of consecutive successes reaches rise_count, the server is considered up
timeout : the timeout time of the back-end health request

default_down : Set the initial state of the server, if it is true, it means that the default is down

type : the type of health check package, now supports the following multiple types

  • tcp: simple tcp connection, if the connection is successful, the backend is normal

  • ssl_hello: send an initial SSL hello package and accept the server’s SSL hello package

  • http: Send an HTTP request, and judge whether the backend is alive by the status of the backend's reply packet

  • mysql: Connect to the mysql server, and judge whether the backend is alive by receiving the greeting package from the server

  • ajp: Send the Cping packet of the AJP protocol to the backend, and judge whether the backend is alive by receiving the Cpong packet

port : Specify the check port of the back-end server. You can specify the port of the back-end server that is different from the real service. For example, the back-end provides an application on port 443. You can check the status of port 80 to determine the back-end health. The default is 0, which means that the port is the same as the real service port provided by the backend server. This option appears in Tengine-1.4.0


  • check_keepalive_requests

Syntax: check_keepalive_requests request_num
Default: 1
Context: upstream

This command can configure the number of requests sent by a connection. The default value is 1, which means that Tengine will close the connection after completing one request.


  • check_http_send

Syntax: check_http_send http_packet
Default: "GET / HTTP/1.0"
Context: upstream

This command can configure the request content sent by the http health check package. In order to reduce the amount of transmitted data, the "HEAD" method is recommended. When a long connection is used for health check, the keep-alive request header needs to be added to the command, such as: "HEAD / HTTP/1.1\r\nConnection: keep-alive\r\n\r\n". At the same time, in the case of using the "GET" method, the size of the request uri should not be too large to ensure that the transmission can be completed within 1 interval, otherwise it will be regarded as a back-end server or network abnormality by the health check module


  • check_http_expect_alive

Syntax: check_http_expect_alive [ http_2xx | http_3xx | http_4xx | http_5xx ]
Default: http_2xx | http_3xx
Context: upstream

This command specifies the success status of the HTTP response. By default, the status of 2XX and 3XX is considered healthy.


  • check_shm_size

Syntax: check_shm_size size
Default: 1M
Context: http

The health check status of all back-end servers is stored in the shared memory. This command can set the size of the shared memory. The default is 1M, if you have more than 1,000 servers and an error occurs during configuration, you may need to expand the size of the memory


  • check_status

Syntax: check_status [html|csv|json]
Default: check_status html
Context: location

Display the server's health status page. This instruction needs to be configured in the http block. After Tengine-1.4.0, you can configure the format of the displayed page. The supported formats are: html, csv, json. The default type is html. You can also specify the format through the requested parameters. Assuming that'/status' is the URL of your status page, the format parameter changes the format of the page

such as:

/status?format=html
/status?format=csv
/status?format=json

下面是一个HTML状态页面的例子(server number是后端服务器的数量,generation是Nginx reload的次数。Index是服务器的索引,Upstream是在配置中upstream的名称,Name是服务器IP,Status是服务器的状态,Rise是服务器连续检查成功的次数,Fall是连续检查失败的次数,Check type是检查的方式,Check port是后端专门为健康检查设置的端口)

图片

下面是json格式

图片


监控数据就是从这里获取,在zabbix的agent中添加脚本如下:

json
urllib3

():
    url = http = urllib3.PoolManager()
    up_status = http.request(url).data.decode()
    up_status = json.loads(up_status)
    upstreams = []
    upserver up_status[][]:
        status = {: upserver[]: upserver[]: upserver[]: upserver[]: upserver[]: upserver[]}
        upstreams.append(status)
    result = { : upstreams}

    result

__name__ == :
    :
        (call_api())
    e:
        (e)

这里主要是把status返回的数据处理成zabbix需要的格式,因为我是用zabbix自动发现功能,所以这里直接写成遍历server,执行脚本输出如下:

图片

数据收集就没问题了,接着在zabbix中添加自动发现规则

图片

接着添加监控项原型图片

监控项原型主要是获取upstream后端server状态,接着添加触发器

图片

监控很简单,就添加完了,当upstream后端server状态down掉就会触发规则,将告警信息通过告警媒介发送到企业微信,当然你也可以是钉钉或短信,看你自己配置的告警媒介

图片

恢复后通知:

图片


Guess you like

Origin blog.51cto.com/15080021/2654494