AlertManager monitoring alarm artifact (9)

Prometheus+AlertManager realizes monitoring and alarm

1. Introduction to AlertManager

Prometheus itself does not have the ability to alarm, so it needs to be combined with a third-party alarm program to achieve monitoring indicator alarms

AlertManager is a good alerting program. First, prometheus configures alerting rules. When the alerting rules are triggered, it will push the alerting information to the altermanager. After AlertManager receives the alert, it will send it to different alarms based on the configured routing and different alert levels. Receive (recipient), AlertManager can implement email, corporate WeChat and other alarms

Insert picture description here

2. Introduction to deployment of AlertManager and configuration files

2.1. Deploy AlertManager

[root@prometheus-server ~]# tar xf AlertManager-0.21.0.linux-amd64.tar.gz 
[root@prometheus-server ~]# mv AlertManager-0.21.0.linux-amd64 /data/AlertManager

2.2. Introduction to configuration files

global:

​ resolve_timeout //Resolve timeout time, that is, the alarm recovery is not sent immediately, but the recovery alarm can be sent only if the alarm is not triggered within a time range, the default is 5 minutes

​ smtp_from //Recipient's email address

​ smtp_smarthost //The smtp address of the mailbox provider

​ smtp_auth_username //Recipient's email account

​ smtp_auth_password //Email authorization code

​ smtp_require_tls //whether tls protocol is required, the default is true

​ wechart_api_url //WeChat api address

​ wbchart_api_secret //Password

​ wechat_api_corp_id //The id of the robot application

route:

​ group_by //Which label is used as the basis for grouping

​ group_wait // The waiting time of the group, the alarm is not sent out immediately after receiving the alarm, but waiting for a period of time to see if there are other alarms, and send them together

​ group_interval //Alarm interval

​ repeat_interval //Repeat the alarm interval, which can reduce the frequency of sending alarms

​ receiver // who is the receiver

receivers:

​ name //The name of the receiver, which corresponds to the receiver in the route

​ email_configs

​-to //Recipient's email address

2.3. Configure AlertManager mailbox alarms

1.修改主配置文件
[root@prometheus-server ~]# cd /data/AlertManager/
[root@prometheus-server /data/AlertManager]# vim /data/AlertManager/AlertManager.yml 
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.qq.com:465'        
  smtp_from: '[email protected]'       
  smtp_auth_username: '[email protected]'  
  smtp_auth_password: 'yzjqxhsranbpdijd'     
  smtp_require_tls: false
  
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 10m
  receiver: 'mail'
receivers:
- name: 'mail'
  email_configs:
  - to: '[email protected]'
  
2.检测语法
[root@prometheus-server /data/AlertManager]# ./amtool check-config AlertManager.yml 
Checking 'AlertManager.yml'  SUCCESS
Found:
 - global config
 - route
 - 0 inhibit rules
 - 1 receivers
 - 0 templates

3.启动AlertManager
[root@prometheus-server /data/AlertManager]# nohup ./AlertManager --config.file="/data/AlertManager/AlertManager.yml" &

4.查看端口
[root@prometheus-server /data/AlertManager]# netstat -lnpt | grep alert
tcp6       0      0 :::9093                 :::*                    LISTEN      31401/./alertmanage 
tcp6       0      0 :::9094                 :::*                    LISTEN      31401/./alertmanage 
  

2.4. Configure prometheus to integrate AlertManager

1.修改配置文件
[root@prometheus-server ~]# vim /data/prometheus/prometheus.yml 
alerting:
  AlertManagers:											
  - static_configs:
    - targets:
      - 192.168.81.210:9093											#AlertManager地址

rule_files:											#告警规则路径
  - "rules/*.yml"

2.创建rules告警规则目录
[root@prometheus-server ~]# mkdir /data/prometheus/rules

3.加载配置
[root@prometheus-server ~]# curl -XPOST 192.168.81.210:9090/-/reload

The configuration has been updated, and the AlertManager can now be used for alarms
Insert picture description here

3. Warning rules

3.1. Alarm rule syntax

groups:									//定义一个告警规则组
- name: general.rules			//组名,可以将同一类型的报警放到一个分组中
  rules:						//定义告警规则,可以有多个
  - alert: 主机宕机					//告警名称,也就是告警信息的标题,一个alert代表一个告警规则
    expr: up == 0				//表达式,根据表达式的值进行匹配
    for: 1m						//报警收到后多长时间后发送报警信息
    labels:										//定义标签
      serverity: error				//告警级别,有warning、error等
    annotations:					//定义告警内容
      summary: "主机 {
    
    { $labels.instance }} 停止工作"				//消息内容,$labels.instance就是监控项中的标签变量
      description: "{
    
    { $labels.instance }} job {
    
    { $labels.job }} 已经宕机5分钟以上!"			//详细描述

3.2. Alarm rule status

There are three types of alarm rules

  • inactive: No alarm, everything is normal

    • green
  • pending: The threshold has been triggered, but the alarm duration is not met, that is, the for written in the alarm rules, will not be sent to the AlertManager when triggered within the for specified time, and will be sent to the AlertManager immediately after the for duration has passed

    • yellow
  • firing: The threshold has been triggered and the alarm duration is met, and the alarm is sent to the receiver

    • red

4. Create an alarm to detect host downtime

Each monitoring instance will have an up monitoring item. If the value of this monitoring item is 0, the host is down, and 1 is normal.

Insert picture description here

4.1. Write host down alarm rules

1.编写规则
[root@prometheus-server /data/prometheus]# vim rules/hostdown.yml
groups:
- name: general.rules
  rules:
  - alert: 主机宕机
    expr: up == 0
    for: 1m
    labels:
      serverity: error
    annotations:
      summary: "主机 {
   
   { $labels.instance }} 停止工作"
      description: "{
   
   { $labels.instance }} job {
   
   { $labels.job }} 已经宕机5分钟以上!"

2.检查语法
[root@prometheus-server /data/prometheus]# promtool check config /data/prometheus/prometheus.yml 
Checking /data/prometheus/prometheus.yml
  SUCCESS: 1 rule files found

Checking /data/prometheus/rules/hostdown.yml
  SUCCESS: 1 rules found

3.加载配置
[root@prometheus-server /data/prometheus]# curl -XPOST 192.168.81.210:9090/-/reload

You can see the alarm rules we created under status-rules on the page

Insert picture description here

In the alert, you can also see whether the alarm rule is triggered. If it is not triggered, it is displayed in green.

Insert picture description here

4.2. Trigger alarm rules

Stop the mysql_exporter of 192.168.81.220 to see if the alarm is triggered

[root@192_168_81_220 ~]# ps aux | grep mysql
[root@192_168_81_220 ~]# kill -9 40268

Insert picture description here

4.3. Check whether AlertManager has received alarm information

The target page has been displayed down

Insert picture description here

Red firing means that an alarm has been sent

Insert picture description here

The AlertManager page has also received the alarm information pushed from the host down

Insert picture description here

4.4. Check whether you have received the email alarm content

Have been received, so far our process has run through

Insert picture description here

4.5. Triggering similar alarm rules

Similar alarms refer to: Prometheus's alarm rules are effective for all monitoring instances, that is, when the same type of alarm is triggered, it will be displayed together and an email will be sent

After we stop the docker service, two similar alarms will be generated at this time

Although they are two services, they belong to one type of alarm rule, so they will be put together

Insert picture description here

Emails will also be sent together. This is also the parameter of group_wait configured in our configuration file. When multiple alarm rules are triggered in the same group at the same time, they will be sent as one email at the same time.

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_44953658/article/details/113777121