[Monitoring System] Promethus integrates Alertmanager monitoring and alarm email notifications

[Monitoring System] Promethus integrates Alertmanager monitoring and alarm email notifications

Alertmanager is an open source software for managing and monitoring alerts. It is tightly integrated with Prometheus, a popular open source monitoring and alerting system. Alertmanager receives alerts and notifications from multiple sources and decides how to process and send those alerts based on a set of configuration rules.

Insert image description here

Therefore, the functionality of Alertmanager can be summarized as:

  • Receive alerts from surveillance systems
  • Process and deduplicate received alerts according to configured rules
  • Send notification alert

Alertmanager supports various notification methods, such as email, DingTalk, etc.

An alarm rule in Prometheus consists of

  • Alarm name: Users need to name the alarm rule
  • Alarm rules: Mainly defined by PromQL, indicating how long the expression (PromQL) query result lasts (During) before triggering an alarm.

Key features:

  • Grouping: Combine detailed alarm information into one notification. In some cases, such as a large number of alarms being triggered at the same time due to system downtime.
  • Suppression: When an alarm is sent, you can stop the mechanism of repeatedly sending other alarms caused by this alarm to avoid alarm bombing.
  • Silence: Silence the alarm according to the label. If the received alarm conforms to the silence configuration, Alertmanager will not send an alarm notification.

Alertmanager installation

1.下载Alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz

2.解压
tar -zxvf alertmanager-0.24.0.linux-amd64.tar.gz

Insert image description here

#启动
./alertmanager --config.file=alertmanager.yml

#守护进程方式启动
nohup ./alertmanager --config.file=alertmanager.yml &
  • Visit ip+port, such as http://ip:9093/#/alerts

Insert image description here

About the usage process of Alertmanager:

  • Prometheus's rules.yaml writes alert rules, configures Prometheus, and defines under what circumstances you will be alerted.
  • Configure Alertmanager, add Email, DingTalk or SMS receiving programs, and specify the target and notification medium for alarm notifications.
  • Establish alarm routing, define alarm routing methods to distinguish and classify alarm levels, and set different fire notification methods for different alarm targets.

Three states of Alert:

pending:警报被激活,但是低于配置的持续时间。这里的持续时间即rule里的FOR字段设置的时间。改状态下不发送报警。
firing:警报已被激活,而且超出设置的持续时间。该状态下发送报警。
inactive:既不是pending也不是firing的时候状态变为inactive

The process of prometheus triggering an alarm:

prometheus—>trigger threshold—>duration exceeded—>alertmanager—>grouping|suppression|silent—>media type—>mail|DingTalk|WeChat, etc.

Insert image description here

OK, now that we have deployed Alertmanager, our requirement is to monitor the application. If the application hangs, trigger an email to be sent to the developer.

First, go to the root directory of Promethus: create the rule.yml file.

Insert image description here

Let’s first briefly introduce the configuration properties of rule.yaml.

groups: # 告警规则组
- name: server-alarm
  rules: #规则,可以配置多个alert告警
  
  - alert: # 告警名称
    expr:  # 告警表达式,基于PromQL表达式告警触发条件,用于计算是否有时间序列满足该条件。
    for:  # 评估等待时间,可选,用于表示只有当触发条件持续一段时间后才发送告警,在等待期间新产生告警 的状态为pending。
    labels: #自定义标签,允许用户指定要附加到告警上的一组附加标签。
      severity:  # 告警严重程度
    annotations: #用于指定一组附加信息,比如用于描述告警详细信息的文字等
      summary: # 告警摘要
      description: # 告警详细描述

Severity has the following commonly used values.

  • critical (serious), used to describe situations that affect the main functions of the system or even cause the system to crash.
  • A warning is used to describe a situation where an exception exists but does not cause the system to crash or stop service.
  • info (information) is used to describe normal status information corresponding to the normal operation of the business.
  • debug, used to describe debugging information that can be used to troubleshoot problems.
#配置规则
groups:
- name: server-alarm
  rules:
  - alert: "InstanceDown"
    expr: up == 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "{
    
    { $labels.instance }}"
      description: "{
    
    { $labels.instance }} of job {
    
    { $labels.job }} has been down for more than 1 minutes."

Configure Prometheus to associate the Alertmanager address and enable rule rules.

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - 192.168.140.133:9093
rule_files:
  - "rule.yml"

Insert image description here

Dynamically update configurationcurl -X POST http://localhost:9090/-/reload

Configure the alertmanager.yml configuration file of Alertmanager

alertmanager.yml mainly contains two parts: route + receivers

  • The alarm information will enter the routing tree from the top-level route (route) in the configuration, and the alarm information will be sent to the corresponding receiver according to the routing rules.

Edit the alertmanager.yml file and save it.

Insert image description here

global:
  smtp_smarthost: 'smtp.163.com:25' # SMTP服务器地址和端口
  smtp_from: '[email protected]' # 显示在邮件“发件人”字段中的地址
  smtp_auth_username: '[email protected]' # STMP认证时使用的用户名
  smtp_auth_password: 'TCNTXJTZUXJHJJPX' # SMTP认证时使用的密码,不是密码
  smtp_require_tls: false # SMTP服务器是否需要TLS加密

route:
  receiver: 'email' # 发送告警通知的收件人,和下面的接受者名称匹配
  group_wait: 10s # 在发送前等待各个警报的时间
  group_interval: 30s # 相同警报名称的警报发送间隔
  repeat_interval: 10m # 重复发送警报的时间间隔
  group_by: ['alertname'] # 根据警报名分组告警接收者

# 告警接收者
receivers:
- name: 'email' # 接收者名称
  email_configs:
  - to: '[email protected]' # 接收告警邮件的收件人
  • Restart alertmanager
#守护进程方式启动
nohup ./alertmanager --config.file=alertmanager.yml &

Check the Prometheus configuration and whether the rules are in effect

Insert image description here

Insert image description here

Take a look at Alertmanager.

Insert image description here

OK, let’s start verifying the alarm function.

First, let's stop the springboot application.

Insert image description here

We have seen the service down in Prometheus.

Insert image description here

An alarm was found in Alertmanager.

Insert image description here

Check Mail:

Insert image description here

OK, it’s over here, remember to support the blogger!
Insert image description here

Guess you like

Origin blog.csdn.net/weixin_47533244/article/details/132780119