Technology sharing | How to use Prometheus to implement system monitoring alarm email notification

The last article about Prometheus talked about how Prometheus implements process monitoring. In the actual online environment, when the system process is abnormal, it is necessary to notify the operation and maintenance personnel on duty in real time to check whether the system is still operating normally. Next, we will introduce how to implement monitoring and alarm notification based on Prometheus.

The alarm notification of Prometheus uses its component AlertManager. Alertmanager receives alerts from clients such as Prometheus, and then processes them by grouping, deleting duplicates, etc., and sends them to the correct receiver through routing. Alerts can be sent to different module leaders according to different rules. Alertmanager supports alerts such as Wechat, Email, and Webhook, among which Webhook can be connected to chat tools such as DingTalk.

insert image description here

Alarm process

  • Prometheus configuration monitoring rules
  • Monitoring object trigger threshold
  • Threshold Exceeded Duration
  • Push alerts to Alertmanager
  • Alertmanager processes alarm information
    1) Group (group): Similar alarms are combined into one notification.
    2) Silences: No notification, used when the system is upgraded.
    3) Inhibition: Notify only once, the same content will not be notified again.
  • Alertmanager sends notifications to the media, mailboxes, DingTalk, WeChat, etc. Receive notifications

Install and deploy AlertManager

Deploy alert term manager

download binaries

wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz

tar zxvf alertmanager-0.24.0.linux-amd64.tar.gz
mv alertmanager-0.24.0.linux-amd64 /apps/alertmanager

Create an alerttermanager service

vim /etc/systemd/system/alertmanager.service

[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target

[Service]
User=root
Type=simple
#不能有单引号和双引号
ExecStart=/home/prometheus/alertmanager/alertmanager/alertmanager --config.file=/home/prometheus/alertmanager/alertmanager/alertmanager.yml --storage.path=/home/prometheus/alertmanager/alertmanager/data --web.listen-address=:19093 --cluster.listen-address=0.0.0.0:19094 --web.external-url=http://192.168.1.108:19093
Restart=on-failure

[Install]
WantedBy=multi-user.target

Start the service:

systemctl daemon-reload
systemctl enable --now alertmanager
systemctl status alertmanager

Visit 192.168.1.108:19093 to manage the alertmanager page:

insert image description here

Alertmanager configuration

Detailed explanation of the configuration file, taking the mailbox alarm as an example:

vim /home/prometheus/alertmanager/alertmanager/alertmanager.yml
#邮件发送者
global:
  resolve_timeout: 30s
  smtp_smarthost: 'smtp.qq.com:465' 
  smtp_from: '[email protected]' 
  smtp_auth_username: '[email protected]' 
  smtp_auth_password: 'xxxxxxxxvpobcee'
  smtp_hello: '@qq.com'
  smtp_require_tls: false

templates:
  - '/home/prometheus/alertmanager/alertmanager/tmpl/email.tmpl'  #增加templates配置
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 5m
  receiver: 'email'
  routes:
  - receiver: dingtalk-webhook
    group_wait: 10s
  - receiver: email
    group_wait: 10s
receivers:
  - name: 'email'
    email_configs:
      - to: '[email protected]'
        send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Item Value

insert image description here

Prometheus rules

Create a new rule file, configure group information, alarm threshold and time, alarm label and comment, etc.

The indicator expression adopts PromQL statement, and the unit of most indicators is bytes, which needs to be converted into KMG, for example, 2M=2 1024 1024.

Prometheus rule file, for mailbox, DingTalk or enterprise WeChat, this file is common:

vim /home/prometheus/prometheus/rule/qtalk_auth.yaml
groups:
 - name: qtalk_auth 程异常退出
   rules:
   - alert: 应用进程 qtalk_auth 异常退出 # 告警名称
     expr: (namedprocess_namegroup_num_procs{groupname="map[:qtalk_auth]"}) == 0
     for: 30s # 满足告警条件持续时间多久后,才会发送告警
     labels: #标签项
        severity: error
        ip: 192.168.1.108
     annotations: # 解析项,详细解释告警信息
         summary: "进程异常报警 Alert {
   
   { $labels.instance }} ,异常停止超过30秒."
         description: "{
   
   {$labels.ip}} 进程{
   
   {$labels.groupname}} 异常停止!请立即查看!"

insert image description here

Check the prometheus alarm rule file, showing SUCCESS:

/home/prometheus/prometheus/promtool check rules rule/qtalk_auth.yml
Checking rule/qtalk_auth.yml
  SUCCESS: 1 rules found

Prometheus configuration

Configure the Prometheus file, the IP and port of the alertmanagers server, and the path of the prometheus server rule file:

vim /home/Prometheus/prometheus/prometheus.yml
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["192.168.1.108:19093"]
           #- alertmanager:["192.168.1.108:19093"]

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
 - "rule/*.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]

  - job_name: 'process'
    static_configs:
      - targets: ['192.168.1.108:9256']

Restart the Prometheus service:

systemctl restart prometheus.service

Email Alert

View Prometheus

Prometheus home page, Alerts option, you can view the alarm information:

There are 3 alarm states:

  • inactive: no exception.
  • pending: The threshold has been triggered, but the alarm duration has not been met (that is, the for field in the rule).
  • firing: The threshold has been triggered and the condition is met and sent to alertmanager.

In the pending state, the threshold is triggered, but observe for another 30m seconds (for: 30s).

insert image description here

In the firing state, if the threshold is exceeded after 30 seconds, it will be sent to alertmanager.

insert image description here

View Alertmanager

Only the warning of Firing in Prometheus will be sent to Alertmanager, enter the home page to view.

insert image description here

check email

After Prometheus sends an alert to alertmanager, alertmanager sends the alert message via email according to the notification settings:

insert image description here

When sending emails, the emails are pushed according to the time interval in the configuration rules. (Can be modified in the configuration file)

insert image description here

So far, a simple Prometheus-based system monitoring and alarm notification service has been built. Using such a monitoring and notification system can allow system operation and maintenance personnel to know the system health early and ensure high system availability.

reference documents

prometheus

Prometheus sends recovery value_Prometheus-Basic Series-(5)-Alarm System-2

Prometheus — AlertManager configuration instructions

insert image description here

Guess you like

Origin blog.csdn.net/anyRTC/article/details/129297866