Prometheus monitoring and learning road (four)

Prometheus alarm configuration

Alarm function overview

Prometheus's ability to collect and store indicators and alarms is divided into two independent components, Prometheus Server and Alertmanager. The former is only responsible for the "alarm rules" to produce alarm notifications, and the specific alarm operations are completed by the latter.
Alertmanager is responsible for processing alarm notifications sent by the client:

  • The client is usually Prometheus Server, but it also supports receiving alerts from other tools
  • Alertmanager groups the alert notifications and routes them to different receivers according to routing rules after deduplication, such as Email, SMS or Dingding, etc.

Alarm logic of Prometheus monitoring system

First, you must configure Prometheus to be the alert client of Alertmanager. In turn, Alertmanager is also an application, and it should also be included in the monitoring target of Prometheus.
Configuration logic:
define receivers on Alertmanager, they are usually specific users who can receive alert messages based on a certain medium

  • Email, WeChat, slack and webhook are common media for sending alert information
  • In different media, the address of the recipient of the alarm message will be expressed differently.

Define routing rules on Alertmanager so that the received alarm notifications can be processed separately as needed, and define alarm rules on Prometheus to produce alarm notifications and send them to Alertmanager

Insert picture description here

Alertmanager

In addition to basic alarm notification capabilities, Alertmanager also supports functions such as deduplication, grouping, suppression, silence, and routing of alarms:

  • Grouping: A mechanism that combines similar alarms into a single alarm notification. The grouping mechanism can prevent users from being overwhelmed by a large amount of alarm noise when the system triggers an alarm wave due to a large area failure, which will lead to the concealment of key information
  • Inhibition: After an alarm notification is triggered by a component or service failure in the system, other components or services that depend on the component or service may also trigger an alarm. Inhibition is one way to avoid similar cascading alarms. This feature allows users to focus on the real fault
  • Silent (silent): refers to the behavior of Alertmanager not actually sending alarm information to the user even if an alarm notification is received within a specific time window. Usually during routine maintenance of the system, the silent feature of the alarm system needs to be activated
  • Route: Used to configure how Alertmanager handles incoming specific types of alarm notifications. The basic logic is to determine the path and behavior of processing current alarm notifications based on the matching results of the route matching rules

Configure Alertmanager

Altermanager is an independent go binary program that requires independent deployment and maintenance
Insert picture description here

tar xf alertmanager-0.21.0.linux-amd64.tar.gz -C /usr/local/
mv /usr/local/alertmanager-0.21.0.linux-amd64 /usr/local/alertmanager

Modify the Alertmanager configuration file

vim alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'email'
receivers:
- name: 'email'
  email_configs:
  - to: '***@163.com'
    from: '***@163.com'
    smarthost: 'smtp.163.com:25'
    auth_username: '***@163.com'
    auth_identity: '***@163.com'
    auth_password: 'OTFXYHONWUFELOTN'
    require_tls: false

The 163 mailbox is used here, where auth_password needs to use the password of the 163 mailbox smtp instead of the password to log in to the mailbox

  • start up
    Insert picture description here

Modify Prometheus configuration file and configure alarm rules

  • File-based discovery here
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - file_sd_configs:
    - files:
      - "target/alertmanagers*.yaml"

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/*.yaml"
  - "alert_rules/*.yaml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
    file_sd_configs:
    - files:
      - target/prometheus-*.yaml
      refresh_interval: 2m

  # All nodes
  - job_name: 'nodes'
    file_sd_configs:
    - files:
      - target/nodes-*.yaml
      refresh_interval: 2m

  - job_name: 'alertmanagers'
    file_sd_configs:
    - files:
      - target/alertmanagers*.yaml
      refresh_interval: 2m
  • Write node file
vim /usr/local/prometheus/target/nodes.yaml
- targets:
  - 192.168.0.181:9100
  - 192.168.0.179:9100
  labels:
    app: node-exporter
    job: node
vim /usr/local/prometheus/target/prometheus-servers.yaml
- targets:
  - 192.168.0.181:9090
  labels:
    app: prometheus
    job:  prometheus
vim /usr/local/prometheus/target/alertmanagers.yaml
- targets:
  - 192.168.0.181:9093
  labels:
    app: alertmanager
  • Configure alarm rules
vim /usr/local/prometheus/alert_rules/instance_down.yaml
groups:
- name: AllInstances
  rules:
  - alert: InstanceDown
    # Condition for alerting
    expr: up == 0
    for: 1m
    # Annotation - additional informational labels to store more information
    annotations:
      title: 'Instance down'
      description: Instance has been down for more than 1 minute.'
    # Labels - additional labels to be attached to the alert
    labels:
      severity: 'critical'

The point here is that expr corresponds to the indicator name and its corresponding value on Prometheus, and triggers an alarm when it meets the conditions.

  • Start Prometheus
    Insert picture description here
  • Stop the node_exporter of a node and
    Insert picture description here
    wait for about a minute to check the mailbox

Insert picture description here
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_33235529/article/details/113716274