Installation alertmanager

To https://prometheus.io/download/ download alertmanager

blob.png

Edit alertmanager.yml, only achieved after decompression alarm, modified as follows

, Ltd. Free Join: 
  resolve_timeout: 5m                  
  smtp_smarthost: 'smtp.163.com:25' 
  smtp_from: '***@163.com' 
  smtp_auth_username,: '***@163.com' 
  smtp_auth_password: '******' Authorization # password 
  smtp_require_tls: to false 
route: 
  GROUP_BY: [ 'AlertName'] 
  group_wait: 10s 
  group_interval: 10s 
  REPEAT_INTERVAL: 1M # repetition interval, set here to 1m, the production environment is set to 20m-30m around 
  Receiver: 'mail' 
Receivers: 
- name: ' mail ' 
  email_configs: 
  - to:' @@@@@@ 163.com '

start up

nohup ./alertmanager --config.file=/root/alertmanager-0.17.0.linux-amd64/alertmanager.yml &


Prometheus modified configuration is as follows (with the host and alertmanager):

the Alerting: 
  alertmanagers: 
  - static_configs: 
    - Targets: 
      - 127.0.0.1:9093 # native boot Alertmanager so use 127.0.0.1, can also be deployed in other hosts 
rule_files: 
  -. "rules / * YML" # set the alarm rule files


Add node widespread alarm and regulations are as follows:

groups:
- name: general.rules
  rules:
  # Alert for any ×××tance that is unreachable for >5 minutes.
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: error
    annotations:
      summary: "Instance {{ $labels.×××tance }} down"
      description: "{{ $labels.×××tance }} of job {{ $labels.job }} has been down for more than 5 minutes."

View prometheus in, Targets state, then the Node  http://192.168.199.221:9100/metrics is up state

blob.png

The 221 node_exporter stopped to observe again

blob.png


View Alerts status

blob.png

Wait a moment receive alarm

blob.png


Adding memory alarm rule reads as follows:

groups:
- name: mem.rules
  rules:
  # Alert for any ×××tance that is unreachable for >5 minutes.
  - alert: NodeMemoryUsage
    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 5
    for: 1m
    labels:
      severity: error
    annotations:
      summary: "Instance {{ $labels.×××tance }} down"
      description: "{{ $labels.×××tance }} of job {{ $labels.job }} has been down for more than 5 minutes."

Description: Because the test is used, the usage rate is adjusted to more than 5% on the alarm

Overloaded prometheus. View prometheus ui in alert status

blob.png


Check whether the rules take effect

blob.png

After a brief period, receive alarm

blob.png


Add the contents of the file cpu alarm rule as follows:

groups:
- name: cpu.rules
  rules:
  # Alert for any ×××tance that is unreachable for >5 minutes.
  - alert: NodeCpuUsage
    expr: 100-irate(node_cpu_seconds_total{job="node",mode="idle"}[5m])*100 > 1
    for: 1m
    labels:
      severity: error
    annotations:
      summary: "{{ $labels.×××tance }} cpu useage load too high"
      description: "{{ $labels.×××tance }} of job {{ $labels.job }} has been too hgih for more than 1 minutes."

1% threshold is set here, except to use the test. Overloaded prometheus, had a moment receive alarm

blob.png