Installation alertmanager
To https://prometheus.io/download/ download alertmanager
Edit alertmanager.yml, only achieved after decompression alarm, modified as follows
, Ltd. Free Join: resolve_timeout: 5m smtp_smarthost: 'smtp.163.com:25' smtp_from: '***@163.com' smtp_auth_username,: '***@163.com' smtp_auth_password: '******' Authorization # password smtp_require_tls: to false route: GROUP_BY: [ 'AlertName'] group_wait: 10s group_interval: 10s REPEAT_INTERVAL: 1M # repetition interval, set here to 1m, the production environment is set to 20m-30m around Receiver: 'mail' Receivers: - name: ' mail ' email_configs: - to:' @@@@@@ 163.com '
start up
nohup ./alertmanager --config.file=/root/alertmanager-0.17.0.linux-amd64/alertmanager.yml &
Prometheus modified configuration is as follows (with the host and alertmanager):
the Alerting: alertmanagers: - static_configs: - Targets: - 127.0.0.1:9093 # native boot Alertmanager so use 127.0.0.1, can also be deployed in other hosts rule_files: -. "rules / * YML" # set the alarm rule files
Add node widespread alarm and regulations are as follows:
groups: - name: general.rules rules: # Alert for any ×××tance that is unreachable for >5 minutes. - alert: InstanceDown expr: up == 0 for: 1m labels: severity: error annotations: summary: "Instance {{ $labels.×××tance }} down" description: "{{ $labels.×××tance }} of job {{ $labels.job }} has been down for more than 5 minutes."
View prometheus in, Targets state, then the Node http://192.168.199.221:9100/metrics is up state
The 221 node_exporter stopped to observe again
View Alerts status
Wait a moment receive alarm
Adding memory alarm rule reads as follows:
groups: - name: mem.rules rules: # Alert for any ×××tance that is unreachable for >5 minutes. - alert: NodeMemoryUsage expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 5 for: 1m labels: severity: error annotations: summary: "Instance {{ $labels.×××tance }} down" description: "{{ $labels.×××tance }} of job {{ $labels.job }} has been down for more than 5 minutes."
Description: Because the test is used, the usage rate is adjusted to more than 5% on the alarm
Overloaded prometheus. View prometheus ui in alert status
Check whether the rules take effect
After a brief period, receive alarm
Add the contents of the file cpu alarm rule as follows:
groups: - name: cpu.rules rules: # Alert for any ×××tance that is unreachable for >5 minutes. - alert: NodeCpuUsage expr: 100-irate(node_cpu_seconds_total{job="node",mode="idle"}[5m])*100 > 1 for: 1m labels: severity: error annotations: summary: "{{ $labels.×××tance }} cpu useage load too high" description: "{{ $labels.×××tance }} of job {{ $labels.job }} has been too hgih for more than 1 minutes."
1% threshold is set here, except to use the test. Overloaded prometheus, had a moment receive alarm
Reproduced in: https: //blog.51cto.com/lvsir666/2409063