Prometheus+AlertManager realizes monitoring and alarm
1. Introduction to AlertManager
Prometheus itself does not have the ability to alarm, so it needs to be combined with a third-party alarm program to achieve monitoring indicator alarms
AlertManager is a good alerting program. First, prometheus configures alerting rules. When the alerting rules are triggered, it will push the alerting information to the altermanager. After AlertManager receives the alert, it will send it to different alarms based on the configured routing and different alert levels. Receive (recipient), AlertManager can implement email, corporate WeChat and other alarms
2. Introduction to deployment of AlertManager and configuration files
2.1. Deploy AlertManager
[root@prometheus-server ~]# tar xf AlertManager-0.21.0.linux-amd64.tar.gz
[root@prometheus-server ~]# mv AlertManager-0.21.0.linux-amd64 /data/AlertManager
2.2. Introduction to configuration files
global:
resolve_timeout //Resolve timeout time, that is, the alarm recovery is not sent immediately, but the recovery alarm can be sent only if the alarm is not triggered within a time range, the default is 5 minutes
smtp_from //Recipient's email address
smtp_smarthost //The smtp address of the mailbox provider
smtp_auth_username //Recipient's email account
smtp_auth_password //Email authorization code
smtp_require_tls //whether tls protocol is required, the default is true
wechart_api_url //WeChat api address
wbchart_api_secret //Password
wechat_api_corp_id //The id of the robot application
route:
group_by //Which label is used as the basis for grouping
group_wait // The waiting time of the group, the alarm is not sent out immediately after receiving the alarm, but waiting for a period of time to see if there are other alarms, and send them together
group_interval //Alarm interval
repeat_interval //Repeat the alarm interval, which can reduce the frequency of sending alarms
receiver // who is the receiver
receivers:
name //The name of the receiver, which corresponds to the receiver in the route
email_configs
-to //Recipient's email address
2.3. Configure AlertManager mailbox alarms
1.修改主配置文件
[root@prometheus-server ~]# cd /data/AlertManager/
[root@prometheus-server /data/AlertManager]# vim /data/AlertManager/AlertManager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'yzjqxhsranbpdijd'
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 10m
receiver: 'mail'
receivers:
- name: 'mail'
email_configs:
- to: '[email protected]'
2.检测语法
[root@prometheus-server /data/AlertManager]# ./amtool check-config AlertManager.yml
Checking 'AlertManager.yml' SUCCESS
Found:
- global config
- route
- 0 inhibit rules
- 1 receivers
- 0 templates
3.启动AlertManager
[root@prometheus-server /data/AlertManager]# nohup ./AlertManager --config.file="/data/AlertManager/AlertManager.yml" &
4.查看端口
[root@prometheus-server /data/AlertManager]# netstat -lnpt | grep alert
tcp6 0 0 :::9093 :::* LISTEN 31401/./alertmanage
tcp6 0 0 :::9094 :::* LISTEN 31401/./alertmanage
2.4. Configure prometheus to integrate AlertManager
1.修改配置文件
[root@prometheus-server ~]# vim /data/prometheus/prometheus.yml
alerting:
AlertManagers:
- static_configs:
- targets:
- 192.168.81.210:9093 #AlertManager地址
rule_files: #告警规则路径
- "rules/*.yml"
2.创建rules告警规则目录
[root@prometheus-server ~]# mkdir /data/prometheus/rules
3.加载配置
[root@prometheus-server ~]# curl -XPOST 192.168.81.210:9090/-/reload
The configuration has been updated, and the AlertManager can now be used for alarms
3. Warning rules
3.1. Alarm rule syntax
groups: //定义一个告警规则组
- name: general.rules //组名,可以将同一类型的报警放到一个分组中
rules: //定义告警规则,可以有多个
- alert: 主机宕机 //告警名称,也就是告警信息的标题,一个alert代表一个告警规则
expr: up == 0 //表达式,根据表达式的值进行匹配
for: 1m //报警收到后多长时间后发送报警信息
labels: //定义标签
serverity: error //告警级别,有warning、error等
annotations: //定义告警内容
summary: "主机 {
{ $labels.instance }} 停止工作" //消息内容,$labels.instance就是监控项中的标签变量
description: "{
{ $labels.instance }} job {
{ $labels.job }} 已经宕机5分钟以上!" //详细描述
3.2. Alarm rule status
There are three types of alarm rules
-
inactive: No alarm, everything is normal
- green
-
pending: The threshold has been triggered, but the alarm duration is not met, that is, the for written in the alarm rules, will not be sent to the AlertManager when triggered within the for specified time, and will be sent to the AlertManager immediately after the for duration has passed
- yellow
-
firing: The threshold has been triggered and the alarm duration is met, and the alarm is sent to the receiver
- red
4. Create an alarm to detect host downtime
Each monitoring instance will have an up monitoring item. If the value of this monitoring item is 0, the host is down, and 1 is normal.
4.1. Write host down alarm rules
1.编写规则
[root@prometheus-server /data/prometheus]# vim rules/hostdown.yml
groups:
- name: general.rules
rules:
- alert: 主机宕机
expr: up == 0
for: 1m
labels:
serverity: error
annotations:
summary: "主机 {
{ $labels.instance }} 停止工作"
description: "{
{ $labels.instance }} job {
{ $labels.job }} 已经宕机5分钟以上!"
2.检查语法
[root@prometheus-server /data/prometheus]# promtool check config /data/prometheus/prometheus.yml
Checking /data/prometheus/prometheus.yml
SUCCESS: 1 rule files found
Checking /data/prometheus/rules/hostdown.yml
SUCCESS: 1 rules found
3.加载配置
[root@prometheus-server /data/prometheus]# curl -XPOST 192.168.81.210:9090/-/reload
You can see the alarm rules we created under status-rules on the page
In the alert, you can also see whether the alarm rule is triggered. If it is not triggered, it is displayed in green.
4.2. Trigger alarm rules
Stop the mysql_exporter of 192.168.81.220 to see if the alarm is triggered
[root@192_168_81_220 ~]# ps aux | grep mysql
[root@192_168_81_220 ~]# kill -9 40268
4.3. Check whether AlertManager has received alarm information
The target page has been displayed down
Red firing means that an alarm has been sent
The AlertManager page has also received the alarm information pushed from the host down
4.4. Check whether you have received the email alarm content
Have been received, so far our process has run through
4.5. Triggering similar alarm rules
Similar alarms refer to: Prometheus's alarm rules are effective for all monitoring instances, that is, when the same type of alarm is triggered, it will be displayed together and an email will be sent
After we stop the docker service, two similar alarms will be generated at this time
Although they are two services, they belong to one type of alarm rule, so they will be put together
Emails will also be sent together. This is also the parameter of group_wait configured in our configuration file. When multiple alarm rules are triggered in the same group at the same time, they will be sent as one email at the same time.