Prometheus+Grafana监控系统

目录

一、Promethues. 1

二、Grafana. 2

三、环境准备... 2

四、安装Prometheus与Grafana. 3

五、配置Prometheus与Grafana. 4

1、配置prometheus. 4

2、配置grafana. 6

六、监控机器硬件资源... 7

七、监控基础服务... 9

1、监控NGINX. 9

扫描二维码关注公众号,回复: 14888629 查看本文章

2、监控MYSQL. 11

3、监控 REDIS. 16

八、监控应用程序... 18

九、监控业务接口数据... 19

十、告警设置... 21

1、设置告警方式... 21

2、设置告警规则... 24

①/data/prometheus/rules/node.yml 24

② /data/prometheus/rules/redis.yml 26

③/data/prometheus/rules/mysql.yml 27

④/data/prometheus/rules/nginx.yml 29

使用Prometheus+Grafana搭建监控系统,主要监控内容包括机器硬件资源,基础服务,应用程序,业务接口数据。

一、Promethues

Prometheus - Monitoring system & time series database

Prometheus是一个开源的服务监控系统和时间序列数据库。Prometheus生态系统由多个组件组成,包括负责数据采集和存储并提供PromQL查询语言支持的Prometheus Server,提供多语言的客户端SDK,支持临时性Job主动推送指标的中间网关Push Gateway,数据采集组件Exporter,它负责从目标处搜集数据,并将其转化为Prometheus支持的格式以及提供告警功能的Alertmanager。

Promethues Exporter组件与传统的数据采集组件不同的是它并不向中央服务器发送数据,而是等待中央服务器主动前来抓取,prometheus提供多种类型的exporter用于采集各种不同服务的运行状态。

Promethues官网提供的生态系统架构图:

二、Grafana

Grafana: The open observability platform | Grafana Labs

Grafana是一个跨平台的开源的度量分析和可视化工具,支持从多种数据源(如prometheus)获取数据进行可视化数据展示。

三、环境准备

&&&监控服务器的硬件资源:cpu 8核、内存32G、磁盘250G、网卡。

Prometheus及其它监控服务部署安排如下:

1.查看操作系统版本

[root@node64 ~]# cat /etc/redhat-release

CentOS Linux release 7.1.1503 (Core)

[root@node64 ~]# getconf LONG_BIT

64

2.查看网卡IP及配置

[root@node64 ~]# ip a

eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000

link/ether fa:16:3e:a3:89:25 brd ff:ff:ff:ff:ff:ff

inet 192.168.0.91/24 brd 192.168.0.255 scope global eth1

valid_lft forever preferred_lft forever

inet6 fe80::f816:3eff:fea3:8925/64 scope link

valid_lft forever preferred_lft forever

[root@node64 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth1

NAME=eth1

TYPE=Ethernet

BOOTPROTO=dhcp

DEVICE=eth1

ONBOOT=yes

IPV4_ROUTE_METRIC=100

四、安装Prometheus与Grafana

1.创建安装目录   mkir /data/prometheus

groupadd prometheus

useradd -g prometheus -s /sbin/nologin prometheus

chown -R prometheus:prometheus prometheus

2.下载prometheus,grafana安装包及prometheus插件包,

包括node_exporter、mysqld_exporter、nginx-vts-exporter、redis_exporter、alertmanager

Prometheus安装包参考如下:

wget https://github.com/prometheus/prometheus/releases/download/v2.20.0/prometheus-2.20.0.linux-amd64.tar.gz

   Grafana安装包参考如下:

wget https://dl.grafana.com/oss/release/grafana-7.1.1.linux-amd64.tar.gz

3.解压安装

五、配置Prometheus与Grafana

1、配置prometheus

a.修改配置文件prometheus.yml

vi /data/prometheus/prometheus/prometheus.yml

scrape_configs:

    metrics_path: /prometheus/metrics

static_configs:

      - targets: ['192.168.0.91:9090']

b.检查配置文件

/data/prometheus/prometheus

[root@node64 prometheus]# ./promtool check config prometheus.yml

注:使prometheus配置生效 pgrep -fl  prometheus

c.将prometheus注册为系统服务

[root@node64 prometheus]# cat /usr/lib/systemd/system/prometheus.service

[Unit]

Description=prometheus

After=network.target

[Service]

Type=simple

User=root

ExecStart=/data/prometheus/prometheus/prometheus --web.external-url=prometheus   --web.enable-admin-api --config.file=/data/prometheus/prometheus/prometheus.yml --storage.tsdb.path=/data/prometheus/prometheus/data --storage.tsdb.retention=15d --log.level=info --web.enable-lifecycle

Restart=on-failure

[Install]

WantedBy=multi-user.target

d.启动并查看prometheus服务

systemctl enable prometheus

systemctl start prometheus

systemctl status prometheus

   [root@node64 prometheus]# netstat -anp | grep 9090

e. nginx转发prometheus服务

    Prometheus 的 Node Exporter 并没有提供任何认证支持。不过,借助 Nginx 作为反向代理服务器,我们可以很容易地为 Node Exporter 添加 HTTP Basic Auth 功能。

yum -y install httpd

[root@node23 conf]# htpasswd -c .htpasswd_prometheus Prometheus

Iampwd

location /prometheus/ {

     proxy_set_header X-Real-IP $remote_addr;

     proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

     proxy_set_header Host $http_host;

     proxy_set_header X-Nginx-Proxy true;

     proxy_pass http://192.168.0.91:9090;

     proxy_redirect off;

     proxy_buffering off;

     proxy_read_timeout 90;

     proxy_send_timeout 90;

        auth_basic "Prometheus";

auth_basic_user_file ".htpasswd";

}

f. web访问prometheus

https://***.com:9091/prometheus/

prometheus/Iampwd

2、配置grafana

   a.修改配置文件default.ini

     vi /data/prometheus/grafana/conf/defaults.ini

     http_port = 3000

     root_url = %(protocol)s://%(domain)s:%(http_port)s/grafana/

  

b.将grafana注册为系统服务

     [root@node64 conf]# cat /usr/lib/systemd/system/grafana-server.service

[Unit]

Description=Grafana

After=network.target

[Service]

Type=notify

ExecStart=/data/prometheus/grafana/bin/grafana-server -homepath /data/prometheus/grafana

Restart=on-failure

[Install]

WantedBy=multi-user.target

c.启动并查看grafana服务

systemctl enable grafana-server

systemctl start grafana-server

systemctl status grafana-server

  

   d. nginx转发grafana服务

     location /grafana/ {

          proxy_set_header X-Real-IP $remote_addr;

          proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

          proxy_set_header Host $http_host;

          proxy_set_header X-Nginx-Proxy true;

          proxy_pass http://192.168.0.91:3000/;

          proxy_redirect off;

          proxy_buffering off;

          proxy_read_timeout 90;

          proxy_send_timeout 90;

       }

   e. web访问grafana并配置数据源

     https://***.com:9091/grafana/

admin/Iampwd

     数据源URL: http://192.168.0.91:9090/prometheus

 

六、监控机器硬件资源

Prometheus使用node_exporter插件监控机器硬件资源,prometheus主动从安装了node_exporter服务的网络互通的机器上抓取所需的数据

1.在需要监控的机器上安装node_exporter

2.将node_exporter注册为系统服务

[root@node64 conf]# cat /usr/lib/systemd/system/node_exporter.service

[Unit]

Description=node_exporter

Documentation=https://prometheus.io/

After=network.target

[Service]

Type=simple

User=root

ExecStart=/data/prometheus/node_exporter/node_exporter

Restart=on-failure

[Install]

WantedBy=multi-user.target

3.启动并查看node_exporter服务

systemctl enable node_exporter

systemctl start node_exporter

systemctl status node_exporter

4.修改prometheus.yml并重启prometheus服务

scrape_configs:

- job_name: 'node_exporter'

static_configs:

-targets:['192.168.0.91:9100','192.168.0.92:9100'…]

   

5.访问prometheus查看监控状态

 6.引入grafana监控面板

Dashboards | Grafana Labs

node_exporter 8919

mysql_exporter 11323

mysql overview 7362

nginxvts 2949

redis 11835

七、监控基础服务

1、监控NGINX

Nginx通过nginx-module-vts模块获取nginx的某些指标数据, Prometheus再通过nginx-vts-exporter组件采集nginx信息。

a、在nginx服务器安装nginx-module-vts模块

./configure --prefix= /data/nginx --with-http_gzip_static_module --with-http_stub_status_module --with-http_ssl_module --with-pcre --with-file-aio --with-http_realip_module --add-module=/data/nginx-module-vts

make && make install

  • 修改nginx.conf文件

http{

    vhost_traffic_status_zone;

    vhost_traffic_status_filter_by_host on;

location /status {

vhost_traffic_status_display;

vhost_traffic_status_display_format html;

  }

}

b、在nginx服务器与prometheus服务器都下载安装nginx-vts-exporter插件

wget

https://github.com/hnlq715/nginx-vts-exporter/releases/download/v0.10.3/nginx-vts-exporter-0.10.3.linux-amd64.tar.gz

c、在nginx服务器(192.168.0.71)将nginx-vts-exporter注册为系统服务

cat /etc/systemd/system/nginx-vts-exporter.service

[Unit]

Description=nginx_exporter

After=network.target

[Service]

Type=simple

User=root

ExecStart=/data/nginx-vts-exporter/nginx-vts-exporter -nginx.scrape_uri=https://公网IP:9091/status/format/json

Restart=on-failure

[Install]

WantedBy=multi-user.target

d、在prometheus服务器(192.168.0.91)将nginx-vts-exporter注册为系统服务

cat /etc/systemd/system/nginx-vts-exporter.service

[Unit]

Description=nginx_exporter

After=network.target

[Service]

Type=simple

User=root

ExecStart=/data/prometheus/nginx-vts-exporter/nginx-vts-exporter -nginx.scrape_uri=https://192.168.0.71:9091/status/format/json

Restart=on-failure

[Install]

WantedBy=multi-user.target

e、在nginx服务器与prometheus服务器启动并查看nginx-vts-exporter服务

systemctl enable nginx-vts-exporter

systemctl start nginx-vts-exporter

systemctl status nginx-vts-exporter

f、在prometheus服务器上修改配置文件prometheus.yml并重启prometheus服务

- job_name: 'nginx'

static_configs:

      - targets: ['192.168.0.91:9913']

  

 https://公网IP:9091/statusg、查看nginx页面监控

https://公网IP:9091/status

在grafana上导入nginx监控面板 nginx-vts-exporter 2949并查看面板

       https://***.com:9091/grafana/d/5-RKCVxGk/nginx-vts-stats?orgId=1

 

2、监控MYSQL

Prometheus通过mysqld_exporter组件采集MySQL主从服务器的相关数据。

1)使用自动化脚本安装mysql后将mysql服务加入系统服务并设置开机自启

[root@centos7-min4 nginx]# cp /opt/mysql57/support-files/mysql.server /etc/rc.d/init.d/mysqld

chmod +x /etc/init.d/mysqld

chkconfig --add mysqld

chkconfig --list

 # systemctl start mysqld

# systemctl status mysqld

[mysql@centos7-min4 nginx]$ mysql -uroot -p —— 123456

mysql> select version();

+------------+

| version() |

+------------+

| 5.7.24-log |

2)在prometheus服务器上安装mysqld_exporter组件

prometheus监控mysql主从服务器

    • 登录mysql为exporter创建账号并授权

create user 'exporter'@'192.168.0.%' identified by 'Abc123';

grant process,replication client,select on *.* to 'exporter'@'192.168.0.%';

flush privileges;

   

    • 在Prometheus服务器上安装mysqld_exporter服务,同时监控mysql主从服务

   ls -al /data/prometheus/mysqld_exporter/

   .my-master.cnf

   .my-slave.cnf

       root@node64 mysqld_exporter]# cat .my-master.cnf

[client]

user=exporter 

password=Abc123

host=192.168.0.92

port=3306

[root@node64 mysqld_exporter]# cat .my-slave.cnf

[client]

user=exporter

password=Abc123

host=192.168.0.93

port=3306

    • 启动mysqld_exporter服务

为mysql主从服务分别启动一个服务:

Mysql主服务启动

/data/prometheus/mysqld_exporter/mysqld_exporter --web.listen-address=192.168.0.91:9104 --config.my-cnf=/data/prometheus/mysqld_exporter/.my-master.cnf --collect.auto_increment.columns --collect.binlog_size --collect.global_status --collect.engine_innodb_status --collect.global_variables --collect.info_schema.innodb_metrics --collect.info_schema.innodb_tablespaces --collect.info_schema.innodb_cmp --collect.info_schema.innodb_cmpmem --collect.info_schema.processlist --collect.info_schema.query_response_time --collect.info_schema.tables --collect.info_schema.tablestats --collect.info_schema.userstats --collect.perf_schema.eventswaits --collect.perf_schema.file_events --collect.perf_schema.indexiowaits --collect.perf_schema.tableiowaits --collect.perf_schema.tablelocks

Mysql从服务启动

/data/prometheus/mysqld_exporter/mysqld_exporter --web.listen-address=192.168.0.91:9105 --config.my-cnf=/data/prometheus/mysqld_exporter/.my-slave.cnf

(注:其余参数请与上面Mysql主服务启动保持一致)

3)修改prometheus配置文件信息并重启prometheus

prometheus.yml

- job_name: 'mysql_exporter'

    static_configs:

    #  - targets: ['192.168.0.92:9104','192.168.0.93:9104']

      - labels:

          instance: master:3306 # grafana显示的实例的别名

      - targets:

        - 192.168.0.91:9104 # mysqld_exporter暴露的端口

      - labels:

          instance: slave:3306 # grafana显示的实例的别名

      - targets:

        - 192.168.0.91:9105 # mysqld_exporter暴露的端口

4)查看prometheus与grafana面板的mysql数据,导入mysql监控面板

mysql_exporter 11323

mysql overview 7362

 

 

 

PS: mysql同步故障处理: Slave_SQL_Running: No

分析:造成mysql数据同步失败的原因

  1. 程序可能在slave上进行了写操作
  2. 可能是slave机器重启后,事务回滚造成的

处理办法:首先停掉slave服务,查看主服务器上主机状态,并根据File和Position对应的值同步到从服务器,最后启动slave服务,查看同步状态

主服务器:

mysql> show master status;

从服务器:

mysql> stop slave;

Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql>

mysql> change master to master_host='192.168.0.92',

-> master_user='repl',

-> master_password='123456',

-> master_log_file='mysql-bin-T-prod-3306.000005',

-> master_log_pos=653020;

Query OK, 0 rows affected, 2 warnings (0.05 sec)

mysql> start slave;

3、监控 REDIS

使用redis_exporter组件监控三主三从的redis集群。

     1)使用自动化脚本安装redis三主三从集群

     2)在prometheus服务器下载并安装redis_exporter服务

wget https://github.com/oliver006/redis_exporter/releases/download/v1.3.5/redis_exporter-v1.3.5.linux-amd64.tar.gz

    3)监控redis集群中其中一台服务器即可监控整个已创建好的集群

cd /data/prometheus/redis_exporter

./redis_exporter -redis.addr 192.168.0.93:7000 -redis.password 'zxcvb123' &

   4)修改prometheus配置文件信息并重启prometheus

- job_name: 'redis_exporter_targets'

    static_configs:

      - targets:

        - redis://192.168.0.3:7000

        - redis://192.168.0.2:7003

        - redis://192.168.0.72:7002

        - redis://192.168.0.35:7001

        - redis://192.168.0.14:7004

        - redis://192.168.0.13:7005

    metrics_path: /scrape

    relabel_configs:

      - source_labels: [__address__]

        target_label: __param_target

      - source_labels: [__param_target]

        target_label: instance

      - target_label: __address__

        replacement: 192.168.0.91:9121

  - job_name: 'redis_exporter'

    static_configs:

      - targets:

        - 192.168.0.91:9121

       5)查看prometheus与grafana面板数据,导入面板redis 11835

八、监控应用程序

Prometheus使用node_exporter组件监控应用程序

     1)在需要监控的应用程序服务器上安装process-exporter组件

wget

https://github.com/ncabatoff/process-exporter/releases/download/v0.5.0/process-exporter-0.5.0.linux-amd64.tar.gz

    2)配置应用程序监控信息

process-conf.yml

      3)启动应用程序监控服务并指定配置文件

./process-exporter -config.path  process-conf.yml &

      4)修改prometheus配置信息并重启prometheus

[root@node64 prometheus]# vi prometheus.yml

- job_name: process

static_configs:

- targets: ['192.168.0.35:9256','192.168.0.13:9256'…]

     5)查看prometheus与grafana面板数据  

    process-exporter对应的dashboard为:Named processes | Grafana Labs

 

 

九、监控业务接口数据

根据开发提供的业务监控接口数据配置grafana数据展示。

https://***.com:9092/api

  • 注意:

1.Grafana不保存普罗米修斯的数据。它查询Prometheus并显示UI。在这种情况下,您必须查看清除普罗米修斯数据。

Prometheus默认具有15天的保留期。但这可以通过-storage.local.retention标志进行调整以满足您的需求

2.prometheus会把过期数据置NaN,sum()求和函数不支持NaN,需要调整metric:

sum(st_invoke_count{app_id=~'$appid',road_type='1'}>0)

3.Grafana全局变量

4.Grafana菜单级联

十、告警设置

1、设置告警方式

> Prometheus

  Prometheus通过组件alertmanager实现告警,Alertmanager接收prometheus发送的告警并对告警进行一系列的处理后发送给指定的用户。

prometheus--->触发阈值--->超出持续时间--->alertmanager--->分组|抑制|静默--->媒体类型--->邮件、钉钉、微信等。

      1)安装alertmanager

wget https://github.com/prometheus/alertmanager/releases/download/v0.16.2/alertmanager-0.16.2.linux-amd64.tar.gz

      2)修改alertmanager配置文件信息alertmanager.yml

      3)开启smtp服务

      4)配置告警通知模板

[root@centos7-min4 alertmanager-0.21.0]# cat template/test.tmpl

{ { define "test.html" }}

<table border="5">

<tr>

<td>报警项</td>

<td>事项</td>

<td>报警阀值</td>

<td>开始时间</td>

</tr>

{ { range $i, $alert := .Alerts }}

<tr>

<td>{ { index $alert.Labels "alertname" }}</td>

<td>{ { index $alert.Labels "instance" }}</td>

<td>{ { index $alert.Annotations "value" }}</td>

<td>{ { $alert.StartsAt }}</td>

</tr>

{ { end }}

</table>

{ { end }}

      5)启动alertmanager服务

(1)指定配置文件启动

[root@centos7-min4 alertmanager-0.21.0]# ./alertmanager --config.file=alertmanager.yml &

(2)配置成系统服务启动

[root@centos7-min4 alertmanager-0.21.0]# cat /usr/lib/systemd/system/alertmanager.service

[Unit]

Description=https://prometheus.io

[Service]

Restart=on-failure

ExecStart=/opt/alertmanager-0.21.0/alertmanager --config.file=/opt/alertmanager-0.21.0/alertmanager.yml

[Install]

WantedBy=multi-user.target

systemctl enable alertmanager

systemctl start alertmanager

systemctl status alertmanager

      6)修改prometheus配置文件并重启prometheus

# Alertmanager configuration

alerting:

  alertmanagers:

  - static_configs:

    - targets:

       - 192.168.0.91:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

    - "/data/prometheus/rules/node.yml"

    - "/data/prometheus/rules/redis.yml"

    - "/data/prometheus/rules/mysql.yml"

    - "/data/prometheus/rules/nginx.yml"

 - "/data/prometheus/rules/service-api.yml"

> Grafana

2、设置告警规则

①/data/prometheus/rules/node.yml

groups:

- name: NodeProcess

  rules:

  - alert: NodeStatus

    expr: up == 0

    for: 1m

    labels:

      severity: warning

    annotations:

      summary: "{ {$labels.instance}}:服务器宕机"

      description: "{ {$labels.instance}}:服务器延时超过5分钟"

  - alert: NodeFilesystemUsage

    expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80

    for: 2m

    labels:

      severity: warning

    annotations:

      summary: "{ {$labels.instance}}: { {$labels.mountpoint }} 分区使用过高"

      description: "{ {$labels.instance}}: { {$labels.mountpoint }} 分区使用大于 80% (当前值: { { $value }})"

  - alert: NodeMemoryUsage

    expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80

    for: 2m

    labels:

      severity: warning

    annotations:

      summary: "{ {$labels.instance}}: 内存使用过高"

      description: "{ {$labels.instance}}: 内存使用大于 80% (当前值: { { $value }})"

  - alert: NodeCPUUsage

    expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80

    for: 2m

    labels:

      severity: warning

    annotations:

      summary: "{ {$labels.instance}}: CPU使用过高"

      description: "{ {$labels.instance}}: CPU使用大于 80% (当前值: { { $value }})"

  - alert: LoadCPU

    expr: node_load5 > 5

    for: 2m

    labels:

      severity: warning

    annotations:

      summary: "{ {$labels.instance}}: 负载过高"

      description: "{ {$labels.instance}}: 5分钟内平均负载超过5 (当前值: { { $value }})" 

  - alert: DiskIORead

    expr: irate(node_disk_read_bytes_total{device="sda"}[1m]) > 30000000

    for: 1m

    labels:

      severity: warning

    annotations:

      summary: "{ {$labels.instance}}: I/O读负载过高"

      description: "{ {$labels.instance}}: I/O 每分钟读已超过 30MB/s (当前值: { { $value }})"

  - alert: DiskIOWrite

    expr: irate(node_disk_written_bytes_total{device="sda"}[1m]) > 30000000

    for: 1m

    labels:

      severity: warning

    annotations:

      summary: "{ {$labels.instance}}: I/O写负载过高"

      description: "{ {$labels.instance}}: I/O 每分钟写已超过 30MB/s (当前值: { { $value }})"

  - alert: 流入网络带宽

    expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 18432

    for: 1m

    labels:

      status: warning

    annotations:

      summary: "{ {$labels.mountpoint}} 流入网络带宽过高!"

      description: "{ {$labels.mountpoint }}流入网络带宽持续5分钟高于18M. RX带宽使用率{ {$value}}"

  - alert: 流出网络带宽

    expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 18432

    for: 1m

    labels:

      status: warning

    annotations:

      summary: "{ {$labels.mountpoint}} 流出网络带宽过高!"

      description: "{ {$labels.mountpoint }}流出网络带宽持续5分钟高于18M. RX带宽使用率{ {$value}}"

  - alert: 网络连接数

    expr: node_sockstat_TCP_inuse > 240

    for: 1m

    labels:

      status: warning

    annotations:

      summary: "{ {$labels.mountpoint}} 连接数过高!"

      description: "{ {$labels.mountpoint }}当前连接数{ {$value}}"

② /data/prometheus/rules/redis.yml

groups:

- name:  Redis

  rules:

    - alert: RedisDown

      expr: redis_up  == 0

      for: 5m

      labels:

        severity: warning

      annotations:

        summary: "Redis down (instance { { $labels.instance }})"

        description: "Redis集群节点故障\n  VALUE = { { $value }}\n  LABELS: { { $labels }}"

    - alert: OutOfMemory

      expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90

      for: 5m

      labels:

        severity: warning

      annotations:

        summary: "Out of memory (instance { { $labels.instance }})"

        description: "Redis is running out of memory (> 90%)\n  VALUE = { { $value }}\n  LABELS: { { $labels }}"

    - alert: ReplicationBroken

      expr: delta(redis_connected_slaves[1m]) < 0

      for: 5m

      labels:

        severity: warning

      annotations:

        summary: "Replication broken (instance { { $labels.instance }})"

        description: "Redis instance lost a slave\n  VALUE = { { $value }}\n  LABELS: { { $labels }}"

    - alert: TooManyConnections

      expr: redis_connected_clients > 1000

      for: 5m

      labels:

        severity: warning

      annotations:

        summary: "Too many connections (instance { { $labels.instance }})"

        description: "Redis instance has too many connections\n  VALUE = { { $value }}\n  LABELS: { { $labels }}"     

    - alert: RejectedConnections

      expr: increase(redis_rejected_connections_total[1m]) > 0

      for: 5m

      labels:

        severity: warning

      annotations:

        summary: "Rejected connections (instance { { $labels.instance }})"

        description: "Some connections to Redis has been rejected\n  VALUE = { { $value }}\n  LABELS: { { $labels }}"

    - alert: AofSaveStatus

      expr: redis_aof_last_bgrewrite_status < 1

      for: 5m

      lables:

        serverity: warning

      annotations:

        summary: "Missing backup (instance { { $labels.instance }})"

        description: "Redis AOF persistence failed\n  VALUE = { { $value }}\n  LABELS: { { $labels }}"

③/data/prometheus/rules/mysql.yml

groups:

    - name: MySQL

      rules:

      - alert: MySQL Status

        expr: mysql_up == 0

        for: 5s

        labels:

          severity: warning

        annotations:

          summary: "{ {$labels.instance}}: MySQL has stop !!!"

          description: "检测MySQL数据库运行状态"

   

      - alert: MySQL Slave IO Thread Status

        expr: mysql_slave_status_slave_io_running != 1

        for: 5s

        labels:

          severity: warning

        annotations:

          summary: "{ {$labels.instance}}: MySQL Slave IO Thread has stop !!!"

          description: "检测MySQL主从IO线程运行状态"

      - alert: MySQL Slave SQL Thread Status

        expr: mysql_slave_status_slave_sql_running != 1

        for: 5s

        labels:

          severity: warning

        annotations:

          summary: '{ {$labels.instance}}: MySQL Slave SQL Thread has stop !!!' 

          description: "检测MySQL主从SQL线程运行状态"

      - alert: MySQL Slave Delay Status

        expr: mysql_slave_status_sql_delay == 30

        for: 5s

        labels:

          severity: warning

        annotations:

          summary: "{ {$labels.instance}}: MySQL Slave Delay has more than 30s !!!"

          description: "检测MySQL主从延时状态"

   

      - alert: Mysql_Too_Many_Connections

        expr: rate(mysql_global_status_threads_connected[5m]) > 200

        for: 2m

        labels:

          severity: warning

        annotations:

          summary: "{ {$labels.instance}}: 连接数过多"

          description: "{ {$labels.instance}}: 连接数过多,请处理 ,(current value is: { { $value }})" 

④/data/prometheus/rules/nginx.yml

groups:

- name: nginx

  rules:

  - alert: Nginx Status

    expr: up{instance="192.168.0.91:9913",job="nginx"} == 0

    for: 5s

    labels:

      severity: warning

    annotations:

      summary: "{ {$labels.instance}}: Nginx has stop !!!"

      description: "检测Nginx运行状态异常"

猜你喜欢

转载自blog.csdn.net/Wemesun/article/details/126455053