Build a high-availability web cluster based on Nginx+Keepalived and implement monitoring and alarming

Build related servers

Plan the IP address and cluster architecture diagram, and turn off all SELINUX and firewalls

# 关闭SELINUX
sed -i '/^SELINUX=/ s/enforcing/disabled/' /etc/selinux/config
# 关闭防火墙
service firewalld stop
systemctl disable firewalld

Insert image description here

web1 192.168.40.21 backend web
web2 192.168.40.22 backend web
web3 192.168.40.23 backend web
lb1 192.168.40.31 load balancer 1
lb2 192.168.40.32 load balancer 2
dns、prometheus 192.168.40.137 DNS server, monitoring server
nfs 192.168.40.138 NFS server

DNS server configuration

Install bind package

yum install bind* -y

Start the named process

[root@elk-node2 selinux]# service named start
Redirecting to /bin/systemctl start named.service
[root@elk-node2 selinux]# ps aux | grep named
named     44018  0.8  3.2 391060 60084 ?        Ssl  15:15   0:00 /usr/sbin/named -u named -c /etc/named.conf
root      44038  0.0  0.0 112824   980 pts/0    S+   15:15   0:00 grep --color=auto named

Modify the /etc/resolv.conf file, add a line, and set the domain name server to this machine

nameserver 127.0.0.1

test. Parsed successfully

[root@elk-node2 selinux]#  nslookup 
> www.qq.com
Server:         127.0.0.1
Address:        127.0.0.1#53

Non-authoritative answer:
www.qq.com      canonical name = ins-r23tsuuf.ias.tencent-cloud.net.
Name:   ins-r23tsuuf.ias.tencent-cloud.net
Address: 121.14.77.221
Name:   ins-r23tsuuf.ias.tencent-cloud.net
Address: 121.14.77.201
Name:   ins-r23tsuuf.ias.tencent-cloud.net
Address: 240e:97c:2f:3003::77
Name:   ins-r23tsuuf.ias.tencent-cloud.net
Address: 240e:97c:2f:3003::6a

Use this machine as a domain name server so that other machines can access it and modify /etc/named.conf

listen-on port 53 {
    
     any; };
listen-on-v6 port 53 {
    
     any; };
allow-query     {
    
     any; };

Restart service

service named restart

In this way, other machines can use the machine 192.168.40.137 for domain name resolution.

WEB server configuration

Configure static IP

Enter /etc/sysconfig/network-scripts/directory

Modify the ifcfg-ens33 file to ensure communication with each other

web1IP configuration

BOOTPROTO="none"
NAME="ens33"
DEVICE="ens33"
ONBOOT="yes"
IPADDR=192.168.40.21
PREFIX=24
GATEWAY=192.168.40.2
DNS1=114.114.114.114                       

web2IP configuration

BOOTPROTO="none"
NAME="ens33"
DEVICE="ens33"
ONBOOT="yes"
IPADDR=192.168.40.22
PREFIX=24
GATEWAY=192.168.40.2
DNS1=114.114.114.114    

web3IP configuration

BOOTPROTO="none"
NAME="ens33"
DEVICE="ens33"
ONBOOT="yes"
IPADDR=192.168.40.23
PREFIX=24
GATEWAY=192.168.40.2
DNS1=114.114.114.114    

Compile and install nginx

To compile and install nginx, you can read my blog about starting and stopping the installation of Nginx.

After installation, browser access can be successful.

Load balancer configuration

Use nginx for load balancing

lb1

Configure static IP

BOOTPROTO="none"
NAME="ens33"
DEVICE="ens33"
ONBOOT="yes"
IPADDR=192.168.40.31
PREFIX=24
GATEWAY=192.168.40.2
DNS1=114.114.114.114

Modify the files in the installation directory nginx.confand add the following

Layer 7 load—>upstream is in the http block and is forwarded based on the http protocol.

http {
    
    
   ……
    upstream lb1{
    
     # 后端真实的IP地址,在http块里
    ip_hash; # 使用ip_hash算法或least_conn;最小连接
    # 权重 192.168.40.21 weight=5;
	server 192.168.40.21;;
    server 192.168.40.22;
    server 192.168.40.23;
    }
    server {
    
    
        listen       80;
        ……
        location / {
    
    
	    #root   html; 注释掉,因为只是做代理,不是直接访问
        #index  index.html index.htm;
 		proxy_pass http://lb1; # 代理转发    
        }

Layer 4 load—>stream block is at the same level as http block, forwarding based on IP+port

stream {
    
    
  upstream lb1{
    
    
        
  }
    server {
    
    
        listen 80; # 基于80端口转发
        proxy_pass lb1;
  }
  upstream dns_servers {
    
    
		least_conn;
		server 192.168.40.21:53;
        server 192.168.40.22:53;
        server 192.168.40.23:53;     
        }
        server {
    
    
        listen 53 udp; # 基于53端口转发
        proxy_pass dns_servers;
   }
}

Reload nginx

nginx -s reload

The polling algorithm is used by default, you can check the effect

Insert image description here

Insert image description here
Insert image description here

lb2

Configure static IP

BOOTPROTO="none"
NAME="ens33"
DEVICE="ens33"
ONBOOT="yes"
IPADDR=192.168.40.32
PREFIX=24
GATEWAY=192.168.40.2
DNS1=114.114.114.114

Modify the files in the installation directory nginx.confand add the following

Layer 7 load—>upstream is in the http block and is forwarded based on the http protocol.

http {
    
    
   ……
    upstream lb2{
    
     # 后端真实的IP地址,在http块里
    ip_hash; # 使用ip_hash算法或least_conn;最小连接
    # 权重 192.168.40.21 weight=5;
	server 192.168.40.21;;
    server 192.168.40.22;
    server 192.168.40.23;
    }
    server {
    
    
        listen       80;
        ……
        location / {
    
    
	    #root   html; 注释掉,因为只是做代理,不是直接访问
        #index  index.html index.htm;
 		proxy_pass http://lb2; # 代理转发    
        }

Reload nginx

nginx -s reload

Problem: The back-end server does not know the actual accessed IP address, but only the IP address of the load balancer. How to solve it?

Reference article

       Use the variable $remote_addrto get the client's IP address, assign it to X-Real-IPthe field, and then reload it

nginx -s reload

Insert image description here

All backend server modification logs to obtain the value of this field
Insert image description here

Check whether the real IP of the client is obtained

Insert image description here

Question: What is the difference between Layer 4 load and Layer 7 load?

Reference article

  1. 四层负载均衡(Layer 4 Load Balancing): Four-layer load balancing is a way to perform load balancing on the transport layer (that is, the network layer). In four-layer load balancing, the load balancing device 源IP地址、目标IP地址、源端口号、目标端口号forwards the request to the corresponding server based on other information. It basically only focuses on the basic properties of the network connection and has no knowledge of the content and protocol of the request.

    The advantages of four-layer load balancing are fast speed and high efficiency, and it is suitable for handling a large number of network connections, such as TCP and UDP protocols. However, it has a limited understanding of the content of the request and cannot customize forwarding strategies for the specific needs of specific applications.

  2. 七层负载均衡(Layer 7 Load Balancing): Layer-7 load balancing is a way of load balancing on the application layer. In seven-layer load balancing, the load balancing device can go deep into the application layer protocol (such as HTTP, HTTPS) to understand the content and characteristics of the request, and intelligently forward the request based on the request URL、请求头、会话信息等因素.

    Seven-layer load balancing can achieve more flexible and customized forwarding strategies. For example, requests can be distributed to different backend servers based on domain names, URL paths, specific information in request headers, etc. This is useful when dealing with web applications, API services, etc. that have specific routing rules and requirements.

Four-layer load balancing mainly forwards based on the network connection attributes of the transport layer, which is suitable for high concurrency and large-scale network connection scenarios; while seven-layer load balancing provides an in-depth understanding of requests at the application layer and is suitable for forwarding based on request content and characteristics. Scenarios for intelligent forwarding. In actual applications, depending on specific needs and application types, you can choose an appropriate load balancing method or combine the two to achieve better performance and scalability.

High availability configuration

Use keepalived to achieve high availability

       Both load balancers are installed with keepalived, and the communication between them is through the VRRP protocol. Reference article for introduction to the VRRP protocol.

yum install keepalived

Single VIP configuration

       Enter the directory where the configuration file is located /etc/keepalived/, edit the configuration file keepalived.conf, and start a vrrp instance.

lb1 configuration

vrrp_instance VI_1 {
    
        #启动一个实例
    state MASTER	    #角色为master
    interface ens33     #网卡接口
    virtual_router_id 150#路由id
    priority 100        #优先级
    advert_int 1        #宣告信息 间隔1s
    authentication {
    
        #认证信息
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
    
     #虚拟IP,对外提供服务
        192.168.40.51
    }
}

lb2 configuration

vrrp_instance VI_1 {
    
    
    state BACKUP #角色为backup
    interface ens33
    virtual_router_id 150
    priority 50  #优先级比master要低
    advert_int 1
    authentication {
    
    
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
    
    
        192.168.40.51
    }
}

Start keepalived, you can see vip on the load balancer with high priority

service keepalived start

Double VIP configuration

       Enter the directory where the configuration file is located /etc/keepalived/, edit the configuration file keepalived.conf, and start two vrrp to provide external services to improve usage.

lb1 configuration

vrrp_instance VI_1 {
    
        #启动一个实例
    state MASTER       #角色为master
    interface ens33     #网卡接口
    virtual_router_id 150#路由id
    priority 100        #优先级
    advert_int 1        #宣告信息
    authentication {
    
    
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
    
    
        192.168.40.51 # 对外提供的IP
    }
}
vrrp_instance VI_2 {
    
       #启动第二个实例
    state BACKUP      #角色为backup
    interface ens33     #网卡接口
    virtual_router_id 160#路由id
    priority 50         #优先级
    advert_int 1        #宣告信息
    authentication {
    
    
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
    
    
        192.168.40.52 # 对外提供的IP
    }
}

lb2 configuration

vrrp_instance VI_1 {
    
    
    state BACKUP  #角色为backup
    interface ens33
    virtual_router_id 150
    priority 50
    advert_int 1
    authentication {
    
    
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
    
    
        192.168.40.51
    }
}
vrrp_instance VI_2 {
    
    
    state MASTER  #角色为master
    interface ens33
    virtual_router_id 160
    priority 100
    advert_int 1
    authentication {
    
    
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
    
    
        192.168.40.52
    }
}

Restart keepalived and you can see VIP on both load balancers.

service keepalived start

       Write a script check_nginx.shto monitor whether Nginx is running. If Nginx hangs, there is no point in turning on keepalived. It takes up resources and requires timely adjustment of the active and standby status.

#!/bin/bash
if [[ $(netstat -anplut| grep nginx|wc -l) -eq 1 ]];then
        exit 0
else
        exit 1
        # 关闭keepalived
        service keepalived stop
fi

Granted permission

chmod +x check_nginx.sh 

       The script did not execute successfully. There was a problem when checking the /var/log/messages log. It turned out that there was no space between the script name and the brackets...

Insert image description here

lb1 configuration after adding script

! Configuration File for keepalived

global_defs {
    
    
   notification_email {
    
    
     [email protected]
     [email protected]
     [email protected]
   }
   notification_email_from [email protected]
   smtp_server 192.168.200.1
   smtp_connect_timeout 30
   router_id LVS_DEVEL
   vrrp_skip_check_adv_addr
  #vrrp_strict
   vrrp_garp_interval 0
   vrrp_gna_interval 0
}
vrrp_script chk_nginx {
    
    
script "/etc/keepalived/check_nginx.sh" # 外部脚本执行位置,使用绝对路径
interval 1
weight -60 # 修改后权重的优先值要小于backup
}

vrrp_instance VI_1 {
    
        #启动一个实例
    state MASTER
    interface ens33     #网卡接口
    virtual_router_id 150#路由id
    priority 100        #优先级
    advert_int 1        #宣告信息
    authentication {
    
    
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
    
    
        192.168.40.51
    }
track_script {
    
     # 脚本名要有空格
chk_nginx # 调用脚本
}

}

vrrp_instance VI_2 {
    
        #启动一个实例
    state BACKUP
    interface ens33     #网卡接口
    virtual_router_id 170#路由id
    priority 50         #优先级
    advert_int 1        #宣告信息
    authentication {
    
    
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
    
    
        192.168.40.52
    }
}

The configuration of lb2 is the same as lb1. Just place the code executed by the script in the master part of lb2.

       进行nginx测试,发现双vip能够在nginx关闭的状态同时关闭keepalived并进行vip漂移

       Reference article for usage of notify (can also achieve the effect of keepalived closing)

notify的用法:
  notify_master:当前节点成为master时,通知脚本执行任务(一般用于启动某服务,比如nginx,haproxy等)
  notify_backup:当前节点成为backup时,通知脚本执行任务(一般用于关闭某服务,比如nginx,haproxy等)
  notify_fault:当前节点出现故障,执行的任务; 
  例:当成为master时启动haproxy,当成为backup时关闭haproxy
  notify_master "/etc/keepalived/start_haproxy.sh start"
  notify_backup "/etc/keepalived/start_haproxy.sh stop"

Question: What is the split brain phenomenon and what are the possible causes?

       The split-brain phenomenon refers to 主备服务器之间的通信故障a situation that causes two nodes to think that they are the master node at the same time and compete for the following reasons:

  1. Network partition: In a cluster using keepalived, if the network is partitioned and the communication between the master node and the backup node is interrupted, a split-brain phenomenon may occur.
  2. Inconsistent virtual route IDs: Virtual route IDs are used to uniquely identify the active and backup nodes. If the virtual route ID settings are inconsistent, conflicts will occur between different nodes, which may cause the nodes to announce themselves as active nodes at the same time, causing a split-brain phenomenon.
  3. 认证密码不一样:当认证密码不一致时,节点之间的通信将受阻,可能导致节点无法正常进行状态同步和故障切换,从而引发脑裂现象的发生。
  4. 节点运行状态不同步:当主节点和备份节点之间的状态同步过程中出现错误或延迟,导致节点状态不一致,可能会引发脑裂现象。
  5. 信号丢失:keepalived使用心跳机制检测节点状态,如果由于网络延迟或其他原因导致心跳信号丢失,可能会误判节点状态,从而引发脑裂现象。

问题:keepalived的三个进程?

  1. Keepalived 主进程:负责加载并解析 Keepalived 配置文件,创建和管理 VRRP 实例,并监控实例状态。它还处理与其他 Keepalived 进程之间的通信。
  2. Keepalived VRRP 进程:这是负责实现虚拟路由冗余协议功能的进程。每个启动的 VRRP 实例都会有一个对应的 VRRP 进程。它负责定期发送 VRRP 通告消息,监听其他节点发送的通告消息,并根据配置的优先级进行故障转移。
  3. Keepalived Check Script 进程:这个进程用于执行用户定义的健康检查脚本。通过此进程,可以执行自定义的脚本来检测服务器的健康状态,并根据脚本的返回结果来更改 VRRP 实例的状态或触发故障转移。

NFS服务器配置

       使用nfs,让后端服务器到nfs服务器里获取数据,将nfs的服务器挂载到web服务器上,保证数据一致性

配置静态IP

BOOTPROTO="none"
IPADDR=192.168.40.138
GATEWAY=192.168.40.2
DNS2=114.114.114.114
NAME="ens33"
DEVICE="ens33"
ONBOOT="yes"       

安装软件包

yum -y install rpcbind nfs-utils


启动服务,先启动rpc服务,再启动nfs服务

# 启动rpc服务
[root@nfs ~]# service rpcbind start
Redirecting to /bin/systemctl start rpcbind.service
[root@nfs ~]# systemctl enable rpcbind
# 启动nfs服务
[root@nfs ~]# service nfs-server start
Redirecting to /bin/systemctl start nfs-server.service
[root@nfs ~]# systemctl enable nfs-server

新建共享目录

新建/data/share/,自己写一个index.html查看效果

mkdir -p /data/share/

编辑配置文件vim /etc/exports

/data/share/ 192.168.40.0/24(rw,no_root_squash,all_squash,sync)

其中:

  • /data/share/:共享文件目录
  • 192.168.40.0/24:表示接受来自以 192.168.40.0 开头的IP地址范围的请求。
  • (rw):指定允许对目录进行读写操作。
  • no_root_squash:指定不对root用户进行权限限制。它意味着在客户端上以root用户身份访问时,在服务器上也将以root用户身份进行访问。
  • all_squash:指定将所有用户映射为匿名用户。它意味着在客户端上以任何用户身份访问时,在服务器上都将以匿名用户身份进行访问。
  • sync:指定文件系统同步方式。sync 表示在写入操作完成之前,将数据同步到磁盘上。保障数据的一致性和可靠性,但可能会对性能产生影响。

重新加载nfs,让配置文件生效

systemctl reload nfs
exportfs -rv

web服务器挂载

       3台web服务器只需要安装rpcbind服务即可,无需安装nfs或开启nfs服务。

yum install rpcbind -y

web服务器端查看nfs服务器共享目录

[root@web1 ~]# showmount -e 192.168.40.138
Export list for 192.168.40.138:
/data/share 192.168.40.0/24
[root@web2 ~]# showmount -e 192.168.40.138
Export list for 192.168.40.138:
/data/share 192.168.40.0/24
[root@web3 ~]# showmount -e 192.168.40.138
Export list for 192.168.40.138:
/data/share 192.168.40.0/24

进行挂载,挂载到Nginx网页目录下

[root@web1 ~]# mount 192.168.40.138:/data/share /usr/local/shengxia/html
[root@web2 ~]# mount 192.168.40.138:/data/share /usr/local/shengxia/html
[root@web3 ~]# mount 192.168.40.138:/data/share /usr/local/shengxia/html

设置开机自动挂载nfs文件系统

vim /etc/rc.local
# 将这行直接接入/etc/rc.local文件末尾
mount -t nfs 192.168.40.138:/data/share /usr/local/shengxia/html

同时给/etc/rc.d/rc.local可执行权限

chmod /etc/rc.d/rc.local

看到这个效果就表示成功了
Insert image description here

监控服务器配置

       下载prometheus和exporter进行监控,安装可以看我这篇博客
Prometheus、Grafana、cAdvisor的介绍、安装和使用

安装node-exporter

       prometheus安装好之后,在每个服务器都安装node-exporter,监控服务器状态 下载

       除了本机192.168.40.137以外,所有的服务器都下载,演示一个案例。其他服务器相同操作

解压文件

[root@web1 exporter]# ls
node_exporter-1.5.0.linux-amd64.tar.gz
[root@web1 exporter]# tar xf node_exporter-1.5.0.linux-amd64.tar.gz 
[root@web1 exporter]# ls
node_exporter-1.5.0.linux-amd64  node_exporter-1.5.0.linux-amd64.tar.gz

新建目录

[root@web1 exporter]# mkdir -p /node_exporter

复制node_exporter下的文件到指定的目录

[root@web1 exporter]# cp node_exporter-1.5.0.linux-amd64/* /node_exporter

/root/.bashrc文件下修改PATH环境变量,将这行加到文件末尾,刷新一下

PATH=/node_exporter/:$PATH
source /root/.bashrc

放到后台启动运行

[root@web1 exporter]# nohup node_exporter --web.listen-address 192.168.40.21:8899 &

出现这个页面即成功

Insert image description here

编写prometheus.yml

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["192.168.40.137:9090"]
  - job_name: "nfs"
    static_configs:
      - targets: ["192.168.40.138:8899"]
  - job_name: "lb1"
    static_configs:
      - targets: ["192.168.40.31:8899"]
  - job_name: "lb2"
    static_configs:
      - targets: ["192.168.40.32:8899"]
  - job_name: "web1"
    static_configs:
      - targets: ["192.168.40.21:8899"]
  - job_name: "web2"
    static_configs:
      - targets: ["192.168.40.22:8899"]
  - job_name: "web3"
    static_configs:
      - targets: ["192.168.40.23:8899"]

重新启动prometheus

[root@dns-prom prometheus]# service prometheus restart

看到这个页面就表示监控成功了

Insert image description here

安装alertmanager和钉钉插件

下载

[root@dns-prom prometheus]# wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
[root@dns-prom prometheus]# wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz

解压

[root@dns-prom prometheus]# tar xf alertmanager-0.23.0-rc.0.linux-amd64.tar.gz 
[root@dns-prom prometheus]# mv alertmanager-0.23.0-rc.0.linux-amd64 alertmanager
[root@dns-prom prometheus]# tar xf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz 
[root@dns-prom prometheus]# mv prometheus-webhook-dingtalk-1.4.0.linux-amd64 prometheus-webhook-dingtalk

获取机器人webhook

Insert image description here

获取允许访问的IP,使用curl ifconfig.me可以获得

[root@dns-prom alertmanager]# curl ifconfig.me
222.244.215.17

Insert image description here

Modify DingTalk alert template

#位置:/lianxi/prometheus/prometheus-webhook-dingtalk/contrib/templates/legacy/template.tmpl
[root@dns-prom legacy]# cat template.tmpl
{
    
    {
    
     define "ding.link.title" }}{
    
    {
    
     template "legacy.title" . }}{
    
    {
    
     end }}
{
    
    {
    
     define "ding.link.content" }}
{
    
    {
    
     if gt (len .Alerts.Firing) 0 -}}
告警列表:
{
    
    {
    
     template "__text_alert_list" .Alerts.Firing }}
{
    
    {
    
    - end }}
{
    
    {
    
     if gt (len .Alerts.Resolved) 0 -}}
恢复列表:
{
    
    {
    
     template "__text_resolve_list" .Alerts.Resolved }}
{
    
    {
    
    - end }}
{
    
    {
    
    - end }}

Modify the cofig and yml files, add the robot's webhook token, and specify the template file

[root@dns-prom prometheus-webhook-dingtalk]# cat config.yml 
templates:
  - /lianxi/prometheus/prometheus-webhook-dingtalk/contrib/templates/legacy/template.tmpl # 模板路径

targets:
  webhook2:
    url: https://oapi.dingtalk.com/robot/send?access_token=你自己的token

will be prometheus-webhook-dingtalkregistered as a service

[root@dns-prom system]# pwd
/usr/lib/systemd/system
[root@dns-prom system]# cat webhook-dingtalk
[Unit]
Description=prometheus-webhook-dingtalk
Documentation=https://github.com/timonwong/prometheus-webhook-dingtalk
After=network.target

[Service]
ExecStart=/lianxi/prometheus/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk  --config.file=/lianxi/prometheus/prometheus-webhook-dingtalk/config.yml
Restart=on-failure

[Install]
WantedBy=multi-user.target

Loading services

[root@dns-prom system]# systemctl daemon-reload

start service

[root@dns-prom system]# service webhook-dingtalk start
Redirecting to /bin/systemctl start webhook-dingtalk.service

Write alertmanager

Modify alertmanager.ymlfiles

global:
  resolve_timeout: 5m

route: # 告警路由配置,定义如何处理和发送告警
  receiver: webhook
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 4h
  group_by: [alertname]
  routes:
  - receiver: webhook
    group_wait: 10s

receivers: # 告警接收者配置,定义如何处理和发送告警
- name: webhook 
  webhook_configs:
   ### 注意注意,我在dingtalk的配置文件里用的是webhook2,要对上
  - url: http://192.168.40.137:8060/dingtalk/webhook2/send  # 告警 Webhook URL
    send_resolved: true # 是否发送已解决的告警。如果设置为 true,则在告警解决时发送通知

will be alertmanagerregistered as a service

[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target

[Service]
ExecStart=/lianxi/prometheus/alertmanager/alertmanager --config.file=/lianxi/prometheus/alertmanager/alertmanager.yml 
Restart=on-failure

[Install]
WantedBy=multi-user.target

Loading services

[root@dns-prom system]# systemctl daemon-reload

Check

Insert image description here

Set alarm file

Create an alarm rule in the rules.yml file in the prometheus directory.

[root@dns-prom prometheus]# pwd
/lianxi/prometheus/prometheus
[root@dns-prom prometheus]# cat rules.yml 
groups:
  - name: host_monitoring
    rules:
      - alert: 内存报警
        expr: netdata_system_ram_MiB_average{
    
    chart="system.ram",dimension="free",family="ram"} < 800
        for: 2m
        labels:
          team: node
        annotations:
          Alert_type: 内存报警
          Server: '{
    
    {$labels.instance}}'
          explain: "内存使用量超过90%,目前剩余量为:{
    
    { $value }}M"
      - alert: CPU报警
        expr: netdata_system_cpu_percentage_average{
    
    chart="system.cpu",dimension="idle",family="cpu"} < 20
        for: 2m
        labels:
          team: node
        annotations:
          Alert_type: CPU报警
          Server: '{
    
    {$labels.instance}}'
          explain: "CPU使用量超过80%,目前剩余量为:{
    
    { $value }}"
      - alert: 磁盘报警
        expr: netdata_disk_space_GiB_average{
    
    chart="disk_space._",dimension="avail",family="/"} < 4
        for: 2m
        labels:
          team: node
        annotations:
          Alert_type: 磁盘报警
          Server: '{
    
    {$labels.instance}}'
          explain: "磁盘使用量超过90%,目前剩余量为:{
    
    { $value }}G"
      - alert: 服务告警
        expr: up == 0
        for: 2m
        labels:
          team: node
        annotations:
          Alert_type: 服务报警
          Server: '{
    
    {$labels.instance}}'
          explain: "netdata服务已关闭"

Modify the prometheus.yml file and alertmanagerassociate it with

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["192.168.40.137:9093"]

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/lianxi/prometheus/prometheus/rules.yml" # 告警模板路径
  # - "first_rules.yml"
  # - "second_rules.yml"

Restart the prometheus service

[root@dns-prom prometheus]# service prometheus restart

You can see the monitoring data

Insert image description here

Simulate server downtime, close web1, and prompt an alarm

Insert image description here

DingTalk received an alert

Insert image description here

Install grafana

Download the Grafana software package from the Grafana official website and install it according to the official documentation

root@dns-prom grafana]# yum install -y https://dl.grafana.com/enterprise/release/grafana-enterprise-9.5.1-1.x86_64.rpm

Start grafana

[root@dns-prom grafana]# service grafana-server restart
Restarting grafana-server (via systemctl):                 [  确定  ]

For the specific operation process, please see this document Introduction, Installation and Use of Prometheus, Grafana, and cAdvisor

Just choose a good template and you can display it.

Insert image description here

Insert image description here

Conduct stress testing

Install ab software and simulate requests

yum install ab -y

Continuously simulate requests to understand the number of cluster concurrency.

Guess you like

Origin blog.csdn.net/qq_52589631/article/details/131837458