[Linux] prometheus adds monitoring in batches

Reposted from: https://blog.csdn.net/qq_39595769/article/details/119248666

How to automate the monitoring of hundreds of servers
The old way:
1. Install node_exporter on these 100 servers.
2. Add these 100 machine configurations to the prometheus configuration.

Automated operation and maintenance:
1. Ansible deploys node_exporter in batches
2. Consul-based service discovery
3. Register the location of node_exporter and its IP and port in consul.
4. Prometheus obtains all IPs and ports from consul and automatically joins the monitoring.

Among these hundreds of servers are:

Web server, DB server, load balancing server, message queue server.

In the actual operation and maintenance process, it is also managed according to groups

"id": "web1","name": "webserver组","address": "xxxx"
"id": "web2","name": "webserver组","address": "xxxx"
"id": "web3","name": "webserver组","address": "xxxx"

"id": "db1","name": "dbserver组","address": "xxxx"
"id": "db2","name": "dbserver组","address": "xxxx"
"id": "db3","name": "dbserver组","address": "xxxx"


Install Ansible on the prometheus service for automatic monitoring of hundreds of servers

Install epel source

yum install epel-release -y

Install Ansible

yum install ansible -y

Clear the node_exporter folder of the server where the exporter is installed:
insert image description here
insert image description here

After deleting, you can see that the Endpoints in the target in prometheus are all hung up
insert image description here

Delete the configuration in the prometheus configuration file, and only keep the consul configuration
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
Ansible + playbook to complete the task
insert image description here
insert image description here
insert image description here
insert image description here
With these four files, you can write the playbook

consul-register.sh
hosts
node_exporter-1.2.0.linux-amd64.tar.gz
node_exporter.service
playbook.yaml

Contents of each file

consul-register.sh

#!/bin/bash
service_name=$1
instance_id=$2
ip=$3
port=$4
curl -X PUT -d '{"id": "'"$instance_id"'","name": "'"$service_name"'","address": "'"$ip"'","port": '"$port"',"tags": ["'"$service_name"'"],"checks": [{"http": "http://'"$ip"':'"$port"'","interval": "5s"}]}' http://192.168.220.103:8500/v1/agent/service/register

hosts

[webservers]
192.168.220.102 name=web1

[dbservers]
192.168.220.103 name=db1

node_exporter.service

[Unit]
Description=node_exporter

[Service]
ExecStart=/usr/local/node_exporter/node_exporter
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure

[Install]
WantedBy=multi-user.target

playbook.yaml

- hosts: webservers
  gather_facts: no
  vars:
    port: 9100
  tasks:
  - name: 推送二进制文件
    unarchive: src=node_exporter-1.2.0.linux-amd64.tar.gz dest=/usr/local
  - name: 重命名
    shell: |
         cd /usr/local
         if [ ! -d node_exporter ];then
             mv node_exporter-1.2.0.linux-amd64 node_exporter
         fi
 #- name: 推送配置文件
 #  copy: src=config.yml dest=/usr/local/node_exporter
  - name: 拷贝systemd文件
    copy: src=node_exporter.service dest=/usr/lib/systemd/system
  - name: 启动服务
    systemd: name=node_exporter state=started enabled=yes daemon_reload=yes
  - name: 推送注册脚本
    copy: src=consul-register.sh dest=/usr/local/bin/
  - name: 注册当前节点   
    # 服务名 实例名 IP 端口 
    shell: /bin/bash /usr/local/bin/consul-register.sh {
   
   { group_names[0] }} {
   
   { name }} {
   
   { inventory_hostname }} {
   
   { port }}

Ready, ansible deploys exporter to other servers

ansible-playbook -i hosts playbook.yaml -uroot -k 

Failed because a username and password were required
insert image description here
insert image description here

Signs of success:
insert image description here
insert image description here

The reason for the warning is that the name is wrong: if the port is renamed to exporter_port, there will be no such warning
insert image description here

Modify port to exporter_port

insert image description here
There is no warning.
insert image description here
After monitoring the webservers group, you can monitor the dbservers group.

Modify the playbook.yaml file
insert image description here

It failed because the fingerprint verification was required for the first time,
insert image description here
and then the execution was successful again
insert image description here
.
insert image description here
The execution status is Down, check the reason
insert image description here
insert image description here

The status is started, but the web page shows an error.
insert image description here
Finally, the configuration file is not copied.
insert image description here
Kill the process and start again:
insert image description here
so the following files need to be re-modified:
insert image description here

Just do it all over again.

Grafana also has a corresponding group.
insert image description here

When adding machines in the future, just modify the hosts file.
insert image description here

Guess you like

Origin blog.csdn.net/imliuqun123/article/details/129416433