Reposted from: https://blog.csdn.net/qq_39595769/article/details/119248666
How to automate the monitoring of hundreds of servers
The old way:
1. Install node_exporter on these 100 servers.
2. Add these 100 machine configurations to the prometheus configuration.
Automated operation and maintenance:
1. Ansible deploys node_exporter in batches
2. Consul-based service discovery
3. Register the location of node_exporter and its IP and port in consul.
4. Prometheus obtains all IPs and ports from consul and automatically joins the monitoring.
Among these hundreds of servers are:
Web server, DB server, load balancing server, message queue server.
In the actual operation and maintenance process, it is also managed according to groups
"id": "web1","name": "webserver组","address": "xxxx"
"id": "web2","name": "webserver组","address": "xxxx"
"id": "web3","name": "webserver组","address": "xxxx"
"id": "db1","name": "dbserver组","address": "xxxx"
"id": "db2","name": "dbserver组","address": "xxxx"
"id": "db3","name": "dbserver组","address": "xxxx"
Install Ansible on the prometheus service for automatic monitoring of hundreds of servers
Install epel source
yum install epel-release -y
Install Ansible
yum install ansible -y
Clear the node_exporter folder of the server where the exporter is installed:
After deleting, you can see that the Endpoints in the target in prometheus are all hung up
Delete the configuration in the prometheus configuration file, and only keep the consul configuration
Ansible + playbook to complete the task
With these four files, you can write the playbook
consul-register.sh
hosts
node_exporter-1.2.0.linux-amd64.tar.gz
node_exporter.service
playbook.yaml
Contents of each file
consul-register.sh
#!/bin/bash
service_name=$1
instance_id=$2
ip=$3
port=$4
curl -X PUT -d '{"id": "'"$instance_id"'","name": "'"$service_name"'","address": "'"$ip"'","port": '"$port"',"tags": ["'"$service_name"'"],"checks": [{"http": "http://'"$ip"':'"$port"'","interval": "5s"}]}' http://192.168.220.103:8500/v1/agent/service/register
hosts
[webservers]
192.168.220.102 name=web1
[dbservers]
192.168.220.103 name=db1
node_exporter.service
[Unit]
Description=node_exporter
[Service]
ExecStart=/usr/local/node_exporter/node_exporter
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
playbook.yaml
- hosts: webservers
gather_facts: no
vars:
port: 9100
tasks:
- name: 推送二进制文件
unarchive: src=node_exporter-1.2.0.linux-amd64.tar.gz dest=/usr/local
- name: 重命名
shell: |
cd /usr/local
if [ ! -d node_exporter ];then
mv node_exporter-1.2.0.linux-amd64 node_exporter
fi
#- name: 推送配置文件
# copy: src=config.yml dest=/usr/local/node_exporter
- name: 拷贝systemd文件
copy: src=node_exporter.service dest=/usr/lib/systemd/system
- name: 启动服务
systemd: name=node_exporter state=started enabled=yes daemon_reload=yes
- name: 推送注册脚本
copy: src=consul-register.sh dest=/usr/local/bin/
- name: 注册当前节点
# 服务名 实例名 IP 端口
shell: /bin/bash /usr/local/bin/consul-register.sh {
{ group_names[0] }} {
{ name }} {
{ inventory_hostname }} {
{ port }}
Ready, ansible deploys exporter to other servers
ansible-playbook -i hosts playbook.yaml -uroot -k
Failed because a username and password were required
Signs of success:
The reason for the warning is that the name is wrong: if the port is renamed to exporter_port, there will be no such warning
Modify port to exporter_port
There is no warning.
After monitoring the webservers group, you can monitor the dbservers group.
Modify the playbook.yaml file
It failed because the fingerprint verification was required for the first time,
and then the execution was successful again
.
The execution status is Down, check the reason
The status is started, but the web page shows an error.
Finally, the configuration file is not copied.
Kill the process and start again:
so the following files need to be re-modified:
Just do it all over again.
Grafana also has a corresponding group.
When adding machines in the future, just modify the hosts file.