Table of contents
Involving components and versions
Installation package acquisition path
Recently, the company's monitoring system has been re-established. Compared with the current monitoring system, Zabbix is still the one that uses the most basic resources, and the version has been updated. In addition, the v5 version of Nightingale has just been released recently. Before There has also been research and the latter currently supports prometheus as a data source.
In the end, I chose prometheus because time was tight. Given the current situation of the company, it was more suitable to use prometheus. Therefore, some monitoring of basic resources, ports, and application layers were done with prometheus. If there were alarms, the initial plan was to use the enterprise WeChat robot to implement it. Follow-up Email alerts can be added.
In addition to prometheus, there is also apm monitoring. After comparing skyworing, pinpoint, and cat, I plan to use pinpoint to implement it. This will be introduced in detail in a later article.
Otherwise, use elk to collect logs, and then monitor some of ng's request responses. If some back-end logs report abnormal errors, they will also be monitored.
The Nightingale v5 version that has been researched and practiced before:
About prometheus architecture
The initial idea was to build a federated cluster and use victoriametrics, but after estimating the usage, it turned out that it would not be used.
An initial thought was this:
In the final practice, I actually did not do this. According to the actual scenario, I actually made a HA, as shown below:
For prometheus, it will be deployed in this way later. Then regarding some collection agents and subsequent alarms, you can see the following picture:
prometheus HA deployment
Involving components and versions
prometheus 2.36.0
nginx 1.6.2
alertmanager 0.24.0
grafana 8.5.3
Installation package acquisition path
prometheus:
https://github.com/prometheus/prometheus/releases/download/v2.36.1/prometheus-2.36.1.linux-amd64.tar.gz
alert
https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
grafana
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-8.5.6-1.x86_64.rpm
nginx
prometheus deployment
Among them, prometheus and alert are installed using tar packages, nginx is installed using compilation, grafana is installed using rpm packages, and prometheus is managed using systremctl.
Deployment is actually very simple
cd /data
tar -xvf prometheus-2.36.0.linux-amd64.tar.gz
mv prometheus-2.36.0.linux-amd64 prometheus
mkdir -p /data/prometheus/{log,data}
Then modify the configuration file. You can take a look at the current configuration file, which mainly involves job configuration and alert configuration.
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
alerting:
alertmanagers:
- static_configs:
- targets: ['192.168.200.9:9093']
rule_files:
- "rules/*_rules.yml"
# - "first_rules.yml"
# - "second_rules.yml"
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["192.168.x.x:9090","192.168.x.1:9090"]
- job_name: 'linux_base'
file_sd_configs:
- refresh_interval: 1m
files:
- config_exporter.json
- job_name: blackbox_tcp
scrape_interval: 1m
metrics_path: /probe
params:
module: [tcp_connect]
file_sd_configs:
- refresh_interval: 1m
files:
- config_port.json
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.x.2:9115
My node_exporter and blackbox exporter configuration files are hung outside and support hot updates.
config_exporter.json
config_port.json
For specific configuration, please refer to:
[
{
"targets": [ "192.168.x.x:9100"],
"labels": {
"env": "yw"
}
}
]
Prometheus is managed by systemctl
cat /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus
After=network.target
[Service]
Type=simple
Environment="GOMAXPROCS=4"
User=root
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/data/prometheus/prometheus \
--config.file=/data/prometheus/prometheus.yml \
--storage.tsdb.path=/data/prometheus/data \
--storage.tsdb.retention=30d \
--web.console.libraries=/data/prometheus/console_libraries \
--web.console.templates=/data/prometheus/consoles \
--web.listen-address=0.0.0.0:9090 \
--web.read-timeout=5m \
--web.max-connections=10 \
--query.max-concurrency=20 \
--query.timeout=2m \
--web.enable-lifecycle
PrivateTmp=true
PrivateDevices=true
ProtectHome=true
NoNewPrivileges=true
LimitNOFILE=infinity
ReadWriteDirectories=/data/prometheus
ProtectSystem=full
SyslogIdentifier=prometheus
Restart=always
[Install]
WantedBy=multi-user.target
After setting, you need to reset
systemctl daemon-reload
alertmanager deployment
tar -xvf alertmanager-0.24.0.linux-amd64.tar.gz
Modify the configuration file, because the WeChat robot is used to alarm, a web service is created through python, and then the web hook is configured to achieve
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://192.168.x.x:5000'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
If the alarm is issued, it will probably look like this:
【恢复】生产环境 blackbox_network_stats 有报警恢复
告警级别: critical
告警类型: blackbox_network_stats
告警主机: 192.168.x.x:9100
告警系统: ops
告警详情: This requires immediate action!
告警状态: resolved
触发时间: 2022-06-16 11:05:48 +08:00
触发结束时间: 2022-06-16 16:06:48 +08:00
I won’t write about the specific python service here, and you can leave a message in the background if necessary.
grafana deployment
In this case, it is very simple, direct rpm deployment
rpm -ivh grafana-8.5.3-1.x86_64.rpm
For grafana, it is more of a dashboard display. Node and blackbox are added here. There are quite a lot of mature examples on the Internet, which can be used directly.
nginx deployment
It is mainly used to load prometheus, and some components are proxied out
Just compile and install.
In this case, a set of the most basic monitoring system has been deployed, and alarms through the enterprise WeChat robot have been implemented. What follows is a cluster deployment of pinpoint and a cluster deployment of ELK.
In fact, deployment is only the first step, and the subsequent uses are important.