Practice of ptomrtheus monitoring system to realize the early warning of enterprise micro robots

Table of contents

About prometheus architecture

prometheus HA deployment

Involving components and versions

Installation package acquisition path

alertmanager deployment

grafana deployment

nginx deployment


Recently, the company's monitoring system has been re-established. Compared with the current monitoring system, Zabbix is ​​still the one that uses the most basic resources, and the version has been updated. In addition, the v5 version of Nightingale has just been released recently. Before There has also been research and the latter currently supports prometheus as a data source.

In the end, I chose prometheus because time was tight. Given the current situation of the company, it was more suitable to use prometheus. Therefore, some monitoring of basic resources, ports, and application layers were done with prometheus. If there were alarms, the initial plan was to use the enterprise WeChat robot to implement it. Follow-up Email alerts can be added.

In addition to prometheus, there is also apm monitoring. After comparing skyworing, pinpoint, and cat, I plan to use pinpoint to implement it. This will be introduced in detail in a later article.

Otherwise, use elk to collect logs, and then monitor some of ng's request responses. If some back-end logs report abnormal errors, they will also be monitored.

The Nightingale v5 version that has been researched and practiced before:

About prometheus architecture

The initial idea was to build a federated cluster and use victoriametrics, but after estimating the usage, it turned out that it would not be used.

An initial thought was this:

In the final practice, I actually did not do this. According to the actual scenario, I actually made a HA, as shown below:

For prometheus, it will be deployed in this way later. Then regarding some collection agents and subsequent alarms, you can see the following picture:

prometheus HA deployment

Involving components and versions

prometheus 2.36.0nginx 1.6.2alertmanager 0.24.0grafana 8.5.3

Installation package acquisition path

prometheus:https://github.com/prometheus/prometheus/releases/download/v2.36.1/prometheus-2.36.1.linux-amd64.tar.gzalerthttps://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gzgrafanawget https://dl.grafana.com/enterprise/release/grafana-enterprise-8.5.6-1.x86_64.rpmnginx

prometheus deployment

Among them, prometheus and alert are installed using tar packages, nginx is installed using compilation, grafana is installed using rpm packages, and prometheus is managed using systremctl.

Deployment is actually very simple

 
 
cd /datatar -xvf prometheus-2.36.0.linux-amd64.tar.gzmv prometheus-2.36.0.linux-amd64 prometheusmkdir -p /data/prometheus/{log,data}

Then modify the configuration file. You can take a look at the current configuration file, which mainly involves job configuration and alert configuration.

 
 
global:  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.  # scrape_timeout is set to the global default (10s).alerting:  alertmanagers:    - static_configs:        - targets: ['192.168.200.9:9093']rule_files:  - "rules/*_rules.yml"  # - "first_rules.yml"  # - "second_rules.yml"scrape_configs:  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.  - job_name: "prometheus"    # metrics_path defaults to '/metrics'    # scheme defaults to 'http'.    static_configs:      - targets: ["192.168.x.x:9090","192.168.x.1:9090"]  - job_name: 'linux_base'    file_sd_configs:    - refresh_interval: 1m      files:      - config_exporter.json  - job_name: blackbox_tcp    scrape_interval: 1m    metrics_path: /probe    params:      module: [tcp_connect]      file_sd_configs:    - refresh_interval: 1m      files:      - config_port.json    relabel_configs:      - source_labels: [__address__]        target_label: __param_target      - source_labels: [__param_target]        target_label: instance      - target_label: __address__        replacement: 192.168.x.2:9115

My node_exporter and blackbox exporter configuration files are hung outside and support hot updates.

config_exporter.json

config_port.json

For specific configuration, please refer to:

[  {
   
       "targets": [ "192.168.x.x:9100"],    "labels": {
   
         "env": "yw"    }  }]

Prometheus is managed by systemctl

cat /usr/lib/systemd/system/prometheus.service

[Unit]Description=PrometheusAfter=network.target

[Service]Type=simpleEnvironment="GOMAXPROCS=4"User=rootExecReload=/bin/kill -HUP $MAINPIDExecStart=/data/prometheus/prometheus \  --config.file=/data/prometheus/prometheus.yml \  --storage.tsdb.path=/data/prometheus/data \  --storage.tsdb.retention=30d \  --web.console.libraries=/data/prometheus/console_libraries \  --web.console.templates=/data/prometheus/consoles \  --web.listen-address=0.0.0.0:9090 \  --web.read-timeout=5m \  --web.max-connections=10 \  --query.max-concurrency=20 \  --query.timeout=2m \  --web.enable-lifecyclePrivateTmp=truePrivateDevices=trueProtectHome=trueNoNewPrivileges=trueLimitNOFILE=infinityReadWriteDirectories=/data/prometheusProtectSystem=fullSyslogIdentifier=prometheusRestart=always[Install]WantedBy=multi-user.target

After setting, you need to reset

systemctl daemon-reload

alertmanager deployment

tar -xvf alertmanager-0.24.0.linux-amd64.tar.gz

Modify the configuration file, because the WeChat robot is used to alarm, a web service is created through python, and then the web hook is configured to achieve

route:  group_by: ['alertname']  group_wait: 30s  group_interval: 5m  repeat_interval: 1h  receiver: 'web.hook'receivers:  - name: 'web.hook'    webhook_configs:      - url: 'http://192.168.x.x:5000'inhibit_rules:  - source_match:      severity: 'critical'    target_match:      severity: 'warning'    equal: ['alertname', 'dev', 'instance']

If the alarm is issued, it will probably look like this:

【恢复】生产环境 blackbox_network_stats 有报警恢复告警级别: critical 告警类型: blackbox_network_stats 告警主机: 192.168.x.x:9100 告警系统: ops 告警详情: This requires immediate action! 告警状态: resolved 触发时间: 2022-06-16 11:05:48 +08:00 触发结束时间: 2022-06-16 16:06:48 +08:00

I won’t write about the specific python service here, and you can leave a message in the background if necessary.

grafana deployment

In this case, it is very simple, direct rpm deployment

rpm -ivh grafana-8.5.3-1.x86_64.rpm

For grafana, it is more of a dashboard display. Node and blackbox are added here. There are quite a lot of mature examples on the Internet, which can be used directly.

nginx deployment

It is mainly used to load prometheus, and some components are proxied out

Just compile and install.

In this case, a set of the most basic monitoring system has been deployed, and alarms through the enterprise WeChat robot have been implemented. What follows is a cluster deployment of pinpoint and a cluster deployment of ELK.

In fact, deployment is only the first step, and the subsequent uses are important.

Guess you like

Origin blog.csdn.net/smallbird108/article/details/125466785