Get into the habit of writing together! This is the sixth day of my participation in the "Nuggets Daily New Plan · April Update Challenge", click to view the details of the event .
prometheus
Prometheus, a Cloud Native Computing Foundation project, is a system and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays results, and triggers alerts when specified conditions are observed.
What differentiates Prometheus from other metrics and monitoring systems are:
- Multidimensional data model (time series defined by metric name and set of key/value dimensions)
- PromQL, a powerful and flexible query language that leverages this dimension
- Does not rely on distributed storage; individual server nodes are autonomous
- HTTP pull model for time series collection
- ****Support push timeseries through an intermediate gateway for batch jobs
- Discover targets through service discovery or static configuration
- Multiple modes supported by graphs and dashboards
- Support for hierarchical and horizontal unions
Architecture Overview
Install
Download the latest version of Prometheus for your platform , then unzip and run it:
tar xvfz prometheus-*.tar.gz
cd prometheus-*
复制代码
Before starting Prometheus, let's configure it.
prometheus.yml
# my global config
global:
scrape_interval: 10s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 10s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 127.0.0.1:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# 配置job,可以同时配置多个服务
- job_name: 'name1'
metrics_path: '/actuator/prometheus'
scrape_interval: 5s
scheme: "https"
static_configs:
- targets: ['xxx.com']
- job_name: 'name2'
metrics_path: '/actuator/prometheus'
scrape_interval: 5s
scheme: "http"
static_configs:
- targets: ['abc.com']
复制代码
Monitoring alarm rules
rules/xx_rules.yml
groups:
- name: nqi-down
rules:
- alert: gx-node-down
expr: up{instance="xxx.com:443"} == 0
for: 10s
labels:
status: High
team: xxx
annotations:
description: "xxx is Down ! ! !"
summary: "xxx服务停了,请留意!!!"
- alert: test-node-down
expr: up{instance="abc.com:80"} == 0
for: 5s
labels:
status: Warn
team: test
annotations:
description: "abc is Down ! ! !"
summary: "abc服务停了,请留意!!!"
复制代码
Install alertmanager and extract it
tar xvfz alertmanager-*.tar.gz
cd alertmanager-*
复制代码
configure
alertmanager.yml
global:
resolve_timeout: 5m #解析的超时时间
smtp_smarthost: 'smtp.xxx.com:465' #邮箱smtp地址
smtp_from: '[email protected]' #来自哪个邮箱发出的
smtp_auth_username: '[email protected]' #邮箱的用户名
smtp_auth_password: 'W7CKmqD2x0iGXM9R' #这里是邮箱的授权密码,不是登录密码
smtp_require_tls: false #是否启用tls
templates:
- ./email.html
route:
group_by: ['abc']
group_wait: 3s
group_interval: 10s
repeat_interval: 3h
receiver: 'mail'
receivers:
- name: 'mail'
email_configs: #email的配置
- to: '[email protected], [email protected]' #报警接收人的邮件地址
send_resolved: true #发送恢复通知
html: '{{ template "email.html" . }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
复制代码
code
add pom dependency
<!-- 监控 -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
复制代码
application.properties increase configuration
#监控
management.endpoints.web.exposure.include=prometheus
复制代码
Install Grafana
wget <https://dl.grafana.com/enterprise/release/grafana-enterprise-8.4.6.linux-amd64.tar.gz>\
tar -zxvf grafana-enterprise-8.4.6.linux-amd64.tar.gz
复制代码