Directorio de artículos
Usando el sistema de monitoreo de código abierto "Prometheus+Grafana", instale e implemente la plataforma de monitoreo de clústeres K8S.
Y use el complemento de alarma Altermanager, junto con el uso de Enterprise WeChat, para realizar el mecanismo de alarma y monitoreo del clúster del sistema.
1. Tabla de planificación de nodos experimentales
nombre de host | dirección IP | Instalar componentes |
---|---|---|
m1 | 192.168.200.61 | Prometeo+Grafana+Alertmanager+node_exporter |
m2 | 192.168.200.62 | nodo_exportador |
m3 | 192.168.200.63 | nodo_exportador |
n1 | 192.168.200.64 | nodo_exportador |
n2 | 192.168.200.65 | nodo_exportador |
n3 | 192.168.200.66 | nodo_exportador |
2. Instalar Prometeo
在master01节点上执行操作。
- Instalar Prometeo
# 下载
wget https://github.com/prometheus/prometheus/releases/download/v2.34.0/prometheus-2.34.0.linux-amd64.tar.gz
# 解压
tar -zxvf prometheus-2.34.0.linux-amd64.tar.gz -C /usr/local/
# 更名
cd /usr/local/ && mv prometheus-2.34.0.linux-amd64 prometheus && cd prometheus
- Cree un archivo de configuración de prometheus.service
cat > /usr/lib/systemd/system/prometheus.service << EOF
[Unit]
Description=prometheus
After=network.target
[Service]
Type=simple
User=root
ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/data/prometheus --storage.tsdb.retention=15d --log.level=info
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
- Inicie el servicio Prometheus
systemctl daemon-reload && systemctl start prometheus && systemctl enable prometheus && systemctl status prometheus
- Ver el proceso de servicio de Prometheus
netstat -lntp | grep prometheus
3. Instalar nodo_exportador
其余节点安装操作相同。
- Instalar node_exporter
# 下载
wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
# 解压
tar -zxvf node_exporter-1.3.1.linux-amd64.tar.gz -C /usr/local/
# 更名
cd /usr/local && mv node_exporter-1.3.1.linux-amd64 node_exporter && cd node_exporter
- Iniciar node_exproter
cat > /usr/lib/systemd/system/node_exporter.service << EOF
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
User=root
ExecStart=/usr/local/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
- Inicie el servicio node_exproter
systemctl daemon-reload && systemctl start node_exporter && systemctl enable node_exporter && systemctl status node_exporter
- Ver el proceso de servicio de node_exproter
ps -ef | grep node_exporter
4. Configure el archivo prometheus.yml
- Modifique el archivo de configuración prometheus.yml
[root@m1 prometheus]# cat prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 127.0.0.1:9093
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.yml"
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090'] # 如果对本机node_exporter监控,加入,'localhost:9100'
- job_name: 'K8S-Masters'
#重写了全局抓取间隔时间,由15秒重写成5秒。
scrape_interval: 5s
static_configs:
- targets: ['192.168.200.61:9100']
- targets: ['192.168.200.62:9100']
- targets: ['192.168.200.63:9100']
- job_name: 'K8S-Nodes'
scrape_interval: 5s
static_configs:
- targets: ['192.168.200.64:9100']
- targets: ['192.168.200.65:9100']
- targets: ['192.168.200.66:9100']
- Verifique que la configuración de prometheus.yml sea válida
./promtool check config prometheus.yml
- Reinicie el servicio Prometheus
systemctl daemon-reload && systemctl restart prometheus && systemctl status prometheus
- Reinicie el servicio node_exporter
systemctl daemon-reload && systemctl restart node_exporter && systemctl status node_exporter
- Visite el sitio web de Prometeo
http://192.168.200.61:9090/targets
5. Instalar Grafana
在master01节点上执行操作。
- Descargar e instalar Grafana
Enlace de descarga: Descargar Grafana | Grafana Labs
# 下载
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-8.4.5-1.x86_64.rpm
# 安装
yum install -y grafana-enterprise-8.4.5-1.x86_64.rpm
- Iniciar el servicio de grafana
systemctl start grafana-server.service && systemctl enable grafana-server.service && systemctl status grafana-server.service
- Ver el proceso de grafana
netstat -lntp | grep grafana-serve
- Visite la página web de Grafana, es decir, visite http://192.168.200.61:3000
- cambiar la contraseña
- Inicie sesión para acceder a la página web de Grafana
- Agregar fuentes de datos
- Seleccione "Prometeo"
- agregar URL
- Guarde la prueba, haga clic en "Guardar y probar" para solicitar el éxito verde.
- Configurar el panel de grafana-node_exporter
6. Instale Altermanager para monitorear alarmas
enlace de descarga
- Instalar Altermanager
# 下载
wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
# 解压
tar xvf alertmanager-0.24.0.linux-amd64.tar.gz -C /usr/local/
# 更名
cd /usr/local/ && mv alertmanager-0.24.0.linux-amd64 alertmanager && cd alertmanager/
- Iniciar sesión WeChat empresarial
- Obtener ID de empresa: ww9fxxxxxx03000
- Obtener ID de departamento: 2
- ID del agente: 1000003
- Secreto: 8FZ_LnlwuFKNf6xxxxxxxxxxxxWwVPH8R3ExJvIs
- obtener la identificación de la aplicación
以上步骤完成后,我们就得到了配置Alertmanager的所有信息,包括:企业ID,AgentId,Secret和接收告警的部门id
- Crear un archivo wechat.tmpl
[root@m1 template]# cat /usr/local/alertmanager/template/wechat.tmpl
{
{
define "wechat.default.message" }}
{
{
- if gt (len .Alerts.Firing) 0 -}}
{
{
- range $index, $alert := .Alerts -}}
{
{
- if eq $index 0 -}}
告警类型: {
{
$alert.Labels.alertname }}
告警级别: {
{
$alert.Labels.severity }}
=====================
{
{
- end }}
===告警详情===
告警详情: {
{
$alert.Annotations.message }}
故障时间: {
{
$alert.StartsAt.Format "2006-01-02 15:04:05" }}
===参考信息===
{
{
if gt (len $alert.Labels.instance) 0 -}}故障实例ip: {
{
$alert.Labels.instance }};{
{
- end -}}
{
{
- if gt (len $alert.Labels.namespace) 0 -}}故障实例所在namespace: {
{
$alert.Labels.namespace }};{
{
- end -}}
{
{
- if gt (len $alert.Labels.node) 0 -}}故障物理机ip: {
{
$alert.Labels.node }};{
{
- end -}}
{
{
- if gt (len $alert.Labels.pod_name) 0 -}}故障pod名称: {
{
$alert.Labels.pod_name }}{
{
- end }}
=====================
{
{
- end }}
{
{
- end }}
{
{
- if gt (len .Alerts.Resolved) 0 -}}
{
{
- range $index, $alert := .Alerts -}}
{
{
- if eq $index 0 -}}
告警类型: {
{
$alert.Labels.alertname }}
告警级别: {
{
$alert.Labels.severity }}
=====================
{
{
- end }}
===告警详情===
告警详情: {
{
$alert.Annotations.message }}
故障时间: {
{
$alert.StartsAt.Format "2006-01-02 15:04:05" }}
恢复时间: {
{
$alert.EndsAt.Format "2006-01-02 15:04:05" }}
===参考信息===
{
{
if gt (len $alert.Labels.instance) 0 -}}故障实例ip: {
{
$alert.Labels.instance }};{
{
- end -}}
{
{
- if gt (len $alert.Labels.namespace) 0 -}}故障实例所在namespace: {
{
$alert.Labels.namespace }};{
{
- end -}}
{
{
- if gt (len $alert.Labels.node) 0 -}}故障物理机ip: {
{
$alert.Labels.node }};{
{
- end -}}
{
{
- if gt (len $alert.Labels.pod_name) 0 -}}故障pod名称: {
{
$alert.Labels.pod_name }};{
{
- end }}
=====================
{
{
- end }}
{
{
- end }}
{
{
- end }}
- Edite el archivo de configuración alertmanager.yml
global:
resolve_timeout: 1m # 每1分钟检测一次是否恢复
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/' # 请勿修改!!!
wechat_api_corp_id: '*************' # 企业微信中企业ID
wechat_api_secret: '************************' # 企业微信中创建应用的Secret
templates:
- '/usr/local/alertmanager/template/*.tmpl'
route:
receiver: 'wechat'
group_by: ['env','instance','type','group','job','alertname']
group_wait: 10s # 初次发送告警延时
group_interval: 10s # 距离第一次发送告警,等待多久再次发送告警
repeat_interval: 1h # 告警重发时间
# receiver: 'email'
receivers:
- name: 'wechat'
wechat_configs:
- send_resolved: true
message: '{
{ template "wechat.default.message" . }}'
to_party: '2' # 企业微信中创建的接收告警的部门【K8S告警组】的部门ID
agent_id: '1000003' # 企业微信中创建的应用的ID
api_secret: '************************************' # 企业微信中创建应用的Secret
global:
resolve_timeout: 1m # 每1分钟检测一次是否恢复
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/' # 请勿修改!!!
wechat_api_corp_id: 'ww9ff288a7d3903000' # 企业微信中企业ID
wechat_api_secret: '8FZ_LnlwuFKNf6yR8A8svWO0arYYrWwVPH8R3ExJvIs' # 企业微信中创建应用的Secret
templates:
- '/usr/local/alertmanager/template/*.tmpl'
route:
receiver: 'wechat'
group_by: ['env','instance','type','group','job','alertname']
group_wait: 10s # 初次发送告警延时
group_interval: 3m # 距离第一次发送告警,等待多久再次发送告警
repeat_interval: 3m # 告警重发时间
# receiver: 'email'
receivers:
- name: 'wechat'
wechat_configs:
- send_resolved: true # 是否发出已解决消息
to_user: '@all' # 所有用户
message: '{
{ template "wechat.default.message" . }}'
to_party: '2' # 企业微信中创建的接收告警的部门【K8S告警组】的部门ID
agent_id: '1000003' # 企业微信中创建的应用的ID
api_secret: '8FZ_LnlwuFKNf6yR8A8svWO0arYYrWwVPH8R3ExJvIs' # 企业微信中创建应用的Secret
- Cree el archivo de configuración alertmanager.service.
cat > /usr/lib/systemd/system/alertmanager.service << EOF
[Unit]
Description=alertmanager
Documentation=https://github.com/prometheus/alertmanager
After=network.target
[Service]
Type=simple
User=root
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --storage.path=/data/alertmanager
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
- iniciar alertmanager.servicio
systemctl daemon-reload && systemctl start alertmanager.service && systemctl enable alertmanager.service
- Modifique el archivo de configuración prometheus.yml
- Cree node_status.yml en la ruta prometheus/rules
# 创建rules目录并进入
mkdir /usr/local/prometheus/rules && cd rules/
# 创建node_status.yml配置文件
cat node_status.yml
groups:
- name: 实例存活告警规则
rules:
- alert: 实例存活告警
expr: up{
job="prometheus"} == 0 or up{
job="K8S-Nodes"} == 0
for: 1m
labels:
user: root
severity: Disaster
annotations:
summary: "Instance {
{ $labels.instance }} is down"
description: "Instance {
{ $labels.instance }} of job {
{ $labels.job }} has been down for more than 1 minutes."
value: "{
{ $value }}"
- name: 内存告警规则
rules:
- alert: "内存使用率告警"
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 75
for: 1m
labels:
user: root
severity: warning
annotations:
summary: "服务器: {
{
$labels.alertname}} 内存报警"
description: "{
{ $labels.alertname }} 内存资源利用率大于75%!(当前值: {
{ $value }}%)"
value: "{
{ $value }}"
- name: CPU报警规则
rules:
- alert: CPU使用率告警
expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{
mode="idle"}[1m]) )) * 100 > 70
for: 1m
labels:
user: root
severity: warning
annotations:
summary: "服务器: {
{
$labels.alertname}} CPU报警"
description: "服务器: CPU使用超过70%!(当前值: {
{ $value }}%)"
value: "{
{ $value }}"
- name: 磁盘报警规则
rules:
- alert: 磁盘使用率告警
expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 80
for: 1m
labels:
user: root
severity: warning
annotations:
summary: "服务器: {
{
$labels.alertname}} 磁盘报警"
description: "服务器:{
{
$labels.alertname}},磁盘设备: 使用超过80%!(挂载点: {
{ $labels.mountpoint }} 当前值: {
{ $value }}%)"
value: "{
{ $value }}"
- Verifique que el archivo alertmanager.yml esté configurado correctamente
./amtool check-config alertmanager.yml
resultado de salida
[root@m1 alertmanager]# pwd
/usr/local/alertmanager
[root@m1 alertmanager]# ./amtool check-config alertmanager.yml
Checking 'alertmanager.yml' SUCCESS
Found:
- global config
- route
- 0 inhibit rules
- 1 receivers
- 1 templates
SUCCESS
- Inicie el servicio de administrador de alertas
systemctl daemon-reload && systemctl start alertmanager && systemctl enable alertmanager && systemctl status alertmanager
- Compruebe si se ha iniciado el proceso de servicio de alertmanager
ps -ef | grep alertmanager
- Reinicie el servicio Prometheus
systemctl daemon-reload && systemctl restart prometheus && systemctl status prometheus
- Visite http://192.168.200.61:9090/alerts para ver información de reglas relevante.
Hasta el momento, se ha completado la implementación de la alarma WeChat empresarial de acoplamiento de Prometheus empresarial, y la siguiente información de alarma e información de recuperación se puede ver cuando ocurre una falla.
En este punto, simule que uno de los hosts está inactivo y verifique si la información de alarma configurada tiene efecto en WeChat Work.
- Verifique la alarma de supervivencia de la instancia y descubra que uno de los hosts está inactivo
- Ver la información de alarma emitida por la empresa WeChat