k8s cluster monitoring and alarm (Prometheus+AlertManager+Grafana+prometheusAlert+Dingding)

background:

After the k8s cluster is deployed, a reliable, stable and low-latency cluster monitoring and alarm system is urgently needed. The alarm k8s cluster runs normally and orderly. After continuous research and testing, the deployment plan of Prometheus+AlertManager+Grafana+prometheusAlert is finally selected, and the fault information is sent to DingTalk. Group and mail, if additional monitoring is required, pushgateway can be deployed to actively push data to Prometheus for data collection

Deployment plan:

Prometheus+AlertManager+Grafana+prometheusAlert+Dingding (pushgateway can be deployed by itself)

premise:

The k8s cluster has been deployed. For details, see Using kubeadm to build a single master node k8s cluster for production environment

deploy

1. Prometheus deployment

Prometheus consists of several components, some of which are optional:

Prometheus Server:用于抓取指标、存储时间序列数据
exporter:暴露指标让任务来抓
pushgateway:push 的方式将指标数据推送到该网关
alertmanager:处理报警的报警组件 adhoc:用于数据查询

Prometheus architecture diagram:
insert image description here
Prometheus directly receives or passively obtains indicator data through the intermediate Pushgateway gateway, stores all the acquired indicator data locally, and organizes these data according to some rules to generate some aggregated data or alarm information, Grafana or others Tools are used to visualize these data.

1.1 Create a namespace

kubectl create ns monitor 

1.2 Create a Prometheus configuration file# prometheus-cm.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitor
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      
    alerting:
      alertmanagers:
        - static_configs:
            - targets: ["alertmanager:9093"]
            - 
    rule_files:
      # - "first.rules"
      # - "second.rules"
    
    scrape_configs:
      - job_name: prometheus
        static_configs:
          - targets: ["localhost:9090"]
    
      - job_name: "kubernetes-apiservers"
        kubernetes_sd_configs:
          - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - source_labels:
              [
                __meta_kubernetes_namespace,
                __meta_kubernetes_service_name,
                __meta_kubernetes_endpoint_port_name,
              ]
            action: keep
            regex: default;kubernetes;https

      - job_name: "kubernetes-nodes"
        kubernetes_sd_configs:
          - role: node
        relabel_configs:
          - source_labels: [__address__]
            regex: "(.*):10250"
            replacement: "${1}:9100"
            target_label: __address__
            action: replace
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
    
      - job_name: 'controller-manager'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-controller-manager;https
    
      - job_name: 'kube-scheduler'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          insecure_skip_verify: true
        authorization:
          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;kube-scheduler;https
    
    
      - job_name: 'etcd'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: http
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: kube-system;etcd;http
    
      - job_name: "etcd-https"
        metrics_path: "/metrics"
        kubernetes_sd_configs:
          - role: endpoints
        scheme: https
        tls_config:
          ca_file: /opt/categraf/pki/etcd/ca.crt
          cert_file: /opt/categraf/pki/etcd/client.crt
          key_file: /opt/categraf/pki/etcd/client.key
          insecure_skip_verify: true
        relabel_configs:
          - source_labels:
              [
                __meta_kubernetes_namespace,
                __meta_kubernetes_service_name,
                __meta_kubernetes_endpoint_port_name,
              ]
            action: keep
            regex: kube-system;etcd;https
#]kubectl apply -f prometheus-cm.yaml
configmap "prometheus-config" created

The global module controls the global configuration of Prometheus Server:
scrape_interval indicates the frequency at which prometheus crawls indicator data, the default is 15s, we can override this value
evaluation_interval : used to control the frequency of evaluation rules, prometheus uses rules to generate new time series data or generate alarms
rule_files : Specifies the location of the alarm rules. Prometheus can load rules according to this configuration to generate new time series data or alarm information. Currently, we have not configured any alarm rules.
scrape_configs is used to control which resources prometheus monitors.

1.3 Create a Pod resource for prometheus

# prometheus-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitor
  labels:
    app: prometheus
spec:
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      securityContext:
         runAsUser: 0  #使用指定root用户运行容器
      serviceAccountName: prometheus
      containers:
#        - image: prom/prometheus:v2.34.0
        - image: prom/prometheus:v2.44.0
          name: prometheus
          args:
            - "--config.file=/etc/prometheus/prometheus.yml"
            - "--storage.tsdb.path=/prometheus" # 指定tsdb数据路径
            - "--storage.tsdb.retention.time=24h"
            - "--web.enable-admin-api" # 控制对admin HTTP API的访问,其中包括删除时间序列等功能
            - "--web.enable-lifecycle" # 支持热更新,直接执行localhost:9090/-/reload立即生效
          ports:
            - containerPort: 9090
              name: http
          volumeMounts:
            - mountPath: "/etc/prometheus"
              name: config-volume
            - mountPath: "/prometheus"
              name: data
          resources:
            requests:
              cpu: 100m
              memory: 512Mi
            limits:
              cpu: 100m
              memory: 512Mi
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: prometheus-data
        - configMap:
            name: prometheus-config
          name: config-volume

–storage.tsdb.path=/prometheus specifies the data directory

Create a PVC resource object as shown below, note that it is a LocalPV, and has affinity with the node1 node:

mkdir /data/k8s/localpv/prometheus
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-local
  labels:
    app: prometheus
spec:
  capacity:
    storage: 2Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage-prometheus
  local:
    path: /data/k8s/localpv/prometheus  # 节点上的目录
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - master
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-data
  namespace: monitor
spec:
  selector:
    matchLabels:
      app: prometheus
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: local-storage-prometheus
---
# local-storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-storage-prometheus     # StorageClass 的名字,叫作 local-storage-prometheus,也就是我们在 PV 中声明的
provisioner: kubernetes.io/no-provisioner   # 因为我们这里是手动创建的 PV,所以不需要动态来生成 PV
volumeBindingMode: WaitForFirstConsumer   # 延迟绑定

prometheus needs to access some resource objects of Kubernetes, so rbac-related authentication needs to be configured. Here we use a serviceAccount object named prometheus:

# prometheus-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitor
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
      - services
      - endpoints
      - pods
      - nodes/proxy
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - "extensions"
    resources:
      - ingresses
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - configmaps
      - nodes/metrics
    verbs:
      - get
  - nonResourceURLs:
      - /metrics
    verbs:
      - get
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
  - kind: ServiceAccount
    name: prometheus
    namespace: monitor

#]kubectl apply -f prometheus-rbac.yaml
serviceaccount "prometheus" created
clusterrole.rbac.authorization.k8s.io "prometheus" created
clusterrolebinding.rbac.authorization.k8s.io "prometheus" created

1.4 Create Prometheus

kubectl apply -f prometheus-deploy.yaml
deployment.apps/prometheus created
➜ kubectl get pods -n monitor
NAME                         READY   STATUS             RESTARTS   AGE
prometheus-df4f47d95-vksmc   1/1     running  3          98s

1.5 create service

After the Pod is successfully created, in order to be able to access the webui service of prometheus externally, we also need to create a Service object:

# prometheus-svc.yaml
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitor
  labels:
    app: prometheus
spec:
  selector:
    app: prometheus
  type: NodePort
  ports:
    - name: web
      port: 9090
      targetPort: http
#] kubectl apply -f prometheus-svc.yaml
service "prometheus" created
#] kubectl get svc -n monitor
NAME         TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE
prometheus   NodePort   10.96.194.29   <none>        9090:30980/TCP   13h

Now we can access the webui service of prometheus through http://any node IP:30980:
insert image description here

Two, AlertManager deployment

2.1 Install AlertManager

alert manager configuration file

# alertmanager-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: alert-config
  namespace: monitor
data:
  config.yml: |-
    global:
      # 当alertmanager持续多长时间未接收到告警后标记告警状态为 resolved
      resolve_timeout: 5m
      # 配置邮件发送信息
      smtp_smarthost: 'smtp.qq.com:25'
      smtp_from: '257*******@qq.com'
      smtp_auth_username: '257*******@qq.com'
      smtp_auth_password: '<邮箱密码>'
      smtp_hello: 'qq.com'
      smtp_require_tls: false
    # 所有报警信息进入后的根路由,用来设置报警的分发策略
    route:
      # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
      group_by: ['alertname', 'cluster']
      # 当一个新的报警分组被创建后,需要等待至少 group_wait 时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
      group_wait: 30s

      # 相同的group之间发送告警通知的时间间隔
      group_interval: 30s

      # 如果一个报警信息已经发送成功了,等待 repeat_interval 时间来重新发送他们,不同类型告警发送频率需要具体配置
      repeat_interval: 1h

      # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
      receiver: default

      # 上面所有的属性都由所有子路由继承,并且可以在每个子路由上进行覆盖。
      routes:
      - receiver: email
        group_wait: 10s
        group_by: ['instance'] # 根据instance做分组
        match:
          team: node
    receivers:
    - name: 'default'
      email_configs:
      - to: '257*******@qq.com'
        send_resolved: true  # 接受告警恢复的通知
    - name: 'email'
      email_configs:
      - to: '257*******@qq.com'
        send_resolved: true
 kubectl apply -f alertmanager-config.yaml

To configure the AlertManager container, you can directly use a Deployment to manage it. The corresponding YAML resource declaration is as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: monitor
  labels:
    app: alertmanager
spec:
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      volumes:
        - name: alertcfg
          configMap:
            name: alert-config
      containers:
        - name: alertmanager
          image: prom/alertmanager:v0.24.0
          imagePullPolicy: IfNotPresent
          args:
            - "--config.file=/etc/alertmanager/config.yml"
          ports:
            - containerPort: 9093
              name: http
          volumeMounts:
            - mountPath: "/etc/alertmanager"
              name: alertcfg
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 100m
              memory: 256Mi
---
# alertmanager-svc.yaml
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: monitor
  labels:
    app: alertmanager
spec:
  selector:
    app: alertmanager
  type: NodePort
  ports:
    - name: web
      port: 9093
      targetPort: http

After the AlertManager container is started, we also need to configure the address of AlertManager in Prometheus, so that Prometheus can access AlertManager, and add the following configuration to the ConfigMap resource list of Prometheus:

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

Execute the reload operation.

Add the following alarm rule configuration to the Prometheus configuration file:

rule_files:
  - /etc/prometheus/rules.yml

rule_files is used to specify alarm rules. Here we also mount the rules.yml file under the /etc/prometheus directory in the form of ConfigMap, such as the following rules:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitor
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      scrape_timeout: 15s
      evaluation_interval: 30s  # 默认情况下每分钟对告警规则进行计算
    alerting:
      alertmanagers:
      - static_configs:
        - targets: ["alertmanager:9093"]
    rule_files:
    - /etc/prometheus/rules.yml
  ...... # 省略prometheus其他部分
  rules.yml: |
    groups:
    - name: test-node-mem
      rules:
      - alert: NodeMemoryUsage
        expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 20
        for: 2m
        labels:
          team: node
        annotations:
          summary: "{
    
    {
    
    $labels.instance}}: High Memory usage detected"
          description: "{
    
    {
    
    $labels.instance}}: Memory usage is above 20% (current value is: {
    
    { $value }}"

定义了一个名为 NodeMemoryUsage 的报警规则,一条报警规则主要由以下几部分组成:

alert:告警规则的名称
expr:是用于进行报警规则 PromQL 查询语句
for:评估等待时间(Pending Duration),用于表示只有当触发条件持续一段时间后才发送告警,在等待期间新产生的告警状态为 pending
labels:自定义标签,允许用户指定额外的标签列表,把它们附加在告警上
annotations:指定了另一组标签,它们不被当做告警实例的身份标识,它们经常用于存储一些额外的信息,用于报警信息的展示之类的

3. Grafana deployment

Grafana is a visualization panel with very beautiful charts and layouts, a full-featured metrics dashboard and a graphical editor, supporting Graphite, zabbix, InfluxDB, Prometheus, OpenTSDB, Elasticsearch, etc. as data sources, better than Prometheus' own chart display The functions are too powerful, more flexible, rich in plug-ins, and more powerful.

3.1 Install Grafana

Here I specify storageClassName: managed-nfs-storage
needs to deploy the storageclass in advance, and then declare it to automatically create pv.
This article uses local storageclass and creates the path in advance.

mkdir  -p /data/k8s/localpv 
---
#grafana.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: monitor
spec:
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      volumes:
        - name: storage
          persistentVolumeClaim:
            claimName: grafana-pvc
      securityContext:
        runAsUser: 0
      containers:
        - name: grafana
#          image: grafana/grafana:8.4.6
          image: grafana/grafana:10.0.1
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 3000
              name: grafana
          env:
            - name: GF_SECURITY_ADMIN_USER
              value: admin
            - name: GF_SECURITY_ADMIN_PASSWORD
              value: admin321
          readinessProbe:
            failureThreshold: 10
            httpGet:
              path: /api/health
              port: 3000
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 30
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /api/health
              port: 3000
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              cpu: 150m
              memory: 512Mi
            requests:
              cpu: 150m
              memory: 512Mi
          volumeMounts:
            - mountPath: /var/lib/grafana
              name: storage
---
#grafana-svc.yaml
apiVersion: v1
kind: Service
metadata:
  name: grafana
  namespace: monitor
spec:
  type: NodePort
  ports:
    - port: 3000
  selector:
    app: grafana
 
---
#grafana-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: grafana-pv
  labels:
    app: grafana
spec:
  capacity:
    storage: 2Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  local:
    path: /data/k8s/localpv  # 节点上的目录
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - master
---
#grafana-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana-pvc
  namespace: monitor
  labels:
    app: grafana
spec:
#  storageClassName: managed-nfs-storage
  storageClassName: local-storage
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
---
# local-storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-storage     # StorageClass 的名字,叫作 local-storage,也就是我们在 PV 中声明的
provisioner: kubernetes.io/no-provisioner   # 因为我们这里是手动创建的 PV,所以不需要动态来生成 PV
volumeBindingMode: WaitForFirstConsumer   # 延迟绑定

The environment variables GF_SECURITY_ADMIN_USER and GF_SECURITY_ADMIN_PASSWORD are used to configure the administrator user and password of grafana

Grafana stores the dashboard and plug-in data under the /var/lib/grafana directory, so if we need to persist data here, we need to make a volume mount declaration for this directory

Check whether the Pod corresponding to grafana is normal:

[root@master grafana]# kubectl get pods -n monitor  -l app=grafana
NAME                       READY   STATUS    RESTARTS   AGE
grafana-85794dc4d9-mhcj7   1/1     Running   0          7m12s
[root@master grafana]# kubectl logs -f grafana-85794dc4d9-mhcj7 -n monitor 
...
logger=settings var="GF_SECURITY_ADMIN_USER=admin"
t=2019-12-13T06:35:08+0000 lvl=info msg="Config overridden from Environment variable"
......
t=2019-12-13T06:35:08+0000 lvl=info msg="Initializing Stream Manager"
t=2019-12-13T06:35:08+0000 lvl=info msg="HTTP Server Listen" logger=http.server address=[::]:3000 protocol=http subUrl= socket=

[root@master grafana]# kubectl get svc -n monitor 
NAME      TYPE       CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
grafana   NodePort   10.98.74.79   <none>        3000:31197/TCP   8m26s

Use http://<any node IP:31197> in the browser to access the service configuration data source of grafana: Both
insert image description here
Prometheus and Grafana are under the same namespace of kube-mon, so our data source address here is: http:/ /prometheus:9090 (Because it is under the same namespace, you can also use the Service name directly)

4. 1.prometheusAlert deployment

github address: https://github.com/feiyu563/PrometheusAlert
PrometheusAlert is an open source operation and maintenance alarm center message forwarding system that supports mainstream monitoring systems Prometheus, Zabbix, log systems Graylog2, Graylog3, data visualization systems Grafana, SonarQube. Alibaba Cloud-Cloud Monitoring, as well as the warning messages sent by all systems that support the WebHook interface, support sending these received messages to DingTalk, WeChat, email, Feishu, Tencent SMS, Tencent Phone, Alibaba Cloud SMS, and Alibaba Cloud Phone , Huawei SMS, Baidu Cloud SMS, Ronglian Cloud Phone, Qimo SMS, Qimo Voice, TeleGram, Baidu Hi (Ruliu), etc.
insert image description here
PrometheusAlert can be deployed on local and cloud platforms, supporting windows, linux, public cloud, private cloud, hybrid cloud, container and kubernetes. You can choose the appropriate way to deploy PrometheusAlert according to actual scenarios or needs:
https://github.com/feiyu563/PrometheusAlert/blob/master/doc/readme/base-install.md
This article uses running in kubernetes:
download in advance mirror image:

docker pull feiyu563/prometheus-alert
[root@master ~]# docker images  | grep prometheus-alert
feiyu563/prometheus-alert                                                 latest                   d68864d68c3e   19 months ago   38.9MB

#Kubernetes中运行可以直接执行以下命令行即可(注意默认的部署模版中未挂载模版数据库文件 db/PrometheusAlertDB.db,为防止模版数据丢失,请自行增加挂载配置 )
kubectl apply -n monitoring -f https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/example/kubernetes/PrometheusAlert-Deployment.yaml

#启动后可使用浏览器打开以下地址查看:http://[YOUR-PrometheusAlert-URL]:8080
#默认登录帐号和密码在app.conf中有配置
[root@master prometheusalert]# kubectl  logs prometheus-alert-center-7f76d88c98-fnjzz 
pass!
table `prometheus_alert_d_b` already exists, skip
table `alert_record` already exists, skip
2023/08/14 10:07:46.483 [I] [proc.go:225]  [main] 构建的Go版本: go1.16.5
2023/08/14 10:07:46.483 [I] [proc.go:225]  [main] 应用当前版本: v4.6.1
2023/08/14 10:07:46.483 [I] [proc.go:225]  [main] 应用当前提交: 1bc0791a637b633257ce69de05d57b79ddd76f7c
2023/08/14 10:07:46.483 [I] [proc.go:225]  [main] 应用构建时间: 2021-12-23T12:37:35+0000
2023/08/14 10:07:46.483 [I] [proc.go:225]  [main] 应用构建用户: root@c14786b5a1cd
2023/08/14 10:07:46.491 [I] [asm_amd64.s:1371]  http server Running on http://0.0.0.0:8080

Change the type of svc to nodeport

[root@master prometheusalert]# kubectl  edit svc prometheus-alert-center 
service/prometheus-alert-center edited

[root@master prometheusalert]# kubectl  get svc -A 
NAMESPACE     NAME                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes                ClusterIP   10.96.0.1        <none>        443/TCP                  11d
default       service/prometheus-alert-center   NodePort    10.105.133.163   <none>        8080:32021/TCP           2m19s
kube-system   service/kube-dns                  ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP   11d

Browser access to any node http://<any node ip>:32021
insert image description here

Because the containerized image provided by the github address is old, you can write a dockerfile to download the binary package and package it into an image to deploy to k8s

2.prometheusAlert deployment (binary)

#打开PrometheusAlert releases页面,根据需要选择需要的版本下载到本地解压并进入解压后的目录
如linux版本(https://github.com/feiyu563/PrometheusAlert/releases/download/v4.9/linux.zip)

# wget https://github.com/feiyu563/PrometheusAlert/releases/download/v4.9/linux.zip && unzip linux.zip && cd linux/

#下载好后解压并进入解压后的文件夹


#运行PrometheusAlert
./PrometheusAlert (#后台运行请执行 nohup ./PrometheusAlert &)

#启动后可使用浏览器打开以下地址查看:http://127.0.0.1:8080
#默认登录帐号和密码在app.conf中有配置

insert image description here
Note:
1. Configure alert routing
https://github.com/feiyu563/PrometheusAlert/blob/master/doc/readme/web-router.md
2. Enable alert logging

#是否开启告警记录 0为关闭,1为开启
AlertRecord=1

5. DingTalk configuration

Start the DingTalk robot

Open DingTalk, enter the DingTalk group, select Group Settings –> Smart Group Assistant –> Add Robot –> Customize, as shown in the figure below: The new
insert image description hereinsert image description here
version of DingTalk has added security settings, just select the self in the security settings Just define the keyword, and set the keyword to the title value set in Prometheus or app.conf. Refer to the figure below to
insert image description here
insert image description here
copy the Webhook address in the figure, and fill in the corresponding configuration item in the PrometheusAlert configuration file app.conf. .
insert image description here

PS: DingTalk robot currently supports @某人. To use this function, you need to obtain the corresponding mobile phone number associated with DingTalk, as shown in the figure below:

insert image description here

DingTalk currently only supports a subset of markdown syntax, and the specific supported elements are as follows:

标题
# 一级标题
## 二级标题
### 三级标题
#### 四级标题
##### 五级标题
###### 六级标题

引用
> A man who stands for nothing will fall for anything.

文字加粗、斜体
**bold**
*italic*

链接
[this is a link](http://name.com)

图片
![](http://name.com/pic.jpg)

无序列表
- item1
- item2

有序列表
1. item1
2. item2

DingTalk related configuration:

#---------------------↓全局配置-----------------------
#告警消息标题
title=PrometheusAlert
#钉钉告警 告警logo图标地址
logourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png
#钉钉告警 恢复logo图标地址
rlogourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png

#---------------------↓webhook-----------------------
#是否开启钉钉告警通道,可同时开始多个通道0为关闭,1为开启
open-dingding=1
#默认钉钉机器人地址
ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxx
#是否开启 @所有人(0为关闭,1为开启)
dd_isatall=1

Take Prometheus with custom template as an example:

Prometheus configuration reference:

global:
  resolve_timeout: 5m
route:
  group_by: ['instance']
  group_wait: 10m
  group_interval: 10s
  repeat_interval: 10m
  receiver: 'web.hook.prometheusalert'
receivers:
- name: 'web.hook.prometheusalert'
  webhook_configs:
  - url: 'http://[prometheusalert_url]:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=钉钉机器人地址,钉钉机器人地址2&at=18888888888,18888888889'

If you want to specify multiple ways to send the alarm, such as email + DingTalk, you can add continue: true under routes
insert image description here

Guess you like

Origin blog.csdn.net/weixin_45720992/article/details/132279028