Kubernetes deploys a secret pit of thanos ruler sending repeated alarms

1 Overview:

1.1 Environment

Thanos ruler and alertmanager are deployed in kubernetes cluster, the version information is as follows:
a, kubernetes cluster: v1.18.5
b, thanos ruler: v0.11.0
c, alertmanager: v0.20.0

Introduction to the yaml file of thanos ruler:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/name: thanos-rule
  name: thanos-rule
  namespace: monitoring
spec:
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: thanos-rule
  serviceName: thanos-rules
  template:
    metadata:
      labels:
        app.kubernetes.io/name: thanos-rule
    spec:
      containers:
      - image: registry.cn-shenzhen.aliyuncs.com/gzlj/thanos-reloader:v0.1
        imagePullPolicy: Always
        name: reloader
        resources:
          limits:
            cpu: 100m
            memory: 100Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - args:
        - rule
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --rule-file=/etc/thanos/rules/*rules.yaml
        - --data-dir=/var/thanos/rule
        - --label=rule_replica="$(NAME)"
        #请注意--alert.label-drop这行记录,值是带""
        - --alert.label-drop="rule_replica"
        - --query=dnssrv+_http._tcp.thanos-query.monitoring.svc.cluster.local
        - --alertmanagers.url=http://alertmanager-main.monitoring.svc.cluster.local:9093
        env:
        - name: NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: quay.mirrors.ustc.edu.cn/thanos/thanos:v0.11.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 24
          httpGet:
            path: /-/healthy
            port: 10902
            scheme: HTTP
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        name: thanos-rule
        ports:
        - containerPort: 10901
          name: grpc
          protocol: TCP
        - containerPort: 10902
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 18
          httpGet:
            path: /-/ready
            port: 10902
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        volumeMounts:
        - mountPath: /var/thanos/rule
          name: data
        - mountPath: /etc/thanos/rules
          name: thanos-rules
      restartPolicy: Always
      serviceAccount: thanos-rules
      serviceAccountName: thanos-rules
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: thanos-rules
        name: thanos-rules
      - emptyDir: {}
        name: data

The key screenshots are as follows
Insert picture description here


1.2 Phenomenon

Alertmanager receives duplicate alarms. The only difference between the two duplicate alarms is that the value of the custom label rule_replica is different, as shown in the figure:
Insert picture description here


2 Solution

I tried to change to the mirrored version of thanos ruler (v0.15.0), but the phenomenon remains the same.
When I was about to give up, I changed the startup command parameter of thanos ruler --alert.label-drop="rule_replica" to --alert.label-drop=rule_replica, that is, I just removed the double quotation marks and the alertmanager repeatedly received alerts to solve the problem .


3 Phenomenon after resolution

Thanos ruler throws away the label rule_replica in the alert message, and then sends the alert to alertmanager. At this time, there is only one alert message in alertmanager instead of the previous two.
Insert picture description here

Guess you like

Origin blog.csdn.net/nangonghen/article/details/109186245