1 Overview:
1.1 Environment
Thanos ruler and alertmanager are deployed in kubernetes cluster, the version information is as follows:
a, kubernetes cluster: v1.18.5
b, thanos ruler: v0.11.0
c, alertmanager: v0.20.0
Introduction to the yaml file of thanos ruler:
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: thanos-rule
name: thanos-rule
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app.kubernetes.io/name: thanos-rule
serviceName: thanos-rules
template:
metadata:
labels:
app.kubernetes.io/name: thanos-rule
spec:
containers:
- image: registry.cn-shenzhen.aliyuncs.com/gzlj/thanos-reloader:v0.1
imagePullPolicy: Always
name: reloader
resources:
limits:
cpu: 100m
memory: 100Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- args:
- rule
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --rule-file=/etc/thanos/rules/*rules.yaml
- --data-dir=/var/thanos/rule
- --label=rule_replica="$(NAME)"
#请注意--alert.label-drop这行记录,值是带""
- --alert.label-drop="rule_replica"
- --query=dnssrv+_http._tcp.thanos-query.monitoring.svc.cluster.local
- --alertmanagers.url=http://alertmanager-main.monitoring.svc.cluster.local:9093
env:
- name: NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
image: quay.mirrors.ustc.edu.cn/thanos/thanos:v0.11.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 24
httpGet:
path: /-/healthy
port: 10902
scheme: HTTP
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: thanos-rule
ports:
- containerPort: 10901
name: grpc
protocol: TCP
- containerPort: 10902
name: http
protocol: TCP
readinessProbe:
failureThreshold: 18
httpGet:
path: /-/ready
port: 10902
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
volumeMounts:
- mountPath: /var/thanos/rule
name: data
- mountPath: /etc/thanos/rules
name: thanos-rules
restartPolicy: Always
serviceAccount: thanos-rules
serviceAccountName: thanos-rules
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: thanos-rules
name: thanos-rules
- emptyDir: {}
name: data
The key screenshots are as follows
1.2 Phenomenon
Alertmanager receives duplicate alarms. The only difference between the two duplicate alarms is that the value of the custom label rule_replica is different, as shown in the figure:
2 Solution
I tried to change to the mirrored version of thanos ruler (v0.15.0), but the phenomenon remains the same.
When I was about to give up, I changed the startup command parameter of thanos ruler --alert.label-drop="rule_replica" to --alert.label-drop=rule_replica, that is, I just removed the double quotation marks and the alertmanager repeatedly received alerts to solve the problem .
3 Phenomenon after resolution
Thanos ruler throws away the label rule_replica in the alert message, and then sends the alert to alertmanager. At this time, there is only one alert message in alertmanager instead of the previous two.